Skip to main content
Article
A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports
2015 IEEE 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER): Proceedings: March 2-6, 2015, Montréal
  • Yuan TIAN, Singapore Management University
  • David LO, Singapore Management University
Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
3-2014
Abstract

Many software artifacts are written in natural language or contain substantial amount of natural language contents. Thus these artifacts could be analyzed using text analysis techniques from the natural language processing (NLP) community, e.g., the part-of-speech (POS) tagging technique that assigns POS tags (e.g., verb, noun, etc.) to words in a sentence. In the literature, several studies have already applied POS tagging technique on software artifacts to recover important words in them, which are then used for automating various tasks, e.g., locating buggy files for a given bug report, etc. There are many POS tagging techniques proposed and they are trained and evaluated on non software engineering corpus (documents). Thus it is unknown whether they can correctly identify the POS of a word in a software artifact and which of them performs the best. To fill this gap, in this work, we investigate the effectiveness of seven POS taggers on bug reports. We randomly sample 100 bug reports from Eclipse and Mozilla project and create a text corpus that contains 21,713 words. We manually assign POS tags to these words and use them to evaluate the studied POS taggers. Our comparative study shows that the state-of-the-art POS taggers achieve an accuracy of 83.6%-90.5% on bug reports and the Stanford POS tagger and the TreeTagger achieve the highest accuracy on the sampled bug reports. Our findings show that researchers could use these POS taggers to analyze software artifacts, if an accuracy of 80-90% is acceptable for their specific needs, and we recommend using the Stanford POS tagger or the TreeTagger.

ISBN
9781479984695
Identifier
10.1109/SANER.2015.7081879
Publisher
IEEE
City or Country
Piscataway, NJ
Copyright Owner and License
Authors
Creative Commons License
Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International
Additional URL
https://doi.org/10.1109/SANER.2015.7081879
Citation Information
Yuan TIAN and David LO. "A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports" 2015 IEEE 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER): Proceedings: March 2-6, 2015, Montréal (2014) p. 570 - 574
Available at: http://works.bepress.com/david_lo/152/