Data quality in empirical software engineering: a targeted review

aut.conference.typePaper Published in Proceedings
aut.relation.endpage176
aut.relation.startpage171
aut.researcherMacDonell, Stephen Gerard
dc.contributor.authorBosu, MF
dc.contributor.authorMacDonell, SG
dc.contributor.editorSilva, FQBD
dc.contributor.editorJuzgado, NJ
dc.contributor.editorTravassos, GH
dc.date.accessioned2013-06-24T07:19:31Z
dc.date.accessioned2013-06-24T23:50:18Z
dc.date.available2013-06-24T07:19:31Z
dc.date.available2013-06-24T23:50:18Z
dc.date.copyright2013
dc.date.issued2013
dc.description.abstractContext: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data points may be relevant in this regard. Objective: We investigate the reporting of three potentially influential elements of data quality in ESE studies: data collection, data pre-processing, and the identification of data quality issues. This enables us to establish how researchers view the topic of data quality and the mechanisms that are being used to address it. Greater awareness of data quality should inform both the sound conduct of ESE research and the robust practice of ESE data collection and processing. Method: We performed a targeted literature review of empirical software engineering studies covering the period January 2007 to September 2012. A total of 221 relevant studies met our inclusion criteria and were characterized in terms of their consideration and treatment of data quality. Results: We obtained useful insights as to how the ESE community considers these three elements of data quality. Only 23 of these 221 studies reported on all three elements of data quality considered in this paper. Conclusion: The reporting of data collection procedures is not documented consistently in ESE studies. It will be useful if data collection challenges are reported in order to improve our understanding of why there are problems with software engineering data sets and the models developed from them. More generally, data quality should be given far greater attention by the community. The improvement of data sets through enhanced data collection, pre-processing and quality assessment should lead to more reliable prediction models, thus improving the practice of software engineering.
dc.identifier.citationIn Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering (EASE). Pp. 171-176.
dc.identifier.doi10.1145/2460999.2461024
dc.identifier.isbn978-1-4503-1848-8
dc.identifier.urihttps://hdl.handle.net/10292/5502
dc.publisherACM
dc.relation.replaceshttp://hdl.handle.net/10292/5494
dc.relation.replaces10292/5494
dc.relation.urihttp://dx.doi.org/10.1145/2460999.2461024
dc.rights© ACM, 2013. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in PUBLICATION (see Citation), (see Publisher’s Version).
dc.rights.accessrightsOpenAccess
dc.subjectData quality
dc.subjectData sets
dc.subjectEmpirical software engineering
dc.subjectLiterature review
dc.titleData quality in empirical software engineering: a targeted review
dc.typeConference Contribution
pubs.elements-id142256
pubs.organisational-data/AUT
pubs.organisational-data/AUT/Design & Creative Technologies
pubs.organisational-data/AUT/Design & Creative Technologies/School of Computing & Mathematical Science
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Bosu and MacDonell (2013a) EASE.pdf
Size:
146.09 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
licence.htm
Size:
30.34 KB
Format:
Unknown data format
Description: