A fine-grained data set and analysis of tangling in bug fixing commits

Herbold, Steffen; Trautsch, Alexander; Ledel, Benjamin; Aghamohammadi, Alireza; Ghaleb, Taher A.; Chahal, Kuljit Kaur; Bossenmaier, Tim; Nagaria, Bhaveet; Makedonski, Philip; Ahmadabadi, Matin Nili; Erbel, Johannes

A fine-grained data set and analysis of tangling in bug fixing commits

2022 | journal article. A publication with affiliation to the University of Göttingen.

Jump to: Cite & Linked | Documents & Media | Details | Version history

Cite this publication

A fine-grained data set and analysis of tangling in bug fixing commits
Herbold, S.; Trautsch, A.; Ledel, B.; Aghamohammadi, A.; Ghaleb, T. A.; Chahal, K. K. & Bossenmaier, T. et al. (2022)
Empirical Software Engineering, 27(6). DOI: https://doi.org/10.1007/s10664-021-10083-5

Copy

GRO View APA Chicago MLA Vancouver

Citable link

GRO.publications Link

Further links

DOI

Documents & Media

document.pdf2.27 MBAdobe PDF

License

GRO License

Details

Authors: Herbold, Steffen; Trautsch, Alexander; Ledel, Benjamin; Aghamohammadi, Alireza; Ghaleb, Taher A.; Chahal, Kuljit Kaur; Bossenmaier, Tim; Nagaria, Bhaveet; Makedonski, Philip; Ahmadabadi, Matin Nili; Erbel, Johannes
Abstract: Abstract Context Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
Issue Date: 2022
Journal: Empirical Software Engineering
ISSN: 1382-3256
eISSN: 1573-7616
Language: English

Export Metadata

Refman EndNote BibTeX RefWorks Excel CSV

A fine-grained data set and analysis of tangling in bug fixing commits

Cite this publication

Citable link

Further links

Documents & Media

License

Details

Export Metadata

Reference

Citations

Social Media