Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets
2021-09-11 | preprint
Jump to: Cite & Linked | Documents & Media | Details | Version history
Documents & Media
Details
- Authors
- Zhukova, Anastasia ; Hamborg, Felix; Gipp, Bela
- Abstract
- Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice and more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In this paper, we qualitatively and quantitatively compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We propose a phrasing diversity metric (PD) that compares lexical diversity within coreference chains on a more detailed level than previously proposed metric, e.g., a number of unique lemmas. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges, and propose a direction for further CDCR evaluation.
- Issue Date
- 11-September-2021