Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets

2021-09-11 | preprint

Jump to: Cite & Linked | Documents & Media | Details | Version history

Cite this publication

​Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets​
Zhukova, A. ; Hamborg, F.& Gipp, B. ​ (2021)

Documents & Media

License

GRO License GRO License

Details

Authors
Zhukova, Anastasia ; Hamborg, Felix; Gipp, Bela 
Abstract
Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice and more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In this paper, we qualitatively and quantitatively compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We propose a phrasing diversity metric (PD) that compares lexical diversity within coreference chains on a more detailed level than previously proposed metric, e.g., a number of unique lemmas. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges, and propose a direction for further CDCR evaluation.
Issue Date
11-September-2021

Reference

Citations