Learning to identify fragmented words in spoken discourse

2003 | conference paper

Jump to:Cite & Linked | Documents & Media | Details | Version history

Cite this publication

​Learning to identify fragmented words in spoken discourse​
Lendvai, P. ​ (2003)
​Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics pp. 25​-32. (Vol. 2). ​EACL '03​, Budapest, Hungary. DOI: https://doi.org/10.3115/1067737.1067742 

Documents & Media

License

GRO License GRO License

Details

Authors
Lendvai, Piroska 
Abstract
Disfluent speech adds to the difficulty of processing spoken language utterances. In this paper we concentrate on identifying one disfluency phenomenon: fragmented words. Our data, from the Spoken Dutch Corpus, samples nearly 45,000 sentences of human discourse, ranging from spontaneous chat to media broadcasts. We classify each lexical item in a sentence either as a completely or an incompletely uttered, i.e. fragmented, word. The task is carried out both by the IB 1 and RIPPER machine learning algorithms, trained on a variety of features with an extensive optimization strategy. Our best classifier has a 74.9% F-score, which is a significant improvement over the baseline. We discuss why memory-based learning has more success than rule induction in correctly classifying fragmented words.
Issue Date
2003
Conference
EACL '03
ISBN
978-1-932432-00-8
Conference Place
Budapest, Hungary
Event start
2003-04-12
Event end
2003-04-17
Language
English

Reference

Citations


Social Media