Learning to identify fragmented words in spoken discourse
2003 | conference paper
Jump to:Cite & Linked | Documents & Media | Details | Version history
Documents & Media
Details
- Authors
- Lendvai, Piroska
- Abstract
- Disfluent speech adds to the difficulty of processing spoken language utterances. In this paper we concentrate on identifying one disfluency phenomenon: fragmented words. Our data, from the Spoken Dutch Corpus, samples nearly 45,000 sentences of human discourse, ranging from spontaneous chat to media broadcasts. We classify each lexical item in a sentence either as a completely or an incompletely uttered, i.e. fragmented, word. The task is carried out both by the IB 1 and RIPPER machine learning algorithms, trained on a variety of features with an extensive optimization strategy. Our best classifier has a 74.9% F-score, which is a significant improvement over the baseline. We discuss why memory-based learning has more success than rule induction in correctly classifying fragmented words.
- Issue Date
- 2003
- Conference
- EACL '03
- ISBN
- 978-1-932432-00-8
- Conference Place
- Budapest, Hungary
- Event start
- 2003-04-12
- Event end
- 2003-04-17
- Language
- English