Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

Scharpf, Philipp; Schubotz, Moritz; Youssef, Abdou; Hamborg, Felix; Meuschke, Norman; Gipp, Bela

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

2020-05-22 | conference paper

Jump to: Cite & Linked | Documents & Media | Details | Version history

Cite this publication

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language
Scharpf, P.; Schubotz, M.; Youssef, A.; Hamborg, F.; Meuschke, N. & Gipp, B. (2020)
In:Huang, Ruhua; Wu, Dan; Marchionini, Gary; He, Daqing; Cunningham, Sally Jo; Hansen, Preben (Eds.), Proceedings pp. 137-146. JCDL '20: The ACM/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event.
ACM Digital Library. DOI: https://doi.org/10.1145/3383583.3398529

Copy

GRO View APA Chicago MLA Vancouver

Citable link

GRO.publications Link

Further links

DOI

Documents & Media

License

GRO License

Details

Authors: Scharpf, Philipp; Schubotz, Moritz; Youssef, Abdou; Hamborg, Felix; Meuschke, Norman; Gipp, Bela
Editors: Huang, Ruhua; Wu, Dan; Marchionini, Gary; He, Daqing; Cunningham, Sally Jo; Hansen, Preben
Abstract: In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 2.8\%$ and cluster purities up to 9.4\%$ (number of clusters equals number of classes), and 9.9\%$ (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.
Issue Date: 22-May-2020
Publisher: ACM Digital Library
Conference: JCDL '20: The ACM/IEEE Joint Conference on Digital Libraries in 2020
ISBN: 978-1-4503-7585-6
Conference Place: Virtual Event
Event start: 2020-08-01
Event end: 2020-08-05

Export Metadata

Refman EndNote BibTeX RefWorks Excel CSV

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

Cite this publication

Citable link

Further links

Documents & Media

License

Details

Export Metadata

Reference

Citations

Social Media