This thesis contributes to sequence labeling tasks in the field of Natural Language Processing by introducing novel concepts, models and algorithms for Part-of-Speech (POS) tagging, social media text detection and Web page cleaning.
First, the task of social media text classification in Web pages is addressed, where sequences of Web text segments are classified based on a high-dimensional feature vector.
New features motivated by social media text characteristics are introduced and investigated with respect to different classifiers.
Two classification problems in the context of social media text classification are treated, (1) the problem of social media text detection and (2) a method for Web page cleaning for social media platforms.
A new Web page corpus, particularly designed to train and test the classifiers on representative Web pages is created.
Moreover, a POS tagger for social media texts is developed.
The need for a specialized tagger is due to the specific social media text characteristics and the high non-standardization of such texts.
Based on these factors, a Markov model tagger with parameter estimation enhancements with respect to social media texts is proposed.
Particular focus is put on reliable estimation of non-standardized tokens like out-of-vocabulary words.
To that end, methods are proposed to improve the reliability of probability estimation.
Moreover, a novel approach mapping unknown tokens to tokens either known from training or tokens which fall into a class represented by regular expressions is presented.
Finally, for remaining unknown tokens, semi-supervised auxiliary lexica and adequate estimation from prefix and suffix information is proposed.
Furthermore, we propose to combine sparse in-domain social media training data and a newspaper corpus by an oversampling technique which improves POS tagging accuracies significantly.
Training and evaluation of the proposed POS tagger is performed on a new manually annotated German social media text corpus.
Tagging accuracies are presented and compared to accuracies achieved with state-of-the-art POS taggers.
Finally, we show that the proposed social media text detection and Web cleaning methods, as well as the presented POS tagger can be efficiently used in the context of information retrieval for Web page corpus construction.
By applying Web page cleaning and social media text detection to Web page corpora obtained from Web crawlers, the generated corpus can be further refined.
Elektro- und Informationstechnik
Part-of-Speech Tagging and Detection of Social Media Texts
This thesis contributes to sequence labeling tasks in the field of Natural Language Processing by introducing novel concepts, models and algorithms for Part-of-Speech tagging, social media text detection and Web page cleaning.
First, the task of social media text classification in Web pages is addressed, where Web text segments are classified based on a high-dimensional feature vector.
Moreover, a Part-of-Speech tagger for highly non-standardized social media texts is developed.