Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics

Frey, Jennifer Carmen (2020) Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Traduzione, interpretazione e interculturalità, 32 Ciclo. DOI 10.6092/unibo/amsdottorato/9300.
Documenti full-text disponibili:
[img] Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons Attribution Non-commercial ShareAlike 4.0 (CC BY-NC-SA 4.0) .
Download (8MB)


A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions. This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods. After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis.

Tipologia del documento
Tesi di dottorato
Frey, Jennifer Carmen
Dottorato di ricerca
Settore disciplinare
Settore concorsuale
Parole chiave
Data Mining, Corpus Linguistics, Computer-mediated Communication, Social Media, Sociolinguistics, German, Data Science, Machine Learning, Student essays, Text Quality, Interpretability, Repurposing, Linguistic Complexity
Data di discussione
3 Aprile 2020

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi