Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics

Frey, Jennifer Carmen (2020) Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Traduzione, interpretazione e interculturalità, 32 Ciclo. DOI 10.6092/unibo/amsdottorato/9300.

Salva citazione

Citato da

Documenti full-text disponibili:

[thumbnail of frey_jennifercarmen_tesi.pdf]

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons: Attribuzione - Non Commerciale - Condividi allo Stesso Modo 4.0 (CC BY-NC-SA 4.0) .
Download (8MB)

Abstract

A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions. This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods. After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Frey, Jennifer Carmen

Supervisore

Soffritti, Marcello

Co-supervisore

Glaznieks, Aivars

Dottorato di ricerca

Traduzione, interpretazione e interculturalità

Ciclo

Coordinatore

Baccolini, Raffaella

Settore disciplinare

Area 10 - Scienze dell'antichità, filologico-letterarie e storico-artistiche > L-LIN/14 Lingua e traduzione - Lingua tedesca

Settore concorsuale

Area 01 - Scienze matematiche e informatiche > 01/B - Informatica > 01/B1 Informatica
Area 10 - Scienze dell'antichita, filologico-letterarie e storico-artistiche > 10/M - Lingue, letterature e culture germaniche e slave > 10/M1 Lingue, letterature e culture germaniche

Parole chiave

Data Mining, Corpus Linguistics, Computer-mediated Communication, Social Media, Sociolinguistics, German, Data Science, Machine Learning, Student essays, Text Quality, Interpretability, Repurposing, Linguistic Complexity

URN:NBN

urn:nbn:it:unibo-26018

DOI

10.6092/unibo/amsdottorato/9300

Data di discussione

3 Aprile 2020

URI

https://amsdottorato.unibo.it/id/eprint/9300