Analysis and Application of Language Models to Human-Generated Textual Content

Di Giovanni, Marco (2022) Analysis and Application of Language Models to Human-Generated Textual Content, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Data science and computation, 36 Ciclo. DOI 10.48676/unibo/amsdottorato/10057.
Documenti full-text disponibili:
[img] Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato.
Download (7MB)


Social networks are enormous sources of human-generated content. Users continuously create information, useful but hard to detect, extract, and categorize. Language Models (LMs) have always been among the most useful and used approaches to process textual data. Firstly designed as simple unigram models, they improved through the years until the recent release of BERT, a pre-trained Transformer-based model reaching state-of-the-art performances in many heterogeneous benchmark tasks, such as text classification and tagging. In this thesis, I apply LMs to textual content publicly shared on social media. I selected Twitter as the principal source of data for the performed experiments since its users mainly share short and noisy texts. My goal is to build models that generate meaningful representations of users encoding their syntactic and semantic features. Once appropriate embeddings are defined, I compute similarities between users to perform higher-level analyses. Tested tasks include the extraction of emerging knowledge, represented by users similar to a given set of well-known accounts, controversy detection, obtaining controversy scores for topics discussed online, community detection and characterization, clustering similar users and detecting outliers, and stance classification of users and tweets (e.g., political inclination, COVID-19 vaccines position). The obtained results suggest that publicly available data contains delicate information about users, and Language Models can now extract it, threatening users' privacy.

Tipologia del documento
Tesi di dottorato
Di Giovanni, Marco
Dottorato di ricerca
Settore disciplinare
Settore concorsuale
Parole chiave
Language Models, Social Media, User Similarity, Text Classification, Knowledge Extraction
Data di discussione
21 Marzo 2022

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi