Structuring cultural heritage content and context: integrating llms in ontology-driven knowledge graph extraction

Schimmenti, Andrea (2026) Structuring cultural heritage content and context: integrating llms in ontology-driven knowledge graph extraction, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Patrimonio culturale nell'ecosistema digitale, 38 Ciclo. DOI 10.48676/unibo/amsdottorato/13084.

Salva citazione

Citato da

Documenti full-text disponibili:

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons: Attribuzione 4.0 (CC BY 4.0) .
Download (3MB)

Abstract

Cultural Heritage institutions have digitized extensive collections and published their metadata through Semantic Web technologies, yet the content and the scholarly contextualization of documents—entities, relationships, events, and interpretations—remains largely inaccessible through semantic querying. Manual Knowledge Graph creation proves prohibitively expensive at scale, while automatic Knowledge Extraction faces critical barriers in CH contexts: limited annotated training data and domain-specific linguistic complexity. This dissertation investigates automatic Knowledge Graph extraction from Cultural Heritage texts in data-scarce scenarios, addressing three research questions: (1) What methodologies and challenges characterize existing CH text-to-KG projects? (2) How can Large Language Models be integrated into ontology-driven Knowledge Extraction pipelines, and what are the limitations and trade-offs? (3) Can LLM-based systems produce sufficiently accurate Knowledge Graphs of scholarly interpretations while preserving provenance and epistemic uncertainty? We conduct a systematic survey of eleven CH projects (2015-2025) and analyze 227 papers, identifying persistent bottlenecks in Named Entity Recognition, Relationship Extraction, and Entity Linking. We introduce \textit{Adaptive Text-to-KG for Cultural Heritage} (ATR4CH), a five-step methodology coordinating ontology analysis, Competency Question formulation, ground-truth annotation, LLM-based extraction, and multi-layered evaluation. We validate ATR4CH through case studies including authenticity debates, archival finding aids, RAG-based argument extraction, and synthetic training data generation for Aspect-Based Sentiment Analysis. Results establish that LLMs enable ontology-aligned extraction under data scarcity, achieving accuracy sufficient for scholarly workflows. LLMs augment rather than replace traditional pipelines, providing capabilities for bootstrapping development and serving domains where annotation costs cannot be justified. However, human oversight remains necessary: errors may propagate through pipelines, data alignment represents a persistent bottleneck, and epistemic uncertainty requires continued development. This dissertation advances the state of the art by providing a replicable methodological framework and empirical evidence that LLM-based extraction can bridge the gap between digitization and semantic accessibility of Cultural Heritage repositories.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Schimmenti, Andrea

Supervisore

Vitali, Fabio

Co-supervisore

Van Erp, Maria Godefrida Jacoba

Dottorato di ricerca

Patrimonio culturale nell'ecosistema digitale

Ciclo

Coordinatore

Tomasi, Francesca

Settore disciplinare

Area 01 - Scienze matematiche e informatiche > INF/01 Informatica

Settore concorsuale

Area 01 - Scienze matematiche e informatiche > 01/B - Informatica > 01/B1 Informatica

Parole chiave

Cultural Heritage Knowledge Graph Knowledge Extraction Text-to-KG Large Language Models Semantic Web Ontology Digital Humanities Retrieval-Augmented Generation Aspect-Based Sentiment Analysis Pipeline Architecture

DOI

10.48676/unibo/amsdottorato/13084

Data di discussione

26 Marzo 2026

URI

https://amsdottorato.unibo.it/id/eprint/13084