Schimmenti, Andrea
(2026)
Structuring cultural heritage content and context: integrating llms in ontology-driven knowledge graph extraction, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Patrimonio culturale nell'ecosistema digitale, 38 Ciclo. DOI 10.48676/unibo/amsdottorato/13084.
Documenti full-text disponibili:
Abstract
Cultural Heritage institutions have digitized extensive collections and published their metadata through Semantic Web technologies, yet the content and the scholarly contextualization of documents—entities, relationships, events, and interpretations—remains largely inaccessible through semantic querying. Manual Knowledge Graph creation proves prohibitively expensive at scale, while automatic Knowledge Extraction faces critical barriers in CH contexts: limited annotated training data and domain-specific linguistic complexity. This dissertation investigates automatic Knowledge Graph extraction from Cultural Heritage texts in data-scarce scenarios, addressing three research questions: (1) What methodologies and challenges characterize existing CH text-to-KG projects? (2) How can Large Language Models be integrated into ontology-driven Knowledge Extraction pipelines, and what are the limitations and trade-offs? (3) Can LLM-based systems produce sufficiently accurate Knowledge Graphs of scholarly interpretations while preserving provenance and epistemic uncertainty? We conduct a systematic survey of eleven CH projects (2015-2025) and analyze 227 papers, identifying persistent bottlenecks in Named Entity Recognition, Relationship Extraction, and Entity Linking. We introduce \textit{Adaptive Text-to-KG for Cultural Heritage} (ATR4CH), a five-step methodology coordinating ontology analysis, Competency Question formulation, ground-truth annotation, LLM-based extraction, and multi-layered evaluation. We validate ATR4CH through case studies including authenticity debates, archival finding aids, RAG-based argument extraction, and synthetic training data generation for Aspect-Based Sentiment Analysis. Results establish that LLMs enable ontology-aligned extraction under data scarcity, achieving accuracy sufficient for scholarly workflows. LLMs augment rather than replace traditional pipelines, providing capabilities for bootstrapping development and serving domains where annotation costs cannot be justified. However, human oversight remains necessary: errors may propagate through pipelines, data alignment represents a persistent bottleneck, and epistemic uncertainty requires continued development. This dissertation advances the state of the art by providing a replicable methodological framework and empirical evidence that LLM-based extraction can bridge the gap between digitization and semantic accessibility of Cultural Heritage repositories.
Abstract
Cultural Heritage institutions have digitized extensive collections and published their metadata through Semantic Web technologies, yet the content and the scholarly contextualization of documents—entities, relationships, events, and interpretations—remains largely inaccessible through semantic querying. Manual Knowledge Graph creation proves prohibitively expensive at scale, while automatic Knowledge Extraction faces critical barriers in CH contexts: limited annotated training data and domain-specific linguistic complexity. This dissertation investigates automatic Knowledge Graph extraction from Cultural Heritage texts in data-scarce scenarios, addressing three research questions: (1) What methodologies and challenges characterize existing CH text-to-KG projects? (2) How can Large Language Models be integrated into ontology-driven Knowledge Extraction pipelines, and what are the limitations and trade-offs? (3) Can LLM-based systems produce sufficiently accurate Knowledge Graphs of scholarly interpretations while preserving provenance and epistemic uncertainty? We conduct a systematic survey of eleven CH projects (2015-2025) and analyze 227 papers, identifying persistent bottlenecks in Named Entity Recognition, Relationship Extraction, and Entity Linking. We introduce \textit{Adaptive Text-to-KG for Cultural Heritage} (ATR4CH), a five-step methodology coordinating ontology analysis, Competency Question formulation, ground-truth annotation, LLM-based extraction, and multi-layered evaluation. We validate ATR4CH through case studies including authenticity debates, archival finding aids, RAG-based argument extraction, and synthetic training data generation for Aspect-Based Sentiment Analysis. Results establish that LLMs enable ontology-aligned extraction under data scarcity, achieving accuracy sufficient for scholarly workflows. LLMs augment rather than replace traditional pipelines, providing capabilities for bootstrapping development and serving domains where annotation costs cannot be justified. However, human oversight remains necessary: errors may propagate through pipelines, data alignment represents a persistent bottleneck, and epistemic uncertainty requires continued development. This dissertation advances the state of the art by providing a replicable methodological framework and empirical evidence that LLM-based extraction can bridge the gap between digitization and semantic accessibility of Cultural Heritage repositories.
Tipologia del documento
Tesi di dottorato
Autore
Schimmenti, Andrea
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
38
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Cultural Heritage
Knowledge Graph
Knowledge Extraction
Text-to-KG
Large Language Models
Semantic Web
Ontology
Digital Humanities
Retrieval-Augmented Generation
Aspect-Based Sentiment Analysis
Pipeline Architecture
DOI
10.48676/unibo/amsdottorato/13084
Data di discussione
26 Marzo 2026
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Schimmenti, Andrea
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
38
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Cultural Heritage
Knowledge Graph
Knowledge Extraction
Text-to-KG
Large Language Models
Semantic Web
Ontology
Digital Humanities
Retrieval-Augmented Generation
Aspect-Based Sentiment Analysis
Pipeline Architecture
DOI
10.48676/unibo/amsdottorato/13084
Data di discussione
26 Marzo 2026
URI
Statistica sui download
Gestione del documento: