Structural patterns for document engineering: from an empirical bottom-up analysis to an ontological theory

Poggi, Francesco (2015) Structural patterns for document engineering: from an empirical bottom-up analysis to an ontological theory, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Informatica, 27 Ciclo. DOI 10.6092/unibo/amsdottorato/7123.
Documenti full-text disponibili:
Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Download (2MB) | Anteprima


This thesis aims at investigating a new approach to document analysis based on the idea of structural patterns in XML vocabularies. My work is founded on the belief that authors do naturally converge to a reasonable use of markup languages and that extreme, yet valid instances are rare and limited. Actual documents, therefore, may be used to derive classes of elements (patterns) persisting across documents and distilling the conceptualization of the documents and their components, and may give ground for automatic tools and services that rely on no background information (such as schemas) at all. The central part of my work consists in introducing from the ground up a formal theory of eight structural patterns (with three sub-patterns) that are able to express the logical organization of any XML document, and verifying their identifiability in a number of different vocabularies. This model is characterized by and validated against three main dimensions: terseness (i.e. the ability to represent the structure of a document with a small number of objects and composition rules), coverage (i.e. the ability to capture any possible situation in any document) and expressiveness (i.e. the ability to make explicit the semantics of structures, relations and dependencies). An algorithm for the automatic recognition of structural patterns is then presented, together with an evaluation of the results of a test performed on a set of more than 1100 documents from eight very different vocabularies. This language-independent analysis confirms the ability of patterns to capture and summarize the guidelines used by the authors in their everyday practice. Finally, I present some systems that work directly on the pattern-based representation of documents. The ability of these tools to cover very different situations and contexts confirms the effectiveness of the model.

Tipologia del documento
Tesi di dottorato
Poggi, Francesco
Dottorato di ricerca
Scuola di dottorato
Scienze e ingegneria dell'informazione
Settore disciplinare
Settore concorsuale
Parole chiave
XML, descriptive markup, document visualisation, ontology, pattern recognition, structural patterns, EARMARK, document engineering, markup languages, OWL, information visualization
Data di discussione
4 Giugno 2015

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi