Pattern-based segmentation of digital documents: model and implementation

di Iorio, Angelo (2007) Pattern-based segmentation of digital documents: model and implementation, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Informatica, 19 Ciclo. DOI 10.6092/unibo/amsdottorato/370.
Documenti full-text disponibili:
Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Download (12MB) | Anteprima


This thesis proposes a new document model, according to which any document can be segmented in some independent components and transformed in a pattern-based projection, that only uses a very small set of objects and composition rules. The point is that such a normalized document expresses the same fundamental information of the original one, in a simple, clear and unambiguous way. The central part of my work consists of discussing that model, investigating how a digital document can be segmented, and how a segmented version can be used to implement advanced tools of conversion. I present seven patterns which are versatile enough to capture the most relevant documents’ structures, and whose minimality and rigour make that implementation possible. The abstract model is then instantiated into an actual markup language, called IML. IML is a general and extensible language, which basically adopts an XHTML syntax, able to capture a posteriori the only content of a digital document. It is compared with other languages and proposals, in order to clarify its role and objectives. Finally, I present some systems built upon these ideas. These applications are evaluated in terms of users’ advantages, workflow improvements and impact over the overall quality of the output. In particular, they cover heterogeneous content management processes: from web editing to collaboration (IsaWiki and WikiFactory), from e-learning (IsaLearning) to professional printing (IsaPress).

Tipologia del documento
Tesi di dottorato
di Iorio, Angelo
Dottorato di ricerca
Settore disciplinare
Settore concorsuale
Parole chiave
segmentation patterns pentaformat iml digital documents
Data di discussione
16 Aprile 2007

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi