Rizzo, Stefano Giovanni
(2017)
Temporal Dimension of Text: Quantification, Metrics and Features, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Computer science and engineering, 29 Ciclo. DOI 10.6092/unibo/amsdottorato/8004.
Documenti full-text disponibili:
Abstract
The time dimension is so inherently bound to any information space that it can hardly be ignored when describing the reality, nor can be disregarded in interpreting most information. In the pressing need to search and classify a larger amount of unstructured data with better accuracy, the temporal dimension of text documents is becoming a crucial property for information retrieval and text mining tasks.
Of all the features that characterize textual information, the time dimension is still not fully regarded, despite its richness and diversity. Temporal information retrieval is still in its infancy, while time features of documents are barely taken into account in text classification.
The temporal aspects of text can be used to better interpret the relative truthiness and the context of old information, and to determine the relevance of a document with respect to information needs and categories.
In this research, we first explore the temporal dimension of text collections in a large scale study on more than 30 million documents, quantifying its extent and showing its peculiarities and patterns, such as the relation between the creation time of documents and the mentioned time.
Then we define a comprehensive and accurate representation of the temporal aspects of documents, modeling ad-hoc temporal similarities based on metric distances between time intervals.
Results of evaluation show taking into account the temporal relevance of documents yields a significant improvement in retrieval effectiveness, over both implicit and explicit time queries, and a gain in classification accuracy when temporal features are involved.
By defining a set of temporal features to comprehensively describe the temporal scope of text documents, we show their significant relation to topical categories and how these proposed features are able to categorize documents, improving the text categorization tasks in combination with ordinary terms frequencies features.
Abstract
The time dimension is so inherently bound to any information space that it can hardly be ignored when describing the reality, nor can be disregarded in interpreting most information. In the pressing need to search and classify a larger amount of unstructured data with better accuracy, the temporal dimension of text documents is becoming a crucial property for information retrieval and text mining tasks.
Of all the features that characterize textual information, the time dimension is still not fully regarded, despite its richness and diversity. Temporal information retrieval is still in its infancy, while time features of documents are barely taken into account in text classification.
The temporal aspects of text can be used to better interpret the relative truthiness and the context of old information, and to determine the relevance of a document with respect to information needs and categories.
In this research, we first explore the temporal dimension of text collections in a large scale study on more than 30 million documents, quantifying its extent and showing its peculiarities and patterns, such as the relation between the creation time of documents and the mentioned time.
Then we define a comprehensive and accurate representation of the temporal aspects of documents, modeling ad-hoc temporal similarities based on metric distances between time intervals.
Results of evaluation show taking into account the temporal relevance of documents yields a significant improvement in retrieval effectiveness, over both implicit and explicit time queries, and a gain in classification accuracy when temporal features are involved.
By defining a set of temporal features to comprehensively describe the temporal scope of text documents, we show their significant relation to topical categories and how these proposed features are able to categorize documents, improving the text categorization tasks in combination with ordinary terms frequencies features.
Tipologia del documento
Tesi di dottorato
Autore
Rizzo, Stefano Giovanni
Supervisore
Dottorato di ricerca
Ciclo
29
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Time, dimension, temporal expressions, timex, information retrieval, text categorization, features engineering, machine learning, new york times, wikipedia, metric distances, time intervals, time quantification, temporal queries, content-level time, relative time
URN:NBN
DOI
10.6092/unibo/amsdottorato/8004
Data di discussione
15 Maggio 2017
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Rizzo, Stefano Giovanni
Supervisore
Dottorato di ricerca
Ciclo
29
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Time, dimension, temporal expressions, timex, information retrieval, text categorization, features engineering, machine learning, new york times, wikipedia, metric distances, time intervals, time quantification, temporal queries, content-level time, relative time
URN:NBN
DOI
10.6092/unibo/amsdottorato/8004
Data di discussione
15 Maggio 2017
URI
Statistica sui download
Gestione del documento: