Domeniconi, Giacomo
(2016)
Data and Text Mining Techniques for In-Domain and Cross-Domain Applications, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Informatica, 28 Ciclo. DOI 10.6092/unibo/amsdottorato/7494.
Documenti full-text disponibili:
Abstract
In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on.
Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge?
This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated.
Abstract
In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on.
Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge?
This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated.
Tipologia del documento
Tesi di dottorato
Autore
Domeniconi, Giacomo
Supervisore
Dottorato di ricerca
Scuola di dottorato
Scienze e ingegneria dell'informazione
Ciclo
28
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
data mining, text mining, transfer learning, cross-domain classification, term weighting, hierarchical text categorization, text classification, wordnet, job recommendation, gene ontology, genomic features prediction biomedical literature
URN:NBN
DOI
10.6092/unibo/amsdottorato/7494
Data di discussione
12 Maggio 2016
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Domeniconi, Giacomo
Supervisore
Dottorato di ricerca
Scuola di dottorato
Scienze e ingegneria dell'informazione
Ciclo
28
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
data mining, text mining, transfer learning, cross-domain classification, term weighting, hierarchical text categorization, text classification, wordnet, job recommendation, gene ontology, genomic features prediction biomedical literature
URN:NBN
DOI
10.6092/unibo/amsdottorato/7494
Data di discussione
12 Maggio 2016
URI
Statistica sui download
Gestione del documento: