Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

Domeniconi, Giacomo (2016) Data and Text Mining Techniques for In-Domain and Cross-Domain Applications, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Informatica, 28 Ciclo. DOI 10.6092/unibo/amsdottorato/7494.

Salva citazione

Citato da

Documenti full-text disponibili:

Anteprima

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Download (3MB) | Anteprima

Abstract

In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Domeniconi, Giacomo

Supervisore

Moro, Gianluca ; Sartori, Claudio

Dottorato di ricerca

Informatica

Scuola di dottorato

Scienze e ingegneria dell'informazione

Ciclo

Coordinatore

Ciaccia, Paolo

Settore disciplinare

Area 09 - Ingegneria industriale e dell'informazione > ING-INF/05 Sistemi di elaborazione delle informazioni

Settore concorsuale

Area 09 - Ingegneria industriale e dell'informazione > 09/H - Ingegneria informatica > 09/H1 Sistemi di elaborazione delle informazioni

Parole chiave

data mining, text mining, transfer learning, cross-domain classification, term weighting, hierarchical text categorization, text classification, wordnet, job recommendation, gene ontology, genomic features prediction biomedical literature

URN:NBN

urn:nbn:it:unibo-18679

DOI

10.6092/unibo/amsdottorato/7494

Data di discussione

12 Maggio 2016

URI

http://amsdottorato.unibo.it/id/eprint/7494