Pasolini, Roberto
  
(2015)
Learning Methods and Algorithms for Semantic Text Classification across Multiple Domains, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. 
 Dottorato di ricerca in 
Ingegneria elettronica, informatica e delle telecomunicazioni, 27 Ciclo. DOI 10.6092/unibo/amsdottorato/7058.
  
 
  
  
        
        
        
  
  
  
  
  
  
  
    
  
    
      Documenti full-text disponibili:
      
    
  
  
    
      Abstract
      Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems.
Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort.
Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times.
A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.
     
    
      Abstract
      Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems.
Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort.
Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times.
A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.
     
  
  
    
    
      Tipologia del documento
      Tesi di dottorato
      
      
      
      
        
      
        
          Autore
          Pasolini, Roberto
          
        
      
        
          Supervisore
          
          
        
      
        
          Co-supervisore
          
          
        
      
        
          Dottorato di ricerca
          
          
        
      
        
          Scuola di dottorato
          Scienze e ingegneria dell'informazione
          
        
      
        
          Ciclo
          27
          
        
      
        
          Coordinatore
          
          
        
      
        
          Settore disciplinare
          
          
        
      
        
          Settore concorsuale
          
          
        
      
        
          Parole chiave
          text mining, text categorization, cross-domain text categorization, 
          
        
      
        
          URN:NBN
          
          
        
      
        
          DOI
          10.6092/unibo/amsdottorato/7058
          
        
      
        
          Data di discussione
          14 Aprile 2015
          
        
      
      URI
      
      
     
   
  
    Altri metadati
    
      Tipologia del documento
      Tesi di dottorato
      
      
      
      
        
      
        
          Autore
          Pasolini, Roberto
          
        
      
        
          Supervisore
          
          
        
      
        
          Co-supervisore
          
          
        
      
        
          Dottorato di ricerca
          
          
        
      
        
          Scuola di dottorato
          Scienze e ingegneria dell'informazione
          
        
      
        
          Ciclo
          27
          
        
      
        
          Coordinatore
          
          
        
      
        
          Settore disciplinare
          
          
        
      
        
          Settore concorsuale
          
          
        
      
        
          Parole chiave
          text mining, text categorization, cross-domain text categorization, 
          
        
      
        
          URN:NBN
          
          
        
      
        
          DOI
          10.6092/unibo/amsdottorato/7058
          
        
      
        
          Data di discussione
          14 Aprile 2015
          
        
      
      URI
      
      
     
   
  
  
  
  
  
    
    Statistica sui download
    
    
  
  
    
      Gestione del documento: 
      
        