Anderlucci, Laura
  
(2012)
Comparing Different Approaches for Clustering Categorical Data, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. 
 Dottorato di ricerca in 
Metodologia statistica per la ricerca scientifica, 24 Ciclo. DOI 10.6092/unibo/amsdottorato/4302.
  
 
  
  
        
        
        
  
  
  
  
  
  
  
    
  
    
      Documenti full-text disponibili:
      
    
  
  
    
      Abstract
      There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints.
Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group.
In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution.
But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other?
In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation
outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.
     
    
      Abstract
      There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints.
Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group.
In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution.
But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other?
In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation
outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.
     
  
  
    
    
      Tipologia del documento
      Tesi di dottorato
      
      
      
      
        
      
        
          Autore
          Anderlucci, Laura
          
        
      
        
          Supervisore
          
          
        
      
        
          Co-supervisore
          
          
        
      
        
          Dottorato di ricerca
          
          
        
      
        
          Scuola di dottorato
          Scienze economiche e statistiche
          
        
      
        
          Ciclo
          24
          
        
      
        
          Coordinatore
          
          
        
      
        
          Settore disciplinare
          
          
        
      
        
          Settore concorsuale
          
          
        
      
        
          Parole chiave
          clustering, latent class models, partioning around medoids, multidimensional scaling
          
        
      
        
          URN:NBN
          
          
        
      
        
          DOI
          10.6092/unibo/amsdottorato/4302
          
        
      
        
          Data di discussione
          3 Febbraio 2012
          
        
      
      URI
      
      
     
   
  
    Altri metadati
    
      Tipologia del documento
      Tesi di dottorato
      
      
      
      
        
      
        
          Autore
          Anderlucci, Laura
          
        
      
        
          Supervisore
          
          
        
      
        
          Co-supervisore
          
          
        
      
        
          Dottorato di ricerca
          
          
        
      
        
          Scuola di dottorato
          Scienze economiche e statistiche
          
        
      
        
          Ciclo
          24
          
        
      
        
          Coordinatore
          
          
        
      
        
          Settore disciplinare
          
          
        
      
        
          Settore concorsuale
          
          
        
      
        
          Parole chiave
          clustering, latent class models, partioning around medoids, multidimensional scaling
          
        
      
        
          URN:NBN
          
          
        
      
        
          DOI
          10.6092/unibo/amsdottorato/4302
          
        
      
        
          Data di discussione
          3 Febbraio 2012
          
        
      
      URI
      
      
     
   
  
  
  
  
  
    
    Statistica sui download
    
    
  
  
    
      Gestione del documento: