Cluster analysis on low dimensional embeddings: unveiling similarity patterns in multi-dimensional data for better stratification

Dall'olio, Lorenzo (2024) Cluster analysis on low dimensional embeddings: unveiling similarity patterns in multi-dimensional data for better stratification, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Fisica, 36 Ciclo.
Documenti full-text disponibili:
[img] Documento PDF (English) - Accesso riservato fino a 10 Aprile 2027 - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons Attribution Non-commercial No Derivatives 4.0 (CC BY-NC-ND 4.0) .
Download (9MB) | Contatta l'autore

Abstract

The thesis explores the utility of clustering analysis and dimensionality reduction in various applied physics contexts. Clustering identifies groups of data points with high similarity and is assessed through different algorithms, each with unique capabilities and limitations. Dimensionality reduction serves as both a visualization aid and preprocessing step to enhance cluster quality and reduce computational resources required for subsequent algorithms. Chapter 1 defines clustering and dimensionality reduction, compares algorithms, and selects one from each category to form a robust clustering pipeline adaptable to real-world scenarios. Three applications demonstrate the pipeline's versatility and robustness, showcasing tailored preprocessing and analysis for each context. Chapter 2 addresses automatic clustering for single cells in lymphoid tissue, tackling challenges like cross-bleed effects and detailed cell type separation. A novel preprocessing technique called Lognormal Shrinkage improves cell type separation, facilitating expert cluster validation. Chapter 3 presents a pipeline for identifying interesting genes from RNA-seq experiments, leveraging Weighted Gene Co-expression Network Analysis and novel approaches like identifying drug-targeted gene-enriched clusters. Grouping genes by co-expression patterns mitigates RNA-seq variability, enhancing robust gene identification. Chapter 4 applies the pipeline to patients with myelodysplastic syndrome using mutational data, demonstrating its efficacy in identifying distinct patient groups with varying survival outcomes, even in boolean and sparse data scenarios. Conclusions summarize algorithm properties and analyze the pipeline's advantages and potential drawbacks across diverse use cases. Further discussion includes innovative steps necessary for specific datasets.

Abstract
Tipologia del documento
Tesi di dottorato
Autore
Dall'olio, Lorenzo
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
36
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Clustering, Dimensionality Reduction, Manifold Learning, Similarity, Metric, Gene Expression, Robust, Comparison, PCA. Feature Selection, Density-Based clustering, UMAP, Hierarchical clustering, HDBSCAN, t-SNE, Machine Learning, Deep Learning, Artificial Intelligence, Applied Physics,
URN:NBN
Data di discussione
17 Giugno 2024
URI

Altri metadati

Gestione del documento: Visualizza la tesi

^