Forresi, Chiara
(2025)
Techniques and methodologies to support data management and analysis in big data ecosystems, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Data science and computation, 36 Ciclo.
Documenti full-text disponibili:
Abstract
In recent years, industries have widely adopted digital technologies, reshaping key business operations, processes, and management structures, thus leading to digital transformation. Central to digital transformation is the seamless integration of processes and the exploitation of hidden data value, pushing information systems toward complex ecosystems of data-oriented services that meet diverse data needs and requirements. Big data drives digital transformation, encapsulated by the 4 Vs: volume, velocity, veracity, and variety. While scalable storage solutions exist, managing data variety remains a significant challenge in achieving a unified view of data that is essential for effective transformation. This thesis tackles the challenge of managing data variety in both batch and streaming contexts. NoSQL DBMSs have led to the adoption of polyglot storage systems, which combine the strengths of various technologies and data models. While operational applications benefit from this, analytical applications struggle with inconsistent schemas across different DBMSs and even within a single NoSQL system. As a result, data science is shifting towards a flexible, lightweight approach, moving away from traditional data warehousing. This thesis proposes an approach to support data analysis in a high-variety multistore with heterogeneous schemas and overlapping records. It also presents a case study on a data platform integrating multiple sources using a traditional warehousing approach and a formal study on representing complex, non-standard data distribution strategies. The literature on analyzing schemaless data streams is still in its early stages. This thesis presents a novel schema profiling technique for schemaless data streams within an overlapping sliding window paradigm, along with introducing a self-adaptive stream analysis framework. These approaches are integrated into a dashboard for real-time monitoring of schemaless data streams.
Abstract
In recent years, industries have widely adopted digital technologies, reshaping key business operations, processes, and management structures, thus leading to digital transformation. Central to digital transformation is the seamless integration of processes and the exploitation of hidden data value, pushing information systems toward complex ecosystems of data-oriented services that meet diverse data needs and requirements. Big data drives digital transformation, encapsulated by the 4 Vs: volume, velocity, veracity, and variety. While scalable storage solutions exist, managing data variety remains a significant challenge in achieving a unified view of data that is essential for effective transformation. This thesis tackles the challenge of managing data variety in both batch and streaming contexts. NoSQL DBMSs have led to the adoption of polyglot storage systems, which combine the strengths of various technologies and data models. While operational applications benefit from this, analytical applications struggle with inconsistent schemas across different DBMSs and even within a single NoSQL system. As a result, data science is shifting towards a flexible, lightweight approach, moving away from traditional data warehousing. This thesis proposes an approach to support data analysis in a high-variety multistore with heterogeneous schemas and overlapping records. It also presents a case study on a data platform integrating multiple sources using a traditional warehousing approach and a formal study on representing complex, non-standard data distribution strategies. The literature on analyzing schemaless data streams is still in its early stages. This thesis presents a novel schema profiling technique for schemaless data streams within an overlapping sliding window paradigm, along with introducing a self-adaptive stream analysis framework. These approaches are integrated into a dashboard for real-time monitoring of schemaless data streams.
Tipologia del documento
Tesi di dottorato
Autore
Forresi, Chiara
Supervisore
Dottorato di ricerca
Ciclo
36
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Query optimization, Cost model, Multistore, Stream analysis, Schemaless, Heterogeneous data, Schema profiling
Data di discussione
8 Luglio 2025
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Forresi, Chiara
Supervisore
Dottorato di ricerca
Ciclo
36
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Query optimization, Cost model, Multistore, Stream analysis, Schemaless, Heterogeneous data, Schema profiling
Data di discussione
8 Luglio 2025
URI
Gestione del documento: