Data sensitivity detection in chat interactions for privacy protection

Gambarelli, Gaia (2023) Data sensitivity detection in chat interactions for privacy protection, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Culture letterarie e filologiche, 35 Ciclo. DOI 10.48676/unibo/amsdottorato/10573.

Salva citazione

Citato da

Documenti full-text disponibili:

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons: Attribuzione - Non Commerciale - Non Opere Derivate 4.0 (CC BY-NC-ND 4.0) .
Download (13MB)

Abstract

In recent years, there has been exponential growth in using virtual spaces, including dialogue systems, that handle personal information. The concept of personal privacy in the literature is discussed and controversial, whereas, in the technological field, it directly influences the degree of reliability perceived in the information system (privacy ‘as trust’). This work aims to protect the right to privacy on personal data (GDPR, 2018) and avoid the loss of sensitive content by exploring sensitive information detection (SID) task. It is grounded on the following research questions: (RQ1) What does sensitive data mean? How to define a personal sensitive information domain? (RQ2) How to create a state-of-the-art model for SID?(RQ3) How to evaluate the model? RQ1 theoretically investigates the concepts of privacy and the ontological state-of-the-art representation of personal information. The Data Privacy Vocabulary (DPV) is the taxonomic resource taken as an authoritative reference for the definition of the knowledge domain. Concerning RQ2, we investigate two approaches to classify sensitive data: the first - bottom-up - explores automatic learning methods based on transformer networks, the second - top-down - proposes logical-symbolic methods with the construction of privaframe, a knowledge graph of compositional frames representing personal data categories. Both approaches are tested. For the evaluation - RQ3 – we create SPeDaC, a sentence-level labeled resource. This can be used as a benchmark or training in the SID task, filling the gap of a shared resource in this field. If the approach based on artificial neural networks confirms the validity of the direction adopted in the most recent studies on SID, the logical-symbolic approach emerges as the preferred way for the classification of fine-grained personal data categories, thanks to the semantic-grounded tailor modeling it allows. At the same time, the results highlight the strong potential of hybrid architectures in solving automatic tasks.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Gambarelli, Gaia

Supervisore

Gangemi, Aldo

Co-supervisore

Lunghini, Marco ; Tripodi, Rocco

Dottorato di ricerca

Culture letterarie e filologiche

Ciclo