Documenti full-text disponibili:
Abstract
This thesis addresses the challenge of deploying Artificial Intelligence (AI) algorithms at the edge—a critical issue in the evolving Internet of Things (IoT) landscape, where applications like smart cities, autonomous drones, and augmented reality demand near-sensor processing under strict power, performance, and storage constraints. With traditional hardware scaling methods reaching their limits, this work advances architectural innovations on both the host CPU and AI accelerator in IoT System-on-Chips. On the host side, the first contribution introduces a technique to mitigate load-use hazards in the CVA6 CPU—a 6-stage, single-issue, open-source RISC-V processor. This redesign increases maximum frequency by 4% while reducing area and power consumption by 2.5%. The backend improvements yield an average 6.5% gain in IPC, with peaks of 29%, leading to overall performance boosts of up to 35% and energy efficiency improvements averaging 6.5% at 1 GHz. The second host contribution enhances the dual-issue superscalar configuration of the CVA6 core by implementing a renaming scheme to eliminate write-after-write hazards, a low-overhead ALU-ALU forwarding mechanism, and an improved branch predictor. With an 11% increase in area, this extension achieves a 45% IPC increase. On the accelerator side, two works target low-power embedded environments. The first presents Dustin, a fully flexible accelerator in 65nm TSMC technology featuring a 16-core RISC-V compute cluster. Dustin incorporates mixed-precision support via hardware-based packing/unpacking and a novel Vector Lockstep Execution Mode (VLEM) that allows switching from a MIMD to a SIMD-like model, yielding up to 40% power savings. The final contribution explores heterogeneous acceleration through Analog In-Memory Computing. By integrating two specialized accelerators with a flexible compute cluster, this approach achieves performance and efficiency improvements of up to 10× over core-only implementation, with nearly 1 TOPS in performance, and 6.4 TOPS/W in efficiency, underscoring the promise of heterogeneous architectures for future AI devices.
Abstract
This thesis addresses the challenge of deploying Artificial Intelligence (AI) algorithms at the edge—a critical issue in the evolving Internet of Things (IoT) landscape, where applications like smart cities, autonomous drones, and augmented reality demand near-sensor processing under strict power, performance, and storage constraints. With traditional hardware scaling methods reaching their limits, this work advances architectural innovations on both the host CPU and AI accelerator in IoT System-on-Chips. On the host side, the first contribution introduces a technique to mitigate load-use hazards in the CVA6 CPU—a 6-stage, single-issue, open-source RISC-V processor. This redesign increases maximum frequency by 4% while reducing area and power consumption by 2.5%. The backend improvements yield an average 6.5% gain in IPC, with peaks of 29%, leading to overall performance boosts of up to 35% and energy efficiency improvements averaging 6.5% at 1 GHz. The second host contribution enhances the dual-issue superscalar configuration of the CVA6 core by implementing a renaming scheme to eliminate write-after-write hazards, a low-overhead ALU-ALU forwarding mechanism, and an improved branch predictor. With an 11% increase in area, this extension achieves a 45% IPC increase. On the accelerator side, two works target low-power embedded environments. The first presents Dustin, a fully flexible accelerator in 65nm TSMC technology featuring a 16-core RISC-V compute cluster. Dustin incorporates mixed-precision support via hardware-based packing/unpacking and a novel Vector Lockstep Execution Mode (VLEM) that allows switching from a MIMD to a SIMD-like model, yielding up to 40% power savings. The final contribution explores heterogeneous acceleration through Analog In-Memory Computing. By integrating two specialized accelerators with a flexible compute cluster, this approach achieves performance and efficiency improvements of up to 10× over core-only implementation, with nearly 1 TOPS in performance, and 6.4 TOPS/W in efficiency, underscoring the promise of heterogeneous architectures for future AI devices.
Tipologia del documento
Tesi di dottorato
Autore
Ottavi, Gianmarco
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
37
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Computer Architectures, RISC-V, CVA6, CV32E40P, AI, Deep Neural Networks, DNN, Mixed-precision, Quantized Neural Networks, QNN, AI Acceleration
Data di discussione
4 Aprile 2025
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Ottavi, Gianmarco
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
37
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Computer Architectures, RISC-V, CVA6, CV32E40P, AI, Deep Neural Networks, DNN, Mixed-precision, Quantized Neural Networks, QNN, AI Acceleration
Data di discussione
4 Aprile 2025
URI
Gestione del documento: