Simone Campanoni

Associate professor
Department of Computer Science at Northwestern University

Simone Campanoni

Computer Science
Northwestern University

ARCANA logo
Publications

Publications


David M. Brooks

Lab
Lab
Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
ArXiv, 2023
PDF BibTeX arXiv

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design
Communication ACM Research Highlights (CACM), 2017
PDF BibTeX CACM ACM

Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.

Lab
Simone Campanoni , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
HELIX-UP: Relaxing Program Semantics to Unleash Parallelization
International Symposium on Code Generation and Optimization (CGO), 2015
Acceptance rate: 27.3% (24/88)
PDF BibTeX ACM IEEE

Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis.


One of four papers nominated for the Best Paper Award by the Program Committee
Lab
Khalid Al-Hawaj, Simone Campanoni , Gu-Yeon Wei , and David M. Brooks
Unified Cache: A Case for Low Latency Communication
International Workshop on Parallelism in Mobile Platforms (PRISM), 2015
PDF
Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
International Symposium on Computer Architecture (ISCA), 2014
Acceptance rate: 17.8% (46/258)
PDF BibTeX ACM IEEE

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks


IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 IEEE
Communication ACM Research Highlights (CACM), 2017 CACM
Lab
Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing
International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
PDF
Lab
Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream
IEEE computer Society Digital Library (IEEE Micro), 2012
PDF BibTeX IEEE

Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing
International Symposium on Code Generation and Optimization (CGO), 2012
Acceptance rate: 28.9% (26/90)
PDF BibTeX ACM

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
The HELIX Project: Overview and Directions
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise in Production Processors
IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
PDF BibTeX IEEE

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling
International Symposium on Microarchitecture (MICRO), 2010
Acceptance rate: 18.1% (45/248)
PDF BibTeX ACM IEEE

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.


IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 IEEE
Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation
ACM Transactions on Architecture and Code Optimization (TACO), 2010
PDF BibTeX ACM

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack
Design Automation Conference (DAC), 2009
Acceptance rate: 21.7% (148/682)
PDF BibTeX ACM IEEE

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.

Lab

Gu-Yeon Wei

Lab
Lab
Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
ArXiv, 2023
PDF BibTeX arXiv

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design
Communication ACM Research Highlights (CACM), 2017
PDF BibTeX CACM ACM

Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.

Lab
Simone Campanoni , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
HELIX-UP: Relaxing Program Semantics to Unleash Parallelization
International Symposium on Code Generation and Optimization (CGO), 2015
Acceptance rate: 27.3% (24/88)
PDF BibTeX ACM IEEE

Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis.


One of four papers nominated for the Best Paper Award by the Program Committee
Lab
Khalid Al-Hawaj, Simone Campanoni , Gu-Yeon Wei , and David M. Brooks
Unified Cache: A Case for Low Latency Communication
International Workshop on Parallelism in Mobile Platforms (PRISM), 2015
PDF
Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
International Symposium on Computer Architecture (ISCA), 2014
Acceptance rate: 17.8% (46/258)
PDF BibTeX ACM IEEE

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks


IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 IEEE
Communication ACM Research Highlights (CACM), 2017 CACM
Lab
Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing
International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
PDF
Lab
Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream
IEEE computer Society Digital Library (IEEE Micro), 2012
PDF BibTeX IEEE

Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing
International Symposium on Code Generation and Optimization (CGO), 2012
Acceptance rate: 28.9% (26/90)
PDF BibTeX ACM

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
The HELIX Project: Overview and Directions
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise in Production Processors
IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
PDF BibTeX IEEE

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling
International Symposium on Microarchitecture (MICRO), 2010
Acceptance rate: 18.1% (45/248)
PDF BibTeX ACM IEEE

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.


IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 IEEE
Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation
ACM Transactions on Architecture and Code Optimization (TACO), 2010
PDF BibTeX ACM

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack
Design Automation Conference (DAC), 2009
Acceptance rate: 21.7% (148/682)
PDF BibTeX ACM IEEE

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.

Lab

Peter Dinda

Lab
Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Lab
Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda
Getting a Handle on Unmanaged Memory
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis.

Lab
Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.

Lab
Lab
Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda
Getting a Handle on Unmanaged Memory
ArXiv, 2024
PDF BibTeX arXiv

The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis.

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Lab
Kyle C. Hale , Simone Campanoni , Nikos Hardavellas , and Peter Dinda
The Case for an Interwoven Parallel Hardware/Software Stack
International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021
PDF BibTeX IEEE

The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work.

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Lab
Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism
International Conference on Programming Language Design and Implementation (PLDI), 2021
Acceptance rate: 27.2% (87/320)
PDF BibTeX ACM Artifact: Functional

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Lab
Souradip Ghosh , Michael Cuevas , Simone Campanoni , and Peter Dinda
Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism
High Performance Computing, Networking, Storage and Analysis (SC), 2020
Acceptance rate: 25.1% (95/378)
PDF BibTeX ACM IEEE

In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model.

Lab
Brian Suchy , Simone Campanoni , Nikos Hardavellas , and Peter Dinda
CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM

Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions.

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018
Acceptance rate: 17.6% (56/319)
PDF BibTeX ACM

The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
The Liberation Day of Nondeterministic Programs
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017
PDF BibTeX IEEE

The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

David I. August

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
Acceptance rate: 30.9% (38/123)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI GitHub

Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Lab
Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Lab
Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Lab
Lab
Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
ArXiv, 2023
PDF BibTeX arXiv GitHub

Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Lab
Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Artifact: Functional Slides Video DOI GitHub

Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

Lab
Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020
Acceptance rate: 18.1% (86/476)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Video GitHub

The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Ziyang Xu

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
Acceptance rate: 30.9% (38/123)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI GitHub

Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Lab
Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
ArXiv, 2023
PDF BibTeX arXiv GitHub

Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Lab
Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Artifact: Functional Slides Video DOI GitHub

Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

Lab
Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020
Acceptance rate: 18.1% (86/476)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Video GitHub

The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Enrico Armenio Deiana

Lab
Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Lab
Enrico Armenio Deiana , and Simone Campanoni
Workload Characterization of Nondeterministic Programs Parallelized by STATS
International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019
Acceptance rate: 29.5% (26/88)
PDF BibTeX IEEE

Chip Multiprocessors (CMP) are everywhere, from mobile systems, to servers. Thread Level Parallelism (TLP) is the characteristic of a program that makes use of the parallel cores of a CMP to generate performance. Despite all efforts for creating TLP, multiple cores are still underutilized even though we have been in the multicore era for more than a decade. Recently, a new approach called STATS has been proposed to generate additional TLP for complex and irregular nondeterministic programs. STATS allows a developer to describe application-specific information that its compiler uses to automatically generate a new source of TLP. This new source of TLP increases with the size of the input and it has the potential to generate scalable performance with the number of cores. Even though STATS obtains most of its potential, some of it is still unreached. This paper identifies and characterizes the sources of overhead that are currently blocking STATS parallelized programs to achieve their full potential. To this end, we characterized the workloads generated by the STATS compiler on a 28 core Intel-based machine (dual-socket). This paper shows that the performance loss is due to a combination of factors: some can be optimized via engineering efforts and some require a deeper evolution of STATS. We also highlight potential solutions to significantly reduce most of this overhead. Exploiting these insights will unblock scalable performance for the parallel binaries generated by STATS.

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018
Acceptance rate: 17.6% (56/319)
PDF BibTeX ACM

The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
The Liberation Day of Nondeterministic Programs
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017
PDF BibTeX IEEE

The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

Nikos Hardavellas

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Lab
Kyle C. Hale , Simone Campanoni , Nikos Hardavellas , and Peter Dinda
The Case for an Interwoven Parallel Hardware/Software Stack
International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021
PDF BibTeX IEEE

The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work.

Lab
Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism
International Conference on Programming Language Design and Implementation (PLDI), 2021
Acceptance rate: 27.2% (87/320)
PDF BibTeX ACM Artifact: Functional

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Lab
Brian Suchy , Simone Campanoni , Nikos Hardavellas , and Peter Dinda
CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM

Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions.

Lab
Georgios Tziantzioulis , Nikos Hardavellas , and Simone Campanoni
Temporal Approximate Function Memoization
IEEE Micro's special issue on Approximate Computing, 2018
PDF BibTeX IEEE

Improving the performance of applications is a core target of computer systems research and has led to the creation of various techniques. Among them is function memoization, an intuitive technique that, unfortunately, failed to live up to its promised potential. Traditional function memoization falls short mainly because it is input-based, meaning the system needs to pattern-match the current invocations inputs with previously seen ones to decide which memoized output value to return. This is often computationally intensive and only suitable for a limited set of functions. However, function calls often exhibit temporal value locality on their output. We capitalize on this observation to create the first output-based function memoization technique. We demonstrate that compiler-directed output-based memoization reduces energy and runtime by 50 to 75 percent (2-4x speedup) at the cost of a relatively small degradation of the applications output quality.

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018
Acceptance rate: 17.6% (56/319)
PDF BibTeX ACM

The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
The Liberation Day of Nondeterministic Programs
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017
PDF BibTeX IEEE

The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

Timothy M. Jones

Lab
Xiaochun Zhang, Timothy M. Jones , and Simone Campanoni
Quantifying the Semantic Gap Between Serial and Parallel Programming
International Symposium on Workload Characterization (IISWC), 2021
Acceptance rate: 39.6% (19/48)
PDF BibTeX IEEE Video

Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system.

Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design
Communication ACM Research Highlights (CACM), 2017
PDF BibTeX CACM ACM

Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.

Lab
Niall Murphy, Timothy M. Jones , Robert Mullins , and Simone Campanoni
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops
International Conference on Compiler Construction (CC), 2016
Acceptance rate: 31.2% (24/77)
PDF BibTeX ACM

Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.

Lab
Niall Murphy, Timothy M. Jones , Simone Campanoni , and Robert Mullins
Limits of Static Dependence Analysis for Automatic Parallelization
International Workshop on Compilers for Parallel Computing (CPC), 2015
PDF
Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
International Symposium on Computer Architecture (ISCA), 2014
Acceptance rate: 17.8% (46/258)
PDF BibTeX ACM IEEE

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks


IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 IEEE
Communication ACM Research Highlights (CACM), 2017 CACM
Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream
IEEE computer Society Digital Library (IEEE Micro), 2012
PDF BibTeX IEEE

Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing
International Symposium on Code Generation and Optimization (CGO), 2012
Acceptance rate: 28.9% (26/90)
PDF BibTeX ACM

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
The HELIX Project: Overview and Directions
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.

Zujun Tan

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
Acceptance rate: 30.9% (38/123)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI GitHub

Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.

Lab
Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
ArXiv, 2023
PDF BibTeX arXiv GitHub

Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Lab
Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Artifact: Functional Slides Video DOI GitHub

Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

Ishita Chaturvedi

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Lab
Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Lab
Lab
Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Stefano Crespi Reghizzi

Lab
Lab
Michele Tartara , Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
Parallelism and Retargetability in the ILDJIT Dynamic Compiler
Architecture of Computing Systems (ARCS), 2010
PDF BibTeX IEEE GitHub

Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability.

Lab
Lab
Simone Campanoni , Martino Sykora , Giovanni Agosta , and Stefano Crespi Reghizzi
Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment
International Conference on Compiler Construction (CC), 2009
Acceptance rate: 25.0% (18/72)
PDF BibTeX ACM Springer GitHub

Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation.

Lab
Lab
Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
A parallel dynamic compiler for CIL bytecode
ACM SIGPLAN Notices, 2008
PDF BibTeX ACM GitHub

Multi-core technology is being employed in most recent high-performance architectures. Such architectures need specifically designed multi-threaded software to exploit all the potentialities of their hardware parallelism. At the same time, object code virtualization technologies are achieving a growing popularity, as they allow higher levels of software portability and reuse. Thus, a virtual execution environment running on a multi-core processor has to run complex, high-level applications and to exploit as much as possible the underlying parallel hardware. We propose an approach that leverages on CMP features to expose a novel pipeline synchronization model for the internal threads of the dynamic compiler. Thanks to compilation latency masking effect of the pipeline organization, our dynamic compiler, ILDJIT, is able to achieve significant speedups (26% on average) with respect to the baseline, when the underlying hardware exposes at least two cores.

Lab
Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
ILDJIT: a Parallel Dynamic Compiler
International Conference on Very Large Scale Integration (VLSI-SoC), 2008
PDF GitHub

Svilen Kanev

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design
Communication ACM Research Highlights (CACM), 2017
PDF BibTeX CACM ACM

Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.

Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
International Symposium on Computer Architecture (ISCA), 2014
Acceptance rate: 17.8% (46/258)
PDF BibTeX ACM IEEE

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks


IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 IEEE
Communication ACM Research Highlights (CACM), 2017 CACM
Lab
Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing
International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
PDF
Lab
Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise in Production Processors
IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
PDF BibTeX IEEE

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling
International Symposium on Microarchitecture (MICRO), 2010
Acceptance rate: 18.1% (45/248)
PDF BibTeX ACM IEEE

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.


IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 IEEE

Yian Su

Lab
Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
Acceptance rate: 30.9% (38/123)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI GitHub

Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.

Lab
Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
ArXiv, 2023
PDF BibTeX arXiv GitHub

Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Bhargav Reddy Godala

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Lab
Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Lab
Lab

Sotiris Apostolakis

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
Acceptance rate: 30.9% (38/123)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI GitHub

Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
ArXiv, 2023
PDF BibTeX arXiv GitHub

Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Lab
Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Artifact: Functional Slides Video DOI GitHub

Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

Lab
Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020
Acceptance rate: 18.1% (86/476)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Video GitHub

The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Vijay Janapa Reddi

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing
International Symposium on Code Generation and Optimization (CGO), 2012
Acceptance rate: 28.9% (26/90)
PDF BibTeX ACM

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise in Production Processors
IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
PDF BibTeX IEEE

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling
International Symposium on Microarchitecture (MICRO), 2010
Acceptance rate: 18.1% (45/248)
PDF BibTeX ACM IEEE

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.


IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 IEEE
Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation
ACM Transactions on Architecture and Code Optimization (TACO), 2010
PDF BibTeX ACM

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack
Design Automation Conference (DAC), 2009
Acceptance rate: 21.7% (148/682)
PDF BibTeX ACM IEEE

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.

Lab

Brian Homerding

Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Brian Suchy

Lab
Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Lab
Brian Suchy , Simone Campanoni , Nikos Hardavellas , and Peter Dinda
CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM

Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions.

Giovanni Agosta

Lab
Lab
Michele Tartara , Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
Parallelism and Retargetability in the ILDJIT Dynamic Compiler
Architecture of Computing Systems (ARCS), 2010
PDF BibTeX IEEE GitHub

Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability.

Lab
Simone Campanoni , Martino Sykora , Giovanni Agosta , and Stefano Crespi Reghizzi
Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment
International Conference on Compiler Construction (CC), 2009
Acceptance rate: 25.0% (18/72)
PDF BibTeX ACM Springer GitHub

Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation.

Lab
Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
A parallel dynamic compiler for CIL bytecode
ACM SIGPLAN Notices, 2008
PDF BibTeX ACM GitHub

Multi-core technology is being employed in most recent high-performance architectures. Such architectures need specifically designed multi-threaded software to exploit all the potentialities of their hardware parallelism. At the same time, object code virtualization technologies are achieving a growing popularity, as they allow higher levels of software portability and reuse. Thus, a virtual execution environment running on a multi-core processor has to run complex, high-level applications and to exploit as much as possible the underlying parallel hardware. We propose an approach that leverages on CMP features to expose a novel pipeline synchronization model for the internal threads of the dynamic compiler. Thanks to compilation latency masking effect of the pipeline organization, our dynamic compiler, ILDJIT, is able to achieve significant speedups (26% on average) with respect to the baseline, when the underlying hardware exposes at least two cores.

Lab
Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
ILDJIT: a Parallel Dynamic Compiler
International Conference on Very Large Scale Integration (VLSI-SoC), 2008
PDF GitHub

Glenn Holloway

Lab
Simone Campanoni , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
HELIX-UP: Relaxing Program Semantics to Unleash Parallelization
International Symposium on Code Generation and Optimization (CGO), 2015
Acceptance rate: 27.3% (24/88)
PDF BibTeX ACM IEEE

Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis.


One of four papers nominated for the Best Paper Award by the Program Committee
Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream
IEEE computer Society Digital Library (IEEE Micro), 2012
PDF BibTeX IEEE

Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing
International Symposium on Code Generation and Optimization (CGO), 2012
Acceptance rate: 28.9% (26/90)
PDF BibTeX ACM

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.

Lab
Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks
The HELIX Project: Overview and Directions
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.

Lab

Michael D. Smith

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise in Production Processors
IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
PDF BibTeX IEEE

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling
International Symposium on Microarchitecture (MICRO), 2010
Acceptance rate: 18.1% (45/248)
PDF BibTeX ACM IEEE

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.


IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 IEEE
Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation
ACM Transactions on Architecture and Code Optimization (TACO), 2010
PDF BibTeX ACM

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack
Design Automation Conference (DAC), 2009
Acceptance rate: 21.7% (148/682)
PDF BibTeX ACM IEEE

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.

Lab

Souradip Ghosh

Lab
Vito Kortbeek , Souradip Ghosh , Josiah Hester , Simone Campanoni , and Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing
International Conference on Programming Language Design and Implementation (PLDI), 2022
Acceptance rate: 20.9% (68/326)
PDF ACM Artifact: Available Artifact: Reusable Video DOI GitHub

Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Lab
Souradip Ghosh , Michael Cuevas , Simone Campanoni , and Peter Dinda
Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism
High Performance Computing, Networking, Storage and Analysis (SC), 2020
Acceptance rate: 25.1% (95/378)
PDF BibTeX ACM IEEE

In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model.

Tommy McMichen

Lab
Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni
Representing Data Collections in an SSA Form
International Symposium on Code Generation and Optimization (CGO), 2024
Acceptance rate: 32.5% (37/114)
PDF BibTeX IEEE Artifact: Available Artifact: Reusable Artifact: Reproduced DOI GitHub

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

Lab
Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda
Getting a Handle on Unmanaged Memory
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis.

Lab
Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda
Getting a Handle on Unmanaged Memory
ArXiv, 2024
PDF BibTeX arXiv

The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis.

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

William Fornaciari

Lab
Simone Campanoni , and William Fornaciari
Multi-level Design and Optimization of Wireless Sensor Networks
International Conference on Networked Sensing Systems (INSS), 2008
PDF BibTeX IEEE

This paper proposes a methodology to off-line planning of WSNs (wireless sensor networks) by addressing the problem in a multi-layer manner. At the sensor level a model is described to properly select and distribute the sensors in the environment. To optimize the cost and deployment of realistic WSNs, further design activities are proposed at an intermediate level, targeting board-level clustering of sensors. Finally, it is presented a methodology to hierarchically organize the set of sensors in patches with an additional gateway-level communication layer, to take into account also possible scaling of the application complexity. Particular emphasis is put on the cost modeling and on ensuring the correct behavior of the WSN against variation of parameters like sensor position, protocol, cost, etc.

Lab
Simone Campanoni , and William Fornaciari
Node-Level Optimization of Wireless Sensor Networks
International Conference on Wireless Communications, Networking and Mobile Computing (WiCom), 2008
PDF BibTeX IEEE

A methodology for WSN planning should produce a suitable placement of sensors to simplify the acquisition of the data relevant for the application. The goal of this paper is to present a strategy to formally specify the system level characteristics of the events to be monitored and to identify a proper set of sensor-position pairs, tailored to provide the required sensing capabilities. In particular the focus is on the node level optimization, where the designer has to face with the problem of clustering sensors onto the same board based upon a cost-performance tradeoff.

Lab
Simone Campanoni , and William Fornaciari
Models and Tradeoffs in WSN System-Level Design
Euromicro Symposium on Digital System Design (DSD), 2008
PDF BibTeX IEEE

System-level design of WSNs includes the selection of the sensing nodes and their dissemination in the environment to be monitored. Many design choices have to be taken during this stage of the development of the application. The goal of this paper is to present a methodology to specify formally the desired behavior of the sensing application and to derive an optimal selection and placement of the network nodes. The approach is flexible and powerful, since it allows the designer to analyze the impact of clustering sensors onto a reduced set of boards (nodes) and to perform sensitivity analysis on parameters such as type of sensor, position of the nodes, observation time, presence of faults, etc. The paper introduces the concepts with some representative examples as well as by considering bigger use cases extracted from international projects.

Lab
Simone Campanoni , and William Fornaciari
Ensuring Feasibility of Wireless Sensor Networks
International Conference on Circuits and Systems for Communications (ICCSC), 2008
PDF BibTeX IEEE

A comprehensive WSNs (wireless sensor networks) design methodology should deserve enough effort to select a suitable set of sensors with a proper spatial distribution, in order to ensure a correct monitoring of the parameters relevant for the application. The goal of this paper is to show how it is possible a priori ensuring that a given WSN is actually capable to satisfy the application requirements. In addition to this, the flexibility of the approach is also demonstrated by evaluating the sensitivity of the WSN performance to parameters like sensors distribution, observation time and the events to be analyzed. This paper presents the overall design methodology implemented in SWORDFISH (sensor networks development framework integrating simulation and hardware optimization) and some representative case studies.

Lab
Simone Campanoni , and William Fornaciari
SWORDFISH: A framework to formally design WSNs capturing events
International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2007
PDF BibTeX IEEE

A Wireless Sensor Network (WSN) consists of spatially distributed autonomous devices equipped with sensors and radio communication capabilities. They can act in a cooperative manner for monitoring environmental parameters like pressure, temperature, acceleration, light, humidity, etc. The typical WSN design problem is to discover a proper set of sensors and their spatial distribution, so to enable the monitoring of relevant parameters for the application, while concurrently optimizing some design goals (e.g., cost, reliability, lifetime and energy requirements). In this paper we present part of the SWORDFISH (Sensor netWORks Development Framework Integrating Simulation and Hardware optimization) project, aiming at providing a design and verification environment for WSNs, including in the design loop the simulation of physical events and a formal verification of the WSN desired properties against the network optimization goals. In particular, it is presented the overall design framework, the approach to model properties and goals of the network and some representative case studies showing the value of a toolsuite gathering WSN formal optimization/planning and environment simulation.

Atmn Patel

Lab
Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni
Representing Data Collections in an SSA Form
International Symposium on Code Generation and Optimization (CGO), 2024
Acceptance rate: 32.5% (37/114)
PDF BibTeX IEEE Artifact: Available Artifact: Reusable Artifact: Reproduced DOI GitHub

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

Lab
Lab
Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni
The Parallel Semantics Program Dependence Graph
ArXiv, 2024
PDF BibTeX arXiv

A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Kevin Brownell

Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design
Communication ACM Research Highlights (CACM), 2017
PDF BibTeX CACM ACM

Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.

Lab
Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
International Symposium on Computer Architecture (ISCA), 2014
Acceptance rate: 17.8% (46/258)
PDF BibTeX ACM IEEE

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks


IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 IEEE
Communication ACM Research Highlights (CACM), 2017 CACM
Lab
Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing
International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
PDF
Lab

Kyle C. Hale

Lab
Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.

Lab
Kyle C. Hale , Simone Campanoni , Nikos Hardavellas , and Peter Dinda
The Case for an Interwoven Parallel Hardware/Software Stack
International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021
PDF BibTeX IEEE

The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work.

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Lab
Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism
International Conference on Programming Language Design and Implementation (PLDI), 2021
Acceptance rate: 27.2% (87/320)
PDF BibTeX ACM Artifact: Functional

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Yebin Chon

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
Acceptance rate: 30.9% (38/123)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI GitHub

Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.

Lab
Lab
Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Lab
Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework
ArXiv, 2023
PDF BibTeX arXiv GitHub

Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Greg Chan

Lab
Lab
Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework
International Conference on Programming Language Design and Implementation (PLDI), 2020
Acceptance rate: 22.6% (77/341)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Artifact: Functional Slides Video DOI GitHub

Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

Lab
Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020
Acceptance rate: 18.1% (86/476)
PDF BibTeX ACM Artifact: Available Artifact: Reusable Video GitHub

The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Meeta S. Gupta

Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation
ACM Transactions on Architecture and Code Optimization (TACO), 2010
PDF BibTeX ACM

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack
Design Automation Conference (DAC), 2009
Acceptance rate: 21.7% (148/682)
PDF BibTeX ACM IEEE

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.

Lab

Michael Cuevas

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Lab
Souradip Ghosh , Michael Cuevas , Simone Campanoni , and Peter Dinda
Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism
High Performance Computing, Networking, Storage and Analysis (SC), 2020
Acceptance rate: 25.1% (95/378)
PDF BibTeX ACM IEEE

In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model.

Nicholas Wanninger

Lab
Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Lab
Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda
Getting a Handle on Unmanaged Memory
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis.

Lab
Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda
Getting a Handle on Unmanaged Memory
ArXiv, 2024
PDF BibTeX arXiv

The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis.

Umut A. Acar

Lab
Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Lab
Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism
International Conference on Programming Language Design and Implementation (PLDI), 2021
Acceptance rate: 27.2% (87/320)
PDF BibTeX ACM Artifact: Functional

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Aaron Nelson

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Abdulrahman Mahmoud

Lab
Lab
Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
ArXiv, 2023
PDF BibTeX arXiv

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Alex Bernat

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Alexander M. Rush

Lab
Lab
Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
ArXiv, 2023
PDF BibTeX arXiv

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Angelo Matni

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Celine Lee

Lab
Lab
Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
ArXiv, 2023
PDF BibTeX arXiv

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Doug Downey

Lab
Ettore M. G. Trainiti, Simone Campanoni , and Doug Downey
Compiler-based neuron-aware deep neural network ensemble training
US Patents, 2022
Lab
Ettore M. G. Trainiti, Thanapon Noraset , David Demeter, Doug Downey , and Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training
Machine Learning and Systems (MLSys), 2021
Acceptance rate: 24.1% (52/216)
PDF BibTeX

Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles.

Ettore M. G. Trainiti

Lab
Ettore M. G. Trainiti, Simone Campanoni , and Doug Downey
Compiler-based neuron-aware deep neural network ensemble training
US Patents, 2022
Lab
Ettore M. G. Trainiti, Thanapon Noraset , David Demeter, Doug Downey , and Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training
Machine Learning and Systems (MLSys), 2021
Acceptance rate: 24.1% (52/216)
PDF BibTeX

Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles.

Federico Sossai

Lab
Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni
Representing Data Collections in an SSA Form
International Symposium on Code Generation and Optimization (CGO), 2024
Acceptance rate: 32.5% (37/114)
PDF BibTeX IEEE Artifact: Available Artifact: Reusable Artifact: Reproduced DOI GitHub

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

Lab

Jasper Liang

Lab
Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Lab

Lukas Gross

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)
PDF BibTeX ACM IEEE Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Lab
Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021
PDF BibTeX arXiv GitHub

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Marco D. Santambrogio

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Lab
Filippo Sironi , Davide B. Bartolini , Simone Campanoni , Fabio Cancare , Henry Hoffmann , Donatella Sciuto , and Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite.

Michael Wilkins

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Michal Kurek

Lab
Lab
Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
ArXiv, 2023
PDF BibTeX arXiv

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Michele Tartara

Lab
Michele Tartara , Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
Parallelism and Retargetability in the ILDJIT Dynamic Compiler
Architecture of Computing Systems (ARCS), 2010
PDF BibTeX IEEE GitHub

Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability.

Lab

Mike Rainey

Lab
Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Lab
Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism
International Conference on Programming Language Design and Implementation (PLDI), 2021
Acceptance rate: 27.2% (87/320)
PDF BibTeX ACM Artifact: Functional

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Nathan Greiner

Lab
Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni
Representing Data Collections in an SSA Form
International Symposium on Code Generation and Optimization (CGO), 2024
Acceptance rate: 32.5% (37/114)
PDF BibTeX IEEE Artifact: Available Artifact: Reusable Artifact: Reproduced DOI GitHub

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

Lab

Nayana Prasad Nagendra

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Lab

Niall Murphy

Lab
Niall Murphy, Timothy M. Jones , Robert Mullins , and Simone Campanoni
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops
International Conference on Compiler Construction (CC), 2016
Acceptance rate: 31.2% (24/77)
PDF BibTeX ACM

Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.

Lab
Niall Murphy, Timothy M. Jones , Simone Campanoni , and Robert Mullins
Limits of Static Dependence Analysis for Automatic Parallelization
International Workshop on Compilers for Parallel Computing (CPC), 2015
PDF

Peter Zhong

Lab
Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni
Representing Data Collections in an SSA Form
International Symposium on Code Generation and Optimization (CGO), 2024
Acceptance rate: 32.5% (37/114)
PDF BibTeX IEEE Artifact: Available Artifact: Reusable Artifact: Reproduced DOI GitHub

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

Lab
Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024
PDF BibTeX arXiv

In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively.

Robert Mullins

Lab
Niall Murphy, Timothy M. Jones , Robert Mullins , and Simone Campanoni
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops
International Conference on Compiler Construction (CC), 2016
Acceptance rate: 31.2% (24/77)
PDF BibTeX ACM

Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.

Lab
Niall Murphy, Timothy M. Jones , Simone Campanoni , and Robert Mullins
Limits of Static Dependence Analysis for Automatic Parallelization
International Workshop on Compilers for Parallel Computing (CPC), 2015
PDF

Russ Joseph

Lab
Yuanbo Fan , Simone Campanoni , and Russ Joseph
Time Squeezing for Tiny Devices
International Symposium on Computer Architecture (ISCA), 2019
Acceptance rate: 17.0% (62/365)
PDF BibTeX ACM IEEE Video

Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency in ultra-low power embedded systems. This slack arises when all the signals have propagated through logic paths well in advance of the clock signal. When it is properly identified, the system can exploit this unused cycle time for energy savings. In this paper, we describe compiler and architecture co-design that opens new opportunities for timing slack that are otherwise impossible. Through cross-layer optimization, we introduce novel mechanisms in the hardware and in the compiler that work together to improve the benefit of circuit-level timing speculation by effectively squeezing time during execution. This approach is particularly well-suited to tiny embedded devices. Our evaluation on a gate-level model of a complete processor shows that our co-design saves (on average) 40.5% of the original energy consumption (additional 16.5% compared to the existing clock scheduling technique) across 13 workloads while retaining transparency to developers.

Lab
Yuanbo Fan , Tianyu Jia , Jie Gu , Simone Campanoni , and Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors
Design Automation Conference (DAC), 2018
Acceptance rate: 24.3% (168/691)
PDF BibTeX ACM

Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level.

Stephen Chong

Lab
Lab
Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
ArXiv, 2023
PDF BibTeX arXiv

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Vincent St-Amour

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018
Acceptance rate: 17.6% (56/319)
PDF BibTeX ACM

The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.

Lab
Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
The Liberation Day of Nondeterministic Programs
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017
PDF BibTeX IEEE

The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

Wonyoung Kim

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise in Production Processors
IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
PDF BibTeX IEEE

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

Lab
Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling
International Symposium on Microarchitecture (MICRO), 2010
Acceptance rate: 18.1% (45/248)
PDF BibTeX ACM IEEE

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.


IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 IEEE

Yuanbo Fan

Lab
Yuanbo Fan , Simone Campanoni , and Russ Joseph
Time Squeezing for Tiny Devices
International Symposium on Computer Architecture (ISCA), 2019
Acceptance rate: 17.0% (62/365)
PDF BibTeX ACM IEEE Video

Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency in ultra-low power embedded systems. This slack arises when all the signals have propagated through logic paths well in advance of the clock signal. When it is properly identified, the system can exploit this unused cycle time for energy savings. In this paper, we describe compiler and architecture co-design that opens new opportunities for timing slack that are otherwise impossible. Through cross-layer optimization, we introduce novel mechanisms in the hardware and in the compiler that work together to improve the benefit of circuit-level timing speculation by effectively squeezing time during execution. This approach is particularly well-suited to tiny embedded devices. Our evaluation on a gate-level model of a complete processor shows that our co-design saves (on average) 40.5% of the original energy consumption (additional 16.5% compared to the existing clock scheduling technique) across 13 workloads while retaining transparency to developers.

Lab
Yuanbo Fan , Tianyu Jia , Jie Gu , Simone Campanoni , and Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors
Design Automation Conference (DAC), 2018
Acceptance rate: 24.3% (168/691)
PDF BibTeX ACM

Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level.

Yucan Wu

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Lab

Zhen Huang

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Alessandro A. Nacci

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Andrea Di Biagio

Lab

August Ning

Lab

Bangyen Pham

Lab

Brian R. Tauro

Lab
Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.

Christos Dimoulas

Lab
Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024
PDF BibTeX arXiv

In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively.

Conghao Liu

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

David Demeter

Lab
Ettore M. G. Trainiti, Thanapon Noraset , David Demeter, Doug Downey , and Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training
Machine Learning and Systems (MLSys), 2021
Acceptance rate: 24.1% (52/216)
PDF BibTeX

Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles.

Davide B. Bartolini

Lab
Filippo Sironi , Davide B. Bartolini , Simone Campanoni , Fabio Cancare , Henry Hoffmann , Donatella Sciuto , and Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite.

Di Zhang

Lab
Ling Jin, Yinzhi Cao , Yan Chen , Di Zhang, and Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities
IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
PDF BibTeX IEEE

Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \nExGen\n, to generate multiple transactions as exploits to given vulnerable smart contracts. \nExGen\n first translates either Ethereum or EOS contracts to an intermediate representation (IR). Then, \nExGen\n generates symbolic attack contracts with transactions in a partial order and then symbolically executes the attack contracts together with the target to find and solve all the constraints. Lastly, \nExGen\n concretizes all the symbols, generates attack contracts with multiple transactions, and verifies the generated contracts’ exploitability on a private chain with values crawled from the public chain. We implemented a prototype of \nExGen\n and evaluated it on Ethereum and EOS benchmarks. \nExGen\n successfully exploits 1,258/1,399 (89.9%) Ethereum and 126/130 (96.9%) EOS vulnerabilities. \nExGen\n is also able to exploit zero-day vulnerabilities on EOS.

Dimitrios Soudris

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Donatella Sciuto

Lab
Filippo Sironi , Davide B. Bartolini , Simone Campanoni , Fabio Cancare , Henry Hoffmann , Donatella Sciuto , and Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite.

Drew Kersnar

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Ella Colby

Lab

Fabio Cancare

Lab
Filippo Sironi , Davide B. Bartolini , Simone Campanoni , Fabio Cancare , Henry Hoffmann , Donatella Sciuto , and Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite.

Filippo Sironi

Lab
Filippo Sironi , Davide B. Bartolini , Simone Campanoni , Fabio Cancare , Henry Hoffmann , Donatella Sciuto , and Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite.

Galen Collier

Lab

Gaurav Chaudhary

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Georgios Tziantzioulis

Lab
Georgios Tziantzioulis , Nikos Hardavellas , and Simone Campanoni
Temporal Approximate Function Memoization
IEEE Micro's special issue on Approximate Computing, 2018
PDF BibTeX IEEE

Improving the performance of applications is a core target of computer systems research and has led to the creation of various techniques. Among them is function memoization, an intuitive technique that, unfortunately, failed to live up to its promised potential. Traditional function memoization falls short mainly because it is input-based, meaning the system needs to pattern-match the current invocations inputs with previously seen ones to decide which memoized output value to return. This is often computationally intensive and only suitable for a limited set of functions. However, function calls often exhibit temporal value locality on their output. We capitalize on this observation to create the first output-based function memoization technique. We demonstrate that compiler-directed output-based memoization reduces energy and runtime by 50 to 75 percent (2-4x speedup) at the cost of a relatively small degradation of the applications output quality.

Gianluca C. Durelli

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Gilles A. Pokam

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Henry Hoffmann

Lab
Filippo Sironi , Davide B. Bartolini , Simone Campanoni , Fabio Cancare , Henry Hoffmann , Donatella Sciuto , and Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)
PDF BibTeX ACM IEEE

In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite.

Jared Stark

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Jeremiah Giordani

Lab

Jiacheng Ma

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Jie Gu

Lab
Yuanbo Fan , Tianyu Jia , Jie Gu , Simone Campanoni , and Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors
Design Automation Conference (DAC), 2018
Acceptance rate: 24.3% (168/691)
PDF BibTeX ACM

Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level.

Johannes Doerfert

Lab
Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Jonathan D. Halverson

Lab

Jose Ayala

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Josiah Hester

Lab
Vito Kortbeek , Souradip Ghosh , Josiah Hester , Simone Campanoni , and Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing
International Conference on Programming Language Design and Implementation (PLDI), 2022
Acceptance rate: 20.9% (68/326)
PDF ACM Artifact: Available Artifact: Reusable Video DOI GitHub

Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.

Josue Pagan

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Katarzyna Dunajewski

Lab
Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced DOI GitHub

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Khalid Al-Hawaj

Lab
Khalid Al-Hawaj, Simone Campanoni , Gu-Yeon Wei , and David M. Brooks
Unified Cache: A Case for Low Latency Communication
International Workshop on Parallelism in Mobile Platforms (PRISM), 2015
PDF

Kim Hazelwood

Lab
Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation
ACM Transactions on Architecture and Code Optimization (TACO), 2010
PDF BibTeX ACM

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Konstantinos Iliakis

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Krishna K. Rangan

Lab

Ling Jin

Lab
Ling Jin, Yinzhi Cao , Yan Chen , Di Zhang, and Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities
IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
PDF BibTeX IEEE

Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \nExGen\n, to generate multiple transactions as exploits to given vulnerable smart contracts. \nExGen\n first translates either Ethereum or EOS contracts to an intermediate representation (IR). Then, \nExGen\n generates symbolic attack contracts with transactions in a partial order and then symbolically executes the attack contracts together with the target to find and solve all the constraints. Lastly, \nExGen\n concretizes all the symbols, generates attack contracts with multiple transactions, and verifies the generated contracts’ exploitability on a private chain with values crawled from the public chain. We implemented a prototype of \nExGen\n and evaluated it on Ethereum and EOS benchmarks. \nExGen\n successfully exploits 1,258/1,399 (89.9%) Ethereum and 126/130 (96.9%) EOS vulnerabilities. \nExGen\n is also able to exploit zero-day vulnerabilities on EOS.

Luca Rocchini

Lab
Simone Campanoni , and Luca Rocchini
Static Memory Management within Bytecode Languages on Multicore Systems
Workshop on Computing in Heterogeneous, Autonomous ’N’ Goal-oriented Environments (CHANGE), 2011
PDF IEEE GitHub

Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The pipeline model is appealing for the compilation tasks that dynamic compilers need to perform, but it can bring deadlock issues when static memories are exploited by the running program. This research suggests a solution that both solves the mentioned problem and reduces the unnecessary compiler threads used to handle static memories. The proposed solution is a self-aware runtime system that both it is able to detect/avoid deadlocks and it adapts the number of compilation threads needed to handle static memories to the current workload.

Marina Zapater

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Martino Sykora

Lab
Simone Campanoni , Martino Sykora , Giovanni Agosta , and Stefano Crespi Reghizzi
Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment
International Conference on Compiler Construction (CC), 2009
Acceptance rate: 25.0% (18/72)
PDF BibTeX ACM Springer GitHub

Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation.

Matteo Ferroni

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Michael Kruse

Lab
Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Michael Leonard

Lab
Michael Leonard , and Simone Campanoni
Introducing the Pseudorandom Value Generator Selection in the Compilation Toolchain
International Symposium on Code Generation and Optimization (CGO), 2020
Acceptance rate: 27.4% (26/95)
PDF BibTeX ACM

As interest in randomization has grown within the computing community, the number of pseudorandom value generators (PRVGs) at developers' disposal dramatically increased. Today, developers lack the tools necessary to obtain optimal behavior from their PRVGs. We provide the first deep study into the tradeoffs among the PRVGs in the C++ standard, finding no silver bullet for all programs and architectures. With this in mind, we have built PRV Jeeves, the first fully automatic PRVG selector. We demonstrate that when compiling widely-used, highly optimized programs with PRV Jeeves, we are able to cut execution time by 34% on average. This enhancement comes at no cost to developers.

Monica Vallejo

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Nadharm Dhiantravan

Lab
Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
PDF BibTeX ACM Artifact: Available Artifact: Functional Artifact: Reproduced Video DOI GitHub

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Panagiotis-Eleftherios Eleftherakis

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Przemysław Pawełczak

Lab
Vito Kortbeek , Souradip Ghosh , Josiah Hester , Simone Campanoni , and Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing
International Conference on Programming Language Design and Implementation (PLDI), 2022
Acceptance rate: 20.9% (68/326)
PDF ACM Artifact: Available Artifact: Reusable Video DOI GitHub

Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.

Riccardo Cattaneo

Lab
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015
PDF BibTeX ACM IEEE

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Robert Bruce Findler

Lab
Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024
PDF BibTeX arXiv

In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively.

Ryan Newton

Lab
Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism
International Conference on Programming Language Design and Implementation (PLDI), 2021
Acceptance rate: 27.2% (87/320)
PDF BibTeX ACM Artifact: Functional

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Sam Westrick

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Shaowei Zhu

Lab

Shu-Hung You

Lab
Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024
PDF BibTeX arXiv

In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively.

Siyuan Chai

Lab
Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)
PDF BibTeX ACM Artifact: Available Video DOI

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Sotirios Xydis

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Thanapon Noraset

Lab
Ettore M. G. Trainiti, Thanapon Noraset , David Demeter, Doug Downey , and Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training
Machine Learning and Systems (MLSys), 2021
Acceptance rate: 24.1% (52/216)
PDF BibTeX

Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles.

Tianyu Jia

Lab
Yuanbo Fan , Tianyu Jia , Jie Gu , Simone Campanoni , and Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors
Design Automation Conference (DAC), 2018
Acceptance rate: 24.3% (168/691)
PDF BibTeX ACM

Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level.

Tipp Moseley

Lab
Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)
PDF BibTeX ACM Video

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Tor M. Aamodt

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Tyler Sorensen

Lab
Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)
PDF BibTeX IEEE Artifact: Available Artifact: Functional Artifact: Reproduced

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Vijay Kandiah

Lab
Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)
PDF BibTeX ACM Artifact: Available Artifact: Reusable DOI

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Vito Kortbeek

Lab
Vito Kortbeek , Souradip Ghosh , Josiah Hester , Simone Campanoni , and Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing
International Conference on Programming Language Design and Implementation (PLDI), 2022
Acceptance rate: 20.9% (68/326)
PDF ACM Artifact: Available Artifact: Reusable Video DOI GitHub

Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.

Wenyi Wang

Lab
Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)
PDF BibTeX ACM GitHub

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Xiaochun Zhang

Lab
Xiaochun Zhang, Timothy M. Jones , and Simone Campanoni
Quantifying the Semantic Gap Between Serial and Parallel Programming
International Symposium on Workload Characterization (IISWC), 2021
Acceptance rate: 39.6% (19/48)
PDF BibTeX IEEE Video

Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system.

Xinyu Xing

Lab

Yan Chen

Lab
Ling Jin, Yinzhi Cao , Yan Chen , Di Zhang, and Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities
IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
PDF BibTeX IEEE

Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \nExGen\n, to generate multiple transactions as exploits to given vulnerable smart contracts. \nExGen\n first translates either Ethereum or EOS contracts to an intermediate representation (IR). Then, \nExGen\n generates symbolic attack contracts with transactions in a partial order and then symbolically executes the attack contracts together with the target to find and solve all the constraints. Lastly, \nExGen\n concretizes all the symbols, generates attack contracts with multiple transactions, and verifies the generated contracts’ exploitability on a private chain with values crawled from the public chain. We implemented a prototype of \nExGen\n and evaluated it on Ethereum and EOS benchmarks. \nExGen\n successfully exploits 1,258/1,399 (89.9%) Ethereum and 126/130 (96.9%) EOS vulnerabilities. \nExGen\n is also able to exploit zero-day vulnerabilities on EOS.

Yinzhi Cao

Lab
Ling Jin, Yinzhi Cao , Yan Chen , Di Zhang, and Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities
IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
PDF BibTeX IEEE

Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \nExGen\n, to generate multiple transactions as exploits to given vulnerable smart contracts. \nExGen\n first translates either Ethereum or EOS contracts to an intermediate representation (IR). Then, \nExGen\n generates symbolic attack contracts with transactions in a partial order and then symbolically executes the attack contracts together with the target to find and solve all the constraints. Lastly, \nExGen\n concretizes all the symbols, generates attack contracts with multiple transactions, and verifies the generated contracts’ exploitability on a private chain with values crawled from the public chain. We implemented a prototype of \nExGen\n and evaluated it on Ethereum and EOS benchmarks. \nExGen\n successfully exploits 1,258/1,399 (89.9%) Ethereum and 126/130 (96.9%) EOS vulnerabilities. \nExGen\n is also able to exploit zero-day vulnerabilities on EOS.

Zheng Yu

Lab

Zhenpeng Lin

Lab

Ziyi Guo

Lab

Publications without collaborators

Lab
Simone Campanoni
Guide to ILDJIT
Book from Springer, 2011
BibTeX Springer GitHub
External links:
LinkedIn profile LinkedIn Follow SimoneCampanoni on Twitter Twitter GitHub profile GitHub

Share this page with: