Publications |
Most frequent 6 collaborators
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 |
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task. |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks. |
|
Simone Campanoni
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-UP: Relaxing Program Semantics to Unleash Parallelization International Symposium on Code Generation and Optimization (CGO), 2015 Acceptance rate: 27.3% (24/88) Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis. One of four papers nominated for the Best Paper Award by the Program Committee |
|
Khalid Al-Hawaj,
Simone Campanoni
,
Gu-Yeon Wei
, and
David M. Brooks
Unified Cache: A Case for Low Latency Communication International Workshop on Parallelism in Mobile Platforms (PRISM), 2015 |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017 |
|
Simone Campanoni
,
Svilen Kanev
,
Kevin Brownell
,
Gu-Yeon Wei
, and
David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014 |
|
Gu-Yeon Wei
,
David M. Brooks
,
Simone Campanoni
,
Kevin Brownell
, and
Svilen Kanev
Methods and apparatus for parallel processing US Patents, 2014 |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Vijay Janapa Reddi
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Kim Hazelwood
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations. |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance. |
|
Vijay Janapa Reddi
,
Meeta S. Gupta
,
Krishna K. Rangan
,
Simone Campanoni
,
Glenn Holloway
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009 |
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 |
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task. |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks. |
|
Simone Campanoni
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-UP: Relaxing Program Semantics to Unleash Parallelization International Symposium on Code Generation and Optimization (CGO), 2015 Acceptance rate: 27.3% (24/88) Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis. One of four papers nominated for the Best Paper Award by the Program Committee |
|
Khalid Al-Hawaj,
Simone Campanoni
,
Gu-Yeon Wei
, and
David M. Brooks
Unified Cache: A Case for Low Latency Communication International Workshop on Parallelism in Mobile Platforms (PRISM), 2015 |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017 |
|
Simone Campanoni
,
Svilen Kanev
,
Kevin Brownell
,
Gu-Yeon Wei
, and
David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014 |
|
Gu-Yeon Wei
,
David M. Brooks
,
Simone Campanoni
,
Kevin Brownell
, and
Svilen Kanev
Methods and apparatus for parallel processing US Patents, 2014 |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Vijay Janapa Reddi
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Kim Hazelwood
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations. |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance. |
|
Vijay Janapa Reddi
,
Meeta S. Gupta
,
Krishna K. Rangan
,
Simone Campanoni
,
Glenn Holloway
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009 |
Yian Su
,
Mike Rainey
,
Nicholas Wanninger
, Nadharm Dhiantravan,
Jasper Liang
,
Umut A. Acar
,
Peter Dinda
, and
Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs. |
|
Nicholas Wanninger
,
Tommy McMichen
,
Simone Campanoni
, and
Peter Dinda
Getting a Handle on Unmanaged Memory International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis. |
|
Brian R. Tauro
,
Brian Suchy
,
Simone Campanoni
,
Peter Dinda
, and
Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits. |
|
Zhenpeng Lin
,
Zheng Yu
,
Ziyi Guo
,
Simone Campanoni
,
Peter Dinda
, and
Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection USENIX Security Symposium (USENIX Security), 2024 |
|
Nicholas Wanninger
,
Tommy McMichen
,
Simone Campanoni
, and
Peter Dinda
Getting a Handle on Unmanaged Memory ArXiv, 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis. |
|
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
|
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
|
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
|
Kyle C. Hale
,
Simone Campanoni
,
Nikos Hardavellas
, and
Peter Dinda
The Case for an Interwoven Parallel Hardware/Software Stack International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021 The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work. |
|
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
|
Mike Rainey
,
Kyle C. Hale
,
Ryan Newton
,
Nikos Hardavellas
,
Simone Campanoni
,
Peter Dinda
, and
Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability. |
|
Souradip Ghosh
,
Michael Cuevas
,
Simone Campanoni
, and
Peter Dinda
Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism High Performance Computing, Networking, Storage and Analysis (SC), 2020 Acceptance rate: 25.1% (95/378) In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model. |
|
Brian Suchy
,
Simone Campanoni
,
Nikos Hardavellas
, and
Peter Dinda
CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions. |
|
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018 Acceptance rate: 17.6% (56/319) The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform. |
|
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
The Liberation Day of Nondeterministic Programs International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017 The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%. |
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers. |
|
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
|
Zujun Tan
,
Yebin Chon
,
Michael Kruse
,
Johannes Doerfert
,
Ziyang Xu
,
Brian Homerding
,
Simone Campanoni
, and
David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score. |
|
Bhargav Reddy Godala
,
Nayana Prasad Nagendra
,
Ishita Chaturvedi
,
Simone Campanoni
, and
David I. August
Modern Front-end Support in gem5 International Workshop on Gem5 (GEM5), 2023 |
|
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Yucan Wu
,
Simone Campanoni
, and
David I. August
QPoints: QEMU to gem5 ARM Full System Checkpointing International Workshop on Gem5 (GEM5), 2023 |
|
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
|
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
, Greg Chan,
Simone Campanoni
, and
David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks. |
|
Sotiris Apostolakis
,
Ziyang Xu
, Greg Chan,
Simone Campanoni
, and
David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability. |
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers. |
|
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Zujun Tan
,
Yebin Chon
,
Michael Kruse
,
Johannes Doerfert
,
Ziyang Xu
,
Brian Homerding
,
Simone Campanoni
, and
David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score. |
|
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
|
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
, Greg Chan,
Simone Campanoni
, and
David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks. |
|
Sotiris Apostolakis
,
Ziyang Xu
, Greg Chan,
Simone Campanoni
, and
David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability. |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
|
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
|
Enrico Armenio Deiana
, and
Simone Campanoni
Workload Characterization of Nondeterministic Programs Parallelized by STATS International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019 Acceptance rate: 29.5% (26/88) Chip Multiprocessors (CMP) are everywhere, from mobile systems, to servers. Thread Level Parallelism (TLP) is the characteristic of a program that makes use of the parallel cores of a CMP to generate performance. Despite all efforts for creating TLP, multiple cores are still underutilized even though we have been in the multicore era for more than a decade. Recently, a new approach called STATS has been proposed to generate additional TLP for complex and irregular nondeterministic programs. STATS allows a developer to describe application-specific information that its compiler uses to automatically generate a new source of TLP. This new source of TLP increases with the size of the input and it has the potential to generate scalable performance with the number of cores. Even though STATS obtains most of its potential, some of it is still unreached. This paper identifies and characterizes the sources of overhead that are currently blocking STATS parallelized programs to achieve their full potential. To this end, we characterized the workloads generated by the STATS compiler on a 28 core Intel-based machine (dual-socket). This paper shows that the performance loss is due to a combination of factors: some can be optimized via engineering efforts and some require a deeper evolution of STATS. We also highlight potential solutions to significantly reduce most of this overhead. Exploiting these insights will unblock scalable performance for the parallel binaries generated by STATS. |
|
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018 Acceptance rate: 17.6% (56/319) The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform. |
|
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
The Liberation Day of Nondeterministic Programs International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017 The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%. |
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
|
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
|
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
|
Kyle C. Hale
,
Simone Campanoni
,
Nikos Hardavellas
, and
Peter Dinda
The Case for an Interwoven Parallel Hardware/Software Stack International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021 The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work. |
|
Mike Rainey
,
Kyle C. Hale
,
Ryan Newton
,
Nikos Hardavellas
,
Simone Campanoni
,
Peter Dinda
, and
Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability. |
|
Brian Suchy
,
Simone Campanoni
,
Nikos Hardavellas
, and
Peter Dinda
CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions. |
|
Georgios Tziantzioulis
,
Nikos Hardavellas
, and
Simone Campanoni
Temporal Approximate Function Memoization IEEE Micro's special issue on Approximate Computing, 2018 Improving the performance of applications is a core target of computer systems research and has led to the creation of various techniques. Among them is function memoization, an intuitive technique that, unfortunately, failed to live up to its promised potential. Traditional function memoization falls short mainly because it is input-based, meaning the system needs to pattern-match the current invocations inputs with previously seen ones to decide which memoized output value to return. This is often computationally intensive and only suitable for a limited set of functions. However, function calls often exhibit temporal value locality on their output. We capitalize on this observation to create the first output-based function memoization technique. We demonstrate that compiler-directed output-based memoization reduces energy and runtime by 50 to 75 percent (2-4x speedup) at the cost of a relatively small degradation of the applications output quality. |
|
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018 Acceptance rate: 17.6% (56/319) The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform. |
|
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
The Liberation Day of Nondeterministic Programs International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017 The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%. |
Xiaochun Zhang,
Timothy M. Jones
, and
Simone Campanoni
Quantifying the Semantic Gap Between Serial and Parallel Programming International Symposium on Workload Characterization (IISWC), 2021 Acceptance rate: 39.6% (19/48) Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system. |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks. |
|
Niall Murphy,
Timothy M. Jones
,
Robert Mullins
, and
Simone Campanoni
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops International Conference on Compiler Construction (CC), 2016 Acceptance rate: 31.2% (24/77) Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism. |
|
Niall Murphy,
Timothy M. Jones
,
Simone Campanoni
, and
Robert Mullins
Limits of Static Dependence Analysis for Automatic Parallelization International Workshop on Compilers for Parallel Computing (CPC), 2015 |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017 |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Vijay Janapa Reddi
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism. |
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Zujun Tan
,
Yebin Chon
,
Michael Kruse
,
Johannes Doerfert
,
Ziyang Xu
,
Brian Homerding
,
Simone Campanoni
, and
David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score. |
|
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
|
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
, Greg Chan,
Simone Campanoni
, and
David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
|
Bhargav Reddy Godala
,
Nayana Prasad Nagendra
,
Ishita Chaturvedi
,
Simone Campanoni
, and
David I. August
Modern Front-end Support in gem5 International Workshop on Gem5 (GEM5), 2023 |
|
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Yucan Wu
,
Simone Campanoni
, and
David I. August
QPoints: QEMU to gem5 ARM Full System Checkpointing International Workshop on Gem5 (GEM5), 2023 |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
Simone Campanoni
,
Giovanni Agosta
,
Stefano Crespi Reghizzi
, and
Andrea Di Biagio
A highly flexible, parallel virtual machine: design and experience of ILDJIT Software: Practice and Experience (SPE), 2010 |
|
Michele Tartara
,
Simone Campanoni
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
Parallelism and Retargetability in the ILDJIT Dynamic Compiler Architecture of Computing Systems (ARCS), 2010 Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability. |
|
Michele Tartara
,
Stefano Crespi Reghizzi
, and
Simone Campanoni
Extending Hammocks for Parallelism Detection Italian Conference on Theoretical Computer Science (ICTCS), 2010 |
|
Simone Campanoni
,
Martino Sykora
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment International Conference on Compiler Construction (CC), 2009 Acceptance rate: 25.0% (18/72) Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation. |
|
Simone Campanoni
, and
Stefano Crespi Reghizzi
Traces of Control-Flow Graphs International Conference on Developments in Language Theory (DLT), 2009 Acceptance rate: 45.7% (32/70) |
|
Simone Campanoni
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
A parallel dynamic compiler for CIL bytecode ACM SIGPLAN Notices, 2008 Multi-core technology is being employed in most recent high-performance architectures. Such architectures need specifically designed multi-threaded software to exploit all the potentialities of their hardware parallelism. At the same time, object code virtualization technologies are achieving a growing popularity, as they allow higher levels of software portability and reuse. Thus, a virtual execution environment running on a multi-core processor has to run complex, high-level applications and to exploit as much as possible the underlying parallel hardware. We propose an approach that leverages on CMP features to expose a novel pipeline synchronization model for the internal threads of the dynamic compiler. Thanks to compilation latency masking effect of the pipeline organization, our dynamic compiler, ILDJIT, is able to achieve significant speedups (26% on average) with respect to the baseline, when the underlying hardware exposes at least two cores. |
|
Simone Campanoni
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
ILDJIT: a Parallel Dynamic Compiler International Conference on Very Large Scale Integration (VLSI-SoC), 2008 |
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks. |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017 |
|
Simone Campanoni
,
Svilen Kanev
,
Kevin Brownell
,
Gu-Yeon Wei
, and
David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014 |
|
Gu-Yeon Wei
,
David M. Brooks
,
Simone Campanoni
,
Kevin Brownell
, and
Svilen Kanev
Methods and apparatus for parallel processing US Patents, 2014 |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 |
Yian Su
,
Mike Rainey
,
Nicholas Wanninger
, Nadharm Dhiantravan,
Jasper Liang
,
Umut A. Acar
,
Peter Dinda
, and
Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs. |
|
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
|
Bhargav Reddy Godala
,
Nayana Prasad Nagendra
,
Ishita Chaturvedi
,
Simone Campanoni
, and
David I. August
Modern Front-end Support in gem5 International Workshop on Gem5 (GEM5), 2023 |
|
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Yucan Wu
,
Simone Campanoni
, and
David I. August
QPoints: QEMU to gem5 ARM Full System Checkpointing International Workshop on Gem5 (GEM5), 2023 |
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers. |
|
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
|
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
, Greg Chan,
Simone Campanoni
, and
David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks. |
|
Sotiris Apostolakis
,
Ziyang Xu
, Greg Chan,
Simone Campanoni
, and
David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability. |
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Vijay Janapa Reddi
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Kim Hazelwood
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations. |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance. |
|
Vijay Janapa Reddi
,
Meeta S. Gupta
,
Krishna K. Rangan
,
Simone Campanoni
,
Glenn Holloway
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009 |
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Zujun Tan
,
Yebin Chon
,
Michael Kruse
,
Johannes Doerfert
,
Ziyang Xu
,
Brian Homerding
,
Simone Campanoni
, and
David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score. |
|
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
Brian R. Tauro
,
Brian Suchy
,
Simone Campanoni
,
Peter Dinda
, and
Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits. |
|
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
|
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
|
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
|
Brian Suchy
,
Simone Campanoni
,
Nikos Hardavellas
, and
Peter Dinda
CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions. |
Simone Campanoni
,
Giovanni Agosta
,
Stefano Crespi Reghizzi
, and
Andrea Di Biagio
A highly flexible, parallel virtual machine: design and experience of ILDJIT Software: Practice and Experience (SPE), 2010 |
|
Michele Tartara
,
Simone Campanoni
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
Parallelism and Retargetability in the ILDJIT Dynamic Compiler Architecture of Computing Systems (ARCS), 2010 Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability. |
|
Simone Campanoni
,
Martino Sykora
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment International Conference on Compiler Construction (CC), 2009 Acceptance rate: 25.0% (18/72) Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation. |
|
Simone Campanoni
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
A parallel dynamic compiler for CIL bytecode ACM SIGPLAN Notices, 2008 Multi-core technology is being employed in most recent high-performance architectures. Such architectures need specifically designed multi-threaded software to exploit all the potentialities of their hardware parallelism. At the same time, object code virtualization technologies are achieving a growing popularity, as they allow higher levels of software portability and reuse. Thus, a virtual execution environment running on a multi-core processor has to run complex, high-level applications and to exploit as much as possible the underlying parallel hardware. We propose an approach that leverages on CMP features to expose a novel pipeline synchronization model for the internal threads of the dynamic compiler. Thanks to compilation latency masking effect of the pipeline organization, our dynamic compiler, ILDJIT, is able to achieve significant speedups (26% on average) with respect to the baseline, when the underlying hardware exposes at least two cores. |
|
Simone Campanoni
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
ILDJIT: a Parallel Dynamic Compiler International Conference on Very Large Scale Integration (VLSI-SoC), 2008 |
Simone Campanoni
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-UP: Relaxing Program Semantics to Unleash Parallelization International Symposium on Code Generation and Optimization (CGO), 2015 Acceptance rate: 27.3% (24/88) Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis. One of four papers nominated for the Best Paper Award by the Program Committee |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Vijay Janapa Reddi
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000. |
|
Simone Campanoni
,
Timothy M. Jones
,
Glenn Holloway
,
Gu-Yeon Wei
, and
David M. Brooks
The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism. |
|
Vijay Janapa Reddi
,
Meeta S. Gupta
,
Krishna K. Rangan
,
Simone Campanoni
,
Glenn Holloway
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009 |
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Kim Hazelwood
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations. |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance. |
|
Vijay Janapa Reddi
,
Meeta S. Gupta
,
Krishna K. Rangan
,
Simone Campanoni
,
Glenn Holloway
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009 |
Vito Kortbeek
,
Souradip Ghosh
,
Josiah Hester
,
Simone Campanoni
, and
Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing International Conference on Programming Language Design and Implementation (PLDI), 2022 Acceptance rate: 20.9% (68/326) Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks. |
|
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
|
Souradip Ghosh
,
Michael Cuevas
,
Simone Campanoni
, and
Peter Dinda
Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism High Performance Computing, Networking, Storage and Analysis (SC), 2020 Acceptance rate: 25.1% (95/378) In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model. |
Tommy McMichen
,
Nathan Greiner
,
Peter Zhong
,
Federico Sossai
,
Atmn Patel
, and
Simone Campanoni
Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. |
|
Nicholas Wanninger
,
Tommy McMichen
,
Simone Campanoni
, and
Peter Dinda
Getting a Handle on Unmanaged Memory International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis. |
|
Nicholas Wanninger
,
Tommy McMichen
,
Simone Campanoni
, and
Peter Dinda
Getting a Handle on Unmanaged Memory ArXiv, 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis. |
|
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
Simone Campanoni
, and
William Fornaciari
Multi-level Design and Optimization of Wireless Sensor Networks International Conference on Networked Sensing Systems (INSS), 2008 This paper proposes a methodology to off-line planning of WSNs (wireless sensor networks) by addressing the problem in a multi-layer manner. At the sensor level a model is described to properly select and distribute the sensors in the environment. To optimize the cost and deployment of realistic WSNs, further design activities are proposed at an intermediate level, targeting board-level clustering of sensors. Finally, it is presented a methodology to hierarchically organize the set of sensors in patches with an additional gateway-level communication layer, to take into account also possible scaling of the application complexity. Particular emphasis is put on the cost modeling and on ensuring the correct behavior of the WSN against variation of parameters like sensor position, protocol, cost, etc. |
|
Simone Campanoni
, and
William Fornaciari
Node-Level Optimization of Wireless Sensor Networks International Conference on Wireless Communications, Networking and Mobile Computing (WiCom), 2008 A methodology for WSN planning should produce a suitable placement of sensors to simplify the acquisition of the data relevant for the application. The goal of this paper is to present a strategy to formally specify the system level characteristics of the events to be monitored and to identify a proper set of sensor-position pairs, tailored to provide the required sensing capabilities. In particular the focus is on the node level optimization, where the designer has to face with the problem of clustering sensors onto the same board based upon a cost-performance tradeoff. |
|
Simone Campanoni
, and
William Fornaciari
Models and Tradeoffs in WSN System-Level Design Euromicro Symposium on Digital System Design (DSD), 2008 System-level design of WSNs includes the selection of the sensing nodes and their dissemination in the environment to be monitored. Many design choices have to be taken during this stage of the development of the application. The goal of this paper is to present a methodology to specify formally the desired behavior of the sensing application and to derive an optimal selection and placement of the network nodes. The approach is flexible and powerful, since it allows the designer to analyze the impact of clustering sensors onto a reduced set of boards (nodes) and to perform sensitivity analysis on parameters such as type of sensor, position of the nodes, observation time, presence of faults, etc. The paper introduces the concepts with some representative examples as well as by considering bigger use cases extracted from international projects. |
|
Simone Campanoni
, and
William Fornaciari
Ensuring Feasibility of Wireless Sensor Networks International Conference on Circuits and Systems for Communications (ICCSC), 2008 A comprehensive WSNs (wireless sensor networks) design methodology should deserve enough effort to select a suitable set of sensors with a proper spatial distribution, in order to ensure a correct monitoring of the parameters relevant for the application. The goal of this paper is to show how it is possible a priori ensuring that a given WSN is actually capable to satisfy the application requirements. In addition to this, the flexibility of the approach is also demonstrated by evaluating the sensitivity of the WSN performance to parameters like sensors distribution, observation time and the events to be analyzed. This paper presents the overall design methodology implemented in SWORDFISH (sensor networks development framework integrating simulation and hardware optimization) and some representative case studies. |
|
Simone Campanoni
, and
William Fornaciari
SWORDFISH: A framework to formally design WSNs capturing events International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2007 A Wireless Sensor Network (WSN) consists of spatially distributed autonomous devices equipped with sensors and radio communication capabilities. They can act in a cooperative manner for monitoring environmental parameters like pressure, temperature, acceleration, light, humidity, etc. The typical WSN design problem is to discover a proper set of sensors and their spatial distribution, so to enable the monitoring of relevant parameters for the application, while concurrently optimizing some design goals (e.g., cost, reliability, lifetime and energy requirements). In this paper we present part of the SWORDFISH (Sensor netWORks Development Framework Integrating Simulation and Hardware optimization) project, aiming at providing a design and verification environment for WSNs, including in the design loop the simulation of physical events and a formal verification of the WSN desired properties against the network optimization goals. In particular, it is presented the overall design framework, the approach to model properties and goals of the network and some representative case studies showing the value of a toolsuite gathering WSN formal optimization/planning and environment simulation. |
Tommy McMichen
,
Nathan Greiner
,
Peter Zhong
,
Federico Sossai
,
Atmn Patel
, and
Simone Campanoni
Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Brian Homerding
,
Atmn Patel
,
Enrico Armenio Deiana
,
Yian Su
,
Zujun Tan
,
Ziyang Xu
,
Bhargav Reddy Godala
,
David I. August
, and
Simone Campanoni
The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics. |
|
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks. |
|
Simone Campanoni
,
Kevin Brownell
,
Svilen Kanev
,
Timothy M. Jones
,
Gu-Yeon Wei
, and
David M. Brooks
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017 |
|
Simone Campanoni
,
Svilen Kanev
,
Kevin Brownell
,
Gu-Yeon Wei
, and
David M. Brooks
Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014 |
|
Gu-Yeon Wei
,
David M. Brooks
,
Simone Campanoni
,
Kevin Brownell
, and
Svilen Kanev
Methods and apparatus for parallel processing US Patents, 2014 |
Brian R. Tauro
,
Brian Suchy
,
Simone Campanoni
,
Peter Dinda
, and
Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits. |
|
Kyle C. Hale
,
Simone Campanoni
,
Nikos Hardavellas
, and
Peter Dinda
The Case for an Interwoven Parallel Hardware/Software Stack International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021 The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work. |
|
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
|
Mike Rainey
,
Kyle C. Hale
,
Ryan Newton
,
Nikos Hardavellas
,
Simone Campanoni
,
Peter Dinda
, and
Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability. |
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Zujun Tan
,
Yebin Chon
,
Michael Kruse
,
Johannes Doerfert
,
Ziyang Xu
,
Brian Homerding
,
Simone Campanoni
, and
David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score. |
|
Ziyang Xu
,
Yebin Chon
,
Yian Su
,
Zujun Tan
,
Sotiris Apostolakis
,
Simone Campanoni
, and
David I. August
PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers. |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
|
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
, Greg Chan,
Simone Campanoni
, and
David I. August
SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks. |
|
Sotiris Apostolakis
,
Ziyang Xu
, Greg Chan,
Simone Campanoni
, and
David I. August
Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability. |
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Kim Hazelwood
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations. |
|
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance. |
|
Vijay Janapa Reddi
,
Meeta S. Gupta
,
Krishna K. Rangan
,
Simone Campanoni
,
Glenn Holloway
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009 |
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
|
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
|
Souradip Ghosh
,
Michael Cuevas
,
Simone Campanoni
, and
Peter Dinda
Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism High Performance Computing, Networking, Storage and Analysis (SC), 2020 Acceptance rate: 25.1% (95/378) In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model. |
Yian Su
,
Mike Rainey
,
Nicholas Wanninger
, Nadharm Dhiantravan,
Jasper Liang
,
Umut A. Acar
,
Peter Dinda
, and
Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs. |
|
Nicholas Wanninger
,
Tommy McMichen
,
Simone Campanoni
, and
Peter Dinda
Getting a Handle on Unmanaged Memory International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis. |
|
Nicholas Wanninger
,
Tommy McMichen
,
Simone Campanoni
, and
Peter Dinda
Getting a Handle on Unmanaged Memory ArXiv, 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis. |
Yian Su
,
Mike Rainey
,
Nicholas Wanninger
, Nadharm Dhiantravan,
Jasper Liang
,
Umut A. Acar
,
Peter Dinda
, and
Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs. |
|
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
|
Mike Rainey
,
Kyle C. Hale
,
Ryan Newton
,
Nikos Hardavellas
,
Simone Campanoni
,
Peter Dinda
, and
Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability. |
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
|
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 |
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task. |
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
|
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 |
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task. |
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 |
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task. |
Ettore M. G. Trainiti,
Simone Campanoni
, and
Doug Downey
Compiler-based neuron-aware deep neural network ensemble training US Patents, 2022 |
|
Ettore M. G. Trainiti,
Thanapon Noraset
, David Demeter,
Doug Downey
, and
Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training Machine Learning and Systems (MLSys), 2021 Acceptance rate: 24.1% (52/216) Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles. |
Ettore M. G. Trainiti,
Simone Campanoni
, and
Doug Downey
Compiler-based neuron-aware deep neural network ensemble training US Patents, 2022 |
|
Ettore M. G. Trainiti,
Thanapon Noraset
, David Demeter,
Doug Downey
, and
Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training Machine Learning and Systems (MLSys), 2021 Acceptance rate: 24.1% (52/216) Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles. |
Tommy McMichen
,
Nathan Greiner
,
Peter Zhong
,
Federico Sossai
,
Atmn Patel
, and
Simone Campanoni
Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Yian Su
,
Mike Rainey
,
Nicholas Wanninger
, Nadharm Dhiantravan,
Jasper Liang
,
Umut A. Acar
,
Peter Dinda
, and
Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
Brian Homerding
,
Tommy McMichen
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it. |
|
Angelo Matni
,
Enrico Armenio Deiana
,
Yian Su
,
Lukas Gross
,
Souradip Ghosh
,
Sotiris Apostolakis
,
Ziyang Xu
,
Zujun Tan
,
Ishita Chaturvedi
,
David I. August
, and
Simone Campanoni
NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
|
Filippo Sironi
,
Davide B. Bartolini
,
Simone Campanoni
,
Fabio Cancare
,
Henry Hoffmann
,
Donatella Sciuto
, and
Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite. |
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
|
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 |
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task. |
Michele Tartara
,
Simone Campanoni
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
Parallelism and Retargetability in the ILDJIT Dynamic Compiler Architecture of Computing Systems (ARCS), 2010 Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability. |
|
Michele Tartara
,
Stefano Crespi Reghizzi
, and
Simone Campanoni
Extending Hammocks for Parallelism Detection Italian Conference on Theoretical Computer Science (ICTCS), 2010 |
Yian Su
,
Mike Rainey
,
Nicholas Wanninger
, Nadharm Dhiantravan,
Jasper Liang
,
Umut A. Acar
,
Peter Dinda
, and
Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs. |
|
Mike Rainey
,
Kyle C. Hale
,
Ryan Newton
,
Nikos Hardavellas
,
Simone Campanoni
,
Peter Dinda
, and
Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability. |
Tommy McMichen
,
Nathan Greiner
,
Peter Zhong
,
Federico Sossai
,
Atmn Patel
, and
Simone Campanoni
Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. |
|
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
|
Bhargav Reddy Godala
,
Nayana Prasad Nagendra
,
Ishita Chaturvedi
,
Simone Campanoni
, and
David I. August
Modern Front-end Support in gem5 International Workshop on Gem5 (GEM5), 2023 |
Niall Murphy,
Timothy M. Jones
,
Robert Mullins
, and
Simone Campanoni
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops International Conference on Compiler Construction (CC), 2016 Acceptance rate: 31.2% (24/77) Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism. |
|
Niall Murphy,
Timothy M. Jones
,
Simone Campanoni
, and
Robert Mullins
Limits of Static Dependence Analysis for Automatic Parallelization International Workshop on Compilers for Parallel Computing (CPC), 2015 |
Tommy McMichen
,
Nathan Greiner
,
Peter Zhong
,
Federico Sossai
,
Atmn Patel
, and
Simone Campanoni
Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%. |
|
Peter Zhong
,
Shu-Hung You
,
Simone Campanoni
,
Robert Bruce Findler
, and
Christos Dimoulas
A Calculus for Unreachable Code ArXiv, 2024 In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively. |
Niall Murphy,
Timothy M. Jones
,
Robert Mullins
, and
Simone Campanoni
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops International Conference on Compiler Construction (CC), 2016 Acceptance rate: 31.2% (24/77) Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism. |
|
Niall Murphy,
Timothy M. Jones
,
Simone Campanoni
, and
Robert Mullins
Limits of Static Dependence Analysis for Automatic Parallelization International Workshop on Compilers for Parallel Computing (CPC), 2015 |
Yuanbo Fan
,
Simone Campanoni
, and
Russ Joseph
Time Squeezing for Tiny Devices International Symposium on Computer Architecture (ISCA), 2019 Acceptance rate: 17.0% (62/365) Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency in ultra-low power embedded systems. This slack arises when all the signals have propagated through logic paths well in advance of the clock signal. When it is properly identified, the system can exploit this unused cycle time for energy savings. In this paper, we describe compiler and architecture co-design that opens new opportunities for timing slack that are otherwise impossible. Through cross-layer optimization, we introduce novel mechanisms in the hardware and in the compiler that work together to improve the benefit of circuit-level timing speculation by effectively squeezing time during execution. This approach is particularly well-suited to tiny embedded devices. Our evaluation on a gate-level model of a complete processor shows that our co-design saves (on average) 40.5% of the original energy consumption (additional 16.5% compared to the existing clock scheduling technique) across 13 workloads while retaining transparency to developers. |
|
Yuanbo Fan
,
Tianyu Jia
,
Jie Gu
,
Simone Campanoni
, and
Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors Design Automation Conference (DAC), 2018 Acceptance rate: 24.3% (168/691) Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level. |
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 |
|
Celine Lee
,
Abdulrahman Mahmoud
,
Michal Kurek
,
Simone Campanoni
,
David M. Brooks
,
Stephen Chong
,
Gu-Yeon Wei
, and
Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task. |
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018 Acceptance rate: 17.6% (56/319) The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform. |
|
Enrico Armenio Deiana
,
Vincent St-Amour
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
The Liberation Day of Nondeterministic Programs International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017 The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%. |
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems. |
|
Vijay Janapa Reddi
,
Svilen Kanev
,
Wonyoung Kim
,
Simone Campanoni
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 |
Yuanbo Fan
,
Simone Campanoni
, and
Russ Joseph
Time Squeezing for Tiny Devices International Symposium on Computer Architecture (ISCA), 2019 Acceptance rate: 17.0% (62/365) Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency in ultra-low power embedded systems. This slack arises when all the signals have propagated through logic paths well in advance of the clock signal. When it is properly identified, the system can exploit this unused cycle time for energy savings. In this paper, we describe compiler and architecture co-design that opens new opportunities for timing slack that are otherwise impossible. Through cross-layer optimization, we introduce novel mechanisms in the hardware and in the compiler that work together to improve the benefit of circuit-level timing speculation by effectively squeezing time during execution. This approach is particularly well-suited to tiny embedded devices. Our evaluation on a gate-level model of a complete processor shows that our co-design saves (on average) 40.5% of the original energy consumption (additional 16.5% compared to the existing clock scheduling technique) across 13 workloads while retaining transparency to developers. |
|
Yuanbo Fan
,
Tianyu Jia
,
Jie Gu
,
Simone Campanoni
, and
Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors Design Automation Conference (DAC), 2018 Acceptance rate: 24.3% (168/691) Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
|
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Yucan Wu
,
Simone Campanoni
, and
David I. August
QPoints: QEMU to gem5 ARM Full System Checkpointing International Workshop on Gem5 (GEM5), 2023 |
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
|
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Simone Campanoni
,
Giovanni Agosta
,
Stefano Crespi Reghizzi
, and
Andrea Di Biagio
A highly flexible, parallel virtual machine: design and experience of ILDJIT Software: Practice and Experience (SPE), 2010 |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Brian R. Tauro
,
Brian Suchy
,
Simone Campanoni
,
Peter Dinda
, and
Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits. |
Peter Zhong
,
Shu-Hung You
,
Simone Campanoni
,
Robert Bruce Findler
, and
Christos Dimoulas
A Calculus for Unreachable Code ArXiv, 2024 In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively. |
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
Ettore M. G. Trainiti,
Thanapon Noraset
, David Demeter,
Doug Downey
, and
Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training Machine Learning and Systems (MLSys), 2021 Acceptance rate: 24.1% (52/216) Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles. |
Filippo Sironi
,
Davide B. Bartolini
,
Simone Campanoni
,
Fabio Cancare
,
Henry Hoffmann
,
Donatella Sciuto
, and
Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite. |
Ling Jin,
Yinzhi Cao
,
Yan Chen
, Di Zhang, and
Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \n |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
Filippo Sironi
,
Davide B. Bartolini
,
Simone Campanoni
,
Fabio Cancare
,
Henry Hoffmann
,
Donatella Sciuto
, and
Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite. |
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Filippo Sironi
,
Davide B. Bartolini
,
Simone Campanoni
,
Fabio Cancare
,
Henry Hoffmann
,
Donatella Sciuto
, and
Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite. |
Filippo Sironi
,
Davide B. Bartolini
,
Simone Campanoni
,
Fabio Cancare
,
Henry Hoffmann
,
Donatella Sciuto
, and
Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite. |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
Georgios Tziantzioulis
,
Nikos Hardavellas
, and
Simone Campanoni
Temporal Approximate Function Memoization IEEE Micro's special issue on Approximate Computing, 2018 Improving the performance of applications is a core target of computer systems research and has led to the creation of various techniques. Among them is function memoization, an intuitive technique that, unfortunately, failed to live up to its promised potential. Traditional function memoization falls short mainly because it is input-based, meaning the system needs to pattern-match the current invocations inputs with previously seen ones to decide which memoized output value to return. This is often computationally intensive and only suitable for a limited set of functions. However, function calls often exhibit temporal value locality on their output. We capitalize on this observation to create the first output-based function memoization technique. We demonstrate that compiler-directed output-based memoization reduces energy and runtime by 50 to 75 percent (2-4x speedup) at the cost of a relatively small degradation of the applications output quality. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
Filippo Sironi
,
Davide B. Bartolini
,
Simone Campanoni
,
Fabio Cancare
,
Henry Hoffmann
,
Donatella Sciuto
, and
Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite. |
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
Yuanbo Fan
,
Tianyu Jia
,
Jie Gu
,
Simone Campanoni
, and
Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors Design Automation Conference (DAC), 2018 Acceptance rate: 24.3% (168/691) Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level. |
Zujun Tan
,
Yebin Chon
,
Michael Kruse
,
Johannes Doerfert
,
Ziyang Xu
,
Brian Homerding
,
Simone Campanoni
, and
David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score. |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Vito Kortbeek
,
Souradip Ghosh
,
Josiah Hester
,
Simone Campanoni
, and
Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing International Conference on Programming Language Design and Implementation (PLDI), 2022 Acceptance rate: 20.9% (68/326) Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Enrico Armenio Deiana
,
Brian Suchy
,
Michael Wilkins
,
Brian Homerding
,
Tommy McMichen
,
Katarzyna Dunajewski
,
Peter Dinda
,
Nikos Hardavellas
, and
Simone Campanoni
Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. |
Khalid Al-Hawaj,
Simone Campanoni
,
Gu-Yeon Wei
, and
David M. Brooks
Unified Cache: A Case for Low Latency Communication International Workshop on Parallelism in Mobile Platforms (PRISM), 2015 |
Vijay Janapa Reddi
,
Simone Campanoni
,
Meeta S. Gupta
,
Kim Hazelwood
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
Vijay Janapa Reddi
,
Meeta S. Gupta
,
Krishna K. Rangan
,
Simone Campanoni
,
Glenn Holloway
,
Michael D. Smith
,
Gu-Yeon Wei
, and
David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009 |
Ling Jin,
Yinzhi Cao
,
Yan Chen
, Di Zhang, and
Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \n |
Simone Campanoni
, and Luca Rocchini
Static Memory Management within Bytecode Languages on Multicore Systems Workshop on Computing in Heterogeneous, Autonomous ’N’ Goal-oriented Environments (CHANGE), 2011 Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The pipeline model is appealing for the compilation tasks that dynamic compilers need to perform, but it can bring deadlock issues when static memories are exploited by the running program. This research suggests a solution that both solves the mentioned problem and reduces the unnecessary compiler threads used to handle static memories. The proposed solution is a self-aware runtime system that both it is able to detect/avoid deadlocks and it adapts the number of compilation threads needed to handle static memories to the current workload. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Simone Campanoni
,
Martino Sykora
,
Giovanni Agosta
, and
Stefano Crespi Reghizzi
Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment International Conference on Compiler Construction (CC), 2009 Acceptance rate: 25.0% (18/72) Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Zujun Tan
,
Yebin Chon
,
Michael Kruse
,
Johannes Doerfert
,
Ziyang Xu
,
Brian Homerding
,
Simone Campanoni
, and
David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score. |
Michael Leonard
, and
Simone Campanoni
Introducing the Pseudorandom Value Generator Selection in the Compilation Toolchain International Symposium on Code Generation and Optimization (CGO), 2020 Acceptance rate: 27.4% (26/95) As interest in randomization has grown within the computing community, the number of pseudorandom value generators (PRVGs) at developers' disposal dramatically increased. Today, developers lack the tools necessary to obtain optimal behavior from their PRVGs. We provide the first deep study into the tradeoffs among the PRVGs in the C++ standard, finding no silver bullet for all programs and architectures. With this in mind, we have built PRV Jeeves, the first fully automatic PRVG selector. We demonstrate that when compiling widely-used, highly optimized programs with PRV Jeeves, we are able to cut execution time by 34% on average. This enhancement comes at no cost to developers. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Yian Su
,
Mike Rainey
,
Nicholas Wanninger
, Nadharm Dhiantravan,
Jasper Liang
,
Umut A. Acar
,
Peter Dinda
, and
Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
Vito Kortbeek
,
Souradip Ghosh
,
Josiah Hester
,
Simone Campanoni
, and
Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing International Conference on Programming Language Design and Implementation (PLDI), 2022 Acceptance rate: 20.9% (68/326) Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks. |
Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo,
Simone Campanoni
, Jose Ayala, and
Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015 Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used. |
Peter Zhong
,
Shu-Hung You
,
Simone Campanoni
,
Robert Bruce Findler
, and
Christos Dimoulas
A Calculus for Unreachable Code ArXiv, 2024 In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively. |
Mike Rainey
,
Kyle C. Hale
,
Ryan Newton
,
Nikos Hardavellas
,
Simone Campanoni
,
Peter Dinda
, and
Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability. |
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
Jeremiah Giordani,
Ziyang Xu
, Ella Colby,
August Ning
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Shaowei Zhu
,
Yebin Chon
, Greg Chan,
Zujun Tan
,
Galen Collier
,
Jonathan D. Halverson
,
Enrico Armenio Deiana
,
Jasper Liang
,
Federico Sossai
,
Yian Su
,
Atmn Patel
,
Bangyen Pham
,
Nathan Greiner
,
Simone Campanoni
, and
David I. August
Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) |
Peter Zhong
,
Shu-Hung You
,
Simone Campanoni
,
Robert Bruce Findler
, and
Christos Dimoulas
A Calculus for Unreachable Code ArXiv, 2024 In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively. |
Brian Suchy
,
Souradip Ghosh
,
Aaron Nelson
, Zhen Huang,
Drew Kersnar
,
Siyuan Chai
,
Michael Cuevas
,
Alex Bernat
,
Gaurav Chaudhary
,
Nikos Hardavellas
,
Simone Campanoni
, and
Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
Ettore M. G. Trainiti,
Thanapon Noraset
, David Demeter,
Doug Downey
, and
Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training Machine Learning and Systems (MLSys), 2021 Acceptance rate: 24.1% (52/216) Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles. |
Yuanbo Fan
,
Tianyu Jia
,
Jie Gu
,
Simone Campanoni
, and
Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors Design Automation Conference (DAC), 2018 Acceptance rate: 24.3% (168/691) Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level. |
Nayana Prasad Nagendra
,
Bhargav Reddy Godala
,
Ishita Chaturvedi
,
Atmn Patel
,
Svilen Kanev
,
Tipp Moseley
,
Jared Stark
,
Gilles A. Pokam
,
Simone Campanoni
, and
David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
Ishita Chaturvedi
,
Bhargav Reddy Godala
,
Yucan Wu
,
Ziyang Xu
, Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis,
Sotirios Xydis
,
Dimitrios Soudris
,
Tyler Sorensen
,
Simone Campanoni
,
Tor M. Aamodt
, and
David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks. |
Michael Wilkins
,
Sam Westrick
,
Vijay Kandiah
,
Alex Bernat
,
Brian Suchy
,
Enrico Armenio Deiana
,
Simone Campanoni
,
Umut A. Acar
,
Peter Dinda
, and
Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages. |
Vito Kortbeek
,
Souradip Ghosh
,
Josiah Hester
,
Simone Campanoni
, and
Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing International Conference on Programming Language Design and Implementation (PLDI), 2022 Acceptance rate: 20.9% (68/326) Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks. |
Jiacheng Ma
,
Wenyi Wang
,
Aaron Nelson
,
Michael Cuevas
,
Brian Homerding
, Conghao Liu, Zhen Huang,
Simone Campanoni
,
Kyle C. Hale
, and
Peter Dinda
Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks. |
Xiaochun Zhang,
Timothy M. Jones
, and
Simone Campanoni
Quantifying the Semantic Gap Between Serial and Parallel Programming International Symposium on Workload Characterization (IISWC), 2021 Acceptance rate: 39.6% (19/48) Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system. |
Zhenpeng Lin
,
Zheng Yu
,
Ziyi Guo
,
Simone Campanoni
,
Peter Dinda
, and
Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection USENIX Security Symposium (USENIX Security), 2024 |
Ling Jin,
Yinzhi Cao
,
Yan Chen
, Di Zhang, and
Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \n |
Ling Jin,
Yinzhi Cao
,
Yan Chen
, Di Zhang, and
Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities IEEE Transactions on Dependable and Secure Computing (TDSC), 2023
Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \n |
Zhenpeng Lin
,
Zheng Yu
,
Ziyi Guo
,
Simone Campanoni
,
Peter Dinda
, and
Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection USENIX Security Symposium (USENIX Security), 2024 |
Zhenpeng Lin
,
Zheng Yu
,
Ziyi Guo
,
Simone Campanoni
,
Peter Dinda
, and
Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection USENIX Security Symposium (USENIX Security), 2024 |
Zhenpeng Lin
,
Zheng Yu
,
Ziyi Guo
,
Simone Campanoni
,
Peter Dinda
, and
Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection USENIX Security Symposium (USENIX Security), 2024 |
LinkedIn Twitter GitHub
Share this page with: