Research

Publications

David M. Brooks

	Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. GUESS & SKETCH extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test GUESS & SKETCH on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.
	Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
	Simone Campanoni , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks HELIX-UP: Relaxing Program Semantics to Unleash Parallelization International Symposium on Code Generation and Optimization (CGO), 2015 Acceptance rate: 27.3% (24/88) Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis. One of four papers nominated for the Best Paper Award by the Program Committee
	Khalid Al-Hawaj, Simone Campanoni , Gu-Yeon Wei , and David M. Brooks Unified Cache: A Case for Low Latency Communication International Workshop on Parallelism in Mobile Platforms (PRISM), 2015
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017
	Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
	Gu-Yeon Wei , David M. Brooks , Simone Campanoni , Kevin Brownell , and Svilen Kanev Methods and apparatus for parallel processing US Patents, 2014
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.
	Vijay Janapa Reddi , Meeta S. Gupta , Krishna K. Rangan , Simone Campanoni , Glenn Holloway , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009

Gu-Yeon Wei

	Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush Guess and Sketch: Language Model Guided Transpilation International Conference on Learning Representations (ICLR), 2024 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. GUESS & SKETCH extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test GUESS & SKETCH on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.
	Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush Guess and Sketch: Language Model Guided Transpilation ArXiv, 2023 Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
	Simone Campanoni , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks HELIX-UP: Relaxing Program Semantics to Unleash Parallelization International Symposium on Code Generation and Optimization (CGO), 2015 Acceptance rate: 27.3% (24/88) Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis. One of four papers nominated for the Best Paper Award by the Program Committee
	Khalid Al-Hawaj, Simone Campanoni , Gu-Yeon Wei , and David M. Brooks Unified Cache: A Case for Low Latency Communication International Workshop on Parallelism in Mobile Platforms (PRISM), 2015
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017
	Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
	Gu-Yeon Wei , David M. Brooks , Simone Campanoni , Kevin Brownell , and Svilen Kanev Methods and apparatus for parallel processing US Patents, 2014
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.
	Vijay Janapa Reddi , Meeta S. Gupta , Krishna K. Rangan , Simone Campanoni , Glenn Holloway , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009

Peter Dinda

	Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.
	Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda Getting a Handle on Unmanaged Memory International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis.
	Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale TrackFM: Far-out Compiler Support for a Far Memory World International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.
	Zhenpeng Lin , Zheng Yu , Ziyi Guo , Simone Campanoni , Peter Dinda , and Xinyu Xing CAMP: Compiler and Allocator-based Heap Memory Protection USENIX Security Symposium (USENIX Security), 2024 The heap is a critical and widely used component of many applications. Due to its dynamic nature, combined with the complexity of heap management algorithms, it is also a frequent target for security exploits. To enhance the heap's security, various heap protection techniques have been introduced, but they either introduce significant runtime overhead or have limited protection. We present CAMP, a new sanitizer for detecting and capturing heap memory corruption. CAMP leverages a compiler and a customized memory allocator. The compiler adds boundary-checking and escape-tracking instructions to the target program, while the memory allocator tracks memory ranges, coordinates with the instrumentation, and neutralizes dangling pointers. With the novel error detection scheme, CAMP enables various compiler optimization strategies and thus eliminates redundant and unnecessary check instrumentation. This design minimizes runtime overhead without sacrificing security guarantees. Our evaluation and comparison of CAMP with existing tools, using both real-world applications and SPEC CPU benchmarks, show that it provides even better heap corruption detection capability with lower runtime overhead.
	Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda Getting a Handle on Unmanaged Memory ArXiv, 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis.
	Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.
	Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.
	Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.
	Kyle C. Hale , Simone Campanoni , Nikos Hardavellas , and Peter Dinda The Case for an Interwoven Parallel Hardware/Software Stack International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021 The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work.
	Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.
	Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.
	Souradip Ghosh , Michael Cuevas , Simone Campanoni , and Peter Dinda Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism High Performance Computing, Networking, Storage and Analysis (SC), 2020 Acceptance rate: 25.1% (95/378) In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model.
	Brian Suchy , Simone Campanoni , Nikos Hardavellas , and Peter Dinda CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions.
	Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Unconventional Parallelization of Nondeterministic Applications International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018 Acceptance rate: 17.6% (56/319) The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.
	Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni The Liberation Day of Nondeterministic Programs International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017 The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

David I. August

	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
	Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.
	Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.
	Bhargav Reddy Godala , Nayana Prasad Nagendra , Ishita Chaturvedi , Simone Campanoni , and David I. August Modern Front-end Support in gem5 International Workshop on Gem5 (GEM5), 2023
	Bhargav Reddy Godala , Ishita Chaturvedi , Yucan Wu , Simone Campanoni , and David I. August QPoints: QEMU to gem5 ARM Full System Checkpointing International Workshop on Gem5 (GEM5), 2023
	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.
	Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.
	Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Ziyang Xu

	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
	Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.
	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.
	Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.
	Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Enrico Armenio Deiana

	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.
	Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.
	Enrico Armenio Deiana , and Simone Campanoni Workload Characterization of Nondeterministic Programs Parallelized by STATS International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019 Acceptance rate: 29.5% (26/88) Chip Multiprocessors (CMP) are everywhere, from mobile systems, to servers. Thread Level Parallelism (TLP) is the characteristic of a program that makes use of the parallel cores of a CMP to generate performance. Despite all efforts for creating TLP, multiple cores are still underutilized even though we have been in the multicore era for more than a decade. Recently, a new approach called STATS has been proposed to generate additional TLP for complex and irregular nondeterministic programs. STATS allows a developer to describe application-specific information that its compiler uses to automatically generate a new source of TLP. This new source of TLP increases with the size of the input and it has the potential to generate scalable performance with the number of cores. Even though STATS obtains most of its potential, some of it is still unreached. This paper identifies and characterizes the sources of overhead that are currently blocking STATS parallelized programs to achieve their full potential. To this end, we characterized the workloads generated by the STATS compiler on a 28 core Intel-based machine (dual-socket). This paper shows that the performance loss is due to a combination of factors: some can be optimized via engineering efforts and some require a deeper evolution of STATS. We also highlight potential solutions to significantly reduce most of this overhead. Exploiting these insights will unblock scalable performance for the parallel binaries generated by STATS.
	Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Unconventional Parallelization of Nondeterministic Applications International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018 Acceptance rate: 17.6% (56/319) The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.
	Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni The Liberation Day of Nondeterministic Programs International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017 The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

Nikos Hardavellas

	Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.
	Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.
	Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.
	Kyle C. Hale , Simone Campanoni , Nikos Hardavellas , and Peter Dinda The Case for an Interwoven Parallel Hardware/Software Stack International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021 The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work.
	Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.
	Brian Suchy , Simone Campanoni , Nikos Hardavellas , and Peter Dinda CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions.
	Georgios Tziantzioulis , Nikos Hardavellas , and Simone Campanoni Temporal Approximate Function Memoization IEEE Micro's special issue on Approximate Computing, 2018 Improving the performance of applications is a core target of computer systems research and has led to the creation of various techniques. Among them is function memoization, an intuitive technique that, unfortunately, failed to live up to its promised potential. Traditional function memoization falls short mainly because it is input-based, meaning the system needs to pattern-match the current invocations inputs with previously seen ones to decide which memoized output value to return. This is often computationally intensive and only suitable for a limited set of functions. However, function calls often exhibit temporal value locality on their output. We capitalize on this observation to create the first output-based function memoization technique. We demonstrate that compiler-directed output-based memoization reduces energy and runtime by 50 to 75 percent (2-4x speedup) at the cost of a relatively small degradation of the applications output quality.
	Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Unconventional Parallelization of Nondeterministic Applications International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018 Acceptance rate: 17.6% (56/319) The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.
	Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni The Liberation Day of Nondeterministic Programs International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017 The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

Timothy M. Jones

	Xiaochun Zhang, Timothy M. Jones , and Simone Campanoni Quantifying the Semantic Gap Between Serial and Parallel Programming International Symposium on Workload Characterization (IISWC), 2021 Acceptance rate: 39.6% (19/48) Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system.
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
	Niall Murphy, Timothy M. Jones , Robert Mullins , and Simone Campanoni Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops International Conference on Compiler Construction (CC), 2016 Acceptance rate: 31.2% (24/77) Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.
	Niall Murphy, Timothy M. Jones , Simone Campanoni , and Robert Mullins Limits of Static Dependence Analysis for Automatic Parallelization International Workshop on Compilers for Parallel Computing (CPC), 2015
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.

Zujun Tan

	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.
	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.
	Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.

Ishita Chaturvedi

	Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.
	Bhargav Reddy Godala , Nayana Prasad Nagendra , Ishita Chaturvedi , Simone Campanoni , and David I. August Modern Front-end Support in gem5 International Workshop on Gem5 (GEM5), 2023
	Bhargav Reddy Godala , Ishita Chaturvedi , Yucan Wu , Simone Campanoni , and David I. August QPoints: QEMU to gem5 ARM Full System Checkpointing International Workshop on Gem5 (GEM5), 2023
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Stefano Crespi Reghizzi

	Simone Campanoni , Giovanni Agosta , Stefano Crespi Reghizzi , and Andrea Di Biagio A highly flexible, parallel virtual machine: design and experience of ILDJIT Software: Practice and Experience (SPE), 2010
	Michele Tartara , Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi Parallelism and Retargetability in the ILDJIT Dynamic Compiler Architecture of Computing Systems (ARCS), 2010 Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability.
	Michele Tartara , Stefano Crespi Reghizzi , and Simone Campanoni Extending Hammocks for Parallelism Detection Italian Conference on Theoretical Computer Science (ICTCS), 2010
	Simone Campanoni , Martino Sykora , Giovanni Agosta , and Stefano Crespi Reghizzi Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment International Conference on Compiler Construction (CC), 2009 Acceptance rate: 25.0% (18/72) Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation.
	Simone Campanoni , and Stefano Crespi Reghizzi Traces of Control-Flow Graphs International Conference on Developments in Language Theory (DLT), 2009 Acceptance rate: 45.7% (32/70)
	Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi A parallel dynamic compiler for CIL bytecode ACM SIGPLAN Notices, 2008 Multi-core technology is being employed in most recent high-performance architectures. Such architectures need specifically designed multi-threaded software to exploit all the potentialities of their hardware parallelism. At the same time, object code virtualization technologies are achieving a growing popularity, as they allow higher levels of software portability and reuse. Thus, a virtual execution environment running on a multi-core processor has to run complex, high-level applications and to exploit as much as possible the underlying parallel hardware. We propose an approach that leverages on CMP features to expose a novel pipeline synchronization model for the internal threads of the dynamic compiler. Thanks to compilation latency masking effect of the pipeline organization, our dynamic compiler, ILDJIT, is able to achieve significant speedups (26% on average) with respect to the baseline, when the underlying hardware exposes at least two cores.
	Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi ILDJIT: a Parallel Dynamic Compiler International Conference on Very Large Scale Integration (VLSI-SoC), 2008

Svilen Kanev

	Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017
	Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
	Gu-Yeon Wei , David M. Brooks , Simone Campanoni , Kevin Brownell , and Svilen Kanev Methods and apparatus for parallel processing US Patents, 2014
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011

Yian Su

	Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.
	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Bhargav Reddy Godala

	Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction International Symposium on Computer Architecture (ISCA), 2024 Acceptance rate: 21.2% (79/372) Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.
	Bhargav Reddy Godala , Nayana Prasad Nagendra , Ishita Chaturvedi , Simone Campanoni , and David I. August Modern Front-end Support in gem5 International Workshop on Gem5 (GEM5), 2023
	Bhargav Reddy Godala , Ishita Chaturvedi , Yucan Wu , Simone Campanoni , and David I. August QPoints: QEMU to gem5 ARM Full System Checkpointing International Workshop on Gem5 (GEM5), 2023

Sotiris Apostolakis

	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.
	Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.
	Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Tommy McMichen

	Tommy McMichen , David Dlott , Panitan Wongse-ammat , Nathan Greiner , Hussain Khajanchi, Russ Joseph , and Simone Campanoni Saving Energy with Per-Variable Bitwidth Speculation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025 Tiny devices have become ubiquitous in people’s daily lives. Their applications dictate tight energy budgets, but also require reasonable performance to meet user expectations. To this end, the hardware of tiny devices has been highly optimized, making further optimizations difficult. In this work, we identify a missed opportunity: the bitwidth selection of program variables. Today’s compilers directly translate the bitwidth specified in the source code to the binary. However, we observe that most variables do not utilize the full bitwidth specified in the source code for the majority of execution. To leverage this opportunity, we propose BitSpec: a system that performs fine-grained speculation on the bitwidth of program variables. BitSpec is implemented as a compiler-architecture co-design, where the compiler transparently reduces the bitwidth of program variables to their expected needs and the hardware monitors speculative variables, reporting misspeculation to the software, which re-executes at the original bitwidth, ensuring correctness. BitSpec reduces the energy consumption of tiny devices in production by 14.4% on average, up to 28.2%
	Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.
	Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda Getting a Handle on Unmanaged Memory International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis.
	Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda Getting a Handle on Unmanaged Memory ArXiv, 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis.
	Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Vijay Janapa Reddi

	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.
	Vijay Janapa Reddi , Meeta S. Gupta , Krishna K. Rangan , Simone Campanoni , Glenn Holloway , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009

Brian Homerding

	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.
	Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Brian Suchy

	Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale TrackFM: Far-out Compiler Support for a Far Memory World International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.
	Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni Program State Element Characterization International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.
	Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.
	Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.
	Brian Suchy , Simone Campanoni , Nikos Hardavellas , and Peter Dinda CARAT: A Case for Virtual Memory through Compiler- And Runtime-based Address Translation International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Virtual memory is a critical abstraction in modern computer systems. Its common model, paging, is currently seeing considerable innovation, yet its implementations continue to be co-designs between power-hungry/latency-adding hardware (e.g., TLBs, pagewalk caches, pagewalkers, etc) and software (the OS kernel). We make a case for a new model for virtual memory, compiler- and runtime-based address translation (CARAT), which instead is a co-design between the compiler and the OS kernel. CARAT can operate without any hardware support, although it could also be retrofitted into a traditional paging model, and could leverage simpler hardware support. CARAT uses compile-time transformations and optimizations combined with tightly-coupled runtime/kernel interaction to generate programs that run efficiently in a physical address space, but nonetheless allow the kernel to maintain protection and dynamically manage physical memory similar to what is possible using traditional virtual memory. We argue for the feasibility of CARAT through an empirical study of application characteristics and kernel behavior, as well as through the design, implementation, and performance evaluation of a CARAT prototype. Because our prototype works at the IR level (in particular, via LLVM bitcode), it can be applied to most C and C++ programs with minimal or no restrictions.

Giovanni Agosta

	Simone Campanoni , Giovanni Agosta , Stefano Crespi Reghizzi , and Andrea Di Biagio A highly flexible, parallel virtual machine: design and experience of ILDJIT Software: Practice and Experience (SPE), 2010
	Michele Tartara , Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi Parallelism and Retargetability in the ILDJIT Dynamic Compiler Architecture of Computing Systems (ARCS), 2010 Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability.
	Simone Campanoni , Martino Sykora , Giovanni Agosta , and Stefano Crespi Reghizzi Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment International Conference on Compiler Construction (CC), 2009 Acceptance rate: 25.0% (18/72) Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation.
	Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi A parallel dynamic compiler for CIL bytecode ACM SIGPLAN Notices, 2008 Multi-core technology is being employed in most recent high-performance architectures. Such architectures need specifically designed multi-threaded software to exploit all the potentialities of their hardware parallelism. At the same time, object code virtualization technologies are achieving a growing popularity, as they allow higher levels of software portability and reuse. Thus, a virtual execution environment running on a multi-core processor has to run complex, high-level applications and to exploit as much as possible the underlying parallel hardware. We propose an approach that leverages on CMP features to expose a novel pipeline synchronization model for the internal threads of the dynamic compiler. Thanks to compilation latency masking effect of the pipeline organization, our dynamic compiler, ILDJIT, is able to achieve significant speedups (26% on average) with respect to the baseline, when the underlying hardware exposes at least two cores.
	Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi ILDJIT: a Parallel Dynamic Compiler International Conference on Very Large Scale Integration (VLSI-SoC), 2008

Glenn Holloway

	Simone Campanoni , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks HELIX-UP: Relaxing Program Semantics to Unleash Parallelization International Symposium on Code Generation and Optimization (CGO), 2015 Acceptance rate: 27.3% (24/88) Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a program's results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis. One of four papers nominated for the Best Paper Award by the Program Committee
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks HELIX: Making the Extraction of Thread-Level Parallelism Mainstream IEEE computer Society Digital Library (IEEE Micro), 2012 Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Vijay Janapa Reddi , Gu-Yeon Wei , and David M. Brooks HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing International Symposium on Code Generation and Optimization (CGO), 2012 Acceptance rate: 28.9% (26/90) We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.
	Simone Campanoni , Timothy M. Jones , Glenn Holloway , Gu-Yeon Wei , and David M. Brooks The HELIX Project: Overview and Directions Design Automation Conference (DAC), 2012 Acceptance rate: 22.7% (168/741) Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.
	Vijay Janapa Reddi , Meeta S. Gupta , Krishna K. Rangan , Simone Campanoni , Glenn Holloway , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009

Michael D. Smith

	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise in Production Processors IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011 Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.
	Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling International Symposium on Microarchitecture (MICRO), 2010 Acceptance rate: 18.1% (45/248) Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs. IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.
	Vijay Janapa Reddi , Meeta S. Gupta , Krishna K. Rangan , Simone Campanoni , Glenn Holloway , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009

Souradip Ghosh

	Vito Kortbeek , Souradip Ghosh , Josiah Hester , Simone Campanoni , and Przemysław Pawełczak WARio: Efficient Code Generation for Intermittent Computing International Conference on Programming Language Design and Implementation (PLDI), 2022 Acceptance rate: 20.9% (68/326) Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.
	Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions International Symposium on Code Generation and Optimization (CGO), 2022 Acceptance rate: 28.7% (27/94) Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
	Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni NOELLE Offers Empowering LLVM Extensions ArXiv, 2021 Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.
	Souradip Ghosh , Michael Cuevas , Simone Campanoni , and Peter Dinda Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism High Performance Computing, Networking, Storage and Analysis (SC), 2020 Acceptance rate: 25.1% (95/378) In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model.

William Fornaciari

	Simone Campanoni , and William Fornaciari Multi-level Design and Optimization of Wireless Sensor Networks International Conference on Networked Sensing Systems (INSS), 2008 This paper proposes a methodology to off-line planning of WSNs (wireless sensor networks) by addressing the problem in a multi-layer manner. At the sensor level a model is described to properly select and distribute the sensors in the environment. To optimize the cost and deployment of realistic WSNs, further design activities are proposed at an intermediate level, targeting board-level clustering of sensors. Finally, it is presented a methodology to hierarchically organize the set of sensors in patches with an additional gateway-level communication layer, to take into account also possible scaling of the application complexity. Particular emphasis is put on the cost modeling and on ensuring the correct behavior of the WSN against variation of parameters like sensor position, protocol, cost, etc.
	Simone Campanoni , and William Fornaciari Node-Level Optimization of Wireless Sensor Networks International Conference on Wireless Communications, Networking and Mobile Computing (WiCom), 2008 A methodology for WSN planning should produce a suitable placement of sensors to simplify the acquisition of the data relevant for the application. The goal of this paper is to present a strategy to formally specify the system level characteristics of the events to be monitored and to identify a proper set of sensor-position pairs, tailored to provide the required sensing capabilities. In particular the focus is on the node level optimization, where the designer has to face with the problem of clustering sensors onto the same board based upon a cost-performance tradeoff.
	Simone Campanoni , and William Fornaciari Models and Tradeoffs in WSN System-Level Design Euromicro Symposium on Digital System Design (DSD), 2008 System-level design of WSNs includes the selection of the sensing nodes and their dissemination in the environment to be monitored. Many design choices have to be taken during this stage of the development of the application. The goal of this paper is to present a methodology to specify formally the desired behavior of the sensing application and to derive an optimal selection and placement of the network nodes. The approach is flexible and powerful, since it allows the designer to analyze the impact of clustering sensors onto a reduced set of boards (nodes) and to perform sensitivity analysis on parameters such as type of sensor, position of the nodes, observation time, presence of faults, etc. The paper introduces the concepts with some representative examples as well as by considering bigger use cases extracted from international projects.
	Simone Campanoni , and William Fornaciari Ensuring Feasibility of Wireless Sensor Networks International Conference on Circuits and Systems for Communications (ICCSC), 2008 A comprehensive WSNs (wireless sensor networks) design methodology should deserve enough effort to select a suitable set of sensors with a proper spatial distribution, in order to ensure a correct monitoring of the parameters relevant for the application. The goal of this paper is to show how it is possible a priori ensuring that a given WSN is actually capable to satisfy the application requirements. In addition to this, the flexibility of the approach is also demonstrated by evaluating the sensitivity of the WSN performance to parameters like sensors distribution, observation time and the events to be analyzed. This paper presents the overall design methodology implemented in SWORDFISH (sensor networks development framework integrating simulation and hardware optimization) and some representative case studies.
	Simone Campanoni , and William Fornaciari SWORDFISH: A framework to formally design WSNs capturing events International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2007 A Wireless Sensor Network (WSN) consists of spatially distributed autonomous devices equipped with sensors and radio communication capabilities. They can act in a cooperative manner for monitoring environmental parameters like pressure, temperature, acceleration, light, humidity, etc. The typical WSN design problem is to discover a proper set of sensors and their spatial distribution, so to enable the monitoring of relevant parameters for the application, while concurrently optimizing some design goals (e.g., cost, reliability, lifetime and energy requirements). In this paper we present part of the SWORDFISH (Sensor netWORks Development Framework Integrating Simulation and Hardware optimization) project, aiming at providing a design and verification environment for WSNs, including in the design loop the simulation of physical events and a formal verification of the WSN desired properties against the network optimization goals. In particular, it is presented the overall design framework, the approach to model properties and goals of the network and some representative case studies showing the value of a toolsuite gathering WSN formal optimization/planning and environment simulation.

Atmn Patel

	Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Brian Homerding , Atmn Patel , Enrico Armenio Deiana , Yian Su , Zujun Tan , Ziyang Xu , Bhargav Reddy Godala , David I. August , and Simone Campanoni The Parallel Semantics Program Dependence Graph ArXiv, 2024 A compiler's intermediate representation (IR) defines a program's execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan with a semantically-equivalent one that increases the program's performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. Parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. We propose the Parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select different parallel execution plans while maintaining the same original semantics.
	Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching International Symposium on Computer Architecture (ISCA), 2023 Acceptance rate: 21.2% (79/372) For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Kevin Brownell

	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks Automatically Accelerating Non-Numerical Programs By Extracting Threads with an Architecture-Compiler Co-Design Communication ACM Research Highlights (CACM), 2017 Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
	Simone Campanoni , Kevin Brownell , Svilen Kanev , Timothy M. Jones , Gu-Yeon Wei , and David M. Brooks HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs International Symposium on Computer Architecture (ISCA), 2014 Acceptance rate: 17.8% (46/258) Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks IEEE Micro’s Top Picks in Computer Architecture Conferences honorable mention, 2014 Communication ACM Research Highlights (CACM), 2017
	Simone Campanoni , Svilen Kanev , Kevin Brownell , Gu-Yeon Wei , and David M. Brooks Breaking Cyclic-Multithreading Parallelization with XML Parsing International Workshop on Parallelism in Mobile Platforms (PRISM), 2014
	Gu-Yeon Wei , David M. Brooks , Simone Campanoni , Kevin Brownell , and Svilen Kanev Methods and apparatus for parallel processing US Patents, 2014

Kyle C. Hale

	Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale TrackFM: Far-out Compiler Support for a Far Memory World International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.
	Kyle C. Hale , Simone Campanoni , Nikos Hardavellas , and Peter Dinda The Case for an Interwoven Parallel Hardware/Software Stack International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2021 The layered structure of the system software stacks we use today allows for separation of concerns and increases portability. However, the confluence of widely available virtualization and hardware partitioning technology, new OS techniques, rapidly changing hardware, and significant advances in compiler technology together present a ripe opportunity for restructuring the stack, particularly to support effective parallel execution. We argue that there are cases where layers, particularly the compiler, run-time, kernel, and hardware, should be interwoven, enabling new optimizations and abstractions. We present four examples where we have successfully applied this interweaving model of system design, and we outline several lines of promising ongoing work.
	Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.
	Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Yebin Chon

	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024 Acceptance rate: 30.9% (38/123) Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 Acceptance rate: 26.7% (72/270) Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.
	Ziyang Xu , Yebin Chon , Yian Su , Zujun Tan , Sotiris Apostolakis , Simone Campanoni , and David I. August PROMPT: A Fast and Extensible Memory Profiling Framework ArXiv, 2023 Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Greg Chan

	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
	Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Greg Chan, Simone Campanoni , and David I. August SCAF: A Speculation-Aware Collaborative Dependence Analysis Framework International Conference on Programming Language Design and Implementation (PLDI), 2020 Acceptance rate: 22.6% (77/341) Program analysis determines the potential dataflow and control flow relationships among instructions so that compiler optimizations can respect these relationships to transform code correctly. Since many of these relationships rarely or never occur, speculative optimizations assert they do not exist while optimizing the code. To preserve correctness, speculative optimizations add validation checks to activate recovery code when these assertions prove untrue. This approach results in many missed opportunities because program analysis and thus other optimizations remain unaware of the full impact of these dynamically-enforced speculative assertions. To address this problem, this paper presents SCAF, a Speculation-aware Collaborative dependence Analysis Framework. SCAF learns of available speculative assertions via profiling, computes their full impact on memory dependence analysis, and makes this resulting information available for all code optimizations. SCAF is modular (adding new analysis modules is easy) and collaborative (modules cooperate to produce a result more precise than the confluence of all individual results). Relative to the best prior speculation-aware dependence analysis technique, by computing the full impact of speculation on memory dependence analysis, SCAF dramatically reduces the need for expensive-to-validate memory speculation in the hot loops of all 16 evaluated C/C++ SPEC benchmarks.
	Sotiris Apostolakis , Ziyang Xu , Greg Chan, Simone Campanoni , and David I. August Perspective: A Sensible Approach to Speculative Automatic Parallelization International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020 Acceptance rate: 18.1% (86/476) The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines a novel speculation-aware memory analyzer, new efficient speculative privatization methods, and a planning phase to select a minimal-cost set of parallelization-enabling transforms. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory commodity machine) than Privateer (11.5x), the prior automatic DOALL parallelization system with the highest applicability.

Meeta S. Gupta

	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Eliminating Voltage Emergencies via Software-Guided Code Transformation ACM Transactions on Architecture and Code Optimization (TACO), 2010 In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
	Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Software-Assisted Hardware Reliability: Abstracting Circuit-level Challenges to the Software Stack Design Automation Conference (DAC), 2009 Acceptance rate: 21.7% (148/682) Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.
	Vijay Janapa Reddi , Meeta S. Gupta , Krishna K. Rangan , Simone Campanoni , Glenn Holloway , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks Voltage Noise: Why It’s Bad, and What To Do About It International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009

Michael Cuevas

	Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022 Acceptance rate: 20.2% (80/397) Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.
	Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda Paths to OpenMP in the Kernel High Performance Computing, Networking, Storage and Analysis (SC), 2021 Acceptance rate: 26.8% (98/365) OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.
	Souradip Ghosh , Michael Cuevas , Simone Campanoni , and Peter Dinda Compiler-based Timing For Extremely Fine-grain Preemptive Parallelism High Performance Computing, Networking, Storage and Analysis (SC), 2020 Acceptance rate: 25.1% (95/378) In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads), attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model.

Nathan Greiner

	Tommy McMichen , David Dlott , Panitan Wongse-ammat , Nathan Greiner , Hussain Khajanchi, Russ Joseph , and Simone Campanoni Saving Energy with Per-Variable Bitwidth Speculation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025 Tiny devices have become ubiquitous in people’s daily lives. Their applications dictate tight energy budgets, but also require reasonable performance to meet user expectations. To this end, the hardware of tiny devices has been highly optimized, making further optimizations difficult. In this work, we identify a missed opportunity: the bitwidth selection of program variables. Today’s compilers directly translate the bitwidth specified in the source code to the binary. However, we observe that most variables do not utilize the full bitwidth specified in the source code for the majority of execution. To leverage this opportunity, we propose BitSpec: a system that performs fine-grained speculation on the bitwidth of program variables. BitSpec is implemented as a compiler-architecture co-design, where the compiler transparently reduces the bitwidth of program variables to their expected needs and the hardware monitors speculative variables, reporting misspeculation to the software, which re-executes at the original bitwidth, ensuring correctness. BitSpec reduces the energy consumption of tiny devices in production by 14.4% on average, up to 28.2%
	Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni Representing Data Collections in an SSA Form International Symposium on Code Generation and Optimization (CGO), 2024 Acceptance rate: 32.5% (37/114) Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.
	Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August Revisiting Computation for Research: Practices and Trends High Performance Computing, Networking, Storage and Analysis (SC), 2024 Acceptance rate: 22.7% (102/449) In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.

Nicholas Wanninger

	Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.
	Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda Getting a Handle on Unmanaged Memory International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, Alaska, that acts as a drop-in replacement for malloc. Without any programmer effort, the Alaska compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of Alaska on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use Alaska to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis.
	Nicholas Wanninger , Tommy McMichen , Simone Campanoni , and Peter Dinda Getting a Handle on Unmanaged Memory ArXiv, 2024 The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employ runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to reduce performance impact. A codesigned runtime system manages this level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through compaction, reducing memory usage by up to 40% in Redis.

Russ Joseph

	Tommy McMichen , David Dlott , Panitan Wongse-ammat , Nathan Greiner , Hussain Khajanchi, Russ Joseph , and Simone Campanoni Saving Energy with Per-Variable Bitwidth Speculation International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025 Tiny devices have become ubiquitous in people’s daily lives. Their applications dictate tight energy budgets, but also require reasonable performance to meet user expectations. To this end, the hardware of tiny devices has been highly optimized, making further optimizations difficult. In this work, we identify a missed opportunity: the bitwidth selection of program variables. Today’s compilers directly translate the bitwidth specified in the source code to the binary. However, we observe that most variables do not utilize the full bitwidth specified in the source code for the majority of execution. To leverage this opportunity, we propose BitSpec: a system that performs fine-grained speculation on the bitwidth of program variables. BitSpec is implemented as a compiler-architecture co-design, where the compiler transparently reduces the bitwidth of program variables to their expected needs and the hardware monitors speculative variables, reporting misspeculation to the software, which re-executes at the original bitwidth, ensuring correctness. BitSpec reduces the energy consumption of tiny devices in production by 14.4% on average, up to 28.2%
	Yuanbo Fan , Simone Campanoni , and Russ Joseph Time Squeezing for Tiny Devices International Symposium on Computer Architecture (ISCA), 2019 Acceptance rate: 17.0% (62/365) Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency in ultra-low power embedded systems. This slack arises when all the signals have propagated through logic paths well in advance of the clock signal. When it is properly identified, the system can exploit this unused cycle time for energy savings. In this paper, we describe compiler and architecture co-design that opens new opportunities for timing slack that are otherwise impossible. Through cross-layer optimization, we introduce novel mechanisms in the hardware and in the compiler that work together to improve the benefit of circuit-level timing speculation by effectively squeezing time during execution. This approach is particularly well-suited to tiny embedded devices. Our evaluation on a gate-level model of a complete processor shows that our co-design saves (on average) 40.5% of the original energy consumption (additional 16.5% compared to the existing clock scheduling technique) across 13 workloads while retaining transparency to developers.
	Yuanbo Fan , Tianyu Jia , Jie Gu , Simone Campanoni , and Russ Joseph Compiler-guided instruction-level clock scheduling for timing speculative processors Design Automation Conference (DAC), 2018 Acceptance rate: 24.3% (168/691) Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level.

Umut A. Acar

	Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni Compiling Loop-Based Nested Parallelism for Irregular Workloads International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.
	Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas WARDen: Specializing Cache Coherence for High-Level Parallel Languages International Symposium on Code Generation and Optimization (CGO), 2023 Acceptance rate: 39.2% (20/51) High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.
	Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar Task Parallel Assembly Language for Uncompromising Parallelism International Conference on Programming Language Design and Implementation (PLDI), 2021 Acceptance rate: 27.2% (87/320) Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Aaron Nelson

Brian Suchy , Souradip Ghosh , Aaron Nelson , Zhen Huang, Drew Kersnar , Siyuan Chai , Michael Cuevas , Alex Bernat , Gaurav Chaudhary , Nikos Hardavellas , Simone Campanoni , and Peter Dinda
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
Acceptance rate: 20.2% (80/397)

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Jiacheng Ma , Wenyi Wang , Aaron Nelson , Michael Cuevas , Brian Homerding , Conghao Liu, Zhen Huang, Simone Campanoni , Kyle C. Hale , and Peter Dinda
Paths to OpenMP in the Kernel
High Performance Computing, Networking, Storage and Analysis (SC), 2021
Acceptance rate: 26.8% (98/365)

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.

Abdulrahman Mahmoud

Celine Lee , Abdulrahman Mahmoud , Michal Kurek , Simone Campanoni , David M. Brooks , Stephen Chong , Gu-Yeon Wei , and Alexander M. Rush
Guess and Sketch: Language Model Guided Transpilation
International Conference on Learning Representations (ICLR), 2024

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. GUESS & SKETCH extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test GUESS & SKETCH on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Alex Bernat

Michael Wilkins , Sam Westrick , Vijay Kandiah , Alex Bernat , Brian Suchy , Enrico Armenio Deiana , Simone Campanoni , Umut A. Acar , Peter Dinda , and Nikos Hardavellas
WARDen: Specializing Cache Coherence for High-Level Parallel Languages
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

Alexander M. Rush

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Angelo Matni

Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , Brian Homerding , Tommy McMichen , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
International Symposium on Code Generation and Optimization (CGO), 2022
Acceptance rate: 28.7% (27/94)

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

Angelo Matni , Enrico Armenio Deiana , Yian Su , Lukas Gross , Souradip Ghosh , Sotiris Apostolakis , Ziyang Xu , Zujun Tan , Ishita Chaturvedi , David I. August , and Simone Campanoni
NOELLE Offers Empowering LLVM Extensions
ArXiv, 2021

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Celine Lee

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Doug Downey

Ettore M. G. Trainiti, Simone Campanoni , and Doug Downey
Compiler-based neuron-aware deep neural network ensemble training
US Patents, 2022

Ettore M. G. Trainiti, Thanapon Noraset , David Demeter, Doug Downey , and Simone Campanoni
CODE: Compiler-Based Neuron-Aware Ensemble Training
Machine Learning and Systems (MLSys), 2021
Acceptance rate: 24.1% (52/216)

Deep Neural Networks (DNNs) are redefining the state-of-the-art performance in a variety of tasks like speech recognition and image classification. These impressive results are often enabled by ensembling many DNNs together. Surprisingly, ensembling is often done by training several DNN instances from scratch and combining them. This paper shows that there is significant redundancy in today’s way of ensembling. The novelty we propose is CODE, a compiler approach designed to automatically generate DNN ensembles while avoiding unnecessary retraining among its DNNs. For this purpose, CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble. Removing redundancy enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques. These savings can be leveraged by CODE to increase the output quality of its ensembles.

Ettore M. G. Trainiti

Ettore M. G. Trainiti, Simone Campanoni , and Doug Downey
Compiler-based neuron-aware deep neural network ensemble training
US Patents, 2022

Federico Sossai

Tommy McMichen , Nathan Greiner , Peter Zhong , Federico Sossai , Atmn Patel , and Simone Campanoni
Representing Data Collections in an SSA Form
International Symposium on Code Generation and Optimization (CGO), 2024
Acceptance rate: 32.5% (37/114)

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of Memoir is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, Memoir compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our Memoir compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

Jeremiah Giordani, Ziyang Xu , Ella Colby, August Ning , Bhargav Reddy Godala , Ishita Chaturvedi , Shaowei Zhu , Yebin Chon , Greg Chan, Zujun Tan , Galen Collier , Jonathan D. Halverson , Enrico Armenio Deiana , Jasper Liang , Federico Sossai , Yian Su , Atmn Patel , Bangyen Pham , Nathan Greiner , Simone Campanoni , and David I. August
Revisiting Computation for Research: Practices and Trends
High Performance Computing, Networking, Storage and Analysis (SC), 2024
Acceptance rate: 22.7% (102/449)

In the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a toolkit with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.

Jasper Liang

Yian Su , Mike Rainey , Nicholas Wanninger , Nadharm Dhiantravan, Jasper Liang , Umut A. Acar , Peter Dinda , and Simone Campanoni
Compiling Loop-Based Nested Parallelism for Irregular Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Lukas Gross

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.

Marco D. Santambrogio

Alessandro A. Nacci, Gianluca C. Durelli, Josue Pagan, Marina Zapater, Matteo Ferroni, Riccardo Cattaneo, Monica Vallejo, Simone Campanoni , Jose Ayala, and Marco D. Santambrogio
Power-Awareness and Smart-Resource Management in Embedded Computing Systems
International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Filippo Sironi , Davide B. Bartolini , Simone Campanoni , Fabio Cancare , Henry Hoffmann , Donatella Sciuto , and Marco D. Santambrogio
Metronome: Operating System Level Performance Management via Self-Adaptive Computing
Design Automation Conference (DAC), 2012
Acceptance rate: 22.7% (168/741)

In this paper, we present Metronome: a framework to enhance commodity operating systems with self-adaptive capabilities. The Metronome framework features two distinct components: Heart Rate Monitor (HRM) and Performance--Aware Fair Scheduler (PAFS). HRM is an active monitoring infrastructure implementing the observe phase of a self--adaptive computing system Observe--Decide--Act (ODA) control loop, while PAFS is an adaptation policy implementing the decide and act phases of the control loop. Metronome was designed and developed looking towards multi--core processors; therefore, its experimental evaluation has been carried on with the PARSEC 2.1 benchmark suite.

Michael Wilkins

Enrico Armenio Deiana , Brian Suchy , Michael Wilkins , Brian Homerding , Tommy McMichen , Katarzyna Dunajewski , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Program State Element Characterization
International Symposium on Code Generation and Optimization (CGO), 2023
Acceptance rate: 39.2% (20/51)

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.

Michal Kurek

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Michele Tartara

Michele Tartara , Simone Campanoni , Giovanni Agosta , and Stefano Crespi Reghizzi
Parallelism and Retargetability in the ILDJIT Dynamic Compiler
Architecture of Computing Systems (ARCS), 2010

Modern computer architectures are becoming increasingly parallel with each generation. At the same time, new, different and binary incompatible platforms are becoming more and more widespread on the market. This paper presents ILDJIT, a Just-In-Time compiler that aims at exploiting parallelism even when dealing with non-explicitly parallel programs, and the advantages obtained by introducing portability.

Michele Tartara , Stefano Crespi Reghizzi , and Simone Campanoni
Extending Hammocks for Parallelism Detection
Italian Conference on Theoretical Computer Science (ICTCS), 2010

Mike Rainey

Mike Rainey , Kyle C. Hale , Ryan Newton , Nikos Hardavellas , Simone Campanoni , Peter Dinda , and Umut A. Acar
Task Parallel Assembly Language for Uncompromising Parallelism
International Conference on Programming Language Design and Implementation (PLDI), 2021
Acceptance rate: 27.2% (87/320)

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

Nayana Prasad Nagendra

Nayana Prasad Nagendra , Bhargav Reddy Godala , Ishita Chaturvedi , Atmn Patel , Svilen Kanev , Tipp Moseley , Jared Stark , Gilles A. Pokam , Simone Campanoni , and David I. August
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching
International Symposium on Computer Architecture (ISCA), 2023
Acceptance rate: 21.2% (79/372)

For decades, architects have designed cache replacement policies to reduce cache misses. Since not all cache misses affect processor performance equally, researchers have also proposed cache replacement policies focused on reducing the total miss cost rather than the total miss count. However, all prior cost-aware replacement policies have been proposed specifically for data caching and are either inappropriate or unnecessarily complex for instruction caching. This paper presents EMISSARY, the first cost-aware cache replacement family of policies specifically designed for instruction caching. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY resists evicting those cache lines whose misses cause costly decode starvations. In the context of a modern processor with fetch-directed instruction prefetching and other aggressive front-end features, EMISSARY applied to L2 cache instructions delivers an impressive 3.24% geomean speedup (up to 23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on widely used server applications with large code footprints. This speedup is 21.6% of the total speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and conflict instruction misses.

Bhargav Reddy Godala , Nayana Prasad Nagendra , Ishita Chaturvedi , Simone Campanoni , and David I. August
Modern Front-end Support in gem5
International Workshop on Gem5 (GEM5), 2023

Niall Murphy

Niall Murphy, Timothy M. Jones , Robert Mullins , and Simone Campanoni
Performance Implications of Transient Loop-Carried Data Dependences in Automatically Parallelized Loops
International Conference on Compiler Construction (CC), 2016
Acceptance rate: 31.2% (24/77)

Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.

Niall Murphy, Timothy M. Jones , Simone Campanoni , and Robert Mullins
Limits of Static Dependence Analysis for Automatic Parallelization
International Workshop on Compilers for Parallel Computing (CPC), 2015

Peter Zhong

Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024

In Racket, the LLVM IR, Rust, and other modern languages, programmers and static analyses can hint, with special annotations, that certain parts of a program are unreachable. Same as other assumptions about undefined behavior; the compiler assumes these hints are correct and transforms the program aggressively.

Robert Mullins

Stephen Chong

Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.

Vincent St-Amour

Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
Unconventional Parallelization of Nondeterministic Applications
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018
Acceptance rate: 17.6% (56/319)

The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.

Enrico Armenio Deiana , Vincent St-Amour , Peter Dinda , Nikos Hardavellas , and Simone Campanoni
The Liberation Day of Nondeterministic Programs
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017

The demand for thread-level parallelism (TLP) is endless, especially on commodity processors, as TLP is essential for gaining performance. However, the TLP of today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, therefore boosting the program's TLP. We show how these dependences (which we call \"state dependences\" because they are related to the program's state) can be exploited using algorithm-specific knowledge. To demonstrate the practicality of our technique, we implemented a system called April25th that incorporates the concept of \"state dependences\". This system boosts the performance of five nondeterministic, multi-threaded PARSEC benchmarks by 100.5%.

Wonyoung Kim

Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise in Production Processors
IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

Vijay Janapa Reddi , Svilen Kanev , Wonyoung Kim , Simone Campanoni , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-guided Thread Scheduling
International Symposium on Microarchitecture (MICRO), 2010
Acceptance rate: 18.1% (45/248)

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4 percent, far from the processor's 14percent worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today's processors, such resilient designs could yield 15 percent to 20 percent performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing micro architectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.

IEEE Micro’s Top Picks in Computer Architecture Conferences, 2011

Yuanbo Fan

Yuanbo Fan , Simone Campanoni , and Russ Joseph
Time Squeezing for Tiny Devices
International Symposium on Computer Architecture (ISCA), 2019
Acceptance rate: 17.0% (62/365)

Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency in ultra-low power embedded systems. This slack arises when all the signals have propagated through logic paths well in advance of the clock signal. When it is properly identified, the system can exploit this unused cycle time for energy savings. In this paper, we describe compiler and architecture co-design that opens new opportunities for timing slack that are otherwise impossible. Through cross-layer optimization, we introduce novel mechanisms in the hardware and in the compiler that work together to improve the benefit of circuit-level timing speculation by effectively squeezing time during execution. This approach is particularly well-suited to tiny embedded devices. Our evaluation on a gate-level model of a complete processor shows that our co-design saves (on average) 40.5% of the original energy consumption (additional 16.5% compared to the existing clock scheduling technique) across 13 workloads while retaining transparency to developers.

Yuanbo Fan , Tianyu Jia , Jie Gu , Simone Campanoni , and Russ Joseph
Compiler-guided instruction-level clock scheduling for timing speculative processors
Design Automation Conference (DAC), 2018
Acceptance rate: 24.3% (168/691)

Despite the significant promise that circuit-level timing speculation has for enabling operation in marginal conditions, overheads associated with recovery prove to be a serious drawback. We show that fine-grained clock adjustment guided by the compiler can be used to stretch and shrink the clock to maximize benefits of timing speculation and reduce the overheads associated with recovery. We present a formulation for compiler-driven clock scheduling and explore the benefits in several scenarios. Our results show that there are significant opportunities to exploit timing slack when there are appropriate channels for the compiler to select clock period at cycle-level.

Yucan Wu

Ishita Chaturvedi , Bhargav Reddy Godala , Yucan Wu , Ziyang Xu , Konstantinos Iliakis, Panagiotis-Eleftherios Eleftherakis, Sotirios Xydis , Dimitrios Soudris , Tyler Sorensen , Simone Campanoni , Tor M. Aamodt , and David I. August
GhOST: a GPU Out-of-Order Scheduling Technique for stall reduction
International Symposium on Computer Architecture (ISCA), 2024
Acceptance rate: 21.2% (79/372)

Graphics Processing Units (GPUs) use massive multi-threading coupled with static scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge as their latencies vary throughout the application’s execution, leading to stalls. Out-of-order (OoO) execution has been shown to effectively mitigate these types of stalls. However, prior OoO proposals involve costly techniques such as reordering loads and stores, register renaming, or two-phase execution, amplifying implementation overhead and consequently creating a substantial barrier to adoption in GPUs. This paper introduces GhOST, a minimal yet effective OoO technique for GPUs. Without expensive components, GhOST can manifest a substantial portion of the instruction reorderings found in an idealized OoO GPU. GhOST leverages the decode stage’s existing pool of decoded instructions and the existing issue stage’s information about instructions in the pipeline to select instructions for OoO execution with little additional hardware. A comprehensive evaluation of GhOST and the prior state-of-the-art OoO technique across a range of diverse GPU benchmarks yields two surprising insights: (1) Prior works utilized Nvidia’s intermediate representation PTX for evaluation; however, the optimized static instruction scheduling of the final binary form negates many purported improvements from OoO execution; and (2) The prior state-of-the-art OoO technique results in an average slowdown across this set of benchmarks. In contrast, GhOST achieves a $\\mathbf{3 6 \\%}$ maximum and $6.9 \\%$ geometric mean speedup on GPU binaries with only a $0.007 \\%$ area increase, surpassing previous techniques without slowing down any of the measured benchmarks.

Bhargav Reddy Godala , Ishita Chaturvedi , Yucan Wu , Simone Campanoni , and David I. August
QPoints: QEMU to gem5 ARM Full System Checkpointing
International Workshop on Gem5 (GEM5), 2023

Zhen Huang

Alessandro A. Nacci

Andrea Di Biagio

Simone Campanoni , Giovanni Agosta , Stefano Crespi Reghizzi , and Andrea Di Biagio
A highly flexible, parallel virtual machine: design and experience of ILDJIT
Software: Practice and Experience (SPE), 2010

August Ning

Bangyen Pham

Brian R. Tauro

Brian R. Tauro , Brian Suchy , Simone Campanoni , Peter Dinda , and Kyle C. Hale
TrackFM: Far-out Compiler Support for a Far Memory World
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.

Christos Dimoulas

Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024

Conghao Liu

David Demeter

David Dlott

Tommy McMichen , David Dlott , Panitan Wongse-ammat , Nathan Greiner , Hussain Khajanchi, Russ Joseph , and Simone Campanoni
Saving Energy with Per-Variable Bitwidth Speculation
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025

Tiny devices have become ubiquitous in people’s daily lives. Their applications dictate tight energy budgets, but also require reasonable performance to meet user expectations. To this end, the hardware of tiny devices has been highly optimized, making further optimizations difficult. In this work, we identify a missed opportunity: the bitwidth selection of program variables. Today’s compilers directly translate the bitwidth specified in the source code to the binary. However, we observe that most variables do not utilize the full bitwidth specified in the source code for the majority of execution. To leverage this opportunity, we propose BitSpec: a system that performs fine-grained speculation on the bitwidth of program variables. BitSpec is implemented as a compiler-architecture co-design, where the compiler transparently reduces the bitwidth of program variables to their expected needs and the hardware monitors speculative variables, reporting misspeculation to the software, which re-executes at the original bitwidth, ensuring correctness. BitSpec reduces the energy consumption of tiny devices in production by 14.4% on average, up to 28.2%

Davide B. Bartolini

Di Zhang

Ling Jin, Yinzhi Cao , Yan Chen , Di Zhang, and Simone Campanoni
EXGEN: Cross-platform, Automated Exploit Generation for Smart Contract Vulnerabilities
IEEE Transactions on Dependable and Secure Computing (TDSC), 2023

Smart contracts, just like other computer programs, are prone to a variety of vulnerabilities, which lead to severe consequences including massive token and coin losses. Prior works have explored automated exploit generation for vulnerable Ethereum contracts. However, the scopes of prior works are limited in both vulnerability types and contract platforms. In this paper, we propose a cross-platform framework, called \nExGen\n, to generate multiple transactions as exploits to given vulnerable smart contracts. \nExGen\n first translates either Ethereum or EOS contracts to an intermediate representation (IR). Then, \nExGen\n generates symbolic attack contracts with transactions in a partial order and then symbolically executes the attack contracts together with the target to find and solve all the constraints. Lastly, \nExGen\n concretizes all the symbols, generates attack contracts with multiple transactions, and verifies the generated contracts’ exploitability on a private chain with values crawled from the public chain. We implemented a prototype of \nExGen\n and evaluated it on Ethereum and EOS benchmarks. \nExGen\n successfully exploits 1,258/1,399 (89.9%) Ethereum and 126/130 (96.9%) EOS vulnerabilities. \nExGen\n is also able to exploit zero-day vulnerabilities on EOS.

Dimitrios Soudris

Donatella Sciuto

Drew Kersnar

Ella Colby

Fabio Cancare

Filippo Sironi

Galen Collier

Gaurav Chaudhary

Georgios Tziantzioulis

Georgios Tziantzioulis , Nikos Hardavellas , and Simone Campanoni
Temporal Approximate Function Memoization
IEEE Micro's special issue on Approximate Computing, 2018

Improving the performance of applications is a core target of computer systems research and has led to the creation of various techniques. Among them is function memoization, an intuitive technique that, unfortunately, failed to live up to its promised potential. Traditional function memoization falls short mainly because it is input-based, meaning the system needs to pattern-match the current invocations inputs with previously seen ones to decide which memoized output value to return. This is often computationally intensive and only suitable for a limited set of functions. However, function calls often exhibit temporal value locality on their output. We capitalize on this observation to create the first output-based function memoization technique. We demonstrate that compiler-directed output-based memoization reduces energy and runtime by 50 to 75 percent (2-4x speedup) at the cost of a relatively small degradation of the applications output quality.

Gianluca C. Durelli

Gilles A. Pokam

Henry Hoffmann

Hussain Khajanchi

Jared Stark

Jeremiah Giordani

Jiacheng Ma

Jie Gu

Johannes Doerfert

Zujun Tan , Yebin Chon , Michael Kruse , Johannes Doerfert , Ziyang Xu , Brian Homerding , Simone Campanoni , and David I. August
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Acceptance rate: 26.7% (72/270)

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Jonathan D. Halverson

Jose Ayala

Josiah Hester

Vito Kortbeek , Souradip Ghosh , Josiah Hester , Simone Campanoni , and Przemysław Pawełczak
WARio: Efficient Code Generation for Intermittent Computing
International Conference on Programming Language Design and Implementation (PLDI), 2022
Acceptance rate: 20.9% (68/326)

Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.

Josue Pagan

Katarzyna Dunajewski

Khalid Al-Hawaj

Khalid Al-Hawaj, Simone Campanoni , Gu-Yeon Wei , and David M. Brooks
Unified Cache: A Case for Low Latency Communication
International Workshop on Parallelism in Mobile Platforms (PRISM), 2015

Kim Hazelwood

Vijay Janapa Reddi , Simone Campanoni , Meeta S. Gupta , Kim Hazelwood , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Eliminating Voltage Emergencies via Software-Guided Code Transformation
ACM Transactions on Architecture and Code Optimization (TACO), 2010

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Konstantinos Iliakis

Krishna K. Rangan

Vijay Janapa Reddi , Meeta S. Gupta , Krishna K. Rangan , Simone Campanoni , Glenn Holloway , Michael D. Smith , Gu-Yeon Wei , and David M. Brooks
Voltage Noise: Why It’s Bad, and What To Do About It
International Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009

Ling Jin

Luca Rocchini

Simone Campanoni , and Luca Rocchini
Static Memory Management within Bytecode Languages on Multicore Systems
Workshop on Computing in Heterogeneous, Autonomous ’N’ Goal-oriented Environments (CHANGE), 2011

Marina Zapater

Martino Sykora

Simone Campanoni , Martino Sykora , Giovanni Agosta , and Stefano Crespi Reghizzi
Dynamic Look Ahead Compilation: a technique to hide JIT compilation latencies in multicore environment
International Conference on Compiler Construction (CC), 2009
Acceptance rate: 25.0% (18/72)

Object-code virtualization, commonly used to achieve software portability, relies on a virtual execution environment, typically comprising an interpreter used for initial execution of methods, and a JIT for native code generation. The availability of multiple processors on current architectures makes it attractive to perform dynamic compilation in parallel with application execution. The major issue is to decide at runtime which methods to compile ahead of execution, and how much time to invest in their optimization. This research introduces an abstract model, termed Dynamic Look Ahead (DLA) compilation, which represents the available information on method calls and computational weight as a weighted graph. The graph dynamically evolves as computation proceeds. The model is then instantiated by specifying criteria for adaptively choosing the method compilation order. The DLA approach has been applied within our dynamic compiler for .NET. Experimental results are reported and analyzed, for both synthetic programs and benchmarks. The main finding is that a careful choice of method-selection criteria, based on light-weight program analysis and execution tracing, is essential to mask compilation times and to achieve higher overall performances. On multi-processors, the DLA approach is expected to challenge the traditional virtualization environments based on bytecode interpretation and JITing, thus bridging the gap between ahead-of-time and just-in-time translation.

Matteo Ferroni

Michael Kruse

Michael Leonard

Michael Leonard , and Simone Campanoni
Introducing the Pseudorandom Value Generator Selection in the Compilation Toolchain
International Symposium on Code Generation and Optimization (CGO), 2020
Acceptance rate: 27.4% (26/95)

As interest in randomization has grown within the computing community, the number of pseudorandom value generators (PRVGs) at developers' disposal dramatically increased. Today, developers lack the tools necessary to obtain optimal behavior from their PRVGs. We provide the first deep study into the tradeoffs among the PRVGs in the C++ standard, finding no silver bullet for all programs and architectures. With this in mind, we have built PRV Jeeves, the first fully automatic PRVG selector. We demonstrate that when compiling widely-used, highly optimized programs with PRV Jeeves, we are able to cut execution time by 34% on average. This enhancement comes at no cost to developers.

Monica Vallejo

Nadharm Dhiantravan

Panagiotis-Eleftherios Eleftherakis

Panitan Wongse-ammat

Przemysław Pawełczak

Riccardo Cattaneo

Robert Bruce Findler

Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024

Ryan Newton

Sam Westrick

Shaowei Zhu

Shu-Hung You

Peter Zhong , Shu-Hung You , Simone Campanoni , Robert Bruce Findler , and Christos Dimoulas
A Calculus for Unreachable Code
ArXiv, 2024

Siyuan Chai

Sotirios Xydis

Thanapon Noraset

Tianyu Jia

Tipp Moseley

Tor M. Aamodt

Tyler Sorensen

Vijay Kandiah

Vito Kortbeek

Wenyi Wang

Xiaochun Zhang

Xiaochun Zhang, Timothy M. Jones , and Simone Campanoni
Quantifying the Semantic Gap Between Serial and Parallel Programming
International Symposium on Workload Characterization (IISWC), 2021
Acceptance rate: 39.6% (19/48)

Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system.

Xinyu Xing

Zhenpeng Lin , Zheng Yu , Ziyi Guo , Simone Campanoni , Peter Dinda , and Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection
USENIX Security Symposium (USENIX Security), 2024

The heap is a critical and widely used component of many applications. Due to its dynamic nature, combined with the complexity of heap management algorithms, it is also a frequent target for security exploits. To enhance the heap's security, various heap protection techniques have been introduced, but they either introduce significant runtime overhead or have limited protection. We present CAMP, a new sanitizer for detecting and capturing heap memory corruption. CAMP leverages a compiler and a customized memory allocator. The compiler adds boundary-checking and escape-tracking instructions to the target program, while the memory allocator tracks memory ranges, coordinates with the instrumentation, and neutralizes dangling pointers. With the novel error detection scheme, CAMP enables various compiler optimization strategies and thus eliminates redundant and unnecessary check instrumentation. This design minimizes runtime overhead without sacrificing security guarantees. Our evaluation and comparison of CAMP with existing tools, using both real-world applications and SPEC CPU benchmarks, show that it provides even better heap corruption detection capability with lower runtime overhead.

Yan Chen

Yinzhi Cao

Zheng Yu

Zhenpeng Lin , Zheng Yu , Ziyi Guo , Simone Campanoni , Peter Dinda , and Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection
USENIX Security Symposium (USENIX Security), 2024

Zhenpeng Lin

Zhenpeng Lin , Zheng Yu , Ziyi Guo , Simone Campanoni , Peter Dinda , and Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection
USENIX Security Symposium (USENIX Security), 2024

Ziyi Guo

Zhenpeng Lin , Zheng Yu , Ziyi Guo , Simone Campanoni , Peter Dinda , and Xinyu Xing
CAMP: Compiler and Allocator-based Heap Memory Protection
USENIX Security Symposium (USENIX Security), 2024

Publications without collaborators

Simone Campanoni
Guide to ILDJIT
Book from Springer, 2011