Research Interests


Parallel systems, computer architecture, microarchitecture, nanophotonics-based computer architectures, memory systems, design for dark silicon, memory-oriented system design, and quantum computer systems.


Description: wordle


Current Research

Please see the PARAG@N web site.


Earlier Research - Carnegie Mellon University


Research performed as a member of CALCM and Databases @ CMU: The performance of modern multicore processors is shaped by three trends:

1.     The increasing wire delays force processors to become distributed and disperse the cores and the cache across the die area, making cache block placement a determinant of the processor's performance.

2.     The on-chip transistor counts increase exponentially, but this increase does not directly translate into performance improvement; rather, processors must optimally allocate transistors to components (e.g., cores, caches), in a quest to attain maximum performance and remain within physical constraints (e.g., power, area, bandwidth).

3.     Conventional software is still hampered by arbitrarily complex data access and sharing patterns that inhibit most hardware optimizations. To fully realize the potential of modern multicore processors, the software must be redesigned for the new hardware landscape.

To address these requirements, my research proceeds along three synergistic fronts: (a) hardware designs that optimize for fast data accesses and utilize efficiently the transistors on chip, (b) scalable parallel software architectures that mitigate the rising on-chip data latencies, and (c) scalable performance evaluation techniques.

a)    Scalable Hardware: cache designs, transistor-efficient multicore designs and memory systems.

Cache Designs. To optimize for data placement on chip, we developed R-NUCA (Reactive Non-Uniform Cache Access). R-NUCA is a distributed cache architecture that places blocks on chip based on the observation that cache accesses can be classified into distinct classes, where each class lends itself to a different placement policy. Fast lookup is provided by Rotational Interleaving, an indexing scheme that affords the fast lookup of conventional address interleaving while allowing cache block replication and migration. Finally, through intelligent cache block placement, R-NUCA obviates the need for hardware coherence at the last-level cache, greatly simplifying the design and improving scalability. [IEEE Micro Top Pick 2010]

Transistor-Efficient Multicores. To utilize efficiently the abundant transistors on chip, we developed ADviSE (Analytic Design-Space Exploration), a collection of performance, area, bandwidth, power and thermal analytical models for multicore processors. ADviSE suggests design rules across process technologies that optimize for a given metric (e.g., throughput, power efficiency) and allocates transistors judiciously among components, leading to near-optimal designs.

Memory Systems. To tame the off-chip data latency, we developed STeMS (Spatio-Temporal Memory Streaming), a memory system in which data move in correlated streams, rather than in individual cache blocks. STeMS is based on the observation that applications execute repetitive code sequences that result in recurring data access sequences ("streams"), which can be used to predict future requests and prefetch their data.

b)   Scalable Parallel Software. To overcome the limitations of conventional software, we develop Data-Oriented Staging, a software architecture that decomposes otherwise single-threaded requests into parallel tasks and partitions their data logically across the cores. The logical data partitioning (a) transforms data that were shared across requests and slow to access into core-private data with fast local access times, and (b) renders data sharing patterns within a parallel request predictable, enabling prefetch mechanisms to hide data access latencies. To show the feasibility of the design, we developed a prototype staged database system as part of the StagedDB-CMP project. [best demonstration award at ICDE 2006]. Then we analyzed the behavior of modern database systems, identified their bottlenecks on modern hardware that favors scalability over single-thread performance, and proposed optimizations to overcome these limitations. The resulting system, Shore-MT, influenced the design of subsequent DBMSs. [10-Year Test-of-Time Award at EDBT 2019]

c)    Scalable Performance Evaluation Techniques. The growing size and complexity of modern hardware make software simulators prohibitively slow, barring researchers from evaluating their designs on commercial-grade workloads and large-scale systems. To overcome this limitation, we develop FLEXUS, a cycle-accurate full-system simulation infrastructure that reduces simulation turnaround through statistical sampling. FLEXUS has been adopted by several research groups, it has been the primary infrastructure for computer architecture courses, and is currently in its third public release.


Even Earlier Research - University of Rochester


Cashmere: a software distributed-shared-memory system over low-latency remote-memory-access networks like DEC's Memory Channel. Cashmere utilizes the hardware coherence within a multiprocessor, provides software coherence across nodes in a cluster and allows memory scaling through remote memory paging.


Carnival: a tool for the characterization and analysis of waiting time and communication patterns of parallel shared-memory applications. Through cause-effect analysis, Carnival detects performance bottlenecks of individual code fragments.



Prior Life Research - Industry (Digital, Compaq, Hewlett-Packard)


Alpha Processors and High-Performance Multiprocessor Systems: While affiliated with Digital Equipment Corp., Compaq Computer Corp., and Hewlett-Packard, I was a member of the design team of high-end enterprise servers. I contributed to the Alpha EV6 (21264), EV7 (21364), and EV8 (21464) generations of microprocessors, and had a fleeting relationship with the Piranha multicore. The rest of my time was spent working on a number of AlphaServers in the Titan, Wildfire, and Marvel families: the Marvel (GS-1280), WildFire (GS-320), and Privateer (ES-45) multiprocessor systems. My work focused on memory hierarchy and multiprocessor system design, including adaptive and multi-level cache coherence protocols, migratory data optimizations, novel caching schemes, RAMbus modeling and optimizations, link retraining, flow control, directory caches, routing, and system topology. Also, I worked on the design and development of full-system execution-driven, trace-driven, and statistical simulators, and tools and techniques for performance analysis.