This is a selected list of my publications. Look me up at Google Scholar for another view on my publications.
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand for GPU computational power. Emerging workloads contain hundreds or thousands of GPU kernel launches, which incur high overheads, and exhibit data-dependent behavior between kernels, which requires synchronization, leading to GPU under-utilization. Task-based execution models have been proposed to solve these issues, but they require significant programmer effort to port applications to proprietary task-based programming models in order to specify tasks and task dependencies. To address this need, we propose BlockMaestro, a software-hardware solution that combines command queue reordering, kernel-launch-time static analysis, and runtime hardware support to dynamically identify and resolve thread-block level data dependencies between kernels. Through static analysis of memory access patterns at kernel-launch-time, BlockMaestro can extract inter-kernel thread block-level data dependencies. BlockMaestro also introduces kernel pre-launching to reduce the kernel launch overheads experienced by multiple dependent kernels. Correctness is enforced by dynamically resolving thread block-level data dependency at runtime through hardware support. BlockMaestro achieves an average speedup of 51.76% (up to 2.92x) on data-dependent benchmarks, and requires minimal hardware overhead.
The Register File (RF) is a critical structure in Graphics Processing Units (GPUs) responsible for a large portion of the area and power. To simplify the architecture of the RF, it is organized in a multi-bank configuration with a single port for each bank. Not surprisingly, the frequent accesses to the register file during kernel execution incur a sizeable overhead in GPU power consumption, and introduce delays as accesses are serialized when port conflicts occur. In this paper, we observe that there is a high degree of temporal locality in accesses to the registers: within short instruction windows, the same registers are often accessed repeatedly. We characterize the opportunities to reduce register accesses as a function of the size of the instruction window considered, and establish that there are many recurring reads and updates of the same register operands in most GPU computations. To exploit this opportunity, we propose Breathing Operand Windows (BOW), an enhanced GPU pipeline and operand collector organization that supports bypassing register file accesses and instead passes values directly between instructions within the same window. Our baseline design can only bypass register reads; we introduce an improved design capable of also bypassing unnecessary write operations to the RF. We introduce compiler optimizations to help guide the write-back destination of operands depending on whether they will be reused to further reduce the write traffic. To reduce the storage overhead, we analyze the occupancy of the bypass buffers and discover that we can significantly down size them without losing performance. BOW along with optimizations reduces dynamic energy consumption of the register file by 55% and increases the performance by 11%, with a modest overhead of 12KB increase in the size of the operand collectors (4% of the register file size).
In many emerging applications such as deep learning, large data set is essential to generate reliable solutions. In these big data workloads, memory latency and bandwidth are the main performance bottlenecks. In this paper, we propose a locality-aware GPU register file that enables data sharing for memory-intensive big data workloads on GPUs without relying on small on-chip memories. We exploit two types of data sharing patterns commonly found from the big data workloads and have warps opportunistically share data in physical registers instead of issuing memory loads separately and storing the same data redundantly in their registers as well as small shared memory. With an extended register file mapping mechanism, our proposed design enables warps to share data by simply mapping to the same physical registers or reconstructing from the data in the register file already. The proposed sharing not only reduces the memory transactions but also further decreases the register file usage. The spared registers make rooms for applying orthogonal optimizations for energy and performance improvement. Our evaluation on two deep learning workloads and matrixMul show that the proposed locality-aware GPU register file achieves over 2× speedup and saves register space up to 57%.
The Register File (RF) in GPUs is a critical structure that maintains the state for thousands of threads that support the GPU processing model. The RF organization substantially aﬀects the overall performance and the energy efciency of a GPU. For example, the frequent accesses to the RF consume a substantial amount of the dynamic energy, and port contention due to limited ports on operand collectors and register fle banks aﬀect performance as register operations are serialized. We present CORF, a compiler-assisted Coalescing Operand Register File which performs register coalescing by combining reads to multiple registers required by a single instruction, into a single physical read. To enable register coalescing, CORF utilizes register packing to co-locate narrow-width operands in the same physical register. CORF uses compiler hints to identify which register pairs are commonly accessed together. CORF saves dynamic energy by reducing the number of physical register fle accesses, and improves performance by combining read operations, as well as by reducing pressure on the register fle. To increase the coalescing opportunities, we re-architect the physical register fle to allow coalescing reads across diﬀerent physical registers that reside in mutually exclusive sub-banks; we call this design CORF++. The compiler analysis for register allocation for CORF++ becomes a form of graph coloring called the bipartite edge frustration problem. CORF++ reduces the dynamic energy of the RF by 17%, and improves IPC by 9%.
Dynamic neural networks enable higher represen- tation flexibility compared to networks with a fixed architecture and are extensively deployed in problems dealing with vary- ing input-induced network structure, such as those in Natural Language Processing. One of the optimizations used in training networks is persistency of recurrent weights on the chip. In dynamic nets, a possibly-inhomogeneous computation graph for every input prevents caching recurrent weights in GPU registers. Therefore, existing solutions suffer from excessive recurring off-chip memory loads as well as compounded kernel launch overheads and underutilization of GPU SMs. In this paper, we present a software system that enables persistency of weight matrices during the training of dynamic neural networks on the GPU. Before the training begins, our ap- proach named Virtual Persistent Processor Specialization (VPPS) specializes a forward-backward propagation kernel that contains in-register caching and operation routines. VPPS virtualizes persistent kernel CTAs as CISC-like vector processors that can be guided to execute supplied instructions. VPPS greatly reduces the overall amount of off-chip loads by caching weight matrices on the chip, while simultaneously, provides maximum portability as it does not make any assumptions about the shape of the given computation graphs hence fulfilling dynamic net requirements. We implemented our solution on DyNet and abstracted away its design complexities by providing simple function calls to the user. Our experiments on a Volta micro-architecture shows that, unlike the most competitive solutions, VPPS shows excellent performance even in small batch sizes and delivers up to 6x speedup on training dynamic nets.
Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large register files. However, to avoid over-complicating the hardware, registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread’s lifetime. This decomposition takes into account the maximum number of live registers at any given point in the GPU binary although the points at which all the requested registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the register file remains under-utilized. In this paper, we propose a software-hardware comechanism named RegMutex (Register Mutual Exclusion) to share a subset of physical registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected register set into a base register set and an extended register set. While physical registers corresponding to the base register set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical registers across warps to provision their extended register set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of registers hence yielding higher performance per dollar. For programs that require a large number of registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their register allocations with each other, leading to a higher device occupancy. Since some aspects of register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of registers.
Recently, side-channel attacks on Last Level Caches (LLCs) were demonstrated. The attacks require the ability to evict critical data from the cache hierarchy, making future accesses visible. We propose Relaxed Inclusion Caches (RIC), a low-complexity cache design protecting against LLC side channel attacks. RIC relaxes inclusion when it is not needed, preventing the attacker from replacing the victim’s data from the local core caches thus protecting critical data from leakage. RIC improves performance (by about 10%) and retains snoop filtering capabilities of inclusive cache hierarchies, while requiring only minimal changes to the cache.