This is a selected list of my publications. Look me up at Google Scholar for another view on my publications.
The Register File (RF) in GPUs is a critical structure that maintains the state for thousands of threads that support the GPU processing model. The RF organization substantially aﬀects the overall performance and the energy efciency of a GPU. For example, the frequent accesses to the RF consume a substantial amount of the dynamic energy, and port contention due to limited ports on operand collectors and register fle banks aﬀect performance as register operations are serialized. We present CORF, a compiler-assisted Coalescing Operand Register File which performs register coalescing by combining reads to multiple registers required by a single instruction, into a single physical read. To enable register coalescing, CORF utilizes register packing to co-locate narrow-width operands in the same physical register. CORF uses compiler hints to identify which register pairs are commonly accessed together. CORF saves dynamic energy by reducing the number of physical register fle accesses, and improves performance by combining read operations, as well as by reducing pressure on the register fle. To increase the coalescing opportunities, we re-architect the physical register fle to allow coalescing reads across diﬀerent physical registers that reside in mutually exclusive sub-banks; we call this design CORF++. The compiler analysis for register allocation for CORF++ becomes a form of graph coloring called the bipartite edge frustration problem. CORF++ reduces the dynamic energy of the RF by 17%, and improves IPC by 9%.
Dynamic neural networks enable higher represen- tation flexibility compared to networks with a fixed architecture and are extensively deployed in problems dealing with vary- ing input-induced network structure, such as those in Natural Language Processing. One of the optimizations used in training networks is persistency of recurrent weights on the chip. In dynamic nets, a possibly-inhomogeneous computation graph for every input prevents caching recurrent weights in GPU registers. Therefore, existing solutions suffer from excessive recurring off-chip memory loads as well as compounded kernel launch overheads and underutilization of GPU SMs. In this paper, we present a software system that enables persistency of weight matrices during the training of dynamic neural networks on the GPU. Before the training begins, our ap- proach named Virtual Persistent Processor Specialization (VPPS) specializes a forward-backward propagation kernel that contains in-register caching and operation routines. VPPS virtualizes persistent kernel CTAs as CISC-like vector processors that can be guided to execute supplied instructions. VPPS greatly reduces the overall amount of off-chip loads by caching weight matrices on the chip, while simultaneously, provides maximum portability as it does not make any assumptions about the shape of the given computation graphs hence fulfilling dynamic net requirements. We implemented our solution on DyNet and abstracted away its design complexities by providing simple function calls to the user. Our experiments on a Volta micro-architecture shows that, unlike the most competitive solutions, VPPS shows excellent performance even in small batch sizes and delivers up to 6x speedup on training dynamic nets.
Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large register files. However, to avoid over-complicating the hardware, registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread’s lifetime. This decomposition takes into account the maximum number of live registers at any given point in the GPU binary although the points at which all the requested registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the register file remains under-utilized. In this paper, we propose a software-hardware comechanism named RegMutex (Register Mutual Exclusion) to share a subset of physical registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected register set into a base register set and an extended register set. While physical registers corresponding to the base register set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical registers across warps to provision their extended register set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of registers hence yielding higher performance per dollar. For programs that require a large number of registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their register allocations with each other, leading to a higher device occupancy. Since some aspects of register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of registers.
Recently, side-channel attacks on Last Level Caches (LLCs) were demonstrated. The attacks require the ability to evict critical data from the cache hierarchy, making future accesses visible. We propose Relaxed Inclusion Caches (RIC), a low-complexity cache design protecting against LLC side channel attacks. RIC relaxes inclusion when it is not needed, preventing the attacker from replacing the victim’s data from the local core caches thus protecting critical data from leakage. RIC improves performance (by about 10%) and retains snoop filtering capabilities of inclusive cache hierarchies, while requiring only minimal changes to the cache.