If this mechanism is required very often it may harm performance. Software transactional memory for gpu architectures ieee xplore. Scheduling techniques for gpu architectures with processing. In this paper, we analyze the performance and energy ef. Aamodt university of british columbia, canada motivation. Tm transactional memory stm software transactional memory htm hardware transactional memory hytm hybrid transactional memory tsx intels transactional synchronization extensions hle hardware lock elision rtm restricted transactional memory gpu graphics processing unit gpgpu general purpose computation on graphics processing units cpu central. To make applications with dynamic data sharing among threads benefit from gpu acceleration, we propose a novel software transactional memory system for gpu architectures gpustm. Efficient transactionalmemorybased implementation of morph. One hardware proposal, kilo tm, can scale to s of concurrent transaction. Computing without processors august 2011 communications.
Energy e ciency of software transactional memory in a. Software transactional memory for gpu architectures yunlong xu. The unconverted parts of the java program could use up the cpu multicore resources with its multithreaded workload. Exploration of lockbased software transactional memory justin gottschlich.
The heterogeneous accelerated processing units apus integrate a multicore cpu and a gpu within the same chip. Gpustm, a software tm for gpus enables simplified data synchronizations on gpus scales to s of txs ensures livelockfreedom runs on commercially available gpus and runtime outperforms gpu coarsegrain locks by up to 20x. First, thread block compaction tbc is a microarchitecture innovation that reduces the performance penalty caused by branch divergence in gpu applications. There are three ways to copy data to the gpu memory, either implicitly through calresmapcalresunmap or explicitly via calctxmemcopy or via a custom copy shader that reads from pcie memory and writes to gpu memory. However, performance and energy overhead of kilo tm may deter gpu vendors from incorporating it into future designs.
Acle version acle q3 2019 acle acle q3 2019 documentation. Pdf hardware transactional memory for gpu architectures. However, ensuring atomicity for complex data types is a task delegated to programmers. Transactional synchronization extensions wikipedia. Data layout transformation for enhancing locality on nuca chip multiprocessors. Toward a software transactional memory for heterogeneous. Hardware transactional memory for gpu architectures ubc ece. To make applications with dynamic data sharing among threads benefit from gpu acceleration, we propose a novel software transactional. Gpu localtm allocates transactional metadata in the existing memory resources, minimizing the storage requirements for tm support. Towards a software transactional memory for graphics processors. To reduce this effort, prior work has proposed supporting transactional memory on gpu architectures. Pdf software transactional memory for gpu architectures.
Matt software transactional memory, herlihys hardware accelerator concept. Software transactional memory provides transactional memory semantics in a software runtime library or the programming language, and requires minimal hardware support typically an atomic compare and swap operation, or equivalent. Towards a software transactional memory for heterogeneous. Both hardware and software transactional memories have been proposed for the gpu architectures. Rafael ubal david kaeli department of electrical and computer engineering. To evaluate tlll, we use it to implement six widely used programs, and compare it with the stateoftheart adhoc gpu synchronization, gpu software transactional memory stm, and cpu hardware. While transactional memory for processors with hundreds of cores is likely to require hardware support, software implementations will be required for backward compatibility with current and near. Hardware support for scratchpad memory transactions on gpu. Compiler, architecture and tools conference program abstracts. We propose gpu localtm, a hardware transactional memory tm, as an alternative to data locking mechanisms in local memory.
Gpu computing architecture for irregular parallelism ubc. For a set of tmenhanced gpu applications, kilo tm captures 59% of the performance of finegrained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0. Or would these kinds of building blocks be just what we want. Thesis, department of electrical and computer engineering, university of colorado. In addition, it ensures forward progress through an automatic serialization mechanism. Modern gpu architectures have a memory hierarchy that needs to be explicitly programmed to obtain good performance. Programming gpus is challenging for applications with irregular finegrained communication between threads. Cpu and gpu architectures, memory subsystem design, hardwaresoftware codesign. Were upgrading the acm dl, and would like your input. View anup holeys profile on linkedin, the worlds largest professional community. Software transactional memory for gpu architectures ieee. Qingda lu, christophe alias, uday bondhugula, sriram krishnamoorthy, j. Secondly, the con ict detection mechanism is based on uni ed readwrite signatures i. Yunlong xu, rui wang, nilanjan goswami, tao li and depei qian.
Software transactional memory for gpu architectures nilanjan. Accelerating gpu hardware transactional memory with snapshot. Hardware transactional memory for gpu architectures wilson w. To improve gpus programmability and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory tm on gpus via kilo tm, a novel hardware tm system that scales to thousands of concurrent transactions. Evaluation of amds advanced synchronization facility within a complete transactional memory stack performance evaluation of intel transactional synchronization extensions for highperformance computing software transactional memory. Next generation cuda architecture, code named fermi.
Ennals, efficient software transactional memory, technical report, intel research cambridge, uk, 2005. A stm system that supports perthread transactions faces new challenges. Sep 15, 2008 3 the graphics memory is the gpu s version of host memory. I have been working on software transactional memory for in memory database. Pdf modern gpus have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. To appear in the 12th annual ieeeacm international symposium on code generation and optimization cgo, 2014. Hardware transactional memory for gpu architectures. The major challenges include ensuring good scalability with respect to the massively multithreading of gpus, and. Towards a software transactional memory for heterogeneous cpu. Advanced computer architecture and systems detailed. Hardware support for local memory transactions on gpu architectures alejandro villegas angeles navarro. An efficient software transactional memory using committime invalidation. Modern apus implement cpugpu platform atomics for simple data types. Nilanjan goswami gpu architect advanced computing lab.
Hardware support for local memory transactions on gpu. Many tm systems have been proposed in the last two decades for multicore architectures 7, implemented either in hardware or software or a combination. Improvements in hardware transactional memory for gpu. On the hardware side, kilo tm was proposed in 2011. Toward a software transactional memory for heterogeneous cpu. As the downside, software implementations usually come with a performance penalty, when compared to hardware. On the gpu, main memory is accessed via a cache hierarchy where, in most cases, the l1 data cache is not coherent. Transactional synchronization extensions tsx, also called transactional synchronization extensions new instructions tsxni, is an extension to the x86 instruction set architecture isa that adds hardware transactional memory support, speeding up execution of multithreaded software through lock elision. A question that arises in our smart highways use case is this. His research interests include parallel programming, software transactional memory, and distributed architectures. To improve gpus programmability and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory tm on gpus. And now having read about intels hw tm i have many curious questions. Transactional memory for heterogeneous systems arxiv.
A cuda program starts on a cpu and then launches parallel compute kernels onto a gpu. Tm simplifies software development for parallel architectures by providing the programmer with the illusion that code blocks, called transactions, execute. The ability of the gpu to handle considerably more threads than the cpu has recently led to increased interest in utilising transactional memory for gpu. With tm, the programmer does not need to write code with locks to ensure mutual exclusion. The major challenges include ensuring good scalability with respect to the massively multithreading of gpus, and preventing livelocks. Each kernel launch dispatches a hierarchy of threads a grid of blocks. It is only accessible by the gpu and not accessible via the cpu. Nov 11, 20 compiler, architecture and tools conference program abstracts. Systemwide data consistency issues can be handled by a gpu friendly design of software transactional memory. Today most people who make effective use of gpus undergo a steep learning curve and are forced to program close to the machine using special gpu programming languages. To make applications with dynamic data sharing benefit from gpu acceleration, we propose a novel software transactional memory system for gpu architectures gpustm. Software transactional memory for gpu architectures. To make applications with dynamic data sharing among threads benefit from gpu acceleration, we propose a novel software transactional memory system for gpu architectures gpu stm. Software transactional memory for gpu architectures proceedings.
Transactional memory tm is an optimistic approach to achieve this goal. Improvements in hardware transactional memory for gpu architectures 3 proposed. The major challenges include ensuring good scalability with respect to the massively multithreading of gpus, and preventing livelocks caused by the simt execution paradigm of gpus. This dissertation aims to reduce the burden on gpu software developers with two major enhancements to gpu architectures. Sadayappan, yongjian chen, haibo lin and tinfook ngai. Scheduling techniques for gpu architectures with processinginmemory capabilities ashutosh pattnaik1 xulong tang1 adwait jog2 onur kay. Transactional memory for heterogeneous cpugpu systems.
54 498 520 31 622 1344 734 1181 1198 74 218 1488 650 433 1114 1352 809 1388 1220 36 853 320 235 723 338 1000 635 1456 915 1261 1527 1241 446 1154 561 836 1412 1322 814 730 464 1148 1002 1262 1007