跳转至

GPGPU Storage architecture

Review of hierarchical storage of CPU

Figure 1 CPU classic storage hierarchy, showing a positive triangle pyramid structure

 The hierarchical storage structure widely used in the CPU storage system is shown in the figure above, which presents a typical "positive triangle" pyramid structure. Different storage media have different storage characteristics: the top storage media is close to the arithmetic unit and has high speed, but the circuit overhead is large, so the quantity and the capacity is small; The storage medium at the bottom is far away from the arithmetic unit and is slow, but the circuit overhead is low, so it has greater capacity. It is with the help of hierarchical structure and reasonable data layout management that the "illusion" of large and fast storage system capacity is created for programmers, and has successfully played the role of accelerating data access in practical applications.

Storage hierarchy of GPGPU

 Similar to the CPU storage system structure, the GPGPU storage system also adopts a hierarchical structure design to reduce the access cost of off-chip memory by making full use of the locality of data. However, in order to meet the SIMT large-scale thread parallelism of GPGPU core, GPGPU follows the principle of throughput priority, so its storage system is also significantly different from CPU. The differences between the two are mainly reflected in memory capacity, structure composition and access management.

Figure 2 GPGPU storage hierarchy, inverted triangle structure

 The programming model of CUDA and OpenCL roughly divides the storage of GPGPU into register file, L1 data cache/shared memory, L2 cache and global memory. Although the device types used at each level and the corresponding level in the CPU are roughly the same, the design of the storage level in the GPGPU is significantly different from that of the CPU. As can be seen from the above figure, the capacity of register files in each programmable multiprocessor of GPGPU is significantly higher than that of L1 cache and shared memory, showing a "inverted triangle" structure that is diametrically opposite to the CPU. This "inverted triangle" structure is a significant feature of GPGPU storage system.

 The register file of GPGPU is designed with such a large capacity, mainly to meet the zero overhead switching of thread warps. Because each stream multiprocessor can support a large number of thread warps at the same time, in order to support the flexible switching of thread warps to hide long delay operations such as cache miss, it is necessary to save the context information of active thread warps, especially the contents of registers, in the register file. If the register resource is reduced, when the register demand of the thread exceeds the physical capacity of the register file, it is necessary to allocate some local storage space in the global memory to use these additional registers. This phenomenon is called "register overflow". Register overflow operations often lead to significant performance degradation. Therefore, GPGPU has to adopt a large-capacity register file design in terms of performance.

 In addition to the difference in capacity, GPGPU memory access operations are highly parallelized. Because GPGPU executes at the granularity of thread warps, data access operations for each thread warp will use spatial parallelism to merge requests or broadcast the required data as much as possible, so as to improve the efficiency of access. For example, L1 data cache and global memory access have their own address merging access rules, which often requires special access merging units on the hardware to support online merging of thread access requests.

 The memory access behavior of GPGPU also reflects the characteristics of more sophisticated on-chip storage management. GPGPU can select different storage space to store data, such as shared memory, L1 data cache, texture cache, constant cache, etc. It can also flexibly adjust the size of shared memory and L1 data cache while maintaining the same total capacity. Many GPGPUs can also specify the cache policy of data in L1 data cache. These allow programmers to implement more sophisticated storage management according to the characteristics of different applications.

 With the iteration of each generation of GPGPU hardware, the architecture of the storage system is also evolving. Therefore, it is difficult to give a specific description of the hardware structure, but it mostly conforms to the abstraction of multiple storage spaces in the CUDA and OpenCL programming models.