参考书籍

通用图形处理器设计(GPGPU编程模型与架构原理)

参考文献

Aamodt T M, Fung W W L, Rogers T G. General-purpose graphics processor architectures[J]. Synthesis Lectures on Computer Architecture, 2018, 13(2): 1-140
Nvidia. Guide D. Cuda c programming guide[Z]. (2017-06-01)[2021-08-01]. https://eva.fing.edu.uy/pluginfile.php/174141/mod_resource/content/1/CUDA_C_Programming_Guide.pdf.
Nvidia. PTX: Parallel thread execution ISA version 6.4[M]. (2017-06-01)[2021-08-01].https://docs.nvidia.com/pdf/ptx_isa_5.0.pdf.
Nvidia. Cooperative Groups: Flexible CUDA Thread Programming[Z]. [2021-08-01].https://developer.nvidia.com/blog/cooperative-groups/.
ElTantawy A, Aamodt T M. MIMD synchronization on SIMT architectures[C], 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016: 1-14.
Diamos G F, Johnson R C, Grover V, et al. Execution of divergent threads using a convergence barrier: U.S. Patent 10,067,768[P]. 2018-9-4 [2021-8-12]. https://www.freepatentsonline.com/y2016/0019066.html
Fung W W L, Aamodt T M. Thread block compaction for efficient SIMT control flow[C], 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 2011: 25-36.
Rhu M, Erez M. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures[J]. ACM SIGARCH Computer Architecture News, 2012, 40(3): 61-71.
Lee M, Song S, Moon J, et al. Improving GPGPU resource utilization through alternative thread block scheduling[C], 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014: 260-271.
Sadrosadati M, Mirhosseini A, Hajiabadi A, et al. Enabling High-Capacity, Latency-Tolerant,and Highly-Concurrent GPU Register Files via Software/Hardware Cooperation[J]. arXiv preprint arXiv:2010.09330, 2020.
Jing N, Wang J, Fan F, et al. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs[C]. 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016: 1-12.
Paul Teich. TEARING APART GOOGLE’S TPU 3.0 AI COPROCESSOR [Z]. [2021-9-13]. https://www.nextplatform.com/2018/05/10/ /tearing-apart-googles-tpu-3-0-ai-coprocessor/
Nvidia. NVIDIA A100 tensor core GPU Architecture[Z/OL]. (2020-05-14)[2021-08-13]. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecturewhitepaper.pdf
Wan C R, Evans D J. Nineteen ways of systolic matrix multiplication[J]. International journal of computer mathematics, 1998, 68(1-2): 39-69
Asghari Esfeden H, Khorasani F, Jeon H, et al. CORF: Coalescing operand register file for GPUs[C]. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 2019: 701-714.
Nvidia. NVIDIA Turing GPU architecture[Z]. [2021-08-01]. https://images.nvidia.com/aemdam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-TuringArchitecture-Whitepaper.pdf.
Nvidia. NVIDIA Tesla V100 GPU architecture[Z]. [2021-08-01]. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.