Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling

Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computat...

Full description

Bibliographic Details
Main Authors:	Yoonah Paik, Chang Hyun Kim, Won Jun Lee, Seon Wook Kim
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Memory-computation decoupling in-memory processing standard memory interface all-bank execution
Online Access:	https://ieeexplore.ieee.org/document/9870805/

_version_	1818056163621601280
author	Yoonah Paik Chang Hyun Kim Won Jun Lee Seon Wook Kim
author_facet	Yoonah Paik Chang Hyun Kim Won Jun Lee Seon Wook Kim
author_sort	Yoonah Paik
collection	DOAJ
description	Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines’ registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of <inline-formula> <tex-math notation="LaTeX">$75.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$1.2\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.7\times $ </tex-math></inline-formula> compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM.
first_indexed	2024-12-10T12:24:29Z
format	Article
id	doaj.art-b16ad0a5f7924796ab062994bb468ca5
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-10T12:24:29Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-b16ad0a5f7924796ab062994bb468ca52022-12-22T01:48:59ZengIEEEIEEE Access2169-35362022-01-0110932569327210.1109/ACCESS.2022.32030519870805Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation DecouplingYoonah Paik0https://orcid.org/0000-0002-8294-1079Chang Hyun Kim1https://orcid.org/0000-0001-6224-7074Won Jun Lee2https://orcid.org/0000-0001-6161-0871Seon Wook Kim3https://orcid.org/0000-0001-6555-1741Department of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaProcessing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines’ registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of <inline-formula> <tex-math notation="LaTeX">$75.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$1.2\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.7\times $ </tex-math></inline-formula> compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM.https://ieeexplore.ieee.org/document/9870805/Memory-computation decouplingin-memory processingstandard memory interfaceall-bank execution
spellingShingle	Yoonah Paik Chang Hyun Kim Won Jun Lee Seon Wook Kim Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling IEEE Access Memory-computation decoupling in-memory processing standard memory interface all-bank execution
title	Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_full	Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_fullStr	Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_full_unstemmed	Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_short	Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_sort	achieving the performance of all bank in dram pim with standard memory interface memory computation decoupling
topic	Memory-computation decoupling in-memory processing standard memory interface all-bank execution
url	https://ieeexplore.ieee.org/document/9870805/
work_keys_str_mv	AT yoonahpaik achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling AT changhyunkim achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling AT wonjunlee achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling AT seonwookkim achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling

Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling

Similar Items