Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling

Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computat...

Full description

Bibliographic Details
Main Authors: Yoonah Paik, Chang Hyun Kim, Won Jun Lee, Seon Wook Kim
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9870805/
_version_ 1818056163621601280
author Yoonah Paik
Chang Hyun Kim
Won Jun Lee
Seon Wook Kim
author_facet Yoonah Paik
Chang Hyun Kim
Won Jun Lee
Seon Wook Kim
author_sort Yoonah Paik
collection DOAJ
description Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines&#x2019; registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of <inline-formula> <tex-math notation="LaTeX">$75.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$1.2\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.7\times $ </tex-math></inline-formula> compared to CPU, GPU, and the per-bank PIM and up to 91.4&#x0025; of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0&#x0025; and 78.4&#x0025;, but 7.4&#x0025;, a little more than the ideal all-bank PIM.
first_indexed 2024-12-10T12:24:29Z
format Article
id doaj.art-b16ad0a5f7924796ab062994bb468ca5
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-10T12:24:29Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-b16ad0a5f7924796ab062994bb468ca52022-12-22T01:48:59ZengIEEEIEEE Access2169-35362022-01-0110932569327210.1109/ACCESS.2022.32030519870805Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation DecouplingYoonah Paik0https://orcid.org/0000-0002-8294-1079Chang Hyun Kim1https://orcid.org/0000-0001-6224-7074Won Jun Lee2https://orcid.org/0000-0001-6161-0871Seon Wook Kim3https://orcid.org/0000-0001-6555-1741Department of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaProcessing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines&#x2019; registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of <inline-formula> <tex-math notation="LaTeX">$75.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$1.2\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.7\times $ </tex-math></inline-formula> compared to CPU, GPU, and the per-bank PIM and up to 91.4&#x0025; of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0&#x0025; and 78.4&#x0025;, but 7.4&#x0025;, a little more than the ideal all-bank PIM.https://ieeexplore.ieee.org/document/9870805/Memory-computation decouplingin-memory processingstandard memory interfaceall-bank execution
spellingShingle Yoonah Paik
Chang Hyun Kim
Won Jun Lee
Seon Wook Kim
Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
IEEE Access
Memory-computation decoupling
in-memory processing
standard memory interface
all-bank execution
title Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_full Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_fullStr Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_full_unstemmed Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_short Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
title_sort achieving the performance of all bank in dram pim with standard memory interface memory computation decoupling
topic Memory-computation decoupling
in-memory processing
standard memory interface
all-bank execution
url https://ieeexplore.ieee.org/document/9870805/
work_keys_str_mv AT yoonahpaik achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling
AT changhyunkim achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling
AT wonjunlee achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling
AT seonwookkim achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling