Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling
Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computat...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2022-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9870805/ |
_version_ | 1818056163621601280 |
---|---|
author | Yoonah Paik Chang Hyun Kim Won Jun Lee Seon Wook Kim |
author_facet | Yoonah Paik Chang Hyun Kim Won Jun Lee Seon Wook Kim |
author_sort | Yoonah Paik |
collection | DOAJ |
description | Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines’ registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of <inline-formula> <tex-math notation="LaTeX">$75.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$1.2\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.7\times $ </tex-math></inline-formula> compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM. |
first_indexed | 2024-12-10T12:24:29Z |
format | Article |
id | doaj.art-b16ad0a5f7924796ab062994bb468ca5 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-10T12:24:29Z |
publishDate | 2022-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-b16ad0a5f7924796ab062994bb468ca52022-12-22T01:48:59ZengIEEEIEEE Access2169-35362022-01-0110932569327210.1109/ACCESS.2022.32030519870805Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation DecouplingYoonah Paik0https://orcid.org/0000-0002-8294-1079Chang Hyun Kim1https://orcid.org/0000-0001-6224-7074Won Jun Lee2https://orcid.org/0000-0001-6161-0871Seon Wook Kim3https://orcid.org/0000-0001-6555-1741Department of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaDepartment of Electrical Engineering, Korea University, Seoul, South KoreaProcessing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines’ registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of <inline-formula> <tex-math notation="LaTeX">$75.8\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$1.2\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$4.7\times $ </tex-math></inline-formula> compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM.https://ieeexplore.ieee.org/document/9870805/Memory-computation decouplingin-memory processingstandard memory interfaceall-bank execution |
spellingShingle | Yoonah Paik Chang Hyun Kim Won Jun Lee Seon Wook Kim Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling IEEE Access Memory-computation decoupling in-memory processing standard memory interface all-bank execution |
title | Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling |
title_full | Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling |
title_fullStr | Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling |
title_full_unstemmed | Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling |
title_short | Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decoupling |
title_sort | achieving the performance of all bank in dram pim with standard memory interface memory computation decoupling |
topic | Memory-computation decoupling in-memory processing standard memory interface all-bank execution |
url | https://ieeexplore.ieee.org/document/9870805/ |
work_keys_str_mv | AT yoonahpaik achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling AT changhyunkim achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling AT wonjunlee achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling AT seonwookkim achievingtheperformanceofallbankindrampimwithstandardmemoryinterfacememorycomputationdecoupling |