Brief announcement: Distributed shared memory based on computation migration
Driven by increasingly unbalanced technology scaling and power dissipation limits, microprocessor designers have resorted to increasing the number of cores on a single chip, and pundits expect 1000-core designs to materialize in the next few years [1]. But how will memory architectures scale and...
Principais autores: | , , , , , , , |
---|---|
Outros Autores: | |
Formato: | Artigo |
Idioma: | en_US |
Publicado em: |
Association for Computing Machinery (ACM)
2012
|
Acesso em linha: | http://hdl.handle.net/1721.1/72358 https://orcid.org/0000-0001-8253-7714 https://orcid.org/0000-0003-4301-1159 https://orcid.org/0000-0001-5490-2323 https://orcid.org/0000-0003-1467-2150 |
Resumo: | Driven by increasingly unbalanced technology scaling and power
dissipation limits, microprocessor designers have resorted to increasing
the number of cores on a single chip, and pundits expect
1000-core designs to materialize in the next few years [1]. But how
will memory architectures scale and how will these next-generation
multicores be programmed?
One barrier to scaling current memory architectures is the offchip
memory bandwidth wall [1,2]: off-chip bandwidth grows with
package pin density, which scales much more slowly than on-die
transistor density [3]. To reduce reliance on external memories and
keep data on-chip, today’s multicores integrate very large shared
last-level caches on chip [4]; interconnects used with such shared
caches, however, do not scale beyond relatively few cores, and the
power requirements and access latencies of large caches exclude
their use in chips on a 1000-core scale. For massive-scale multicores,
then, we are left with relatively small per-core caches.
Per-core caches on a 1000-core scale, in turn, raise the question
of memory coherence. On the one hand, a shared memory abstraction
is a practical necessity for general-purpose programming, and
most programmers prefer a shared memory model [5]. On the other
hand, ensuring coherence among private caches is an expensive
proposition: bus-based and snoopy protocols don’t scale beyond
relatively few cores, and directory sizes needed in cache-coherence
protocols must equal a significant portion of the combined size of
the per-core caches as otherwise directory evictions will limit performance
[6]. Moreover, directory-based coherence protocols are
notoriously difficult to implement and verify [7]. |
---|