Brief announcement: Distributed shared memory based on computation migration

Driven by increasingly unbalanced technology scaling and power dissipation limits, microprocessor designers have resorted to increasing the number of cores on a single chip, and pundits expect 1000-core designs to materialize in the next few years [1]. But how will memory architectures scale and...

ver descrição completa

Detalhes bibliográficos
Principais autores: Lis, Mieszko, Shim, Keun Sup, Cho, Myong Hyon, Fletcher, Christopher Wardlaw, Kinsy, Michel A., Lebedev, Ilia A., Khan, Omer, Devadas, Srinivas
Outros Autores: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Formato: Artigo
Idioma:en_US
Publicado em: Association for Computing Machinery (ACM) 2012
Acesso em linha:http://hdl.handle.net/1721.1/72358
https://orcid.org/0000-0001-8253-7714
https://orcid.org/0000-0003-4301-1159
https://orcid.org/0000-0001-5490-2323
https://orcid.org/0000-0003-1467-2150
Descrição
Resumo:Driven by increasingly unbalanced technology scaling and power dissipation limits, microprocessor designers have resorted to increasing the number of cores on a single chip, and pundits expect 1000-core designs to materialize in the next few years [1]. But how will memory architectures scale and how will these next-generation multicores be programmed? One barrier to scaling current memory architectures is the offchip memory bandwidth wall [1,2]: off-chip bandwidth grows with package pin density, which scales much more slowly than on-die transistor density [3]. To reduce reliance on external memories and keep data on-chip, today’s multicores integrate very large shared last-level caches on chip [4]; interconnects used with such shared caches, however, do not scale beyond relatively few cores, and the power requirements and access latencies of large caches exclude their use in chips on a 1000-core scale. For massive-scale multicores, then, we are left with relatively small per-core caches. Per-core caches on a 1000-core scale, in turn, raise the question of memory coherence. On the one hand, a shared memory abstraction is a practical necessity for general-purpose programming, and most programmers prefer a shared memory model [5]. On the other hand, ensuring coherence among private caches is an expensive proposition: bus-based and snoopy protocols don’t scale beyond relatively few cores, and directory sizes needed in cache-coherence protocols must equal a significant portion of the combined size of the per-core caches as otherwise directory evictions will limit performance [6]. Moreover, directory-based coherence protocols are notoriously difficult to implement and verify [7].