By Piotr Mitros
Memory speed has not kept up with CPU speed. On the order of half of modern CPU die area is devoted to dealing with this problem, mostly in terms of caches, but also in the form of additional complexity to instruction sets (prefetch instructions, etc.) One solution to this problem is to integrate processing and memory onto a single die. Memory bandwidth increases considerably as we come closer to physical memory:
Location | Bandwidth |
---|---|
Sense Amps | 2.9TB/sec |
Column Decode | 49GB/sec |
Chip Pins | 6.2GB/sec |
System Bus | 190MB/sec |
CPU Cache | 600MB/sec |
(Source: C-RAM project). Although these numbers are somewhat dated, they give an idea of the orders of magnitude difference between on-chip and off-chip accesses to memory. When viewing the above, note that for many types of operations, most of the data at the sense amps is useless.
The manufacturing techniques for memory are different from those for logic. In logic, one wants to minimize capacitances for high-speed operation. In memory, one wants to maximize them, for memory stability and slow refresh rates. Logic also has a much less standard structure than memory. As such, it is quite difficult to manufacture high-density memory on the same die as high-speed logic.
One cannot achieve the same high memory capacity on a single chip as on multiple chips. As such, designers of computing-in-memory systems are limited to either using much smaller amounts of memory, or not having a combination of high-speed local memory and low-speed off-chip memory. With massively-parallel SIMD or MIMD arrays, the memory chip-to-memory chip communications can become the primary bottleneck. Many of the current approaches to PIM do not address this problem adequately.
Fortunately, memory density has grown more quickly than demand for memory. As such, there is hope that we may be able to integrate enough memory on chip for most mainstream applications.
PIM would also decommodatize the memory market; 128MB of Micron memory would no longer be identical to 128MB of IBM memory. This could drive consumer prices up, and hurt interoperability. This has been one of the dominant reasons for known sub-ideal approaches to memory architectures.
PIM projects range a fair bit in overall architecture. On one end, several groups have made system-on-a-chip designs consisting of a normal CPU with integrated on-board memory for fast access. Others, instead, opted to distribute simple computing elements within main memory, controlled by a separate CPU. Here, the main CPU spawns off memory-intensive operations to be done in intelligent memory, while handling serial CPU-intensive internally.
The second approach has varied from supercomputer-like massively parallel SIMD or MIMD arrays, to adding simple logic to coprocess standard memory operations like refresh.
ActivePages integrate a memory chip with an FPGA. The FPGA is capable of performing simple operations on the data in memory. This has the advantage that, due to its simple organization, an FPGA is considerably easier to manufacture in a memory process than conventional logic. However, FPGAs are also somewhat more difficult to program than conventional logic. In addition, swapping FPGA logic to disk is slow compared to memory.
At the moment, Active Pages lack a mechanism for interchip communications, so all communications must be mediated by the CPU. This is likely to become a bottleneck for some types of operations (especially ones with pointer-based structures).
Programming PIM CPUs varies wildly based on the architecture of the PIM. As such, it is really impossible to present a single approach to programming PIM systems. PIMs span the whole range of difficulty of programming, depending on architecture.
For instance, programming a system with 1T-SRAM would be identical to programming a CPU with normal memory. A system-on-chip or IRAM-like approach would also be identical, assuming no secondary off-chip memory. With off-chip memory, the OS swapper could be modified to keep active data in main memory, and swap to off-chip memory and then disk, hiding it from application-level software. For massively-parallel SIMD/MIMD arrays, programming would be identical to the existing models for those.
We come upon several new approaches for intermediate models, where there is a fast main CPU, that can spawn off simple, memory-intensive ops to the memory. From a programmer's point of view, this can often be hidden in lower-level APIs. For instance, we can create a (hairy) database on top of a massively-parallel PIM, but a programmer would still interface to it through standard SQL queries. For the standard C library, we could replace functions like memcpy() with a version that would invalidate the relevant parts of the CPU cache, and spawn the operation off to memory. The memory would then block on all accesses to parts of memory where it had not yet finished processing. For a simple, but quite sub-optimal approach, one could also treat programs in OO blocks of data and operations that deal on that data, and having message passing return immediately, rather than having it wait.
Several architectures introduce unique sets of problems. For instance, FPGAs require the programmer or compiler to configure the FPGA, and then add difficulties swapping.
Plagarism warning and disclaimers: Except for short snippets in the programming section, the information on this page is blatantly stolen from the web pages listed within, as well as several research papers. This document was written March 2000, and is in the public domain.