Computing in Memory

By Piotr Mitros

Problem Overview

Memory speed has not kept up with CPU speed. On the order of half of modern CPU die area is devoted to dealing with this problem, mostly in terms of caches, but also in the form of additional complexity to instruction sets (prefetch instructions, etc.) One solution to this problem is to integrate processing and memory onto a single die. Memory bandwidth increases considerably as we come closer to physical memory:

Location Bandwidth

Sense Amps 2.9TB/sec

Column Decode 49GB/sec

Chip Pins 6.2GB/sec

System Bus 190MB/sec

CPU Cache 600MB/sec

Location	Bandwidth
Sense Amps	2.9TB/sec
Column Decode	49GB/sec
Chip Pins	6.2GB/sec
System Bus	190MB/sec
CPU Cache	600MB/sec

(Source: C-RAM project). Although these numbers are somewhat dated, they give an idea of the orders of magnitude difference between on-chip and off-chip accesses to memory. When viewing the above, note that for many types of operations, most of the data at the sense amps is useless.

Disadvantages

The manufacturing techniques for memory are different from those for logic. In logic, one wants to minimize capacitances for high-speed operation. In memory, one wants to maximize them, for memory stability and slow refresh rates. Logic also has a much less standard structure than memory. As such, it is quite difficult to manufacture high-density memory on the same die as high-speed logic.

One cannot achieve the same high memory capacity on a single chip as on multiple chips. As such, designers of computing-in-memory systems are limited to either using much smaller amounts of memory, or not having a combination of high-speed local memory and low-speed off-chip memory. With massively-parallel SIMD or MIMD arrays, the memory chip-to-memory chip communications can become the primary bottleneck. Many of the current approaches to PIM do not address this problem adequately.

Fortunately, memory density has grown more quickly than demand for memory. As such, there is hope that we may be able to integrate enough memory on chip for most mainstream applications.

PIM would also decommodatize the memory market; 128MB of Micron memory would no longer be identical to 128MB of IBM memory. This could drive consumer prices up, and hurt interoperability. This has been one of the dominant reasons for known sub-ideal approaches to memory architectures.

General Approaches

PIM projects range a fair bit in overall architecture. On one end, several groups have made system-on-a-chip designs consisting of a normal CPU with integrated on-board memory for fast access. Others, instead, opted to distribute simple computing elements within main memory, controlled by a separate CPU. Here, the main CPU spawns off memory-intensive operations to be done in intelligent memory, while handling serial CPU-intensive internally.

The second approach has varied from supercomputer-like massively parallel SIMD or MIMD arrays, to adding simple logic to coprocess standard memory operations like refresh.

Existing Projects

1T-SRAM (1999)

MoSys 1T-SRAM integrates simple logic onto a DRAM chip to make it behave externally as if it were an SRAM chip. It includes an on-chip refresh controller, as well as a small true SRAM cache to allow accesses to memory currently being refreshed. This drastically improves performance of the chip over conventional DRAM, and simplifies external circuitry needed to interface to it.

Active Pages (1998)

ActivePages integrate a memory chip with an FPGA. The FPGA is capable of performing simple operations on the data in memory. This has the advantage that, due to its simple organization, an FPGA is considerably easier to manufacture in a memory process than conventional logic. However, FPGAs are also somewhat more difficult to program than conventional logic. In addition, swapping FPGA logic to disk is slow compared to memory.

At the moment, Active Pages lack a mechanism for interchip communications, so all communications must be mediated by the CPU. This is likely to become a bottleneck for some types of operations (especially ones with pointer-based structures).

Terasys (1995)

Terasys created a massively parallel array of memory chips with simple PIM processors (2kx64bits RAM with 64 1 bit CPUs). At the time of the article, they were working on including the Terasys system as part of the memory for a Cray supercomputer. The PIM CPUs include simple interprocessor communications, although they do not support indirect addressing. Terasys also created a programming language to ease programming called dbC.

IRAM (1997)

The IRAM advocates integrating 96MBytes of DRAM with a standard scalar or vector CPU on a single chip. This allows the CPU much faster access to standard memory than for off-chip memory. Unfortunately, the IRAM project is targeted at the billion-transistor generation, and as such has not yet released any silicon.

Embedded devices

Integrating memory and CPU is necessary for single-chip systems. As such, many embedded CPUs currently use memory-on-chip, including Texas Instruments and Analog Devices Sharc DSPs. Presumably due to either manufacturing or cost limitations, these chips tend run at low speeds (~50mhz) and integrate modest amounts of memory (~128kb).

Other projects

I have not researched most of these in too much detail; as such, the information may be slightly inaccurate.

DIVA is a PIM-based architecture that has focused on the processor-to-memory and memory-to-memory communications problem. This is a modern paper (1999)
PRAM is a fairly modern approach PIM. Their web page includes a good deal of info(~1997)
MIT MAP integrated 4 superscalar CPUs with 0.128MB RAM (1996)
Linden DAAM had 1024 16bit ALUs with 0.016MB RAM (1996)
Execube is a 16 bit SIMD/MIMD CMOS array (0.5MB, 50MIPS) (1993)
Computational RAM is a fairly conventional (by PIM standards) project to integrate simple CPUs onto a memory die. (1992)

Programming

Programming PIM CPUs varies wildly based on the architecture of the PIM. As such, it is really impossible to present a single approach to programming PIM systems. PIMs span the whole range of difficulty of programming, depending on architecture.

For instance, programming a system with 1T-SRAM would be identical to programming a CPU with normal memory. A system-on-chip or IRAM-like approach would also be identical, assuming no secondary off-chip memory. With off-chip memory, the OS swapper could be modified to keep active data in main memory, and swap to off-chip memory and then disk, hiding it from application-level software. For massively-parallel SIMD/MIMD arrays, programming would be identical to the existing models for those.

We come upon several new approaches for intermediate models, where there is a fast main CPU, that can spawn off simple, memory-intensive ops to the memory. From a programmer's point of view, this can often be hidden in lower-level APIs. For instance, we can create a (hairy) database on top of a massively-parallel PIM, but a programmer would still interface to it through standard SQL queries. For the standard C library, we could replace functions like memcpy() with a version that would invalidate the relevant parts of the CPU cache, and spawn the operation off to memory. The memory would then block on all accesses to parts of memory where it had not yet finished processing. For a simple, but quite sub-optimal approach, one could also treat programs in OO blocks of data and operations that deal on that data, and having message passing return immediately, rather than having it wait.

Several architectures introduce unique sets of problems. For instance, FPGAs require the programmer or compiler to configure the FPGA, and then add difficulties swapping.

Related sites

Notre Dame presentation slides on PIM
Merged DRAM/Logic LSI gallery - links to large numbers of PIM architectures
Notes for my presentation on PIM (in AbiWord format)
Questions on PIM

An anecdote

At one point, mid to high end graphic cards used a special type of memory called VRAM. VRAM included a funky shift-register mechanism that allowed it to simultaneously send refresh data to the monitor, while handling normal memory transactions. This was a clearly superior approach. It died, as it turned out that it was not economically feasible because it required a very different manufacturing process from conventional memory. Current graphics cards use a type of memory called SGRAM, that is effectively identical to SDRAM, only it allows some sort of faster block access (I believe less memory is discard from the sense amps). Few people write papers on failed projects, so it is difficult to answer why people don't manufacture some sort of technology. However, reasons like this presumably explain a good chunk of it.

Plagarism warning and disclaimers: Except for short snippets in the programming section, the information on this page is blatantly stolen from the web pages listed within, as well as several research papers. This document was written March 2000, and is in the public domain.