MIT AI Aries Group

The Hamal Parallel Computer

Designing a parallel architecture is difficult because the parameters which define a "good" machine are extremely application-dependent. Machines with a restricted set of supported applications are faster and more efficient as the hardware resources can be tailored to the software needs. However, they are unlikely to be produced in volume and are therefore not cost effective. Since many budgets do not have room for computers costing millions of dollars, there is a need for commodity general-purpose parallel machines.

The goal of the Hamal project is to investigate design principles for general-purpose shared memory parallel computers. Specific interests include improving silicon efficiency (roughly defined as performance per unit area), developing a memory system which will scale to billions of processor/memory nodes, and tighly integrating processor and memory. Our approach is to develop a flexible cycle-accurate simulator for the Hamal architecture, a parallel shared-memory machine which integrates a number of new and existing architectural ideas. Determining benchmark performance across various hardware configurations and machine loads will allow us to evaluate the mechanisms of the Hamal design.

Documents:

Design and Evaluation of the Hamal Parallel Computer (thesis proposal)

Proposed Model for Multithreading and Event Handling in the Aries Architecture (ARIES-TM-07)

The Hamal Processor-Memory Node (ARIES-TM-10)

Hamal Design Rationale (ARIES-TM-12)

Memory Integration: Implementing the Memory Model in Hardware (2000 AI Lab Abstract)

The Q-Machine

Andrew "bunnie" Huang, Ben Vandiver, Bobby Woods-Corwin

The laws of physics define the speed of information propagation in a dielectric medium. This translates to a fundamental lower-bound on latency for a given size of computer. Process technology improvements have brought us to the point where even the desktop PC, in principle, operates near this lower-bound on latency. Hence, a central feature of any contemporary computer architecture is how it addresses this latency issue.

The Q-Machine is a novel architecture in which all communication and data movement is achieved via hardware queues. The Q-Machine manages latency through the efficient migration of computation and data throughout a spatially distributed machine. An introspective mechanism observes communication patterns and attempts to optimize the placement of memory and lightweight threads to reduce latency. This migration occurs without requiring explicit programmer effort, and without sacrificing single-threaded code performance. The mechanisms used for latency management can also be easily extended to enhance system-level fault tolerance, through the migration of data and threads out of failing nodes.

Documents:

Q-Machine: A Spatially-Aware Decentralized Computer Architecture (thesis proposal)

Spatially Aware Decentralized Computing (ARIES-TM-13)

Memory management and garbage collection on a massively parallel MIMD architecture

Jeremy Brown

Memory-management on a massively-parallel architecture has several challenges: an object on one processing node may reference objects on many other nodes; objects may be distributed over many nodes; and the persistent heap may greatly exceed the size of physical memory. Additionally, because inter-processor references in a dedicated parallel architecture are low-latency, memory management overhead on inter-processor operations must be minimal to avoid significantly impacting performance. In light of these challenges, specific goals for this project are:

Garbage data must eventually be eliminated.

Normal memory accesses should experience no runtime overhead. Pointers must generally point straight to their target objects, even when the target is on a different node, since with a low-latency network the cost of extra indirection is unacceptable.

Memory management, particularly garbage collection, must not cause excessive paging even when the heap greatly exceeds the size of main memory.

Memory management, particularly garbage collection, must not consume excessive inter-node communications bandwidth.

Documents:

Memory Management on a Massively Parallel Capability Architecture (thesis proposal)

Reconfigurable Prototyping

Andrew "bunnie" Huang, Michael Phillips

Empirical performance data is an important resource when trying to design a high performance computing system. The ability to quickly and easily test out hypothetical architectures can save on potentially costly errors due to educated guesses made in the absence of insufficient analytical tools. Unfortuneately, simulating the interactions of various components in a complex system is a computationally difficult task. In the case of high performance, data intensive computers, one must be able to accurately account for perhaps thousands of processors, gigabytes of data, an equally complex interprocessor communication network, and any interactions between software drivers, compilers, and the hardware itself. It is as if one needs a supercomputer to design a supercomputer.

A flexible prototyping system (FPS) based on reconfigurable components provides a solution to the architect's dilemma. The FPS is a scalable, medium-performance parallel computer designed using FPGAs, making it possible to implement a plurality of high performance architectures with a single piece of hardware. It also provides a number of hardware-assisted performance monitoring and profiling features to aid in the collection of empirical performance data. The FPS is intended to provide a factor of 10 to 100 realtime slowdown when running simulations of moderately complex parallel supercomputers, with little or no performance penalty for the collection of profiling information.

Documents:

Performance Monitoring and Simulation of Communications-Intensive Processing HPC Systems (1999 AI Lab Abstract)

Processor-In-Memory System Simulator (2000 AI Lab Abstract)

UROP/MEng Projects

You!!

Here is a list of possible projects that UROP or MEng students may wish to try. Contact the graduate student listed in italics at the end of the project description to get more information and supervision.

Write a compiler, scheduler and/or hand-code algorithms in assembly for a VLIW processor being designed. (jpg)
Develop a library of parallel primitives based on "processor sets" (jpg)
Conduct a performance evaluation of delayed issue (see ARIES-TM-06) (jpg)
PC Board design, layout, assembly and debug (protel-based design flow from schematic capture to layout) (bunnie)
Processor design and implementation in reconfigurable logic (verilog coding, simulation, verification and implementation in Xilinx FPGAs) (bunnie)
Embedded processor design (strongARM-based development enviroment, fast interrupt handlers and performance-oriented code) (bunnie)
Compiler and language design (novel object-oriented languages, implementation, compilation and toolchain design) (bunnie)
Network design and debug (end-to-end protocol implementation, hardware design, debug) (bunnie)
Profiling of high-performance computer code (communications requirements, data placement, and benchmarking) (bunnie)
Any crazy computer-related ideas you would like to implement or research (liquid cooling, performance clocking, novel high-performance devices, hyper-light-speed antennas) (bunnie)