Designing a parallel architecture is difficult because the parameters which define a "good" machine are extremely application-dependent. Machines with a restricted set of supported applications are faster and more efficient as the hardware resources can be tailored to the software needs. However, they are unlikely to be produced in volume and are therefore not cost effective. Since many budgets do not have room for computers costing millions of dollars, there is a need for commodity general-purpose parallel machines.
The goal of the Hamal project is to investigate design principles for general-purpose shared memory parallel computers. Specific interests include improving silicon efficiency (roughly defined as performance per unit area), developing a memory system which will scale to billions of processor/memory nodes, and tighly integrating processor and memory. Our approach is to develop a flexible cycle-accurate simulator for the Hamal architecture, a parallel shared-memory machine which integrates a number of new and existing architectural ideas. Determining benchmark performance across various hardware configurations and machine loads will allow us to evaluate the mechanisms of the Hamal design.
The laws of physics define the speed of information propagation in a dielectric medium. This translates to a fundamental lower-bound on latency for a given size of computer. Process technology improvements have brought us to the point where even the desktop PC, in principle, operates near this lower-bound on latency. Hence, a central feature of any contemporary computer architecture is how it addresses this latency issue.
The Q-Machine is a novel architecture in which all communication and data movement is achieved via hardware queues. The Q-Machine manages latency through the efficient migration of computation and data throughout a spatially distributed machine. An introspective mechanism observes communication patterns and attempts to optimize the placement of memory and lightweight threads to reduce latency. This migration occurs without requiring explicit programmer effort, and without sacrificing single-threaded code performance. The mechanisms used for latency management can also be easily extended to enhance system-level fault tolerance, through the migration of data and threads out of failing nodes.
Memory-management on a massively-parallel architecture has several challenges: an object on one processing node may reference objects on many other nodes; objects may be distributed over many nodes; and the persistent heap may greatly exceed the size of physical memory. Additionally, because inter-processor references in a dedicated parallel architecture are low-latency, memory management overhead on inter-processor operations must be minimal to avoid significantly impacting performance. In light of these challenges, specific goals for this project are:
Empirical performance data is an important resource when trying to design a high performance computing system. The ability to quickly and easily test out hypothetical architectures can save on potentially costly errors due to educated guesses made in the absence of insufficient analytical tools. Unfortuneately, simulating the interactions of various components in a complex system is a computationally difficult task. In the case of high performance, data intensive computers, one must be able to accurately account for perhaps thousands of processors, gigabytes of data, an equally complex interprocessor communication network, and any interactions between software drivers, compilers, and the hardware itself. It is as if one needs a supercomputer to design a supercomputer.
A flexible prototyping system (FPS) based on reconfigurable components provides a solution to the architect's dilemma. The FPS is a scalable, medium-performance parallel computer designed using FPGAs, making it possible to implement a plurality of high performance architectures with a single piece of hardware. It also provides a number of hardware-assisted performance monitoring and profiling features to aid in the collection of empirical performance data. The FPS is intended to provide a factor of 10 to 100 realtime slowdown when running simulations of moderately complex parallel supercomputers, with little or no performance penalty for the collection of profiling information.
Here is a list of possible projects that UROP or MEng students may wish to try. Contact the graduate student listed in italics at the end of the project description to get more information and supervision.