MIT Media Laboratory

Perceptual Computing

The Perceptual Computing Section of the Media Laboratory is concerned with making computers understand their environment, with special emphasis on understanding and cooperating with people. The Section consists of eight faculty, six postgraduate researchers, and more than forty graduate students in the areas of vision, music, speech, and multimodal interaction. Faculty participating in the Media Lab's ICCV open house include Aaron Bobick, Mike Bove, Roz Picard, Ted Adelson, and Alex Pentland.

Demonstration Abstracts

Lightness, Transparency, and Mid-Level Vision
Some new brightness illusions will be demonstrated. These illusions indicate the importance of mid-level mechanisms involving transparency, occlusion, and lighting.

Object-oriented television and Cheops processing system
Cheops is a data-flow computer built in the lab which can be used to display such "structured video" programs in real time. Structured video is represented as 2D and 3D objects rather than pixels or frames. These objects are "transmitted" together with a script that tells how to assemble them to make a television program.

Automated Extraction and Resynthesis of Walkers
Automated layer decomposition of walkers in an image sequence: an approach for efficient coding and resynthesis of walking motion using component layers.

Put That There/Models from Video
We will show a wide-baseline stereo system for tracking people in 3-D based on symbolic correspondence. The system is self-calibrated and the output is used for gestural control in a 3-D audio visual environment.We will also show a system for building 3D models from video, which is based on our Structure-from-Motion research, described in last month's IEEE PAMI.

Ambient Microphone Speech Recognition
This demonstration illustrates the use of an array of microphones along with visual cues to perform speech recognition "at a distance" in a noisy, open environment.

Semiautomatic 3D model building and lens distortion correction
Reconstructing camera parameters, planar 3-D geometry and surface texture, given one or more views of a scene with pre-selected parallel and coplanar edges. This technique has been used to generate a 3-D textured database from a set of still images taken with an uncalibrated 35mm camera. This technique has also been used to determine 3-D positions of actors from video.

Physics-based scene understanding
Knowledge-intensive vision systems can understand scenes with complex visual and causal structure. This demo shows the visual analysis and explanation of a variety of artifacts, including mechanical transmissions.

Phase Space Recognition of Human Body Motion
This work presents a method for representing and recognizing human body motion. It identifies sets of constraints that are diagnostic of a movement; different constraints identify different movements.

Vision-Steered Phased-Array Microphones
A beam-forming microphone array is used to capture noisy speech input from the ALIVE space. Using the position information provided by the vision system, we obtain audio singal enhancements up to 10dB.

ALIVE, Active Face Tracking/Recognition/Pose Estimation
We will show active face tracking, recognition and pose estimation in the ALIVE system. Users can walk about a room and interact with autonomous virtual creatures in a `Magic Mirror' paradigm; the creatures can recognize/track/respond to the users face as well as body position and hand gestures.

Recognizing Facial Expressions
We describe our methods for extracting detailed representations of facial motion from video. We will show how these representation can be used for Coding, Analysis, Recognition, Tracking and Synthesis of Facial Expressions.

Transaural Rendering
The STIVE demo will feature a three-dimensional audio system which uses only two speakers to create the illusion of sounds emanating from arbitrary directions around the listener.

Closed-World Tracking
Tracking for video annotation using contextual information to dynamically select tracked features. Example domain: football plays.

Wold-based Texture Modeling
We apply the Wold-based texture model to image database retrieval. The Wold model provides perceptually sensible features which correspond well to the reported most important dimensions of human texture perception -- periodicity, directionality, and randomness.

Video Orbits for Mosaicing and Resolution Enhancement Wearable Computers
New featureless multiscale method of estimating the homographic coordinate transformation between a pair of images. This method is used to make pictures with a "Visual filter" equipped with image acquisition and display capability. Standing in a single location. A scene is scanned on a large "video canvas" where each new frame undergoes the appropriate homographic coordinate transformation to insert it correctly into the image mosaic. I will also show my work on wearable computers and NetCam.

Photobook: Content-Based Image Retrieval
Content-based image annotation is complicated by the fact that feature salience varies with context. FourEyes indexes images using several features which are consulted independently, based on user interaction.

Large Database Face Recognition, and Active Face Recognition/Tracking/Pose Recognition
An automatic system for detection, recognition and model-based coding of human faces is presented. The system is able to detect human faces (at various scales and different poses) in the input scene and geometrically align them prior to recognition and compression. The system has been tested successfully on over 2,000 faces from ARPA's FERET program.

Thin-plate Models for Motion Analysis and Object Recognition
We present a deformable model for nonrigid motion tracking (e.g. heart motion). A similar model can be used for object recognition (e.g face recognition).

Detecting Kinetic Occlusion
Description: Detecting motion boundaries in image sequences through local spatiotemporal junction analysis; deducing ordinal depth locally from accretion and deletion cues.

An SmartCam is a robotic camera which operates in a TV studio without a cameraman, using computer vision to find objects and people in complex scenes. The development of SmartCams requires new methods and ideas in context-based vision, action recognition, and architecture of computer vision systems.

High-Dimensional Probabilistic Modeling
Improved probabilistic models often mean better performance in a variety of systems. Accurate modeling usually requires high-dimensional modeling, with its attendant difficulties. We explore some approaches to high-dimensional modeling, and explore their application to image compression and restoration, and to texture synthesis and classification.

M-Lattice -- Nonlinear Dynamics For Vision and Image Processing
This research investigates the mathematical properties of the Reaction-Diffusion model and its derivative the new "M-Lattice" system. Originated by Alan Turing in order to explain morphogenesis, we demonstrate these models' applications to computational vision and image processing.

Real-time visual recognition of American Sign Language Wearable Computing
Full-sentence, 40-word lexicon ASL is recognized with an accuracy of 99.2% in real-time without explicit modelling of the fingers. One color camera is used for tracking. I will also show my work on wearable computers and rememberance agents.

Scene Cut Detection and Motion Texture Modeling
1) A robust algorithm for finding cuts in video -- to "skip ahead to the next shot." 2) A stochastic motion model for estimating and resynthesizing spatio-temporal patterns. (water, smoke, etc.)

Query by Content in Video Sequences
Unsupervised, Cross-Modal Characterization of Discourse in Tonight Show Monlogues: Preliminary results from analysis of audio and visual-kinesic features as processed with the isodata clustering algorithm demonstrate a bottom up approach to discourse analysis

Layered Image Representation
We will demonstrate novel techniques in motion estimation and segmentation based on mid-level vision concepts for applications in image coding, data compression, video special effects, and 3D structure recovery.

Non-Rigid Motion Segmentation: Psychophysics and Modeling
Estimating non-rigid motion requires integrating multiple constraints while segmenting others. We will show psychophysical demonstrations which reveal how the human visual system solves this dilemma.

Learning Visual Behavior for Gesture Analysis
The "visual behavior" of gesture is recovered from a number of example image sequences by concurrently training the temporal model and multiple models of the visual scene. The training process is demonstrated.

We will show active face tracking, recognition and pose estimation in the ALIVE system. Users can walk about a room and interact with autonomous virtual creatures in a `Magic Mirror' paradigm; the creatures can recognize/track/respond to the users face as well as body position and hand gestures.