Vision Interfaces


		Past Projects

CALO Perceptive Laptop Interface

As part of the DARPA CALO project led by SRI, the Vision Interface group at MIT CSAIL has developed a perceptive laptop interface to support meeting understanding, and for multimodal human-computer interface. We are developing algorithms to assess the conversational state of meeting participants or users of the CALO environment from a personalized device, such as a laptop or tablet computer. Our “ptablet” device is purely passive, and offers the following cues to conversation or interaction state: presence, attention, turn-taking, agreement and grounding gestures, emotion and expression cues and visual speech features. More...

CALO Perceptive Laptop Interface

--!>

Recognition and Retrieval with the Pyramid Match

Local image features have emerged as a powerful way to describe images of objects and scenes. Their stability under variable image conditions is critical for success in a wide range of recognition and retrieval applications. However, comparing images represented by their collections of local features is challenging, since each set may vary in cardinality and its elements lack a meaningful ordering. Existing methods compare feature sets by searching for explicit correspondences between their elements, which is too computationally expensive in many realistic settings. In this work we develop efficient methods for matching sets of local features, and we show how this matching may be used as a robust measure of similarity to perform content-based image retrieval, as well as a basis for learning object categories or inferring 3D pose. More...

Learning Task-Specific Similarity

A notion of similarity is central to many task in machine learning, information retrieval and computer vision. Semantics of siilarity are often determined by the specific task, and may not be captured well with a standard metric approach. We develop a framework for learning similarity, for the task at hand, from examples of what is and is not considered similar. More...

Fast example-based pose estimation

We develop Parameter-Sensitive Hashing (PSH):an algorithm for fast similarity search for regression tasks, when the goal is to find examples in the database which have values of some parameter(s) similar to those of the query example, but the parameters can not be directly computed from the examples. We apply PSH on the task of inferring the articulated body pose from a single image of a person. More...

WATSON:
Real-time Head Pose Estimation

Our real-time object tracker uses range and appearance information from a stereo camera to recover the 3D rotation and translation of objects, or of the camera itself. The system can be connected to a face detector and used as an accurate head tracker. Additional supporting algorithms can improve the accuracy of the tracker. More...

Six-degree of freedom head tracking with drift reduction.

Learning Uncertainty Models for Audio Localization

We want computers to be able to localize sounds better. Human listeners can successfully localize sounds in noisy and reverberant environments, but computer systems can localize sounds well only in quiet, nearly anechoic environments. One reason for this performance gap is that human listeners exhibit the precedence effect, in which they weigh localization cues from sound onsets more heavily when making localization judgements. This effect has been well-studied in the psychoacoustics literature but has not yet been integrated into a practical computer system for audio localization. We formulate the estimation of localization accuracy as a regression problem. Our solution leads to improved audio source localization. More...

Learning the relationship between reverberant speech spectrograms and localization uncertainty allows us to better combine localization information across time and frequency.

Hidden Conditional Random Fields for Object Recognition

We model objects as flexible constellations of parts conditioned on local observations found by an interest operator. For each class the probability of a given assignment of model parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition (hCRF). The main advantage of the proposed model is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches. More...

Information Extraction From Images And Captions

One of the main challenges in building a model that extracts semantic information from an image is that it might require a significant amount of labeled data. On the other hand extracting templates from captions might be an easier task that requires less training data. As a first step we propose to use a small training set of labeled captions to train a model that would map captions to templates. We will then use that model to label a larger set of images using their paired captions and use this larger dataset to train a visual classifier that maps images to templates. More...

Contextual Recognition of Head Gestures

Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. We investigate how dialog context from an embodied conversational agent (ECA) can improve visual recognition of user gestures. We present a recognition framework which (1) extracts contextual features from an ECA's dialog manager, (2) computes a prediction of head nod and head shakes, and (3) integrates the contextual predictions with the visual observation of a vision-based head gesture recognizer. Using a discriminative approach to contextual prediction and multi-modal integration, we were able to improve the performance of head gesture detection even when the topic of the test set was significantly different than the training set. More...

Contextual recognition of head gestures during face-to-face interaction with an embodied agent.

Hidden Condtional Random Fields for Gesture Recognition

Gesture sequences often have a complex underlying structure, and models that can incorporate hidden structure have proven in the past to be advantageous. Most existing approaches to gesture recognition with hidden state employ a Hidden Markov Model or suitable variant to model gesture streams; a significant limitation of these models is the requirement of conditional independence of observations. In addition, hidden states in a generative model are selected to maximize the likelihood of generating all examples of a given gesture class, which is not necessarily optimal for discriminating the gesture class against other gestures. More...

Arm Gesture Sequence superimposed with a 3D body tracker

IDeixis: Image-based Deixis for Finding Location-Based Information

IDexis project is to develop an image-based deixis for refering to a remote location in applications of finding location-specific information. IDexis is an image-based approach to finding location-based information from camera-equipped mobile devices. It is a point-by-photograph paradigm, where users can specify a location simply by taking pictures. Our technique uses content-based image retrieval methods to search the web or other databases for matching images and their source pages to find relevant location-based information. In contrast to conventional approaches to location detection, this method can refer to distant locations and does not require any physical infrastructure beyond mobile internet service. More...

Multimodal Co-training of Agreement Gesture Classifiers

In this work we investigate the use of multimodal semi-supervised learning to train a classifier which detects user agreement during a dialog with a robotic agent. Separate ‘views’ of the user’s agreement are given by head nods and keywords in the user’s speech. We develop a co-training algorithm for the gesture and speech classifiers to adapt each classifier to a particular user and increase recognition performance. Multimodal co-training allows user-adaptive models without labeled training data for that user. We evaluate our algorithm on a data set of subjects interacting with a robotic agent and demonstrate that co-training can be used to learn user-specific head nods and keywords to improve overall agreement recognition. More...

Face recognition from image sets

In most paradigms of face recognition it is assumed that while a set of training images is available for each individual in the database, the input (test data) consists of a single shot. However, in many scenarios the recognition system has access to a set of face images of the person to be recognized. We want to use this fact to do a better job in recognition. More...

Matching <i>distributions</i>, rather than individual images or a single image to a set.

Matching distributions, rather than individual images or a single image to a set.