Multi-Modal Orientation Behavior

The Cog Shop
MIT Artificial Intelligence Laboratory
545 Technology Square, #920
Cambridge, MA 02139

write to the Cog Documentation Project: cdp@ai.mit.edu

The Problem

To Integrate multi-modal sensory information (visual, auditory, tactile, etc.), and use this information to orient the robot to the source of sensory stimuli.

The ability to orient toward visual, auditory, or tactile stimuli is an important skill for systems intended to interact with and explore their environment. In the brain of mammalian vertebrates, the Superior Colliculus is specialized for integrating multi-modal sensory information, and for using this information to orient the animal to the source sensory stimuli, such as noisy, moving objects. Within the Superior Colliculus, this ability appears to be implemented using layers of registered, multi-modal, topographic maps. Inspired by the structure, function, and plasticity of the Superior Colliculus, we are in the process of implementing multi-modal orientation behaviors on our humanoid robot using registered topographic maps.

Our Approach

In the Superior Colliculus of the cat, there are visuotopic maps representing motion in visual space, somatotopic maps yielding a body representation of tactile inputs, and spatiotopic maps of auditory space encoding inter-aural time differences (ITD) and inter-aural intensity differences (IID). Hence, a sensory stimulus originating from a given direction will elicit activity in the corresponding region of the appropriate sensory map. There are also motor movement maps consisting of pre-motor neurons whose movement fields are topologically organized. These maps exist for the eyes, head, neck, body, ears. Stimulating a specific region in the map elicits a corresponding motor movement.

These multi-modal maps overlap and are aligned with each other so that they share a common multisensory spatial coordinate system. The maps are said to be registered with one another when this is the case. By arranging multi-modal information into a common representational framework (within each map) and registering the maps with respect to eachother allows the information within each map to interact and influence the other maps.

There are a couple of advantages to this organizational strategy. First, it is an economical way of specifying the location of peripheral stimuli, and for organizing and activating the motor program required to orient towards the stimuli; thereby allowing any sensory modality to orient the other sensory organs to source of stimulation. Second, it supports enhancement of simultaneous sensory cues. Stimuli that occur in the same place at the same time are likely to be interrelated by common causality. For instance, a bird rustling in the bushes will provide both visual motion and auditory cues. During enhancement, certain combinations of meaningful stimulus become more salient because their neuronal responses are spatio-temporally related. Once the multi-modal maps are aligned, neuronal enhancement (or depression) is a function of the temporal and spatial relationships of neural activity among the maps.

In our framework, a map is a two dimensional array of elements where each element corresponds to a site in the map. The maps are arranged into interconnected layers, where a given map can be interfaced to more than one map. Each connection is uni-directional, so recurrent connections between maps require both a feedforward connection and a feedback connection. The activity level of sites on one map is passed to another map thorough these connections, hence the input to a given map is a function of the spatio-temporal activity of the maps feeding into it and the connectivity between these maps. Currently, all connections have equal weights, although this could change in the future. The output of a given map is its spatio-temporal activity pattern. What this pattern of activity represents depends upon the map: if it is a visuotopic map, it could represent motion coming from a particular direction in the visual field; if it is an oculomotor map, it could encode a motor command to move the eyes, and so forth.

The smallest map ensemble capable of producing an observable behavior consists of a sensory input map, a motor output map, and an established set of connections between them. The input map could have a fairly rigid structure consisting simply of time-differenced intensity images. Because visual information already contains a spatial component, this simple map is topographic without any additional tuning. The motor map could be established such that a given site on the map corresponds to a particular motor displacement from the current position. If the motor displacement commands vary linearly with motor space, for instance, this map is also topographically organized. Assuming the cameras are motionless, a moving object occupies a localized region in the visual field, and correspondingly causes a localized intensity difference (an active region) in the time-differenced image map. If there exists connections from this region of the time-difference map to the appropriate region of the oculomotor map, then a motion stimulus in the visual field excites the corresponding region of the visual motion map, which in turn excites the connected region of the oculomotor map, which evokes the necessary camera motion to foveate the stimuli.

Of course, sensor to motor integration is only one type of multi-modal registration. As mentioned earlier, it is also possible to register sensor to sensor maps, such as aligning an auditory ITD map with the visuotopic motion map. By integrating these maps to motor maps, the robot could orient to visual or auditory stimuli.

Several mechanisms and models have been proposed to account for the alignament process of sensory-motor maps in animals. The mechanisms we use for map organization and alignment on Cog are inspired by similar mechanisms. However, different combinations of mechanisms are used depending on what is being learned: i.e. tuning the organization within a map, registering different sensory maps, or registering sensory maps and motor maps. A variety of mechanisms determine how map connections are established. Guided by sensori-motor experience, these mechanisms govern how connections between maps are modified to improve behavioral performance.

Competition: There is competition between concurrently active sites where only the most active site is modified per trial. In our system, the most active is currently approximated as the centroid of activity of the active region. Furthermore, each site of a given map can only form a limited number of connections to the other map. So, candidate map sites compete to determine those that can connect to a given site on the other map.

Locality, neighborhood influences: The neighboring sites around the most active site are also updated each trail. The amount a neighboring site is adjusted decays with distance from the maximally active site. This mechanism penalizes long connections and encourages topographic organization. The size of the neighborhood can vary over time. Typically, it starts off fairly large until the map displays some rough topographic organization, then it decreases as the map undergoes fine tuning adjustments.

Error correction: It is not sufficient that the maps are topographically organized and aligned -- they must be organized and interfaced so that the agent performs well in its environment. For tight feedback loop sensorimotor tasks (such as saccading to a visual stimulus), an appropriate error signal is very important and useful for tuning the behavior of the system. Naturally, the error signal must be a good measure of performance and obtainable at a fast enough rate to enable on-line learning. Connections are modified to reduce the discrepancy between current performance and desired performance. The magnitude of the correction is proportional to the size of the error on that trial.

Correlated temporal activity: Hebbian mechanisms are often used for self-organizing processes. By strengthening connections between simultaneously active sites, they are useful for relating information between different sensory maps.

Learning rate: The magnitude of the adjustment for each trial is also proportional to the learning rate. The learning rate can vary over time, where it starts of relatively large for course tuning, and then decreases for finer adjustments.

Status

These mechanisms have been used to register a visual motion map with an oculomotor map so that the robot saccades to moving objects seen in either its peripheral field of view or its foveal field of fiew. A successful saccade is one that centers the stimulus in the foveal field of view. The registration of the visuotopic map with the oculomotor map proceeds from an initial random mapping between the two maps. So far we have learned the registration for the center 20x20 degree region of the peripheral map (this region corresponds to the fovea).

Results

The figure below shows the performance of the learned mapping. The error at each site of the map is measured as the displacement of the centroid of the stimulus (after the saccade) from the center of the fovea field of view. The error is measured in degrees. The performance is analgous to the average error over the mapping, i.e., the average of the absolute value of the error at each site.

Future

Ongoing work extends this approach to sensory-sensory map registration to integrate auditory and visual information so that the robot orients to both noisy and moving stimuli. The neck and body degrees of freedom are also being incorporated for full body orientation to stimuli.

Representatives of the press who are interested in acquiring further information about the Cog project should contact Elizabeth Thomson, thomson@mit.edu, from the MIT News Office, http://web.mit.edu/newsoffice/www/ .