Multi-Modal Orientation Behavior
The Cog Shop
MIT Artificial Intelligence Laboratory
545 Technology Square, #920
Cambridge, MA 02139
|
|
To Integrate multi-modal sensory information (visual, auditory, tactile, etc.), and use
this information to orient the robot to the source of sensory stimuli.
The ability to orient toward visual, auditory, or tactile stimuli is an
important skill for systems intended to interact with and explore their environment. In
the brain of mammalian vertebrates, the Superior Colliculus is specialized for integrating
multi-modal sensory information, and for using this information to orient the animal to
the source sensory stimuli, such as noisy, moving objects. Within the Superior Colliculus,
this ability appears to be implemented using layers of registered, multi-modal,
topographic maps. Inspired by the structure, function, and plasticity of the Superior
Colliculus, we are in the process of implementing multi-modal orientation behaviors on our
humanoid robot using registered topographic maps.
In the Superior Colliculus of the cat, there are visuotopic maps representing motion in
visual space, somatotopic maps yielding a body representation of tactile inputs, and
spatiotopic maps of auditory space encoding inter-aural time differences (ITD) and
inter-aural intensity differences (IID). Hence, a sensory stimulus originating from a
given direction will elicit activity in the corresponding region of the appropriate
sensory map. There are also motor movement maps consisting of pre-motor neurons whose
movement fields are topologically organized. These maps exist for the eyes, head, neck,
body, ears. Stimulating a specific region in the map elicits a corresponding motor
movement.
These multi-modal maps overlap and are aligned with each other so that they share a
common multisensory spatial coordinate system. The maps are said to be registered
with one another when this is the case. By arranging multi-modal information into a common
representational framework (within each map) and registering the maps with respect to
eachother allows the information within each map to interact and influence the other maps.
There are a couple of advantages to this organizational strategy. First, it is an
economical way of specifying the location of peripheral stimuli, and for organizing and
activating the motor program required to orient towards the stimuli; thereby allowing any
sensory modality to orient the other sensory organs to source of stimulation. Second, it
supports enhancement of simultaneous sensory cues. Stimuli that occur in the same place at
the same time are likely to be interrelated by common causality. For instance, a bird
rustling in the bushes will provide both visual motion and auditory cues. During
enhancement, certain combinations of meaningful stimulus become more salient because their
neuronal responses are spatio-temporally related. Once the multi-modal maps are aligned,
neuronal enhancement (or depression) is a function of the temporal and spatial
relationships of neural activity among the maps.
In our framework, a map is a two dimensional array of elements where each element
corresponds to a site in the map. The maps are arranged into interconnected layers, where
a given map can be interfaced to more than one map. Each connection is uni-directional, so
recurrent connections between maps require both a feedforward connection and a feedback
connection. The activity level of sites on one map is passed to another map thorough these
connections, hence the input to a given map is a function of the spatio-temporal activity
of the maps feeding into it and the connectivity between these maps. Currently, all
connections have equal weights, although this could change in the future. The output of a
given map is its spatio-temporal activity pattern. What this pattern of activity
represents depends upon the map: if it is a visuotopic map, it could represent motion
coming from a particular direction in the visual field; if it is an oculomotor map, it
could encode a motor command to move the eyes, and so forth.
The smallest map ensemble capable of producing an observable behavior consists of a
sensory input map, a motor output map, and an established set of connections between them.
The input map could have a fairly rigid structure consisting simply of time-differenced
intensity images. Because visual information already contains a spatial component, this
simple map is topographic without any additional tuning. The motor map could be
established such that a given site on the map corresponds to a particular motor
displacement from the current position. If the motor displacement commands vary linearly
with motor space, for instance, this map is also topographically organized. Assuming the
cameras are motionless, a moving object occupies a localized region in the visual field,
and correspondingly causes a localized intensity difference (an active region) in the
time-differenced image map. If there exists connections from this region of the
time-difference map to the appropriate region of the oculomotor map, then a motion
stimulus in the visual field excites the corresponding region of the visual motion map,
which in turn excites the connected region of the oculomotor map, which evokes the
necessary camera motion to foveate the stimuli.
Of course, sensor to motor integration is only one type of multi-modal
registration. As mentioned earlier, it is also possible to register sensor to sensor maps,
such as aligning an auditory ITD map with the visuotopic motion map. By integrating these
maps to motor maps, the robot could orient to visual or auditory stimuli.
Several mechanisms and models have been proposed to account for the alignament process
of sensory-motor maps in animals. The mechanisms we use for map organization and alignment
on Cog are inspired by similar mechanisms. However, different combinations of mechanisms
are used depending on what is being learned: i.e. tuning the organization within a map,
registering different sensory maps, or registering sensory maps and motor maps. A variety
of mechanisms determine how map connections are established. Guided by sensori-motor
experience, these mechanisms govern how connections between maps are modified to improve
behavioral performance.
These mechanisms have been used to register a visual motion map with an oculomotor map
so that the robot saccades to moving objects seen in either its peripheral field of view
or its foveal field of fiew. A successful saccade is one that centers the stimulus in the
foveal field of view. The registration of the visuotopic map with the oculomotor map
proceeds from an initial random mapping between the two maps. So far we have learned the
registration for the center 20x20 degree region of the peripheral map (this region
corresponds to the fovea).
The figure below shows the performance of the learned mapping. The error at each site
of the map is measured as the displacement of the centroid of the stimulus (after the
saccade) from the center of the fovea field of view. The error is measured in degrees. The
performance is analgous to the average error over the mapping, i.e., the average of the
absolute value of the error at each site.
Ongoing work extends this approach to sensory-sensory map registration to integrate
auditory and visual information so that the robot orients to both noisy and moving
stimuli. The neck and body degrees of freedom are also being incorporated for full body
orientation to stimuli.
|