Pointing to a Visual Target

The Cog Shop
MIT Artificial Intelligence Laboratory
545 Technology Square, #920
Cambridge, MA 02139

write to the Cog Documentation Project: cdp@ai.mit.edu

We have implemented a pointing behavior which enables Cog to reach out its arm to point to a visual target. The behavior is learned over many repeated trials without human supervision, using gradient descent methods to train forward and inverse mappings between a visual parameter space and an arm position parameter space. This behavior uses a novel approach to arm control, and the learning bootstraps from prior knowledge contained within the saccade behavior. As implemented, the behavior assumes that the robot’s neck remains in a fixed position.

From an external perspective, the behavior is quite rudimentary. Given a visual stimulus, typically by a researcher waving an object in front of its cameras, the robot saccades to foveate on the target, and then reaches out its arm toward the target. Early reaches are inaccurate, and often in the wrong direction altogether, but after a few hours of practice the accuracy improves drastically.

The reaching algorithm involves an amalgam of several subsystems. A motion detection routine identifies a salient stimulus, which serves as a target for the saccade module. This foveation guarantees that the target is always at the center of the visual field; the coordinates of the target on the retina are always the center of the visual field, and the position of the target relative to the robot is wholly characterized by the gaze angle of the eyes (only two degrees of freedom). Once the target is foveated, the joint configuration necessary to point to that target is generated from the gaze angle of the eyes using a "ballistic map." This configuration is used by the arm controller to generate the reach.

Training the ballistic map is complicated by the inappropriate coordinate space of the error signal. When the arm is extended, the robot waves its hand. This motion is used to locate the end of the arm in the visual field. The distance of the hand from the center of the visual field is the measure of the reach error. However, this error signal is measured in units of pixels, yet the map being trained relates gaze angles to joint positions. The reach error measured by the visual system cannot be directly used to train the ballistic map. However, the saccade map has been trained to relate pixel positions to gaze angles. The saccade map converts the reach error, measured as a pixel offset on the retina, into an offset in the gaze angles of the eyes (as if Cog were looking at a different target).

This is still not enough to train the ballistic map. Our error is now in terms of gaze angles, not joint positions --- i.e. we know where Cog could have looked, but not how it should have moved the arm. To train the ballistic map, we also need a "forward map" --- i.e. a forward kinematics function which gives the gaze angle of the hand in response to a commanded set of joint positions. The error in gaze coordinates can be back-propagated through this map, yielding a signal appropriate for training the ballistic map.

The forward map is learned incrementally during every reach: after each reach we know the commanded arm position, as well as the position measured in eye gaze coordinates (even though that was not the target position). For the ballistic map to train properly, the forward map must have the correct signs in its derivative. Hence, training of the forward map begins first, during a "flailing" period in which Cog performs reaches to random arm positions distributed through its workspace.

Although the arm has four joints active in moving the hand to a particular position in space (the other two control the orientation of the hand), we re-parameterize in such a way that we only control two degrees of freedom for a reach. The position of the outstretched arm is governed by a normalized vector of "postural primitives." A primitive is a fixed set joint angles, corresponding to a static position of the arm, placed at a corner of the workspace. Three such primitives form a basis for the workspace. The joint space command for the arm is calculated by interpolating the joint space components between each primitive, weighted by the coefficients of the primitive-space vector. Since the vector in primitive space is normalized, three coefficients give rise to only two degrees of freedom. Hence, a mapping between eye gaze position and arm position, and vice versa, is a simple, non-degenerate $R^2 \rightarrow R^2$ function. This considerably simplifies learning.

Unfortunately, the notion of postural primitives as formulated is very brittle: the primitives are chosen ad-hoc to yield a reasonable workspace. Finding methods to adaptively generate primitives and divide the workspace is a subject of active research.

Representatives of the press who are interested in acquiring further information about the Cog project should contact Elizabeth Thomson, thomson@mit.edu, from the MIT News Office, http://web.mit.edu/newsoffice/www/ .