The Role of Fixation and Visual Attention in Object Recognition


This project is a study of the role of fixation and visual attention in object recognition. We built an active vision system which can recognize a target object in a cluttered scene efficiently (with minimal search) and reliably (with minimal false identifications). Our system integrates the visual cues of color and stereo to perform selection (figure/ground separation), yielding candidate regions on which to focus attention. One of the advantages of using multiple cues for selection is that we do not need the individual cues to be very accurate. We need methods for roughly comparing properties of the model with the image, so that we are unlikely to exclude a correct target region while reducing extraneous information in the image.

The system fixates each candidate region in the image and uses stereo to extract features that lie within a narrow disparity range about the fixation position. These selected features are then used as input to an Alignment-style recognition system. We show that visual attention and fixation significantly reduce the complexity and the false identifications in model-based recognition using Alignment methods.

Another theme that we investigate in this project is the role of stereo in selection instead of 3D reconstruction. It has been shown that small inaccuracies in measuring camera parameters lead to large errors in computed depth. We demonstrate that we can avoid the need for explicit depth information and accurate camera calibration by using stereo for selection in object recognition.

Various stages in the working of the system

Target object.

Figures 1-9 in the following pages show the working of the algorithm.

Figure 1 - Initial left and right images

Figure 2 - Segments in left and right images

Figure 3 - Color regions extracted from images. The stereo algorithm is run on the edges retained after the color filtering to find target focal edges in the left image which have unique matches in the right image, subject to constraints on the epipolar geometry, edge orientation, length and contrast. The disparity associated with each unique match is used to move the cameras and extract high resolution images around the matched edges. If there are several unique matches, each of them is fixated in turn.


Figure 4 - Foveated left and right images

Figure 5 - Segments in foveated left and right images

Figure 6 - Selected segments. The stereo matcher is used to get all matches within a narrow disparity range about the disparity of the fixated edge. The matched edges are likely to come from a single object and are fed into the recognition engine. The selected segments in Figure 6 are NOT recognized as an instance of the model. The system fixates on the next target edge.


Figure 7 - Foveated left and right images

Figure 8 - Segments in foveated left and right images

Figure 9(a) - Selected segments. 9(b) - Model transformed and aligned with the data. The alignment is good enough and the object is FOUND in the given scene.