At the base of this entire effort is our tracker. The data aquired by our tracking system is used to calibrate the systems to build up extended scenes and as training data to our automatic classification.
We begin this section with examples of data which were captured using our tracking system. Each of the following pages includes a Java 1.1 applet which will load some tracking data. Using this applet you can review the tracking data. For instructions, see the Applet Instructions.
It will be difficult to evaluate the effectiveness of the tracker using these examples, because objects that were not tracked at all will not be included. The best indicator of how well the system tracked is whether the system maintains the tracking ID. By sorting the objects you can easily see where tracks were broken (resulting in two or more tracking sequences from the same object). Usually this results from objects crossing paths or significant occlusions. Any questions or comments regarding the tracker or what types of scenes should be added to this section should be sent to Chris Stauffer. Currently, the tracker only works on video taken from a static camera.
To view extensive amounts of data we also produce multi-camera time-lapse videos(1.15 MByte mpeg) or you can use the Java applet to view a whole day's activities(39 MByte).
A common method for real-time segmentation of moving regions in image sequences involves ``background subtraction,'' or thresholding the error between an estimate of the image without moving objects and the current image. The numerous approaches to this problem differ in the type of background model used and the procedure used to update the model. This section briefly discusses method of modeling each pixel as a mixture of Gaussians and using an on-line approximation to update the model. The Gaussian distributions of the adaptive mixture model are then evaluated to determine which are most likely to result from a background process. Each pixel is classified based on whether the Gaussian distribution which represents it most effectively is considered part of the background model.
This results in a stable, real-time outdoor tracker which reliably deals with lighting changes, repetitive motions from clutter, and long-term scene changes. This system has been run almost continuously for 16 months, 24 hours a day, through rain and snow. Of the over 500,000 frames processed, only a small percentage have presented significant tracking failures.
In the past, computational barriers have limited the complexity of real-time video processing applications. As a consequence, most systems were either too slow to be practical, or succeeded by restricting themselves to very controlled situations. Recently, faster computers have enabled researchers to consider more complex, robust models for real-time analysis of streaming data. These new methods allow researchers to begin modeling real world processes under varying conditions.
Consider the problem of video surveillance and monitoring. A robust system should not depend on careful placement of cameras. It should also be robust to whatever is in its visual field or whatever lighting effects occur. It should be capable of dealing with movement through cluttered areas, objects overlapping in the visual field, shadows, lighting changes, effects of moving elements of the scene (e.g.\ swaying trees), slow-moving objects, and objects being introduced or removed from the scene. Traditional approaches based on backgrounding methods typically fail in these general situations. Our goal is to create a robust, adaptive tracking system that is flexible enough to handle variations in lighting, moving scene clutter, multiple moving objects and other arbitrary changes to the observed scene. The resulting tracker is primarily geared towards scene-level video surveillance applications.
Most researchers have abandoned non-adaptive methods of backgrounding because of the need for manual initialization. Without re-initialization, errors in the background accumulate over time, making this method useful only in highly-supervised, short-term tracking applications without significant changes in the scene.
A standard method of adaptive backgrounding is averaging the images over time, creating a background approximation which is similar to the current static scene except where motion occurs. While this is effective in situations where objects move continuously and the background is visible a significant portion of the time, it is not robust to scenes with many moving objects particularly if they move slowly. It also cannot handle bimodal backgrounds, recovers slowly when the background is uncovered, and has a single, predetermined threshold for the entire scene.
Changes in scene lighting can cause problems for many backgrounding methods. Ridder et al. modeled each pixel with a Kalman Filter which made their system more robust to lighting changes in the scene. While this method does have a pixel-wise automatic threshold, it still recovers slowly and does not handle bimodal backgrounds well. Koller et al. have successfully integrated this method in an automatic traffic monitoring application.
"Pfinder" uses a multi-class statistical model for the tracked objects, but the background model is a single Gaussian per pixel. After an initialization period where the room is empty, the system reports good results. There have been no reports on the success of this tracker in outdoor scenes.
Friedman and Russell have recently implemented a pixel-wise EM framework for detection of vehicles that bears the most similarity to our work. Their method attempts to explicitly classify the pixel values into three separate, predetermined distributions corresponding to the road color, the shadow color, and colors corresponding to vehicles. Their attempt to mediate the effect of shadows appears to be somewhat successful, but it is not clear what behavior their system would exhibit for pixels which did not contain these three distributions. For example, pixels may present a single background color or multiple background colors resulting from repetitive motions, shadows, or reflectances.
Rather than explicitly modeling the values of all the pixels as one particular type of distribution, we simply model the values of a particular pixel as a mixture of Gaussians. Based on the persistence and the variance of each of the Gaussians of the mixture, we determine which Gaussians may correspond to background colors. Pixel values that do not fit the background distributions are considered foreground until there is a Gaussian that includes them with sufficient, consistent evidence supporting it.
Our system adapts to deal robustly with lighting changes, repetitive motions of scene elements, tracking through cluttered regions, slow-moving objects, and introducing or removing objects from the scene. Slowly moving objects take longer to be incorporated into the background, because their color has a larger variance than the background. Also, repetitive variations are learned, and a model for the background distribution is generally maintained even if it is temporarily replaced by another distribution which leads to faster recovery when objects are removed.
Our backgrounding method contains two significant parameters -- alpha, the learning constant and T, the proportion of the data that should be accounted for by the background. Without needing to alter parameters, our system has been used in an indoors, human-computer interface application and, for the past 16 months, has been continuously monitoring outdoor scenes.
|Figure 1- This figure illustrates one situation where a bimodal method could do much better than most alternatives. The scatter plot shows the values of a particular pixel over a short window in time. This is certainly not a Gaussian distribution. Other multi-model processes result from monitor flicker, flicker from particular lighting, flags in the wind, trees blowing in the wind, etc.
If each pixel resulted from a particular surface under particular lighting, a single Gaussian would be sufficient to model the pixel value while accounting for acquisition noise. If only lighting changed over time, a single, adaptive Gaussian per pixel would be sufficient. In practice, multiple surfaces often appear in the view frustum of a particular pixel and the lighting conditions change. Water in sunlight shows a particularly salient bimodality as illustrated in Figure 1. Thus, multiple, adaptive Gaussians are necessary. We use a mixture of adaptive Gaussians to approximate this process.
|Figure 2- This figure shows a version of our tracker. It includes the current image with tracking information overlayed, the foreground image, the accumulated template, and the zoomed TrackCam window.
Each time the parameters of the Gaussians are updated, the Gaussians are evaluated using a simple heuristic to hypothesize which are most likely to be part of the ``background process.'' Pixel values that do not match one of the pixel's ``background'' Gaussians are grouped using connected components. Finally, the connected components are tracked from frame to frame using a multiple hypothesis tracker. The process is illustrated in Figure 2.
Rather than discuss particular aspects of the method on this web page, we refer you to our Publications sections which includes papers describing the system in more detail.
On an SGI O2 with a R10000 processor, this method can process 11 to 13 frames a second (frame size 160x120pixels). The variation in the frame rate is due to variation in the amount of foreground present. Our tracking system has been effectively storing tracking information for five scenes for over 16 months.
While quick changes in cloud cover (relative to alpha, the learning rate) can sometimes necessitate a new set of background distributions, it will stabilize within 10-20 seconds and tracking will continue unhindered.
When deciding on a tracker to implement, the most important information to a researcher is where the tracker is applicable. This section will endeavor to pass on some of the knowledge we have gained through our experiences with this tracker.
The tracking system has the most difficulty with scenes containing high occurrences of objects that visually overlap. The multiple hypothesis tracker is not extremely sophisticated about reliably dissambiguating objects which cross. This problem can be compounded by long shadows, but for our applications it was much more desirable to track an object and its shadow and avoid cropping or missing dark objects than it was to attempt to remove shadows. In our experience, on bright days when the shadows are the most significant, both shadowed regions and shady sides of dark objects are black (not dark green, not dark red, etc.).
The good news is that the tracker was relatively robust to all but relatively fast lighting changes (e.g.\ flood lights turning on and partly cloudy, windy days). It successfully tracked outdoor scenes in rain, snow, sleet, hail, overcast, and sunny days. It has also been used to track birds at a feeder, mice at night using Sony NightShot, fish in a tank, people entering a lab, and objects in outdoor scenes. In these environments, it reduces the impact of repetative motions from swaying branches, rippling water, specularities, slow moving objects, and camera and acquisition noise. The system has proven robust to day/night cycles and long-term scene changes. More recent results and project updates are available at http://www.ai.mit.edu/projects/vsam/.
As computers improve and parallel architectures are investigated, this algorithm can be run faster, on larger images, and using a larger number of Gaussians in the mixture model. All of these factors will increase performance. A full covariance matrix would further improve performance. Adding prediction to each Gaussian(e.g.\ the Kalman filter approach), may also lead to more robust tracking of lighting changes.
Beyond these obvious improvements, we are investigating modeling some of the inter-dependencies of the pixel processes. Relative values of neighboring pixels and correlations with neighboring pixel's distributions may be useful in this regard. This would allow the system to model changes in occluded pixels by observations of some of its neighbors.
Our method has been used on grayscale, RGB, HSV, and local linear filter responses. But this method should be capable of modeling any streamed input source in which our assumptions and heuristics are generally valid. We are investigating use of this method with frame-rate stereo, IR cameras, and including depth as a fourth channel(R,G,B,D). Depth is an example where multi-modal distributions are useful, because while disparity estimates are noisy due to false correspondences, those noisy values are often relatively predictable when they result from false correspondences in the background.
In the past, we were often forced to deal with relatively small amounts of data, but with this system we can collect images of moving objects and tracking data robustly on real-time streaming video for weeks at a time. This ability is allowing us to investigate future directions that were not available to us in the past. We are working on activity classification and object classification using literally millions of examples.
This web page has shown a novel, probabilistic method for background subtraction. It involves modeling each pixel as a separate mixture model. We implemented a real-time approximate method which is stable and robust. The method requires only two parameters, alpha and T. These two parameters are robust to different cameras and different scenes.
This method deals with slow lighting changes by slowly adapting the values of the Gaussians. It also deals with multi-modal distributions caused by shadows, specularities, swaying branches, computer monitors, and other troublesome features of the real world which are not often mentioned in computer vision. It recovers quickly when background reappears and has a automatic pixel-wise threshold. All these factors have made this tracker an essential part of our activity and object classification research.
This system has been successfully used to track people in indoor environments, people and cars in outdoor environments, fish in a tank, ants on a floor, and remote control vehicles in a lab setting. All these situations involved different cameras, different lighting, and different objects being tracked. This system achieves our goals of real-time performance over extended periods of time without human intervention.