Research in Algorithms for Geometric Pattern Matching

MIT2001-06

 

Progress Report: July 1, 2002 ­ December 31, 2002

 

Piotr Indyk

 

 

 

Project Overview

 

Geometric pattern matching is pervasive in many areas of computer science, e.g., in computer vision, computational drug design and computational biology.

The goal of this project is to develop efficient algorithms for key geometric pattern matching problems.

 

As mentioned earlier, the focus of this project is to design a fast similarity search algorithm for large data sets of images. To evaluate the similarity between images, we use the Earth-Mover Metric (EMD). It was experimentally verified to capture well the perceptual notion of a difference between images, in fact much better than other well-known metrics (e.g., Euclidean distance between the feature vectors). The basic idea behind EMD is as follows. Assume that the features of an image are represented by a set of points in low-dimensional space Rd. For example, an image could be represented by a set of pixels, where each pixel is a point in 3-dimensional color space or texture space. The distance between two sets of points (representing two different images) is defined as the minimum amount of work needed to transform one set into another. Formally, this corresponds to the minimum weight matching between the two sets of points.

 

Since EMD has been shown to outperform other measures for comparing color or texture similarity between images, it is of great interest to design efficient algorithms for pattern matching under this metric. In particular, the most interesting case occurs when one is given a ³query² image, and wants to scan a large database of images, in order to find the image most similar to the query. The approach used so far is to compute the distances between the query image and each image stored in the database. This is highly inefficient, since the time needed to answer a query could be very large for large databases.

 

During the earlier stages of this project we designed and implemented a method which drastically reduces the time needed to solve this problem [IT¹01].   The main idea of our approach is to embed the Earth Mover Distance into Manhattan space and use very efficient nearest neighbor data structure for the latter (well-studied) space. In other words, we show that one can represent each pixel set by a feature vector, in such a way that the EMD between two pixel sets is approximately proportional to the Manhattan distance between the feature vectors. The distortion induced by the embedding algorithm is provably bounded.

 

Since very fast nearest neighbor algorithms for normed spaces are known (e.g., see [IM¹98, GIM¹98]), our embedding method yields dramatic improvement in the running of nearest neighbor algorithms for EMD. However, as we mentioned above, the embedding is not exact ­ it introduces a small error which could in principle affect the quality of the retrieved images. Thus, for this approach to work in practice, it is crucial to verify that the actual error occurring in practice is low. This could require additional adjustments and fine-tuning of the algorithm, to minimize the embedding error.

 

In the next section we describe our progress on implementing and evaluating our method in the context of image retrieval in large image databases.

 

 

 

Progress Through December 2002

 

 

During the period of June-December 2002, we implemented and evaluated algorithms for fast nearest neighbor search in Rd. For this purpose, we implemented a variant of the Locality-Sensitive Hashing algorithm [GIM¹99]. To this end, we needed to modify many parts of the original algorithm. In particular:

 

·       The original algorithm worked efficiently only when the input vectors were binary; we introduced new hashing scheme that works directly on points in d-dimensional space

·       The original algorithm could not handle sparse vectors efficiently. This was undesirable, since the vectors obtained by using our embedding tools were very high-dimensional but sparse. We adapted the algorithm to deal with sparse data.

 

The final algorithm enables very fast similarity search in large collections of high-dimensional images. In particular, for a set of 20,000 points obtained by extracting color features of Corel-Draw images, LSH returned the answers 10-20 times faster than linear scan (the best previous method). Although LSH is a probabilistic and approximate algorithm, it almost always returned exact nearest neighbor. Since the ³EMD to Rd ³ embedding introduces some small error (about 15%), our algorithm incurs a small error with respect to the original EMD metric, while achieving an order of magnitude speedup over earlier methods. We mention that the LSH algorithm has not been fine-tuned yet, and thus we expect much larger gain in future experiments.

 

Research Plan for the Next Six Months

 

Our main goals for the nearest future are:

 

·       Build an easy-to-navigate user interface

·       Perform rigorous user experiments to measure the influence of the embedding error on the perceptual retrieval error

·       Write a report

 

In addition (time permitting) we plan to investigate the (quite fortunate!) discrepancy between the theoretical error bounds and the error we achieve in practice. This could entail developing a model for color histogram data sets which are closer to reality than the current worst-case approach.

 

References

 

[EMD] Scott Cohen, ³Computing Earth-Mover distance under transformations²,   http://robotics.stanford.edu/~scohen/research/emdg/emdg.html

 

[GIM¹99] Aris Gionis, Piotr Indyk and Rajeev Motwani, ³Similarity Search in High Dimensions via Hashing², IEEE Symposium on Very Large Databases, 1999.

 

[IM¹98] Piotr Indyk and Rajeev Motwani, ³Approximate Nearest Neighbor ­ Towards Removing the Curse of Dimensionality², ACM Symposium on Theory of Computing, 1998.

 

[IT¹01] Piotr Indyk and Nitin Thaper,  ³Embedding Earth-Mover Distance into the Euclidean space², manuscript, 2001.