Microsoft open source Space Partition Tree And Graph, key algorithms behind Bing search
Microsoft has just announced the open source of a key algorithm behind Bing search, Space Partition Tree And Graph (SPTAG), which enables Bing to quickly return search results to users.
SPTAG (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenerio, which is written in C++ and wrapped by Python.
This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.
SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relatrive neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.
SPTAG is inspired by the NGS approach [WangL12]. It contains two basic modules: index builder and searcher. The RNG is built on the k-nearest neighborhood graph [WangWZTG12, WangWJLZZH14] for boosting the conectivity. Balanced k-means trees are used to replace kd-trees to avoid the inaccurate distance bound estimation in kd-trees for very high-dimensional vectors. The search begins with the search in the space partition trees for finding several seeds to start the search in the RNG. The searches in the trees and the graph are iteratively conducted.
Microsoft said the SPTAG library has so far cataloged more than 150 billion pieces of data, including individual words, characters, web snippet, and full queries. “Bing processes billions of documents every day, and the idea now is that we can represent these entries as vectors and search through this giant index of 100 billion-plus vectors to find the most related results in 5 milliseconds,” said Jeffrey Zhu, a program manager on Microsoft’s Bing team.
The Bing team expects open source SPTAG to be used to build applications that recognize language based on audio clips, or for users to take photos of plants and identify genus and species. SPTAG is an open source project on GitHub.