Inverted File Index Demo
Advanced version of Inverted File Index Demo
Inverted File Index
Inverted File Index is the name for the family of algorithms designed for a quick search of relevant documents in response to a query. Those who need quick introduction can read this article.
In this research I publish the method and the program that is close to other technologies but not exactly the same. The criteria, that algorithm must meet, is the capability to find documents in a collection of at least 10 billion items for not longer than a quarter of a second. It must allow concurrent search, use of multi-processor systems or cloud computing, allow easy update of collection (remove and add documents) without reconstructing of all indexes and replacement of auxiliary data sets.
During the processing of documents, the words in documents are stemmed and reduced to lower case, also some stop words are removed (but that is optional). The words then are passed to the dictionary, which returns index as a whole number. When same word is repeated in text, the dictionary returns same whole number, so the documents are converted into the sequences of integers. The queries are subjected to the same transformation and are converted to shorter set of integers. The programmatic task is to find documents where these integers from the query are located in the neighborhood, not very far from each other, but not necessarily in the same order. Requirement of proximity for the words in a query is mandatory otherwise system may treat words VICE and PRESIDENT sitting far apart as VICE PRESIDENT and return this document to a query VICE PRESIDENT.
The solution at the top link is complete demo of this algorithm. It contains separate projects: creation of indexes and auxiliary data set (project is called INVERTEX) and demo of query search (project is called QUERY). INVERTEX needs only folder name for the collection of documents. In demo version it creates the local database in the location of executable. The database contains list of processed documents in ASCII format, list of all words in ASCII format, binary file for the indexes of the documents and subfolder with binary data holding location of every word. These data can be recognized by the word. Each individual file name match the word. Demo project QUERY needs the text of the query and location of database. For these demo projects the relative paths are hardcoded and can be easily recognized.
AlgorithmThe number of distinct words in collections is, usually, near 100,000. The total number of words in documents is, let say, 500 * 10,000,000,000 = 0x48C27395000, which is longer than 4 bytes. On that reason the collection is processed by smaller groups where number of words is 4 byte integer. We can set limit for a group as reasonably large number, up to a million documents, depending on computer resources. In INVERTEX we build a single array of integers and list of processed files. Array of integers is sequence of indexes for each word returned by dictionary, such as 34, 8765, 8876, 145, 34, 778, 3456, ..., and so on. When file is terminated we add separator as -1. This functionality can be found in function doProcessFiles(), class WordProcessing. This array then is passed to function OutputDatabase() that completes the building of indexes. In OutputDatabase() we create set of lists and populate them with sequential indexes. For example, for above figures, we create list 34 and pass 0 to it, then list 8765 and pass 1 to it and so on. When number 34 is met for the second time we pass 4 to list 34. When set of lists is ready, we save them as database, each list is output into a separate file. File 34 will have numbers 0, 4, ... and so on, all positions where word 34 is met. When -1 is read, that means we reached the end of file and we increment position counter on a relatively large number, such as 1000. That means when word 34 is met at the last position of the previous file and first position of the next file, the counter is significantly different and we don't see these words next to each other. For clarity only, I replaced the numbers in file names by words from the dictionary, so rare and frequent words can be told from each other by simple sorting files in BIN directory.
The query is processed in a very simple way. We pass query to the same dictionary and get numbers that are names of files. We read all integers from files, sort them, read this sorted array and find longest sequence with given but small admissible gap. Gap tells the program how far from each other we allow the words from the query to be sitting in the file. We ignore the order and also how many times each individual word occurs. That may cause return of irrelevant file as rare event, but it can be easy prevented by elementary post processing that already deals with few hundred files, not millions.
To save space, I added elementary compression. The integers in files are, actually, differences between next and previous count. The numbers are always positive and not very large, most of them are between 1 and 3000. I apply simple Huffman compression. The least significant byte is output as is. Fist 3 bytes are subjected to Huffman compression with predefined tree. The tree only covers first 4 bits. When number sitting in first 3 bytes is longer than 4 bits it outputs as is with the flag. The overall compression ratio is 5:1 for small collections and 7:1 for large collections.
TestingThe simplest way to test is to use collection PATENTCORPUS128. Download the data, unzip and make sure that rootFolder (in Main function of ivertex project) is assigned correct path to data. Start invertex and wait when it finishes preparation of the database. It usually takes less than a minute. When it is finished, activate query project. Make sure path variable (in Main function of query project) is pointing to location of created database. Open any file from collection PATENTCORPUS128, copy any text fragment and start query program. If copied text is not few generic words the program identify the file where query is copied from. It also returns statistics in a form of array.
520 75 16 1 1 0 1 0 0 0 0 0 1It shows that only one word from the query occurred in 520 locations, only two words in 75 locations, and so on. As result of the query the program shows the file name where it found the maximum number of words. In this particular case it was one time and the number of words was 13. The variable wordsproximity = 2 sets the limit for counting words from the query. Number 2 means that it allows next word to be the first or the second one from the previous word. The order is ignored, the next and previous could be any two words from the query. Very convenient way to test queries is to assign them values from text definition of the patent class. The files are technical desriptions of the patents. First 3 numbers in the file name is class number. USPTO.GOV provides definitions for each class, for example:
SCANNING-PROBE TECHNIQUES OR APPARATUS; APPLICATIONS OF SCANNING-PROBE TECHNIQUES, E.G., SCANNING-PROBE MICROSCOPY [SPM] is class 850
INTERACTIVE VIDEO DISTRIBUTION SYSTEMS is class 725
In my experiments it perfectly worked. We enter human definition for a class and system finds technical descriptions of the patents classified by human that belong to this class.
The statistics can be used for preparation of sorted lists of files according to number of occurrences. In case of very large collection program should be updated to make set of independent databases, process them concurrently in multi-processor system and merge the results. Files can be added at any time. Removal of the file from collection means to clean sequential indexes. It affects only one database out of many and can be arranged without even replacement of database, just by removal of some digits in binary files.
The QUERY project of advanced version is adjusted to return sorted list of files where the words from the query are observed in close proximity. It takes into consideration the overall number of words from the query and longest sequence of words from query in the neighborhood of each other.
Also new research is added recently. Now user can find files similar to a given one. It is provided by function:
string files = qProc.findSimilarFiles(fileIndex, queryLength, wordsproximity, KMAX, outputListSize);The file ID is provided by the fist parameter - file index, which is index in the list of processed files. Function reads file, makes all possible sequential queries of provided length, execute all possible queries, score the result for every file in collection and returns the list of most frequent file names returned in all queries. It, usually, takes about one minute. The size of collection does not make this time significantly longer. It may be important and also convenient to find files similar to one provided. Comparison of this method to LSA and HAC showed that it works approximately with the same precision, however HAC is much quicker.