Author: Jim McKeeth
Describe the principles of an Indexed search engine (like google). How this works
when a lot of text information is compressed into an index - so that a search can
be done within very short time.
There are really two main ways to search a large collection of text documents. The
simplest method would be to load each document and scan through it for the search
terms, this would be referred to as a full text scan. The second, much faster
method is to create an index and then search the index. An index is a list of
terms found in a document or set of documents. Each word only appears once per
document so it is much shorter then the original document.
Creating an index
Finding the words
In order to create an index you must first parse the document. Parsing, is the
process of picking out the individual tokens (terms or words) in a piece of text.
A parser is a type of state machine. There are many existing parsing routines
available. "The Tomes of Delphi Algorithms and Data Structures" by Julian Bucknall
contains many very good parsers. An example.
A simple parser would scan through a string of text, starting at the beginning,
looking at each character. If it is a letter or number then it is part of a word,
if it is white space or punctuation then it is a separator. Each word is added to
a list (i.e. TStringList) in the order it is found in the document. Typically each
word is converted to the same case (upper or lower).
It is really important to consider what you are indexing and how your index will be
used when creating your index and parsing. For example if you are parsing HTML
then you want to exclude most tags (with the obvious exception of META tags, which
are handled specially). Other times you might only want to index summery
information about each document.
Indexing the Words
Now that we have a parsed token list we need to index it. The simplest index is
just a list of each word found in a document, and a reference to the document.
This reference may be a URL, a document name or any other unique identifier (a GUID
or a foreign key to another table describing the document). A more complex index
may include the number of times the word is found in the document or a ranking for
where it is in the document (in the title, keyword section, first paragraph,
middle, last, etc.) This additional information stored with each word is part of
what differentiates one search engine's performance from another.
Many times certain words are left out. These are called stop words. Stop words
are common words, words that will not be searched on, or words that will not
enhance the meaning of a search. An example of stop words includes "THE, A, AN,
AND, IF, BUT", words with numbers in them, or anything else you want to filter out.
Selecting stop words is another point of separation of performance.
Some web search engines used to leave out words like "HTML" or "WEB" because they
were so common while other search engines would include every word. Other search
engines start with a dictionary list and only index words found in that dictionary
list. This leads to trouble when you are indexing names, technical terms or
anything else not found in your original dictionary.
One time I was creating a search engine for a collection newsgroup articles. I
discovered that there was UUEncoded (similar to MIME or Base64) binaries in the
articles. This resulted in my parser finding words that were hundreds of
characters long and total gibberish. I decided to omit any word longer then 50
characters or shorter then 4. Making the choices about what to include and what to
omit is an important decision, and will vary based on what content you are
So here is an example table structure for your index:
Document: Number (foreign key to Documents table)
Word : String (if our longest word is 20 characters)
Count : Number (how many times the word is found)
The primary key would be a compound of Document and Word since each word is listed
once per document.
Document : Number(primary key index)
Title : string (title of document)
Location : string (URL or full filename and path)
Optionally you could include the entire document as a blob in this table. You
could also have other tables that lists terms (from the meta section of the
document) or include authors. Again this design choice depends on the type of
documents you are indexing and the purpose of your search engine.
Searching Your Index
Once all the indexes are stored in a database you need to be able to search the
index for a document. A simple SQL statement to search for a document that
contains a single word could look like this:
WHERE Word = :v_search_term
ORDER BY Count DESC
This returns all documents containing your single search term and they are ordered
by the number of times the word is found. If you want to use SQL then to search on
multiple terms involves an join for each term. Instead you could retrieve a list
for each term and then merge them manually. This is where you would support AND,
OR or NOT key words.
If you want to allow phrase searching then you could search for each word in the
phrase and then search those documents for the phrase. The same technique could be
used for the NEAR key word. There are other more advanced techniques to do this
that are much quicker, but they are beyond the scope of this document.
Once the hits are found and ranked then display the title of each document,
possibly a summary or the context of the hits, and provide a way for your user to
reach the document.
One thing Google does a little differently is they look at how pages are linked.
This works really well with the hyper linked nature of the web. For example if you
search for Borland most pages that mention Borland link to
www.borland.comhttp://www.borland.com. This is assumed to indicate that
www.borland.comhttp://www.borland.com is a very important site about Borland.
Google also limits the number of hits you get on each domain.
Many search engines also rank pages higher if the search term appears in the URL or
title of the page. They also look at the description and keywords meta tags for
ranking. Some search engines will actually ignore a word if it appears too often
in a page. This weeds out sites that try to artificially inflate their rankings.
Phonetics or Soundex is another technique that can be used. This could be done
with an additional table similar to the word table, but instead store the soundex
value for the words instead of the actual word.
Searching a shorter and well organized index is much quicker then searching an
entire document. Creating and searching an index takes a lot more planning and
effort up front, but quickly pays off if the text is searched very often.
Typically the larger and more complex the index, the more effective the search. If
your index gets too large or complex then the search speed will degrade.
There are off the shelf searching tools available to end users and developers
alike. dtSearch ( http://www.dtsearch.comhttp://www.dtsearch.com ) makes a wide
range of searching tools for both end users and all types of developers. Tamarack
Associates' Rubicon ( http://www.tamaracka.comhttp://www.tamaracka.com ) is a
search component for Delphi that provides a lot of flexibility, especially in
storage options. Both are extremely fast and useful, but don't let that stop you
from designing your own, especially if you have a specialized need.