How do I search an index for a prefix?
Author: Deron Eriksson
Description: This Java tutorial shows how to query an index for a particular prefix using the PrefixQuery class of Lucene.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)


In another tutorial, we examined how to search an index for a term. In this tutorial, we'll examine how to search an index for a prefix of a word. We'll see how to do this for searches performed with a QueryParser, and also how to do this for searches performed with a PrefixQuery.

We'll utilize a project with the following structure. The "filesToIndex" directory contains two text files that will be indexed, and the "indexDirectory" will contain a file system index that we will create based on the text files.

project structure

Two text files in the "filesToIndex" directory will be indexed. The first one, deron-foods.txt, lists some foods that I like.

deron-foods.txt

Here are some foods that Deron likes:
hamburger
french fries
steak
mushrooms
artichokes

The second text file, nicole-foods.txt, lists some foods that Nicole likes.

nicole-foods.txt

Here are some foods that Nicole likes:
apples
bananas
salad
mushrooms
cheese

Now, let's examine the LucenePrefixQueryDemo class. It first creates an index using the two text documents that we just mentioned. One field of the documents in the index consists of the canonical path to the file, and the other field is based on the contents of the text file. I won't discuss the details of index creation, since this is covered in another tutorial.

LucenePrefixQueryDemo.java

package avajava;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hit;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;

public class LucenePrefixQueryDemo {

	public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex";
	public static final String INDEX_DIRECTORY = "indexDirectory";

	public static final String FIELD_PATH = "path";
	public static final String FIELD_CONTENTS = "contents";

	public static void main(String[] args) throws Exception {

		createIndex();
		searchIndex(FIELD_CONTENTS, "deron");
		searchIndexWithPrefixQuery(FIELD_CONTENTS, "deron");
		searchIndex(FIELD_CONTENTS, "der");
		searchIndexWithPrefixQuery(FIELD_CONTENTS, "der");
		searchIndex(FIELD_CONTENTS, "der*");
		searchIndexWithPrefixQuery(FIELD_CONTENTS, "der*");
		searchIndex(FIELD_PATH, "C:\\projects\\workspace\\demo\\filesToIndex");
		searchIndexWithPrefixQuery(FIELD_PATH, "C:\\projects\\workspace\\demo\\filesToIndex");

	}

	public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
		Analyzer analyzer = new StandardAnalyzer();
		boolean recreateIndexIfExists = true;
		IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists);
		File dir = new File(FILES_TO_INDEX_DIRECTORY);
		File[] files = dir.listFiles();
		for (File file : files) {
			Document document = new Document();

			String path = file.getCanonicalPath();
			document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED));

			Reader reader = new FileReader(file);
			document.add(new Field(FIELD_CONTENTS, reader));

			indexWriter.addDocument(document);
		}
		indexWriter.optimize();
		indexWriter.close();
	}

	public static void searchIndex(String whichField, String searchString) throws IOException, ParseException {
		System.out.println("\nSearching for '" + searchString + "' using QueryParser");
		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexSearcher indexSearcher = new IndexSearcher(directory);

		QueryParser queryParser = new QueryParser(whichField, new StandardAnalyzer());
		Query query = queryParser.parse(searchString);
		System.out.println("Type of query: " + query.getClass().getSimpleName());
		Hits hits = indexSearcher.search(query);
		displayHits(hits);
	}

	public static void searchIndexWithPrefixQuery(String whichField, String searchString) throws IOException,
			ParseException {
		System.out.println("\nSearching for '" + searchString + "' using PrefixQuery");
		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexSearcher indexSearcher = new IndexSearcher(directory);

		Term term = new Term(whichField, searchString);
		Query query = new PrefixQuery(term);
		Hits hits = indexSearcher.search(query);
		displayHits(hits);
	}

	public static void displayHits(Hits hits) throws CorruptIndexException, IOException {
		System.out.println("Number of hits: " + hits.length());

		Iterator<Hit> it = hits.iterator();
		while (it.hasNext()) {
			Hit hit = it.next();
			Document document = hit.getDocument();
			String path = document.get(FIELD_PATH);
			System.out.println("Hit: " + path);
		}
	}
}

The searchIndex() method of LucenePrefixQueryDemo performs index searches using a QueryParser. The steps involved in the search basically consist of the following lines of code:

	Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
	IndexSearcher indexSearcher = new IndexSearcher(directory);
	QueryParser queryParser = new QueryParser(whichField, new StandardAnalyzer());
	Query query = queryParser.parse(searchString);
	Hits hits = indexSearcher.search(query);

The QueryParser's parse() method returns a Query object based on the text of the searchString parameter. If the searchString consists of one word and this word ends in an asterisk, the Query object returned from parse() will be a PrefixQuery.

The searchIndexWithPrefixQuery() method of LucenePrefixQueryDemo searches the index using a PrefixQuery. Unlike the QueryParser, no asterisk should be used at the end of the word being searched for when using the PrefixQuery.

	Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
	IndexSearcher indexSearcher = new IndexSearcher(directory);
	Term term = new Term(whichField, searchString);
	Query query = new PrefixQuery(term);
	Hits hits = indexSearcher.search(query);

If we look at the main() method, we can see that first our index gets created and then eight searches against the index are performed:

	createIndex();
	searchIndex(FIELD_CONTENTS, "deron");
	searchIndexWithPrefixQuery(FIELD_CONTENTS, "deron");
	searchIndex(FIELD_CONTENTS, "der");
	searchIndexWithPrefixQuery(FIELD_CONTENTS, "der");
	searchIndex(FIELD_CONTENTS, "der*");
	searchIndexWithPrefixQuery(FIELD_CONTENTS, "der*");
	searchIndex(FIELD_PATH, "C:\\projects\\workspace\\demo\\filesToIndex");
	searchIndexWithPrefixQuery(FIELD_PATH, "C:\\projects\\workspace\\demo\\filesToIndex");

Let's look at the console output from executing LucenePrefixQueryDemo so that we can examine our search results:

Console Output


Searching for 'deron' using QueryParser
Type of query: TermQuery
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for 'deron' using PrefixQuery
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for 'der' using QueryParser
Type of query: TermQuery
Number of hits: 0

Searching for 'der' using PrefixQuery
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for 'der*' using QueryParser
Type of query: PrefixQuery
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for 'der*' using PrefixQuery
Number of hits: 0

Searching for 'C:\projects\workspace\demo\filesToIndex' using QueryParser
Type of query: TermQuery
Number of hits: 0

Searching for 'C:\projects\workspace\demo\filesToIndex' using PrefixQuery
Number of hits: 2
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt

In our first two searches, we find the word "deron" in the index, first using the QueryParser and then using the PrefixQuery. Note that the word is lowercased in the index due to the use of the StandardAnalyzer when we create the index. Notice that since the word "deron" doesn't end in an asterisk, the Query type returned from the QueryParser's parse() method is a TermQuery rather than a PrefixQuery.

In our third search, we search for "der" using the QueryParser. There is no asterisk at the end of "der", so this results in a TermQuery being returned by QueryParser's parse() method. Since there is no "der" term in the indexed contents, this search did not get a hit.

In our fourth search, we search for "der" as a PrefixQuery. Since "deron" in the index begins with "der", this results in a search hit.

In the fifth search, we search for "der*" using the QueryParser. The asterisk indicates to the QueryParser that this is a prefix query, so it returns a PrefixQuery object from its parse() method. This results in a search for the prefix "der" in the indexed contents. This results in 1 hit, since "deron" is in the index.

In the sixth search, we search for "der*" as a PrefixQuery. Since there is no string beginning with "der*" in the index, this returns no hits. The asterisk is used to indicate a prefix query when using the QueryParser and should not be used when dealing directly with a PrefixQuery (unless you're searching for a *).

The seventh query searches using the file path field rather than the file contents field. The file canonical path is stored in this field. When I use "C:\projects\workspace\demo\filesToIndex" as a search string to QueryParser, this ends up receiving no search hits since this results in a term query, and there is no term in the index that matches that string exactly.

In the eighth query, we do a PrefixQuery using "C:\projects\workspace\demo\filesToIndex" as the prefix. This receives two hits against the index, since both file canonical paths begin with that string prefix.

As we have seen, prefix queries can be performed with the QueryParser and by dealing with a PrefixQuery directly. When dealing directly with PrefixQuery, no asterisk is typically used, while when working with a QueryParser, an ending asterisk indicates that a word should be considered to be a prefix query.