How do I query for words near each other with a phrase query?
Author: Deron Eriksson
Description: This Java tutorial shows how to use Lucene's PhraseQuery and QueryParser classes to perform queries related to how close words are to each other.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)


With Lucene, a PhaseQuery can be used to query for a sequence of terms, where the terms do not necessarily have to be next to each other or in order. The PhaseQuery object's setSlop() method can be used to set how many words can be between the various words in the query phrase. Here are the current javadocs (at the time of this writing) for PhaseQuery's setSlop() method:

javadocs for PhaseQuery's setSlop() method

Sets the number of other words permitted between words in query phrase. If zero, then this is an exact phrase search. For larger values this works like a WITHIN or NEAR operator.

The slop is in fact an edit-distance, where the units correspond to moves of terms in the query phrase out of position. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit re-orderings of phrases, the slop must be at least two.

More exact matches are scored higher than sloppier matches, thus search results are sorted by exactness.

The slop is zero by default, requiring exact matches.

The QueryParser class can be used to generate a PhaseQuery if the search string that is passed to its parse() method is formatted for it. This formatting involves putting double quotes around the words of the phase query. The 'slop' can be set by following the double quotes with a tilde (~) followed by the slop number.

This tutorial will utilize a project with the following structure. The "filesToIndex" directory contains two text files that will be indexed, and the "indexDirectory" will contain a file system index that we will create based on the text files.

project structure

Two text files in the "filesToIndex" directory will be indexed. The first one, deron-foods.txt, lists some foods that I like.

deron-foods.txt

Here are some foods that Deron likes:
hamburger
french fries
steak
mushrooms
artichokes

The second text file, nicole-foods.txt, lists some foods that Nicole likes.

nicole-foods.txt

Here are some foods that Nicole likes:
apples
bananas
salad
mushrooms
cheese

Here is the LucenePhraseQueryDemo class. It first creates an index from the text files mentioned above. Following this, it performs 5 phase queries using the PhaseQuery class directly. After this, it performs another 4 queries using the QueryParser class.

LucenePhraseQueryDemo.java

package avajava;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hit;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;

public class LucenePhraseQueryDemo {

	public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex";
	public static final String INDEX_DIRECTORY = "indexDirectory";

	public static final String FIELD_PATH = "path";
	public static final String FIELD_CONTENTS = "contents";

	public static void main(String[] args) throws Exception {

		createIndex();

		searchIndexWithPhraseQuery("french", "fries", 0);
		searchIndexWithPhraseQuery("hamburger", "steak", 0);
		searchIndexWithPhraseQuery("hamburger", "steak", 1);
		searchIndexWithPhraseQuery("hamburger", "steak", 2);
		searchIndexWithPhraseQuery("hamburger", "steak", 3);

		searchIndexWithQueryParser("french fries"); // BooleanQuery
		searchIndexWithQueryParser("\"french fries\""); // PhaseQuery
		searchIndexWithQueryParser("\"hamburger steak\"~1"); // PhaseQuery
		searchIndexWithQueryParser("\"hamburger steak\"~2"); // PhaseQuery

	}

	public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
		Analyzer analyzer = new StandardAnalyzer();
		boolean recreateIndexIfExists = true;
		IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists);
		File dir = new File(FILES_TO_INDEX_DIRECTORY);
		File[] files = dir.listFiles();
		for (File file : files) {
			Document document = new Document();

			String path = file.getCanonicalPath();
			document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED));

			Reader reader = new FileReader(file);
			document.add(new Field(FIELD_CONTENTS, reader));

			indexWriter.addDocument(document);
		}
		indexWriter.optimize();
		indexWriter.close();
	}

	public static void searchIndexWithQueryParser(String searchString) throws IOException, ParseException {
		System.out.println("Searching for '" + searchString + "' using QueryParser");
		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexSearcher indexSearcher = new IndexSearcher(directory);

		QueryParser queryParser = new QueryParser(FIELD_CONTENTS, new StandardAnalyzer());
		Query query = queryParser.parse(searchString);
		System.out.println("Type of query: " + query.getClass().getSimpleName());
		displayQuery(query);
		Hits hits = indexSearcher.search(query);
		displayHits(hits);
	}

	public static void searchIndexWithPhraseQuery(String string1, String string2, int slop) throws IOException,
			ParseException {
		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexSearcher indexSearcher = new IndexSearcher(directory);

		Term term1 = new Term(FIELD_CONTENTS, string1);
		Term term2 = new Term(FIELD_CONTENTS, string2);
		PhraseQuery phraseQuery = new PhraseQuery();
		phraseQuery.add(term1);
		phraseQuery.add(term2);
		phraseQuery.setSlop(slop);
		displayQuery(phraseQuery);
		Hits hits = indexSearcher.search(phraseQuery);
		displayHits(hits);
	}

	public static void displayHits(Hits hits) throws CorruptIndexException, IOException {
		System.out.println("Number of hits: " + hits.length());

		Iterator<Hit> it = hits.iterator();
		while (it.hasNext()) {
			Hit hit = it.next();
			Document document = hit.getDocument();
			String path = document.get(FIELD_PATH);
			System.out.println("Hit: " + path);
		}
		System.out.println();
	}

	public static void displayQuery(Query query) {
		System.out.println("Query: " + query.toString());
	}

}

Let's talk briefly about the searchIndexWithPhraseQuery() method. It takes two strings representing words in the document and a slop value. It constructs a PhraseQuery by adding two Term objects based on the "contents" field and the string1 and string2 parameters. Following that, it sets the slop value of the PhraseQuery object using the setSlop() method of PhraseQuery. The search is conducted by passing the PhraseQuery object to IndexSearcher's search() method.

	public static void searchIndexWithPhraseQuery(String string1, String string2, int slop) throws IOException,
			ParseException {
		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexSearcher indexSearcher = new IndexSearcher(directory);

		Term term1 = new Term(FIELD_CONTENTS, string1);
		Term term2 = new Term(FIELD_CONTENTS, string2);
		PhraseQuery phraseQuery = new PhraseQuery();
		phraseQuery.add(term1);
		phraseQuery.add(term2);
		phraseQuery.setSlop(slop);
		displayQuery(phraseQuery);
		Hits hits = indexSearcher.search(phraseQuery);
		displayHits(hits);
	}

The console output from the execution of LucenePhraseQueryDemo is shown here.

Console Output

Query: contents:"french fries"
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Query: contents:"hamburger steak"
Number of hits: 0

Query: contents:"hamburger steak"~1
Number of hits: 0

Query: contents:"hamburger steak"~2
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Query: contents:"hamburger steak"~3
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for 'french fries' using QueryParser
Type of query: BooleanQuery
Query: contents:french contents:fries
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for '"french fries"' using QueryParser
Type of query: PhraseQuery
Query: contents:"french fries"
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for '"hamburger steak"~1' using QueryParser
Type of query: PhraseQuery
Query: contents:"hamburger steak"~1
Number of hits: 0

Searching for '"hamburger steak"~2' using QueryParser
Type of query: PhraseQuery
Query: contents:"hamburger steak"~2
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Let's talk briefly about the console output. The first phrase query searches for "french" and "fries" with a slop of 0, meaning that the phrase search ends up being a search for "french fries", where "french" and "fries" are next to each other. Since this exists in deron-foods.txt, we get 1 hit.

In the second query, we search for "hamburger" and "steak" with a slop of 0. Since "hamburger" and "steak" don't exist next to each other in either document, we get 0 hits. The third query also involves a search for "hamburger" and "steak", but with a slop of 1. These words are not within 1 word of each other, so we get 0 hits.

The fourth query searches for "hamburger" and "steak" with a slop of 2. In the deron-foods.txt file, we have the words "... hamburger french fries steak ...". Since "hamburger" and "steak" are within two words of each other, we get 1 hit. The fifth phrase query is the same search but with a slop of 3. Since "hamburger" and "steak" are withing three words of each other (they are two words from each other), we get a hit of 1.

The next four queries utilize QueryParser. Notice that in the first of the QueryParser queries, we get a BooleanQuery rather than a PhraseQuery. This is because we passed QueryParser's parse() method "french fries" rather than "\"french fries\"". If we want QueryParser to generate a PhraseQuery, the search string needs to be surrounded by double quotes. The next query does search for "\"french fries\"" and we can see that it generates a PhraseQuery (with the default slop of 0) and gets 1 hit in response to the query.

The last two QueryParser queries demonstrate setting slop values. We can see that the slop values can be set the following the double quotes of the search string with a tilde (~) following by the slop number.

As we have seen, phrase queries are a great way to produce queries that have a degree of leeway to them in terms of the proximity and ordering of the words to be searched. The total allowed spacing between words can be controlled using the setSlop() method of PhaseQuery.