How do I combine queries with a boolean query?
Author: Deron Eriksson
Description: This Java tutorial shows how to create Lucene boolean queries that combine multiple queries using BooleanQuery and also QueryParser.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)


With Lucene, it's possible to combine multiple queries with boolean conditions using the BooleanQuery class. The QueryParser class also generates BooleanQuery objects via its parse() method when the search text passed to parse() is formatted to generate a BooleanQuery. In this tutorial, we will look at examples of creating boolean queries directly using BooleanQuery, and we shall see the generation of boolean queries via QueryParser.

This tutorial will utilize a project with the following structure. The "filesToIndex" directory contains two text files that will be indexed, and the "indexDirectory" will contain a file system index that we will create based on the text files.

project structure

Two text files in the "filesToIndex" directory will be indexed. The first one, deron-foods.txt, lists some foods that I like.

deron-foods.txt

Here are some foods that Deron likes:
hamburger
french fries
steak
mushrooms
artichokes

The second text file, nicole-foods.txt, lists some foods that Nicole likes.

nicole-foods.txt

Here are some foods that Nicole likes:
apples
bananas
salad
mushrooms
cheese

The LuceneBooleanQueryDemo class is our demonstration class. It first creates a file system index based on the aforementioned text files via its createIndex() method. Following this, in its main() method, we can see that it creates two term queries that will query the file contents in the index. One query is for "mushrooms" and the other query is for "steak".

After this, it creates a BooleanQuery based on the two term queries and requires that both terms be in the results via:

	BooleanQuery booleanQuery = new BooleanQuery();
	booleanQuery.add(query1, BooleanClause.Occur.MUST);
	booleanQuery.add(query2, BooleanClause.Occur.MUST);

This boolean query requires that "mushrooms" and "steak" must both be present for a hit to be returned from the index. Notice that the queries are added to the BooleanQuery object via its add() method. Although not demonstrated here, the add() method can also be used to add boolean queries to boolean queries. We can construct complex boolean queries by adding boolean queries to other boolean queries.

In the next boolean query that gets created, the "mushrooms" term must be present in the index for a document, and the "steak" term must not be present in the index for the document.

	booleanQuery = new BooleanQuery();
	booleanQuery.add(query1, BooleanClause.Occur.MUST);
	booleanQuery.add(query2, BooleanClause.Occur.MUST_NOT);

In the query after this, a boolean query gets created, and in this query, "mushrooms" must be present, but "steak" should be present. If "steak" is not present but "mushrooms" is present for a document, it will be still returned as a hit.

	booleanQuery = new BooleanQuery();
	booleanQuery.add(query1, BooleanClause.Occur.MUST);
	booleanQuery.add(query2, BooleanClause.Occur.SHOULD);

Here is the LuceneBooleanQueryDemo class.

LuceneBooleanQueryDemo.java

package avajava;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hit;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;

public class LuceneBooleanQueryDemo {

	public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex";
	public static final String INDEX_DIRECTORY = "indexDirectory";

	public static final String FIELD_PATH = "path";
	public static final String FIELD_CONTENTS = "contents";

	public static void main(String[] args) throws Exception {

		createIndex();

		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexSearcher indexSearcher = new IndexSearcher(directory);

		Query query1 = new TermQuery(new Term(FIELD_CONTENTS, "mushrooms"));
		Query query2 = new TermQuery(new Term(FIELD_CONTENTS, "steak"));

		BooleanQuery booleanQuery = new BooleanQuery();
		booleanQuery.add(query1, BooleanClause.Occur.MUST);
		booleanQuery.add(query2, BooleanClause.Occur.MUST);
		displayQuery(booleanQuery);
		Hits hits = indexSearcher.search(booleanQuery);
		displayHits(hits);

		booleanQuery = new BooleanQuery();
		booleanQuery.add(query1, BooleanClause.Occur.MUST);
		booleanQuery.add(query2, BooleanClause.Occur.MUST_NOT);
		displayQuery(booleanQuery);
		hits = indexSearcher.search(booleanQuery);
		displayHits(hits);

		booleanQuery = new BooleanQuery();
		booleanQuery.add(query1, BooleanClause.Occur.MUST);
		booleanQuery.add(query2, BooleanClause.Occur.SHOULD);
		displayQuery(booleanQuery);
		hits = indexSearcher.search(booleanQuery);
		displayHits(hits);

		searchIndexWithQueryParser("+contents:mushrooms +contents:steak");
		searchIndexWithQueryParser("mushrooms steak");
		searchIndexWithQueryParser("bacon eggs");
		searchIndexWithQueryParser("(mushrooms steak) OR (bacon eggs)");
		searchIndexWithQueryParser("(mushrooms steak) AND (bacon eggs)");
		searchIndexWithQueryParser("(mush*ms OR raspberries) AND (ste?k)");

	}

	public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
		Analyzer analyzer = new StandardAnalyzer();
		boolean recreateIndexIfExists = true;
		IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists);
		File dir = new File(FILES_TO_INDEX_DIRECTORY);
		File[] files = dir.listFiles();
		for (File file : files) {
			Document document = new Document();

			String path = file.getCanonicalPath();
			document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED));

			Reader reader = new FileReader(file);
			document.add(new Field(FIELD_CONTENTS, reader));

			indexWriter.addDocument(document);
		}
		indexWriter.optimize();
		indexWriter.close();
	}

	public static void searchIndexWithQueryParser(String searchString) throws IOException, ParseException {
		System.out.println("Searching for '" + searchString + "' using QueryParser");
		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexSearcher indexSearcher = new IndexSearcher(directory);

		QueryParser queryParser = new QueryParser(FIELD_CONTENTS, new StandardAnalyzer());
		Query query = queryParser.parse(searchString);
		System.out.println("Type of query: " + query.getClass().getSimpleName());
		displayQuery(query);
		Hits hits = indexSearcher.search(query);
		displayHits(hits);
	}

	public static void displayHits(Hits hits) throws CorruptIndexException, IOException {
		System.out.println("Number of hits: " + hits.length());

		Iterator<Hit> it = hits.iterator();
		while (it.hasNext()) {
			Hit hit = it.next();
			Document document = hit.getDocument();
			String path = document.get(FIELD_PATH);
			System.out.println("Hit: " + path);
		}
		System.out.println();
	}

	public static void displayQuery(Query query) {
		System.out.println("Query: " + query.toString());
	}
}

After the three boolean queries in LuceneBooleanQueryDemo's main method() that use BooleanQuery directly, we can see that six more queries are performed using a QueryParser via the searchIndexWithQueryParser() method. These queries showcase some of the different text formats that can be used with QueryParser's parse() method to create boolean queries. Notice that we use parentheses to nest boolean queries in other boolean queries.

	searchIndexWithQueryParser("+contents:mushrooms +contents:steak");
	searchIndexWithQueryParser("mushrooms steak");
	searchIndexWithQueryParser("bacon eggs");
	searchIndexWithQueryParser("(mushrooms steak) OR (bacon eggs)");
	searchIndexWithQueryParser("(mushrooms steak) AND (bacon eggs)");
	searchIndexWithQueryParser("(mush*ms OR raspberries) AND (ste?k)");

The searchIndexWithQueryParser() method uses standard Lucene code to search the index.

	Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
	IndexSearcher indexSearcher = new IndexSearcher(directory);
	QueryParser queryParser = new QueryParser(FIELD_CONTENTS, new StandardAnalyzer());
	Query query = queryParser.parse(searchString);
	Hits hits = indexSearcher.search(query);

When working with Lucene queries, it can be useful to use the query object's toString() method to examine the query. The displayQuery() method displays the query using toString(). I find it particularly useful in conjunction to the query object returned from QueryParser's parse() method, since it allows us to validate that the query string that we pass to parse() actually generates the query that we expect.

	public static void displayQuery(Query query) {
		System.out.println("Query: " + query.toString());
	}

The console output from the execution of LuceneBooleanQueryDemo is shown here.

Console Output

Query: +contents:mushrooms +contents:steak
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Query: +contents:mushrooms -contents:steak
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt

Query: +contents:mushrooms contents:steak
Number of hits: 2
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt

Searching for '+contents:mushrooms +contents:steak' using QueryParser
Type of query: BooleanQuery
Query: +contents:mushrooms +contents:steak
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

Searching for 'mushrooms steak' using QueryParser
Type of query: BooleanQuery
Query: contents:mushrooms contents:steak
Number of hits: 2
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt

Searching for 'bacon eggs' using QueryParser
Type of query: BooleanQuery
Query: contents:bacon contents:eggs
Number of hits: 0

Searching for '(mushrooms steak) OR (bacon eggs)' using QueryParser
Type of query: BooleanQuery
Query: (contents:mushrooms contents:steak) (contents:bacon contents:eggs)
Number of hits: 2
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt

Searching for '(mushrooms steak) AND (bacon eggs)' using QueryParser
Type of query: BooleanQuery
Query: +(contents:mushrooms contents:steak) +(contents:bacon contents:eggs)
Number of hits: 0

Searching for '(mush*ms OR raspberries) AND (ste?k)' using QueryParser
Type of query: BooleanQuery
Query: +(contents:mush*ms contents:raspberries) +contents:ste?k
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

From the toString() output, notice that a "+" indicates that a term is a MUST, a "-" indicates that a term is a MUST_NOT, and "" indicates that a term is a SHOULD. We can see that the toString() method on the query also shows the field on which the query occurs, which occurs prior to the ":".

From the output of the first three queries, notice that the first query that looks for documents that must contain "mushrooms" and must contain "steak" and gets a hit from the deron-foods.txt document. The second query looks for documents that must contain "mushrooms" and must not contain "steak", and this gets a hit from the nicole-foods.txt document. The third query looks for documents that must contain "steak" and should contain "mushrooms". This gets hits from deron-foods.txt (which has both terms) and nicole-foods.txt (which has the 'must' term but not the 'should' term).

The last six queries utilize QueryParser. These queries are useful for demonstrating how QueryParser parses a variety of inputs. These examples include boolean queries added to other boolean queries, and even the use of wildcard terms in queries (in the case of the last query).

As you can see from these examples, some of the true querying power of Lucene exists due to the ability to formulate complex boolean queries from other queries, including other boolean queries.