How do control how much of a document is indexed?
Author: Deron Eriksson
Description: This Java tutorial shows how to control how much of a document is indexed in a Lucene index using IndexWriter's setMaxFieldLength() method.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)


When using an IndexWriter to write documents to an index, it is possible to set how much of the content of a document gets indexed via the setMaxFieldLength() method of IndexWriter. According to the javadocs for IndexWriter, maxFieldLength is:

The maximum number of terms that will be indexed for a single field in a document.

So, setMaxFieldLength() basically sets the cutoff point for the number of words in a document that should be indexed. The maxFieldLength value applies to document content added to the index after maxFieldLength has been set, so it does not affect the document content already in the index. By default, maxFieldLength is set to 10,000, meaning that the first 10,000 words od document content will be indexed, which also means that if a document has more than 10,000 words, these extra words will not be indexed.

public class IndexWriter {
...
	public final static int DEFAULT_MAX_FIELD_LENGTH = 10000;
...
	private int maxFieldLength = DEFAULT_MAX_FIELD_LENGTH;
...
}

If you'd like to set the number of words to be indexed to the maximum for extremely large documents, you can set it to Integer.MAX_VALUE.

Now, let's look at a setMaxFieldLength() example. We'll use a project with the following structure. The LuceneMaxFieldLengthDemo class creates indexes and lets us search those indexes. It indexes files in the "filesToIndex" directory. It writes the index files to the "indexDirectory" directory.

project structure

Two text files in the "filesToIndex" directory will be indexed. The first one, deron-foods.txt, lists some foods that I like.

deron-foods.txt

Here are some foods that Deron likes:
hamburger
french fries
steak
mushrooms
artichokes

The second text file, nicole-foods.txt, lists some foods that Nicole likes.

nicole-foods.txt

Here are some foods that Nicole likes:
apples
bananas
salad
mushrooms
cheese

One thing we should also mention is "stop words", which are common words that a Lucene analyzer ignores. The StandardAnalyzer class references the StopAnalyzer class's ENGLISH_STOP_WORDS, which are shown here:

  public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such",
    "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
  };

These stop words will be ignored and thus will not be indexed. The stop words do not count towards the maxFieldLength limit that we set with setMaxFieldLength().

Now, let's examine LuceneMaxFieldLengthDemo.java, shown below. Most of the index creation and searching functionality is described in another tutorial, so I won't cover that here. The LuceneMaxFieldLengthDemo class is features a few slight modifications to the LuceneDemo class of the other tutorial.

LuceneMaxFieldLengthDemo.java

package avajava;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hit;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;

public class LuceneMaxFieldLengthDemo {

	public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex";
	public static final String INDEX_DIRECTORY = "indexDirectory";

	public static final String FIELD_PATH = "path";
	public static final String FIELD_CONTENTS = "contents";

	public static void main(String[] args) throws Exception {
		createIndex(8);
		doSearches();
		createIndex(9);
		doSearches();
		createIndex(10);
		doSearches();
	}

	public static void doSearches() throws IOException, ParseException {
		searchIndex("mushrooms");
		searchIndex("steak");
	}

	public static void createIndex(int maxFieldLength) throws CorruptIndexException, LockObtainFailedException,
			IOException {

		Analyzer analyzer = new StandardAnalyzer();
		boolean recreateIndexIfExists = true;
		IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists);

		System.out.println("\nmaxFieldLength:" + maxFieldLength);
		indexWriter.setMaxFieldLength(maxFieldLength);

		File dir = new File(FILES_TO_INDEX_DIRECTORY);
		File[] files = dir.listFiles();
		for (File file : files) {
			Document document = new Document();

			String path = file.getCanonicalPath();
			document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED));

			Reader reader = new FileReader(file);
			document.add(new Field(FIELD_CONTENTS, reader));

			indexWriter.addDocument(document);
		}
		indexWriter.optimize();
		indexWriter.close();
	}

	public static void searchIndex(String searchString) throws IOException, ParseException {
		System.out.println("Searching for '" + searchString + "'");
		Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY);
		IndexReader indexReader = IndexReader.open(directory);
		IndexSearcher indexSearcher = new IndexSearcher(indexReader);

		Analyzer analyzer = new StandardAnalyzer();
		QueryParser queryParser = new QueryParser(FIELD_CONTENTS, analyzer);
		Query query = queryParser.parse(searchString);
		Hits hits = indexSearcher.search(query);
		System.out.println("Number of hits: " + hits.length());

		Iterator<Hit> it = hits.iterator();
		while (it.hasNext()) {
			Hit hit = it.next();
			Document document = hit.getDocument();
			String path = document.get(FIELD_PATH);
			System.out.println("Hit: " + path);
		}

	}

}

If we look at the main() method of LuceneMaxFieldLengthDemo, we can see that it first creates an index with a maxFieldLength value of 8 and then searches are performed for "mushrooms" and "steak". Then, a new index is created with a maxFieldLength of 9, and the same searches are performed. Then, a new index is created with a maxFieldLength of 10, and the same searches are executed.

Notice that the maxFieldLength is set in the createIndex() method via the call to indexWriter.setMaxFieldLength(maxFieldLength).

Let's look at the console output from the execution of LuceneMaxFieldLengthDemo.

Console Output


maxFieldLength:8
Searching for 'mushrooms'
Number of hits: 0
Searching for 'steak'
Number of hits: 0

maxFieldLength:9
Searching for 'mushrooms'
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt
Searching for 'steak'
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

maxFieldLength:10
Searching for 'mushrooms'
Number of hits: 2
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt
Searching for 'steak'
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt

In the console output, we can see that when we perform our searches and the maxFieldLength was set to 8, we didn't get any hits for "mushrooms" or "steak". When maxFieldLength was set to 9, we had one hit for "mushrooms" in nicole-foods.txt and one hit for "steak" in deron-foods.txt. When maxFieldLength was set to 10, we had two hits for "mushrooms" (in deron-food.txt and nicole-foods.txt), and one hit for "steak" in deron-foods.txt.

Let's number the words in deron-foods.txt and nicole-foods.txt to see if these results make sense. Below, I have numbered the words in each of these documents. We don't count the "stop words", so I place asterisks (*) next to the stop words.

deron-foods.txt with words numbered

Here(1) are(*) some(2) foods(3) that(*) Deron(4) likes:(5) hamburger(6) french(7) fries(8) steak(9) mushrooms(10) artichokes(11)

nicole-foods.txt with words numbered

Here(1) are(*) some(2) foods(3) that(*) Nicole(4) likes:(5) apples(6) bananas(7) salad(8) mushrooms(9) cheese(10)

If we compare the console output results with the numbered words, we can see that the results indeed make sense given the maxFieldLength values that we set. When maxFieldLength was 8, "fries" was the last word to be indexed in deron-foods.txt and "salad" was the last word to be indexed in nicole-foods.txt, so the searches for "mushrooms" and "steak" didn't yield any results. When maxFieldLength was 9, "steak" was the last word to be indexed in deron-foods.txt and "mushrooms" was the last word to be indexed in nicole-foods.txt, so it makes sense that the search for "steak" yielded 1 result in deron-foods.txt, and the search for "mushrooms" returned one hit from nicole-foods.txt. When maxFieldLength was 10, "mushrooms" was the last word to be indexed in deron-foods.txt, and "cheese" was the last word to be indexed in nicole-foods.txt. As a result, the search for "mushrooms" returned two hits (from deron-foods.txt and nicole-foods.txt), and the search fod "steak" returned one hit from deron-foods.txt.

This tutorial has been a short introduction to IndexWriter's setMaxFieldLength() method, showing how it can be used to limit the amount of document content that gets indexed by Lucene.