How do I optimize a Lucene index after deleting documents from the index?
Author: Deron Eriksson
Description: This Java tutorial shows how to optimize a Lucene index after deletions have been performed on the index.
Tutorial created using:
Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)
In a previous tutorial, we saw how we could delete documents from a Lucene index via the deleteDocuments() method of an IndexReader object. After performing deletions, we can optimize the index via a call to an IndexWriter's optimize() method, which according to its javadocs does the following: Requests an "optimize" operation on an index, priming the index for the fastest available search. Traditionally this has meant merging all segments into a single segment as is done in the default merge policy, but individual merge policies may implement optimize in different ways. We'll use the same project structure from the previous tutorial, but with a different class, LuceneDeleteOptimizeDemo: If we examine the main() method of LuceneDeleteOptimizeDemo, we can see that it first creates an index (based on the files in the filesToIndex directory) using the createIndex() method. Following this, it deletes the document based on the nicole-foods.txt file from the index via the deleteDocumentsFromIndexUsingTerm() method. After this, it calls its optimize() method to optimize the index. After each of these calls, it makes a call to its status() method to display the values of the hadDeletions(), isOptimized(), maxDoc(), and numDocs() calls on an IndexReader referring to the index. Note that the searchIndex() method of LuceneDeleteOptimizeDemo is not used in this example, but it can be used to search the index based on the contents of the files in the index. LuceneDeleteOptimizeDemo.javapackage avajava; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import java.util.Iterator; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hit; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.LockObtainFailedException; public class LuceneDeleteOptimizeDemo { public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex"; public static final String INDEX_DIRECTORY = "indexDirectory"; public static final String FIELD_PATH = "path"; public static final String FIELD_CONTENTS = "contents"; public static void main(String[] args) throws Exception { createIndex(); System.out.println("Status after index created"); status(); Term term = new Term(FIELD_PATH, "C:\\projects\\workspace\\demo\\filesToIndex\\nicole-foods.txt"); deleteDocumentsFromIndexUsingTerm(term); System.out.println("Status after deletion"); status(); optimize(); System.out.println("Status after optimization"); status(); } public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException { Analyzer analyzer = new StandardAnalyzer(); boolean recreateIndexIfExists = true; IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists); File dir = new File(FILES_TO_INDEX_DIRECTORY); File[] files = dir.listFiles(); for (File file : files) { Document document = new Document(); String path = file.getCanonicalPath(); document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED)); Reader reader = new FileReader(file); document.add(new Field(FIELD_CONTENTS, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } public static void searchIndex(String searchString) throws IOException, ParseException { System.out.println("Searching for '" + searchString + "'"); Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexReader indexReader = IndexReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Analyzer analyzer = new StandardAnalyzer(); QueryParser queryParser = new QueryParser(FIELD_CONTENTS, analyzer); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); System.out.println("Number of hits: " + hits.length()); Iterator<Hit> it = hits.iterator(); while (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(FIELD_PATH); System.out.println("Hit: " + path); } } public static void deleteDocumentsFromIndexUsingTerm(Term term) throws IOException, ParseException { System.out.println("Deleting documents with field '" + term.field() + "' with text '" + term.text() + "'"); Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexReader indexReader = IndexReader.open(directory); indexReader.deleteDocuments(term); indexReader.close(); } public static void status() throws IOException { Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexReader indexReader = IndexReader.open(directory); System.out.print("Has deletions:" + indexReader.hasDeletions()); System.out.print(", Is optimized:" + indexReader.isOptimized()); System.out.print(", maxDoc:" + indexReader.maxDoc()); System.out.println(", numDocs:" + indexReader.numDocs()); indexReader.close(); } public static void optimize() throws CorruptIndexException, LockObtainFailedException, IOException { IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, new StandardAnalyzer()); indexWriter.optimize(); indexWriter.close(); } } The console output of executing LuceneDeleteOptimizeDemo is shown here: Console OutputStatus after index created Has deletions:false, Is optimized:true, maxDoc:2, numDocs:2 Deleting documents with field 'path' with text 'C:\projects\workspace\demo\filesToIndex\nicole-foods.txt' Status after deletion Has deletions:true, Is optimized:false, maxDoc:2, numDocs:1 Status after optimization Has deletions:false, Is optimized:true, maxDoc:1, numDocs:1 After first creating the index, we optimize it. In the call to status() after index creation, we see the following: Has deletions:false, Is optimized:true, maxDoc:2, numDocs:2 As indicated by the output, the index does not contain any deletions and it has been optimized. The maxDoc() value is 2, meaning that the value of the next document number in the index is 2 (if a new document was added). The numDocs() value is 2, meaning that the index currently contains 2 documents. After deleting a document, we see the following: Has deletions:true, Is optimized:false, maxDoc:2, numDocs:1 After the deletion, we can see that the index has deletions and is no longer optimized. Notice that maxDoc() still returns 2 but numDocs() returns 1. This indicates that the index only contains 1 entry, but that the next document number is still 2 (rather than 1) since the index hasn't been optimized. After optimizing the index, we obtain the following status: Has deletions:false, Is optimized:true, maxDoc:1, numDocs:1 The index no longer contains deletions, and it has been optimized. The maxDoc() call returns 1, and numDocs() returns 1. This indicates that the next available document number is 1 (from maxDoc()), and the index contains 1 document (from numDocs()). In this short tutorial, we've seen how we can use an IndexWriter's optimize() method to optimize an index following document deletion from the index via an IndexReader's deleteDocuments() method. |