How do I use Lucene to index and search text files?
Author: Deron Eriksson
Description: This Java tutorial shows how to use Lucene to create an index based on text files in a directory and search that index.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)


Page: < 1 2

(Continued from page 1)

We perform several searches using the following queries: "mushrooms", "steak", "steak AND cheese", "steak and cheese", and "bacon OR cheese". The console output from executing LuceneDemo is the following:

Console Output

Searching for 'mushrooms'
Number of hits: 2
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Searching for 'steak'
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Searching for 'steak AND cheese'
Number of hits: 0
Searching for 'steak and cheese'
Number of hits: 2
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt
Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt
Searching for 'bacon OR cheese'
Number of hits: 1
Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt

According to the search results, we both have "mushrooms" listed but only I have "steak". Additionally, notice the use of "AND" versus "and". The uppercase "AND" results in a boolean query, which requires "steak" and "cheese" to both be in a document in order for it to be a hit. The lowercase "and" is treated as an irrelevant word, so "steak and cheese" is like searching for "steak cheese", which returns 2 hits since "steak" is found in one text file and "cheese" is found in the other text file.

In case you're interested, an examination of the StandardAnalyzer's STOP_WORDS shows that it uses that StopAnalyzer class' ENGLISH_STOP_WORDS. As you can see below, "and" is listed as one of the stop words, which explains why it was ignored in the "steak and cheese" query.

  public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such",
    "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
  };

In case you're interested, if we examine our project after LuceneDemo has executed, we can see that index files have been created in the "indexDirectory" directory.

index files have been created in indexDirectory directory

Linking to the Lucene javadocs (as shown in the project build path) can be extremely useful when trying to figure out how to use Lucene, since the javadocs are very well-written.

Lucene javadocs in Eclipse for IndexWriter

In addition, I find it very useful to link to the Lucene source code, since you can do things such as open a declaration, as shown here for StandardAnalyzer.

Opening Declaration for StandardAnalyzer

Here, we can see that Open Declaration has taken us to the StandardAnalyzer constructor, which we can see is well-documented.

StandardAnalyzer source code

Hopefully, this quick introduction can give you an example that you can work with so that you can start becoming familiar with Lucene.

Page: < 1 2