How do I query for words near each other with a phrase query?
Author: Deron Eriksson
Description: This Java tutorial shows how to use Lucene's PhraseQuery and QueryParser classes to perform queries related to how close words are to each other.
Tutorial created using:
Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)
With Lucene, a PhaseQuery can be used to query for a sequence of terms, where the terms do not necessarily have to be next to each other or in order. The PhaseQuery object's setSlop() method can be used to set how many words can be between the various words in the query phrase. Here are the current javadocs (at the time of this writing) for PhaseQuery's setSlop() method: javadocs for PhaseQuery's setSlop() methodSets the number of other words permitted between words in query phrase. If zero, then this is an exact phrase search. For larger values this works like a WITHIN or NEAR operator. The QueryParser class can be used to generate a PhaseQuery if the search string that is passed to its parse() method is formatted for it. This formatting involves putting double quotes around the words of the phase query. The 'slop' can be set by following the double quotes with a tilde (~) followed by the slop number. This tutorial will utilize a project with the following structure. The "filesToIndex" directory contains two text files that will be indexed, and the "indexDirectory" will contain a file system index that we will create based on the text files. Two text files in the "filesToIndex" directory will be indexed. The first one, deron-foods.txt, lists some foods that I like. deron-foods.txtHere are some foods that Deron likes: hamburger french fries steak mushrooms artichokes The second text file, nicole-foods.txt, lists some foods that Nicole likes. nicole-foods.txtHere are some foods that Nicole likes: apples bananas salad mushrooms cheese Here is the LucenePhraseQueryDemo class. It first creates an index from the text files mentioned above. Following this, it performs 5 phase queries using the PhaseQuery class directly. After this, it performs another 4 queries using the QueryParser class. LucenePhraseQueryDemo.javapackage avajava; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import java.util.Iterator; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hit; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.Query; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.LockObtainFailedException; public class LucenePhraseQueryDemo { public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex"; public static final String INDEX_DIRECTORY = "indexDirectory"; public static final String FIELD_PATH = "path"; public static final String FIELD_CONTENTS = "contents"; public static void main(String[] args) throws Exception { createIndex(); searchIndexWithPhraseQuery("french", "fries", 0); searchIndexWithPhraseQuery("hamburger", "steak", 0); searchIndexWithPhraseQuery("hamburger", "steak", 1); searchIndexWithPhraseQuery("hamburger", "steak", 2); searchIndexWithPhraseQuery("hamburger", "steak", 3); searchIndexWithQueryParser("french fries"); // BooleanQuery searchIndexWithQueryParser("\"french fries\""); // PhaseQuery searchIndexWithQueryParser("\"hamburger steak\"~1"); // PhaseQuery searchIndexWithQueryParser("\"hamburger steak\"~2"); // PhaseQuery } public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException { Analyzer analyzer = new StandardAnalyzer(); boolean recreateIndexIfExists = true; IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists); File dir = new File(FILES_TO_INDEX_DIRECTORY); File[] files = dir.listFiles(); for (File file : files) { Document document = new Document(); String path = file.getCanonicalPath(); document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED)); Reader reader = new FileReader(file); document.add(new Field(FIELD_CONTENTS, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } public static void searchIndexWithQueryParser(String searchString) throws IOException, ParseException { System.out.println("Searching for '" + searchString + "' using QueryParser"); Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexSearcher indexSearcher = new IndexSearcher(directory); QueryParser queryParser = new QueryParser(FIELD_CONTENTS, new StandardAnalyzer()); Query query = queryParser.parse(searchString); System.out.println("Type of query: " + query.getClass().getSimpleName()); displayQuery(query); Hits hits = indexSearcher.search(query); displayHits(hits); } public static void searchIndexWithPhraseQuery(String string1, String string2, int slop) throws IOException, ParseException { Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexSearcher indexSearcher = new IndexSearcher(directory); Term term1 = new Term(FIELD_CONTENTS, string1); Term term2 = new Term(FIELD_CONTENTS, string2); PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(term1); phraseQuery.add(term2); phraseQuery.setSlop(slop); displayQuery(phraseQuery); Hits hits = indexSearcher.search(phraseQuery); displayHits(hits); } public static void displayHits(Hits hits) throws CorruptIndexException, IOException { System.out.println("Number of hits: " + hits.length()); Iterator<Hit> it = hits.iterator(); while (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(FIELD_PATH); System.out.println("Hit: " + path); } System.out.println(); } public static void displayQuery(Query query) { System.out.println("Query: " + query.toString()); } } Let's talk briefly about the searchIndexWithPhraseQuery() method. It takes two strings representing words in the document and a slop value. It constructs a PhraseQuery by adding two Term objects based on the "contents" field and the string1 and string2 parameters. Following that, it sets the slop value of the PhraseQuery object using the setSlop() method of PhraseQuery. The search is conducted by passing the PhraseQuery object to IndexSearcher's search() method. public static void searchIndexWithPhraseQuery(String string1, String string2, int slop) throws IOException, ParseException { Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexSearcher indexSearcher = new IndexSearcher(directory); Term term1 = new Term(FIELD_CONTENTS, string1); Term term2 = new Term(FIELD_CONTENTS, string2); PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(term1); phraseQuery.add(term2); phraseQuery.setSlop(slop); displayQuery(phraseQuery); Hits hits = indexSearcher.search(phraseQuery); displayHits(hits); } The console output from the execution of LucenePhraseQueryDemo is shown here. Console OutputQuery: contents:"french fries" Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Query: contents:"hamburger steak" Number of hits: 0 Query: contents:"hamburger steak"~1 Number of hits: 0 Query: contents:"hamburger steak"~2 Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Query: contents:"hamburger steak"~3 Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for 'french fries' using QueryParser Type of query: BooleanQuery Query: contents:french contents:fries Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for '"french fries"' using QueryParser Type of query: PhraseQuery Query: contents:"french fries" Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for '"hamburger steak"~1' using QueryParser Type of query: PhraseQuery Query: contents:"hamburger steak"~1 Number of hits: 0 Searching for '"hamburger steak"~2' using QueryParser Type of query: PhraseQuery Query: contents:"hamburger steak"~2 Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Let's talk briefly about the console output. The first phrase query searches for "french" and "fries" with a slop of 0, meaning that the phrase search ends up being a search for "french fries", where "french" and "fries" are next to each other. Since this exists in deron-foods.txt, we get 1 hit. In the second query, we search for "hamburger" and "steak" with a slop of 0. Since "hamburger" and "steak" don't exist next to each other in either document, we get 0 hits. The third query also involves a search for "hamburger" and "steak", but with a slop of 1. These words are not within 1 word of each other, so we get 0 hits. The fourth query searches for "hamburger" and "steak" with a slop of 2. In the deron-foods.txt file, we have the words "... hamburger french fries steak ...". Since "hamburger" and "steak" are within two words of each other, we get 1 hit. The fifth phrase query is the same search but with a slop of 3. Since "hamburger" and "steak" are withing three words of each other (they are two words from each other), we get a hit of 1. The next four queries utilize QueryParser. Notice that in the first of the QueryParser queries, we get a BooleanQuery rather than a PhraseQuery. This is because we passed QueryParser's parse() method "french fries" rather than "\"french fries\"". If we want QueryParser to generate a PhraseQuery, the search string needs to be surrounded by double quotes. The next query does search for "\"french fries\"" and we can see that it generates a PhraseQuery (with the default slop of 0) and gets 1 hit in response to the query. The last two QueryParser queries demonstrate setting slop values. We can see that the slop values can be set the following the double quotes of the search string with a tilde (~) following by the slop number. As we have seen, phrase queries are a great way to produce queries that have a degree of leeway to them in terms of the proximity and ordering of the words to be searched. The total allowed spacing between words can be controlled using the setSlop() method of PhaseQuery. Related Tutorials:
|