How do I perform a wildcard query?
Author: Deron Eriksson
Description: This Java tutorial shows how to perform Lucene index wildcard queries using QueryParser and WildcardQuery.
Tutorial created using:
Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)
With Lucene, it's possible to perform wildcard searches on an index using a WildcardQuery or a QueryParser, which can generate a WildcardQuery. With wildcard queries, you can use an asterisk (*) to represent 0 or more characters, and a question mark (?) can represent a single character. If you use QueryParser, you can't begin your query with an asterisk, and in general it's a bad idea to try to do so with a WildcardQuery since the entire index would need to be searched in order for the query to be performed. This tutorial will utilize a project with the following structure. The "filesToIndex" directory contains two text files that will be indexed, and the "indexDirectory" will contain a file system index that we will create based on the text files. Two text files in the "filesToIndex" directory will be indexed. The first one, deron-foods.txt, lists some foods that I like. deron-foods.txtHere are some foods that Deron likes: hamburger french fries steak mushrooms artichokes The second text file, nicole-foods.txt, lists some foods that Nicole likes. nicole-foods.txtHere are some foods that Nicole likes: apples bananas salad mushrooms cheese Now we'll look at the LuceneWildcardQueryDemo class. This class creates an index via the createIndex() method based on the text files mentioned above, and after this, it attempts to perform 8 wildcard searches against this index. Four of the searches are performed using the WildcardQuery class, and the other four searches are performed using the QueryParser class. The WildcardQuery queries are performed in the searchIndexWithWildcardQuery() method, using the following code: Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexSearcher indexSearcher = new IndexSearcher(directory); Term term = new Term(whichField, searchString); Query query = new WildcardQuery(term); Hits hits = indexSearcher.search(query); The QueryParser queries are performed in the searchIndexWithQueryParser() method via: Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexSearcher indexSearcher = new IndexSearcher(directory); QueryParser queryParser = new QueryParser(whichField, new StandardAnalyzer()); Query query = queryParser.parse(searchString); Hits hits = indexSearcher.search(query); LuceneWildcardQueryDemo.javapackage avajava; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import java.util.Iterator; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hit; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.LockObtainFailedException; public class LuceneWildcardQueryDemo { public static final String FILES_TO_INDEX_DIRECTORY = "filesToIndex"; public static final String INDEX_DIRECTORY = "indexDirectory"; public static final String FIELD_PATH = "path"; public static final String FIELD_CONTENTS = "contents"; public static void main(String[] args) throws Exception { createIndex(); searchIndexWithWildcardQuery(FIELD_CONTENTS, "d*n"); searchIndexWithQueryParser(FIELD_CONTENTS, "d*n"); searchIndexWithWildcardQuery(FIELD_CONTENTS, "nic*"); searchIndexWithQueryParser(FIELD_CONTENTS, "nic*"); searchIndexWithWildcardQuery(FIELD_CONTENTS, "der?n"); searchIndexWithQueryParser(FIELD_CONTENTS, "der?n"); searchIndexWithWildcardQuery(FIELD_CONTENTS, "*eron"); try { searchIndexWithQueryParser(FIELD_CONTENTS, "*eron"); } catch (ParseException pe) { pe.printStackTrace(); } } public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException { Analyzer analyzer = new StandardAnalyzer(); boolean recreateIndexIfExists = true; IndexWriter indexWriter = new IndexWriter(INDEX_DIRECTORY, analyzer, recreateIndexIfExists); File dir = new File(FILES_TO_INDEX_DIRECTORY); File[] files = dir.listFiles(); for (File file : files) { Document document = new Document(); String path = file.getCanonicalPath(); document.add(new Field(FIELD_PATH, path, Field.Store.YES, Field.Index.UN_TOKENIZED)); Reader reader = new FileReader(file); document.add(new Field(FIELD_CONTENTS, reader)); indexWriter.addDocument(document); } indexWriter.optimize(); indexWriter.close(); } public static void searchIndexWithQueryParser(String whichField, String searchString) throws IOException, ParseException { System.out.println("\nSearching for '" + searchString + "' using QueryParser"); Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexSearcher indexSearcher = new IndexSearcher(directory); QueryParser queryParser = new QueryParser(whichField, new StandardAnalyzer()); Query query = queryParser.parse(searchString); System.out.println("Type of query: " + query.getClass().getSimpleName()); Hits hits = indexSearcher.search(query); displayHits(hits); } public static void searchIndexWithWildcardQuery(String whichField, String searchString) throws IOException, ParseException { System.out.println("\nSearching for '" + searchString + "' using WildcardQuery"); Directory directory = FSDirectory.getDirectory(INDEX_DIRECTORY); IndexSearcher indexSearcher = new IndexSearcher(directory); Term term = new Term(whichField, searchString); Query query = new WildcardQuery(term); Hits hits = indexSearcher.search(query); displayHits(hits); } public static void displayHits(Hits hits) throws CorruptIndexException, IOException { System.out.println("Number of hits: " + hits.length()); Iterator<Hit> it = hits.iterator(); while (it.hasNext()) { Hit hit = it.next(); Document document = hit.getDocument(); String path = document.get(FIELD_PATH); System.out.println("Hit: " + path); } } } The LuceneWildcardQueryDemo class performs eight wildcard searches, as we can see from its main() method: searchIndexWithWildcardQuery(FIELD_CONTENTS, "d*n"); searchIndexWithQueryParser(FIELD_CONTENTS, "d*n"); searchIndexWithWildcardQuery(FIELD_CONTENTS, "nic*"); searchIndexWithQueryParser(FIELD_CONTENTS, "nic*"); searchIndexWithWildcardQuery(FIELD_CONTENTS, "der?n"); searchIndexWithQueryParser(FIELD_CONTENTS, "der?n"); searchIndexWithWildcardQuery(FIELD_CONTENTS, "*eron"); try { searchIndexWithQueryParser(FIELD_CONTENTS, "*eron"); } catch (ParseException pe) { pe.printStackTrace(); } Let's look at the console output from executing LuceneWildcardQueryDemo to see the results of the queries. Console OutputSearching for 'd*n' using WildcardQuery Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for 'd*n' using QueryParser Type of query: WildcardQuery Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for 'nic*' using WildcardQuery Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt Searching for 'nic*' using QueryParser Type of query: PrefixQuery Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\nicole-foods.txt Searching for 'der?n' using WildcardQuery Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for 'der?n' using QueryParser Type of query: WildcardQuery Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for '*eron' using WildcardQuery Number of hits: 1 Hit: C:\projects\workspace\demo\filesToIndex\deron-foods.txt Searching for '*eron' using QueryParser org.apache.lucene.queryParser.ParseException: Cannot parse '*eron': '*' or '?' not allowed as first character in WildcardQuery at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:152) at avajava.LuceneWildcardQueryDemo.searchIndexWithQueryParser(LuceneWildcardQueryDemo.java:81) at avajava.LuceneWildcardQueryDemo.main(LuceneWildcardQueryDemo.java:46) The first query uses a WildcardQuery object with "d*n". The "D" is lowercased since "deron" in the index is lowercased as a result of the index creation occurring with the help of the StandardAnalyzer class. Since "d*n" matches "deron" in the index, this query returns 1 hit. The second query uses a QueryParser to query "d*n". The QueryParser parse() method returns a WildcardQuery, and the query returns 1 hit, since it's basically identical to the first query. The third query uses a WildcardQuery object with "nic*". This wildcard query gets one hit, since "nic*" matches "nicole". The fourth query uses QueryParser with "nic*". Notice, however, that QueryParser's parse() method returns a PrefixQuery rather than a WildcardQuery. Since the asterisk is at the end of "nic*". This PrefixQuery for "nic*" gets a hit, since "nic*" matches "nicole". The fifth query is a WildcardQuery that utilizes a question mark in its search word, "der?n". The question mark can match one character. Since "der?n" matches "deron", the search returns 1 hit. The sixth query uses a QueryParser with "der?n". The QueryParser parse() method returns a WildcardQuery, and it gets on hit, just like the fifth query. The seventh query is a WildcardQuery for "*eron". It receives one match, since "deron" matches "*eron". In general, it's not a good idea to perform queries where the first character is a wildcard. The eigth query is a QueryParser query for "*eron". Notice that the QueryParser object does not even allow us to perform a query where the first character is an asterisk. It throws a parse exception. In this tutorial, we've seen some examples of wildcard queries with QueryParser and also with WildcardQuery directly. Related Tutorials:
|