module analysis::text::search::Lucene
Simple interface to the Lucene text analysis library.
Usage
import analysis::text::search::Lucene;
Description
This module wraps the Apache Lucene framework for text analysis.
- It integrates deeply by providing the interfaces for the analysis extension points of Lucene via Rascal callback functions: Analyzers, Tokenizers, Filters.
- It provides access to the full library of Lucene's text analyzers via their class names.
- It is a work in progress. Some configurability of Lucene is not yet exposed, for example programmable weights for fields and the definition of similarity functions per document field. Also Query expressions are not yet exposed.
- This wrapper provides full abstraction over source locations. Both the directory of the index
as well as the locations of input documents are expressed using any existing rascal
loc
.
data Document
A Lucene document has a src and an open set of keyword fields which are also indexed.
data Document
= document(loc src, real score=.0)
;
A lucene document has a src
origin and an open set of keyword fields.
Add as many keyword fields to a document as you want. They will be added to the Lucene document as "Fields".
- fields of type
str
will be stored and indexed as-is - fields of type
loc
will be indexed but not stored
data Analyzer
data Analyzer
= analyzerClass(str analyzerClassName)
| analyzer(Tokenizer tokenizer, list[Filter] pipe)
;
data Analyzer
A fieldsAnalyzer declares using keyword fields which Analyzers to use for which Document field.
data Analyzer
= fieldsAnalyzer(Analyzer src)
;
The src
parameter of fieldsAnalyzer
aligns with the src
parameter of a Document: this analyzer is used
to analyze the src
field. Any other keyword fields, of type Analyzer
are applied to the contents of a
Document
keyword field of type loc
or str
with the same name.
data Term
data Term
= term(str chars, loc src, str kind)
;
data Tokenizer
data Tokenizer
= tokenizer(list[Term] (str input) tokenizerFunction)
| tokenizerClass(str tokenizerClassName)
;
data Filter
data Filter
= \editFilter(str (str term) editor)
| \removeFilter(bool (str term) accept)
| \splitFilter(list[str] (str term) splitter)
| \synonymFilter(list[str] (str term) generator)
| \tagFilter(str (str term, str current) tagger)
| \filterClass(str filterClassName)
;
function createIndex
Creates a Lucene index at a given folder location from the given set of Documents, using a given set of text analyzers.
void createIndex(loc index, set[Document] documents, Analyzer analyzer = standardAnalyzer(), str charset="UTF-8", bool inferCharset=!(charset?))
function searchIndex
Searches a Lucene index indicated by the indexFolder by analyzing a query with a given set of text analyzers and then matching the query to the index.
set[Document] searchIndex(loc index, str query, Analyzer analyzer = standardAnalyzer(), int max = 10)
function searchDocument
Searches a document for a query by analyzing it with a given analyzer and listing the hits inside the document, for debugging and reporting purposes.
list[loc] searchDocument(loc doc, str query, Analyzer analyzer = standardAnalyzer(), int max = 10, str charset="UTF-8", bool inferCharset=!(charset?))
function analyzeDocument
Simulate analyzing a document source location like createIndex
would do, for debugging purposes.
list[Term] analyzeDocument(loc doc, Analyzer analyzer = standardAnalyzer())
function analyzeDocument
Simulate analyzing a document source string like createIndex
would do, for debugging purposes.
list[Term] analyzeDocument(str doc, Analyzer analyzer = standardAnalyzer())
function listTerms
Inspect the terms stored in an index for debugging purposes (what did the analyzers do to the content of the documents?).
rel[str chars, int frequency] listTerms(loc index, str field, int max = 10)
function listFields
Inspect the fields stored in an index for debugging purposes (which fields have been indexed, for how many documents, and how many terms?).
rel[str field, int docCount, int sumTotalTermFreq] listFields(loc index)
function classicAnalyzer
Analyzer classicAnalyzer()
function simpleAnalyzer
Analyzer simpleAnalyzer()
function standardAnalyzer
Analyzer standardAnalyzer()
function whitespaceAnalyzer
Analyzer whitespaceAnalyzer()
function classicTokenizer
Tokenizer classicTokenizer()
function lowerCaseTokenizer
Tokenizer lowerCaseTokenizer()
function lowerCaseFilter
Filter lowerCaseFilter()