Lucene Component of OFBiz, A Technical Overview

OFBiz and Lucene Integration

We are very pleased to contribute this updated integration with Lucene for OFBiz! Also, the timing is good, as I am excited to be attending the Lucene/Solr Revolution EU 2013 conference in Dublin, Ireland November 4-7. (If you will be in attendance and would like to get together, please feel free to contact me through the website.)

Many thanks to Scott Gray and the others within HotWax Media who worked with me on this effort. Also thanks to the Lucene Foundation for their cooperation.

I hope you enjoy the following technical overview of the Lucene component of OFBiz. Let me know what you think!

The “lucene” component in OFBiz provides:

a framework for the efficient management of Lucene indexes, the definition, preparation and indexing of documents for information stored in the OFBiz data model

an implementation of an index and documents for Product searches

This document describes the main features of the Product index/documents and provides details about the framework and how to use it to implement custom indexes and documents.

Lucene Index for Product Information

A specific Lucene index is used to index and search product related information. The index is of type directory (FSDirectory) but the type of index can be easily changed programmatically. The parent folder for directory based indexes in OFBiz is runtime/indexes/ and the directory name of the product index is “products”, so the actual index files maintained by Lucene will be in runtime/indexes/products/: this is the “data” folder, you can backup (or clear) it when needed.

You can specify the path to the parent index folder by setting the property
defaultIndex=runtime/indexes
in the file lucene/config/search.properties.
For example you can set it to a path to a shared folder where the index can be used by other applications or other OFBiz instances.

In the “lucene” component there is also code for another directory based index, named “content” (the path is runtime/indexes/content), that is used to index and search generic Content related code; this code is out of scope for this document. Remember that, even if the code has been upgraded to use the new Lucene framework integrated in OFBiz as the “products” index, the Content document layout is old and the extraction logic may need to be reviewed.

The Lucene index named “products” contains Lucene documents of the same type: the Product Document described in the next section.

You can easily create a custom directory based filter with the following call:

How to create a new index named “customIndex”

DocumentIndexer indexer = DocumentIndexer.getInstance(delegator, "customIndex");

This is actually the same code you can use to get an indexer object, i.e. an object you can use to submit for indexing your documents: if the system cannot locate a directory based index named “customIndex” it creates one (under runtime/indexes/customIndex/); the next calls to DocumentIndexer.getInstance(…) will then return an indexer for it.

The first time the DocumentIndexer.getInstance(…) method is invoked for an index (e.g. “products”) the system starts a thread that implements an efficient and thread safe Producer/Consumer queue: the queue contains the LuceneDocument objects that needs to be indexed (submitted by client/Producer code with the DocumentIndexer.queue(…) call) and the thread is the Consumer that indexes the LuceneDocument in the Lucene index.

Layout of the Product Document

This section describes the internal structure of the product document in the “products” index. This information is a useful reference for the implementation of advanced research user interfaces (e.g. faceted searches, weighted searches etc…).

The client code doesn’t need to know these details in order to submit a product for indexing because the information extraction logic and Lucene document preparation are encapsulated into the ProductDocument class: an example of usage of this class is provided in the section “How to programmatically add documents to the index”. You can also implement custom document builders by implementing the LuceneDocument interface. In OFBiz there are currently to implementation of the interface:

ProductDocument (the topic of this section and ContentDocument).

getDocumentIdentifier(…) should return a Lucene Term that uniquely identifies the document in the index: this is used to locate and recreate a document in the index

prepareDocument(…) contains the information extraction logic to prepare a Lucene document

As soon as you have implemented your custom implementation of a LuceneDocument you can submit it for indexing; see this sample code:

DocumentIndexer indexer = DocumentIndexer.getInstance(delegator, "customIndex");
// CustomDocument is a custom implementation of a LuceneDocument
LuceneDocument document = new CustomDocument("ABC"); // ABC is the unique identifier of the document in the index
indexer.queue(document);

The unique identifier of the document in the index is the productId field.

The productId field is the only field whose content is stored with the document in the index (and thus returned in Lucene search results): in this way the size of the index is kept as small as possible.

The fields that in the table have the “Boosted” column set to Y can be boosted by setting a weight (i.e. boost factor) in the file applications/product/config/productsearch.properties; for example, the following line sets a boost factor of 1 for the field Product.description:

index.weight.Product.description=1

If in this file the weight of a field is set to 0 then the field is not added to the document (not indexed).

The field with Type “id” are added to the index as is (without parsing/tokenization); the field with type “text” are parsed and tokenized.

“fullText” is the main field of the document and should be used for generic searches: in fact the content of several other fields is added to it. The other fields are useful faceted searches or for weighted searches.

Name	Description	Type	Multi Valued	In fullText	Boosted	Stored
productId	The unique identifier of the Document in the Lucene Index; it matches the Product.productId, i.e. the unique identifier in the OFBiz transactional database. This field is primarily used to find and recreate the document in the index when the product information is updated and it needs to be reindexed. This is the only field whose content is stored with the document in the index.	id	N	N	N	Y
fullText	The field is associated to the aggregated content from several other fields (the ones with “In fullText” set to “Y” in this table). This is the primary field to be used in free txt searches.	text	N	–	N	N
productName		text	N	Y	Y	N
internalName		text	N	Y	Y	N
brandName		text	N	Y	Y	N
description		text	N	Y	Y	N
longDescription		text	N	Y	Y	N
introductionDate	Long value representing a date; the time information has been removed from the original field.	quantized date	N	N	N	N
salesDiscontinuationDate	Long value representing a date; the time information has been removed from the original field.	quantized date	N	N	N	N
isVariant		id	N	N	N	N
productFeatureId		id	Y	N	N	N
productFeatureCategoryId		id	Y	N	N	N
productFeatureTypeId		id	Y	N	N	N
featureDescription		text	Y	Y	Y	N
featureAbbreviation		text	Y	Y	Y	N
featureCode		text	Y	Y	Y	N
productFeatureGroupId		id	Y	N	N	N
attributeName		text	Y	Y	Y	N
attributeValue		text	Y	Y	Y	N
goodIdentificationTypeId		id	Y	N	N	N
goodIdentificationIdValue		id	Y	N	N	N
${goodIdentificationTypeId} _GoodIdentification	Dynamic field: different documents representing different products with different identification types may have different field names in the index.	id	Y	N	N	N
identificationValue		text	Y	Y	Y	N
variantProductId		text	Y	Y	Y	N
content		text	Y	Y	Y	N
${productPriceTypeId} _${productPricePurposeId} _${currencyUomId} _${productStoreGroupId} _price	Dynamic field associated to the double value term of a specific price type/purpose/currency/store group. If a product has several different prices the document will have one field for each.	double	Y	N	N	N
supplierPartyId		id	Y	N	N	N
prodCatalogId		id	Y	N	N	N
prodCategoryId		id	Y	N	N	N
directProductCategoryId		id	Y	N	N	N

Product Index Synchronization

A product document is submitted for indexing or re-indexing every time the information about the product is added/updated/removed; this is done real time using Entity-Condition-Actions (ECAs) rules.

In particular, the following events trigger a product document indexing (or-reindexing):

creation, update or removal of records in the entities:
1. Product
2. ProductFeatureAppl
3. ProductAttribute
4. GoodIndentification
5. ProductContent
6. ProductCategoryMember
7. ProductPrice
8. SupplierProduct
when a Product Feature is updated (including its associations with Feature Groups) all the products that are linked to that feature are submitted for re-indexing
when a Data Resource or a Content record is updated all the products that are linked to that content are submitted for re-indexing
when a Product Category tree is updated all the products that are associated to the category are submitted for re-indexing
when an association (ProductAssoc) between two products is created/updated/removed then the two products are submitted for re-indexing

How to programmatically add documents to the index

In addition to the automatic synchronization described in the previous section, there are two ways to easily submit for indexing one product; there is also a script to submit for indexing all products.

How to submit for indexing in index “products” the product “ABC”

String productId = "ABC";
// get an instance of the indexer for the index named "products"
DocumentIndexer indexer = DocumentIndexer.getInstance(delegator, "products");
// submit the product document for indexing
indexer.queue(new ProductDocument(productId));

The second way is equivalent and it is based on a service call for service “indexProduct”, passing in the productId:

How to submit for indexing the product “ABC” (service call)

dispatcher.runSync("indexProduct", UtilMisc.toMap("productId", "ABC"));

There is also a Groovy script that submits for indexing all the products in the database. The script location is:

 lucene/webapp/content/WEB-INF/actions/IndexProducts.groovy

The easiest way to run it is thru the user interface following the steps 1 and 2 in the section “How to Test – Admin User Interface”.

How to Test – Admin User Interface

point your browser to: https://localhost:8443/content/control/AdminIndex
in this page, press the button “Index Products”: this will run a script that will submit all the products in the db for Lucene indexing; after a few seconds all the documents should be indexed (you can check the logs if you like); the index is created in runtime/data/indexes/products folder
now you can submit Lucene queries in the “Search Products” tab: https://localhost:8443/content/control/ProductSearch
here are some examples of queries that you can run”
1. *:*
  returns all the products
2. mi*
  returns all the products that contain words starting with “mi” in some of the description/names/feature fields
3. black
  returns all the products that contain the word “black” in some of the description/names/feature fields
4. description:black
  returns all the products that contain the word “black” in the description field
5. …
6. you can actually run all Lucene queries; feel free to ping me for questions etc…
if you change product descriptions etc the document in the index will be recreated automatically
however if you want to refresh all the documents in the index simply run the step #2 again