Lucene Component of OFBiz, A Technical Overview

Jacopo CappellatoPublished: Updated:

OFBiz and Lucene IntegrationWe are very pleased to contribute this updated integration with Lucene for OFBiz! Also, the timing is good, as I am excited to be attending the Lucene/Solr Revolution EU 2013 conference in Dublin, Ireland November 4-7.  (If you will be in attendance and would like to get together, please feel free to contact me through the website.)

Many thanks to Scott Gray and the others within HotWax Media who worked with me on this effort. Also thanks to the Lucene Foundation for their cooperation.

I hope you enjoy the following technical overview of the Lucene component of OFBiz. Let me know what you think!

The “lucene” component in OFBiz provides:

  1. a framework for the efficient management of Lucene indexes, the definition, preparation and indexing of documents for information stored in the OFBiz data model
  2. an implementation of an index and documents for Product searches

This document describes the main features of the Product index/documents and provides details about the framework and how to use it to implement custom indexes and documents.

Lucene Index for Product Information

A specific Lucene index is used to index and search product related information. The index is of type directory (FSDirectory) but the type of index can be easily changed programmatically. The parent folder for directory based indexes in OFBiz is runtime/indexes/ and the directory name of the product index is “products”, so the actual index files maintained by Lucene will be in runtime/indexes/products/: this is the “data” folder, you can backup (or clear) it when needed.

information You can specify the path to the parent index folder by setting the property
defaultIndex=runtime/indexes
in the file lucene/config/search.properties.
For example you can set it to a path to a shared folder where the index can be used by other applications or other OFBiz instances.

information In the “lucene” component there is also code for another directory based index, named “content” (the path is runtime/indexes/content), that is used to index and search generic Content related code; this code is out of scope for this document. Remember that, even if the code has been upgraded to use the new Lucene framework integrated in OFBiz as the “products” index, the Content document layout is old and the extraction logic may need to be reviewed.

The Lucene index named “products” contains Lucene documents of the same type: the Product Document described in the next section.

information You can easily create a custom directory based filter with the following call:

How to create a new index named “customIndex”

This is actually the same code you can use to get an indexer object, i.e. an object you can use to submit for indexing your documents: if the system cannot locate a directory based index named “customIndex” it creates one (under runtime/indexes/customIndex/); the next calls to DocumentIndexer.getInstance(…) will then return an indexer for it.

information The first time the DocumentIndexer.getInstance(…) method is invoked for an index (e.g. “products”) the system starts a thread that implements an efficient and thread safe Producer/Consumer queue: the queue contains the LuceneDocument objects that needs to be indexed (submitted by client/Producer code with the DocumentIndexer.queue(…) call) and the thread is the Consumer that indexes the LuceneDocument in the Lucene index.

 Layout of the Product Document

This section describes the internal structure of the product document in the “products” index. This information is a useful reference for the implementation of advanced research user interfaces (e.g. faceted searches, weighted searches etc…).

information The client code doesn’t need to know these details in order to submit a product for indexing because the information extraction logic and Lucene document preparation are encapsulated into the ProductDocument class: an example of usage of this class is provided in the section “How to programmatically add documents to the index”. You can also implement custom document builders by implementing the LuceneDocument interface. In OFBiz there are currently to implementation of the interface:

ProductDocument (the topic of this section and ContentDocument).

getDocumentIdentifier(…) should return a Lucene Term that uniquely identifies the document in the index: this is used to locate and recreate a document in the index

prepareDocument(…) contains the information extraction logic to prepare a Lucene document

As soon as you have implemented your custom implementation of a LuceneDocument you can submit it for indexing; see this sample code:

The unique identifier of the document in the index is the productId field.

The productId field is the only field whose content is stored with the document in the index (and thus returned in Lucene search results): in this way the size of the index is kept as small as possible.

The fields that in the table have the “Boosted” column set to Y can be boosted by setting a weight (i.e. boost factor) in the file applications/product/config/productsearch.properties; for example, the following line sets a boost factor of 1 for the field Product.description:

If in this file the weight of a field is set to 0 then the field is not added to the document (not indexed).

The field with Type “id” are added to the index as is (without parsing/tokenization); the field with type “text” are parsed and tokenized.

“fullText” is the main field of the document and should be used for generic searches: in fact the content of several other fields is added to it. The other fields are useful faceted searches or for weighted searches.

NameDescriptionTypeMulti ValuedIn fullTextBoostedStored
productIdThe unique identifier of the Document in the Lucene Index; it matches the Product.productId, i.e. the unique identifier in the OFBiz transactional database. This field is primarily used to find and recreate the document in the index when the product information is updated and it needs to be reindexed. This is the only field whose content is stored with the document in the index.idNNNY
fullTextThe field is associated to the aggregated content from several other fields (the ones with “In fullText” set to “Y” in this table). This is the primary field to be used in free txt searches.textNNN
productNametextNYYN
internalNametextNYYN
brandNametextNYYN
descriptiontextNYYN
longDescriptiontextNYYN
introductionDateLong value representing a date; the time information has been removed from the original field.quantized dateNNNN
salesDiscontinuationDateLong value representing a date; the time information has been removed from the original field.quantized dateNNNN
isVariantidNNNN
productFeatureIdidYNNN
productFeatureCategoryIdidYNNN
productFeatureTypeIdidYNNN
featureDescriptiontextYYYN
featureAbbreviationtextYYYN
featureCodetextYYYN
productFeatureGroupIdidYNNN
attributeNametextYYYN
attributeValuetextYYYN
goodIdentificationTypeIdidYNNN
goodIdentificationIdValueidYNNN
${goodIdentificationTypeId} _GoodIdentificationDynamic field: different documents representing different products with different identification types may have different field names in the index.idYNNN
identificationValuetextYYYN
variantProductIdtextYYYN
contenttextYYYN
${productPriceTypeId} _${productPricePurposeId} _${currencyUomId} _${productStoreGroupId} _priceDynamic field associated to the double value term of a specific price type/purpose/currency/store group. If a product has several different prices the document will have one field for each.doubleYNNN
supplierPartyIdidYNNN
prodCatalogIdidYNNN
prodCategoryIdidYNNN
directProductCategoryIdidYNNN

Product Index Synchronization

A product document is submitted for indexing or re-indexing every time the information about the product is added/updated/removed; this is done real time using Entity-Condition-Actions (ECAs) rules.

In particular, the following events trigger a product document indexing (or-reindexing):

  1. creation, update or removal of records in the entities:
    1. Product
    2. ProductFeatureAppl
    3. ProductAttribute
    4. GoodIndentification
    5. ProductContent
    6. ProductCategoryMember
    7. ProductPrice
    8. SupplierProduct
  2. when a Product Feature is updated (including its associations with Feature Groups) all the products that are linked to that feature are submitted for re-indexing
  3. when a Data Resource or a Content record is updated all the products that are linked to that content are submitted for re-indexing
  4. when a Product Category tree is updated all the products that are associated to the category are submitted for re-indexing
  5. when an association (ProductAssoc) between two products is created/updated/removed then the two products are submitted for re-indexing

How to programmatically add documents to the index

In addition to the automatic synchronization described in the previous section, there are two ways to easily submit for indexing one product; there is also a script to submit for indexing all products.

How to submit for indexing in index “products” the product “ABC”

The second way is equivalent and it is based on a service call for service “indexProduct”, passing in the productId:

How to submit for indexing the product “ABC” (service call)

There is also a Groovy script that submits for indexing all the products in the database. The script location is:
The easiest way to run it is thru the user interface following the steps 1 and 2 in the section “How to Test – Admin User Interface”.

How to Test – Admin User Interface

  1. point your browser to: https://localhost:8443/content/control/AdminIndex
  2. in this page, press the button “Index Products”: this will run a script that will submit all the products in the db for Lucene indexing; after a few seconds all the documents should be indexed (you can check the logs if you like); the index is created in runtime/data/indexes/products folder
  3. now you can submit Lucene queries in the “Search Products” tab: https://localhost:8443/content/control/ProductSearch
  4. here are some examples of queries that you can run”
    1. *:*
      returns all the products
    2. mi*
      returns all the products that contain words starting with “mi” in some of the description/names/feature fields
    3. black
      returns all the products that contain the word “black” in some of the description/names/feature fields
    4. description:black
      returns all the products that contain the word “black” in the description field
    5. you can actually run all Lucene queries; feel free to ping me for questions etc…
  5. if you change product descriptions etc the document in the index will be recreated automatically
  6. however if you want to refresh all the documents in the index simply run the step #2 again

Jacopo Cappellato About Jacopo Cappellato
Jacopo Cappellato is VP of Technology at HotWax Media and has been involved with the OFBiz project since 2003. He is an OFBiz Project Committer and a member of both the OFBiz Project Management Committee and the Apache Software Foundation.

Leave a Reply

Your email address will not be published. Required fields are marked *