Apache OFBiz

Asynchronous Services in Apache OFBiz: Power, Pitfalls, and Production Lessons

by Deepak Dixit |
Asynchronous Services in Apache OFBiz: Power, Pitfalls, and Production Lessons

Asynchronous processing is one of the most powerful capabilities in Apache OFBiz. It allows long-running or non-critical work to move out of the critical path so that core business services can complete quickly. For developers, async often feels like an obvious win: better responsiveness, higher throughput, and improved scalability.

But after building and operating multiple large-scale enterprise automation systems on Apache OFBiz at HotWax Systems, we learned something the hard way: async is powerful, but misused async is one of the fastest ways to destabilize a production system.

Most async-related problems do not appear during development or early testing. They surface later, under real production load, when order volumes spike, batch imports run for hours, background jobs pile up, and interactive users still expect the system to remain responsive. At that point, design choices that once looked harmless begin to reveal their true cost.

This blog is not a theoretical overview of asynchronous services. It is a collection of production lessons learned over time, lessons about when async is genuinely the right tool, when it quietly becomes a bottleneck, why batching and scheduling often outperform async in high-volume workflows, how persisted async jobs can overwhelm JobSandbox, how execution order and service transaction boundaries can break correctness, and how enterprise deployments must be designed to support async safely.

If you are an architect or senior developer working with Apache OFBiz at scale, this is the blog we wish we had read earlier.

Why do asynchronous services exist in the first place?

Let’s start with first principles.

In enterprise systems, not all work is equal:

  • Some work must happen now, inside the user transaction.
  • Some work can happen later, without blocking the business flow.

Examples of work that should not block a transaction:

  • sending emails,
  • calling third-party APIs,
  • long calculations,
  • bulk updates,
  • background validations.

Apache OFBiz’s service engine supports asynchronous execution so that:

  • the main service can finish quickly,
  • the user experience remains responsive,
  • heavy work happens in the background.

All of this sounds reasonable and it is. The problems start not because async exists, but because it is often used as the default solution rather than a carefully chosen one.

Production realities where async service broke for us

Production issue #1: Async can overload threads and starve UI traffic

The Scenario:

  • OMS implementation on top of Apache OFBiz.
  • 5,000–8,000 orders downloaded from eCommerce in one hour and created immediately in OMS.
  • For each order creation system triggers one async service and consequent async services for emails, routing, notifications.
  • Async process are created faster than they are consumed

What goes wrong:

  • async process accumulate in queues,
  • threads get saturated,
  • CPU and memory spike,
  • when capacity frees up, the system aggressively drains the queue.

The side effect:

  • Other synchronous services slow down,
  • UI actions feel laggy,
  • users experience delays even for simple operations.
  • System may also crash

Key takeaway:

Async processes compete with interactive workloads.This is why async must be used with volume awareness, not just functional correctness.

Async execution works best when the work being offloaded does not affect the core business flow. As we already discussed, typical examples include sending confirmation emails, pushing data to analytics systems, or notifying downstream systems after a workflow completes. In these cases, the core business state is already valid and committed, and the system can tolerate delays or even failures in the follow-up work. So in our use case, orders should be created using synchronous service and consequent operations like emails can be done using async services. So if an async task in this category fails, it does not invalidate the primary transaction or disrupt the core business flow.

Another important point is creating bulk orders in an OMS is a database-heavy operation: it involves multiple inserts, updates, validations, and often downstream allocations or reservations. Running hundreds or thousands of such operations asynchronously does not make them faster, it forces them to compete aggressively for the same database resources. The result is higher contention, slower commits, and reduced overall throughput.

So in these cases, controlled batch processing with synchronous execution often outperforms massive async parallelism, delivering steadier performance and far more predictable outcomes. We will discuss this strategy in detail later.

Production issue #2: Async can break correctness due to execution order

Another subtle but critical issue we’ve seen:

The scenario:

A synchronous service:

  • creates an order,
  • writes order headers and items,
  • triggers an async service to send notifications or update downstream systems.

The async service:

  • reads order data from the database.

What goes wrong:

  • async runs in a separate thread,
  • it may execute before the sync service commits,
  • database reads return no data or partial data,
  • async service fails with “record not found”.

The side effect:

  • intermittent failures,
  • retries that magically “fix” the problem,
  • hard-to-reproduce bugs.

Key takeaway:

Async does not mean “run later”. It means “run concurrently”. Because an async service starts in a separate thread with its own transaction, it may execute before the triggering service commits. This creates a "Data Visibility Gap" where the follow-up logic fails with a “Record Not Found” error because it cannot see uncommitted data.

The Solution: Commit Boundaries: To ensure correctness and proper execution order, you must explicitly bind your async service to a commit event. In Apache OFBiz, you have two primary tools for this:

  1. event="commit": Use this when you want the async service to trigger as soon as the current service’s transaction is successful. This is ideal for isolated logic or services that are not part of a deep nested chain.

  2. event="global-commit": Use this when you need absolute certainty that the entire business process is finalized. This is the safest choice for services that depend on data created across multiple nested calls.

A Note of Caution on global-commit: While global-commit is the safest, you must use it carefully because it respects the Apache OFBiz Transaction Stack. For example, if Service A calls Service B, which calls Service C, and Service C triggers Service D on global-commit, Service D will be held in a "wait state" until the very first service (Service A) is entirely committed. If Service A is a heavy process (like a large batch job), Service D might wait a long time before it ever starts.

Best Practice: Use global-commit for mission-critical syncs (like sending emails or pushing to ERP) to ensure they only happen after the "Grandparent" transaction is safe. For lighter, local dependencies, commit is usually sufficient. When possible, passing all required data directly in the service context is the best way to avoid database visibility issues entirely.

Production issue #3: Async + persist can overwhelm JobSandbox

Apache OFBiz supports persisted async services so that jobs:

  • survive server restarts,
  • are guaranteed to execute.

This is powerful and dangerous at scale.

The scenario:

Retailers often need to import historical orders from eCommerce for processing returns.

Example:

  • 5,000 orders per day
  • 90 days of history
  • ~1.3 order items per order
  • ~500,000 records to create

If each order is created via persisted asynchronous service

Then Apache OFBiz creates:

  • ~500,000 JobSandbox records.

What goes wrong:

  • JobSandbox table becomes massive,
  • DB I/O increases,
  • service engine spends time managing jobs rather than executing business logic,
  • other async and scheduled jobs get delayed,
  • user-facing activity slows down.

Async + persistence was used for reliability, but it became a system-wide bottleneck.

Key takeaway:

Persisted async execution works well only when the rate at which jobs are created is lower than, or at least comparable to, the rate at which the system can process them. When this balance is maintained, job queues remain shallow, thread pools stay healthy, and background execution does not interfere with other workloads. Once this balance is lost, persisted async processes stop being an optimization and become a source of systemic slowdown.

Persisted async execution is a reliability feature, but at high volume, it becomes a scaling concern.

Data imports at scale: why batching beats async execution?

One of the most important lessons we learned came from handling high-volume order imports during peak retail seasons. In real OMS deployments, orders arrive in bursts, often under strict same-day fulfillment expectations. It’s not uncommon to see 10,000–20,000 orders in a day, with 5,000 orders landing within a single hour. While eCommerce platforms typically expose APIs to fetch orders, at scale we deliberately chose to pull orders in controlled batches and group them into files of manageable size before processing them inside the OMS.

The initial design choice many teams make feels reasonable at first. Each file is read, and for every row in the file, the createOrder service is invoked asynchronously. The logic is simple, order creation is heavy, so running it asynchronously should keep the system responsive.

At low volumes, this approach works. Under peak load, it quickly falls apart.

When thousands of orders arrive within a short window, this design creates thousands of asynchronous service executions almost instantly. Each async invocation consumes threads, memory, CPU time, and database connections. Very quickly, the system becomes CPU- and memory-bound. Routing processes slow down, background jobs pile up, and fulfillment workflows start falling behind. In the worst cases, we even observed JVM instability and crashes during sustained peak traffic.

What looked like a scalable design on paper turned into an operational liability in production.

The breakthrough came when we stepped back and questioned the assumption that more parallelism meant better performance. For order imports, it doesn’t.

We redesigned the import pipeline around batch processing instead of uncontrolled async execution. Orders were downloaded and grouped into files of manageable size. The OMS processed one file at a time, and within each file, orders were created synchronously, one by one. Each order ran in its own transaction, ensuring isolation and correctness without overwhelming the system.

The results were counterintuitive but consistent. Even though fewer orders were processed in parallel, overall throughput improved. Memory usage stabilized. Thread explosions disappeared. Most importantly, order routing and fulfillment timelines became predictable again. Sequential batch processing routinely delivered higher real-world throughput than massive async parallelism ever did.

This redesign also eliminated an entire class of problems related to the persisted async process. Because orders were created synchronously, there was no JobSandbox explosion, no background backlog, and no hidden competition between import workloads and other system processes.

Error handling improved dramatically as well. In the batch model, a failure in one order did not affect others. Errors were logged row by row and written to a separate error file along with clear failure reasons. Operational teams could fix data issues and re-import only the failed records. Successful orders were never blocked or rolled back because of unrelated errors.

By contrast, when thousands of orders were created asynchronously, failures were scattered across threads and jobs. Retries became noisy, diagnosing root causes took longer, and operational clarity suffered.

For bulk data movement, this experience taught us a clear architectural lesson: batching is usually the right choice. It favors predictability over raw concurrency, stability over bursty execution, and business outcomes over theoretical parallelism.

Infrastructure matters: separating background processing

Even with careful use of asynchronous services, batching, and throttling, there will be periods when background processing load becomes heavy. In enterprise Apache OFBiz deployments, this is not an edge case, it is expected. At that point, service design alone is not enough, and infrastructure architecture starts to matter.

A common and proven enterprise pattern is to separate interactive workloads from background processing i.e. to deploy Apache OFBiz on two separate servers with distinct responsibilities. One set of servers handles user-facing traffic and synchronous business services, while another set is dedicated to background work such as asynchronous services, scheduled jobs, and batch processing. This separation ensures that spikes in background activity never block users from performing daily operations and allows background capacity to be scaled independently as volumes grow.

Apache OFBiz supports this deployment model cleanly through configuration, making it possible to treat async execution not just as a coding choice, but as an architectural and operational decision.

Final thoughts: async is a tool, not a strategy

Apache OFBiz offers a remarkably flexible service engine. It supports synchronous and asynchronous execution, persisted jobs, callbacks, commit and rollback hooks, as well as scheduling and batch processing. This flexibility is one of Apache OFBiz’s greatest strengths, but only when it is applied deliberately.

Our experience at HotWax Systems has shown that most performance and stability issues in production systems are architectural rather than functional. Async-related problems rarely appear at low volume. They emerge gradually as traffic grows, data volumes increase, and background workloads start competing with interactive users for shared resources.

When used thoughtfully, asynchronous execution can unlock scalability and responsiveness. When used as a default, it becomes a silent bottleneck, one that erodes reliability, predictability, and operational confidence over time.

The difference is not the framework. The difference is discipline, experience, and design choices.

Before marking a service as async, it is worth pausing and asking:

  • Does this work need to block the user?
  • Does it depend on data created by another service’s transaction?
  • What is the expected volume per hour or per day?
  • Can failures be delayed or retried safely?
  • Is batch processing a better fit?
  • Should this run only after commit?
  • Does the infrastructure support the async load?

If these questions cannot be answered clearly, async is probably the wrong default.

Async is a powerful tool in Apache OFBiz, but like all powerful tools, it delivers the best results when used with intention, not instinct.

At HotWax Systems, our Apache OFBiz experts have spent years building and scaling high volume enterprise systems where performance, reliability, and operational stability matter every day. From order orchestration and fulfillment workflows to large scale async processing and batch automation, we help businesses design enterprise architectures that perform reliably under real production load. If you are planning to scale your Apache OFBiz implementation or optimize complex background processing workflows, Connect with HotWax Systems to build systems engineered for stability, speed, and growth.



Topic: Apache OFBiz
Deepak Dixit
Deepak is responsible for building microservices-based, API-first enterprise software on Apache OFBiz and Moqui. He leads the design and development of new features and solutions at HotWax Systems, ensuring the platform remains robust and scalable. He also heads the innovation lab at HotWax Systems, where he and his team rapidly experiment with and evaluate new technologies and frameworks to continuously evolve the product. Deepak is also an active contributor and committer in the Apache OFBiz community and has deep expertise in it.
Deepak Dixit