Fixing Lost Runs in Worker 1.15.0

Today’s Worker update changes how your Run uses memory. The result is more stability for everyone, with fewer runs being Lost.

After updating, some users may notice that Runs are consistently being killed (we don’t think this affects anyone running on app.openfn.org). This is because memory required to compile the workflow exceeds the memory permitted to run the workflow.

The Worker is the component which executes your workflows in a secure, sandboxed environment. It’s a separate process to the platform (Lightning), which is precisely what makes it so secure.

Within the worker, every step in a Workflow goes through a compilation process, where your job code gets converted into portable, executable JavaScript.

Prior to 1.15.0, compilation occurred in the main thread of the worker - which means that any memory and CPU consumed in compilation will affect any other runs being processed by the Worker at that time. After compilation, the Run is sent into its own process and allocated its own memory limit.

This meant that large steps could consume a lot of memory from the main process, triggering a heap exception and causing all runs on that Worker instance to fail.

In 1.15.0, we moved the compilation step out of the main process and into the run’s own dedicated process, and subject to that run’s memory limits. So if a run as 250mb of memory, but compiling a larger step consumes 200mb, the run is likely to be killed with an Out Of Memory (OOM) exception.

In some cases, this might mean that workflows with a lot of step code but a low memory limit may suddenly consistently fail to compile. Increasing the run memory should fix these issues.

The kinds of steps which are long enough to cause this problem tend to have a large amount of data embedded within them - usually JSON objects and usually for the purposes of mapping values. These structures can unfortunately be expensive to compile - but also make job code harder to read and write. We recommend using Collections to save large mapping objects outside of your job code.

We’ve been tracking Lost runs for a while run: Lost means that a Run lost communication with the main Lightning server, so we don’t really know what happened to it. This is a very unsatisfactory situation and we’ve been working to understand why runs get lost - connectivity issues, system crashes and memory limits are the major offenders. Lost runs are rare, but every single one is significant to us.

Today’s update should result in even fewer lost runs, both on the main app.openfn.org platform and on self-hosted and local Lightning deployments.

If you’ve been having trouble with Lost runs, hopefully this will fix it. Let us know either way!

1 Like

@joe how do users access the latest version of the worker? Is it bundled in the latest version of openfn/lightning?

Ah that depends a bit!

Our platform at app.openfn.org is running the latest worker right now - so cloud users benefit from this straight away.

Yes, there is a version bundled with Lightning. The `main` branch now includes 1.15.0, so users running from latest source will be on it. The latest release doesn’t include this yet.

The setup is probably a bit different for deployed versions, where users might be hosting their own worker. I expect those users would know how to get set up :slight_smile:

Users can also run RTM=false mix phx.server to start Lightning without the bundled worker, and then spin up their own worker version (from a docker image or straight out of the kit repo or wrapped in their own application from npm package)