Managing Lost Runs

A run is marked as lost when the worker claims it but fails to inform lightning that it has completed it.

What causes lost runs?

As of the day of writing this post, we have narrowed down to the following:

  1. DB Connection timeouts: When the database is under heavy load, it may take a longer than expected time to respond. Check out the docs to see the recommended specs for the database: Requirements | OpenFn/docs
  2. Runs consuming a lot of memory: a run may sometimes use too much memory in the worker, and take the whole worker down.
  3. Timeout when calling fetch:credential. For older versions sometimes a timeout would occur when fetching a credential during run due to misconfiguration. This bug occurred in ws-worker:v1.13.0 . This has been fixed for recent versions

How to reduce lost runs

Try to keep up with latest versions.

We are constantly improving and fixing bugs in lightning and worker. Newer versions are released at least twice a month. Here’s lightning’s changelog: lightning/CHANGELOG.md at main · OpenFn/lightning · GitHub and here’s the worker’s changelog: kit/packages/ws-worker/CHANGELOG.md at main · OpenFn/kit · GitHub . You can also find the updates in Product Updates - OpenFn Community

Configure your lightning and worker to tolerate slow response times from the db.

# (lightning only) controls how long (in milliseconds) lightning waits for the db to respond.
DATABASE_TIMEOUT=60000
# (lightning and worker) the maximum duration (in seconds) that workflows are allowed to run.
# Once a run is claimed, if the worker doesn't respond after this period then the run will be marked as lost
WORKER_MAX_RUN_DURATION_SECONDS=3600

# worker v1.14.3 and above
WORKER_CLAIM_TIMEOUT_SECONDS=3600
WORKER_MESSAGE_TIMEOUT_SECONDS=70
# worker v1.14.2 and below
WORKER_SOCKET_TIMEOUT_SECONDS=70

1 Like