Taylor's Friday Musings

Happy Friday!! To celebrate, here’s a note from Taylor, our CEO, that he shared on our Slack yesterday. It’s a little"rough", but we’re sharing it (with his permission, of course) for two reasons…

First, we’re extremely proud of OpenFn v2—it’s an amazing piece of 100% free and open source software; a critical DPI “building block” that’s being used as the data-exchange/AI/automation layer in life-saving technological solutions in 40+ countries right now.

And second, we think his post does a good job summarizing why working towards and supporting the holy grail of “interoperability” is really hard, but necessary if we want this next generation of digital public infrastructure to be successful. Without further ado, here’s Taylor:

Taylor’s note

Hi @team, we’ve just found (thanks @Rory for the expert sleuthing) that we may be over-reporting lost runs. Yesterday, 2 of Pius’s runs in his African Development Bank project triggered a “lost run” alert, but were in fact quite happily completed by the worker. The final state, logs, steps, etc… everything was there, all neat and tidy. The reasons are complicated, but we know them and I’ve just pushed a fix for them that’s being reviewed as we speak.

We’re still not at zero. We have had 2 lost runs in the last 8 days. More work to do, and I’ll keep you updated.

Oh and one more thing… it’s time for a tiny little (and yes, pretty weird and nerdy) celebration for Lightning :tada: :bottle_with_popping_cork:. This week, as happens in software, we encountered a bug that led to our application being OOM-killed and restarted on a large, multi-tenant deployment. Multiple times. To really understand the :sparkles:magic :sparkles:of what happened next, you’ve got to have some context… namely that v1 never handled the kind of scale that we’re happily chugging through right now: 90k+ runs enqueued, 10 worker slots banging away concurrently. Then consider this… in v1, a crash/restart meant that we lost everything. Each time the app crashed, we’d lose every run that was in progress.

Now with lightning, despite the app restarting 6 or 7 times over the span of an hour, you could barely see a blip in how we handled this enormous, mission critical workload for our customers. The workers simply didn’t care. They kept on working. When lightning rose from the dead, time and time again, they simply said, “Oh hey man, [yawn] I’ve got some absolutely critical customer run data for you [still not even looking up from their desks, while working on the next thing] and I’d appreciate if you bring it over to your database.” Nobody freaked out, nobody lost a step. Lightning and the websocket workers, just straight chillin’ and getting on with business in the middle of a storm. :clap:

Oh and one more “one more thing”… automating disparate systems is almost always a storm. The systems we integrate with are always being changed. They’re being run on underfunded infra. They’re being patched (or not being patched!) and have all sorts of bugs and inconsistencies.

I used to describe OpenFn (doing high performance data integration with international development systems) as a messenger, passing messages between two small boats on the open ocean. During a storm. Where the medium they use to pass messages is paper airplanes. So we’ve got to catch a paper airplane that someone threw at us from a small boat in the middle of the storm. Read it (maybe they’ve got terrible handwriting), translate it to some other little known language, fold up a new paper airplane, then throw it at the other small boat and hope they catch it. It’s hard. These boats are unreliable to begin with, and they’re being tossed around by the storm. The paper gets soggy. Everything is salty. The only way they’ve got a chance in hell of making this system work is if that middle-man (OpenFn) is an ABSOLUTE. FREAKING. ROCK. :rock:

We need to be a lighthouse. Built out of tungsten. On a seamount made of gabbro that runs unbroken directly into the bedrock of the Atlantic abyssal plains somewhere on the African plate. We need to be stable as f***.

Is it fair that all this responsibility rests on us? No. But I don’t care. And our customers don’t care. This is our lot in life, and we are happy warriors. We are Sisyphus, but with a knowing smile. We’re doing the needful and we’re happy doing it. (“It’s the journey, man!:victory_hand::dashing_away:”… But it really is.) In fact, we’re doing the needful with one hand tied behind our back because we’ve got to “move fast” and “build shiny new features” with the other. No, it’s not easy. But I think we can do it.

1 Like