I put a mostly-finished technical overview of Frank’s codebase up on github here.
Mostly focused on the design and operation of the bot itself, not the cool machine learning parts.
I don’t know quite who the audience for this is, or whether there even is one, but it feels nice to have a description of the bot written down that’s not immensely out of date.
I actually really like this API design! it’s very very similar to what I would consider the “good” version of this architecture, which would be using a database like Redis or similar to power a jobs queue that the ML machines can read from and write to.
How do you handle making sure that two ML machines don’t work on the same task concurrently? This is a very common issue with distributed job queues like this one. Does the main server mark as task as “handed out” when the GET /pollml from the ML machine comes in, or does it just hand out the same task to as many machines as will take it? Feels like you might end up with a lot of duplicate work if you don’t have something like this set up. I guess maybe it’ll Just Work Out if you always have “number of candidate posts generated” divisible by “number of ML machines running”? (Not sure how that works out for the score tasks though….)
Good question.
When there are multiple ML machines, and they’re doing the scoring tasks, they do in fact perform some redundant overlapping work. It’s suboptimal, but these tasks are relatively fast so it’s not a huge source of overhead.
The biggest time sink by far is the “write posts” task, but that is also a special case where this is a non-issue. The ML machines all receive the same instructions (“write a single post and send it over”), and it’s actively good for them to do this concurrently.
—-
Now that I think about it, I realize I erred slightly in my description of this one – I’ll have to go back and edit later.
During the “write posts” task, the main process is responsible for deciding when to stop (which it communicates via “/done”). From the perspective of the bridge service, the task is just “keep writing posts until I hear /done.”
This lets me write logic in the main process that rejects some posts (like very short ones) while writing is still happening, while guaranteeing we still get N at the end, and keeping the bridge service simple.
However, this has the annoying consequence that the bridge service learns when we’re done slightly after the decision is made, which means that it might have already sent ML machines off to write additional posts we don’t need.
To smooth over this case, I recently added a state called “almostdone,” set when we get close to N. This tells the ML machines to wait longer between each POST and the next GET, anticipating that a /done may occur in between.
