We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Machine Affinity with LiveView and fly-replay
I’ve been building an application to exercise some features of Phoenix/LiveView and Fly Machines. It’s a button.
You push the button. It makes an API call. A new virtual machine pops up on some Fly.io metal in the data centre nearest to you.
Once the new VM is up, the LiveView redirects your browser to it. Your very own Fly Machine serves you some content, then shuts itself down and gets destroyed forever.
The whole schtick collapses if I don’t ensure any button-pusher can interact only with the Machine they launched.
This looks like a type of session affinity, or sticky sessions, problem.
The key to sticky sessions within a Fly app
is fly-replay
Backing up: the Fly.io load balancer is a Rust program called fly-proxy. When you visit an app’s public URL, fly-proxy relays your request from the edge to a Machine with a service configured on the right port and the concurrency headroom to fulfil it. In the simplest case, that’s the closest such Machine in the app. Machine state and autostart config also factor in, when they’re relevant.
There’s no way to tell fly-proxy beforehand that it should deliver an HTTP request to Machine B with a service exposed on port 80, and not the nearer Machine A in the same app with a service on port 80. But it does provide for application code to examine the request and respond with a fly-replay
header, telling it to deliver the request again; to a different app, to a specific region, or to not-this-Machine, and fly-proxy applies its load-balancing logic within those constraints. Or we can pin the replay down to a specific Machine.
This makes the proxy sort-of programmable without actually making it programmable.
An app per customer may be better than fly-replay
If the Closest Machine is frequently the Wrong Machine, and you’re constantly replaying requests, it may be worth putting each distinct Machine into its own Fly app
with its own .fly.dev
URL.
If you’re isolating workloads/users on their own VMs for security reasons, putting them into their own apps too is even better, because you can wall apps off into separate custom IPv6 private networks.
In my case, though, fly-replay
is just the ticket. fly-proxy has a high chance of hitting the correct Machine on the first try. The whole visitor experience is short and I don’t want to bog it down with more API operations and waiting for DNS. (And while I could generate disposable app names with a vanishing likelihood of collision, burning globally unique app names just feels stinky.)
So from here I just have to implement the logic in my LiveView app to issue a fly-replay
for every request that needs one. It’s not quite straightforward.
Peter Ullrich solved this exact thing already
As it happens, Peter Ullrich wrote an article on using fly-replay
for sticky sessions on Fly.io with a Phoenix/LiveView application. It’s pretty neat, and he explains both the why and the how; I’ll recap what stood out to me, but you should read his version for the good stuff.
Ullrich wrote a module plug that checks each incoming request for a query parameter, or failing that, a cookie, matching the ID of the current Machine. If the query parameter matches, it puts that into a session cookie so that it’s passed in with all subsequent requests from that client (until that cookie gets changed). If there’s a parameter or a cookie, but it doesn’t match, the plug responds with a fly-replay
header and a redirect status (307).
The tricky part in a LiveView application is getting every request to go through that plug. WebSocket connection requests don’t go through the router, nor through the regular plugs in the endpoint, so just adding a plug in either of those modules doesn’t cut it.
To ensure everything, everything, goes through a particular plug, you can override the definition of the endpoint’s Plug.call/2
callback to run through that plug first.
The plug either issues a fly-replay
or dumps the conn
into the front end of the stock endpoint to be handled in the usual way.
Again, I didn’t figure this out myself; go read the original post.
I tried to squirm out of it
When I first read Ullrich’s article, I refused to believe that my use case wasn’t simpler.
I thought I should be able to start with something like a vanilla LiveView authentication/authorization setup and simplify that down to some path-based logic with a live_session
and an on_mount
hook to gate access to the LiveView and fly-replay
any requests for a path indicating a different Machine.
This almost worked!
Just kidding; of course it didn’t.
An on_mount
hook can’t manipulate connections the way a plug can, and it can’t send a fly-replay
header. It can do a redirect to a route in the router module, though.
For a moment I thought it would be clever to get the on_mount
callback to check the path against the FLY_MACHINE_ID
environment variable (this part is fine) and if needed, cycle the connection back to the router and through a plug, which can set headers and respond.
If it’s not obvious, what this accomplishes, when you hit the wrong Machine, is an infinite loop of fly-replay
“redirects”.
Getting a LiveView rendered and connected involves two requests, running its mount
function twice. The first on_mount
catches a request meant for another Machine, and replays to that Machine. But the WebSocket upgrade needs one more HTTP request.
If fly-proxy thought the wrong Machine was a good choice the first time, it probably thinks the same thing this time too, and sends the upgrade request there, where the on_mount
callback sends it back to the router on the right Machine, where we start all over again.
Tweaks
I used Ullrich’s solution wholesale, with two adjustments to the plug:
-
I added a condition to let requests to
/health
pass; localhost needs to reach it. A plug in the router pipeline blocks non-localhost connections. - Instead of passing all connections that don’t match one of the conditions, I block them. There’s no resource that you should be looking for on this Machine if it’s not the Machine you created.
My adaptation is in this gist.
Push the button at https://where.fly.dev. See if it works for you.