Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
fly apps list --json | jq -r '.[].ID' | xargs -n 1 fly m list -q -a | awk NF | wc -lFly has some of its own autoscaling features, but we don't use them. (Their autoscaling only applies to process groups that serve HTTP connections, and it when websockets are mixed in.)
Our homegrown autoscaler pays attention to individual process groups. Each process group can be configured for up to three strategies:
Aiming for 80% utilization, allowing 10% on either side of that before scaling up or down
Latency
Latency in excess of x results in scaling up
History
Our load patterns are very regular, and because Mechanic in particular is highly latency-sensitive, we use this strategy to scale up in anticipation of higher load based on the historical record
Scaling down is implemented as sending the "quiet" instruction to a Sidekiq process. In general, we run one Sidekiq process per Machine. When a quieted Sidekiq process that has finished its work, it's safe to stop the corresponding Machine.
Our Sidekiq leader is configured to monitor for quiet Sidekiq processes that are performing no work. Whenever such a process is detected, the leader uses flyctl to stop the corresponding Machine.
We don't have this implemented for web stuffs yet. We're just very over-provisioned, instead. :)
This command generates restart commands. If you copy and execute its output, you'll restart all of an app's Fly machines individually and in parallel. Watch for failures — it's on you to address them.
Or, because Isaac just found out about pbcopy:
I couldn't get the above to work while also showing status/results of each restart, so this is Jed's version of it:
fly m list -q -a $APP | awk NF | awk '{ print "fly m restart " $1 " &;" }' fly m list -q -a $APP | awk NF | awk '{ print "fly m restart " $1 " &;" }' | pbcopyfly m list -q -a $APP | xargs -P500 -n1 fly m restartfly m list -a $APP | grep $GROUP | awk NF | awk '{ print "fly m restart " $1 " &;" }'GitHub is the source of truth for our environment variables, whether they be sensitive "secrets" or less sensitive "variables".
Fly has its own secret store, which contains protected values to be used as environment variables on deployed Machines. We use Fly's secret store to get our secrets onto deployed Machines, but it is not the source of truth for those values. Instead, we use Fly's secret store as a automatically-maintained mirror of whatever GitHub secrets and variables are effective for a given environment.
In GitHub, secrets and variables can live at any of the following levels. Each subsequent level inherits the preceding level, overriding the preceding level in case of conflict.
The organization level
The repo level, within the org
The environment level, within the repo
Secrets are populated automatically, during a repo-level GitHub workflow. Every deployable repo has its own fly-secrets.yml workflow.
Authorization tokens are strings used to identify and authorize us to some external service.
Locate the external service's config area for the token in question.
Example: FLY_API_TOKEN comes from the "Tokens" config, within a Fly app
Locate the secret's canonical location within GitHub.
We deploy our stuff on . (We ran on Heroku for more than a decade, but its spirit appears to have moved on, and the energy I'm chasing appears to be going by the name "Fly" these days.)
Our heavy-hitting projects (Locksmith and Mechanic) each get two Fly apps per environment*: a UI app, and an API app.
*"Environment" isn't a Fly term. Each of our projects has a production environment, a staging environment, and maybe a handful of others. We construct an environment out of specifically-provisioned Fly apps, Crunchy Bridge databases, and whatever other services are warranted.
Without revoking the old token, generate a new token for the secret with the vendor.
Copy the new token value, and update the corresponding GitHub secret.
Deploy to whatever deployment environments receive and use this secret.
Verify that the new token is working in its deployed environment(s).
Revoke the original token.
Fly is fantastic. Super happy to be on it.
These are the rough edges we've bumped up against, and (when applicable) how we handle it.
auto-stop doesn't seeeeeem to work properly when websockets are in the mix
restart
doesn't support --process-group
workaround (including backgrounding each Machine's individual restart command):
status
no machine-readable output; we regex our way through it to get Machine status
nb: --display-config exists, but that's for something else
count
it seems to grab a lease on all Machines at once, even when scoped by --process-group, which means fly scale count commands can't be run concurrently
no workaround
A rough edge: fly ssh console doesn't support addressing a specific Machine.
$ fly ssh console -a $FLY_APP_NAME
$ $ bin/rails cThis will display an interactive list of Machines to choose from. Good for small numbers of Machines, not great for large ones.
When an app has hundreds of Machines, it's faster on average to just look up the IP address of the desired Machine and pass that back to fly ssh console.
Let's say you have an image constructed from .. who knows where.
Let's say you have a repo that uses a given Fly app to do a fly deploy --build-only thing, prepping an image for use elsewhere.
Let's say you want to run a console using that image in a Fly app environment which is destined to receive that image (i.e. destined to have its machines updated to use this image). Let's say you want to do this before that glorious destiny arrives. Maybe you want to run some helpers that this image contains, or maybe you want to run a migration that this image contains, or or or or or or.
Assuming the build happened using --image-label $IMAGE_TAG, this may help you on your quest:
fly m list -a $APP | grep $PROCESS_GROUP | awk NF | awk '{ print "fly m restart " $1 " &;" }'
slow for restarting large numbers of Machines, and halts if any individual restart fails
workaround: use fly m restart $ID & instead
addressed in Restarting apps
doesn't include healthchecks
fly checks list -a $app | grep $machine_id
$ fly ssh console -a $FLY_APP_NAME -s# get the Machine's IPv6 address
$ fly m status $MACHINE_ID
# use that address here
$ fly ssh console -a $FLY_APP_NAME -A $IP_ADDRESSfly console -a $EXALTED_APP_NAME -i registry.fly.io/$HUMBLE_APP_NAME:$IMAGE_TAGYou might need to destroy the Fly builder app. It'll get auto-created again when you retry, which is what you should do after destroying the builder app.
Just retry. It's fine. :)
Just retry. It's fine. :)
Start by surveying the scene, to see how many machines are on the new image vs the old one, or in replacing vs failed vs created status.
If you're here, the app is probably online but no longer processing background jobs (because all the Sidekiq processes were instructed to enter quiet mode during ).
Handle this by rebooting one of the worker_autoscale machines. That should be enough to start bringing machines back online.
Once you've verified that the app is doing work again, wait for it to catch up on the run backlog, and then retry the deploy.
Manually redo the deploy.
Do this using a , using the Docker image URI from the build step.
Manually update the rest of the machines.
Start by examining fly m list -a $FLY_APP_NAME, and build a list of machine IDs that are stuck on the old image.
For each one, do something like this:
fly m destroy MACHINE_ID
Add --force if the machine is stubborn and won’t stop.
and then use fly scale count to scale back up to the desired machine count. Search fly scale count in the internal slack and you'll see example usage.
Human autonomy and responsibility go hand in hand.
Our deploy practices reflect this, by acknowledging that there are some scenarios in which human autonomy is necessary, and ensuring that the human (1) can be nimbly responsive in those scenarios, and (2) is fully responsible for what happens in those scenarios.
If we have a situation where we actively don't want a human to be responsible, we also take away human autonomy. You can't mess around in a place where you're not responsible for the results.
Fly monitors its own ability to deploy well. :) (Thanks Fly!) See https://atc.fly.dev/.
Our regular deploys are all initiated through GitHub Actions.
To initiate a regular deploy to a production environment, we publish a new repo release. This manual action kicks off an automatic Actions workflow, which invokes flyctl deploy.
Our releases are auto-prepped using Release Drafter. This means that publishing a new release is as simple as editing the latest release draft, and hitting the big green "Publish release" button.
Regular deploys to non-production environments are triggered however's appropriate. Usually, it happens via a push to main, which kicks off an Actions workflow, which invokes flyctl deploy.
Each repo has two GHA workflows that can be manually called through the GitHub UI: one called "Manual secrets 🛠️", and one called "Manual deploy 🛠️".
Use these as needed.
This should reeeeeeaally only ever be done in an emergency situation. If you're reaching for this in a non-emergency, take a minute first, and have a think on why you're here.
Some of our apps are on the larger end. Mechanic uses upwards of 500 Machines, for example. Lots of things can go wrong. Here's some documentation on that:
We use "immediate" in environments where deploys are manually initiated, and "bluegreen" wherever deploys are automatically initiated.
Immediate deploys finish quickly, but the actual Machine updates happen asynchronously, and may take longer. Usually they're fast, but I've seen them take more than 15min on occasion.
Our GitHub org has an org-level variable in place: FLY_DEPLOY_STRATEGY=bluegreen. This makes it the default value for all repos and their environments.
Each repository's production environment has an env-level variable in place: FLY_DEPLOY_STRATEGY=immediate. This makes it the effective value for that environment, and that environment alone.
Fly supports "release commands", which are automatically invoked during deploy, right before updating Machines with new images.
In apps that run Sidekiq, we use this feature it to issue "quiet" commands to all of our Sidekiq processes.
Once this happens, no jobs will be performed. Jobs will be automatically resumed as Machines come back online after the deploy.
$ fly m list -a $FLY_APP_NAME$ fly m list -a $FLY_APP_NAME | grep worker_autoscale
$ fly m restart MACHINE_IDfly m update 328756e9f52758 \
--env RELEASE_LABEL=v62-3-ga38cb23a \
--image registry.fly.io/$FLY_APP_NAME:locksmith-api.v62-3-ga38cb23aflyctl deploy \
--app $FLY_APP_NAME \
--strategy immediate \
--env RELEASE_LABEL=v37 \
--image registry.fly.io/$FLY_APP_NAME:$REPO_NAME.v37 \
--update-only[deploy]
# deployment is done (and configured) via shared workflow. see:
# https://github.com/lightward/.github-private/blob/main/.github/workflows/fly-deploy.yml
# except for this part, where we have an app-specific interest in quieting sidekiq before release
release_command = "bin/rake sidekiq:quiet"