1 of 11

Fly

Overview

We deploy our stuff on . (We ran on Heroku for more than a decade, but its spirit appears to have moved on, and the energy I'm chasing appears to be going by the name "Fly" these days.)

Our heavy-hitting projects (Locksmith and Mechanic) each get two Fly apps per environment*: a UI app, and an API app.

*"Environment" isn't a Fly term. Each of our projects has a production environment, a staging environment, and maybe a handful of others. We construct an environment out of specifically-provisioned Fly apps, Crunchy Bridge databases, and whatever other services are warranted.

Restarting apps

Not-particularly-recommended path

The normal route for this is fly apps restart $APP_NAME.

This works, but (as of this writing) it restarts Fly machines in serial — and the restart sequence halts if any machine fails to restart normally. (This stuff is documented in .)

Counting all org machines

fly apps list --json | jq -r '.[].ID' | xargs -n 1 fly m list -q -a | awk NF | wc -l

Autoscaling

Fly has some of its own autoscaling features, but we don't use them. (Their autoscaling only applies to process groups that serve HTTP connections, and it when websockets are mixed in.)

Strategies

Our homegrown autoscaler pays attention to individual process groups. Each process group can be configured for up to three strategies:

Environment variables

GitHub is the source of truth for our environment variables, whether they be sensitive "secrets" or less sensitive "variables".

Fly has its own secret store, which contains protected values to be used as environment variables on deployed Machines. We use Fly's secret store to get our secrets onto deployed Machines, but it is not the source of truth for those values. Instead, we use Fly's secret store as a automatically-maintained mirror of whatever GitHub secrets and variables are effective for a given environment.

A "secret" is an environment variable that shouldn't be read by anything other than production code. Once configured in GitHub or Fly, you won't get that value back anywhere but in a GitHub workflow or on a Fly Machine.

A "variable" is an environment variable that's safe to be read by authorized users. If you have permission, you can view variable values in GitHub. Fly doesn't distinguish between secrets and variables; once in Fly, they're all secrets, and Fly never lets you read them back except on deployed Machines.

Configuration

In GitHub, secrets and variables can live at any of the following levels. Each subsequent level inherits the preceding level, overriding the preceding level in case of conflict.

The organization level
The repo level, within the org
The environment level, within the repo

Deploying

Secrets are populated automatically, during a repo-level GitHub workflow. Every deployable repo has its own fly-secrets.yml workflow.

Rotating tokens

Authorization tokens are strings used to identify and authorize us to some external service.

Locate the external service's config area for the token in question.
- Example: FLY_API_TOKEN comes from the "Tokens" config, within a Fly app
Locate the secret's canonical location within GitHub.

Deploys

Human autonomy and responsibility go hand in hand.

Our deploy practices reflect this, by acknowledging that there are some scenarios in which human autonomy is necessary, and ensuring that the human (1) can be nimbly responsive in those scenarios, and (2) is fully responsible for what happens in those scenarios.

If we have a situation where we actively don't want a human to be responsible, we also take away human autonomy. You can't mess around in a place where you're not responsible for the results.

Recovering from deploy failures

In this section, "retry" means "use GitHub Action's retry button on the failed run".

Build failures

You might need to destroy the Fly builder app. It'll get auto-created again when you retry, which is what you should do after destroying the builder app.

Docker failures

Just retry. It's fine. :)

Release command failures

Just retry. It's fine. :)

Machine update failures

Start by surveying the scene, to see how many machines are on the new image vs the old one, or in replacing vs failed vs created status.

Total machine update failure, i.e. the release command succeeded but no Machines were updated at all

If you're here, the app is probably online but no longer processing background jobs (because all the Sidekiq processes were instructed to enter quiet mode during ).

Handle this by rebooting one of the worker_autoscale machines. That should be enough to start bringing machines back online.

Once you've verified that the app is doing work again, wait for it to catch up on the run backlog, and then retry the deploy.

A minority of machines were successfully updated

Manually redo the deploy.

Do this using a , using the Docker image URI from the build step.

A majority of machines were successfully updated

Manually update the rest of the machines.

Start by examining fly m list -a $FLY_APP_NAME, and build a list of machine IDs that are stuck on the old image.

For each one, do something like this:

Sometimes a machine will get stuck and you'll need to outright destroy it

fly m destroy MACHINE_ID

Add --force if the machine is stubborn and won’t stop.

and then use fly scale count to scale back up to the desired machine count. Search fly scale count in the internal slack and you'll see example usage.

Rough edges

Fly is fantastic. Super happy to be on it.

These are the rough edges we've bumped up against, and (when applicable) how we handle it.

Fly Proxy

auto-stop doesn't seeeeeem to work properly when websockets are in the mix

flyctl

apps

restart
- doesn't support --process-group
  - workaround (including backgrounding each Machine's individual restart command):

machines

status
- no machine-readable output; we regex our way through it to get Machine status
  - nb: --display-config exists, but that's for something else

scale

count
- it seems to grab a lease on all Machines at once, even when scoped by --process-group, which means fly scale count commands can't be run concurrently
  - no workaround

SSH

A rough edge: fly ssh console doesn't support addressing a specific Machine.

Connecting to a random Machine

$ fly ssh console -a $FLY_APP_NAME
$ $ bin/rails c

Connecting to a specific Machine

This will display an interactive list of Machines to choose from. Good for small numbers of Machines, not great for large ones.

Connecting to a specific Machine address for a given app

When an app has hundreds of Machines, it's faster on average to just look up the IP address of the desired Machine and pass that back to fly ssh console.

Unusual consoles

Let's say you have an image constructed from .. who knows where.

Let's say you have a repo that uses a given Fly app to do a fly deploy --build-only thing, prepping an image for use elsewhere.

Let's say you want to run a console using that image in a Fly app environment which is destined to receive that image (i.e. destined to have its machines updated to use this image). Let's say you want to do this before that glorious destiny arrives. Maybe you want to run some helpers that this image contains, or maybe you want to run a migration that this image contains, or or or or or or.

Assuming the build happened using --image-label $IMAGE_TAG, this may help you on your quest:

Recovering from deploy failures

In this section, "retry" means "use GitHub Action's retry button on the failed run".

Build failures

You might need to destroy the Fly builder app. It'll get auto-created again when you retry, which is what you should do after destroying the builder app.

Docker failures

Just retry. It's fine. :)

Release command failures

Just retry. It's fine. :)

Machine update failures

Start by surveying the scene, to see how many machines are on the new image vs the old one, or in replacing vs failed vs created status.

Total machine update failure, i.e. the release command succeeded but no Machines were updated at all

If you're here, the app is probably online but no longer processing background jobs (because all the Sidekiq processes were instructed to enter quiet mode during ).

Handle this by rebooting one of the worker_autoscale machines. That should be enough to start bringing machines back online.

Once you've verified that the app is doing work again, wait for it to catch up on the run backlog, and then retry the deploy.

A minority of machines were successfully updated

Manually redo the deploy.

Do this using a , using the Docker image URI from the build step.

A majority of machines were successfully updated

Manually update the rest of the machines.

Start by examining fly m list -a $FLY_APP_NAME, and build a list of machine IDs that are stuck on the old image.

For each one, do something like this:

Sometimes a machine will get stuck and you'll need to outright destroy it

fly m destroy MACHINE_ID

Add --force if the machine is stubborn and won’t stop.

and then use fly scale count to scale back up to the desired machine count. Search fly scale count in the internal slack and you'll see example usage.

Fly

Overview

Restarting apps

Not-particularly-recommended path

Counting all org machines

Autoscaling

Strategies

Environment variables

Configuration

Deploying

Rotating tokens

Deploys

Recovering from deploy failures

Build failures

Docker failures

Release command failures

Machine update failures

Total machine update failure, i.e. the release command succeeded but no Machines were updated at all

A minority of machines were successfully updated

A majority of machines were successfully updated

Sometimes a machine will get stuck and you'll need to outright destroy it

Rough edges

Fly Proxy

flyctl

apps

machines

scale

SSH

Connecting to a random Machine

Connecting to a specific Machine

Connecting to a specific Machine address for a given app

Unusual consoles

Counting all org machines

Restarting apps

Not-particularly-recommended path

Autoscaling

Strategies

Sidekiq

Web

Filtering by process group

Environment variables

Configuration

Deploying

Rotating tokens

Overview

Fly

Rough edges

Fly Proxy

flyctl

apps

machines

scale

SSH

Connecting to a random Machine

Connecting to a specific Machine

Connecting to a specific Machine address for a given app

Unusual consoles

Recovering from deploy failures

Build failures

Docker failures

Release command failures

Machine update failures

Total machine update failure, i.e. the release command succeeded but no Machines were updated at all

A minority of machines were successfully updated

A majority of machines were successfully updated

Sometimes a machine will get stuck and you'll need to outright destroy it

Deploys

Automatic deploys

Manual deploys

CLI deploys

Recovery

Strategies

Configuration

Release commands