All pages
Powered by GitBook
1 of 11

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Counting all org machines

fly apps list --json | jq -r '.[].ID' | xargs -n 1 fly m list -q -a | awk NF | wc -l

Restarting apps

Not-particularly-recommended path

The normal route for this is fly apps restart $APP_NAME.

This works, but (as of this writing) it restarts Fly machines in serial — and the restart sequence halts if any machine fails to restart normally. (This stuff is documented in .)

Autoscaling

Fly has some of its own autoscaling features, but we don't use them. (Their autoscaling only applies to process groups that serve HTTP connections, and it when websockets are mixed in.)

Strategies

Our homegrown autoscaler pays attention to individual process groups. Each process group can be configured for up to three strategies:

Utilization
  • Aiming for 80% utilization, allowing 10% on either side of that before scaling up or down

  • Latency

    • Latency in excess of x results in scaling up

  • History

    • Our load patterns are very regular, and because Mechanic in particular is highly latency-sensitive, we use this strategy to scale up in anticipation of higher load based on the historical record

  • Sidekiq

    Scaling down is implemented as sending the "quiet" instruction to a Sidekiq process. In general, we run one Sidekiq process per Machine. When a quieted Sidekiq process that has finished its work, it's safe to stop the corresponding Machine.

    Our Sidekiq leader is configured to monitor for quiet Sidekiq processes that are performing no work. Whenever such a process is detected, the leader uses flyctl to stop the corresponding Machine.

    Web

    We don't have this implemented for web stuffs yet. We're just very over-provisioned, instead. :)

    doesn't appear to work
    Recommended path

    This command generates restart commands. If you copy and execute its output, you'll restart all of an app's Fly machines individually and in parallel. Watch for failures — it's on you to address them.

    Or, because Isaac just found out about pbcopy:

    I couldn't get the above to work while also showing status/results of each restart, so this is Jed's version of it:

    Filtering by process group

    Rough edges
    fly m list -q -a $APP | awk NF | awk '{ print "fly m restart " $1 " &;" }'
     fly m list -q -a $APP | awk NF | awk '{ print "fly m restart " $1 " &;" }' | pbcopy
    fly m list -q -a $APP | xargs -P500 -n1 fly m restart
    fly m list -a $APP | grep $GROUP | awk NF | awk '{ print "fly m restart " $1 " &;" }'

    Environment variables

    GitHub is the source of truth for our environment variables, whether they be sensitive "secrets" or less sensitive "variables".

    Fly has its own secret store, which contains protected values to be used as environment variables on deployed Machines. We use Fly's secret store to get our secrets onto deployed Machines, but it is not the source of truth for those values. Instead, we use Fly's secret store as a automatically-maintained mirror of whatever GitHub secrets and variables are effective for a given environment.

    A "secret" is an environment variable that shouldn't be read by anything other than production code. Once configured in GitHub or Fly, you won't get that value back anywhere but in a GitHub workflow or on a Fly Machine.

    A "variable" is an environment variable that's safe to be read by authorized users. If you have permission, you can view variable values in GitHub. Fly doesn't distinguish between secrets and variables; once in Fly, they're all secrets, and Fly never lets you read them back except on deployed Machines.

    Configuration

    In GitHub, secrets and variables can live at any of the following levels. Each subsequent level inherits the preceding level, overriding the preceding level in case of conflict.

    1. The organization level

    2. The repo level, within the org

    3. The environment level, within the repo

    Deploying

    Secrets are populated automatically, during a repo-level GitHub workflow. Every deployable repo has its own fly-secrets.yml workflow.

    Rotating tokens

    Authorization tokens are strings used to identify and authorize us to some external service.

    1. Locate the external service's config area for the token in question.

      • Example: FLY_API_TOKEN comes from the "Tokens" config, within a Fly app

    2. Locate the secret's canonical location within GitHub.

    Overview

    We deploy our stuff on . (We ran on Heroku for more than a decade, but its spirit appears to have moved on, and the energy I'm chasing appears to be going by the name "Fly" these days.)

    Our heavy-hitting projects (Locksmith and Mechanic) each get two Fly apps per environment*: a UI app, and an API app.

    *"Environment" isn't a Fly term. Each of our projects has a production environment, a staging environment, and maybe a handful of others. We construct an environment out of specifically-provisioned Fly apps, Crunchy Bridge databases, and whatever other services are warranted.

    Example: FLY_API_TOKEN is configured at the repository environment level.
  • Without revoking the old token, generate a new token for the secret with the vendor.

  • Copy the new token value, and update the corresponding GitHub secret.

  • Deploy to whatever deployment environments receive and use this secret.

  • Verify that the new token is working in its deployed environment(s).

  • Revoke the original token.

  • Fly.io

    Fly

    Rough edges

    Fly is fantastic. Super happy to be on it.

    These are the rough edges we've bumped up against, and (when applicable) how we handle it.

    Fly Proxy

    • auto-stop doesn't seeeeeem to work properly when websockets are in the mix

    flyctl

    apps

    • restart

      • doesn't support --process-group

        • workaround (including backgrounding each Machine's individual restart command):

    machines

    • status

      • no machine-readable output; we regex our way through it to get Machine status

        • nb: --display-config exists, but that's for something else

    scale

    • count

      • it seems to grab a lease on all Machines at once, even when scoped by --process-group, which means fly scale count commands can't be run concurrently

        • no workaround

    SSH

    A rough edge: fly ssh console doesn't support addressing a specific Machine.

    Connecting to a random Machine

    $ fly ssh console -a $FLY_APP_NAME
    $ $ bin/rails c

    Connecting to a specific Machine

    This will display an interactive list of Machines to choose from. Good for small numbers of Machines, not great for large ones.

    Connecting to a specific Machine address for a given app

    When an app has hundreds of Machines, it's faster on average to just look up the IP address of the desired Machine and pass that back to fly ssh console.

    Unusual consoles

    Let's say you have an image constructed from .. who knows where.

    Let's say you have a repo that uses a given Fly app to do a fly deploy --build-only thing, prepping an image for use elsewhere.

    Let's say you want to run a console using that image in a Fly app environment which is destined to receive that image (i.e. destined to have its machines updated to use this image). Let's say you want to do this before that glorious destiny arrives. Maybe you want to run some helpers that this image contains, or maybe you want to run a migration that this image contains, or or or or or or.

    Assuming the build happened using --image-label $IMAGE_TAG, this may help you on your quest:

    fly m list -a $APP | grep $PROCESS_GROUP | awk NF | awk '{ print "fly m restart " $1 " &;" }'

  • slow for restarting large numbers of Machines, and halts if any individual restart fails

    • workaround: use fly m restart $ID & instead

    • addressed in Restarting apps

  • doesn't include healthchecks

    • fly checks list -a $app | grep $machine_id

    $ fly ssh console -a $FLY_APP_NAME -s
    # get the Machine's IPv6 address
    $ fly m status $MACHINE_ID
    
    # use that address here
    $ fly ssh console -a $FLY_APP_NAME -A $IP_ADDRESS
    fly console -a $EXALTED_APP_NAME -i registry.fly.io/$HUMBLE_APP_NAME:$IMAGE_TAG

    Recovering from deploy failures

    In this section, "retry" means "use GitHub Action's retry button on the failed run".

    Build failures

    You might need to destroy the Fly builder app. It'll get auto-created again when you retry, which is what you should do after destroying the builder app.

    Docker failures

    Just retry. It's fine. :)

    Release command failures

    Just retry. It's fine. :)

    Machine update failures

    Start by surveying the scene, to see how many machines are on the new image vs the old one, or in replacing vs failed vs created status.

    Total machine update failure, i.e. the release command succeeded but no Machines were updated at all

    If you're here, the app is probably online but no longer processing background jobs (because all the Sidekiq processes were instructed to enter quiet mode during ).

    Handle this by rebooting one of the worker_autoscale machines. That should be enough to start bringing machines back online.

    Once you've verified that the app is doing work again, wait for it to catch up on the run backlog, and then retry the deploy.

    A minority of machines were successfully updated

    Manually redo the deploy.

    Do this using a , using the Docker image URI from the build step.

    A majority of machines were successfully updated

    Manually update the rest of the machines.

    Start by examining fly m list -a $FLY_APP_NAME, and build a list of machine IDs that are stuck on the old image.

    For each one, do something like this:

    Sometimes a machine will get stuck and you'll need to outright destroy it

    fly m destroy MACHINE_ID

    Add --force if the machine is stubborn and won’t stop.

    and then use fly scale count to scale back up to the desired machine count. Search fly scale count in the internal slack and you'll see example usage.

    Deploys

    Human autonomy and responsibility go hand in hand.

    Our deploy practices reflect this, by acknowledging that there are some scenarios in which human autonomy is necessary, and ensuring that the human (1) can be nimbly responsive in those scenarios, and (2) is fully responsible for what happens in those scenarios.

    If we have a situation where we actively don't want a human to be responsible, we also take away human autonomy. You can't mess around in a place where you're not responsible for the results.

    the release command
    CLI deploy

    Fly monitors its own ability to deploy well. :) (Thanks Fly!) See https://atc.fly.dev/.

    Automatic deploys

    Our regular deploys are all initiated through GitHub Actions.

    • To initiate a regular deploy to a production environment, we publish a new repo release. This manual action kicks off an automatic Actions workflow, which invokes flyctl deploy.

      • Our releases are auto-prepped using Release Drafter. This means that publishing a new release is as simple as editing the latest release draft, and hitting the big green "Publish release" button.

    • Regular deploys to non-production environments are triggered however's appropriate. Usually, it happens via a push to main, which kicks off an Actions workflow, which invokes flyctl deploy.

    Manual deploys

    Each repo has two GHA workflows that can be manually called through the GitHub UI: one called "Manual secrets 🛠️", and one called "Manual deploy 🛠️".

    Use these as needed.

    CLI deploys

    This should reeeeeeaally only ever be done in an emergency situation. If you're reaching for this in a non-emergency, take a minute first, and have a think on why you're here.

    Recovery

    Some of our apps are on the larger end. Mechanic uses upwards of 500 Machines, for example. Lots of things can go wrong. Here's some documentation on that:

    Recovering from deploy failures

    Strategies

    We use "immediate" in environments where deploys are manually initiated, and "bluegreen" wherever deploys are automatically initiated.

    Immediate deploys finish quickly, but the actual Machine updates happen asynchronously, and may take longer. Usually they're fast, but I've seen them take more than 15min on occasion.

    "Why not use a strategy (like bluegreen) that guarantees the health of new Machines before putting them into service?"

    • This takes so much time. So much time. Deploys are not fast, and they're hard to interrupt, and when interrupted flyctl tries to roll back the change, and when hundreds of Machines are in play this process is kinda brittle.

    • This doubles the size of our Machine pool, which doubles the number of Postgres and Redis connections in play. This hasn't actually been a problem, but it's .. you know, it's something to think about.

    Configuration

    Our GitHub org has an org-level variable in place: FLY_DEPLOY_STRATEGY=bluegreen. This makes it the default value for all repos and their environments.

    Each repository's production environment has an env-level variable in place: FLY_DEPLOY_STRATEGY=immediate. This makes it the effective value for that environment, and that environment alone.

    Release commands

    Fly supports "release commands", which are automatically invoked during deploy, right before updating Machines with new images.

    In apps that run Sidekiq, we use this feature it to issue "quiet" commands to all of our Sidekiq processes.

    Once this happens, no jobs will be performed. Jobs will be automatically resumed as Machines come back online after the deploy.

    $ fly m list -a $FLY_APP_NAME
    $ fly m list -a $FLY_APP_NAME | grep worker_autoscale
    $ fly m restart MACHINE_ID
    fly m update 328756e9f52758 \
      --env RELEASE_LABEL=v62-3-ga38cb23a \
      --image registry.fly.io/$FLY_APP_NAME:locksmith-api.v62-3-ga38cb23a
    flyctl deploy \
        --app $FLY_APP_NAME \
        --strategy immediate \
        --env RELEASE_LABEL=v37 \
        --image registry.fly.io/$FLY_APP_NAME:$REPO_NAME.v37 \
        --update-only
    fly.toml excerpt
    [deploy]
    # deployment is done (and configured) via shared workflow. see:
    # https://github.com/lightward/.github-private/blob/main/.github/workflows/fly-deploy.yml
    # except for this part, where we have an app-specific interest in quieting sidekiq before release
    release_command = "bin/rake sidekiq:quiet"