Moving blog to EC2 Spot Instance
We recently moved from Chicago to Des Moines, and we're staying in an AirBnB for a couple months while we look for the right house to buy. In the meanwhile, most of our stuff (including critical components of my homelab) are in storage, which means my blog wasn't running. In this transient period, I figured I would try to run my blog in the cloud, and while there are easier and even cheaper options, I decided to try out running it on EC2 in order to learn a bit more about traditional Linux system administration. This post will document the approach I arrived at.
My blog is just a static site served by Caddy, all packaged up into a Docker image. The goal is to run my blog as well as logs and metrics exporters on a single EC2 spot instance to keep costs down. Since this is just one instance, if it goes down (and it likely will because it's a spot instance), my blog will be unavailable, which is to say that this approach is not highly available and thus not a good production architecture1.
Further, I intended to bind a static IP address directly to the spot instance rather than proxying traffic through a $20/month (or whatever the cost is these days) load balancer. This means that if my host goes down, a new instance will be brought up to replace it; however, that instance will not automatically get the Elastic IP address bound to it (the binding will die with the original host). As far as I can tell, AWS doesn't offer any kind of automation for this sort of scenario (apart from load balancers). This is a risk I'm willing to accept, and if it becomes overly inconvenient, I could build a little lambda function that periodically checks to make sure the elastic IP address is bound properly and rebinds it if the spot instance goes down; however, in my experience these kinds of interruptions are rare.
Base Image
I decided to use Ubuntu for my base VM image, mostly because I'm more familiar with it; however, as we'll see later Amazon Linux 2 would likely have been the better choice. I'm also doing all of the configuration management2 through cloud-init whereas I'm guessing more professional sysadmins would just use cloud-init to bootstrap some other configuration management system like Ansible. But I don't know Ansible or Puppet or etc, I'm not particularly interested in learning them, and my use case seems simple enough.
Process Management
A process manager is basically a controller whose job is to make sure that the desired processes are running on the host. If a process terminates unexpectedly, the process manager should respawn it. It is the main process on a Linux system, and it spawns and controls all other processes. Since I picked Ubuntu, I'm using the systemd process manager.
The application's systemd configuration is in a blog.service
file:
The ExecStartPre
line just makes sure the latest image is pulled (in case
the host has previously pulled an older version). Since I wrote this, it seems
the same is achievable with docker run ... --pull=always
such
that this ExecStartPre
line could be elided (I'll probably make this change
in the future). This is necessary because for the time being, I'm just
deploying the latest version of my blog image, which means deploying my blog
just means publishing a new instance and restarting the application; however,
in the future I may deploy explicit image tags rather than using latest
.
The RestartSec=5
bit is necessary to keep the service from failing. I don't
understand why, and it's annoying that I have to add this, but if Linux tools
were intuitive then just anyone could run their own instances 🙃.
The StandardOutput
and StandardError
blocks are important--they write log
output to a /var/log/blog.log
file, which will be read by our log exporter
as discussed below. Ideally our log exporter would be able to pull them
directly from journald
(the logging complement to systemd), but the agent I
chose seems not to support that. There might be a better way to export logs
than writing them to a file, but I couldn't figure it out for my log exporter.
Note that the append:
prefix. Systemd also has a file:
prefix, but if your
service is restarted, systemd will silently stop writing logs to the specified
file. Append behaves correctly, and I can't imagine why anyone would want the
file:
behavior.
Logging and Monitoring
I'm only running one host at a time, but I still want a better and more durable
way to access logs and metrics than SSH-ing onto the host and grepping files.
Specifically, if the host goes down, I don't want to lose its logs and metrics.
This means running exporter utilities for shipping the logs and metrics to some
other service where they can be analyzed. The default log/metric analysis tool
in the AWS world is CloudWatch, and I'd rather use a managed service than try
to operate my own (in my limited experience, CloudWatch seems much better than
the self-hosted options anyway). This means running the
amazon-cloudwatch-agent
utility on the host in addition to the application.
Since I'm using Ubuntu rather than Amazon Linux 2, the agent's installation package isn't available in the system repository, so I needed to write a small script to download, install, and start the package:
Note that I'm using the arm64
version of the package, because my spot
instance is based on the cheaper (better performance/cost) ARM instances. The
debian package handles installing the systemd configuration for the agent, so
I don't have to write or install my own systemd .service
file.
The configuration for the agent is attached below. Note that I'm collecting the
syslog as well as the aforementioned /var/log/blog.log
file, and running as
the root user--I could probably run as the cwagent
user, but I'd need to find
a way to grant that user read access to the syslog. Note also that I'm
collecting disk, memory, and CPU metrics.
As we will see, I'm going to ship this script and the agent installation file in the cloud-init user-data file. CloudWatch also requires configuring a role for the instance giving it permissions to write to CloudWatch--see the section on infrastructure below.
User Data
When an instance boots for the first time, cloud-init
runs, and one of its
first activities is to find the "user-data", which is the configuration file
provided by the cloud provider for configuring that instance. In the case of
AWS, cloud-init calls the metadata endpoint to download
this file. This file is so-called because it's something that we (users of AWS)
provide when we request an AWS EC2 instance. Mine looks like this (note this
YAML was generated by Terraform--more on that below--hence the formatting and
order of keys):
Note that the write_files
section contains the aforementioned
blog.service
, install-cloudwatch-agent.sh
, and
amazon-cloudwatch-agent.json
files, and where to write them to disk. It also
specifies which packages are to be installed from the system package
repository--in this case, I'm installing Docker and curl (required by the
application systemd unit and the cloudwatch agent installation script,
respectively).
The user-data also contains some stuff for configuring users and how they can SSH onto the instance (notably password authentication is disabled, my weberc2 user is a sudoer, and my user's pubkey is pulled from GitHub).
Lastly, the runcmd
section tells cloud-init
to start the blog application
and run the cloudwatch agent installation script.
Infrastructure
A major goal for any project I work on is that the infrastructure is immutable and reproducible. I should be able to tear down my infrastructure and stand it back up again with relative ease. I certainly don't want "a human pokes around the AWS console or SSHes onto an instance" to be part of the process, because I will make mistakes, and I want to minimize tedium (this is for fun, after all). To that end, I'm using Terraform to describe my infrastructure and reconcile that description with the current state.
The infrastructure contains a security group for allowing traffic to reach the instance (ports 80, 443, and 22 for HTTP, HTTPS, and SSH traffic, respectively).
It also defines the Elastic IP address and the Route53 DNS record descriptions:
And the IAM stuff for granting the instance permissions to send logs and metrics to CloudWatch:
Lastly, it contains the description of the spot instance itself:
Note the user_data
is set to local.user_data
. Previously, this was just a
reference to a local user-data.yaml
file contained on disk (user_data =
file("user-data.yaml")
), but as we will see in the next section, I've
abstracted the user-data to make this Terraform more flexible for applications
besides my blog.
Abstracting User-Data
The user-data referenced above is only suitable for my blog, but conceivably I may want to configure other instances running a different suite of applications. Indeed, I have another instance that I was also running with its own hard-coded user-data, and I wanted to abstract out the similarities. To that end, I created a module (Terraform verbiage for a template) and factored out the bits that differ into parameters ("input variables" in Terraform parlance):
This allows me to stamp out multiple Spot Instances complete with DNS and logging/metrics support by just passing in a few parameters of information about each service that will run on the instance (excluding the logging agent, which is provided by default). Further, it gives me one place to make changes which can then benefit all of my spot instances.
This services
variable is used to dynamically generate the user-data
and the cloudwatch agent configuration files:
Since I've modularized my concept of a spot instance application, I've also ported my older EC2 instance to it so that it can benefit from the logging changes.
Next Steps
Now that I have my EC2 spot instance module, the next thing I'd like to add to it is some sort of alerting automation so I can detect when it goes down (at a minimum) and possibly if it looks like it's about to run out of resources (e.g., excessive use of CPU, memory, or disk) although I'm not very worried because Caddy is pretty simple/reliable. Mostly, my biggest concern is that AWS will temporarily take down my spot instance, and I'll need to manually re-associate the Elastic IP address with the new instance it brings up--and an automated alert means I can know about it immediately rather than needing to periodically check the blog myself.
Another improvement would be to use TailScale (a VPN) for SSH access rather than exposing port 22 to the public Internet.
At some point, I would also like to build a CI job to automatically apply my Terraform changes, because at the moment my process is just running these Terraform applies on my local laptop (which also holds the Terraform state). This works well enough for my single-developer flow, and if I'm concerned about losing the Terraform state, Terraform has first class support for writing it to a cloud data store (which is also a precursor for building a CI job).
load balancer.
"configuration management" seems to imply that the term is self-describing or that everyone already knows what it means, but as far as I can tell, it specifically refers to preparing a host (i.e., installing packages, configuration files, etc) at boot time to run the application. So we're specifically talking about hosts (as opposed to all of the other stuff we configure in the infrastructure world), and for some reason it seems more common to do this at boot time rather than baking it into the VM image as we do in the container world (presumably because VM image building tooling is even worse than Docker image building tooling).
- This could be remedied easily enough by running multiple hosts behind a
- For some reason, all of the documentation I'd seen in the past for