automating AMIs with Ansible

Automating the creation and updating of AWS AMI images uses Ansible.

the problem

One of the first problems when deploying anything is managing the base image. The gold load. The template. The thing that all systems are created from.

Since we’re operating in AWS, our story begins with AMIs. We use AMIs. From those AMIs, instances are built and configured. We want to optimize this configuration process, so it makes sense to pre-configure our AMIs as much as possible. This process would keep the AMIs updated, patched, modified, and the latest version of our code. Shorter times when deploying due to less patching, installing, configuring, etc. The less time the better.

Now, to figure out how we would implement and automate it. There are some really really really awesome tools out there (and we may use or mimic them), this was just our first step. It turned out to be a pretty interesting excercise. It allowed us to develop our own method with any peculiar customizations we wanted or needed. Your mileage may vary.

prerequesites

  • Ansible is the tool of choice and a playbook would handle this process. Nuff said.
  • In this example, there is a base AMI that all AMIs are created from. There is one specific role, nginx, that we want to configure. This could be any number of customized roles.
  • All the AWS account configuration has been taken care of. Proper credentials, proper policies, etc. Information on an example policy can be found in the README.md.
  • A ‘base’ AMI has already been created and documented. This was one of the few manual steps (have to start somewhere). In this demo, a CentOS AMI from the AWS marketplace was used. In our workflow, a custom built VM was imported into AWS.

playbook steps - create instance

  • Create the instance - role aws.createinstance.
  • Find the ROLE AMI. Use the BASE AMI if the ROLE AMI doesn’t exist.
  • Increment the counter if the ROLE AMI exists. If it doesn’t, a new one will be created from the BASE AMI.
  • Se an Ansible fact for the documented SSH fingerprint. If we’re using the BASE AMI, use that fingerprint. If ROLE AMI, use that instead.
  • Find the latest version of the ROLE AMI, to set the increment counter.
  • Find the FIRST ROLE AMI created, if it exists (to avoid an AMI created from a chain of updates).
  • If the role AMI was found, use it. Else, use the base AMI.
  • Create the instance and wait for it to start up.

There may multiple ROLE AMIs created. It is important to uniquely identify them via the increment counter:

- name: Find AMI. Sort descending to get the last iteration count.
 ec2_ami_find:
   aws_secret_key: "{{ vault.aws_secret_key }}"
   aws_access_key: "{{ vault.aws_access_key }}"
   region: "{{ vault.region }}"
   owner: self
   ami_tags:
     Name: "{{ ami_prefix }}_{{ ROLE }}"
     Role: "{{ ROLE }}"
   sort: tag
   sort_tag: Increment
   sort_order: descending
   no_result_action: success
 register: fact_ami_info

- name: Set Increment IF custom AMI already exists.
 set_fact:
   fact_Increment: "{{ fact_ami_info.results.0.tags.Increment | int + 1 }}"
 when: fact_ami_info.results.0 is defined

It is also important to build from the first AMI, increment 0. This is to avoid a chain AMIs built from each other. It keeps it less stateful. At some point, it will be important to update that “0” AMI, to reduce the amount of updates and changes required between versions. If it isn’t found, use the BASE image, because the ROLE image must not exist:

- name: Get the first role AMI.  All incremented versions of the AMI are based off of the 0 version.
 ec2_ami_find:
   aws_secret_key: "{{ vault.aws_secret_key }}"
   aws_access_key: "{{ vault.aws_access_key }}"
   region: "{{ vault.region }}"
   ami_tags:
     Name: "{{ ami_prefix }}_{{ ROLE }}"
     Role: "{{ ROLE }}"
     Increment: '0'
   no_result_action: success
   owner: self
   sort: tag
   sort_tag: Increment
 register: fact_ami_info
 when: fact_ami_info.results.0 is defined

- name: Find the base AMI.
 ec2_ami_find:
   aws_secret_key: "{{ vault.aws_secret_key }}"
   aws_access_key: "{{ vault.aws_access_key }}"
   region: "{{ vault.region }}"
   ami_tags:
     Name: "{{ ami_base_Name }}"
     Role: "{{ ami_base_Role }}"
   no_result_action: fail
   owner: self
   sort_tag: Increment
   sort_order: descending
 register: fact_base_ami_info

- name: Set fact AMI to use.
 set_fact:
   fact_ami_selected: "{{ fact_ami_info.results.0.ami_id if fact_ami_info.results is defined else fact_base_ami_info.results.0.ami_id }}"

playbook steps - SSH fingerprint

  • Verify the SSH fingerprint - role localhost.verifyssh.
  • Change the SSH fingerprint fact, depending on if the BASE or ROLE image is being used.
  • SSH keyscan the IP address. If the fingerprint is incorrect, the playbook will error.
  • Import the SSH key into the user’s known_hosts file.
  • Add the newly created IP address of the host to an Ansible group.

I’ve seen references to disabling host checking for SSH. This seems like a 3rd rail to me and something I do NOT want to do. I’d rather take the time to find the SSH fingerprint and document it. There are still issues with this method, as we have a limited number of pre-baked AMIs, so our instances will have the same SSH key (at least instances in the same role).

playbook steps - configure the instance

Since we’ve created the instance and added it to the hostgroup, we can now connect to via Ansible and manage it.

  • Update YUM, install EPEL, and install NGINX - role amibuilder.nginx.
  • Clean log files - role amibuilder.cleanup.
  • Delete and recreate SSH host keys amibuilder.deletesshkeys.
  • SSH keys are deleted in /etc/ssh/.
  • The SSH service is restarted.
  • ssh-keyscan is run against localhost to get the newly created fingerprint.
  • Old entries from the localhost ~/.ssh/known_hosts are deleted.
  • The newly created SSH fingerprint is added to localhost ~/.ssh/known_hosts.

In this example, NGINX is just installed, nothing is really configured, because anything can be done at this point. It’s a basic Ansible deployment here. Update, install, configure, deploy developed code. After that, cleanup logs, delete SSH keys, and document the new SSH fingerprint.

A quick note on bracket notation. With AWS, I find I am CONSTANTLY reaching back to localhost variables. Example:

- name: Delete SSH host keys
 shell: 'rm -f ssh_host_*'
 args:
   chdir: "/etc/ssh"
 when: hostvars['127.0.0.1']['fact_ami_selected'] == hostvars['127.0.0.1']['fact_base_ami_info']['results'][0]['ami_id']

This task accesses the localhost group variable, but it’s not within the localhost group. The when statement compares the “fact_ami_selected” variable (which was set when the instance was created in aws.createinstance) with the “fact_base_ami_info”, and thus if we need to regenerate new SSH host keys. By using hostvars with bracket notation, other host variables can be used. Handy. It’s used all over.

playbook steps - create the AMI and cleanup

  • Stop the instance, create the AMI tags, and create the AMI - aws.createami.
  • Stop the instance.
  • Create the AWS tag dictionary fact.
  • Union them together into one variable.
  • Create the AMI with the proper tags.

Constructing the dictionary was a bit messy. Maybe I’ll figure out a better way to do this in the future:

- name: Get AMI name.
 set_fact:
   fact_ami_name: "{{ hostvars[groups[item][0]]['ami_tag_Name'] }}_{{ fact_Increment }}"
 with_items: "{{ ami_instance_tags.Name }}"
 when: fact_ami_name is not defined

- name: Create AMI name dictionary.
 set_fact:
   fact_name_tag: {"Name": "{{ ami_tag_Name }}"}
   fact_role_tag: {"Role": "{{ ami_tag_Role }}"}
   fact_increment_tag: {"Increment": "{{ fact_Increment }}"}
 with_items: "{{ ami_instance_tags.Name }}"

- name: Set fact - union the new AMI name with the tags.
 set_fact:
   fact_ami_tag_list: "{{ fact_name_tag | combine(fact_role_tag) | combine(fact_increment_tag) }}"
 with_items: "{{ ami_instance_tags.Name }}"

- name: Create AMI.
 ec2_ami:
   aws_secret_key: "{{ vault.aws_secret_key }}"
   aws_access_key: "{{ vault.aws_access_key }}"
   region: "{{ vault.region }}"
   instance_id: "{{ fact_instance_info.tagged_instances.0..id }}"
   name: "{{ ami_prefix }}_{{ ROLE }}_{{ fact_Increment }}"
   # If you share AMIs to other accounts, create the list and they will be automatically shared.
   # launch_permissions:
   #   user_ids: "{{ account_share_list }}"
   tags: "{{ fact_ami_tag_list }}"
   wait: "yes"
   wait_timeout: 900

summary

With this playbook, custom AMIs can be constructed in an automated fashion. Really, it just does a few things: create an instance from an initial AMI (BASE or ROLE), document and verify the SSH fingerprint, configure the instance, clean the logs, shutdown the instance, and create an AMI image with the appropriate tags. These AMIs can be shared as necessary as well; just add a list of AWS accounts to share it with. Rundeck is a nice way to schedule and run jobs, too.

At some point in the future, I’d bet this playbook does what Netflix does; it will mount a volume and chroot to it, saving the startup/shutdown times. For now, it’s pretty good at what it does, it’s expandable, and flexible.

Here’s the playbook for those interested - https://github.com/bonovoxly/playbook/tree/playbook2.0/old_format/amibuilder

-b