Lessons from a Releng Sprint

I will recount the original CI process and the work that went into refactoring it. Although my role is quality assurance (QA) I had to don my release engineering (releng) hat for a few sprints. I could have let things be as they were but they were a hindrance to my work of testing and test automation. Rather than wait for the DevOps team to tackle the technical debt in our releng processes as time permitted, I decided to take this head-on as "step zero" in the long list of steps needed to make our continuous integration (CI) process efficient and fast.

Due to design decisions made way before I joined the team -- mostly by people who had left the team long ago -- the QA team was not able to provide any metrics on what was produced by the developers, what was tested with automation (or manually), and what could be released to production. This was the driving force behind this process refactor; if the QA team was going to sign off on things moving to production, the QA team needed a better handle on the CI pipeline.

This post is from the culmination of the first stage of refactor. A lot more is still left to be done. Hopefully, as those steps get completed I'll post more on the lessons learned.

I'll use the word "team" to describe the entire company since I believe we are all part of a single team. Nevertheless, there are smaller teams within the company tasked with primary duties. Since we're small enough to know and work with each other repeatedly on many different things it's easy for these teams to cross-train in other aspects of the team's work.

Current CI Pipeline

Our git workflow is fairly uncomplicated. We have a master branch that should never break and all feature work is done in feature branches. When features are ready to be merged, pull requests (PRs) are created and reviewed by peers before merging.

Each project has its own git repository (repo). Due to the limited number of repos available on GitHub some things were mashed into a single git repo. The idea was to extract anything our team would ship to production or a customer would get extracted out into its own git repo. Our team managed to get rid of overcome-by-events (OBE) repos fairly quickly and thus we were not blocked from one-repo-per-project although we were hindered numerous times.

The QA team works hard to create test automation. Each project that the team creates has some measure of test automation provided, even if it's just a handful of smoke tests. These tests keep pace with the changes in design and code because the QA team believes in keeping CI green as much as and as long as possible.

The devops team is responsible for creating development environment automation as well as taking care of the production life cycle. The CI pipeline's infrastructure is their responsibility while all the software engineering (SWE) teams (development, QA, DevOps) used the infrastructure to do their jobs.

Problems

Our team identified a whole lot of inefficiencies and problematic hurdles that slowed down our velocity. Technical debt was a major contributor to our woes. We were writing code faster than we were able to package it in a useful way. We were testing ad hoc configurations that were far from what production would look like. We were building artifacts with Jenkins and then had no proper way of managing them. Tracing back to commits when breakages were introduced was a tedious, manual, and time sink process. We were making regular releases but it took the whole team a full day to figure out what was getting released. Jenkins jobs diverged wildly even for building similar things. Building Docker images would take tens of minutes. In one area of development turn around for a developer to committ, build, and deploy was more than two hours. There were more issues we could list but these were the major things we wanted to tackle first.

Lessons Learned

We went through a few sprints of cleaning things up and these were our lessons learned. I tweeted these as well.

For every new project you start go through its entire life cycle of commit, build, package, deploy, and test before you write a lot of code. This will help you work out any kinks in your understanding of how the application will live in this life cycle. Otherwise what happens often is that a bulk of time is spent writing code and then the other things are rushed with sub-par quality.

Provide a default working config with your code. Package it for the platform you'll be deploying to. I'm in favor of building packages for OS package managers as a default option. In this case, for example, I'd create two rpms on Red Hat family of OSes: one for code and one for config.

The reason to build a config rpm is that you can possibly create a large number of various config packages that can easily be installed. This reduces your deployment time by shifting the burden to creating artifacts. Of course, not all situations are ideal for this. In our case, however, we were aware of all the various configs our customers would need and we had an obligation to test all of them.

Once you have a bundle of config packages testing them becomes a cycle of removing config package, installing another, and testing. Sometimes configs contain sensitive information, such as TLS certificates, login credentials, etc. In those cases I prefer to use config management tools, like Ansible, to provide these sensitive pieces after installing a config package.

I had an innovative idea to use config management at build time in addition to at deploy time. Just like Ansible (or Chef or Puppet or Salt) would be used to manage the configuration of a server, it could be used to manage configs in a way that they could be packaged. For example, we had a config file that had different values on different servers. We did a proof of concept where we use config management to create the config file during a Jenkins job and then the file was packaged in an rpm.

The benefit to this is you utilize a tool meant for config management to do its job and you retain flexibility in when to use it. When it's faster to package all your configs upfront you build n number of rpm packages. When it's faster to deploy config when the application is deployed you still use the same tool. You reduce duplication in where you store configs and you gain leverage to decide at which stage of the life cycle these configs become actionable.

Many of our config files were identical in every respect except for a few parts. We built a CI pipeline such that we could quickly test a handful of configs and declare the whole batch to be ready for deployment. Of course, we tested all of them periodically but we didn't have to test them all the time anymore.

We also instituted a principle: each Jenkins job that produces an artifact copies it to a consumable repository (repo). For example, rpm, deb, etc. were copied to rpm repos, deb repos, etc. Similarly, whenever a Jenkins job needed a package it could decide which repo it wanted to use. This simplified how the team thought about chaining a bunch of builds since we dealt with consumable artifacts as our level of abstraction. One Jenkins job did not have to care about other jobs because all it needed to consume was an artifact from a repo and anything it produced was shipped off to a repo as well.

We sped up our Docker image builds by reducing the number of layers each build had to create. This made our Dockerfiles pretty hard to read for a newcomer but the speed we gained was tremendous. Remember the project where one set of changes took more than two hours for a developer to test on their own? We reduced it to less than 20 minutes. And we have plans to reduce it even further.

Docker is amazing at building test environments. Use it to test your code and various configs. Create lean images that are built fast, can be pushed to hub quickly, and are pulled in a jiffy. The more speed you can provide in this cycle the happier your team and CI will be.

Your Dockerfiles should be simple. Base on another image, install artifacts (code & config) from a repo, expose some ports, add a cmd or entrypoint and you're done. If they get more complicated maybe it's better to create more OS packages than to stuff them in the Dockerfile. Those packages are then reusable artifacts for other environments.

You should really mirror dependencies on your own servers. This helps in avoiding issues where a dependency disappears from the internet. It also helps you track exactly what your dependencies are. Locking down dependency versions is a further step is avoiding surprises. You are more sure that what you built and tested should work the same way anywhere as long as the dependencies remain exactly the same.

Always create source packages, like source rpm and deb. These may contain some meta information, like git commits, dependencies and their versions, and any other pertinent information. Source packages are a godsend when it comes to rebuilding the same artifact that you built two years ago. Source packages do not necessarily have to be deterministically built but they can help in deterministically build code and config packages.

When it comes to systemd I'm happy it exists because it has reduced our piles of init scripts to a clump of service files. We can trust it to provide a base set of common functionality and we don't need to write code to manage services anymore; we write config files. For a small team it's a great boost in productivity.

Finally, communication is key. Share your plans and progress with your team at every step of the way. The more your team feels involved in making changes the better they will respond to changing things. Ask for opinions and war stories from other jobs. Listen and be open to concede your idea for a better one. A team is more than the sum of its parts only when it executes the same set of plans.