Inside the OpenGov Cloud: Evolution of Infrastructure and Operations—What’s Next
This is Part Three in a three-part series of posts about the OpenGov Engineering team’s journey in 2018 and the evolution of its infrastructure and operations. Part One focused on the “why and what” (i.e., the problem space). Part Two focused on the “how” (i.e., the solution space) and the wins that we secured. This part focuses on our plans for the continuation of that journey in 2019 and my reflections on lessons learned.
The EngOps team’s overarching goal for 2019 is accelerating value delivery for our business and application developers, which we intend to achieve more or less along the following lines.
In support of our developers: Look back at the three vectors of diversification that I described in this series—development, build, and deployment. We have only partially addressed build and deployment—there’s more work still to be done for these two vectors. For example, we intend to switch all of our applications to use a shared Jenkinsfile, Dockerfile, and Helm Chart; we intend to retire the Kubernetes deployment pipeline and switch to Spinnaker; we intend to introduce a stage in our CI pipeline for artifact vulnerability scanning using JFrog Xray; we intend for LaunchDarkly to be used consistently by all of our services (our legacy applications currently use a homegrown feature flagging implementation in one of our monoliths). But as far as diversification in development workflows goes, we are only getting started in earnest—we want to switch to a master-centric branching workflow for all of our code repositories; we want to standardize our “local development” environments. And I’m sure we will identify other opportunities. We now have a concerted “developer experience” program underway that is tackling these problems.
In support of our applications: At some point in 2019, we intend to start taking a serious look at service mesh such as Istio to support cross-cutting concerns in a polyglot application world. With service mesh we can offload concerns such as inter-service encryption, circuit breaking, request retries, etc. from application frameworks to the network itself. Another major benefit we intend to provide with this effort is improved observability (e.g., service graph, edge metrics, request tracing). We are investigating how we can run Windows applications alongside our Linux ones in our Kubernetes clusters to support any potential inorganic expansion in our application ecosystem.
In support of our infrastructure: Our EngOps Kubernetes cluster will soon be running multiple applications—Jenkins, JFrog Artifactory, JFrog Xray, and Spinnaker. Jenkins workloads are purely driven by developer (PR and merge) activity. As you will see from the chart below, our cluster is scaling up and down fairly nicely based on demand. We are working with our application developers on enhancing our applications in production and testing clusters to use the autoscaler framework to a similar effect.
We are deploying Aviatrix VPN solution for secure access of our non-prod and pre-prod assets by internal users. We need to do a bit more simplification in our networking by moving some of our legacy databases to new production VPCs. As we better characterize our workloads with autoscaler, we intend to save additional dollars by purchasing AWS EC2 Reserved Instances. And, yes, we are keeping an eye on the expansion of our infrastructure internationally—this is entirely business driven.
Engineering Learnings Beyond Tech
We mapped a lot of tech in 2018 and 2019 is expected to be no dissimilar. But how we do what we do best is equally important. Speaking of those general engineering practices, the following are some of the things I took away from 2018:
- Don’t over-optimize to move in a straight line, but instead focus on moving in the general direction of where you want to get to. Optimize to deliver big value, and if there is “throwaway work” along the way then so be it. (For example in our case: Kops to EKS, host-based routing to path-based routing, Kubernetes deployment pipeline in Jenkins to Spinnaker.) Make quick decisions and adjust them as needed and as time permits.
- Avoid cross-linking of projects. This often comes as a corollary of over-optimization. We end up linking one piece of work to another and before we know it, failure of one project leads to the failure of an entire effort. A good “top down” or “bottoms up” analysis can generally show you the way—for example, in our case we limped along on development vector as we addressed the build and deployment vectors, and even those two in different measures.
- Deliver big efforts with specific milestones. Rome wasn’t built in a day. Measurable progress keeps stakeholders engaged and keeps you motivated even when the end may not be in sight. Maintaining momentum is really important.
- Change is hard for many. Building consensus across many is harder. Welcome to the world of a “horizontal function” such as infrastructure and DevOps. Effective cross-team collaboration is key. Your company’s culture will guide you there. Clear accountability is equally important, and unfortunately, accountability often gets lost with collaboration. Specifying clear boundaries and contracts on who owns which part is key.
- “Build vs. Buy” decisions are important. You may not always be able to build everything that you need. Consider throwing money at “supporting pieces” of tech, either by contracting out or by using paid services instead of adopting open source equivalents. There will always be an opportunity (and reason) to swap such things out.
- I often like to say, much of what the EngOps team does is not only under the hood, but close to the bottom of the machinery. It requires building support around your efforts with key stakeholders, which is best done by showing business value—for customers and/or for developers and/or for applications and/or for business partners (e.g., finance, support).
- Unless you are a large company, and maybe not even then, as much as possible dissociate tech patterns from team structures, which are especially fluid in relatively smaller organizations.
Lastly, I want to acknowledge the efforts of the following members and “honorary” members of the EngOps team for a remarkable 2018, and for supporting an equally inspiring vision for 2019: Cade Markegard, Craig Rueda, Diego Rodriguez, Jayson Barley, John Terry, Jonathan Hinds, Jono Spiro, Leena Joshi, Pushkala Pattabhiraman, Sako Mammadov, and Sid Dange.
Interested in contributing to OpenGov’s Engineering culture of innovation, leading-edge technology adoption, and quality? Check out our current job openings.