Introduction to DevOps

16 min readDec 4, 2021

Whenever I hear DevOps, I think about deployments. Many people have different definitions and interpretations of DevOps. In my opinion, DevOps is simply a paradigm to improve the development lifecycle. In a technical sense, DevOps is the marriage between Developers and Operations. Some universal themes across DevOps are leveling up the performance of each human on the team, eliminating inefficiencies, and delivering product faster and with fewer bugs.

With the power of today’s computers, people are typically the biggest obstacle to development. Since each individual in each department of the development cycle has separate goals, there is often friction in the process. Project Managers are judged on the quantity of features released. Designers are judged on the user experience and aesthetics. Developers are judged by the amount of features produced and the number of errors caused. Database admins are judged on the database integrity. Operations are judged based on the availability and reliability of the app. Management is judged on profitability, speed to market, employee turnover, and risk management.

With all of these conflicting motivations, we must implement policies that make processes more collaborative. Although a developer specializes in coding, he should be involved in the design phase so that can understand the purpose of the design and help craft documentation. He could also be involved in project management meetings so he can understand which projects/features are most urgent. He could also be involved in the monitoring phase so that if the app goes down, he is able to assist a site reliability engineer in fixing a problem. This type of cross-integration breaks down the “wall of confusion” and builds trust between departments.

Let’s take a look at the entire Software Development Lifecycle (SDLC). As you can see, the development life cycle is a never-ending cycle. The developers plan, code, and test. Then, the app is deployed and monitored by the operations department. During monitoring, we find bugs and track users needs. Then, we plan to address these needs and start the cycle again.

To be able to level up all parties in this lifecycle, DevOps need to have their hands in every step of the process. Let’s dive into these steps:

Plan:

Unified Vision: Have a unified vision that states who you are, what you do, and where you need to go

Timeline: Determine a timeline based on business needs. Such needs might include a marketing presentation, investor meeting, and user need.

Sizing: Estimate the capacity of each department. Sizing, or estimating the time it will take to solve or make this feature, is nearly impossible. There are many features that seem simple but are extremely complex. In addition, other priorities might arise which could derail any estimation.

Story Points/Velocity: We can use Story Points to approximate the effort a particular project will require. Story points are often based on the fibonacci (1, 1, 2, 3, 5, 8, etc) with a higher number requiring more effort. Based on historical performance, we can determine a “velocity” that estimates how many Story Points a developer can finish during a sprint.

Sprints: Using the aggregate velocity of the development team, we can approximate the time it will take to develop. Divide the project into “sprints” (often weekly) so you can track the progress.

Address Needs: Based on market research and user feedback, determine how to solve user problems

Prioritization: Categorize the projects based on “important” and “urgent.” If a project is important and urgent, we make it top priority. Conversely, if the project is unimportant and non-urgent, we make it to the bottom of the queue.

Minimum Viable Product (MVP): The team should be working toward a product with the basic features. Before you add all the bells and whistles, you should develop the basic idea to test if it’s worth investing more time and money into the project.

Architecture: Architects determine which coding languages, libraries, API’s, microservices, documentation, hardware, and software to use for your app.

Design: Some design principles:

Visibility of system status. The system should always keep users informed about what is going on. This includes validation and error messages.
Intuitive: You should put the elements in your interface near where they affect change.
Fail gracefully: It’s inevitable that the user will encounter errors. Whether the app is encountering an error or the user is causing the error, we need to instruct the user on how to proceed and prevent this error from happening.
Don’t allow the user to make difficult decisions: If you force the user to make difficult decisions, many times, he will unknowingly choose an option that will hurt his experience.

Code:

Monolithic Architecture: A monolithic application is a unified app. The database, server-side application, and client-side application are wrapped into the same app. On one hand, it is simple to develop, simple to test, and deploy. However, if the app grows, this architecture can 1) cause the application to underperform, 2) upgrades/redeployments will be large, 3) a tiny bug in one part of the app can bring down the entire app.

Microservices: In contrast to the monolithic architecture, microservices are a style of architecture that separates logic into loosely coupled services. Each component in the microservice is portable, replaceable, and less dependent on other parts of the applications. We can target a particular part of the app that needs more resources (i.e. computing power, memory, etc). When using microservices, create libraries containing shared logic that any service can access. Be sure to adopt a message-queuing solution (JSON) to notify each component of a change. Keep in mind, for all these benefits, microservices are more complex and require more effort and construct. But the long-term gain can be beneficial if your app plans to scale.

Compiled Languages: Compiled languages translate a program into the assembly language of the computer in which the program runs. The architecture of the computer must support the language into which the program has been compiled. A compiled language typically performs faster because it uses the native language of the computer (C, C++, C#, Go, Haskell, Rust).

Interpreted Languages: An interpreted language requires no consideration of infrastructure beyond having the interpreter installed (javascript, ruby, python)

Concurrency: Concurrency means executing multiple tasks at the same time but not necessarily simultaneously. There are two tasks executing concurrently, but those are run in a 1-core CPU, so the CPU will decide to run a task first and then the other task or run half a task and half another task. If the computer only has one CPU, the application may not make progress on more than one task at exactly the same time, but more than one task is being processed at a time inside the application. It does not completely finish one task before it begins the next.

Parallelism: Parrallelism means that an application splits its tasks up into smaller subtasks which can be processed in parallel, for instance on multiple CPUs at the exact same time. Parallelism does not require two tasks to exist. It literally physically run parts of tasks OR multiple tasks, at the same time using the multi-core infrastructure of CPU, by assigning one core to each task or sub-task. Parallelism requires hardware with multiple processing units, essentially. In single-core CPU, you may get concurrency but NOT parallelism.

Code Reviews: Code reviews help to level up junior engineers, reduce errors, standardizes formatting, and enables people to become familiar with other coding techniques.

Versioning: Every release should have a new version number.

A patch update (bug fixes) would make the current release 1.3.5
A minor update (new features) would increment to 1.4.0
A major update would put the release version at 2.0.0

Test:

Test-Driven Development: In this approach, you write a test that confirms the function you need to write and then you make code to pass the test. This approach is effective but heavy-handed enough that many avoid it.

Types of Tests: You can use tools like Selenium, Mocha, Jest, or Cypress to setup up automated tests.

Unit tests: test an individual function
Integration test: validate multiple components are communicating properly
Regression test: All the previous functions and performance have been preserved
Visual/UI test: The library will take a historical screenshot then compare against the current screenshot
Performance test: Stable, secure. How do you conduct a performance test?

Deployments:

Continuous Integration/Continuous Delivery (CI/CD): CI/CD is a coding philosophy that drives development teams to implement small changes and update repositories frequently. The technical goal of CI is to establish a consistent and automated way to build, package, and test applications. CD automates the delivery of applications to selected infrastructure environments. Most teams work with multiple environments other than the production, such as development and testing environments, and CD ensures there is an automated way to push code changes to them.

The cloud: The means “someone else’s servers.” Cloud providers allow you to select only the services you need, when you need them. Cloud providers make releasing code an automated, repeatable process. They also typically have high security standards.

Public Cloud: Public clouds are by far the most prevalent and relevant to DevOps. You accrue almost no overhead or up-front costs. You pay only for what you use, and you can scale up and down at will. However, it has multiple tenants (shared users). Tenants share hardware, storage, and networking with other users. The main advantages of a public cloud are lowered costs, lack of server maintenance, extremely flexible and capable scalability, and high reliability because of a large network of servers

Private Cloud: A private cloud offers resources like a public cloud but for use exclusively by a single business. . Only one user can access all services and infrastructure. No hardware is shared, and the private cloud eliminates security and compliance concerns for companies with extremely specific requirements, including governments and banks. Private clouds are more expensive and can require maintenance, but they do permit more flexibility in customizing the cloud environment

N + 2 servers: Any application or service with 99.999% availability needs to exist on N + 2 servers. If you have one main server in use, you actually need to have three instances of that server. You have to allow for one machine to be down for scheduled maintenance, which leaves you redundancy if one of the remaining machines goes down because of an unforeseen issue.

Infrastructure as a Service (IaaS): IaaS provides rented IT infrastructure — low-level network infrastructure via abstracted APIs. You can spin up servers and VMs, storage, backups, and networks. Every service is set up to be pay-as-you-go. You pay for only the resources you use.

Platform as a Service: Platform as a Service (PaaS) is designed to increase the speed at which engineers develop, test, and release their code. With PaaS, developers can develop, test, release, and maintain their applications despite having little to no knowledge about underlying infrastructure. PaaS abstracts servers, storage, databases, middleware, and network resources

Serverless Apps: Enables users to run logic without pushing code to a server that you control (serverless functions aka Lambda functions)

App deployment: Platform as a Service (PaaS) solution for deploying applications in a variety of languages, including Java, .NET, Python, Node.js, C#, Ruby, and Go

Azure: Azure Cloud Services
AWS: AWS Elastic Beanstalk
GCP: Google App Engine

Virtual machine (VM) management: Infrastructure as a Service (IaaS) option for running virtual machines (VMs) with Linux or Windows

Azure: Azure Virtual Machines
AWS: Amazon EC2
GCP: Google Compute Engine

Blue-Green Deployments: Let’s say your current production is on v1.0.4. You’re ready to release a minor update, which will take you to v1.1.0. Before you release the new version, only v1.0.4 is running in production. To ensure that the new version behaves as expected in your production environment, you release v1.1.0 to production but route all traffic to the stable v1.0.4. Both versions are running in production, but nothing has changed for customers. After you’re confident that the new version of your software is stable and ready for customer traffic, it’s time to make the switch.

Canary Deployments: Canary releases ship software changes to select customers as a way of testing functionality and reliability in production while limiting the number of customers potentially impacted. You can select the percent of customers that the release is delivered to. You can select based on demographic information or location.

Rolling Deployments: A rolling deployment is a deployment strategy that slowly replaces previous versions of an application with new versions of an application by completely replacing the infrastructure on which the application is running. For example, in a rolling deployment in Amazon ECS, containers running previous versions of the application will be replaced one-by-one with containers running new versions of the application. A rolling deployment is generally faster than a blue/green deployment; however, unlike a blue/green deployment, in a rolling deployment there is no environment isolation between the old and new application versions.

Cloud Providers / Distributed Systems: The majority of companies are beginning to take advantage of pay-as-you-go cloud hosting. In large part, this move to cloud hosting is happening because running applications at scale requires efficient use of infrastructure. The costs of underutilizing hardware add up quickly. Distributed systems have become the norm, mainly because of cloud services. Multitenancy allows multiple customers to take advantage of shared resources, which keeps costs low by maximizing the use of those resources. If you use a cloud provider like Azure or AWS, the components of your system run on machines spread across a particular region (or regions).

Containers: Lightweight environments in which you can run your application.

Temporary: You can create and destroy containers within seconds. The life span of a container is brief, sometimes only a few hours.
Immutable: Containers can’t be updated. After an image is built, it can never be changed. Instead, a new image must replace it.
Scalable: The scalability of containers is an enormous advantage, but it also drastically increases the number of machines in your environment
No storage: Unlike VMs or bare-metal servers, application data can’t be stored directly in a container.
Require Monitoring: The performance and security of containers requires management through the use of an orchestrator or monitoring tool.

Container Life Cycle: Containers have five states: defined, tested, built, deployed, and destroyed. At the start of the life cycle, the container is defined via a Dockerfile that includes runtime, frameworks, and application components. Next, the source code is pushed through a CI system to be tested. The container is built and exposed to the orchestration system, where it is replicated and distributed throughout the cluster. Finally, because containers can never be patched, a container is destroyed and replaced.

Image: An image is a snapshot of a container. You can create a container from an image. Containers have isolated CPU, memory, and network resources while sharing the operating system kernel. They hold source code, system tools, and libraries. Containers are very similar to virtual machines. You can’t change or update the snapshot. Images are stored in a registry and ideally layered to save disk space. Image layers are immutable instructions that allow a container to be created using references to shared information. For example, imagine building two containers that are identical up until the last two lines of instructions. Instead of building two containers from scratch, you can use layers to enable you to reference layer caches and rebuild only the last two layers.

Orchestrators: Orchestrators help you manage sets of containers for applications running in production on multiple containers or using a microservice architecture. An orchestrator is essentially a manager that you can use to automatically scale (add additional resources) your cluster with multiple instances of each image, specify memory limits for each container, instantiate new containers, suspend or kill instances when required, and control each container’s access to resources such as storage and secrets.

Kubernetes: K8s or Kube, is the most popular Docker container orchestrator. It is also open-source. You can use Kubernetes to manage containerized applications as well as automate deployments. By sorting containers into groups referred to as “pods,” Kubernetes streamlines scheduling workloads. Kubernetes maximizes resources, controls deployments, and enables your applications to self-heal through autoplacement, autorestart, and autoreplication.

Azure Kubernetes: Azure Kubernetes Service (AKS) is a managed Kubernetes orchestrator.

OpenShift: OpenShift is Red Hat’s enterprise container application platform. Built on Kubernetes, OpenShift added features to enable rapid application development, easy deployment, and life cycle maintenance. It leverages automation and dynamically provisions storage

Docker Swarm: Docker Swarm is the native clustering and scheduling tool for Docker containers. It uses the Docker CLI to deploy and manage containers while clustering nodes, allowing users to treat nodes as a single system. Users create a primary manager instance and multiple replicas. This redundancy ensures continued uptime in case of failure. Manager and worker nodes can be deployed at runtime. It’s a fast and scalable orchestrator. Swarm has been successfully scaled up to 30,000 containers. Swarm is included in Docker Engine and, unlike other solutions, doesn’t require initial setup and installation

Amazon ECS: ECS helps run Docker containers across Amazon Elastic Cloud Compute (EC2). ECS is compatible with a serverless architecture, and you can use the built-in scheduler to trigger container deployment based on resource availability and demand. ECS is capable of scaling clusters to more than 10,000 containers, which can be created and destroyed within seconds. Amazon ECS is ideal for small teams who rely heavily on Amazon and don’t have the resources to manage bespoke orchestration and infrastructure.

Orchestration Configuration: All configuration logic lives in a Dockerfile which is referred to as the “cookbook,” “playbook,” or “manifest.” Configuration management include Chef, Ansible, and Puppet.

Secrets: Secrets are objects that contain sensitive information such as a username, password, token, key, or SSL certificate. This type of data should never be stored unencrypted in a Dockerfile or source code. Containers are lightweight because they contain less information than a traditional VM, which is great for efficiency but requires additional security considerations.

Deployment Automation:

Integration: merging code from multiple developers
Deployment: Publishing the code onto services
Infrastructure: Configuring the hardware to run the code
SSL renewals: DNS certification
Domain Name System (DNS) resolutions
Load balancer health checks
Data / Error logs

Automation Tools:

Jenkins: Jenkins is probably the most well-known automation tool. It is an open-source automation tool written in Java with plugins built for Continuous Integration purposes. With Jenkins, you must set up your own servers. Jenkins is configured using a yaml config file in the repo. Of note, since Jenkins is open-source, there is no support for this configuration. Once configured, Jenkins is used to build and test your software projects continuously making it easier for developers to integrate changes to the project, and making it easier for users to obtain a fresh build. Plugins allow the integration of various services. As of this writing, Jenkins has 1,800 plugins. If you want to integrate a particular tool, you need to install the plugins for that tool. For example Git, Maven 2 project, Amazon EC2, HTML publisher etc. Jenkins supports parallel builds.
CircleCI: CircleCI is proprietary software for CI/CD. Unlike Jenkins, CircleCI sets up your servers (cloud hosting) and infrastructure. As such, users are reliant on CircleCI’s service. If it goes down, you won’t be able to run builds. It uses plugins (called Orbs) for additional functionality. Similar to Jenkins, CircleCI is configured using a yaml config file stored inside of repo. As proprietary software, CircleCI provides support for its service.
Zeet: Zeet streamlines and automates the deployment process in a 4-step process with a simple user interface. Just connect your Github or Docker image. It is a proprietary platform that aggregates multiple automation technologies so that your DevOps team does not need to configure and maintain services such as Terraform or Ansible. Since both Terraform and Ansible are open-source, there is no support team to handle your needs. However, Zeet will support your specific needs.
Netlify: An app that streamlines the deployment automation with a simple UI. You just upload your code and Netlify will build your app. There is a free plan that allows one concurrent build, 100GB/month bandwidth, 300 build minutes, 125,000k serverless function requests,
Terraform: Open-source service that specializes in provisioning to set up your infrastructure. Terraform uses a declarative approach which means it keeps track of a configuration “state.” If the current state is different than the desired state, then Terrraform will update the infrastructure to reflect the desired state. Let’s say you instruct Terraform to create a database, create a kubernetes cluster, and virtual private cloud (VPC). Then, you instruct Terraform to bind the database to the kubernetes cluster. Since you are using a declarative approach (implicit dependency), Terraform will know that it needs to create the kubernetes cluster before binding the database to it.
Ansible: Open-source service that specializes in configuring the steps needed to run your app. Ansible resolves in a top to bottom manner. You will instruct Ansible to start the virtual machine => install app code=> install dependencies => start app.
GitHub Actions: Integrated directly into githhub.com. Configured with a yaml file. Since GitHub Actions is much newer, 2019, there aren’t as many features as Jenkins.

Block storage: storing data on hard drive

Virtual Private Cloud (VPC): Logically isolated but shared computed services

Content Delivery Network (CDN): Content delivery based on location. Uses caching, load balancing

Domain Name System (DNS): Translates domain names to IP addresses

Single Sign-On (SSO): Access to multiple systems like Google, Twitter and Github using the same login

Identity and Access Management (IAM): Role-based user access management.

Cloud Shell: Access from a command line within a browser

Operate:

Contracts: Much of the performance standards that the company will need to achieve are written in a Service Level Agreement.

Service Level Agreement (SLA): As agreed upon with a client, the percent of time the service is up and running. This is typically 99.999%.
Service Level Objective (SLO): Internal Percent of time the service is up and running. This percent must be higher than the SLA.
Service Level Indicator (SLI): Measures success of meeting SLO.

Monitor: We need to monitor to determine if our app is underperforming. We also need to track the productivity of our team.

Telemetry: collecting data on your systems. Telemetry is also handy in the case of a service-level agreement (SLA), which is essentially your promise of availability to customers. An SLA is typically a legal contract that promises a certain level of performance, such as 99.999% percent availability.

Types of data to track

Disk usage
Number of exceptions
Server Traffic
Load times
Deployment lead times
Time to go from committing code to running code in production. Top performers average 1 hour while average is about a month.
Deployment frequency
Response times
Mean Time to Recover (MTTR): How long to restore a service after an incident
Change failure. Whenever there is a change, how many times was there a problem

Introduction to DevOps

Written by Lance Watanabe