Many don’t understand DevOps. It is more than just development and operations. Different aspects of it need to be understood and respected.
Let’s go through them, as Gene Kim has already demonstrated.
Myths include the following
DevOps is only for startup
- Pioneered by internal unicorns like Google, Amazon, Netflix & Etsy
Key problems include:
- hazardous code releases prone to failure
- Inability to release features fast enough to beat the competition
- High levels of distrust between Development and Operations.
DevOps replaces Agile
- DevOps principles and practices are compatible with Agile.
- DevOps is the logical continuation of the agile journey that began in 2001.
- Agile serves to enable DevOps
- DevOps practises emerge when we have our code always in a deployable state
- Developers check into the trunk daily
- Demonstrate features in production-like environments
DevOps is incompatible with information security and compliance
- Security and compliance are integrated into every stage of daily work in the software development lifecycle
- Security and compliance integration results in better quality, security and compliance outcomes.
DevOps means eliminating IT operations, or “NoOps.”
- IT Ops work will naturally change with DevOps, but it remains as important as ever.
- IT Ops collaborates far earlier with development, who works with IT Ops long after code has been deployed into production.
- IT Ops enables developer productivity through APIs and self-service platforms that create environments, test & deploy code, monitor and display production telemetry, etc…
- IT Ops becomes more like development, where the product is the platform developers use to safely, quickly, and securely test, deploy and run their IT services in production.
DevOps is just “Infrastructure as code” or automation
- DevOps requires cultural norms and an architecture that allows shared goals to be achieved throughout the IT value stream.
DevOps is only for Open Source Software
- Achieving DevOps outcomes is independent of the technology being used.
Dev & Ops become DevOps
By working together toward a common goal, they enable
- The fast flow of planned work into production
- Achieve world-class stability, reliability, availability and security
Toward a common goal
- Cross-functional teams test their hypothesis of which features will delight users and advance the organizational goals.
- Cross-functional teams actively ensure their work flows smoothly and frequently throughout the entire value stream without causing chaos and disruption to IT Ops or any other internal or external customer.
- QA, IT Ops, and InfoSec work on ways to reduce friction for the team, creating work systems that enable developers to be more productive and get better outcomes.
- By integrating QA, IT Ops, and InfoSec, as well as delivery teams and automated self-service tools and platforms, teams can use that expertise in their daily work without being dependent on other groups.
- Organizations create a safe system of work where small teams can quickly and independently develop, test and deploy code and value quickly, safely, securely and reliably to customers.
Outcomes
- Maximize developer productivity;
- Enable organizational learning;
- Create high employee satisfaction;
- Win in the marketplace.
How it is today
- Dev and IT Ops are adversaries
- Testing and InfoSec happen only at the end of a project –> too late to correct errors that are found
- Critical activities require too much manual effort and too many handoffs, leaving people always waiting
Consequences
- Contribute to highly long lead times
- Quality of work, especially production deployments, is also problematic and chaotic
- Negative impacts are produced on our customers and our business
- We fall short of our goals
- The whole organization is dissatisfied with IT performance
- Budgets are reduced
- Unhappy employees feel powerless to change the process and its outcomes
Manufacturing revolution in the 1980s
Adopted lean principles and practices
Make improvements to the following:
- Plant productivity
- Customer lead times
- Product Quality
- Customer satisfaction
Introduction of DevOps in 2010
- Faster hardware, software, cloud deployments, features, and even startup companies begin in just weeks.
- Deployment to production in just hours or minutes
- Deployments become routine and low-risk
- Businesses able to test new ideas and run experiments
- Businesses discover which ideas create the most value for customers more efficiently and effectively
- Rapid, safe and secure deployment to production
Organizations unable to deploy fast and quickly to the market are destined to lose in the marketplace to more nimble competitors. Regardless of the industry, how we acquire customers and deliver value depends on the technology value stream.
The Problem and Chronic Conflict
- Most organizations are unable to deploy production changes in minutes or hours
- Production deployments are not routine
- Production deployments involve outages, chronic firefighting and heroics
- A core conflict exists within these technology organizations
Chronic conflict
The conflict between Dev + Ops creates a downward spiral resulting in:
- slower time-to-market
- reduced quality
- increased outage
- increasing technical debt
Technical debt describes how decisions we make lead to problems that get difficult to fix over time:
- Reduces available options in the future
- There are often competing goals between Dev & Ops
Goals of IT Organisations
- Respond to the rapidly changing competitive landscape
- Provide stable, reliable and secure service to the customer
Development objectives
- Development takes responsibility for responding to changes in the market
- Development deploys features and changes into production as fast as possible
Operations objectives
- Responsible for providing customers with IT service that is stable, reliable and secure –> consequence: makes it virtually impossible for anyone to introduce production changes that could jeopardize production.
Dev + Ops have opposed goals and incentives.
Core conflict: when organizational measurements and incentives across different silos prevent the achievement of organizational goals –> prevent achieving desired business outcomes. These chronic conflicts put technology workers into situations that lead to:
- poor software and service quality, lousy customer outcomes
- the daily need for workarounds
- firefighting
- heroics
The downward spiral in the 2 acts
IT Operations
Goal: Keep applications and infrastructure running so that our organization can deliver value to customers
Many problems are due to applications and infrastructure that are:
- Complex
- Poorly documented
- Incredibly fragile
Outcome:
- Lots of technical debt and workarounds
- Systems most prone to failure are also our most important and at the epicentre of our most urgent changes.
- When our most urgent changes fail, we may risk the following and jeopardize our most critical organizational promises: availability to customers, revenue goals, security of customer data, accurate financial reporting, etc.
Compensation for the last broken promise
Cause: product managers promise bigger, bolder features to impress customers
Outcome:
- Oblivious to the limitations of what the technology can and can’t do, they commit the technology organization to commit to a promise they can’t keep
- Development talked with another urgent project requiring solving new technological challenges, cutting corners to meet promised release dates, and further adding to technical debt.
Getting busier -> for what?
Outcome: loss of market share
Consequence: When IT fails, the organization fails
How the downward spiral starts
- Everybody gets a little busier
- Work takes a little more time
- Communications become slower
- Work queues get a little longer
- Work becomes more tightly coupled
- More minor actions cause more extensive failures
- Become fearful and less tolerant of changes
- Work requires more communication, coordination and approvals
- Teams must wait longer for their dependent work to get done
- Quality keeps getting worse
- Production code deployments are taking longer to complete
- Deployment outcomes have become problematic
- The ever-increasing number of customer outages
- More heroics and firefighting in operations
- Inability to pay down technical debt
- Product delivery cycles slower and slower
- Fewer projects taken are less ambitious
- Feedback becomes slower and weaker
- Feedback from customers slows down
- Things seem to get worse
- No longer able to respond quickly to the changing competitive landscape
- Inability to provide stable, reliable service to our customers
Two facts:
- Every IT organization has two opposing goals
- Every company is a technology company
Benefits of DevOps
DevOps enables organisations to improve
- Organizational performance
- Achieve goals of all various functional technology roles: Dev, Ops, InfoSec, QA
- Improve the human condition
Core advantages and general checklist to observe
- Developers independently implement their features.
- Developers validate the correctness of their features in production-like environments.
- Developers have their code deployed to production quality safely and securely.
- Code deployments are routine and predictable.
- Deployments occur throughout the business day when everyone is already in the office without customers noticing.
- Everyone can see the effects of their actions by creating fast feedback loops at every step of the process.
- When changes are committed to version controls, fast, automated tests are run in production-like environments.
- DevOps give continual assurance that the code and environments operate as designed.
- Deployments are always secure.
- Automated testing helps developers discover their mistakes quickly, enabling faster fixes and genuine learning.
- Code and environments operate as designed and are always secure and deployable.
- Automated tests help developers discover their mistakes quickly.
- Instead of occurring in technical debt, problems are fixed as they are found.
- Global goals outweigh local goals.
- Pervasive production telemetry in our code and production environments ensures that problems are detected and corrected quickly.
- The architecture allows small teams to work safely and decoupled from the work of other teams.
- Teams work independently and productively in small batches, quickly and frequently delivering new value to customers.
- High-profile products and features become routine by using dark launch techniques.
- Instead of firefighting for days, we merely change a feature toggle or configuration setting.
- Features can be automatically rolled back if something goes wrong.
- Releases are controlled, predictable, reversible, and low-stress.
- All sorts of problems are being found and fixed early when they are smaller, cheaper and easier to correct.
- With every fix, we generate organizational learnings, allowing us to prevent the problem from recurring.
- Everyone is learning, fostering a hypothesis-driven culture where the scientific method is used to ensure nothing is taken for granted.
- We use experiments to treat product development and process improvements.
- We create long-term teams intact so they can keep iterating and improving, using those learnings to achieve their goals.
- Instead of a culture of fear, we have a high-trust, collaborative culture where people are rewarded for taking risks.
- People can fearlessly talk about problems.
- Everyone wholly owns the quality of their work.
- People use peer reviews to gain confidence that problems are addressed long before they impact the customer.
- When something goes wrong, we conduct blameless postmortems to understand what caused the accident and how to prevent it.
- We reinforce a culture of learning.
- We care about quality so much that we even inject faults into our production environment so we can learn how our system fails in a planned manner.
- We conduct planned exercises to practise large-scale failures, randomly killing processes, and computing services in production.
- We inject network latencies and other nefarious acts to ensure we grow resilient.
- We enable organizational learning and improvement.
- Everyone owns their work, regardless of their role in the technology organization.
- Employees have confidence that their work matters and meaningfully contributes to organizational benefits.
The Business Value of DevOps
DevOps Practises
- Throughput metrics
- Code and change deployment lead times (30x or more)
- Reliability metrics
- Production deployments (60x higher change success rate)
- Mean time to restore services (168x faster)
- Organizational performance metrics
- Productivity, market share, and profitability goals (2x more likely to succeed)
- Market capitalization growth (50% higher over three years)
Value of DevOps
- High performers were both agile and more reliable, empirical evidence that DevOps enables us to break the core, chronic conflict
- “Code committed” to “successfully running in prod” was 200x faster. Lead time is measured in minutes instead of hours.
- High performers twice as likely to exceed profitability, market share and productivity goals
- Higher employee job satisfaction
- Lower rates of employee satisfaction
- Lower rates of employee burnout
- Employees 2x more likely to recommend their employer to friends as a great place to work
- Better info security outcomes, spending 50% less time remediating security issues by fully integrating it into all stages of development and operations processes.
DevOps help scale developer productivity
Increasing the number of developers for a project significantly decreases developer productivity due to overhead in communication, integration, and testing.
The following combination enables small teams of developers to act quickly, safely and independently to develop, integrate, test and deploy changes into production.
- The right architecture
- The right technical practices
- The right cultural norms
Problems to overcome:
- Catastrophic deployments
- Problems with availability
- Problems with security
- Problems with compliance
DevOps is the result of applying the following:
- Flow: accelerate delivery of work from Dev+Ops
- Feedback: Create safer systems of work
- Continual learning & Experimentation: Faster, high-trust culture and scientific approach to improvement
History of DevOps
DevOps is the combination of the following knowledge
- Knowledge from lean
- Theory of constraints
- Toyota production system
- Resilience engineering
- Learning organisations
- Safety culture
- Human factors
Valuable contexts that DevOps draws from
- High-trust management cultures
- Servant leadership
- Organisational change management
The outcome of implementing DevOps
- World-class quality, reliability and security
- Lower cost and effort
- Accelerated flow
- Reliability throughout the technology value stream, including product management, development, QA, IT operations and InfoSec
DevOps is the logical continuation of the agile software journey that began in 2001
History of Lean
Value Stream Mapping, Kanban Boards and Total Productive Maintenance were codified for the Toyota Production System in the 1990s
3 of Lean’s major tenants are the following
Manufacturing Lead Time
- Conversion of raw materials into finished goods was the best indicator of quality, customer satisfaction, and employee satisfaction
Lean Principles
Focus on how to create value for the customer through systems thinking
- Creating constancy of purpose
- Embracing scientific thinking
- Creating flow and pull
- Assuring quality at the source
- Leading with humility
- Respecting every individual
Value streams
- The sequence of activities an organisation undertakes to deliver upon a customer request
History of Agile
Created in 2001 by 17 of the leading thinkers in software development
Focus: create a lightweight set of values and principles against heavyweight software development practises such as waterfall and methodologies like rational unified process
The key principle of agile: “deliver working software frequently, from a couple of weeks to a couple of months, with a preference for the shorter timescale”
- The desire for small batch sizes
- Incremental releases
- Need for small, self-motivated teams
- Work in high-trust management model
Agile is credited for dramatically increasing the productivity of many development organisations
Agile Infrastructure and Velocity Movement
Patrick Debois and Andrew Schafer introduced agile principles to infrastructure versus application code.
In 2009, John Allspew and Paul Hammond introduced “10 deploys per day”
- Creation of shared goals between Dev & Ops
- Using continuous integration practises to make deployment part of everyone’s daily work
The term “DevOps” was coined by Allspaw and Hammond in 2009
Continuous Delivery Movement
Continuous delivery
- Creation of a deployment pipeline
- Ensure that code and infrastructure are always in a deployable state
- All code checked into the trunk can be safely deployed into production.
Toyota Kata
- Codification of the Toyota Production System
- Mike Rother helped develop the Lean Toolkit
The term “improvement Kata” means:
- Every organisation has work routines
- Improvement kata requires creating a structure for the daily, habitual practice of improvement work
- Daily practice improves outcomes
- Setting weekly target outcomes and continual improvement of daily work is what guided the improvement of Toyota
Manufacturing Value Streams
Value Streams
The sequence of activities required to design, produce and deliver a good or service to a customer, including the duel flows of information and material
- The customer order is received
- Raw materials are released onto the plant floor
Achieve Relentless Focus
- Use small batch sizes
- Reduce work in process (WIP)
- Prevent rework to ensure we don’t pass defects to downstream work centres
- Constantly improve and optimise our system toward global goals
Technology Value Streams
In DevOps, the technology value stream is the process required to convert a business hypothesis into a technology-enabled service that delivers value to the customer
Inputs include:
- Formulation of business objectives
- Formulation of concepts
- Formulation of an idea
- Formulation of a hypothesis
Outcome: adding inputs to our committed backlog of work
Development teams follow an agile process
- Transform idea into a user story
- Implement code into an application
- Code is checked into the version control
- Change is integrated
- Testing is conducted with the rest of the software system
Value generation
- Value is created only when our services are running in production
- We must ensure that we deliver fast flow and that our deployments can also be performed without causing chaos and disruptions such as service outages, service impairments or securing or compliance failures.
Deployment Lead Time in Minutes
The following will be a measure of the success of your DevOps lead times
- Developers receive fast, constant feedback on their work
- Developers are enabled to quickly and independently implement, integrate, and validate their code, and have the code deployed into a production environment
- Developers check into the production environment small code changes into the version control repository, perform automated and exploratory testing against it, and deploy into production.
- This enables a high degree of confidence that our changes will operate as designed in production and that any problems can be quickly detected and corrected.
Deployment lead time is measured in minutes, worse case in hours.
Observing “% C/A” as a measure of rework
- The key metric in the tech value stream is per cent complete and accurate “% C/A”
- Reflects quality output of each step in our value stream
- % C/A can be obtained by asking downstream customers what percentage of the time they receive work that is “unable as is”
Target ideal deployment lead times
Step 1 – committed (automated) produced by automated approval
Step 2 – Automated testing (manual approval) is automated (10 minutes)
Step 3 – Exploratory testing (10 minutes)
Step 4 – Production deployment (5 minutes)
Focus on deployment lead times – The value stream begins when any engineer (Dev, QA, Ops, InfoSec) checks a change into version control and ends when that change is successfully running in production
Phase 1: Design & Development
Design and development is similar to Lean Product Development and is highly variable and uncertain. It requires high degrees of creativity and work that may never be performed again, resulting in high variability of process times.
Phase 2: Testing & Operations
- Akin to lean manufacturing.
- Requires creativity and expertise
- Strives to be predictable and mechanistic with the goal of achieving work outputs with minimised availability (ie. short & predictable lead times, near-zero defects)
Phase 3: Remove large batches of work
- The goal is to have testing and operations happening simultaneously with design/development, enabling fast flow and high quality
- The method succeeds when we work in small batches and quality into every part of our value stream.
Lead Time vs. Processing Time
Lead Time
- Used to measure performance in value streams
- The clock starts when the request is made and ends when it is fulfilled
- Because lead time is what the customer experiences, we focus our process improvement there instead of process time
- Achieving fast flow and short lead times almost always requires reducing the time our work is waiting in queues.
Process Time
Starts only when we begin work on the customer’s request
- Because lead time is what the customer experiences, we typically focus our process improvement attention there instead of on process time.
- Process to lead time serves as an important measure of efficiency. Achieving fast flow and short lead times almost always requires reducing the time our work is waiting in the queue.
Common Scenario – Deployment Lead Times Requiring Months
Common in large, complex organisations that are working with:
- Tightly coupled monolithic applications
- Often, with scarce integration test environments
- High reliance on manual testing
- Multiple required approval process
- Heroics required
- High risks occur after merging all development team changes, resulting in code that no longer builds correctly or passes our tests.
- Fixing each problem requires days or weeks
- Extensive investigation was conducted to determine who broke the code and how to fix
Result: Poor customer outcomes
Enabling organisational learning and a safety Culture
Complex systems make it impossible to predict all the outcomes of our actions.
- The root cause is often called human errors – “name, blame, shame”. The cycle begins for the person who caused the problem.
- More processes and approvals are created to revent errors from happening
- How management reacts to failures and accidents leads to a culture of fear, making it unlikely that problems and failure signals are ever reported. Problems remain hidden until a catastrophe occurs.
- Dr. Westrum defined 3 types of culture: pathological, bureaucratic, and generative.
- In the technology value stream, we need to create a “generative culture”
Westrum organisational typology (2004)
Pathological organisations
- Information is hidden
- Managers are “shot”
- Responsibilities are shirked
- Bridging between teams is discouraged
- Failure is covered up
- New ideas are crushed.
Bureaucratic organisations
- Information may be ignored
- Managers are tolerated
- Responsibilities are compartmented
- Bridging between teams is allowed but discouraged
- The organisation is just and merciful
- New ideas create problems.
Generative organisations
- Information is actively sought
- Managers are trained
- Responsibilities are shared
- Bridging between teams is rewarded
- Failure causes inquiry
- New ideas are welcomed
The goal of Technology Value Stream
- Establish the foundations of a generative culture by striving to create a safe system of work.
- We look for how we can redesign the system to prevent the accident from happening again.
- We conduct a blameless postmortem after every incident to understand how the accident occurred and agree upon the best countermeasures to improve the system. We want to enable faster detection and recovery by preventing the problem from occurring again.
Result:
- Create organizational learning
- Help customers
- Ensure quality
- Create competitive advantage
- Energised workforce
- Committed workforce
- We can uncover the truth.
Institutionalise the improvement of daily work
Problems:
- In the absence of improvements, processes don’t stay the same. Due to chaos and entropy, processes actually degrade over time.
- When we avoid fixing problems and relying on daily workarounds, our problems and technical debt accumulate until all we do is perform workarounds, trying to avoid disaster, with no cycles left for productive work.
Solutions:
- Reserve time to pay down technical debt, fixing defects and refactoring
- Improve problematic areas for our code and environments
- We need to reserve cycles in each deployment interval.
- Schedule kaizen blitzes – periods when engineers self-organize into teams to work on fixing any problems they want.
Outcome: As we make our system of work safer, we find problems from even weaker failure signals.
Transform local discoveries into global improvements
When teams or individuals have experiences that create expertise, we aim to convert that tacit knowledge into explicit, codified knowledge, which becomes someone else’s expertise through practice.
Result: When people do similar work, they do so with the cumulative and collective experience of everyone in the organization who has ever done the same work.
We convert individual expertise into artefacts that the rest of the organization can use.
What we must do: We must create global knowledge by making all blameless post-mortem reports searchable by teams trying to solve similar problems.
Greenfield vs. Brownfield Services
Greenfield Development
- Build on undeveloped land
- No existing structures that need demolishing
- New software project/initiative