Reliability Enablers

Ash Patel & Sebastian Vietz

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.com read less
TechnologyTechnology

Episodes

#56 Resolving DORA Metrics Mistakes
4d ago
#56 Resolving DORA Metrics Mistakes
We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas.Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.Nathen Harvey is no stranger to this problem.He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018. His focus has been on questions like:How do we help teams get better at delivering and operating software? You and I can agree that this is an important question to ask. I’d listen to what he has to say about DORA because he’s got a wealth of experience behind him, having also run community engineering at Chef Software.Before we continue, let’s explore What is DORA? in Nathen’s (paraphrased) words:DORA is a software research program that's been running since 2015.This research program looks to figure out:How do teams get good at delivering, operating, building, and running software? The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes.They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction.Essentially, all those things that matter to the business.One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery? It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily.The unfortunate problem is that the factory mindset I think still leaks in. I’ve personally noted some silly metrics over the years like lines of code.Imagine being asked constantly: “How many lines of code did you write this week?”You might not have to imagine. It might be a reality for you. DORA’s researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer. They settled on and validated 4 key measures for software delivery performance.Nathen elaborated that 2 of these measures look at throughput:[Those] two [that] look at throughput really ask two questions:* How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production?And then the second question on throughput is:* How frequently are you updating production?In plain English, these 2 metrics are:* Deployment Frequency. How often code is deployed to production? This metric reflects the team's ability to deliver new features or updates quickly.* Lead Time for Changes: Measures the time it takes from code being committed to being deployed to production. Nathen recounted his experience of working at organizations that differed in how often they update production from once every six months to multiple times a day. They're both very different types of organizations, so their perspective on throughput metrics will be wildly different. This has some implications for the speed of software delivery.Of course, everyone wants to move faster, but there’s this other thing that comes in and that's stability.And so, the other two stability-oriented metrics look at:What happens when you do update production and... something's gone horribly wrong. “Yeah, we need to roll that back quickly or push a hot fix.” In plain English, they are:* Change Failure Rate: Measures the percentage of deployments that cause a failure in production (e.g., outages, bugs). * Failed Deployment Recovery Time: Measures how long it takes to recover from a failure in production. You might be thinking the same thing as me. These stability metrics might be a lot more interesting to reliability folks than the first 2 throughput metrics.But keep in mind, it’s about balancing all 4 metrics. Nathen believes it’s fair to say today that across many organizations, they look at these concepts of throughput and stability as tradeoffs of one another. We can either be fast or we can be stable. But the interesting thing that the DORA researchers have learned from their decade of collecting data is that throughput and stability aren't trade-offs of one another.They tend to move together. They’ve seen organizations of every shape and size, in every industry, doing well across all four of those metrics. They are the best performers. The interesting thing is that the size of your organization doesn't matter the industry that you're in.Whether you’re working in a highly regulated or unregulated industry, it doesn't matter.The key insight that Nathen thinks we should be searching for is: how do you get there? To him, it's about shipping smaller changes. When you ship small changes, they're easier to move through your pipeline. They're easier to reason about. And when something goes wrong, they're easier to recover from and restore service.But along with those small changes, we need to think about those feedback cycles.Every line of code that we write is in reality a little bit of an experiment. We think it's going to do what we expect and it's going to help our users in some way, but we need to get feedback on that as quickly as possible.Underlying all of this, both small changes and getting fast feedback, is a real climate for learning. Nathen drew up a few thinking points from this:So what is the learning culture like within our organization? Is there a climate for learning? And are we using things like failures as opportunities to learn, so that we can ever be improving? I don’t know if you’re thinking the same as me already, but we're already learning that DORA is a lot more than just metrics. To Nathen (and me), the metrics should be one of the least interesting parts of DORA because it digs into useful capabilities, like small changes and fast feedback. That’s what truly helps determine how well you're going to do against those performance metrics.Not saying “We are a low to medium performer. Now go and improve the metrics!”I think the issue is that a lot of organizations emphasize the metrics because it's something that can sit on an executive dashboard But the true reason we have metrics is to help drive conversations.Through those conversations, we drive improvement.That’s important because currently an unfortunately noticeable amount of organizations are doing this according to Nathen:I've seen organizations [where it’s like]: “Oh, we're going to do DORA. Here's my dashboard. Okay, we're done. We've done DORA. I can look at these metrics on a dashboard.” That doesn't change anything. We have to go the step further and put those metrics into action.We should be treating the metrics as a kind of compass on a map. You can use those metrics to help orient yourself and understand, “Where are we heading?”.But then you have to choose how are you going to make progress toward whatever your goal is.The capabilities enabled by the DORA framework should help answer questions like:* Where are our bottlenecks?* Where are our constraints?* Do we need to do some improvement work as a team?We also talked about the SPACE framework, which is a follow-on tool from DORA metrics. It is a framework for understanding developer productivity. It encourages teams or organizations to look at five dimensions when trying to measure something from a productivity perspective.It stands for:* S — satisfaction and well-being* P — performance* A — activity* C — communication and collaboration* E — efficiency and flowWhat the SPACE framework recommends is that youFirst, pick metrics from two to three of those five categories. (You don't need a metric from every one of those five but find something that works well for your team.)Then write down those metrics and start measuring them. Here’s the interesting thing: DORA is an implementation of SPACE. You can correlate each metric with the SPACE acronym!* Lead time for changes is a measure of Efficiency and flow* Deployment frequency is an Activity* Change fail rate is about Performance.* Failed deployment recovery time is about Efficiency and flowKeep in mind that SPACE itself has no metrics. It is a framework for identifying metrics.Nathen reiterated that you can't use the space metrics because there is no such thing. I mentioned earlier how DORA is a means of identifying the capabilities that can improve the metrics.These can be technical practices like using continuous integration.But they can also be capabilities like collaboration and communication. As an example, you might look at what your change approval process looks like. You might look at how collaboration and communication have failed when you’ve had to send changes off to an external approval board like a CAB (change approval board).DORA’s research backs the above up:What our research has shown through collecting data over the years, is that while they do exist on the whole, an external change approval body will slow you down.That's no surprise. So your change lead time is going to increase, your deployment frequency will decrease. But, at best, they have zero impact on your change fail rate. In most cases, they have a negative impact on your change fail rate. So you're failing more often.It goes back to the idea of smaller changes, faster feedback, and being able to validate that. Building in audit controls and so forth.This is something that reliability-focused engineers should be able to help with because one of the things Sebastian and I talk about a lot is embracing and managing risk effectively and not trying to mitigate it through stifling measures like CABs. In short, DORA and software reliability are not mutually exclusive concepts.They're certainly in the same universe.Nathen went as far as to say that some SRE practices necessarily get a little bit deeper than sort of the capability level that DORA has and provide even more sort of specific guidance on how to do things.He clarified a doubt I had because a lot of people have argued with me (mainly at conferences) that DORA is this thing that developers do, earlier in the SDLC.And then SRE is completely different because it focuses on the production side. The worst possible situation could be turning to developers and saying, “These 2 throughput metrics, they’re yours. Make sure they go up no matter what,” and then turn to our SREs and say “Those stability metrics, they're yours. Make sure they stay good” All that does is put these false incentives in place and we're just fighting against each other.We talked a little more about the future of DORA in our podcast episode (player/link right at the top of this post) if you want to hear about that.Here are some useful links from Nathen for further research:DORA online community of practiceDORA homepage[Article] The SPACE of Developer ProductivityNathen Harvey's Linktree This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards
27-08-2024
#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards
We’ll explore 3 use cases for monitoring data. They are:* Analyzing long-term trends* Comparing over time or experiment groups* Conducting ad hoc retrospective analysis Analyzing long-term trends You can ask yourself a couple of simple questions as a starting point:* How big is my database?* How fast is the database growing? * How quickly is my user count growing?As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like:* How is the database performance evolving? Are there signs of degradation?* Is there consistent growth in data volume that may require future infrastructure adjustments?* How is overall resource utilization trending over time across different services?* How is the cost of cloud resources evolving, and what does that mean for budget forecasting?* Are there recurring patterns in downtime or service degradation, and what can be done to mitigate them?Sebastian mentioned that it's a part of observability he enjoys doing. I can understand why. It’s exciting to see how components are changing over a period and working out solutions before you end up in an incident response nightmare.Getting to effectively analyze the trends requires the right level of data retention settings. Because if you're throwing out your logs, traces, and metrics too early, you will not have enough historical data to do this kind of work.Doing this right means having the right amount of data in place to be able to analyze those trends over time, and that will of course depend on your desired period. Comparing over time or experiment groupsGoogle’s definitionYou're comparing the data results for different groups that you want to compare and contrast. Using a few examples from the SRE (2016) book:* Are your queries faster in this version of this database or this version of that database? * How much better is my memcache hit rate with an extra node and is my site slower than it was last week? You're comparing it to different buckets of time and different types of products.A proper use case for comparing groupsSebastian did this particular use case recently because he had to compare two different technologies for deploying code: AWS Lambda vs AWS Fargate ECS. He took those two services and played around with different memories and different virtual CPUs. Then he ran different amounts of requests against those settings and tried to figure out which one was the better technology option most cost-effectively.His need for this went beyond engineering work but enabling product teams with the right decision-making data. He wrote out a knowledge base article to give them guidance for a more educated decision on the right AWS service.Having the data to compare the two services allowed him to answer questions like:* When should you be using either of these technologies? * What use cases would either technology be more suitable for?This data-based decision support is based mainly on monitoring or observability data. The idea of using the monitoring data to compare tools and technologies for guiding product teams is something I think reliability folk can gain a lot of value from doing. Conducting ad hoc retrospective analysis (debugging)Debugging is a bread-and-butter responsibility for anyone who is a software engineer of any level. It’s something that everybody should know a little bit more about than other tasks because there are very effective and also very ineffective ways of going about debugging. Monitoring data can help make the debugging process fall into the effective side.There are organizations where you have 10 different systems. In one system, you might get one fragmented piece of information. In another, you’ll get another fragment. And so on for all the different systems. And then you have to correlate these pieces of information in your head and hopefully, you get some clarity out of the fragments to form some kind of insight. Monitoring data that are brought together into one datastream can help correlate and combine all these pieces of information. With it, you can:* Pinpoint slow-running queries or functions by analyzing execution times and resource usage, helping you identify inefficiencies in your code* Correlate application logs with infrastructure metrics to determine if a performance issue is due to code errors or underlying infrastructure problems* Track memory leaks or CPU spikes by monitoring resource usage trends, which can help you identify faulty code or services* Set up detailed error tracking that automatically flags code exceptions and matches them with infrastructure events, to get to the root cause faster* Monitor system load alongside application performance to see if scaling issues are related to traffic spikes or inefficient code pathsBeing able to do all this makes the insight part easier for you. And so your debugging approach becomes very different. It becomes much more effective. It becomes much less time-consuming. It potentially makes the debugging task fun.Because you get to the root cause of the thing that is not working much faster. Your monitoring/observability data setup can make it nice and fun to a certain degree, or it can make it downright miserable. If it's done well, it's just one of those things you don't even have to think about. It's just part of your job. You do it. It's very effective and you move on. Wrapping upSo we've covered three more use cases for monitoring data, other than the usual alerts and dashboards.They are once again:* analyzing long-term trends* comparing over time or experiment groups and* conducting ad hoc retrospective analysis, aka debuggingNext time your boss asks you what all these systems do, you now have three more reasons that you need to focus on your monitoring and be able to use it more effectively. Until next time, happy monitoring. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity
20-08-2024
#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity
Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He’s dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer.I’ve broken them down into the following areas:* Avoid the heroic efforts* Mind + heart > Mind alone * Curiosity > Credentials* Experience > Certifications * Thinking for complexityWhen I saw him in Toronto, I thought he would talk about pre-production observability. It would only make sense after watching the previous presenter do a deep dive into Kubernetes tooling.But surprisingly, he started about culture and the need to prevent burnout among engineers — a topic that is as important today as it was 2 years ago when he did the talk. Here’s a look into Shlomo’s philosophy and the practices he champions.Avoid the heroic effortsShlomo's perspective on heroics in engineering and operations challenges a traditional mindset that often glorifies excessive individual efforts at the cost of long-term sustainability. He emphasizes that relying on heroics — where individuals consistently go above and beyond to save the day — creates an unhealthy work environment. "We shouldn't be rewarding people for pulling all-nighters to save a project; we should be asking why those all-nighters were necessary in the first place."This approach not only burns out engineers but also masks underlying systemic issues that need to be addressed. So, instead of celebrating these heroic efforts, Shlomo advocates for creating processes and metrics that ensure smooth operations without the need for constant intervention. Mind + Heart > Mind aloneOne of the challenges Shlomo has faced recently is scaling his engineering organization amidst rapid growth. His approach to hiring is unique; he doesn’t just look for technical skills but prioritizes self-awareness and kindness. "Hiring with heart means looking for individuals who bring empathy and integrity to the team, not just expertise."When he joined The Score, a subsidiary of Penn Interactive, Shlomo immediately revamped the hiring practices by integrating the values above into the process. He favors role-playing scenarios over solely using behavioral interviews to evaluate candidates, as this method reveals how individuals might react in real production situations. I tend to agree with this approach as seeing how people are doing the work is more enlightening than asking them how they behaved in a past situation alone. Curiosity > credentialsHow it plays into career progressionWhen it comes to career progression, Shlomo places little value on traditional markers like education or years of experience. Instead, he values adaptability, resilience, and curiosity. This last trait is the one he doubles down on.According to Shlomo, curiosity is the cornerstone of continuous growth and innovation. It’s not just about asking questions. It’s about fostering a mindset that constantly seeks to understand the 'why' behind everything. Shlomo advocates for a deep, insatiable curiosity that drives engineers to explore beyond the surface of problems, looking for underlying causes and potential improvements. He believes that this kind of curiosity is what separates good engineers from great ones, as it leads to discovering solutions that aren’t immediately obvious and pushes the boundaries of what’s possible.How it plays into teamworkFor Shlomo, curiosity also plays a crucial role in building a cohesive and forward-thinking team. He encourages leaders to cultivate an environment where questions are welcomed, and no stone is left unturned. This approach not only sparks creativity but also ensures that everyone is engaged in a continuous learning process, which is vital in a field that evolves as rapidly as DevOps and SRE.By nurturing curiosity, teams can stay ahead of the curve. They can anticipate challenges before they arise and develop right-fit solutions that keep their work relevant and impactful.Shlomo advises engineers not to let their current organization limit them and to always seek out new challenges and learning opportunities. This mindset will make them valuable to any organization they may work with.Experience > Certifications Shlomo’s stance on certifications is clear: they don’t necessarily lead to career advancement. He argues that the best engineers are those who are too busy doing the work to focus on accumulating certifications. Instead, he encourages engineers to network with industry leaders, demonstrate their skills, and seek mentorship opportunities. Experience and mentorship, he believes, are far more critical to growth than any piece of paper.Thinking for complexityIt’s a well-tread saying now, almost a cliche, but still very relevant to standing out in a crowded engineering talent market. Shlomo and I talked about the issue of many engineers being trained to think in terms of best practices. I feel like over time, this emphasis will reduce, especially for more senior roles. Best practices are not directly applicable to solving today’s problems that are increasing in complexity. Shlomo tries to test potential hires to see if they can handle the complexity. During interviews, he presents candidates with unreasonable scenarios to test their ability to think outside the box. This approach not only assesses their problem-solving skills but also helps them understand the interconnectedness of the challenges they will face.Wrapping upThe insights Shlomo shared with me underscore a crucial point:The most successful engineers are those who combine technical prowess with a strong sense of curiosity, a commitment to continuous improvement, and a genuine understanding of their role within the team. By embracing these qualities, you not only enhance your current contributions but also set yourself on a path for long-term growth and success. The takeaway is clear: to truly stand out and advance in your career, it's not just about doing your job well — it's about constantly seeking to learn more, improve processes, and connect with your team on a deeper level.These are the traits that make you not just a good engineer, but a valuable one. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#53 What's Missing in Incident Response Processes?
15-08-2024
#53 What's Missing in Incident Response Processes?
Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust.Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority setting. According to Vladislav, at the beginning of your SRE journey, it’s not going to be focused on incident response in terms of setting up an incident response process, but more on core SRE artifacts like SLIs, availability measurement, SLOs, etc. And now we are safely investing more into the customer-facing features and things like this. So this is going to be the core SRE concepts. But then at some point, once you've got these things, more or less established in the organization. Understanding and Leveraging SLOsOnce your Service Level Objectives (SLOs) are well-defined and refined over time, they should accurately reflect user and customer experiences. Your SLOs are no longer just initial metrics; they’ve been validated through production. Product managers should now be able to use this data to make informed decisions about feature prioritization. This foundational work is crucial because it sets the stage for integrating a formal incident response process effectively.Implementing a Formal Incident ResponseBefore you overlay a formal incident response process, ensure that you have the cultural and technical groundwork in place. Without this, the process might not be as effective. When the foundational SLOs and organizational culture are strong, a well-structured incident response process can significantly enhance its effectiveness.Coordinating During Major IncidentsWhen a significant incident occurs, detecting it through SLO breaches is just the beginning. You need a system in place to coordinate responses across multiple teams. Consider appointing incident commanders and coordinators, as recommended in PagerDuty’s documentation, to manage this coordination. Develop a lightweight process to guide how incidents are handled.Classifying IncidentsEstablish an incident classification scheme to differentiate between types of incidents. This scheme should include priorities such as Priority One, Priority Two, and Priority Three. Due to the inherently fuzzy nature of incidents, your classification system should also include guidelines for handling ambiguous cases. For instance, if uncertain whether an incident is Priority One or Two, default to Priority One.Deriving Actions from Incident ClassificationBased on the incident classification, outline specific actions. For example, Priority One incidents might require immediate involvement from an incident commander. They might take the following actions:* Create a communication channel, assemble relevant teams, and start coordination. * Simultaneously inform stakeholders according to their priority group. * Define stakeholder groups and establish protocols for notifying them as the situation evolves.Keep Incident Response Processes Simple and AccessibleEnsure that your incident response process is concise and easily understandable. Ideally, it should fit on a single sheet of paper. Complexity can lead to confusion and inefficiencies, so aim for simplicity and clarity in your process diagram. This approach ensures that the process is practical and can be followed effectively during an incident.Preparing Your OrganizationAn effective incident response process relies on an organization’s readiness for such rigor. Attempting to implement this process in an organization not yet mature enough may result in poor adherence during critical times. Make sure your organization is prepared to follow the established procedures.For a deeper dive into these concepts, consider reading "Establishing SRE Foundations," available on Amazon and other book retailers. For further inquiries, you can also connect with the author, Vlad, on LinkedIn. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Can ITIL Benefit from Site Reliability Engineering?
13-08-2024
Can ITIL Benefit from Site Reliability Engineering?
According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale.However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function.Dr. Vladislav Ukis is well qualified to talk about reliability, being at Siemens Healthineers and leading 250 people globally to offer their cloud platform running off Microsoft Azure.We discussed key concepts from his book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations.Unlike other technical books in this field, Dr Ukis’ book is aimed at technology professionals who are beginners to the reliability journey. This is different from the Site Reliability Engineering (2016) book by Google, which covers all the bells and whistles that SRE encompasses. That book requires a degree of prior knowledge and also prior experience in the field. Vlad wanted to make it more accessible:What I did with my book is to say, ‘Okay, so now you've never done operations, but you now are thrown in the world of online services where you have to operate them. How do you get started?’ So this is what the book is for. So for people who want to learn how to get started in the world of operating online services.ITIL was originally developed by the UK government in the 80s to improve IT governance. It is best related to SRE through its service management and incident management components. But it’s for managing systems that are more predictable and can be handled through strict process control.Modern product delivery doesn’t have the luxury of bureaucratic levels of predictability that older IT services have. It requires a more engineer-oriented approach to solving problems/incidents and providing services. So how was Vlad’s experience bringing SRE into an organization that previously had run solely on the ITIL model?Siemens Healthineers for many years operated like a traditional software development organization. In other words, they were developing on-prem software, not cloud software. The company would ship the physical software product to its hospital customers and then those hospitals would have the software operated and supported by their IT departments. The change came about when Siemens Healthineers began to work on a new digital health platform, which would be cloud-based from the beginning. So they would no longer ship physical software in discs to customers, but provide online services in the cloud centrally for the customers to use.The early days were haphazardly done with the software deployed to the cloud with no major issues. Not many customers were on the cloud platform so the team could get away with “handcrafted operating procedures”.But as traffic and service count started to rise rapidly, the Healthineers team learned that they needed a more professional approach. They began to understand that their initial approach to operations could not continue as-is.This is when Vladislav began to drive SRE practices in the organization. This was a sub-30-minute conversation that covered a lot of ground that would be relevant to the needs of organizations looking to transition to product delivery of online services at scale. Have a listen. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#50 Making Better Sense of Observability Data
09-07-2024
#50 Making Better Sense of Observability Data
Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data.We crammed into just under 25 minutes ideas like these 7 takeaways:* Reasserting the Need to Monitor Four Golden Signals: Focus on latency, traffic, errors, and saturation for effective system monitoring and management.* Prioritize Customer Health: in Jack’s words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers for a more comprehensive view of your system's impact.* Apply Mathematical Techniques: Incorporate advanced mathematical concepts, like the Nyquist Shannon law and T Digest algorithm, to enhance data accuracy and observability metrics.* Build Accurate Percentiles: Implement techniques to accurately reproduce percentiles from raw data to ensure reliable performance metrics.* Manage High Cardinality Data: Develop strategies to handle high cardinality data without overwhelming your resources, ensuring you extract valuable insights.* Standardize Log Records: Use readily available frameworks to emit standardized log records makes data easier to process and visualize.* Handle High-Velocity Data Efficiently: Develop methods for collecting and processing high-velocity data without incurring prohibitive costs.Watch Jack’s Monitorama talk via this link: https://vimeo.com/843996971 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#49 Alert Fatigue is Still an Issue - Here's How We Fix it
02-07-2024
#49 Alert Fatigue is Still an Issue - Here's How We Fix it
Alert noise is no joke and neither is the fatigue that results from it. I spoke with Dan Ravenstone who gave a talk at Monitorama about this very topic. He also happens to be an avid skateboarder!Here are 9 takeaways from our conversation:* Regularly Review and Update Monitoring Systems: Don’t set up monitoring once and forget about it. Continuously assess and update your monitoring systems to ensure they remain relevant and effective.* Focus on Relevant Alerts: Ensure your alerting system is tailored to indicate real problems. Avoid relying on outdated criteria such as high CPU or memory usage unless they directly impact user experience.* Adopt a User-Centric Approach: Develop alerts based on how issues affect the user experience rather than purely technical metrics. This helps prioritize what truly matters to the end user.* Evaluate Alert Value: Critically assess each alert for its value. Ask whether the alert provides actionable information and if it impacts the user or business. Eliminate or adjust alerts that don’t meet these criteria.* Reduce Alert Noise: Strive to minimize unnecessary alerts contributing to noise and obscure real issues. This makes it easier to detect and respond to genuine problems.* Understand the User Journey: Document the user journey and create Service Level Objectives (SLOs) to align alerts with user-impacting events. This ensures alerts are meaningful and actionable.* Secure Leadership Support: Gain buy-in from leadership by demonstrating the long-term benefits of an effective alerting system. Emphasize how it can improve user satisfaction and operational efficiency.* Improve Documentation and Preparedness: Ensure thorough documentation for all systems and alerts. This reduces stress and increases efficiency, particularly for engineers handling on-call duties.* Automate Alert Responses: Implement automation to handle routine alerts. This reduces the manual burden on engineers and allows them to focus on more complex issues. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#48 Cutting Down "Toil" aka Manual Work in Software
25-06-2024
#48 Cutting Down "Toil" aka Manual Work in Software
Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.We hit the jackpot with concepts like:* what is toil according to a 5-point criteria* why even care about toil?* where you can find toil in your software system* Google’s goal for how much work (%) should be toil* the fact that toil isn’t always all that badDon’t have time to listen to what we learned or added to the concepts? Check out the takeaways toward the end of this email.But first…Before we jump into the takeaways, here’s a new segment I’m trying out for newsletters. I’ll highlight a new reliability tool that I think could help you. Do you struggle to visualize your Kubernetes workloads?In that case, have you heard of kube-ops-view?It helps you visualize your complex K8s clusters and everything inside them.For a deeper rundown, visit the LinkedIn post I made about kube-ops-view which shares a few more details. Back to our original programming…Here are key takeaways from our chat* Define and Identify ToilRegularly evaluate your tasks. Identify work that is manual, repetitive, and potentially automatable. Recognize it as toil and prioritize its reduction.* Prioritize AutomationLook for repetitive tasks in your workflow and automate them using tools and scripts to reduce manual interventions and increase efficiency.* Embrace the Role of an SRERealize that the role of an SRE is to improve system reliability proactively. Focus on long-term improvements rather than just responding to immediate issues.* Address Common Sources of ToilIdentify frequent sources of toil like context switching, on-call duties, and release processes. Implement solutions to automate and streamline these areas.* Adopt a Toil Elimination MindsetCultivate a mindset focused on eliminating toil. Regularly discuss and explore automation opportunities with your team to improve processes.* Develop a Culture of Continuous ImprovementEncourage a culture that values reducing manual, repetitive work. Advocate for proactive problem-solving and continuous process enhancement within teams.Until next time, happy toil hunting! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#47 How to Grow Team Impact Through Learning Culture
18-06-2024
#47 How to Grow Team Impact Through Learning Culture
The common refrain after an incident is “We could and should learn from this”. To me, that alludes to the need for a robust learning culture.We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives.But how often do we explore the nuances of how we are learning?Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (software consultancy) and now does the same work under her own banner.Her work ties in well with the ideas shared by Manuel Pais in episode #45 about how enabling teams can support a continuous learning culture. We tackled issues like the value of certifications, comparing technical with non-technical skills, and more. You can ⁠connect with Sorrel via LinkedInLearn more about what Sorrel does via LaaS.consultingHere’s a bonus section because you read all this way. It covers 5 public outages and how the affected teams could improve their learning culture: 1. Slack Outage (February 2023)Slack experienced a global outage disrupting communication for hours due to backend infrastructure issues. Perhaps the team could focus their learning on more robust infrastructure management and resilience improvement.2. Twitter Algorithm Glitch (April 2023)A glitch in Twitter's algorithm caused timeline issues, stemming from a problematic software update. Perhaps the team could focus their learning on thorough testing and game days to rectify critical system errors swiftly.3. Microsoft Azure AD Outage (March 2023)Azure Active Directory faced a significant outage due to an internal configuration change. Perhaps the team could focus their learning on the importance of rigorous change management and how to address misconfigurations quickly.4. Google Cloud Platform Networking Issue (May 2023)Google Cloud Platform experienced widespread service disruptions from a software bug in its networking infrastructure. Perhaps the team could focus their learning on the need for comprehensive testing and preventing disruptions.5. GitHub Outage (June 2023)GitHub suffered a major outage caused by a cascading failure in its storage infrastructure. Perhaps the team could focus their learning on robust fault-tolerance mechanisms and ways to address the root causes of failures. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#46 Platform Team Design According to Team Team Topologies
11-06-2024
#46 Platform Team Design According to Team Team Topologies
I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams.In this second part, we will talk about platform teams. A quick refresher on what platform teams doIn the team topologies context:Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity.They achieve this directive by abstracting away common infrastructure and operational concerns. By doing this, they aim to allow stream-aligned teams to focus on delivering business value.Here are the key takeaways from our conversation For those who don’t have time to listen to this episode (but you’re missing out on a great conversation):* Focus on User-Centric Design: Prioritize the user experience in platform development. Regularly collaborate with internal teams to ensure the platform meets their needs and reduces their pain points.* Build and Maintain Trust: Establish and nurture trust with your platform’s users. Trust is crucial for platform adoption and can prevent resistance thus assuring sustained use.* Justify Platform Value: Continuously demonstrate the value of your platform to management and stakeholders, especially during economic downturns. Highlight its contributions to avoid cuts and maintain support.* Understand Adoption Lifecycle: Recognize that platforms go through different stages of adoption. Identify and support early adopters, and gradually bring in late adopters by showcasing successful use cases.* Enhance Collaboration: Foster open communication between platform teams and other teams. Avoid rigid roadmaps and be adaptable to changing needs to prevent barriers and build stronger internal relationships.* Manage Cognitive Load: Be mindful of the cognitive load on your teams. Simplify processes and reduce unnecessary complexities to enhance productivity and efficiency.* Use Tools to Measure Cognitive Load: Implement tools like Teamperature to assess the cognitive load on your teams regularly. Use the insights to identify and mitigate factors contributing to cognitive overload.* Leverage Experienced Product Managers: Ensure experienced product managers are part of your platform team. They can balance long-term goals with the flexibility needed to adapt to the evolving needs of internal users.I think the uncommon takeaway here is #9 in that platform teams should treat their platform as a product. Product Managers like Paweł Huryn and Marty Cagan are doing great work in laying out the roadmap for product management. Did you end up checking out the reliability workstreams map I published last week?It’s free and can help you stay focused on the right priorities at work.Check it out via this link This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#45 How Team Topologies Can Guide Enabling Teams
04-06-2024
#45 How Team Topologies Can Guide Enabling Teams
I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams.A quick refresher on what enabling teams doIn the team topologies context:Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas.This kind of team is available to provide expertise, guidance, and support to other teams working to adopt new technologies, practices, or skills.In other news…This podcast has a new nameWhat more a fitting moment to announce renaming the SREpath podcast to “The Reliability Enablers” podcast?This name change reinforces our quest to demystify and enable reliability efforts so that more organizations successfully implement SRE principles and beyond. Before we get to the 8 takeawaysHere’s something relevant to enabling reliability work — a reliability workflows map I’ve had in my private notes for years, now going public.What is a workstream? 🤔 You might have heard of “value streams”. They show the end-to-end journey of creating and delivering value to a customer.Workstreams support your value streams. They cover the activities carried out to do so. In summary: Value streams are the goals and workstreams are the activities you do to achieve those goals.Okay, now time for the erudite takeaways that Manuel gave me from our talk.Takeaways from the episodeHere are the key takeaways from our conversation for those who don’t have time to listen (but you’re missing out on a great audio conversation):* Create Enabling Teams: Form SRE-focused enabling teams to facilitate technical training, optimize cloud architecture, improve documentation, and overall help other teams build their capabilities.* Work to Minimize Cognitive Load:Minimize the cognitive load on engineers by centralizing complex and repetitive tasks, allowing engineers to concentrate on innovation and high-value work. You can measure cognitive load and manage it through the Teamperature tool* Facilitate Learning and Adoption of Best Practices:Use SRE enabling teams to educate product teams on critical practices like error budgets and service level objectives, making the learning process gradual and manageable.* Collaborate among Topologies for Effective Tooling:Enable teams should work with platform teams to inform their plans to develop and co-evolve tools and services that support reliability and observability practices, like automated dashboards and alerting systems.* Adapt Approaches Based on Organizational Capacity:Tailor the mix of enabling and platform support based on the organization’s resources and constraints, ensuring flexibility and efficiency.* Avoid Traditional Ops Work for SRE Teams:Ensure SRE teams focus on empowering product teams rather than performing traditional operations tasks, promoting a culture of shared responsibility.* Build an Effective Learning Culture:Foster a culture of continuous learning and improvement, integrating learning opportunities into the daily workflow rather than relying solely on formal training programs.* Scale Capabilities Across the Organization:When needed, scale enabling efforts to build organization-wide capabilities, ensuring that expertise is distributed and not bottlenecked within specialized departments. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#43 - SLOs: a Deeper Dive into its Mechanics
28-05-2024
#43 - SLOs: a Deeper Dive into its Mechanics
This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs.Here are 5 takeaways from the show:* Start Small with SLOs: Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once.* Defend and Enforce SLOs: Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability.* Continuous Improvement: Embrace the idea that SLOs are not static targets but evolve over time. Start with loose targets and refine them as you learn more about the system's behavior. Commit to ongoing maintenance and improvement of SLOs for long-term success.* Effective Communication Skills: Recognize the importance of effective communication, especially for technology professionals. Develop the ability to translate technical concepts into plain language that stakeholders can understand and appreciate.* Understanding User Needs: Prioritize understanding and aligning with the expectations of users/customers when defining service level objectives (SLOs) and metrics. User feedback should guide the selection of meaningful SLOs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#42 - Hitting Software SLA Targets through SLOs and SLIs
21-05-2024
#42 - Hitting Software SLA Targets through SLOs and SLIs
In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show:* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
#38 The Real Cost of Software Reliability & Downtime
23-04-2024
#38 The Real Cost of Software Reliability & Downtime
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all.Here are key takeaways from our conversation:* Prioritize Risk Mitigation: Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively.* Consider Cost-Effectiveness: When aiming to improve reliability, consider the cost-effectiveness of incremental improvements. Evaluate the balance between investment in reliability and the value it brings to your organization.* Advocate Continuously: Continuously advocate for the importance of reliability engineering within your organization. Communicate transparently about the value SRE teams add and the impact of their work on the organization's success.* Explore Alternative Metrics: Explore alternative availability metrics beyond traditional time-based measurements. Consider event-based metrics to gain a more nuanced understanding of service availability and performance.* Embrace Regional Focus: Shift from relying solely on global availability metrics to more granular regional metrics. Understand the varying impacts on different customer audiences and prioritize improvements accordingly.* Navigate Regulatory Challenges: Be mindful of regulatory challenges, such as GDPR, and understand their implications on service availability and reliability. Adapt strategies and solutions to comply with regulations while maintaining operational efficiency.* Align Reliability with Revenue: Recognize the direct correlation between service availability and revenue generation, particularly for revenue-driven services like ad platforms. Invest in reliability engineering to ensure consistent revenue streams.* Tier Services Strategically: Implement a tiered approach to prioritize reliability efforts, with revenue-generating services like ad platforms placed in the top tier. Allocate resources based on the criticality of services to the organization's objectives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com