GithubHelp home page GithubHelp logo

sre-hands-on's Introduction

SRE Hands On

This tutorials is a hands on to Site Reliability Engineering.

First, we will focus on learning some basic aspects of SRE. Then, we will learn how to collect Service Level Indicators (SLIs) and define Service Level Objectives (SLOs).

If you wish to learn more about the teory behind SRE, I recommend you read the SRE Workbook from Google

Performance

Observing how a system performs has drastically changed over the years.

I remember a time when we had no idea if a system was running under good conditions. How did we know if something was failing? We didn't. Customers used to be our alert system. If everything was silent, it was a sign that everything was working fine. Otherwise, customers would flood the project manager's inbox. At that time, we had scripts reading the last lines of log files looking for words like ERROR and Exception.

Nowadays, we have plenty of tools and strategies to parse, collect, and analyze all the information we can extract from systems. This wide and diverse range of tools may confuse even experienced developers. Which tools do we need? Which metrics must we collect? Will these numbers show how buggy my code is?

If you are working in a healthy environment, you should not worry about whether these numbers will impact your reasoning about your team's code quality. Measuring should serve one purpose only: understanding systems. And when I say understand, I'm talking about awareness. We should not need to keep our eyes on log parsers, charts, or email inboxes to check if all systems are up and running. Our job, as developers, is to automate these processes.

Understanding SLOs and SLIs

SLIs are metrics that quantify the performance and behavior of your system or service.

SLOs define the acceptable levels of performance or reliability based on SLIs.

SLOs are a tool to:

  • Help you decide what the next steps are in evolving your systems
  • Identify which parts of the system need more attention
  • Negotiate future improvements with stakeholders

All of this is driven by data.

A Service Level Objective is a target, and as such, it must be designed and defined with all the people responsible for the target, from the infrastructure teams to stakeholders. It is an important commitment, and transparency is mandatory.

Availability

Availability is the word used to define whether a system is able to complete a request from a client. We use the word client because we don't care whether the request comes from humans or tools.

If a GET request to the /todos route works, it means the route works, and the request completes. But does it mean that the application is available?

Availability is often measured using the following formula:

$successful / (successful + failed)$

Let's check some examples:

  • Our app received 1000 requests in the past minute
  • 100 out of 1000 requests failed
  • 900 / (900 + 100) = 0.
  • Our app has 90% availability

Latency

Latency measures how long a request takes to complete. How long does it take to process a job?

We use latency to measure the overall user experience.

Slow responses will force users to leave our websites, and that's something we don't want. Availability is important. We want our applications to respond with HTTP 200 OK, but a good status can't come with high latency. We don't want our applications to respond fast with HTTP 5xx errors, just as we don't want HTTP 2xx responses taking 10 seconds to complete. Keep in mind that slow errors are even worse than fast errors.

Measuring latency is a real challenge. At first glance, it may seem like a good idea to store every single response time and then calculate the average time of all the requests. However, it turns out that this is not a good idea. Let's take a closer look at the following example:

The Just Code company launches a fresh new app. On the first day, it only has a few users navigating the site. Upset about the poor numbers, the marketing team decides to invest in advertising. And then, like magic, the app starts to receive a tremendous amount of requests. A month later, the CTO becomes concerned about the costs in the cloud. All the metrics are working fine, but they seem to be very expensive.

Why do you think they are not as cheap as some people might wonder?

One of the best ways to collect latency metrics without sacrificing too much information and paying less for it is by storing the information in buckets and calculating percentiles of latency.

sre-hands-on's People

Contributors

mauricioabreu avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.