GithubHelp home page GithubHelp logo

linkedin / school-of-sre Goto Github PK

View Code? Open in Web Editor NEW
7.6K 156.0 694.0 48.73 MB

At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.

Home Page: https://linkedin.github.io/school-of-sre/

License: Other

HTML 84.59% CSS 15.41%
sre linux networking git python mysql nosql hadoop system-design security

school-of-sre's Introduction

School of SRE

Site Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly important is to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software.

SREs bring in engineering practices to keep the site up. Each distributed system is an agglomeration of many components. SREs validate business requirements, convert them to SLAs for each of the components that constitute the distributed system, monitor and measure adherence to SLAs, re-architect or scale out to mitigate or avoid SLA breaches, add these learnings as feedback to new systems or projects and thereby reduce operational toil. Hence SREs play a vital role right from the day 0 design of the system.

In early 2019, we started visiting campuses across India to recruit the best and brightest minds to make sure LinkedIn, and all the services that make up its complex technology stack are always available for everyone. This critical function at LinkedIn falls under the purview of the Site Engineering team and Site Reliability Engineers (SREs) who are Software Engineers, specialized in reliability.

As we continued on this journey we started getting a lot of questions from these campuses on what exactly the site reliability engineering role entails? And, how could someone learn the skills and the disciplines involved to become a successful site reliability engineer? Fast forward a few months, and a few of these campus students had joined LinkedIn either as interns or as full-time engineers to become a part of the Site Engineering team; we also had a few lateral hires who joined our organization who were not from a traditional SRE background. That's when a few of us got together and started to think about how we can onboard new graduate engineers to the Site Engineering team.

There are very few resources out there guiding someone on the basic skill sets one has to acquire as a beginner SRE. Because of the lack of these resources, we felt that individuals have a tough time getting into open positions in the industry. We created the School Of SRE as a starting point for anyone wanting to build their career as an SRE. In this course, we are focusing on building strong foundational skills. The course is structured in a way to provide more real life examples and how learning each of these topics can play an important role in day to day job responsibilities of an SRE. Currently we are covering the following topics under the School Of SRE:

We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references that could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.

At LinkedIn, we are using this curriculum for onboarding our non-traditional hires and new college grads into the SRE role. We had multiple rounds of successful onboarding experiences with new employees and the course helped them be productive in a very short period of time. This motivated us to open source the content for helping other organizations in onboarding new engineers into the role and provide guidance for aspiring individuals to get into the role. We realize that the initial content we created is just a starting point and we hope that the community can help in the journey of refining and expanding the content. Check out the contributing guide to get started.

school-of-sre's People

Contributors

aayush1205 avatar afrazchelsea avatar akbarkm avatar andrewpollack avatar apoorv1393 avatar arunt12 avatar ashkapow avatar ayan-b avatar ayushman17 avatar bhavyatyagi avatar bogay avatar chris-bateman avatar codophobia avatar coryjamesfisher avatar dictcp avatar fuzzyticks avatar givemeroot avatar jtr109 avatar kalyanceg avatar n704 avatar poudyalamit avatar rajalakshmi-v15 avatar ritwik12 avatar saikiranrgda avatar sanketplus avatar seemaupadhya avatar stormic-nomad-nishant avatar sumeshpremraj avatar sumit419 avatar tusharjain0022 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

school-of-sre's Issues

Standards for images and diagrams

Lets add some guidelines on what tools to use to create diagrams and images.

Having diagrams and images as code have multiple advantages over directly committing images
Thoughts on using mkdocs-diagrams where ever possible?

Consistency and CAP theorem in System Design

Currently, we have Availability, Fault Tolerance in Level 101 but nothing about Consistency which is also an important topic.

With these three in the picture, we can also have the CAP theorem. Maybe under Level 102

Add SOS to Up-For-Grabs.net

It would be good to add this repo as a project list to https://up-for-grabs.net/#/ for the people who are new to OpenSource. A lot of people have great skills and knowledge but they start late into Open Source due to which it is hard for them to contribute initially. SOS is a good place where they can contribute their knowledge and skills without much Open Source knowledge.

Let me know if this is good to be added and if someone is going to do that? or want me to?

Linux Networking Fundamentals needs work

In the intro.md prerequisites, it's stated that readers need to have knowledge of jargon in the TCP/IP stack such as DNS, HTTP, etc. It's not clear if these protocols are the jargon being referred to, or jargon associated with these protocols. Anyway, I would suggest that rather than worrying about jargon, readers should have a grounding in the basic principles of these protocols. I recommend providing a link to Peterson and Davie's Computer Networks: A Systems Approach, an open-source textbook covering the fundamentals of computer networking from a multi-layered, system of components perspective. I believe after reading (at least) the first three chapters, your readers will have enough of a grasp of computer networking fundamentals that they will better understand how to perform SRE networking tasks on Linux (and other) systems.

Links not provided

Description:

Links for the topics of the course are not provided in the course content for Python and The Web course like in all other courses provided. So it becomes difficult to reach out to that topics.

Screenshot

Link not provided for subtopics
Screenshot (107)_LI

Add Course Content for Flutter

As Flutter is rising in popularity and even Google Pay is using Flutter for their product, I believe it's time we introduce the students and enthusiasts to the world of Cross-Platform Mobile Application Development.
I have a structured course material to add through which anyone can get acquainted with the Flutter Framework.

I kindly request the maintainers to share their insights and suggestions on this course material.

SQL operations page not included in the nav

The SQL operations document is in the repo but not added to the nav, throwing up this log during build:

$ mkdocs build
INFO     -  Cleaning site directory
INFO     -  Building documentation to directory: /Users/spremraj/code/school-of-sre/site
INFO     -  The following pages exist in the docs directory, but are not included in the "nav" configuration:
              - level101/databases_sql/operations.md

setup CD

push on main should deploy the website on gh-pages branch

Add observability

what do you think about adding an observability course?

the definition? what is a metric? how to get / store / correlate and show?
why?
Tools eg: prometheus, grafana, elk, jaeger, etc.

Improving CLI screenshots

Hello,

I was working on a similar SRE manual and recently discovered your repo. If you allow me, I would like to contribute a few ideas, so please let me know if you have any guidelines for submitting pull requests.

As first contribution, I think that changing all CLI snippets from screenshot images to asciinema would help. Here is an example: https://asciinema.org/

Let me know your thoughts.

Suggestion to improve readability of shell commands

I was looking at the shell commands here and I found the user prompt very distracting. It currently looks like this.

spatel1-mn1:school-of-sre spatel1$ git branch b1
spatel1-mn1:school-of-sre spatel1$ git log --oneline --graph
* 7f3b00e (HEAD -> master, b1) adding file 2
* df2fb7a adding file 1

but it would be better in my opinion if it looked like this

$ git branch b1
$ git log --oneline --graph
* 7f3b00e (HEAD -> master, b1) adding file 2
* df2fb7a adding file 1

or since the current branch important to understand what a git command would do.

(master)$ git branch b1
(master)$ git log --oneline --graph
* 7f3b00e (HEAD -> master, b1) adding file 2
* df2fb7a adding file 1

I think this is better and easier to understand since the git commands stand out

What do you think?

I'm also willing to make a PR for this if this sounds like a reasonable idea.

Thanks for sharing free knowledge, I wish I came across this when I was in uni.
Cheers!

Add more topics to Signal

Add more topics to Signal under Linux Advanced. Refer #136

  • Signal Groups: realtime and standard signals.
  • Signal Overview
  • improved few terms and sentences

A cloud-flavored school of SRE?

Hello there!

First of all, a big thank you on behalf of the community for this effort! :)

I just started looking at the challenge of onboarding early talent or people making a lateral move into the world of SRE, and this is sometimes a steep learning curve! Resources that you all contribute to in the open make everyone's life easier.

Now coming to the point: While I would love to continue using the school of SRE repo, it misses on something that's key to SRE at some companies: knowledge of cloud architecture, and some containers/Kubernetes.

I completely understand if LinkedIn doesn't use Containers with Kube (not sure if that has changed) and thus the additions wouldn't be relevant to this repo, but I wanted to reach out and see if we can somehow make that extension :)

Do let me know what you think! Cheers!

Something i can't understand in chapter System Design -> Scalability

Usually, web applications can be scaled by adding resources unless there is no state stored inside the application.

I can't understand this line at all. In my opinion, stateless systems can be scaled horizontally easier... Why the article says unless there is no state stored inside the application ? ๐Ÿค”

Katacoda.com is now closed

Hello,
I really love how helpful this manual is and while I was going through it I found out there were some mentions of katacoda.com for hands-on labs in the kubernetes section.
Katacoda.com is now closed, actually it has become O'Reilly exclusive.
References:

An alternative to this can be the labs at play with kubernetes: https://labs.play-with-k8s.com/

Changes should be done primarily in the file: orchestration_with_kubernetes.md

If approved, I would like to update this and send in a PR if allowed
Cheers!

Contribution in Python

I would be happy to contribute in python and html for the course section , Kindly let me know what sections or topics are left out or need to be covered ? so that I could contribute.

Commands for Viewing Files

In addition to CAT, HEAD, and TAIL commands. We should add LESS and MORE too as these are commonly used in everyday cases and also provide faster access.

Add use cases/issues with solutions

In addition to the material with basic concepts, we should give a practical example of some issues that an SRE deals with in his/her day to day life.

Few common cases:

  • Dealing with information in Logs: Scrapping logs, finding errors and warnings.

  • Troubleshooting an issue with Web App: Using the inspect to find any network, console errors.

  • Working with Servers: Check Health, status of an APP, CPU usage, Memory Utilization. etc.

These are just common thought use cases, We can deep dive and find more appropriate practical cases to put on the site.

how to integrate other languages version

Hi folks,
I've been translating this into Chinese, and have almost finished all the courses in level101 in my branch. I'll also continue to translate more courses. I just wonder if there's any guide about how to integrate the course in other languages to the current site?

Thanks!

Minor issue with redirection to GFG site and a typo

Inside
school-of-sre\courses\level102\containerization_and_orchestration\intro_to_containers.md
Line 177, clicking on the link in line "If you want to try out a more in-depth exercise on cgroups, check out this tutorial from Geeks for Geeks." makes the site go Error 404 Not found. To fix this an 'https://' can be added as a prefix to the link which will fix this issue.
Page Link: https://linkedin.github.io/school-of-sre/level102/containerization_and_orchestration/intro_to_containers/

Inside
school-of-sre\courses\level102\linux_intermediate\bashscripting.md
the monitoring script misses onto the first '#' in
#!/bin/bash and should be #!/bin/bash instead of !/bin/bash
Page Link: https://linkedin.github.io/school-of-sre/level102/linux_intermediate/bashscripting/

Do let me know about the issues and suggested changes...

Suggestions in Metrics & Monitoring section

First of all i would like to say huge thanks to you guys for sharing the SRE world knowledge to the community. It is really useful and bring visibility on how important the SRE's are for the company and the expectations of this role.

I have looked at the Metrics and Monitoring section and i have some suggestions. Please check.

The statement "Monitoring is a process of collecting real-time performance metrics from a system" might not be correct for all use cases. There are certain ML or offline jobs which are measured once in a day or hour so we cannot say real-time performance metrics.

The statement "What gets measured, gets fixed" might not be true. For instance, lets say if an ecommerce system is experiencing huge traffic because of lot of requests from a single IP(DDOS attack) they will throttle the requests after a certain threshold or block but it is not fixing the problem rather i would say mitigating it. Similarly if an ecommerce systems is expecting to receive high traffic during sale event they might add hosts prior to the event(based on projection) to accomodate the traffic but does not mean we are fixing the problem rather finding a way to handle it.

In four golden signals of monitoring, i think we should also have Availability as one of the key metric which would help us to understand how much % of time service is available.

In basic terminologies of monitoring we should also add about what a percentile is? Because percentile is the one most frequently used in monitoring and engineers often get confused with this measurement.

In Command line tools, we should also add du command to get disk usage of directories as df shows free space at file system level. Also we should add ping, telnet, vmstat and lsof commands as these i see commonly used in operations world.

In Best Practices for Monitoring we should call out that we should try to bring the system to a stable state rather than trying to fix the problem when a production problem happens. Because getting the service under control is more important than fixing the problem itself.

In Best Practices for Monitoring we should also add "Never hesitate to escalate to the right team if needed". As every issue mitigation has its own SLA we should escalate to the right owner when needed rather than trying to deep dive and breaching the SLA which could cause impact to the customer.

Correct sentence in nosql db intro

This should be "These are a simpler type of databases where each item contains keys and values. A value can typically only be retrieved by referencing its value key"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.