Have you read the Site Reliability Engineering book? First read it then read this post.
One disclaimer is that I'm writing this doctrine from the perspective of working in a large enterprise that is on it's own SaaS transformation. The kind of work I'm doing is driving the perspective and conclusion. I won't claim this applies to all SREs but I hope it's useful to some.
For each SRE or DevOps or DevSecOps or DevXOps (SRE, from here on) in an Enterprise there are four primary stakeholders:
- Outside customers or end users of the service
- Inside customers (developers, engineers, testers, evangelists, etc. (developers, from here on)) building the service
- Service owners (business, product, project)
- SRE team
The SRE role is required to satisfy the requirements of these stakeholders.
End users demand the following from the service's SREs:
- Privacy & Security
End users expect the service to be always available and usable.
SREs must make the service available at all times.
Privacy & Security
End users expect the service to be secure against all threats.
End users expect the service to respect and uphold their privacy.
SREs must make the service secure and ensure privacy.
End users expect the service to be stable at all times. This means the service is free of errors while performing its responsibilities and does not change so often that end users have to re-learn how to use it constantly.
SREs must keep the service stable.
Developers demand the following from the service's SREs:
Developers expect the platform (provided by SREs) to give them agility to:
- Add new capabilities
- Modify existing capabilities
- Run experiments
- Fix instability
SREs must provide developers agility.
Velocity is defined as the speed with which developers can build on the platform with agility.
Developers expect the platform (provided by SREs) to not restrict their velocity by arbitrary or artificial barriers.
Developers expect the platform (provided by SREs) to assist them with increasing their velocity.
SREs must provide developers velocity.
Owners demand the following from the service's SREs:
Owners expect the service to cost as little as possible to operate at scale.
SREs must reduce the cost to operate the service across all levels of scale.
Owners expect the service to operate in compliance with various standards and regulations, such as (but not limited to) PCI, SOC2, FedRAMP, Federal Aviation Regulations, etc.
SREs must operate the service while remaining compliant with all relevant standards identified by Owners.
Owners expect the service to operate with minimal risk of various factors, such as but not limited to, security breach, data exfiltration, non-compliance, etc.
SREs must ensure they perform proper risk assessment of the entire service and develop policies and plans accordingly.
Owners expect the service to be governed properly with the right policies and procedures.
SREs must develop policies and procedures together with other stakeholders to meet the Owners' expections.
Owners expect each stakeholder to have appropriate accountability.
SREs must develop the right framework to assess all stakeholders meet accountability standards set by the Owners.
SREs demand the following from themselves:
SREs must constantly look for new ideas to:
- Embrace risk
- Reduce toil
- Avoid failures
These ideas can come from:
- Daily work and toil
- Resume Driven Development
- Following examples of others
SREs must execute with discipline to deliver all aforementioned demands promptly and with integrity. Delays will cascade and must be avoided at all times.
SREs must measure everything.
SREs must track these measurements over time.
SREs must use these measurements as feedback for all aforementioned demands.
This doctrine is subject to be updated with time and as my own ideas, understanding, and comprehension evolve. With that said, this is a good place to start. These broad strokes can be used to guide policy, strategy, and tactics.