Site Reliability engineering (SRE) is in the wish list of all IT operations leaders for global 2000 enterprises, and they are asking their internal teams or the managed services partners how they can adopt the SRE operating model. On the other side, the application portfolios are adopting agile and DevOps for the newer initiatives and current operating model in IT operations is an impediment for the feature release velocity and the silo between development and operations hampers service reliability.
There is lot of information in the public domain from Google and other pioneers in SRE operations, on how to adopt SRE, but a typical enterprise has significant challenges in meeting their aspirations. To understand the problem deeper, need to delve into the current operating model of these enterprises.
- Have mature ITSM process implementation based on ITIL framework. [Excellent process centric model, with service levels defined for execution of the processes.]
- Have tower based operating model with service level agreements (SLA) for each tower, operational level agreements (OLA) managing the relationships between parts of the IT organization and service integration and management (SIAM) to manage multiple suppliers to provide single business facing IT organization. [Several towers like compute (Windows, Linux, Unix, mainframe), storage (SAN, NAS), network (DC LAN, WAN, firewalls), database (SQL, Oracle), application server (J2EE, .NET) and on… needed due to scale of the systems & the skill needs]
Though it is called ‘site reliability engineering’, it is actually ‘Service’. The ‘Service’ as experienced by the consumer like an end user (accessing a ‘website’ or a ‘mobile application’) or another application (calling an ‘API’ or may be a ‘database’ or even using the ‘network’). An end user ‘service’ experience like ‘availability’ at 99.9% over a month, permits a downtime of 43.2 minutes only. But best of the SLA in P1 incident management process would have a resolution time of 2 hours at a tower level, and end user experience could involve multiple towers and the resolution time is additive. So, it is not a simple change to transition to SRE model for these enterprises.
A well-planned, staged approach is required to transform with progressive changes to reach the target state with no significant disturbance during the transformation. Cloud infrastructure (public or private) is not essential for SRE style operations, though it aids in faster transformation due to software-defined nature. The first 4 stages in the transformation can be executed with no or minimal disturbance to the existing operations, adding a task force approach with people knowledgeable on SRE principles, operating model, complemented with understanding of the business portfolio of the enterprise.
- Define the service and the service levels
Key first step with no impact to operations, but the one of the toughest to accomplish. Understand the service from the consumer perspective and identify the ‘service level indicators’ (SLI) from their eyes. These are the parameters of concern to the consumers and define ‘service level objectives’ (SLO) that tells the experience of the consumer. [With ‘Response time’ as SLI of a website, the SLO could be <2s at 99.9th percentile, <2.5s for 99.95th. For system with 50 transaction per second, i.e. 180K transactions/ hour, just 180 transactions can be >2s and 90 can be >2.5s]. Start with the process of identifying the ‘services’, which is of topmost priority to the business and representative mix of technology and service requirements. This assures full support for the initiative and the outcome would be valued highly and also eliminate ‘this is service unique’ demands later and create cascading effect across the enterprise. [SRE implementation is ‘Invest to Gain’, so need to start with strong use cases.]
- Build the dependency tree and allocate SLO
Systems are interconnected in the enterprise and need a data driven approach to reduce the towers and silos.[Get strong proof points on dependencies, will need this later to change the tower model] For the identified ‘service’ to meet the SLO, create the of internal and external components/ services and allocate the SLO across the components based on the processing complexity, the total of the allocation should be less than or equal to overall ‘service’ SLO. [E.g. the network latency <400ms, authentication service <200ms, application processing <250ms, database time <500ms, messaging <100ms, recommendation service(external) <500ms, adding up to 1,950ms when overall SLO was 2s. Contrast this with process SLA in tower model if each was a separate system]
- Implement the observability platform
The new way of working based on SLO requires an observability platform that captures the required data on SLO allocation and associated parameters across all components, to handle ‘volume’ (as more data points are collected), ‘velocity’ (as data is at more granular level) while ensuring ‘veracity’ in the line with the monitoring needs for the SLI and SLOs.
The observability platform should aid in symptom level detection and help drive proactive actions before its turns into problems. This requires significant change to the existing stack and careful choice of technologies, and it is best to implement the change in non-intrusive manner to the current operations.
- Planning change for process and operating construct
Examine components’ adherence to the SLO budget allocations & the extent of deviations, look at the tower’s boundaries, study the existing SLA and OLA of the critical components in the execution chain for a ‘service’.
- Evaluate minimal order of change, can each tower agree on the new allocation of SLO, irrespective of the old SLAs? Can the OLAs between towers be tuned meet new SLOs? [Largely awfully difficult to achieve]
- Ponder on consolidating the towers to reduce the hand-offs [Like towers in infrastructure space]
- Adopt new execution approach of SRE (single team across incident handling and improvement initiatives), ways of working (data driven, collaborative mindset, system flow based approach to tackle the issue right from the source through every connecting point, customer centric approach taking their pain points and service impact) and induct automation tools (no manual task repetition, everything as code)
- Enumerate the skillset required, with primary focus on multiskilled talent force to operate in the target state. [Automation should be common skill, consumer understanding]
[Note: Till this point in time, there has been no major change made in any of the existing systems/ processes]
- Operationalizing the new processing and operating construct
Change across technology, processes, operating model with right people is needed to address the deviations in SLO budget allocation. The task workforce working so far, would operationalize by bringing in structured changes, starting with deploying enhanced observability and automation, realignment of teams and staffing with required skill levels, ‘service’ centric engagement models with the ‘consumer’, reporting on service excellence.
Execution Approach
- Adopt ‘service’ centric operating model from process centric operating model, ensure right level of processes for industry & regulatory compliance needs. [Adherence to process to be seamless with embedded guard rails in the automation and instrumentation]
- Evolve from the approach of avoiding failures, to accepting failure as normal and create capabilities to handle failure gracefully and recover faster with right process, people, and tools like automation [This is a significant change for the service owners and the teams managing the services]
Technology – Managed services to Engineered systems
- Engineer for self-managing (pro-active) like self-service provisioning, workload modeling & dynamic thresholds, auto-scaling, no downtime deployments, partial failure handling, chaos engineering.
- Automate for auto-healing (reactive) like patch management to reduce downtime, batch monitoring & remediation to reduce failures, incident handling potentially with AI to reduce MTTR (improved observability reduces MTTD and provides information for better AI based resolution) and a robust mechanism to capture the effort and time spent in operations and incident handling.
Talent & operating model
- Digitally oriented, multi-skilled talents placed in multi-disciplinary team aligned with ‘Services’. [Provide learning platform to the existing team to broaden the domain and upskill on new technologies and automation, complement with the required external expertise through hiring and external partners]
- Reduce organization silos, create distinct ‘services’ teams based on service maturity and complexity of the services. [Existing towers, managed services partners and contracts need restructuring for singular responsibility.]
- A single integrated team should carry out ‘incident handling’ (toil), ‘automate for auto-healing’ and work with upstream teams on ‘engineer for self-managing’ (project work) with balanced time allocated between them. [Attract the best of the multi-skilled talent and keep them motivated.]
- Executive alignment for the new operating construct, with appropriate risk reward model for SLO adherence.
As SRE is operationalized for initial set of ‘services’, during the transition period co-existence with the pre-existing tower structures is inevitable. Closely watch on the OLA and active participation by the SIAM teams will ensure better success, though it would be evident that the tower structure would be disbanded in the medium to long term, they need to be incentivized for active support in transition and encourage more from the existing teams to upskill and take new responsibilities in the SRE teams.
Measuring the success with SRE
- Increase in release velocity of business features to consumers, as SRE operating model closes the loop on DevOps implementation for agile organization.
- Adherence to SLO with no segregation on planned and unplanned activities/ downtime.
- Ability to handle volumetric increase of services/ resources/ transactions to be managed with non-proportional expansion of SRE team or contraction of SRE team for the same volumetrics.
- Happy SRE team members without fatigue of ‘incident handling’ and not worrying about the beeper.