Setting Expectations with SLAs
Posted in White Papers
Outsourcing SLA - Setting Expectations
In a new industry, where defined metrics do not yet exist, expectations are unclear – what should a consumer expect with regards to a Service Level Agreement or SLA? Is this a mere insurance policy against failures, is it nothing more than a discount structure applied through punitive means, or is there room for a true SLA to provide competitive advantage to a business customer? This document provides businesses with details about what to expect when navigating the myriad of issues surrounding Service Level Agreements in the outsourcing marketplace.
Online backup is an example of an outsourced service and some of the criteria below may apply to this simple, yet critical business process. This document does not, however, imply that the criteria below are strictly relevant to online backup nor that all the criteria are applied by Backup Direct™ in relation to its online backup service.
Why ask for an SLA anyway?
The first question that must be answered in dealing with any SLA is ‘why bother?’ While it may seem trite, understanding why a business wants an SLA is fundamental to the mutual success of both provider and consumer. At its core, an SLA is a punitive document. It is part marketing brochure, part boast. It is a statement about what capabilities a business believes it can offer, and what performance it can sustain. But at its heart, the document boils down to punitive measures enforced when promises do not meet performance. Remedies are typically financial and seldom can repair the damage that can arise from non-performance. For example, when a mission critical application hosted by a third party service provider, goes down due to infrastructure failures at the service provider. While the clock may start ticking against the SLA warranted down times, the damage incurred by the business consumer will far outweigh the credits given. These credits are generally against the hosting service charges of a particular month’s bill. Credits offered against miniscule monthly hosting fees are insignificant against lost revenues of a business consumer, damage to business reputation and credibility, as well as potential career damage to the decision maker who authorized the outsourcing contract in the first place. Since no financial remedy will ever be adequate enough, focusing on the punitive aspects of an SLA is one of the pitfalls to avoid when crafting this document. The business consumer should avoid too much attention on remedies for failure, and focus more energy on the mechanisms that prevent failure from occurring in the first place. A successful SLA will alleviate doubt from the consumer’s mind, and help ensure business continuity more so than offer financial compensation for substandard performance.
What approach makes the most sense?
One of the contested issues around SLA’s in the Online Backup marketplace is an ongoing philosophical difference between component based agreements and holistic / service based agreements. Providers are more often able to measure elements or components of discrete services and thus tend to offer remedies and documentation that address component-by-component the items of a given service – without ever tying it all together. Other providers have opted for a more holistic approach at the service level, but the pitfalls have been non-specific warranties, generic language, and mismatched expectations between provider and business consumer. The most successful implementations of SLA agreements begin by looking at the service as a whole, from the customer’s point of view. Identifying the key aspects of a solution that must function properly, serves as a first step in identifying how to tie these elements together under a complete SLA. Providers must realize that a partial solution is no solution and use component based capabilities to build a complete overview that warrants all the components of a service and gives the business consumer assurances that the overall solution will be delivered as promised.
What should be in an effective SLA?
There are three key areas that are generally addressed in the successful SLA. The first is categorized as ‘infrastructure warranties’. In the Online Backup marketplace this category tends to include performance characteristics around facilities, connectivity, hardware reliability and general availability of discrete technologies. The second key category of an SLA is ‘process warranties’. This category includes items like turn around times for work process events such as – add new user, delete user, and setup new account. However it may be extended to include items like … develop new scripting module to accommodate ‘x’ requirement in ‘y’ period of time. The third key category of a successful SLA is ‘escalation warranties’. This category is designed to give as much assurance as possible during unforeseen failures, acts of God, external contributor failures, etc. As no-one, perhaps excluding God, can guarantee perfection, this category is designed to outline the flow of how a failure is resolved, what time frames to expect, what percent of failures may fall within a given level of disaster, etc..
The infrastructure warranties section of an SLA is the easiest section for a business consumer to become enamoured with. Providers are generally quick to throw out impressive quality standard numbers such as the ‘five nines’ availability percentages (99.999%) uptime. Some vendor’s only count the nine’s behind the decimal point, some imply it. But the net effect is to offer an extremely high amount of availability to the consumer, and ideally lift that consumer’s feelings about how the vendor will perform given the guarantee. But herein lies another pitfall to avoid, the number of nine’s in a given guarantee can quickly be negated by two factors, the first is exclusions, the second is relevance.
Most service providers who offer such high availability standards provide a laundry list of exclusions from which time is exempt against the overall SLA measurement. Common exclusions are ‘scheduled maintenance windows’ which may involve anything from upgrading equipment, to periodic reboots, to backing up critical information. For example, depending on the application technology being offered by the Online Backup and the proficiency and skills of the provider, an NT platform is often placed on scheduled reboots, to ‘clean-up’ a given system’s performance and thereby reduces the number of ‘unplanned’ negative events. If this reboot schedule is more than once a week, and takes 15-30 minutes or more to complete, the ability to offer true 5 nine’s performance is negated. Other exclusions may involve the infamous acts of God, warfare, terrorism – but more importantly may exclude SLA provisions for failure of a third party which the service provider does not directly control. This limitation is often rightly used to exclude a local ISP of the business consumer where the provider does not directly control the ISP’s operations. It may also be deployed against software vendors for ‘anomalies in the code base’, which require the software vendor to fix themselves. Sometimes connectivity providers fall into this arena for the sections of the network they operate between consumer and the end-all hosting provider. Business consumers should be wary and closely examine the exclusions of an agreement and ensure that the ‘real’ availability of the service matches the perceived warranties.
A business consumer should also take special note of the ‘Application’ wording in the outsourced marketplace with respect to exclusionary provisions. Closely examine the software license agreement of any off the shelf product and the language used excludes the software from providing any benefits at all – except by sheer accident it would appear. Every possible liability is disclaimed. Given that primary software providers disclaim all warranties of their software, Online Backup providers then have a difficult task attempting to warrant a service, which relies at its core on products the original manufacturers refuse to warrant in any way. While premium service providers will attempt to ‘own’ the entire service experience for their customer base, at least from a single point of contact, or responsibility point of view, no service provider can warrant software code above what the developer will take responsibility for.
Outsourcing SLA - Setting Expectations
The second negating factor to examine against multi-nine performance warranties is relevance. The most resilient component of a given service is generally the one touted during a multi-nines warranty claim. This is commonly the resiliency of the facility, or data centre building itself. The second most common measurement is against network uptime / connectivity – which if the provider is worth their salt, generally spans more than one vendor, involves peering arrangements, and is redundant enough to warrant true ‘high availability’. In either case, be it the data centre, or the network layer, that is being measured – in the Online Backup marketplace the truly relevant measure is several layers up the OSI model. Ultimately the only measure that truly counts is at the application tier. This implies performance of the application itself, middleware platforms (if any), the operating system, the hardware, the network, and yes the building itself. It is also not common to offer multi-nine warranties when measured at this layer, since the complexity of the environment (variability in the implementation of the OSI model) make it hard to predict. Business consumers should probe what measurement statements are being made about the service and insure that the most relevant measures are in place to warrant overall successful service delivery.
Another common pitfall to avoid in the infrastructure sections of an SLA is lack of attention to frequency and capacity issues. Seems like a ‘no-brainer’ but it is surprising the lack of published information service providers offer in SLA documents around the frequency of the measurements that are taken against a given component of the service. For example, it is easy (albeit risky), to tout higher availability if only measuring a service characteristic once per day, or perhaps what is more commonly seen, at once per hour. A successful response can negate 59 minutes of downtime that may or may not have gotten attention from staff. Using polling frequencies at smaller intervals makes overall service response time much better, focus’ attention on problem areas quicker, and effectively ‘insures’ better system performance, even though statistically the SLA results could look similar to the end consumer. A reasonable measurement interval is generally around 5 minutes. More frequent measurement than that can burden the system – we affect what we watch theory – less frequent tends to lose responsiveness to potential issues. Business consumers should require a list of the tools used to measure the system from providers, and should ask what frequency of measurement is used to determine the numbers provided.
Generally it is easier for a service provider to achieve better uptime results over a longer period of measurement time. This is not necessarily a disadvantage to either the service provider or business consumer as the goal is to raise performance statistics over time (i.e. the longer a service runs the better off a business consumer is). However the remedies listed against an SLA should be tied to the billing frequency of the business consumer. For example, if the service being provisioned is billed monthly, which is most common in the Online Backup sector, the SLA measurement periods should also be monthly. This allows the business consumer to review on the invoice each month the line item credits associated with any breach in SLA performance. It also facilitates a tie-in between the value the service provider offers and the components of the service that either exceed, match, or do not meet performance expectations of the consumer. Business consumers should insure that the measurement times of the service correspond to the billing cycle, and that line item credits for SLA non-performance are easy to understand, clearly tie to discreet system performance, and are clearly articulated by the service provider on the billing invoice.
Capacity concerns are sometimes overlooked in an SLA with negative results. It is sometimes difficult to warrant certain characteristics of a given solution such as CPU utilization thresholds, or RAM utilization / consumption will not exceed ‘x’ percent for example. However a provider can take certain steps to insure that peak network capacity for example, never exceeds a sustained 70% of overall bandwidth availability. A provider can warrant in the process section of an SLA that hard drive capacity utilization will be escalated to the consumer as key targets are reached – 50% filled, 75% filled, 90% filled for example. It is important when considering capacity utilization issues that the business consumer does not attempt to warrant ‘best practices’ as part of an SLA agreement. Best practices tend to evolve and change over time, and a service provider should be free to manage the solution for the customer by learning on a continual basis. Forcing a service provider to keep a given component utilization at some arbitrary number, may in fact do more detriment to a system than benefit. Business consumers should be careful to avoid constricting language that does not allow a service provider to implement improvements in capacity management as they evolve over time.
The statistical sample set used in the measurement is also significant to the business consumer. For example, it is easier to maintain higher uptime reporting standards when measuring downtime over a large base of servers, than on an individualized (my server only) basis. A service provider should be able to measure ‘dedicated’ servers, or those used specifically by a single customer, and provide reporting to that individual customer regarding their discrete performance. However measuring common components of a solution becomes much more difficult. A fax gateway, or e-mail gateway translation service for example, may be deployed for a large group of customers where throughput of messages is likely measured at the gateway itself, rather than by an individual customer. The higher value added providers however, will find a way to offer the business consumer visibility into their discrete usage of common infrastructure utilities.
The old adage – you get what you pay for – tends to ring true with respect to heading off potential service outages by purchasing additional resiliency in the components of any given hosted solution. For example, a hosted solution may only require a 300Mhz strength CPU, a 4GB Hard Drive, and 64MG of RAM in order to function properly. Consumers will readily purchase higher Mhz CPU’s since the going rate of speed as of this writing is 800+, they will not think twice about spending the additional hardware costs in order to achieve ‘theoretically’ better system performance, or in an effort to anticipate additional solution requirements at some future time. Purchasing resiliency however, may meet the future performance objectives, while at the same time offering immensely better protection against potential service failures. Instead of buying the additional Mhz in the CPU, it may be better to buy a second server entirely that could be mirrored, or load balanced to protect against outages. The load balancing with a second machine may wind up offering much better overall performance than simply upgrading a single box. Service Providers should be able to provide the business customer with detailed system hardware and software requirements for optimum solution performance; this should include single configuration solutions, as well as load balanced solutions. If the consumer opts to purchase more resiliency, the metrics for SLA performance of technical components of the system should be markedly higher.
Technical jargon and detailed explanations of infrastructure measurement techniques can help assure a user that the service provider understands how to run the solution. However, it does nothing to assure the consumer that the provider understands the criticality of the full solution to the business they serve. This is the area where process warranties make all the difference. By warranting specific tasks on specific timelines for example, the consumer can be assured he knows that setting up a new account will occur in ‘x’ minutes; deleting a terminated end-user will occur throughout the system in ‘y’ minutes. These assurances allow the business consumer to develop business practices that the solution provider can warrant performance against. Knowing that requests for change, or work orders (if you will), are warranted for turn-around in defined time periods take the guess work out of planning, and assure both the provider and the consumer that the important functionality of the application is being addressed in a meaningful way to the consumer. It also gives the provider a productivity target that can be used to attempt to develop better performance against over time.
Process warranties are not limited to work-order related tasks. They can also include business information dissemination related to utilization of the solution resources themselves. For example, a process warranty that notifies the consumer when 50%, 75%, and 90% of a hard drive’s free capacity has been exhausted allows the consumer to address the issue. It may be that the consumer enacts utilization policies for the end-user base who utilize his solution such as ‘please limit your online storage to 5 MG per user’, ‘please delete information older than “x” days’, or ‘please delete these sections of data within the solution itself’. Notification on a proactive basis becomes a competitive advantage allowing the consumer to craft the appropriate response for his business. When storage of the data is necessary over a long term, and growth is high or unpredictable, the business consumer may want more notifications between 50% and full – these become items of negotiation in the deployment of a successful SLA that again keep the provider and consumer focused on the key aspects of solution performance.
Response TimesProcess warranties can also include response times of the overall solution to stimuli (WEB page retrievals or mouse clicks most often). While response times can also appear in the infrastructure section of an SLA, identification in the process section sends a message to the business consumer that this service provider knows the truly important items to be measured in the SLA. Who cares if the turnaround time on a database transaction is less than 3 seconds for example, if the time it takes to refresh the screen takes 12 seconds. Measuring key response times to system stimulus should denote the most important aspects of the solution, and should take into account variability for sections of the solution that the provider may not directly be able to control. For example, having a 3 second response time to process a transaction is common (including the screen refresh), but what if the user base extends to international locations. Will the system perform acceptably from Tokyo to London to New York? Should it? Response time warranties force the business consumer and the service provider to think outside of the box, to consider end-user scenarios that may not be ‘normal’ to business operations, but may arise from growth or other factors. It will help avoid problem with differing expectations later in the relationship.
Measurement within the Process section is equally critical to that of the infrastructure sections of the document. Who will monitor that the processes are functioning according to the specifications? Normally, the provider will assign this task to internal personnel, but it is equally important for the provider to document, who, how, and when compliance attributes are to be monitored and reported to the business consumer. It is not unreasonable for the business consumer to negotiate outside auditing of the process section of an SLA. Indeed these functions are the easiest to measure, document, and report against by third parties. The infrastructure section of an SLA may rely on proprietary techniques or inside technical information that forms the basis of a competitive advantage to the service provider. But process warranties are formed at a higher tier, generally involve business processes only, and therefore are easier to validate by internal or external parties. The business consumer should be explicit with the service provider as to who will audit, how often an audit will be conducted, and of course how disparities in reporting will be addressed.
Sometimes referred to as the ‘customer care’ portion of an SLA, this section deals primarily with what to expect when the unforeseeable occurs. It is in this section of the SLA where a service provider has the opportunity to distinguish himself from the pack. The SLA should contain language to help set expectations regarding failure classifications, frequency, and then define escalations both inside and outside the provider’s direct control. For example, a business consumer of Online Backup services should expect that 80% of the support calls to the service provider would be quickly and efficiently diagnosed as ‘client’ problems. History and statistics demonstrate that the piece of any application solution most likely to fail is where variability is the highest. The PC is truly a ‘personal’ device, generally subject to the end user downloading applications from the Internet, changing configurations to accommodate games, or participating in other equally ‘personal’ behaviour on a computer system despite the fact that the PC is generally a corporate owned asset. Therefore variability is generally highest at any given point in time at the client – driving the 80% causal factor for failure. There are still fewer variability’s in connectivity, the typically next known culprit in perceived system non-performance; equalled by lack of training on product functionality / feature sets; followed by true errors in the system infrastructure; and lastly, by real errors in the code base of a given application. This then becomes a hierarchal pecking order for system failures at an Online Backup. It is important for the business consumer to understand this as they negotiate escalation procedural warranties. Avoid the pitfall of requesting the 80% expected client issues be escalated and reported immediately throughout the management chain at the provider and the business consumer sites – focus on the remaining 20% of issues that can be far more significant in terms of impact on the solution, and potentially far longer lasting in terms of outage time.
Escalations should be complete (both internal and external resources used), and describe the expertise hierarchy of the provider. For instance a common escalation chain may start at the tier 1 level of customer support, failing resolution within a defined period of time it would move to the tier 2 level of customer support. This pattern may occur generally from 2 to 4 levels within support but then should reflect an escalation to the Operations and/or Engineering staff. Escalations should further detail the levels within Ops or Engineering [how many, at what time intervals, to what groups in parallel (if any), etc.]. At last, the service provider should identify what types of support relationships they have with third party providers upon which they rely. For instance, if a service provider relies upon MCI for Internet circuits, they should provide the business consumer an itinerary for how an escalation takes place to MCI from the provider. There may be special expedited support mechanisms in place with third party providers to a given Online Backup that distinguish that Online Backup from the rest. The business consumer should pay particular detail to these relationships and escalation paths, as the most significant failures to a solution will inevitably wind up here. Defined relationships, a history of collaboration with vendors on problem resolutions, etc., will go a long way in shortening the time to resolution, and restoring the service to the consumer.
Status during a crisis is the most important element of this section of the SLA. A business consumer has a right to expect regular updates on where a given open item is within the resolution processes of the service provider. The best mechanism for providing this information is generally via a secured Internet site, as accessibility is most open via this mechanism. Business consumers should expect to see trending information related to the performance of the service provider in resolving open trouble tickets. Keeping in mind that the customer support organization cannot mandate the quality of a service solution, only engineering can do this. Trending data will show the business consumer how quickly items are resolved, how many items are submitted and open, and when examined over a 12 to 15 month period of time, the overall quality of the service. The number of tickets should go down over time when a system performs as expected, assuming no new feature are deployed and the environment remains consistent.
Reporting of status information and escalation compliance is something the business consumer should take note of as well. How does the Online Backup report against proscribed escalation procedures? What mechanisms are in place to insure this information is accurate, and not produced only after prompts from the business consumer on a challenge. Ideally the consumer should see a monthly report that matches SLA information against services billed within the period. It should provide highlights for compliance against objectives, and show credits due for any non-performance issues that may have arisen. Reporting on a more frequent basis than monthly becomes a costly proposition for the service provider, and while they may be able to accommodate the request, the business consumer should expect to pay a premium for this type of additional reporting capabilities.
So how does a business derive competitive advantage from an SLA
The savvy business consumer should examine a service provider’s capabilities regarding the SLA, as well as their vision regarding the interface of SLA information to the customer. Presenting information to the consumer on a timely basis that allows the customer to make intelligent business decisions around the usage of a given solution is the end goal. Trending data is critical to proper analysis, and is indicative of advanced systems and capabilities of a service provider. It implies a graphical presentation (charts), it implies storage of historical data, and it implies the ability to archive data over time (or the Online Backup will drown in data overflow). The business consumer does not need real-time feeds of this information as ‘real-time’ implies significantly higher costs than the value it provides, with the exception of crisis status or service outage notification. One-day-old information is generally the most valuable and easiest to collect and present. While the Online Backup may not have developed the actual tools for the delivery of this information yet, it is critical to the business consumer that the Online Backup shares this vision and is actively engaged in making it a reality.
Most online backup’s are still struggling with building intelligent capabilities into their service offerings and monitoring tool sets. Implementing auto-fix scripts that take into account information from the network, hardware, OS, and application layers, that are then analyzed from a composite point of view, with automatic error correction is still the lofty goal more than the common practice at this point. But again the business consumer must examine the Online Backup’s vision with regard to SLA implementation and insure that this capability is an end-goal of the monitoring and measurement systems put in place to collect and report against SLA performance. This type of capability is new to the industry as a whole, but represents the best hope for insuring stability in highly complex computing environments.
Mutual beneficial outcomes
Structuring a ‘win/win’ agreement may motivate both parties to perform. An often-overused cliché, the ‘win/win’ agreement financially motivates both the service provider and the business consumer to achieve common goals. Online Backup’s tend to use this concept to gain additional revenue from the services provided if they hit all the performance objectives. But this is simply added cost to the consumer, the reverse of punitive discount structures. It does not represent a true incentive to perform, only additional revenue to the provider. True ‘win/win’ agreements will expand into revenue sharing opportunities for both parties. For example, if the reliability of a service can win industry awards for the business consumer that truly distinguish their solution from others, the value of these awards to the marketing organization could be financially compensated against. Setting up reciprocal service agreements where service provider and business consumer refer each other’s customers for cross sell opportunities is another way to achieve real ‘win/win’. The implication from such terms and conditions is that performance will have to meet or exceed expectations or revenue flow would cease for both parties.
What is online data backup?
An online data backup service is a proven alternative to traditional optical and tape backup solutions and can be considered a perfect solution to use on an outsourced basis - as it is a critical but non-core business function.
Most businesses understand the need to protect their most valuable business asset. It is too easy to become victim to human error, PC crash, a virus, malicious actions, flood, fire, theft or loss of a PC.
Traditional backup solutions can be effective, but require capital expenditure and internal staff to maintain and operate them. There are general considered a hassle - especially during a crisis when key data is needed to be recovered quickly.