I’ve worked on services with 5 nines of availability (i.e. 99.999% available, less than 5 minutes of downtime allowed per year). I’ve more frequently worked on ones with 4 nines, where you’re allowed almost an hour of downtime per year. GitHub is now barely maintaining 2 nines. That’s just embarrassing.
Each “nine” you add is much more difficult. To get four nines you need people on call who can start working on a problem within 5 minutes and fix it within a few more minutes, and you can only get those calls once every couple of months. Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem because it would take too long for someone on-call to get their computer out, connect and authenticate. It requires warm backup systems that are sitting idle but ready to take over fully at a moment’s notice.
A two nines system is allowed to be down for 100x as long as a four nines system, and 1000x as long as a five nines system. It’s almost 15 minutes of downtime allowed per day, compared to about 15 minutes every 3 months for a four-nines system. Gamers wouldn’t even put up with a two-nines system for a video game. It’s absurd to allow that for a critical piece of infrastructure for software.
I cal bullshit on “Gamers wouldn’t put up with a two-nines system for a video game”
Elder Scrolls Online has a weekly scheduled outage for about 8h. Every monday. Players have been complaining about it for years, but game is still popular.
How often is it offline outside the maintenance windows?
Yeah, maintenance windows are annoying, but they don’t really count when describing the availability of a system. Many government systems are only available during normal business hours. That means they’re offline for 16 hours per day. What matters is how available they are when they’re supposed to be working.
For Elder Scrolls, two nines would mean that the game was allowed to be down for more than an hour a week outside of those maintenance windows. Or, if measured by quarters, which is more typical, the game would still have those maintenance windows, but, in addition, it might be down for a full day once per quarter.
Basically, the 8 hour windows every Monday is a trade-off so that the rest of the week is uninterrupted. They probably manage three nines the rest of the week by shifting any serious maintenance into the weekly downtime.
And, as for the game being “still popular”, one site says that there are currently 7199 players in Elder Scrolls but more than 161k in World of Warcraft. It could be that part of the reason that World of Warcraft is more popular is that it doesn’t have 8 hour maintenance windows every week, but it does often have 2+ hour windows. The number of players who are willing to put up with 8 hour maintenance windows every week seems pretty small.
Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem
No, it means you don’t have outages. Ever.
Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.
It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.
Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.
Yes, you have automated failover systems. But, if something happens which causes those systems to fail over, you need to immediately investigate what happened and why. Even at four nines you have automatic failover, redundant system, hot spares, etc. But, you accept that sometimes not everything will work as planned and you’ll need to fix something. Five nines is just that and more.
It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.
Right, which is why I said that four nines is your realistic maximum if you’re going to have people on call who aren’t actually at their desks. To get better than four nines you need to have around the clock coverage with people at their desks so when a system breaks you have eyes on it in something like 30s.
It’s not impossible. Large reliable websites do it all the time. It’s call 100% uptime.
No, no website does it. There is no such thing as 100% uptime. If it happens, great, but I can guarantee you that no website even aims for 5 nines of uptime.
Google is the benchmark for website availability and in 2022 they had an outage that lasted an hour, meaning they didn’t meet 4 nines for the year.
Sure, it’s measured per year, and sometimes they have some outage that breaks the record. But, it is possible to have 100% uptime throughout the year.
If you miss your SLO target for the year, then you missed your SLO target. If you’re down for 60 minutes but fine for the other 11 months, 29 days and 23 hours, you still missed your yearly SLO.
No, no website does it. There is no such thing as 100% uptime. If it happens, great, but I can guarantee you that no website even aims for 5 nines of uptime.
Google is the benchmark for website availability and in 2022 they had an outage that lasted an hour, meaning they didn’t meet 4 nines for the year.
In 2022. In the other years, they had 100% uptime.
Also, yes, there are plenty of clients that ask for five-nines. Is it realistic? Probably not. But, they definitely ask.
If you miss your SLO target for the year, then you missed your SLO target. If you’re down for 60 minutes but fine for the other 11 months, 29 days and 23 hours, you still missed your yearly SLO.
I understand how SLO targets work. If somebody is asking for a five-nines as an SLO, they are basically asking for 100% uptime, because there is no such thing as a “five minute outage”, especially not one that is fixable without total automation.
Again, a human hasn’t even gotten paged and out of bed in 5 minutes time.
Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem
There’s no detecting and fixing something that fast. When you’re talking about less than 5 minute of outage time a year, it basically means you can’t have outages. Which is possible for some, but only for large reliable websites that have the resources to pull that off, and they still don’t always make the mark.
I’m not sure why that simple premise is disagreeable with the OP.
I’ve worked on services with 5 nines of availability (i.e. 99.999% available, less than 5 minutes of downtime allowed per year). I’ve more frequently worked on ones with 4 nines, where you’re allowed almost an hour of downtime per year. GitHub is now barely maintaining 2 nines. That’s just embarrassing.
Each “nine” you add is much more difficult. To get four nines you need people on call who can start working on a problem within 5 minutes and fix it within a few more minutes, and you can only get those calls once every couple of months. Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem because it would take too long for someone on-call to get their computer out, connect and authenticate. It requires warm backup systems that are sitting idle but ready to take over fully at a moment’s notice.
A two nines system is allowed to be down for 100x as long as a four nines system, and 1000x as long as a five nines system. It’s almost 15 minutes of downtime allowed per day, compared to about 15 minutes every 3 months for a four-nines system. Gamers wouldn’t even put up with a two-nines system for a video game. It’s absurd to allow that for a critical piece of infrastructure for software.
I cal bullshit on “Gamers wouldn’t put up with a two-nines system for a video game”
Elder Scrolls Online has a weekly scheduled outage for about 8h. Every monday. Players have been complaining about it for years, but game is still popular.
How often is it offline outside the maintenance windows?
Yeah, maintenance windows are annoying, but they don’t really count when describing the availability of a system. Many government systems are only available during normal business hours. That means they’re offline for 16 hours per day. What matters is how available they are when they’re supposed to be working.
For Elder Scrolls, two nines would mean that the game was allowed to be down for more than an hour a week outside of those maintenance windows. Or, if measured by quarters, which is more typical, the game would still have those maintenance windows, but, in addition, it might be down for a full day once per quarter.
Basically, the 8 hour windows every Monday is a trade-off so that the rest of the week is uninterrupted. They probably manage three nines the rest of the week by shifting any serious maintenance into the weekly downtime.
And, as for the game being “still popular”, one site says that there are currently 7199 players in Elder Scrolls but more than 161k in World of Warcraft. It could be that part of the reason that World of Warcraft is more popular is that it doesn’t have 8 hour maintenance windows every week, but it does often have 2+ hour windows. The number of players who are willing to put up with 8 hour maintenance windows every week seems pretty small.
No, it means you don’t have outages. Ever.
Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.
It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.
No, that’s infinite nines, which isn’t possible.
Yes, you have automated failover systems. But, if something happens which causes those systems to fail over, you need to immediately investigate what happened and why. Even at four nines you have automatic failover, redundant system, hot spares, etc. But, you accept that sometimes not everything will work as planned and you’ll need to fix something. Five nines is just that and more.
Right, which is why I said that four nines is your realistic maximum if you’re going to have people on call who aren’t actually at their desks. To get better than four nines you need to have around the clock coverage with people at their desks so when a system breaks you have eyes on it in something like 30s.
It’s not impossible. Large reliable websites do it all the time. It’s call 100% uptime.
Sure, it’s measured per year, and sometimes they have some outage that breaks the record. But, it is possible to have 100% uptime throughout the year.
No, no website does it. There is no such thing as 100% uptime. If it happens, great, but I can guarantee you that no website even aims for 5 nines of uptime.
Google is the benchmark for website availability and in 2022 they had an outage that lasted an hour, meaning they didn’t meet 4 nines for the year.
If you miss your SLO target for the year, then you missed your SLO target. If you’re down for 60 minutes but fine for the other 11 months, 29 days and 23 hours, you still missed your yearly SLO.
In 2022. In the other years, they had 100% uptime.
Also, yes, there are plenty of clients that ask for five-nines. Is it realistic? Probably not. But, they definitely ask.
I understand how SLO targets work. If somebody is asking for a five-nines as an SLO, they are basically asking for 100% uptime, because there is no such thing as a “five minute outage”, especially not one that is fixable without total automation.
Again, a human hasn’t even gotten paged and out of bed in 5 minutes time.
Dude, why do you keep referencing that people won’t get out of bed in time, when that’s exactly what the OC originally said XD
This whole thread started with:
There’s no detecting and fixing something that fast. When you’re talking about less than 5 minute of outage time a year, it basically means you can’t have outages. Which is possible for some, but only for large reliable websites that have the resources to pull that off, and they still don’t always make the mark.
I’m not sure why that simple premise is disagreeable with the OP.
Why don’t you go back and actually read what I wrote?
You clearly have no idea what you’re talking about.
I’m used to environments where they expect five nines, get 3 (maybe 4) nines, and fund for 1 nine.