• merc@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    44
    ·
    1 day ago

    I’ve worked on services with 5 nines of availability (i.e. 99.999% available, less than 5 minutes of downtime allowed per year). I’ve more frequently worked on ones with 4 nines, where you’re allowed almost an hour of downtime per year. GitHub is now barely maintaining 2 nines. That’s just embarrassing.

    Each “nine” you add is much more difficult. To get four nines you need people on call who can start working on a problem within 5 minutes and fix it within a few more minutes, and you can only get those calls once every couple of months. Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem because it would take too long for someone on-call to get their computer out, connect and authenticate. It requires warm backup systems that are sitting idle but ready to take over fully at a moment’s notice.

    A two nines system is allowed to be down for 100x as long as a four nines system, and 1000x as long as a five nines system. It’s almost 15 minutes of downtime allowed per day, compared to about 15 minutes every 3 months for a four-nines system. Gamers wouldn’t even put up with a two-nines system for a video game. It’s absurd to allow that for a critical piece of infrastructure for software.

    • HrabiaVulpes@europe.pub
      link
      fedilink
      English
      arrow-up
      4
      ·
      18 hours ago

      I cal bullshit on “Gamers wouldn’t put up with a two-nines system for a video game”

      Elder Scrolls Online has a weekly scheduled outage for about 8h. Every monday. Players have been complaining about it for years, but game is still popular.

      • merc@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        ·
        14 hours ago

        How often is it offline outside the maintenance windows?

        Yeah, maintenance windows are annoying, but they don’t really count when describing the availability of a system. Many government systems are only available during normal business hours. That means they’re offline for 16 hours per day. What matters is how available they are when they’re supposed to be working.

        For Elder Scrolls, two nines would mean that the game was allowed to be down for more than an hour a week outside of those maintenance windows. Or, if measured by quarters, which is more typical, the game would still have those maintenance windows, but, in addition, it might be down for a full day once per quarter.

        Basically, the 8 hour windows every Monday is a trade-off so that the rest of the week is uninterrupted. They probably manage three nines the rest of the week by shifting any serious maintenance into the weekly downtime.

        And, as for the game being “still popular”, one site says that there are currently 7199 players in Elder Scrolls but more than 161k in World of Warcraft. It could be that part of the reason that World of Warcraft is more popular is that it doesn’t have 8 hour maintenance windows every week, but it does often have 2+ hour windows. The number of players who are willing to put up with 8 hour maintenance windows every week seems pretty small.

    • P03 Locke@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      13
      arrow-down
      4
      ·
      edit-2
      1 day ago

      Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem

      No, it means you don’t have outages. Ever.

      Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.

      It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.

      • merc@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        14
        arrow-down
        1
        ·
        1 day ago

        No, it means you don’t have outages. Ever.

        No, that’s infinite nines, which isn’t possible.

        Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.

        Yes, you have automated failover systems. But, if something happens which causes those systems to fail over, you need to immediately investigate what happened and why. Even at four nines you have automatic failover, redundant system, hot spares, etc. But, you accept that sometimes not everything will work as planned and you’ll need to fix something. Five nines is just that and more.

        It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.

        Right, which is why I said that four nines is your realistic maximum if you’re going to have people on call who aren’t actually at their desks. To get better than four nines you need to have around the clock coverage with people at their desks so when a system breaks you have eyes on it in something like 30s.

        • P03 Locke@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          10
          ·
          1 day ago

          No, that’s infinite nines, which isn’t possible.

          It’s not impossible. Large reliable websites do it all the time. It’s call 100% uptime.

          Sure, it’s measured per year, and sometimes they have some outage that breaks the record. But, it is possible to have 100% uptime throughout the year.

          • merc@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            9
            arrow-down
            1
            ·
            1 day ago

            It’s not impossible. Large reliable websites do it all the time. It’s call 100% uptime.

            No, no website does it. There is no such thing as 100% uptime. If it happens, great, but I can guarantee you that no website even aims for 5 nines of uptime.

            Google is the benchmark for website availability and in 2022 they had an outage that lasted an hour, meaning they didn’t meet 4 nines for the year.

            Sure, it’s measured per year, and sometimes they have some outage that breaks the record. But, it is possible to have 100% uptime throughout the year.

            If you miss your SLO target for the year, then you missed your SLO target. If you’re down for 60 minutes but fine for the other 11 months, 29 days and 23 hours, you still missed your yearly SLO.

            • P03 Locke@lemmy.dbzer0.com
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              1
              ·
              1 day ago

              No, no website does it. There is no such thing as 100% uptime. If it happens, great, but I can guarantee you that no website even aims for 5 nines of uptime.

              Google is the benchmark for website availability and in 2022 they had an outage that lasted an hour, meaning they didn’t meet 4 nines for the year.

              In 2022. In the other years, they had 100% uptime.

              Also, yes, there are plenty of clients that ask for five-nines. Is it realistic? Probably not. But, they definitely ask.

              If you miss your SLO target for the year, then you missed your SLO target. If you’re down for 60 minutes but fine for the other 11 months, 29 days and 23 hours, you still missed your yearly SLO.

              I understand how SLO targets work. If somebody is asking for a five-nines as an SLO, they are basically asking for 100% uptime, because there is no such thing as a “five minute outage”, especially not one that is fixable without total automation.

              Again, a human hasn’t even gotten paged and out of bed in 5 minutes time.

              • YeahToast@aussie.zone
                link
                fedilink
                English
                arrow-up
                5
                ·
                18 hours ago

                Again, a human hasn’t even gotten paged and out of bed in 5 minutes time.

                Dude, why do you keep referencing that people won’t get out of bed in time, when that’s exactly what the OC originally said XD

                • P03 Locke@lemmy.dbzer0.com
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  arrow-down
                  2
                  ·
                  17 hours ago

                  This whole thread started with:

                  Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem

                  There’s no detecting and fixing something that fast. When you’re talking about less than 5 minute of outage time a year, it basically means you can’t have outages. Which is possible for some, but only for large reliable websites that have the resources to pull that off, and they still don’t always make the mark.

                  I’m not sure why that simple premise is disagreeable with the OP.

              • merc@sh.itjust.works
                link
                fedilink
                English
                arrow-up
                1
                ·
                15 hours ago

                Again, a human hasn’t even gotten paged and out of bed in 5 minutes time.

                Why don’t you go back and actually read what I wrote?

                You clearly have no idea what you’re talking about.

    • Waraugh@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      1 day ago

      I’m used to environments where they expect five nines, get 3 (maybe 4) nines, and fund for 1 nine.