Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch

24 11 2009

“Every time we’re down for an hour, that’s about 2,500 people inconvenienced,” Smit said. “They’re blaming my people for it and [state IT officials] have an obligation to fix it.”

Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch.

One of the things we’ve been grappling with lately is some unfortunate unplanned outages of services.  You know what those are … random event blips caused by butterflies flapping their wings in the South Pacific that stir up turbulence which creates a small wind, that then turns into a hurricane, which rampages over a submarine cable used by the crucial bit of networking that connects you with the rest of the Internet civilization.

A blip.

Sometimes they’re momentary, sometimes they’re bad.  What all blips have in common is that they affect a class of your customers in a way that inconviences them in some manner.  The hard part of dealing with an outage is understanding and quantifying what the business impact really is.  When a database server goes out, you implicitly undrestand that it potentially affects all database users plus all services and users downstream that depend upon the database being up.  So how do you realisitically quantify that into a valuable metric?

I bring this up because, as a person in the trenches, I’m able to better understand the impact of something (and therefore, provide a better mitigation plan) if I can understand the size, length, and number of ripples in the fabric that spread out from the blip.

At large companies, this impact may be described as thousands of dollars per minute of cost charged against the bottom line.  Some places, like VA, point out the number of people an hour that an outage prevented someone from successfully interacting with the DMV.  Websites may see it as the number of advertising impressions that don’t go out due to the site being unavailable.

Whatever metric is used, it needs to be understandable and an order of magnitude that someone can comprehend.  I understand impacting 2500 people per hour of down time.  I understand costing a company $1 million dollars per minute that the factory is unable to reach it’s control network. I understand an outage costing an engineering team a day’s worth of work (which can ultimately affect the bottom line due to down stream slippage in timelines). What that metric comes down to is being able to understand, in measurable terms, how the blip impacts either people or money.

It’s important to understand these things.  Why?  Because it allows you to more adequately assess your risk of the (unplanned) outage and design your environment appropriately.  If you can point to a solid metric and show how it materially affects people or money, it’s certainly a lot easier to go to management and provide justification for improvements in your environment.  If you can only say, with vague hand waving, that there’s AN effect but no data to back that up, you’re just waffling.

So.  Have you created your approrpriately detailed outage impact metrics?

I haven’t.  But I’m working on it.




Four Short Links, Oct 14, 2009

14 10 2009
  • Larry Ellison hates cloud computing – funny clip of Ellison lambasting the idea of clouds. Yes, really, clouds have been around for over a decade, we just didn’t know it (or realize it).
  • Dynamic general and slow query log before MySQL 5.1 – This is an interesting way of handling the slow and general query logs on pre-5.1 MySQL instances. We don’t need this on slow, but there’s been occasions that we’ve needed the general query log, but enabling it and disabling it requires a full restart of the service on 5.0 and earlier. You still take a performance hit because you’re always logging, but I would think it to be fairly minimized on modern fast hardware.
  • Watch out for your CRON jobs – Over at the MySQL Performance Blog, Peter Zaitsev gives some good guidelines on things to pay attention to when designing your cron jobs. Not just for databases. I like the idea of keeping historical run time information so you can see when large jumps in run time occur (which could be a problem.
  • How Did Danger Not Backup Its Servers? How Did Microsoft Allow Such A Failure? – Oy. A few days late on this one, but really? Total data loss from an upgrade. Scary. This is a reminder: we all test our backups, but how many of us test our restores?



    The Duct Tape Programmer – Joel on Software

    24 09 2009

    Jamie Zawinski is what I would call a duct-tape programmer. And I say that with a great deal of respect. He is the kind of programmer who is hard at work building the future, and making useful things so that people can do stuff. He is the guy you want on your team building go-carts, because he has two favorite tools: duct tape and WD-40. And he will wield them elegantly even as your go-cart is careening down the hill at a mile a minute.

    [...]

    And the duct-tape programmer is not afraid to say, “multiple inheritance sucks. Stop it. Just stop.”

    You see, everybody else is too afraid of looking stupid because they just can’t keep enough facts in their head at once to make multiple inheritance, or templates, or COM, or multithreading, or any of that stuff work. So they sheepishly go along with whatever faddish programming craziness has come down from the architecture astronauts who speak at conferences and write books and articles and are so much smarter than us that they don’t realize that the stuff that they’re promoting is too hard for us.

    The Duct Tape Programmer – Joel on Software.

    Oy.  How closely this parallels the sysadmin world!  It’s interesting to see unique and cool frameworks come together in these amazing bread-slicing-new-wheel-building-ultra-rad panaceas that get things done.  Assuming you have time to complete them.  One of the things I’ve had to force myself to learn is what I’ve termed “rational perfectionism in technology”.  Otherwise known as “Is it good enough?  Ship it!”

    It’s a difficult thing to accept, I know.  As a geek, sometimes it just kills me to let something go out that isn’t up to the uber-high standards I generally have.  But, sometimes you just have to.  You’ve got a job to do and the customer is waiting on you to do it.  Very often we will sit on something trying to achieve this golden-age of usefulness in a piece of software instead of taking a step back and trying to figure out if what we have now will work for the customer.  It’s the technological equivalent to gilding the lily.

    But many times we are faced with a task that cannot or should not be delayed if at all possible.  Many times, it’s perfectly acceptable to put something out there that hits ony 80% of the need so the customer can start doing the job they need to do.  From there, you can iteratively improve as necessary.  And, more often than not, I’ve found that the customer is perfectly happy with that 80% solution and may not even notice the warts that you’ve laboriously fretted over.

    An artist is never satisfied with his work.  An art lover very often is.

    Of course, we now approach this with the devil’s advocate voice in mind.  I’m not saying half-ass the job.  We should always strive to do our best when trying to give the customer what they want (or need).  Just be balanced in it.  It’s a constant juggle for me (which makes it kind of fun) to understand just where that technological center of gravity is and hover around it for as long as I can.

    Just remember:  just as you don’t knock the ducttape if it gets the job done, don’t knock the 80% mark if it makes your customer happy.  Afterall, that’s why we’re here … to do what we can to help the customer.

    -Update-

    Joel’s post references the Worse-is-better concept. I thought this quote was rather interesting (and man do I agree with it …)

    The lesson to be learned from this is that it is often undesirable to go for the right thing first. It is better to get half of the right thing available so that it spreads like a virus. Once people are hooked on it, take the time to improve it to 90% of the right thing.