Losing Big, Which is Really the First Step to Winning Big, if you think about it…

Today we had a huge failure! Yeah! A major campus-wide project didn’t succeed and our services didn’t come back on line as expected! Whoopee! Although we were largely back in business by 8 a.m., one service didn’t go live until 9:30 and only about 1/3 of the improvements we’d hope to deploy went live. Everyone here is very sleep deprived and unhappy. Here are my takeaways for the moment:

  1. A 5-hour downtime (3 a.m.-8 a.m.) in the middle of the week is too long. We added buffer time, but not enough. We should have broken down the steps into more discrete projects with shorter outages (my preference) or scheduled the work for the weekend.
  2. If your test environment doesn’t actually reflect your production environment, then it’s not an adequate test environment. Right?
  3. Before embarking on a major outage, take care of “little stuff,” (see below)
  4. Do not underestimate the disruption an expired certificate can cause.
  5. I need to be more hands-on with major campus projects. I don’t like to micromanage but sometimes it’s not micromanaging, it’s Taking Care of Business.

time to lick the stress ball…

5 Responses to “Losing Big, Which is Really the First Step to Winning Big, if you think about it…”

  1. kdghty said:

    Jan 28, 09 at 3:19 pm

    http://davemarvin.com/tcb.jpg

  2. Crazycatlady said:

    Jan 28, 09 at 7:31 pm

    We used to have regular 3am-10am (sometimes midnight-10am) code rolls/upgrade windows at Motricity, always on weekdays because the cell carriers had their heaviest traffic during evenings and weekends. No one is at their best after getting a few hours of sleep and waking up at 2:30am to drag their butts into work at 3am. Give me a nice long weekend outage window any day, then go out for a big ole brunch afterwards. 🙂

  3. slack said:

    Jan 29, 09 at 9:19 am

    I just got finished watching the new Cyrus IMAP guy flip on automated account creates. Not knowing how it works, what it does, or…what did you say? Test? What’s that? Obviously, all the account creates on the new busted failed.

    We’ve lost big…now…how is it we turn it around? 😉

  4. admin said:

    Jan 30, 09 at 5:26 pm

    Kdghty–How did you find that?

    @slack: yes, how do we turn it around? My boss said to me, “it’s important to succeed so people have confidence in you.” I replied, “I think it’s important to succeed so we can continue to do our jobs…”

    There is no substitute for Doing Things Right. In our case, we had some glitches and learned a few lessons. But the project plan as a whole wasn’t fundamentally broken. The IMAP example, now that just sounds Bad. I mean, like everyone should know better there.

  5. etselec said:

    Jan 31, 09 at 11:28 am

    Amen to #2. Wish there was anything at all I could do about that for my group, but in this economy…


Leave a Reply

You must be logged in to post a comment.