The “Enterprise” …

21 01 2010

From a discussion with a few peers in the industry.  I was entertained.

peer> Now when I hear someone use the word “enterprise”
   as an adjective, I have to ask them which of the four meanings
   they intend:
peer> 1.  defunct and destroyed (the Enterprise aircraft
   carrier from WW2)
peer> 2.  ancient and nearly dead (the Enterprise nuclear
    aircraft carrier)
peer> 3.  a nonfunctional mockup (the Enterprise space shuttle)
peer> 4.  imaginary (the starship Enterprise)
peer> Typically “enterprise software” fits perfectly in one of
   those four categories.

Name withheld to, of course, protect the guilty.




Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch

24 11 2009

“Every time we’re down for an hour, that’s about 2,500 people inconvenienced,” Smit said. “They’re blaming my people for it and [state IT officials] have an obligation to fix it.”

Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch.

One of the things we’ve been grappling with lately is some unfortunate unplanned outages of services.  You know what those are … random event blips caused by butterflies flapping their wings in the South Pacific that stir up turbulence which creates a small wind, that then turns into a hurricane, which rampages over a submarine cable used by the crucial bit of networking that connects you with the rest of the Internet civilization.

A blip.

Sometimes they’re momentary, sometimes they’re bad.  What all blips have in common is that they affect a class of your customers in a way that inconviences them in some manner.  The hard part of dealing with an outage is understanding and quantifying what the business impact really is.  When a database server goes out, you implicitly undrestand that it potentially affects all database users plus all services and users downstream that depend upon the database being up.  So how do you realisitically quantify that into a valuable metric?

I bring this up because, as a person in the trenches, I’m able to better understand the impact of something (and therefore, provide a better mitigation plan) if I can understand the size, length, and number of ripples in the fabric that spread out from the blip.

At large companies, this impact may be described as thousands of dollars per minute of cost charged against the bottom line.  Some places, like VA, point out the number of people an hour that an outage prevented someone from successfully interacting with the DMV.  Websites may see it as the number of advertising impressions that don’t go out due to the site being unavailable.

Whatever metric is used, it needs to be understandable and an order of magnitude that someone can comprehend.  I understand impacting 2500 people per hour of down time.  I understand costing a company $1 million dollars per minute that the factory is unable to reach it’s control network. I understand an outage costing an engineering team a day’s worth of work (which can ultimately affect the bottom line due to down stream slippage in timelines). What that metric comes down to is being able to understand, in measurable terms, how the blip impacts either people or money.

It’s important to understand these things.  Why?  Because it allows you to more adequately assess your risk of the (unplanned) outage and design your environment appropriately.  If you can point to a solid metric and show how it materially affects people or money, it’s certainly a lot easier to go to management and provide justification for improvements in your environment.  If you can only say, with vague hand waving, that there’s AN effect but no data to back that up, you’re just waffling.

So.  Have you created your approrpriately detailed outage impact metrics?

I haven’t.  But I’m working on it.




Never change anything on Friday at 5pm.

20 11 2009
SEVERE: Socket accept failed
java.net.SocketException: SSL handshake error
javax.net.ssl.SSLException: No available certificate or key corresponds to the
SSL cipher suites which are
enabled.
        at
org.apache.tomcat.util.net.jsse.JSSESocketFactory.acceptSocket(JSSESocketFactory.java:150)
        at
org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:310)
        at java.lang.Thread.run(Unknown Source)

This is the reason you never change anything on Friday at 5pm. While attempting to update the SSL certificate for the MySQL Enterprise Monitor (for which the process has no documentation), I managed to break it in a way that caused a few hundred megs of these errors to dump to the catalina log for MEM. Oh, and it meant no monitoring was taking place for a few minutes.

Sigh.

Lesson re-learned. Glad I made a backup of the keystore before I started mucking with it. Now we wait for MySQL to provide me with the correct documentation (after they write it some time this weekend). You would think someone would have already encountered this with their product considering how long it’s been out there already.

At least we have monitoring back.




Resizing a multipathed SAN LUN under RHEL5

12 11 2009

So today I found myself staring at the Red Hat LVM documentation wherein I began to silently cuss.  You see, I needed to double the LUN size for MySQL’s usage on one of my servers.  These servers are our first that are directly attached into the SAN and running Linux, so we’re working a bit off the map here.  To make matters more difficult, we’re using multipathd to manage the SAN connections between the system and SAN, so it wasn’t very clear on what exactly I should do in this case.  The LVM docs are somewhat … lacking.

Thanks to the wonder of Google and sheer dumb luck, I ran across this post on one of the RHEL5 mailing lists.  In that case, the author wasn’t sure if it was the right method.  But, I had a secondary system and was willing to run with scissors for a moment since we’re not fully in production yet.

In short, the basic flow of operations is:

  1. Figure out the multipath I/O device name.
  2. Figure out the underlying device IDs (or device names)
  3. Issue the resize of the LUN in your SAN.
  4. Tell the kernel to rescan the underlying device IDs so it sees the new LUN size.
  5. Tell multipathd that a resize has occurred.
  6. Issue a pvresize so LVM knows it has more extents to work with now.
  7. Issue an lvresize to increase the logical volume size.
  8. Run resize2fs and do an online resize of the filesystem.
  9. Make popcorn.
  10. Watch a movie.

The first time I attempted the process (in a similar, but not quite fashion), I caused the system hang all LVM commands.  I turned off multipathd thinking that it would need to be off while I did the resize.  This appears to not be a healthy way to do it because I ended up having to warm cycle the system.  This is the point where I stopped reading the LVM documentation and found the mailing list post.    Tried it out and it worked.

So, without further ado …

#  multipath -ll mpath0
mpath0 (360060160dac711004c6fa9d07c7cde11) dm-2 DGC,RAID 10
[size=50G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=2][active]
 \_ 1:0:1:0 sdc 8:32  [active][ready]
 \_ 2:0:1:0 sde 8:64  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:0 sdb 8:16  [active][ready]
 \_ 2:0:0:0 sdd 8:48  [active][ready]

#  for i in `multipath -ll mpath0 | grep sd | awk '{print $2}'`; do
> echo $i ; done
1:0:1:0
2:0:1:0
1:0:0:0
2:0:0:0

#  for i in `multipath -ll mpath0 | grep sd | awk '{print $3}'`; do
> blockdev --rereadpt /dev/$i ; done
BLKRRPART: Input/output error
BLKRRPART: Input/output error

#  multipathd -k"resize multipath mpath0"
ok

#  multipath -ll mpath0
mpath0 (360060160dac711004c6fa9d07c7cde11) dm-2 DGC,RAID 10
[size=100G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=2][enabled]
 \_ 1:0:1:0 sdc 8:32  [active][ready]
 \_ 2:0:1:0 sde 8:64  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:0 sdb 8:16  [active][ready]
 \_ 2:0:0:0 sdd 8:48  [active][ready]

#  pvresize /dev/mapper/mpath0
  Physical volume "/dev/mpath/mpath0" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized

#  lvresize -L 50G /dev/VolGroupMySQL/mysql-san 
  Extending logical volume mysql-san to 50.00 GB
  Logical volume mysql-san successfully resized

#  resize2fs /dev/VolGroupMySQL/mysql-san 
resize2fs 1.39 (29-May-2006)
Filesystem at /dev/VolGroupMySQL/mysql-san is mounted on /var/lib/mysql; on-line resizing required
Performing an on-line resize of /dev/VolGroupMySQL/mysql-san to 13107200 (4k) blocks.
The filesystem on /dev/VolGroupMySQL/mysql-san is now 13107200 blocks long.

#  df
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      9.7G  3.3G  6.0G  36% /
/dev/sda1             122M   13M  103M  11% /boot
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/mapper/VolGroupMySQL-mysql--san
                       50G  795M   46G   2% /var/lib/mysql

One thing to note is the BLKRRPART errors from blockdev.  This appears to be “normal” as far as I can tell.  The kernel through some log messages (included below), but they appear harmless as far as I can discern.  The SCSI notices occurred when I issued the blockdev command.  The device-mapper multipath warning came from the multipathd resize command.

SCSI device sdc: 209715200 512-byte hdwr sectors (107374 MB)
sdc: Write Protect is off
sdc: Mode Sense: 87 00 00 08
SCSI device sdc: drive cache: write through
sdc: detected capacity change from 53687091200 to 107374182400
 sdc: unknown partition table
SCSI device sde: 209715200 512-byte hdwr sectors (107374 MB)
sde: Write Protect is off
sde: Mode Sense: 87 00 00 08
SCSI device sde: drive cache: write through
sde: detected capacity change from 53687091200 to 107374182400
 sde: unknown partition table
SCSI device sdb: 209715200 512-byte hdwr sectors (107374 MB)
sdb: test WP failed, assume Write Enabled
sdb: asking for cache data failed
sdb: assuming drive cache: write through
sdb: detected capacity change from 53687091200 to 107374182400
 sdb:<6>sd 1:0:0:0: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdb, sector 0
printk: 68 messages suppressed.

Buffer I/O error on device sdb, logical block 0
 unable to read partition table
SCSI device sdd: 209715200 512-byte hdwr sectors (107374 MB)
sdd: test WP failed, assume Write Enabled
sdd: asking for cache data failed
sdd: assuming drive cache: write through
sdd: detected capacity change from 53687091200 to 107374182400
 sdd:<6>sd 2:0:0:0: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdd, sector 0
 unable to read partition table
device-mapper: multipath emc: long trespass command will be send
device-mapper: multipath emc: honor reservation bit will not be set (default)
device-mapper: multipath: Using dm hw handler module emc for failover/failback and device management.
device-mapper: multipath emc: emc_pg_init: sending switch-over command



audun.ytterdal.net » Tshark to the rescue (MySQL snooping)

11 11 2009

And while I was at it. I figured out I could do something similar with mysql queries. Instead of turning on full Query-logging in mysql (which probably means a restart of a running production mysql) I could just sniff it

tshark -i eth0 -aduration:60 -d tcp.port==3306,mysql -T fields \
   -e mysql.query 'port 3306'

audun.ytterdal.net » Tshark to the rescue

I’ve been using wireshark for awhile now.  I ran across this the other day while trying to debug a non-MySQL issue.  It was a good reminder of wireshark’s ability to do live dissection of protocol from the command line.  I’ve had to do this before while trying to debug NFS issues (specifically, trying to work backwards from a packet to figure out the name of an open file descriptor from a file handle and fsid).  Wireshark’s ability to do this on the fly and from the command line is pretty powerful.

Right now I have a need to figure out why MySQL is showing a bunch of aborted connections.  I was hoping that mysql.error would give me something intelligent to poke after, but I think this occurs all within the MySQL instance.  More digging to do.  It’s good to know that I can do some impromptu poking around of the MySQL packet if I without firing up a full wireshark session on the server if I need to.

If you have a need to do this, these two things will be helpful.

Happy sniffing!




Four short links: Tuesday, Nov 3rd 2009

3 11 2009



Seth’s Blog: Ms. In-between

3 11 2009

If the only reason you’re only wearing one hat is because you’ve always only worn one hat, that’s not a good reason.

Seth’s Blog: Ms. In-between.

Tradition for tradition’s sake is never a good reason to continue doing something in a technical environment.  The landscape is ever changing.  You should always be re-evaluating what you’re doing while keeping an eye on the bleeding edge of your profession. It’s the difference between being relevant and being the (only) caretaker of a legacy system.

When was the last time you challenged a tradition?




Your Cloud Needs a Sys Admin – O’Reilly Broadcast

19 10 2009

The programmer-managed infrastructure suffers from a death by a thousand cuts. The programmer is competent with technology and fully capable of setting up a system that can support the application being built. The programmer, however, lacks a detailed understanding of ongoing infrastructure management. Consequently, the programmer-managed infrastructure ultimately leads to an environment incapable of adjusting to changing demands and potentially opens vulnerabilities to hackers through discreet channels.

The reverse is true of the sys admins who fancy themselves programmers. They can craft Perl programs to do just about any task. Those programs, however, ultimately lack the solid architecture that programming skills provide.

Your Cloud Needs a Sys Admin – O’Reilly Broadcast.

Fair warning:  I’m a sysadmin so my opinion might be slightly biased on this idea.

Yes, you do need a sysadmin.  Just like the blog posts suggests.  Like everything in business, you want to use the right tool for the right job.  Sure, you can use a wrench to hammer a nail in, but in the process you’re likely going to smash a finger, chip the wood, and spend excessive time frustrating yourself when pounding that nail into place doesn’t go fast enough to meet your nail-hammerin’ schedule.

Sysadmins are hammers (the right tool) when it comes to nails (managing systems and infrastructures).  A good sysadmin, like the blog post alludes, has a broad and necessary knowledge required for running a system (or set of systems) effectively and safely (read:  securely).  More often than not, we have the experience needed to tell you when exactly you can cut corners, when you shouldn’t, and why doing so may or may not be the right thing for your environment.

There’s an old AI koan that I’m reminded of.

A novice was trying to fix a broken Lisp machine by turning the power off and on. Knight, seeing what the student was doing spoke sternly: “You can not fix a machine by just power-cycling it with no understanding of what is going wrong.” Knight turned the machine off and on. The machine worked.




Ubuntu Linux adds private cloud backing | Open Source – InfoWorld

14 10 2009

Ubuntu Linux adds private cloud backing

Canonical’s upcoming server upgrade supports the Eucalyptus project’s open source system for cloud implementation using hardware and software already in place

Canonical is touting private cloud capabilities in an upgrade to its Ubuntu Linux OS being announced on Tuesday.

Available for free download on October 29, Ubuntu 9.10 Server Edition introduces UEC (Ubuntu Enterprise Cloud), an open source cloud computing environment based on the same APIs as Amazon EC2 (Elastic Compute Cloud). Businesses can take advantage of private clouds, Canonical said.

Ubuntu Linux adds private cloud backing | Open Source – InfoWorld.

This should prove interesting.  If we were able to leverage something like this, we could build out a private cloud for researchers.  The Eucalyptus system certainly looks useful.  Especially if they’re touting it as API compatible with other external cloud vendors.  We’d certainly need to do some heavy investigation to figure out what running our own cloud would actually mean.  I can certainly see it as being completely different than running a classic high performance computing grid.

You, too, can have a cloud in the privacy of your own home!  Time to keep up with the Jones’s again!




Four Short Links, Oct 14, 2009

14 10 2009
  • Larry Ellison hates cloud computing – funny clip of Ellison lambasting the idea of clouds. Yes, really, clouds have been around for over a decade, we just didn’t know it (or realize it).
  • Dynamic general and slow query log before MySQL 5.1 – This is an interesting way of handling the slow and general query logs on pre-5.1 MySQL instances. We don’t need this on slow, but there’s been occasions that we’ve needed the general query log, but enabling it and disabling it requires a full restart of the service on 5.0 and earlier. You still take a performance hit because you’re always logging, but I would think it to be fairly minimized on modern fast hardware.
  • Watch out for your CRON jobs – Over at the MySQL Performance Blog, Peter Zaitsev gives some good guidelines on things to pay attention to when designing your cron jobs. Not just for databases. I like the idea of keeping historical run time information so you can see when large jumps in run time occur (which could be a problem.
  • How Did Danger Not Backup Its Servers? How Did Microsoft Allow Such A Failure? – Oy. A few days late on this one, but really? Total data loss from an upgrade. Scary. This is a reminder: we all test our backups, but how many of us test our restores?