Query partially completed on the master (error on master: 1053)

8 07 2009

So, I walked into the office this morning to the following lovely alert from our MySQL Enterprise Monitor installation (MEM) which, thankfully, was working again after I had to rebuild it last week.

Server: seawasp:3306
Time: 2009-07-08 01:33:28 GMT
Advisor: Replication – Slave Has Experienced A Replication Error

Problem Description
When a slave receives updates from its master it must apply those updates locally so the data on the slave matches that on the server. If an error occurs while applying an update on a slave, the data on the slave may not match that on the master and it is an indication that replication may be broken.

Advice
Investigate why the following error occurred on slave seawasp:3306. Look in the slaves error log for this and other errors around the time of this alert.

Last_Errno: 1053
Last_Error: Query partially completed on the master (error on master: 1053) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: ALTER TABLE ‘foo_student’ ADD ‘full_name’ VARCHAR(200) NULL DEFAULT NULL AFTER ‘surname_mother’

That’s. Just. Awesome.

So, let’s start from the beginning. What is error 1053?  According to the manual this is ER_SERVER_SHUTDOWN.  Not very helpful, is that?  What this really means is that the something bad happened and the slave chose to stop talking to the master.  In this particular case, a client appeared to issue an ALTER TABLE statement on a table in the master.  This statement got replicated to the slave.  Before the master could complete the query, it was aborted.  The slave noted this and aborted also, thinking there was some form of corruption on the master.

So, what to do?

I went and poked at the table definitions on both the master and slave to figure out what had happened.  It appears that the master had two extra columns in this table now, one of which was the aborted fullname column from above.  Because it appeared to have completed on the master, I went ahead and re-ran it on the slave.  No problems there.

Next, we have to get replication running again.  MEM suggests that to do this, we have to skip the ALTER TABLE query that was already replicated to the slave.  The MySQL docs on Troubleshooting Replication have the note:

If you decide that you can skip the next statement from the master, issue the following statements:

mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = N;
mysql> START SLAVE;

The value of N should be 1 if the next statement from the master does not use AUTO_INCREMENT or LAST_INSERT_ID(). Otherwise, the value should be 2. The reason for using a value of 2 for statements that use AUTO_INCREMENT or LAST_INSERT_ID() is that they take two events in the binary log of the master.

So, in my case, after issuing the ALTER TABLE, I felt it safe enough to skip that in the relay log and restart the slave.  This query wasn’t dealing with an auto incrementing value, so I chose 1 for the skip value.  That note above about the 1 vs 2 is a good thing to heed if you’re ever needing to do this.

As soon as I started the slave, the relay log began draining of all of last night’s updates.  It’s important to note that replication never really stopped in this case.  The master was successfully sending updates to the slave, as far as I can tell.  The slave just wasn’t applying those updates to it’s local copy because it felt there was some sort of corruption.

The last thing to answer is:  what caused the original ALTER TABLE to fail?  No idea at this point. I sent an email to the database owner to try and figure that out, but I haven’t heard anything back yet.




Solaris logins work!

8 06 2009

What a frustrating afternoon. I spent today working on the Solaris LDAP/Kerberos configuration for authenticating against AD. The short story is that I can now successfully login to my Solaris 10 zone using my windows credentials. Just like on the linux box. The long story (with details) will be posted later because Alex and I are still trying to work through an issue with pam_ldap on Solaris. Basically, Windows implements a function called VLV in it’s LDAP config that causes Solaris to choke whenever you attempt to enumerate a large set of directory entries. The “fix” involves disabling it on the domain controllers, but that breaks other things in the Windows environment.

Oh, and my frustration today? Solved with five minutes of work after I had banged on it for three hours. My zone is a test zone. And as a test zone, it never got a DNS entry created for it. Yep. Creating a hostname for it got everything working. What confused me is that I got everything to work for a single login about half way through the day but promptly broke it and had no idea why it worked.

Remember, DNS is an important part of a healthy diet. Who knew?




Whoops. Account lockouts baaaaaaad!

5 06 2009

So I found one downside to using this AD/LDAP configuration. Ok, not really a downside, just a really big caveat. The account used for binding to the LDAP server can get locked out if it authenticates too many times with the wrong password. Discovered this yesterday when I inadvertently changed the password in my configuration while doing some other testing of search options. When things started mysteriously failing soon after, I thought I’d broken my search configuration.

Whoops.

So what did we learn? Be very careful with your bind password. Because of how often we’re binding to the domain controller (and because the bind user is subject to AD policies), it’d be very easy to completely disable your entire authentication environment if you mess this up. Wondering if there’s an alternative way for us to bind to the domain controller, such as using a public/private key instead. New things to investigate.




Woot! Unix group enumeration from AD groups.

3 06 2009

Well, that was easy enough. Just needed to understand a bit more of the AD OU structure here. (Sanitized a bit for now).

-bash-3.2$ touch foo bar baz quux
-bash-3.2$ ls -l
total 0
-rw-r--r-- 1 hcoyote UNIXTEST-test 0 Jun  3 16:59 bar
-rw-r--r-- 1 hcoyote UNIXTEST-test 0 Jun  3 16:59 baz
-rw-r--r-- 1 hcoyote UNIXTEST-test 0 Jun  3 16:59 foo
-rw-r--r-- 1 hcoyote UNIXTEST-test 0 Jun  3 16:59 quux
-bash-3.2$ id
uid=66000(hcoyote) gid=66000(UNIXTEST-test) groups=66000(UNIXTEST-test)
-bash-3.2$ getent group UNIXTEST-test
UNIXTEST-test:*:66000:hcoyote,member2,member3
-bash-3.2$ getent group
root:x:0:root
bin:x:1:root,bin,daemon
daemon:x:2:root,bin,daemon
sys:x:3:root,bin,adm
adm:x:4:root,adm,daemon
tty:x:5:
disk:x:6:root
lp:x:7:daemon,lp
.
.
.
stapdev:x:101:
stapusr:x:102:
avahi-autoipd:x:103:
UNIXTEST-test:*:66000:hcoyote,effie,csoto

UNIXTEST-test is the group name for gid 66000 in Active Directory. Everything listed before this group comes straight from the local group file because we’re using the appropriate configuration in nsswitch.conf.

This was solved by adding the following to the ldap.conf:

nss_base_group		ou=Departments,?sub?&(objectCategory=group)(gidNumber=*)

Also, you need to modify nsswitch.conf to be:

group: files ldap

One step closer. Next: account authorization via group membership. In other words, only let someone use a resource if they exist in a specific group. Need to figure out if this should be done via netgroup or unix group membership. Off to research!




Ha ha! SSL success for AD/LDAP.

3 06 2009

Ha ha! Further success on the Linux -> Active Directory integration front. I got SSL working for the underlying ldap bind user. What’s this mean? Protection of the directory information over the wire as it travels from the domain controller to the client host where it will be used.

So what’s the necessary setup bits?  There are three options that need to be added to the ldap.conf that I originally came up with.  They are:

ssl yes
tls_cacertfile /etc/ssl/certs/ca_bundle.crt
tls_checkpeer no
uri ldaps://austin.utexas.edu/ (this is a modification from the previous config)

The tls_cacertfile defines the location of the file that contains the certificate authority information used to create the SSL certificate on the Active Directory domain controllers. You need this to verify the authenticity of the dc’s. The file should be .pem formatted and must be converted from the file you retrieve from the internal certificate authority at UT.

Once you’ve downloaded the file, you get it in the DER format which needs to be converted using something like the following.

openssl x509 -in downloadedcert.cer -inform DER -out rootca.pem -outform PEM

Next, you copy the contents of the rootca.pem to the tls_cacertfile file.

Once you’ve configured the ldap.conf with the updated options, you should now be accessing LDAP over SSL. You’ll have to verify this by running something like wireshark and watching the tcp traffic going across the wire. It’ll look something like the stream in the image above.

If you see errors in /var/log/messages that look like the following, then you’ve got something wrong in your configuration still.

Jun  3 14:09:49 fedex getent: nss_ldap: failed to bind to LDAP server ldaps://austin.utexas.edu/: Can't contact LDAP server
Jun  3 14:09:49 fedex getent: nss_ldap: reconnecting to LDAP server (sleeping 4 seconds)...
Jun  3 14:09:53 fedex getent: nss_ldap: failed to bind to LDAP server ldaps://austin.utexas.edu/: Can't contact LDAP server
Jun  3 14:09:53 fedex getent: nss_ldap: reconnecting to LDAP server (sleeping 8 seconds)...
Jun  3 14:10:01 fedex getent: nss_ldap: failed to bind to LDAP server ldaps://austin.utexas.edu/: Can't contact LDAP server
Jun  3 14:10:01 fedex getent: nss_ldap: reconnecting to LDAP server (sleeping 16 seconds)...

Now that we have SSL tackled, time to get group lookups working. I’ll leave that for another posting.




Authenticating to Austin AD from Linux

2 06 2009

Woot!  With the help of barthag, I got one of our linux boxes configured to provide passwd file map backend via AD/LDAP and authentication via AD/Kerberos.  Most of the problems stem from permissions issues on the AD side and making sure things are open “enough” to let us through to query for information.

On the Linux side (specifically, Red Hat Enterprise Linux 5 Update 2), there are four files you have to touch to make this work: /etc/krb5.conf, /etc/ldap.conf, /etc/nsswitch.conf, and /etc/pam.d/system-auth.

In our configuration, I’m using LDAP to provide the transport for the actual directory information lookups and Kerberos to manage the authenticating of users to the system.  It’s an easy enough configuration for this.  For testing, we’re not encrypting the LDAP connection because we haven’t yet figured out the correcting pathing and formats of the various certificate files necessary for nss_ldap to work correctly (yay underdocumented features!).  Plus, since my test box is close to the AD servers network-wise, I know the connection is secure enough.  Production usage of this config will very likely enforce the use of SSL since we’d be providing these configs for people outside of our local ITS network (but still on-campus).

Also, for this initial test, we’re only working with the passwd map.

So, without further adieu.

/etc/nsswitch.conf

passwd:  files ldap
shadow:  files ldap

/etc/krb5.conf

[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = AUSTIN.UTEXAS.EDU
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 forwardable = yes

[realms]
 AUSTIN.UTEXAS.EDU = {
  kdc = austin.utexas.edu:88
  admin_server = austin.utexas.edu:749
  default_domain = austin.utexas.edu
 }

[domain_realm]
 .austin.utexas.edu = AUSTIN.UTEXAS.EDU
 austin.utexas.edu = AUSTIN.UTEXAS.EDU

[appdefaults]
 pam = {
   debug = false
   ticket_lifetime = 36000
   renew_lifetime = 36000
   forwardable = true
   krb4_convert = false
 }

/etc/ldap.conf

###
### ldap config for binding to AD with a read-only bind account over
### cleartext.
###
base dc=austin,dc=utexas,dc=edu
uri ldap://austin.utexas.edu/

binddn thebinduser
bindpw thebindpasswd
scope sub
timelimit 120
bind_timelimit 120
idle_timelimit 3600

nss_initgroups_ignoreusers root,ldap,named,avahi,haldaemon,dbus,radvd,tomcat,radiusd,news,mailman,nscd,gdm
# limit to the Austinites->People OU that have a uid set.
nss_base_passwd		ou=Austinites,ou=People,?sub?&(objectCategory=user)(uid=*)
#nss_base_group		ou=Group,dc=example,dc=com?one
#nss_base_hosts		ou=Hosts,dc=example,dc=com?one
#nss_base_services	ou=Services,dc=example,dc=com?one
#nss_base_networks	ou=Networks,dc=example,dc=com?one
#nss_base_protocols	ou=Protocols,dc=example,dc=com?one
#nss_base_rpc		ou=Rpc,dc=example,dc=com?one
#nss_base_ethers	ou=Ethers,dc=example,dc=com?one
#nss_base_netmasks	ou=Networks,dc=example,dc=com?ne
#nss_base_bootparams	ou=Ethers,dc=example,dc=com?one
#nss_base_aliases	ou=Aliases,dc=example,dc=com?one
#nss_base_netgroup	ou=Netgroup,dc=example,dc=com?one

nss_map_objectclass posixAccount user
nss_map_objectclass shadowAccount user
nss_map_attribute uid sAMAccountName
nss_map_attribute homeDirectory unixHomeDirectory
nss_map_attribute shadowLastChange pwdLastSet
nss_map_objectclass posixGroup group
nss_map_attribute uniqueMember member
pam_login_attribute sAMAccountName
pam_filter objectclass=User
pam_password ad

pam_password_prohibit_message "Sorry, you must change your password using the UTDirect EID interface."

# we don't use referrals at UT.
referrals no

# return more than 10 thousand entries when iterating over the entire
# map.
nss_paged_results yes
page_size 1000

/etc/pam.d/system-auth

%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth        required      pam_env.so
auth        sufficient    pam_unix.so nullok try_first_pass
auth        sufficient    pam_krb5.so
auth        requisite     pam_succeed_if.so uid >= 500 quiet
auth        required      pam_deny.so

account     sufficient    pam_unix.so
account     sufficient    pam_krb5.so
account     sufficient    pam_succeed_if.so uid < 500 quiet
account     required      pam_permit.so

password    requisite     pam_cracklib.so try_first_pass retry=3
password    sufficient    pam_unix.so md5 shadow nullok try_first_pass use_authtok
password    required      pam_deny.so

session     optional      pam_keyinit.so revoke
session     required      pam_limits.so
session     [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session     required      pam_unix.so

Great.  So with this configuration, I can verify that I can login to the Linux box via ssh using my Windows username and do things as myself.  There are some caveats to this:  you must be in the Austinites OU at this point AND you must have the appropriate posixAccount attributes populated.  At a minimum, that’s uidNumber, gidNumber, loginShell, unixHomeDirectory, and uid.  You may have noticed that we’re actually using sAMAccountName for the login name above.  We’re still playing with that.  We might use uid instead because that’s something we can more easily filter on to limit on for different pieces of the account lookups.

So, what’s next?

  • need to beat on the system and make sure no one WITHOUT the attributes set can login.
  • need to work on the group map and get that working since we’re currently only doing passwd file lookups.
    • this includes understanding how to deal with secondary groups.
  • need to work on the netgroup map and come up with a standardized way of handling that so we can (under Linux), configure people in the /etc/security/access.conf for access restrictions.
  • need to figure out if there are other access restriction mechanisms we need to pay attention to.
  • need to look at all the pam configs on a default box and see if there’s something that isn’t covered by the system-auth template.
  • need to figure out just how far into Kerberos we want to go.  What we’re doing now is good for authenticating a single session.  Need to determine if we want to go to single sign-on or not and do all the extra bits associated with using kerberos.
  • need to figure out if we should do authentication using LDAP only (honestly, I’d prefer not to, but that depends on what others in the group need).
  • need to get this setup on a Solaris 10 system for testing.
  • need to address the issue of a person’s loginShell and coming up with some standards for those, in order to deal with departments that have differing shell policies but with overlapping accounts (e.g., researcher that has accounts in two Unix areas but that have differing policies on where or which shells should be used).
  • need to address the issue of a person’s unixHomeDirectory possibly differing between two Unix areas.  Same issue as above, but stickier because it’s generally a lot harder to consolidate this down to a consistent path name than it is to consolidate shells down to a consistent path name.

There are problably others, but these are my known unknowns at this point.

Things that were referenced to make all this work:




Confusing service types

20 05 2009

One of the problems we face in ITS is promoting, yet differentiating, the types of MySQL services we provide.  I hear you thinking, “But you’re ITS!  You only provide one type … it’s just mysql!”  Yes and no.  I’m in this weird spot where I’m tasked with being the lead MySQL person for ITS Systems but also as one of the lead technical people for the campus-wide MySQL Enterprise support agreement.  What does that mean?

I wear two hats basically.  Under one hat, my role is to support, maintain, promote, and market the centralized MySQL services that we provide.  As part of that service, I make sure our hosts are a part of the campus-wide agreement.  We foot the bill for making sure that our hosts are under that agreement.  The cost to the customer is just spread out across lots of customers.

On the other hat, my goal is to also help promote the platinum support agreement and get customers to buy in to it.  But how are these two different?

In the centralized and fully managed environment (my first hat), we provide everything so you don’t have to.  That means we have a datacenter, with highly reliable hardware, using enterprise versions of the software, all managed by staff that are highly trained and experienced in various aspects of systems administration.  The benefit to you is that you don’t need to have any of these resources set up in your department or group.  You just come to us and get the databases you need.  This is great for customers that don’t want to sink expenses into setting all these things up on their own.

In the self-managed model (my term for those groups that would utilize the support contract directly), you may have everything that we have (staff, hardware, software, data center space, budget, and so on).  The reliability of your MySQL services is important to you, so being able to access commercial support is a win.  Especially when you consider that the agreement that ITS has worked out makes the barrier to entry (mainly the cost) significantly lower than if you had to purchase this on your own.  It’s a few thousand dollars per server if you wanted to get gold or platinum support from MySQL and do it on your own.  It’s $2500 from ITS and that covers any server you control in your department.  Sweet, eh?

The problem I’m facing right now is that the web pages we have promoting these two services are reasonably confusing when looked at in the context of each other.  It’s hard for a customer to tell which method of service they require, so I’ve been spending some time doing a rewrite to clear things up.  Hopefully I can get this posted soon because I think it will be really helpful to everyone, especially those customers who may not be entirely sure what they really want (or need).




The Trouble with MySQL Tribbles

19 05 2009

Well, I’ve been back at UT for almost two months now and finding it to be a pretty enjoyable experience.  For those who don’t know me, my name is Travis Campbell.  I work in ITS Systems Unix managing the hosted MySQL database services for the University.  As I get deeper into UT’s MySQL infrastructure, I’ll be blogging here with general updates and related information suitable for the UT MySQL community at large.

Things I’m working on:

  • Streamlining the hosted MySQL database request process.
  • Consolidating the hosted and self-managed MySQL documentation on the ITS website.
  • Modifying the hosted MySQL database services so they meet the minimum ISO requirements for storing Category 1 data.
  • Looking to bring high-availability to the hosted MySQL database services (right now we have warm standbys).
  • Looking at ways to streamline the backup processes using Zmanda ZRM (community edition).
  • Slowly moving myself into the position as the University’s global MySQL resource (thanks to our recent purchase of a campus-wide platinum support contract from Sun/MySQL).
  • Elevate hosted MySQL services to the point of allowing connections from non-UTS/WebCentral systems (for those who need hosted databases but still want to use their own web services, for example).

Ultimately, I have quite a few lofty goals and a commitment to bringing MySQL into even more of an enterprise class of service for the University.  There’s lots of work and a long road ahead, but it’s going to be fun.

[Ok, no tribbles were harmed in the making of this blog posting.  Sorry to get your hopes up!]