" /> Frank's Activity Log: August 2009 Archives

« July 2009 | Main | September 2009 »

August 27, 2009

27. August 2009 -- Donnerstag

Admin

Backup
  • EMC 31051592: no nsrck's hung this morning.
  • EMC 31099632: no word from EMC -- have to refresh their memory tomorrow.
  • Scouting for word on whether 7.4sp5 is safe or not -- appears it is good
  • Tapes and Disks picked up by SecurShred for destruction.
  • Start to lay out design for scripted staging.
  • Investigations into why backups on reindeer are so much slower than they should be.
  • Replaced batteries in tape fire safes.
LDAP
  • Privacy flags
  • Thoughts about how to separate the SOR and WP information
  • Research how students find out their id number

August 26, 2009

26. August 2009 -- Mittwoch

Admin

Backup
  • EMC 31051592: Forced to level 1 to get call back, two hours on the phone with support, re-defined 42 clients that had been removed over the course of the last 15 years, nsrim -X now runs without causing nsrck to hang. Acid test will be whether there is a hung nsrck or not tomorrow morning.
  • EMC 31099632: Support thinks it may be an SQL sp3 hotfix -- which COMIS is leary of because they are not seeing any indications that they are having this issue. I agree. Sent support searching for why the VSS backup is ignoring the directive to skip .mdf files.

August 25, 2009

25. August 2009 -- Dienstag

Admin

Backup
  • Manually clean tape drive, EMC 23189374... still broken since 14. April 2008!
  • EMC 31051592:
    • kill off the nsrck that started and hung at midnight.
    • Discover the "undef" volume problem is a side effect of this daily process not working.
    • Continue manual nsrck -L6's against all clients in /nsr/index
  • EMC 31099632 -- open, sev 2 -- med104 failing VSS SYSTEM SERVICES:\ backup due to SQL
  • Weekly cloning script finished. So, once the nightly backups finally finish, I'll be able to recycle networker on the server to see if that clears up the issue with 31051592.

August 24, 2009

24. August 2009 -- Montag

Admin

Backup
  • EMC 31051592: worked for an hour on the phone with Wallace on Friday and ran commands on Sunday afternoon. The "nsrck -m" ran without error. However, the "nsrim -X" appears to have hung up trying to work on a client that was removed in March of 2008. Will spend more time working with EMC this morning. Much work on this issue today, no resolution.
  • Tape/Drive destruction assembly
LDAP
  • Alumni email-for-life additions

August 21, 2009

21. August 2009 -- Freitag

Admin

Backup
  • 31051592: Sev 2 -- nightly nsrim/nsrck pair starts up (nsrim forks and starts nsrck) but nsrck never finishes... one was running for 17 days. Does not use any cpu time, just sits there.... EMC is trying to force me to upgrade to the brand new 7.4.5 release.
  • Volumes going "undef" instead of "expired" -- 5 more this morning, recycled them manually
  • Replace broken handle on fire safe.
LDAP
  • Privacy Flags

August 18, 2009

18. August 2009 -- Dienstag

Admin

Backup
  • EMC SR 30744500: Discovered man nsrck -MXv (long string of client names) running yesterday when I shutdown networker for monthly maintenance. Could those (some running from as far back as August 1, be related to the ongoing corruption problems in daemon.raw?
  • Found more tapes that have volretent value of "undef" instead of "expired". These appear to be tapes that had previously had the volretent set to the ssretent value instead of the clretent value. At least I will assume that until I find one that was never on that list of tapes. Opened issue for tracking in sharepoint -- issue #104
  • EMC SR 30964166: Working with Greg using norton1 as our guinea pig to verify that ASR recovery with AFTD's involved is bad news -- discovered that storage node ordering is critical to having it actually work.
  • Notify Cambridge that I would like info about the "dual power" option for the Qualstar XLS
LDAP
  • Privacy Flags
  • Alumni Email-For-Life updates
  • Double check that backup of data directory is working

August 17, 2009

17. August 2009 -- Montag

Admin

Backup
Monthly Backup Maintenance
  • Qualstar XLS: Upgraded firmware from version 2 to version 3. Process took 7 hours.
  • AFTD: fsck'd /disk1 on storage nodes 5, 7, and 8.
  • LTO4 drives: Upgraded firmware to 94D0 to address a code "A" issue with rewinds.
  • Rebooted all storage nodes and NSR Server.
LDAP
Privacy flags
Accounts
Automated Purge Process test (dry) run.

August 15, 2009

15. August 2009 -- Samstag

Admin

Calendar
OS Maintenance applied and Calendar database maintenance run
Backup
  • rebooted gnu - read-only file system error -- more fallout from the Day of Disaster!
  • Changed nsrports range on rh4-build. Discovered the real problem in NW7.5 is that some piece of nsrexecd is hard coded to open port 7937 -- so if 7937 is in the range of ports nsrexecd is allowed to use some other piece of nsrexecd is 60% likely to get it before the piece that was written incorrectly. -- Amazing!!! Tell the client code it can not use port 7937 and it still does use it.... and this pile of sh*t passed quality assurance testing! wow!!!
  • Restart of 'UVM Fullsave 2' for above two issues

August 14, 2009

14. August 2009 -- Freitag

Admin

Disaster Day
The postmortem remains to be done and will require the assistance of IBM and EMC (VMWare), but we've had a fun day of it. It started on Thursday (while I was out) when working with IBM about an issue that had caused the replication connection between two SAN systems to break. IBM determined that there was an Efix specifically for this condition. It was tested on the target SAN with a VMWare cluster connected to it without incident during the day. Therefore, somewhere between 9 and 10 PM Thursday evening, the first part of the Efix was applied to the source SAN. Well, the production VMWare cluster got really pissed and several guest systems lost access to their disks. I am not sure, but I suspect that all guests saw some interruption to their activities. Not only did these systems lose their disks, but VMWare appeared to have lost access to the files that represented their disks. I'm not sure of what things were tried in the next 3.5 hours, but I was paged at 1:30 am Friday morning to deal with recovering our primary LDAP server from backups (essentially to perform a bare metal restoration). We spent a couple of hours during which time we discovered that our support contract with EMC/VMWare had expired -- guarantees from the sales people that we'd be sent an automatic renewal notice to the contrary, we had no support contract. It took time to convince the support structure that we were not bums and would be contacting sales to negotiate a renewal and finally VMWare was working to help resolve the issue.

At about 2:30 I started the bare metal recovery of the LDAP server (create a new VMWare guest and recover all the files from the NetWorker backup system) and I was amazed that it had completed in less than two hours. It wasn't perfect, because the backup had been done before the previous day's batch update and it was missing some information that was needed. I retrieved the LDAP database from one of the replica servers and got the LDAP server back up and running. I didn't realize that I was missing the previous day's batch run data until after I'd manually kicked off the days batch run, which had lots of errors because it was comparing today's authoritative feed data to that from two days ago instead of just yesterdays. To correct this, I've updated the nightly batch script so it issues a NetWorker save of the batch update data directory when the nightly batch process is complete. So, as long as the batch process runs to completion, the backup system will have the information.

VMWare did come through eventually, with not one but two filesystem dd patches to correct the corruption and the files that were unavailable were all eventually able to be seen and their guests could be started again. Which was good.

Sadly, the incident was not free of problems. We discovered that the ASR recovery process on the Windows 2003 printer server tripped over a bug in NetWorker. An issue is open with NetWorker about that -- it was opened as a Sev 1, but has been lowered because VMWare was able to get the VMFS corruption fixed and we were able to get the printers server running again before the 1 hour call back time on the Sev 1 had expired. We will built a test Windows system that we will do an ASR recovery on and work through this problem with EMC/NetWorker. In short, the problem is that in our environment we send Full saves to one set of disk and incrementals to another set of disk. Those different sets of disk are mounted on different backup servers (storage nodes). NetWorker's ASR recovery process is reading the information from the full save and then demanding that the incremental disk be removed from the storage node it is attached to and moved to the same storage node the full save disk is attached to. That's just not physically possible with direct attached disk.

The final notice that all was recovered was sent out by Geoff at 3:50pm Friday afternoon.