Wednesday, September 29, 2010

Major service interruption update and review at 2:05 pm

Yesterday, at approximately 4 am, we experienced a major hardware failure on one of the primary storage disk arrays used by many NJIT applications. We have been working since that time to restore services. As of noon today, approximately 70% of the consistency checks had completed. It is difficult to predict an exact time, but we expect the entire process to continue at least until this evening so we can ensure data integrity.

A full account to-date is below.

Problem Description:

Yesterday, at approximately 4 am, we experienced a major hardware failure on one of the primary storage disk arrays used by NJIT applications. This particular disk array provides one or more services to about 70% of academic and administrative applications at NJIT. Normally this type of hardware failure does not cause a disruption of service due to internal redundant components (in fact, systems administrators have exercised this internal redundancy many times in the past to replace failed components with no impact on service). However, this time, due to multiple failed components, the internal redundancy could not handle the hardware failure.

Systems Affected:

Systems affected by this hardware problem include the NJIT Web site, administrative e-mail (ADM), authentication middleware to cloud services (e.g. Moodle, Webmail by Google student e-mail), Highlander Pipeline, AFS academic file system, DFS departmental file system, and a number of ancillary components of the Banner ERP system (e.g. Cognos reporting, Banner job submission, e-print).

Current Status:

At this time, we do not expect all services to be fully restored until later this evening. A more detailed discussion of the restoration process is included below.

The following services have been temporarily restored on alternate hardware and are available:
  • NJIT website
  • Webmail by Google student email
  • Banner Job Submission
Staff are working to restore Moodle in a similar fashion

Faculty and staff that normally access Internet Native Banner (INB) through Highlander Pipeline may login at http://bninbts1.njit.edu:9090 (this requires VPN from off-campus). Self-Service Banner is not available.

Fixing the Problem:

Oracle system engineers have been on-site and working remotely with NJIT systems administrators since yesterday morning. Replacing failed hardware components is the simplest part of the solution and Oracle has been proactive in having available an ample supply of replacement parts.

The disk array under discussion totals approximately 54 TB of data split among 224 separate disk drives. The recovery process involves consistency checking and rebuilding parity checks on each of the disk drives. This is necessary to make sure no data has been lost or corrupted, and must be completed on all 224 separate disk drives before the entire disk array can be brought back online.

This consistency checks are a slow process. As of noon today, approximately 70% of the consistency checks had completed. We expect this process to continue at least through the business day at which time Oracle engineers will remotely run further diagnostics and manually recover any disk drives that have not passed the automated consistency checks. We expect this final review to last into this evening, after which all services can be brought back online. It is difficult to predict an exact time because the consistency checks for each of the 224 separate disks do not require the same amount of time. The good news is that an analysis of log files by Oracle engineers and the results of all consistency checks so far indicate no data loss or data corruption. Maintaining data integrity is the highest priority in this slow process.

Continuing Updates

Continuing updates, approximately every 3 hours or as needed, will be posted at this site. You may also follow updates on Twitter (http://twitter.com/njit) Blogspot ( http://njit.blogspot.com) and Facebook (http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825).

You can also contact the IST Computing Helpdesk at (973) 596-2900.

Thank you for your patience.

No comments: