Wednesday, September 29, 2010

Service interruption update at 7:00 pm

Normal Moodle services now available. Disk array data recovery proceeding successfully, but slowly.

We will no longer be posting detailed service interruptions at this general NJIT blog. Instead, we we have a dedicated NJIT SOS (Service Outage System) blog at http://njitsos.wordpress.com.

Major service interruption update and review at 2:05 pm

Yesterday, at approximately 4 am, we experienced a major hardware failure on one of the primary storage disk arrays used by many NJIT applications. We have been working since that time to restore services. As of noon today, approximately 70% of the consistency checks had completed. It is difficult to predict an exact time, but we expect the entire process to continue at least until this evening so we can ensure data integrity.

A full account to-date is below.

Problem Description:

Yesterday, at approximately 4 am, we experienced a major hardware failure on one of the primary storage disk arrays used by NJIT applications. This particular disk array provides one or more services to about 70% of academic and administrative applications at NJIT. Normally this type of hardware failure does not cause a disruption of service due to internal redundant components (in fact, systems administrators have exercised this internal redundancy many times in the past to replace failed components with no impact on service). However, this time, due to multiple failed components, the internal redundancy could not handle the hardware failure.

Systems Affected:

Systems affected by this hardware problem include the NJIT Web site, administrative e-mail (ADM), authentication middleware to cloud services (e.g. Moodle, Webmail by Google student e-mail), Highlander Pipeline, AFS academic file system, DFS departmental file system, and a number of ancillary components of the Banner ERP system (e.g. Cognos reporting, Banner job submission, e-print).

Current Status:

At this time, we do not expect all services to be fully restored until later this evening. A more detailed discussion of the restoration process is included below.

The following services have been temporarily restored on alternate hardware and are available:
  • NJIT website
  • Webmail by Google student email
  • Banner Job Submission
Staff are working to restore Moodle in a similar fashion

Faculty and staff that normally access Internet Native Banner (INB) through Highlander Pipeline may login at http://bninbts1.njit.edu:9090 (this requires VPN from off-campus). Self-Service Banner is not available.

Fixing the Problem:

Oracle system engineers have been on-site and working remotely with NJIT systems administrators since yesterday morning. Replacing failed hardware components is the simplest part of the solution and Oracle has been proactive in having available an ample supply of replacement parts.

The disk array under discussion totals approximately 54 TB of data split among 224 separate disk drives. The recovery process involves consistency checking and rebuilding parity checks on each of the disk drives. This is necessary to make sure no data has been lost or corrupted, and must be completed on all 224 separate disk drives before the entire disk array can be brought back online.

This consistency checks are a slow process. As of noon today, approximately 70% of the consistency checks had completed. We expect this process to continue at least through the business day at which time Oracle engineers will remotely run further diagnostics and manually recover any disk drives that have not passed the automated consistency checks. We expect this final review to last into this evening, after which all services can be brought back online. It is difficult to predict an exact time because the consistency checks for each of the 224 separate disks do not require the same amount of time. The good news is that an analysis of log files by Oracle engineers and the results of all consistency checks so far indicate no data loss or data corruption. Maintaining data integrity is the highest priority in this slow process.

Continuing Updates

Continuing updates, approximately every 3 hours or as needed, will be posted at this site. You may also follow updates on Twitter (http://twitter.com/njit) Blogspot ( http://njit.blogspot.com) and Facebook (http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825).

You can also contact the IST Computing Helpdesk at (973) 596-2900.

Thank you for your patience.

Major service interruption update at 9:15 am

The procedure to reconstruct data damaged in NJIT’s enterprise disk storage array is taken much longer than initially expected. Fortunately, there are no indications of any data loss, and data integrity is the highest priority. The Gmail authentication service has been restored and restoring the Moodle authentication service is under investigation.

At this time, the data recovery process that began at around 2:00 pm yesterday is approximately 2/3 complete. Unfortunately, services may remain unavailable for the better part of the day today if not longer. When there is more specific information available, it will be communicated.

Your understanding is appreciated.

Please continue to visit these sites for the latest: http://ist.njit.edu, http://twitter.com/njit, http://njit.blogspot.com, and http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825. You can also contact the IST computing Helpdesk at (973) 596-2900.

Tuesday, September 28, 2010

Major service interruption status update at 10:45 pm

Unfortunately, after careful analysis of the disk and data rebuild progress, and after discussions with Oracle engineers about the safest strategy for the remaining processes, we've determined that this service restoration might not be completed before 8:00AM Wednesday morning.

Our highest priority continues to be the integrity of the data. Our work with the Oracle engineers will continue throughout the night, and every effort will be made to restore all services as quickly as possible.

Your understanding is appreciated.

Major service interruption update 6:30 pm

Due to the complexities of a multiple component failure, it looks like we are still several hours away from service restoration. Our current estimation is around midnight.

Thanks for your continued patience.

Major service interruption status update at 2:40 pm

Repairs are continuing. Thus far there are no signs of data loss or corruption, however the consistency checking is proceeding slower than anticipated. The new estimated time for service restoration is 5 pm.

We thank you for your continued patience.

Major service interruption status update at 12:30 pm

Equipment repairs on the enterprise disk storage array are expected to be completed by 1:00PM. Then the 100 plus systems impacted by this failure will be evaluated and brought up in a priority order, with academic-related services being the highest priority. Assuming there are no data loss and/or corruption issues which will require file restores from backups, the estimated time for service restoration is about 3:30 pm.

Additional status updates will be posted on http://ist.njit.edu, http://twitter.com/njit, and http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825. You can also contact the IST computing Helpdesk at (973) 596-2900.

Thank you for your patience.

Major service interruption

At about 4:00 am this morning, an enterprise disk storage system failed which is affecting many critical services including administrative email, Gmail authentication, AFS and DFS file systems, Highlander Pipeline, etc. Field engineers are expected on-site by 10:30 am with replacement parts. We expect to know by 11:00AM whether the service interruption will be extended.

Banner production services are available at http://bninbts1.njit.edu:9090/ for NJIT users.

Additional status updates will be posted on http://ist.njit.edu, http://twitter.com/njit, and http://www.facebook.com/pages/Newark-NJ/NJIT/7185471825. You can also contact the IST computing Helpdesk at (973) 596-2900.

Thank you for your patience.

Thursday, September 23, 2010

The College of Architecture and Design Gallery's Fall Exhibit


Details, Details, Details: Roberto Osti, Daniel Brophy, Gocha Tsinadze
Three artists whose use of fine lines and uncompromising attention to detail offer a common linkage between artwork that would otherwise not appear so similar.

Friday, September 24, 2010, from 5-9 pm
open to the public with refreshments and an opportunity to meet the artists.

This exhibition will also be part of Newark’s Open Doors event, a weekend celebration of the arts with a Friday night gallery crawl and a Sunday afternoon open studio tour. For more information on Newark Open Doors, go to: http://www.newarkarts.org/.

The College of Architecture and Design Gallery (COAD)is located on the corner of MLK Blvd. and Warren St. in Newark, NJ – 367 MLK Blvd.

Wednesday, September 15, 2010

Evening Road Closure

The Essex County of Department Works has informed us that Central Avenue will be closed for resurfacing between Dr. MLK Blvd. and Norfolk St. from 7:00 PM, Thursday, September 16 to 6:00 AM, Friday, September 17.

Please adjust your travel plans accordingly. We will advise you if any further updates concerning the project are received from the county.