[CCoE Notice] Computer and network outage update

Charles John Young Jr. cjyoung at EGR.UH.EDU
Sat Dec 31 14:53:54 CST 2011


Dear Colleagues,

As many of you already know, the College experienced a major outage of 
computing services beginning at about 5:30 p.m. Wednesday.  This outage 
affected all of the services in Engineering Computing, including all of 
the College websites.  All primary websites were restored to operation by 
Thursday afternoon, and most servers were restored by noon today.

This was a major unanticipated outage that resulted from multiple 
failures.  First, there was a short, planned electrical outage as 
announced by Physical Plant operations.  This occurred around 5:30 p.m. on 
Wednesday.  Normally, such outages have no significant impact on College 
core computer and network services, since we have excellent UPS 
capabilities in our server room.  However, the IT router that connects the 
ECC to the campus network is served by a separate, dedicated UPS system, 
and this system failed.  As as result, although all of the computer 
services continued to operate normally, the communications with external 
network were lost.  As a result, there was no way to notify the College 
faculty and staff through our normal means such as Engi-Dist.

I contacted central IT's Availability Center (ITAC) within minutes of the 
outage.  However, they were operating with limited staffing, and were 
unable to respond until the following (Thursday) morning.  I arranged to 
come in early Thursday to clear the network problem, which we assumed was 
related to the router needing a manual reset.

When I arrived at the ECC, I discovered that not only had the router's UPS 
failed, but that the chilled water main that feeds Engineering 2 had 
broken, and that the server room was highly over temperature.  Just like 
the central computer center, we depend on external chilled water for 
maintaining cooling in our server room.  As soon as I confirmed with 
Physical Plant that repairs were not expected to be completed until the 
following morning, I had no option but to power off nearly all of the 
computing equipment to prevent possible heat-related damage.

Although we keep a spare UPS for the IT router, it also was defective. 
Fortunately, Bryan Bales was able to rig a direct connect from the Cisco 
router to the 208VAC outlet, so network connectivity was reestablished 
late Thursday afternoon.  This was sufficient to restore our primary 
websites to service.  However, without chilled water, most other services 
remained powered off.

This morning, I returned to the ECC and begin the process of bringing most 
of our services back online.  As of noon today, nearly all essential 
services and most research-related servers are again operational.

I apologize for the inconveniences caused by this outage.  However, the 
cascading failures resulting from both UPS equipment failure and the 
chilled water outage compounded the impact on our operations and 
significantly prolonged the time to restore services.

If anyone is still experiencing a problem with services, please let me 
know.

Let us hope for a better new year!

John


More information about the Engi-Dist mailing list