[CCoE Notice] Computer and network outage update
Charles John Young Jr.
cjyoung at EGR.UH.EDU
Sat Dec 31 14:53:54 CST 2011
Dear Colleagues,
As many of you already know, the College experienced a major outage of
computing services beginning at about 5:30 p.m. Wednesday. This outage
affected all of the services in Engineering Computing, including all of
the College websites. All primary websites were restored to operation by
Thursday afternoon, and most servers were restored by noon today.
This was a major unanticipated outage that resulted from multiple
failures. First, there was a short, planned electrical outage as
announced by Physical Plant operations. This occurred around 5:30 p.m. on
Wednesday. Normally, such outages have no significant impact on College
core computer and network services, since we have excellent UPS
capabilities in our server room. However, the IT router that connects the
ECC to the campus network is served by a separate, dedicated UPS system,
and this system failed. As as result, although all of the computer
services continued to operate normally, the communications with external
network were lost. As a result, there was no way to notify the College
faculty and staff through our normal means such as Engi-Dist.
I contacted central IT's Availability Center (ITAC) within minutes of the
outage. However, they were operating with limited staffing, and were
unable to respond until the following (Thursday) morning. I arranged to
come in early Thursday to clear the network problem, which we assumed was
related to the router needing a manual reset.
When I arrived at the ECC, I discovered that not only had the router's UPS
failed, but that the chilled water main that feeds Engineering 2 had
broken, and that the server room was highly over temperature. Just like
the central computer center, we depend on external chilled water for
maintaining cooling in our server room. As soon as I confirmed with
Physical Plant that repairs were not expected to be completed until the
following morning, I had no option but to power off nearly all of the
computing equipment to prevent possible heat-related damage.
Although we keep a spare UPS for the IT router, it also was defective.
Fortunately, Bryan Bales was able to rig a direct connect from the Cisco
router to the 208VAC outlet, so network connectivity was reestablished
late Thursday afternoon. This was sufficient to restore our primary
websites to service. However, without chilled water, most other services
remained powered off.
This morning, I returned to the ECC and begin the process of bringing most
of our services back online. As of noon today, nearly all essential
services and most research-related servers are again operational.
I apologize for the inconveniences caused by this outage. However, the
cascading failures resulting from both UPS equipment failure and the
chilled water outage compounded the impact on our operations and
significantly prolonged the time to restore services.
If anyone is still experiencing a problem with services, please let me
know.
Let us hope for a better new year!
John
More information about the Engi-Dist
mailing list