NETCOM downtime a programming error. (fwd)

Christopher X. Candreva (chris@westnet.com)
Wed, 3 Jul 1996 10:33:00 -0400 (EDT)

>From an ISP mailing list : Why exactly Netcom went down last week:
----

IA: Take us through what happened...

GARRISON: Think about the network in three layers. The first layer is
Network Access Points, where we have peer agreements with other
careers. It's the entry point to the Internet. (They're about a half
dozen across the U.S.) The next level down are hubs, which are our
internal virtual private network routing hubs that look at traffic and
direct it along the speediest line available. The third level down is
where the customer actually logs on, at an access POP (Point Of
Presence). At each of those levels there are routers made by Cisco and
others that have instruction tables on them -- "IF THEN" statements
that tell the traffic where to go, what route to go to get to its
destination.

At the network access level, you have some pretty complex code that
says, 'If the traffic comes from this party, then do the following
thing with it.' And because of the number of new access providers or
changes in the access providers, there are daily changes made at the
network access layer. And these are changes that are made in software
to the routers. It's done in a language called BGP, or Border Gateway
Protocol. So, there was one line of code that said, literally, "No
redist bgp access list 25 in," just a line of code that revised an
instruction. Because the two sentences were put together as opposed to
being done on separate lines, the network read it as an "AND"
statement instead of an "OR" or an "IF statement.

So, what happens is the network automatically replicates the
instruction set from the network access point from where this was
entered, which was Washington, DC, and it replicated itself to the
other network access points. Because of the way the code was written,
it then said, 'ah hah, it's a network instruction, not a peering
instruction -- I'd better send it out to the hubs.' The hubs saw it,
and said, 'ah hah, I'd better send it out to the POPs.' Well, the POPs
memory -- the routers at the lower levels of the network -- do not
have the memory or capacity for the peering instructions because they
don't interface with anybody else, so they don't need that capacity.

So, when they got it, it basically froze the routers down at the third
level of the network. Meantime, we're sitting reprogramming the
routers, but as fast as we can reprogram the replication feature of
the intelligent network, it overwhelms our ability to reprogram.
Basically our decision was to shut down the network to reboot the
routers, to put in a fresh instruction set.

That's a long winded explanation, but because your readers are more
technical, it's worthwhile!

============================== ISP Mailing List ==============================
Email ``unsubscribe'' to inet-access-request@earth.com to be removed.
inet-access archives are at ftp://ftp.earth.com/pub/archive/inet-access/