Log in

No account? Create an account

December 21st, 2006

Our favorite vendor sucks anew

We replaced our main campus border routers, which had been old Foundry routers, with brand spanking new ones, each the Mighty Cisco 6500. The old boxes had performed very well for several years but Foundry is a company relatively inexperienced with cutting-edge things like interdomain multicast, and those boxes were getting pretty old and it would have been a pain to upgrade them to 10 gigabit Ethernet, so we decided to get rid of them. And you can't go wrong with the equipment "the Internet runs on" right?

Well, as soon as we powered them up, they went to 100% CPU. These big expensive things had insufficient TCAM to handle the size of our routing table and were thus trying to process the routing of all campus Internet traffic in software (a hopeless task). The solution was to repartition the TCAM to steal areas reserved for features we are not currently using (like MPLS and IPv6) and give them over to IP forwarding. Given that we count the need to manually manage TCAM resources as a thumbs-down on our Foundry equipment, this rather amused us.

Even after tuning the hardware to run well our application (something I was NOT expecting to have to do—this implies that we're running this platform close to the edge), there was a persistent high CPU utilization that no one could explain. One box was up around 25-40% and the other ran at about 10-15%. It doesn't sound like much, but these are pretty fast CPUs, and all of the packet forwarding is supposed to be done in hardware. After all, our old Foundry boxes with much slower processors ran at about 0-1% CPU unless they were busy with something like a BGP table reconvergence. We asked Cisco, and after a couple of ineffective suggestions, like a sullen teenager they fell typically silent on the matter.

The 6500's had shipped with Cisco's latest "modular IOS" code, which supposedly allows one to, say, upgrade and restart one aspect of router operation (say, the BGP routing protocol, or the SSH server) without having to restart (thus causing a service interruption) the entire device. Anecdotally, we were being told that modular IOS was not something to be run in production, which surprised us because, well, this is how they shipped them to us.

Ryan and I spent about an hour doing other cleanup on the configurations of those routers bright and early, and also changed their operating software to "monotlithic IOS", which is the old-fashioned single piece of software that runs the entire router. While we were working, the CPU utilizations were kind of crazy since we were doing all sorts of things to them that made them think, but the before and after CPU graphs kind of speak for themselves.

IBM, Microsoft, Cisco... all the vendors you love to hate.




Latest Month

July 2013