?

Log in

No account? Create an account

Previous | Next

Our favorite vendor sucks anew

We replaced our main campus border routers, which had been old Foundry routers, with brand spanking new ones, each the Mighty Cisco 6500. The old boxes had performed very well for several years but Foundry is a company relatively inexperienced with cutting-edge things like interdomain multicast, and those boxes were getting pretty old and it would have been a pain to upgrade them to 10 gigabit Ethernet, so we decided to get rid of them. And you can't go wrong with the equipment "the Internet runs on" right?

Well, as soon as we powered them up, they went to 100% CPU. These big expensive things had insufficient TCAM to handle the size of our routing table and were thus trying to process the routing of all campus Internet traffic in software (a hopeless task). The solution was to repartition the TCAM to steal areas reserved for features we are not currently using (like MPLS and IPv6) and give them over to IP forwarding. Given that we count the need to manually manage TCAM resources as a thumbs-down on our Foundry equipment, this rather amused us.

Even after tuning the hardware to run well our application (something I was NOT expecting to have to do—this implies that we're running this platform close to the edge), there was a persistent high CPU utilization that no one could explain. One box was up around 25-40% and the other ran at about 10-15%. It doesn't sound like much, but these are pretty fast CPUs, and all of the packet forwarding is supposed to be done in hardware. After all, our old Foundry boxes with much slower processors ran at about 0-1% CPU unless they were busy with something like a BGP table reconvergence. We asked Cisco, and after a couple of ineffective suggestions, like a sullen teenager they fell typically silent on the matter.

The 6500's had shipped with Cisco's latest "modular IOS" code, which supposedly allows one to, say, upgrade and restart one aspect of router operation (say, the BGP routing protocol, or the SSH server) without having to restart (thus causing a service interruption) the entire device. Anecdotally, we were being told that modular IOS was not something to be run in production, which surprised us because, well, this is how they shipped them to us.

Ryan and I spent about an hour doing other cleanup on the configurations of those routers bright and early, and also changed their operating software to "monotlithic IOS", which is the old-fashioned single piece of software that runs the entire router. While we were working, the CPU utilizations were kind of crazy since we were doing all sorts of things to them that made them think, but the before and after CPU graphs kind of speak for themselves.

IBM, Microsoft, Cisco... all the vendors you love to hate.

Tags:

Comments

( 4 comments — Comment )
gdmusumeci
Dec. 22nd, 2006 04:23 am (UTC)
Mmm... some reason why you didn't go with the OSR aka 7600? I have found it to do a MUCH better job of routing than the 6500 (especially with the Sup720-3B, a quite potent device).

Also, IOS SUX.
szasz
Dec. 22nd, 2006 04:30 am (UTC)
These things do have Sup720-3B's. For what we do, the OSR wouldn't really be any different. It's almost the same base platform except for those ultra-expensive Service Processors, and we just didn't need that kind of on-card processing. Also, the 10GE SPA is a one-port card.

Sure, we probably could have done this with 7600's at a much, much higher price, but come on: this is a campus exit border application, not an ISP core.. They have to push maybe 2-4 Gb/s and each have something like 8 BGP peers.
tomato_sutra
Dec. 22nd, 2006 02:06 pm (UTC)
Bleagh re: the vendor's response (WTF?), but the "after" part of the graph is quite beautiful. And I got a big tingle when you said "interdomain multicast." Rowr.
paulpomes
Dec. 25th, 2006 07:55 pm (UTC)
Curious that all three people I know at Cisco are jumping ship. One is changing fields entirely to (surprise) veterinary medicine, another is moving to Portland, and the last is waiting another month for her options to vest, such as they are. What they all have in common is the view that Cisco hasn't been fun in a long time.
( 4 comments — Comment )

Profile

14L
szasz
Charley

Latest Month

July 2013
S M T W T F S
 123456
78910111213
14151617181920
21222324252627
28293031