During a recent project, an apparent software bug was encountered. The issue occurred with a Cisco 3850 that was 16.6 variant, and again on 16.5 after rolling back to the original version and required a significant amount of time to resolve.
The project utilized a 3850 stack in a collapsed core type of deployment where it was being used as core and distribution. The 3850 was connected by 10GB single mode fiber to 11 different stacks, and then used copper interfaces to MPLS and Internet. There were also several servers including roles of domain controller, DNS, DHCP, and others attached to copper interfaces on the 3850.
When the topology went live, there were no initial issues, and everything was testing successfully. Somewhere between 2-3 hours after the launch, the interfaces between the 3850 and the 2960X switches began to go into “err disable” state due to UDLD. The interface that went into err state was on the 2960X side of the link, so a quick refresh on UDLD was executed. UDLD works by sending what are called “hello” packets between both sides of the link. When one side fails to receive a “hello”, it believes the link at that point may be unidirectional and will put the interface in err disable state on that side. This means that the 3850 was not sending the required “hello” to the 2960Xs.
The troubleshooting began by bouncing the interfaces on the 2960Xs to see if they would recover, but the issue quickly returned. During this initial troubleshooting phase, the 3850 quit responding to any management traffic including both SSH and console access. The 3850 stack was then reloaded by removing power connections. Once everything was reloaded, it seemed to be in a normal state so no immediate action was taken other than watching for unusual logs. Nothing abnormal ever showed in the logs, but around one hour later, the issue began to occur again and the 3850 was reloaded. This time, after the 3850 was reloaded, UDLD was immediately disabled. Since UDLD err disabled interfaces were the first symptom, this seemed like an appropriate next step.
This time the err disable symptom did not occur, but the 3850 still quit passing traffic. Originally, we thought traffic was just failing when the uplinks were in the “err disable” state. However, with that removed as a possibility, the 3850 would still stop responding to pings, routing traffic, and responding to management attempts via SSH or Console. No logs or alerts of any kind were present on the 3850, and the cutover to the new network equipment took much longer than expected. The thought was that maybe there was a bad 3850 that was having CPU, memory, or other hardware issue. We separated the stack to focus on a single 3850 and went with the 12 SFP+ 3850 as those connections were vital while the copper interfaces could be moved to a 2960X. The appropriate configuration changes were made and the 3850 reloaded with just the one 3850. The issue still occurred within 20 minutes. Luckily, there was a spare 3850 with the 12 SFP+ interfaces so we decided to give it a try.
Unfortunately, the issue occurred on the spare 3850 that was running a slightly different software version, and a considerable amount of time was spent retrieving it from spares, racking the switch, moving connections, and applying the configuration. With no logs being generated, a very straight forward configuration on the new switch, no Internet connection, and being in a foreign country that caused other communication channel limitations, we had limited time and resources to find the solution. With the issue still persisting, a new engineer came on board to look at the issue with a fresh perspective. After about 45 minutes of talking things out and walking through items in the configuration, a solution was reached: disable DHCP snooping.
This seemed like a very straightforward solution as DHCP snooping has been linked to a few issues in the past, but the customer was using it successfully in several other locations. The 3850 was reloaded again, and before management access could fail, the DHCP snooping was quickly disabled. The 3850 was reloaded once about every 15 minutes. It passed the 15-minute mark and it was still running. It then passed the 30, 45, and 60 minute marks and we were starting to gain confidence in this solution. Several days later everything was still working.
A Cisco bug ID does exist for DHCP snooping issues on these releases and is said to be fixed in 16.7 versions. That Bug ID mentions that DHCP packets may not be forwarded by the device. Due to the known Cisco bug along with the issue we encountered, it is recommended to thoroughly test DHCP snooping on a 3850 running 16.3-16.6, and would be prepared to turn off DHCP snooping if issues arise.