Cisco adds two new high-end programmable Silicon One devices that can support massive GPU clusters for AI/ML workloads. Credit: Getty Images Cisco is taking the wraps off new high-end programmable Silicon One processors aimed at underpinning large-scale Artificial Intelligence (AI)/Machine Learning (ML) infrastructure for enterprises and hyperscalers. The company has added the 5nm 51.2Tbps Silicon One G200 and 25.6Tbps G202 to its now 13-member Silicon One family that can be customized for routing or switching from a single chipset, eliminating the need for different silicon architectures for each network function. This is accomplished with a common operating system, P4 programmable forwarding code, and an SDK. The new devices, positioned at the top of the Silicon One family, bring networking enhancements that make them ideal for demanding AI/ML deployments or other highly distributed applications, according to Rakesh Chopra, a Cisco Fellow in the vendor’s Common Hardware Group. “We are going through this huge shift in the industry where we used to build these sorts of reasonably small high-performance compute clusters that seemed large at the time but nothing compared to the absolutely huge deployments required for AI/ML,” Chopra said. AI/ML models have grown from needing a few GPUs to needing tens of thousands linked in parallel and in series. “The number of GPUs and the scale of the network is unheard of.” The new Silcon One enhancements include a P4-programmable parallel-packet processor capable of launching more than 435 billion lookups per second. “We have a fully shared packet buffer where every port has full access to the packet buffer regardless of what’s going on,” Chopra said. This is in contrast with allocating buffers to individual input and output ports, which means the buffer you get depends on which port the packets go to. “That means that you’re less capable of writing through traffic bursts and more likely to drop a packet, which really decreases AI/ML performance,” he said. In addition, each Silicon One device can support 512 Ethernet ports letting customers build a 32K 400G GPU AI/ML cluster requiring 40% fewer switches than other silicon devices needed to support that cluster, Chopra said. Core to the Silicon One system is its support for enhanced Ethernet features such as improved flow control, congestion awareness, and avoidance. The system also includes advanced load-balancing capabilities and “packet-spraying” that spreads traffic across multiple GPUs or switches to avoid congestion and improve latency. Hardware-based link-failure recovery also helps ensure the network operates at peak efficiency, the company stated. Combining these enhanced Ethernet technologies and taking them a step further ultimately lets customers set up what Cisco calls a Scheduled Fabric. In a Scheduled Fabric, the physical components—chips, optics, switches—are tied together like one big modular chassis and communicate with each other to provide optimal scheduling behavior, Chopra said. “Ultimately what it translates to is much higher bandwidth throughput, especially for flows like AI/ML, which lets you get much lower job-completion time, which means that your GPUs run much more efficiently.” With Silicon One devices and software, customers can deploy as many or as few of these features as they need, Chopra said. Cisco is part of a growing AI networking market that includes Broadcom, Marvell, Arista and others that is expected to hit $10B by 2027, up from the $2B it is worth today, according to a recent blog from the 650 Group. “AI networks have already been thriving for the past two years. In fact, we have been tracking AI/ML networking for nearly two years and see AI/ML as a massive opportunity for networking and one of the main drivers for data-center networking growth in our forecasts,” the 650 blog stated. “The key to AI/ML’s impact on networking is the tremendous amount of bandwidth AI models need to train, new workloads, and the powerful inference solutions that appear in the market. In addition, many verticals will go through multiple digitization efforts because of AI during the next 10 years.” The Cisco Silicon One G200 and G202 are being tested by unidentified customers now and are available on a sampled basis, according to Chopra. Related content news Cisco patches actively exploited zero-day flaw in Nexus switches The moderate-severity vulnerability has been observed being exploited in the wild by Chinese APT Velvet Ant. By Lucian Constantin Jul 02, 2024 1 min Network Switches Network Security news Nokia to buy optical networker Infinera for $2.3 billion Customers struggling with managing systems able to handle the scale and power needs of soaring generative AI and cloud operations is fueling the deal. By Evan Schuman Jul 02, 2024 4 mins Mergers and Acquisitions Networking news French antitrust charges threaten Nvidia amid AI chip market surge Enforcement of charges could significantly impact global AI markets and customers, prompting operational changes. By Prasanth Aby Thomas Jul 02, 2024 3 mins Technology Industry GPUs Cloud Computing news Lenovo adds new AI solutions, expands Neptune cooling range to enable heat reuse Lenovo’s updated liquid cooling addresses the heat generated by data centers running AI workloads, while new services help enterprises get started with AI. By Lynn Greiner Jul 02, 2024 4 mins Cooling Systems Generative AI Data Center PODCASTS VIDEOS RESOURCES EVENTS NEWSLETTERS Newsletter Promo Module Test Description for newsletter promo module. Please enter a valid email address Subscribe