michael_cooney
Senior Editor

Arista lays out AI networking plans

News
09 Apr 20245 mins
Networking

Arista’s Etherlink technology will be supported across a range of products, including 800G systems and line cards, and will be compatible with specifications from the Ultra Ethernet Consortium.

Data center / enterprise networking
Credit: Timofeev Vladimir / Shutterstock

Arista Networks has offered a look at how it expects to roll out Ethernet technology that will underpin the networks required to handle the demands of AI-based workloads.

The new Arista Etherlink platform will include a broad range of 800G systems and line cards based on the company’s EOS operating system – which ultimately will include supercharged Ethernet features compatible with specifications from the Ultra Ethernet Consortium (UEC), according to Arista CEO Jayshree Ullal, who authored a recent blog post. “As the UEC completes its extensions to improve Ethernet for AI workloads, Arista assures customers that we can offer UEC-compatible products, easily upgradable to the standards as UEC firms up in 2025,” Ullal wrote.

The UEC was founded last year by AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft, and it now includes more than 50 vendors. The consortium is developing technologies aimed at increasing the scale, stability, and reliability of Ethernet networks to satisfy AI’s high-performance networking requirements. Later this year, it plans to release official specifications that will focus on a variety of scalable Ethernet improvements, including better multi-path and packet delivery options as well as modern congestion and telemetry features.

Across the Arista Etherlink portfolio, UEC-compatible features would include dynamic load balancing, congestion control, and reliable packet delivery, Ullal stated.

“AI workloads push the ‘collective’ operation, where allreduce and all-to-all are the dominant collective types. Today’s models are already moving from billions to one trillion parameters with GPT-4. Of course, we have others such as Google Gemini, open source Llama and xAI’s Grok,” Ullal wrote. “During the compute-exchange-reduce cycle, the volume of data exchanged is so significant that any slowdown due to a poor network can critically impact the AI application performance. The Arista Etherlink AI topology will allow every flow to simultaneously access all paths to the destination with dynamic load balancing at multi-terabit speeds.”

“Arista Etherlink supports a radix from 1,000 to 100,000 GPU nodes today, which will go to more than one million GPUs in the future,” Ullal added.

According to Ullal, two additional key features of Arista’s Etherlink platforms are:

  • Predictable latency: “Rapid and reliable bulk transfer from source to destination is key to all AI job completion. Per-packet latency is important, but the AI workload is most dependent on the timely completion of an entire processing step. In other words, the latency of the whole message is critical. Flexible ordering mechanisms use all Etherlink paths from the NIC to the switch to guarantee end-to-end predictable communication.”
  • Congestion management: “Managing AI network congestion is a common ‘incast’ problem. It can occur on the last link of the AI receiver when multiple uncoordinated senders simultaneously send traffic to it. To avoid hotspots or flow collisions across expensive GPU clusters, algorithms are being defined to throttle, notify, and evenly spread the load across multipaths, improving the utilization and TCO of these expensive GPUs with a VoQ fabric,” Ullal wrote. The Arista Virtual Output Queuing (VoQ) fabric features a distributed scheduling mechanism that guarantees traffic flow delivery in congested switch ports.  

Arista AI networking also depends on a combination of the vendor’s core EOS operating system and its natural-language, generative AI-based Autonomous Virtual Assist (AVA) system for delivering network insights, Ullal wrote.   

“Arista AVA imitates human expertise at cloud scale through an AI-based expert system that automates complex tasks like troubleshooting, root cause analysis, and securing from cyber threats,” Ullal wrote. “It starts with real-time, ground-truth data about the network devices’ state and, if required, the raw packets. AVA combines our vast expertise in networking with an ensemble of AI/ML techniques, including supervised and unsupervised ML and NLP (Natural Language Processing). Applying AVA to AI networking increases the fidelity and security of the network with autonomous network detection and response and real-time observability.”

Regarding Arista’s EOS software stack, Ullal said it can help customers build resilient AI clusters. “EOS offers improved load balancing algorithms and hashing mechanisms that map traffic from ingress host ports to the uplinks so that flows are automatically re-balanced when a link fails,” Ullel wrote. “Our customers can now pick and choose packet header fields for better entropy and efficient load-balancing of AI workloads. 

AI network visibility is another critical aspect in the training phase for large datasets used to improve the accuracy of LLMs, according to Ullal. “In addition to the EOS-based Latency Analyzer that monitors buffer utilization, Arista’s AI Analyzer monitors and reports traffic counters at microsecond-level windows. This is instrumental in detecting and addressing microbursts which are difficult to catch at intervals of seconds,” Ullal wrote. 

In general, AI training clusters require a fundamentally new approach to building networks, “given the massively parallelized workloads” that can cause congestion, according to Ullal. “Traffic congestion in any single flow can lead to a ripple effect slowing down the entire AI cluster, as the workload must wait for that delayed transmission to complete. AI clusters must be architected with massive capacity to accommodate these traffic patterns from distributed GPUs, with deterministic latency and lossless deep buffer fabrics designed to eliminate unwanted congestion,” she wrote.

Exit mobile version