michael_cooney
Senior Editor

Optical networking challenges gain attention as AI networking demands rise

News
26 Jun 20248 mins
Network SwitchesNetworkingNetworking Devices

Demand for higher speeds, AI network development and energy efficiency are driving advances in optical networking technologies.

Smart motorway in England, UK with light trails signifying busy traffic at rush hour. The NSL symbols under the gantry sign signify an end to speed restrictions.
Credit: Cal F / Shutterstock

As large enterprise and hyperscaler networks process increasingly greater AI workloads and other applications that require high-bandwidth performance, the demand for optical connectivity technologies is growing as well.

Ultimately, optical is the only type of connectivity technology that can deliver the capacity organizations require, over the distances needed, to connect data centers, servers, routers, switches and all of the distributed components that make up today’s network architectures, said Bill Gartner, senior vice president and general manager of Cisco’s optical systems and optics group.

But greater use of fiber optics in networks is not without its challenges.

Providers are making plans to effectively and sustainably move to higher speeds such as 400G Ethernet, 800G Ethernet and beyond, while at the same time they’re also trying to develop advanced technologies to support AI networks. Efforts to develop more energy efficient technologies for optical networks and interfaces are also in the works.

Optical circuit switches are currently being offered or developed by vendors including Cisco, Calient Networks, Broadcom, Nvidia, and Telescent. Google, too, is developing its own optical circuit switching platform called Apollo. While the need to support high bandwidth and speed is critical, these players are also focused on improving energy usage.

Google advances Apollo optical circuit switching

In a recent blog about Apollo, Google stated that traditional networks use a “Clos” topology, also known as a spine and leaf configuration, to connect all servers and racks within a data center, while its Apollo platform uses optical circuit switching (OCS) for data center networking:

“In a spine and leaf architecture, compute resources – racks of servers equipped with CPUs, GPUs, FPGAs, storage, and/or ASICs – are connected to leaf or top-of-rack switches, which then connect through various aggregation layers to the spine,” Google wrote. “Traditionally, the spine of this network uses Electronic Packet Switches (EPS), which are standard network switches provided by companies like Broadcom, Cisco, Marvell, and Nvidia. However, these EPS consume a significant amount of power.”

“Apollo is believed to be the first large-scale deployment of optical circuit switching (OCS) for data center networking. The Apollo OCS platform includes a homegrown, internally developed OCS, circulators, and customized wavelength-division-multiplexed (WDM) optical transceiver technology that supports bidirectional links through the OCS and circulators. Apollo has served as the backbone of all Google data center networks, having been in production for nearly a decade, supporting all data center use cases.

“Incorporating the Apollo OCS layer replaces the spine blocks, resulting in significant cost and power savings by eliminating the electrical switches and optical interfaces used in the spine layer. Google uses these optical switches in a direct connect architecture to link the leaves through a patch panel. This method is not packet switching; it functions as an optical cross-connect,” Google stated.

“OCS switches offer high bandwidth and low network latency, along with a significant reduction in capital expenditures. This is due to their ability to reduce the number of required electrical switches, thereby eliminating costly optical-to-electrical-to-optical conversions,” said Sameh Boujelbene, vice president with the Dell’Oro Group. “Moreover, unlike electrical switches, OCS switches do not need frequent upgrades when servers adopt next-generation optical transceivers.”

However, OCS is still an emerging technology. “To date, only Google has managed to deploy them at scale in its data center networks after many years of development. Additionally, OCS switches may necessitate changes to the existing fiber infrastructure, depending on the cloud service provider,” Boujelbene said.

“OCS switches have been deployed at Google in spine layer but with the emergence of AI applications, we see them being deployed more inside the AI clusters because of the benefits that they bring,” Boujelbene said.

Standardizing optical transport technologies

Requirements for higher speed Ethernet networking equipment are evolving as AI networks expand. For example, there’s rising demand for 800G Ethernet employing 800ZR high-speed optical transmission technology and OpenZR+, the industry initiative to develop interoperable standards for coherent optical transceivers.

At the 400G Ethernet level, 400ZR has been “a great success for the coherent pluggable industry with multiple suppliers and a tremendous volume of 400ZR QSFP-DD and OSFP modules deployed in metro DCI [data center interconnect] applications,” according to Cisco’s Acacia website. (Cisco acquired optical maker Acacia Communications for $4.5 billion deal in 2021.) 

“Network grade pluggable optics such as 400ZR and others will see significant uptick in deployments in 2024 in communication service provider networks,” IDC reported recently. 

Effectively tying together dispersed data centers via DCI will be a key driver for AI and fiber optic networks as the distance between AI data centers becomes an issue, Gartner said.

Capacity of these links will need to increase with AI applications, Cisco’s Gartner said. “Right now, we’ve got 400 gig on one wavelength, but the industry wants much better performance, lower costs, lower density, better density, and that will happen too,” Gartner said. “So, what comes out initially might be optimized for five nanometer. We need to do better, and this is going to be progression on this technology.”

AI cluster sizes growing

The scale of emerging AI applications appears to be expanding exponentially, with the number of parameters that these applications have to process growing 1000X every 2 to 3 years, according to Boujelbene. “Consequently, the average size of AI clusters in terms of number of accelerators is quadrupling every 2 years, evolving from a typical size of 256 to 1000, then rapidly to 4K, and now some clusters boast 32K and 64K accelerators.”

At Optical Fiber Communications Conference (OFC) 2023, vendors introduced numerous 1.6 Tbps optical components and transceivers based on 200 G per lambda, and there were a number of demonstrations of these 1.6 Tbps products at OFC 2024, Boujelbene wrote in a blog about OFC 2024.

“While we don’t anticipate volume shipment of 1.6 Tbps until 2025/2026, the industry must already begin efforts towards achieving 3.2 Tbps and exploring various paths and options to reach this milestone,” Boujelbene wrote.

“This sense of urgency arises from a combination of factors, including the exponential growth in bandwidth demand within AI clusters and the escalating power and cost concerns associated with higher speeds.”

In Dell ’Oro’s recently published “AI Networks for AI workloads” report, the researchers forecast that by 2025, the majority of ports in AI networks will be 800 Gbps, and by 2027, the majority of ports will be 1600 Gbps, showing a very fast adoption of the highest speeds available in the market, Boujelbene stated.

However, the increase in optic speed is challenged by a significant increase in cost and power consumption. Substantial investments in AI infrastructure are accelerating the development of innovative optical connectivity solutions tailored to meet the demands of AI clusters while solving some of the cost and power consumption challenges, Boujelbene said.

LPOs vs. CPOs

While optics and AI networking issues may trend toward the future, a more current issue is the battle between Linear Drive Pluggable Optics (LPOs) and Co-Packaged Optics (CPOs). LPOs typically set up direct links between fiber optic modules, eliminating the need for traditional components such as a Digital Signal Processor. CPOs feature the integration of optical components directly into a switch ASIC. 

Both technologies have their place in optical networks, experts say, as both promise to reduce power consumption and support improved bandwidth density. Both have advantages and disadvantages as well – CPOs are more complex to deploy given the amount of technology included in a CPO package, whereas LPOs promise more simplicity. 

Backers of LPO have been pushing that technology hard this year. For example, in March a group of 12 core optics vendors, including Cisco, Broadcom, Intel, Nvidia, Arista and AMD, formed the Linear Pluggable Optics Multi-Source Agreement group to further LPO technology development.

The LPO group is developing myriad optical networking equipment such as switches, NICs, and Ethernet GPUs aimed at high-speed, high-volume applications such as AI and high-performance computing.

“There is an urgent need to reduce the network power consumption for AI and other high-performance applications,” Mark Nowell, LPO MSA Chair said in a statement. “LPO materially reduces power consumption both for the module and the system while maintaining a pluggable interface, providing the economics and flexibility that customers need for high-volume deployments.”

Indeed, both LPO and CPO aim to reduce power and potentially the cost of optics when moving to higher speeds. However, multivendor support, time-to-market, serviceability, manufacturability, and testability are critical requirements for volume adoption, Boujelbene said. “LPO appears to be ahead of CPO in meeting these requirements because it retains a pluggable form factor (only the DSP is removed). Therefore, we expect LPO to achieve volume deployment before CPO.”

Exit mobile version