NVIDIA's supercomputer in the cloud delivers full-stack AI development platform
NVIDIA DGX Cloud is an AI supercomputer in the cloud, designed for enterprise users with demanding needs and deep pockets. The offering comes as a complete software and hardware package for large-scale AI development, accessible via web browser.
DGX Cloud gives enterprises the power to train modern AI workloads such as generative AI and large language models, says Charlie Boyle, NVIDIA’s vice president of DGX Platforms. It combines an AI developer suite, workflow software, a high-performance infrastructure, direct access to NVIDIA AI experts, and 24/7 support.
Market impact of generative AI
Generative AI’s arrival has sparked a rapid increase in demand for AI-based products and services. As a result, companies are racing to acquire the skills and infrastructure needed to leverage AI in their product development processes and business operations.
With DGX Cloud, enterprises can obtain nearly instant access to a full-stack AI supercomputing environment without having to worry about software compatibility, optimization, data center space, power, cooling, or the expertise needed to install and maintain a supercomputer cluster, Boyle says. “It lets them focus on innovation rather than infrastructure and gets them working in days instead of months.”
Vladislav Bilay, a cloud solution engineer with Aquiva Labs, an app and software development services company, adds, “It enables researchers, developers, and data scientists to access and utilize NVIDIA’s DGX systems remotely, eliminating the need for costly on-premises hardware.”
Bilay says that DGX Cloud provides a seamless and scalable environment for training and deploying AI models, allowing users to leverage NVIDIA technologies and accelerate their workflows in a flexible and convenient manner.
One of DGX Cloud’s key advantages is its tight integration with popular AI frameworks and tools. “It supports frameworks like TensorFlow, PyTorch, and MXNet, allowing users to leverage their preferred libraries and APIs.” Bilay adds that DGX Cloud also provides access to NVIDIA’s comprehensive software stack, which includes drivers, libraries, and frameworks tailored for AI development.
Scott Lard, general manager and partner at IS&T, a Houston-based information systems and technology retained search and contingency staffing firm, adds that DGX Cloud provides an opportunity to leverage the power of high-performance computing (HPC) and AI without the need for expensive hardware investments.
“Users can tap into NVIDIA’s robust infrastructure, accessing powerful GPU resources remotely and accelerating their workloads, be it deep learning, data analytics, or scientific simulations,” he explains. “It’s like having a virtual AI powerhouse at your fingertips, ready to revolutionize your computing capabilities.”
Multiple components
DGX Cloud incorporates multiple, integrated components. Users access DGX Cloud from a web browser using NVIDIA Base Command Platform software. “This is the central hub of DGX Cloud, where multiple users manage their complete AI development workflows,” Boyle says. “It eliminates the complexity of resource sharing for large-scale AI training, leveraging multiple instances, known as ‘multi-node training’, which is often difficult to achieve, with an easy to use graphical user interface and integrated monitoring and reporting tools.”
DGX Cloud also incorporates NVIDIA AI Enterprise, the software layer of the NVIDIA AI platform, which includes over 100 pretrained models, optimized frameworks and accelerated data science software libraries. These add-ins give developers an additional jump-start to their AI projects, Boyle notes.
Organizations rent multiple DGX Cloud instances and, in return, get dedicated, full-time access during the rental period, Boyle says. The instances automatically appear in Base Command Platform software, allowing users to submit and run jobs.
Each instance includes eight NVIDIA H100 or A100 80GB Tensor Core GPUs, for a total of 640GB of GPU memory per node. Boyle says that a high-performance, low-latency fabric, built with NVIDIA networking, ensures that workloads can scale across clusters of interconnected systems, allowing multiple instances to meet the performance requirements of advanced AI training. High-performance storage is also integrated within DGX Cloud.
From a financial angle, DGX Cloud provides several significant benefits and advantages. The approach eliminates the need for customers to invest in and manage their own expensive hardware infrastructure. “This translates to cost savings, increased flexibility, and scalability in their AI and deep learning endeavors,” Bilay explains.
DGX Cloud integrates with popular AI frameworks and tools, simplifying the development workflow. It also prioritizes security and data privacy, ensuring adopters can confidently work with sensitive data and models. “Overall, DGX Cloud empowers adopters by providing a high-performance, flexible, and user-friendly cloud platform tailored to their AI and deep learning needs,” Bilay says.
Serving a need, but not inexpensive
Boyle says that by providing dedicated AI supercomputing instances, DGX Cloud meets a critical need by allowing enterprises to stand up services rapidly and affordably. NVIDIA is partnering with leading cloud service providers including Oracle Cloud Infrastructure, Microsoft Azure and Google Cloud to host the DGX Cloud infrastructure.
DGX Cloud instances start at $36,999 per instance per month, with no additional fees for AI software or data transfers. So, that’s $444,000 a year for one instance, and that’s a recurring cost.
When a user initiates a task, such as training an AI model, their work is processed on available DGX systems in the cloud. These systems feature high-performance NVIDIA GPUs specifically optimized for deep learning workloads. User data and models are securely transferred to DGX systems, where the computation takes place.
DGX Cloud supports major AI platforms and tools, ensuring compatibility with the user’s preferred libraries and APIs. This allows users to seamlessly develop and deploy their AI models in the cloud, Bilay says.
Getting started
Boyle says that customers and their teams can get up to speed pretty quickly. NVDIA offers eight interconnected GPUs per instance and provides access at scale in every region DGX Cloud is hosted in. The service’s network fabric is based on NVIDIA’s own technology, which Boyle claims delivers a high-bandwidth, low-latency interconnect that’s optimized for multi-node training. He also points to a simple user interface that allows users to run multi-node training jobs.
A multi-cloud approach avoids the need to lock-in with any one cloud provider, Boyle says. “The DGX Cloud Base Command Platform provides a single pane view for hybrid cloud management across cloud and on-prem resources.”
Other considerations and caveats
DGX Cloud isn’t the only player offering this type of service. Major competitors include Google Cloud AI Platform, Amazon AWS Deep Learning AMIs, Microsoft Azure Machine Learning, and IBM Watson Studio. “These platforms provide similar capabilities, such as scalable computing resources, integration with popular AI frameworks, and support for deep learning workflows,” Bilay says.
The cost of deploying and using DGX Cloud varies depending on factors such as the subscription plan, resource allocation, and usage duration. NVIDIA offers different pricing models and plans tailored to the specific needs of users, Bilay says.
Embracing a cloud solution makes users dependent on the service provider’s infrastructure and support, Bilay cautions. Failures and technical issues on the provider’s end can affect platform availability and performance, potentially affecting a project’s execution and timing.
Perhaps more ominously, particularly for organizations with strict data privacy or compliance requirements, using a cloud platform can raise data security and privacy concerns. “While NVIDIA DGX Cloud implements security measures, it’s important for users to evaluate the platform’s security protocols and ensure they meet their specific compliance requirements,” Bilay advises.