
Artificial intelligence is remaking every aspect of technology. Consider that as the large language models used in AI grow in size and gain more parameters and produce ever more realistic responses, they also have created a parallel need for increases in computing power. That power can no longer be provided by a single computer, and AI/ML companies have turned to supercomputers and cloud infrastructures to train their large-scale language learning models.
As a software engineer, Soumya Pani has worked extensively on cloud infrastructure and integrating AI/ML into it. He currently works for Google in Seattle, Washington, where he is responsible for ensuring the quick integration of graphic processing unit (GPU)) technologies and supporting AI/ML customers in deploying their workloads on the Google Cloud Platform. Some of these clients, like Anthropic and MosaicML, have sizable LLMs that require more than 5,000 GPUs to train their models.
They require hundreds of machines and complex network setups to accomplish this. At Google, Pani has developed an open-source tool that makes it easier to onboard AI/ML customers. Called the Cluster Provisioning Tool (CPT), it has already been widely adopted for Google Cloud Platform customers, earning Pani quite a few accolades.
We recently sat down with Pani to discuss the ways in which AI/ML is impacting cloud infrastructure, and how the tool he has developed can be used to overcome current challenges. Below is an edited transcript of that interview. The specific issue I was trying to address is this: AI/ML models can be trained on a single machine, but the number of parameters and hence the size of the model are limited when you only rely on a single machine.
However, if you integrate the training of AI/ML models on the cloud you can utilize thousands of machines, each of which has multiple GPUs. As a result, you can train on a large dataset and create highly accurate, large-scale AI/ML models. To streamline this process, I decided to create an open source tool that creates a multi-machine AI/ML cloud infrastructure on the Google Cloud Platform.
This is already used by various AI/ML companies like MosaicML and Anthropic . As the scale increases for the Google Kubernetes Engine cluster to host multi-node AI/ML workloads, the complexity of the environment setup increases dramatically. The CPT tool I developed makes the cluster provision very straightforward and standardized.
This involves dynamic cluster autoscaling, where the tool integrates with the GKE cluster and node pool provisioning to ensure both horizontal scaling where GPU nodes are added and removed based on the workload demand supported by the GKE, and vertical scaling, which allows ML workloads to dynamically request larger machine types with more virtual CPUs, memory, and accelerators. So, for example, if a training job suddenly requires 1,000 GPUs, the cluster auto-scales by provisioning additional nodes while maintaining cost efficiency. The CPT also supported GPU-aware scheduling and multi-GPU workload distribution; hybrid and multi-zone scaling for global workloads, which enables AI training across multiple GCP regions; and integration with the Google Cloud AI Hypercomputer, allowing workload distribution across global GPU clusters and enabling massive-scale distributed model training.
The key optimizations provided by the tool are resource utilization and cost optimization. The CPT tool is integrated with the GKE nodepool, and the GKE nodepool dynamically allocates and releases resources based on the requirement of the workloads. In terms of the data pipeline and storage optimization, the tool is integrated with Google Cloud Storage & Parallel File Systems for high-speed data access across regions.
The tool also supports network optimization for multi-node training GPU communication. This is essential for enabling large scale training workloads to make progress on a multi-node multi-GPU cluster. There are always new GPU virtual machine types, new Nvidia drivers, and improved networking to support a larger number of new Nvidia GPUs in the Google Cloud Platform.
It's a real challenge to keep the tool integrated with GCP offerings. The way we address this problem is by dogfooding the tool ourselves and using it to test and create integration environments within the team. That way all the new functionalities are integrated in the tool as soon as they are publicly available.
We also restrict the usage of the tool only to GPU virtual machine families. There are many specialized setups required for GPU clusters. By restricting the tool for only GPU clusters, we capture the nuances for GPU specific cluster setup in the tool.
To accomplish this, new Nvidia drivers and network drivers are first integrated and tested in the tool so that we can provide performance guarantees by enforcing optimal setup steps via the tool. The CPT tool made a lot of the setup process automated which reduced a significant amount of time in creating the AI/ML clusters for these customers. Otherwise the cluster setup would be a complicated and error prone manual process.
The CPT tool still provides customers with the flexibility they need by allowing them to configure and recreate and/or update existing AI/ML clusters via simple configuration. The configuration driven cluster provisioning process supported via the CPT tool is repeatable too. So it's easy for Google engineers to use customer configuration to create clusters and investigate issues when the customers face any problems with their environment setup.
The CPT tool enables observability signals via GKE, Ray and Slurm to investigate the AI/ML workload execution easily. The CPT tool is integrated with many GPU native technologies to manage resource allocation and to ensure the efficient use of resources. As it is integrated with the GKE nodepool, it dynamically allocates and releases resources based on the requirement of its workloads.
The tool is also integrated with managed instance group (MIG), which is a GCP native technology for the A2 and A3* virtual machine families. Integration with MIG provides autoscaling of virtual machines and allows the autoscaling of nodes for AI/ML training workloads. Finally, the tool is integrated with Google Cloud Storage & Parallel File Systems for high-speed data access across regions.
The CPT tool uses GCP native resources and hence integrated with state of the art GCP security systems. Individual network interface cards and virtual networks are set up for individual cluster creation by the tool. This ensures the right firewall rules are configured and blocks external access to the AI/ML cluster.
Google Cloud Platform’s identity and access management (IAM) ensures only authorized users and services can access GPU resources. Google Cloud Storage provides encryption at rest and in transit to guarantee the training data is protected during the entire AI/ML workload execution. And GKE and MIG provide additional security.
The CPT tool provides a single step configuration based solution to a rather complicated and error-prone process of setting up multi-node, multi-GPU AI/ML clusters. There are many planned future developments of this tool. The tool supports terraform natively and can be integrated with existing terraform modules.
So I'm planning on integrating the tool with broader Google tools for AI/ML workloads. My tool is currently being integrated with VertexAI, which provides integration with pre-existing and pre-trained models. Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp _____________ Disclaimer: Analytics Insight does not provide financial advice or guidance.
Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made.
You are responsible for conducting your own research (DYOR) before making any investments. Read more here..