AI on EKS: Revolutionizing Scalable AI Workloads on Amazon EKS

Introduction to AI on EKS

The landscape of Artificial Intelligence (AI) and Machine Learning (ML) is evolving at an unprecedented pace, driving a growing demand for specialized solutions and faster innovation. To meet this demand and streamline the deployment of complex AI/ML workloads, Amazon Web Services (AWS) has introduced "AI on EKS." This initiative is designed to provide organizations with robust, scalable, and optimized environments for running AI/ML applications directly on Amazon Elastic Kubernetes Service (EKS). By leveraging the power of Kubernetes and the extensive capabilities of AWS, AI on EKS aims to simplify the infrastructure challenges associated with AI/ML, enabling teams to focus on developing and deploying cutting-edge AI solutions.

Addressing the Challenges of AI/ML Workloads on Kubernetes

As AI/ML models become more sophisticated and data volumes increase, organizations encounter significant infrastructure and operational hurdles when deploying these workloads on Kubernetes. While Amazon EKS offers a powerful and flexible managed Kubernetes platform, running large-scale AI systems—particularly those involving massive models, specialized accelerators like GPUs, and distributed processing—requires more than just a standard Kubernetes setup. Common pain points include the complexities of managing GPU infrastructure, the rising costs associated with inefficient resource scaling, and the integration of various fragmented open-source tools that may not always work seamlessly together. "AI on EKS" directly addresses these challenges by offering deeply integrated blueprints, proven best practices, and Infrastructure as Code (IaC) templates.

Key Capabilities of AI on EKS

AI on EKS provides a comprehensive suite of features to empower AI/ML development and deployment on Amazon EKS. These capabilities are meticulously crafted to address the end-to-end lifecycle of AI/ML workloads:

Provisioning EKS Clusters: Facilitates the creation of secure, scalable EKS clusters that are specifically configured to support multiple accelerators, such as GPUs, essential for AI/ML tasks.
Reference Architectures: Offers pre-tested and validated architectures that serve as proven environments, enabling teams to quickly start their AI/ML workloads with a reliable foundation.
Composable Environments: Supports the creation of flexible and customizable architectures by allowing components to be configured to meet specific workload requirements, all built upon a foundational infrastructure designed for a wide range of AI/ML applications.
Optimizing Kubernetes for Accelerators: Focuses on the installation and configuration of critical components necessary for efficient AI/ML training and inference, ensuring optimal performance when utilizing hardware accelerators.
Scalable Model Inference and Multi-Model Deployment: Enables the deployment of large language models (LLMs) and multi-model inference workloads using popular frameworks like Ray Serve, NVIDIA Triton, AIBrix, vLLM, and NVIDIA Dynamo. This includes support for advanced features such as multi-model caching, automatic scaling, and low-latency serving to ensure responsive inference.
Distributed Training at Scale: Facilitates the execution of high-performance training jobs across multiple nodes. It supports frameworks such as Ray, PyTorch, DeepSpeed, and NVIDIA NeMo, incorporating features like tensor/pipeline parallelism, efficient multi-threaded model downloading, integration with shared file systems (like EFS and FSx for Lustre), and data locality optimization for faster training cycles.
Cost Optimization and Monitoring: Provides tools and strategies for gaining visibility and control over resource usage. This includes leveraging spot instance integration, implementing sophisticated autoscaling policies, and offering right-sizing recommendations specifically for AI workloads to minimize operational costs without compromising performance.

Project Structure: Infra and Blueprints

The "AI on EKS" project is logically organized into two primary categories to facilitate modularity and ease of use:

Infra

The Infra section comprises modular Terraform templates. These templates are engineered to deploy and configure the essential foundational infrastructure required for scalable AI/ML workloads on Amazon EKS. They provide a robust starting point for setting up the underlying resources, ensuring a stable and efficient environment for AI/ML operations.

Blueprints

The Blueprints section offers pre-built, end-to-end reference implementations tailored for specific AI/ML use cases. Each blueprint is a complete package designed to accelerate deployment and experimentation:

Training: Includes blueprints for Ray on EKS and PyTorch Distributed Data Parallel (DDP) training with FSx for Lustre, optimized for large-scale model training.
Inference: Provides blueprints for Ray with vLLM and NVIDIA Triton Server with TensorRT, designed for efficient and low-latency model serving.
Multi-model Serving: Features a blueprint for NVIDIA Triton with shared Amazon EFS caching, enabling efficient serving of multiple models.
Agentic AI & RAG: Offers blueprints for Retrieval-Augmented Generation (RAG) pipelines using frameworks like LlamaIndex and LangChain, with support for vector stores such as PGVector (upcoming).

Each blueprint comes with a comprehensive documentation package. This includes detailed architectural design specifications, step-by-step deployment instructions, insights into observability and cost metrics, performance benchmarks for various models, and recommended best practices for deploying on Amazon EKS.

Deployment Example: Mistral-7B-Instruct-v0.2 with RayServe and vLLM

To illustrate the ease of deployment, consider the example of deploying the Mistral-7B-Instruct-v0.2 model using RayServe and vLLM. The process involves a few straightforward steps:

First, clone the AI on EKS repository:

git clone https://github.com/awslabs/ai-on-eks.git

Navigate to the JARK deployment directory and execute the install script to deploy the necessary infrastructure:

cd ai-on-eks/infra/jark-stack/terraform && chmod +x install.sh
./install.sh

This deployment typically takes about 15-20 minutes to complete. Upon successful installation, the script will output a configured kubectl command. Use this command to configure your EKS cluster access:

# Creates k8s config file to authenticate with EKS
aws eks --region us-west-2 update-kubeconfig --name jark-stack

With your cluster access configured, you are ready to deploy your first model workload using one of the available blueprints. For the Mistral-7B-Instruct-v0.2 model with RayServe and vLLM, navigate to the blueprint directory and apply the service configuration:

export HUGGING_FACE_HUB_TOKEN=$(echo -n "Your-Hugging-Face-Hub-Token-Value" | base64)
cd ai-on-eks/blueprints/inference/vllm-rayserve-gpu

envsubst < ray-service-vllm.yaml | kubectl apply -f -

Getting Involved and Community Contributions

"AI on EKS" is more than just a set of tools; it