Harnessing Hugging Face Models on AWS Lambda for Serverless Inference

Introduction

The proliferation of large and sophisticated machine learning models has, in many ways, raised the barrier to entry for developers and organizations looking to leverage cutting-edge AI. The computational infrastructure required to train these models can be substantial. Fortunately, the open-source Hugging Face Transformers project significantly democratizes access to these powerful tools. Hugging Face offers a vast repository of over 30 pre-trained Transformer-based models, readily available through a simple Python package. Furthermore, the Hugging Face Hub hosts an extensive collection of over 10,000 community-developed models, allowing users to integrate modern Transformer models into their applications without the intensive process of training from scratch.

This tutorial aims to guide you through the process of hosting these Hugging Face models on AWS Lambda, enabling serverless inference. By utilizing container images for Lambda functions and Amazon Elastic File System (EFS) for model caching, we can achieve efficient, low-latency inference suitable for a variety of applications.

Architectural Overview

The solution architecture is designed to provide serverless inference capabilities through AWS Lambda. Key components and their functions are:

AWS Lambda Functions with Container Images: Lambda functions are packaged as container images, allowing for larger deployment packages and more complex dependencies required by machine learning models.
Automatic Model Downloading: Pre-trained models are automatically downloaded from the Hugging Face Hub the first time a Lambda function is invoked. This ensures that the function has the necessary model artifacts available.
Amazon EFS for Model Caching: To significantly improve inference latency, pre-trained models are cached within Amazon Elastic File System (EFS) storage. EFS provides a persistent and scalable file storage that can be mounted by Lambda functions.

This setup is particularly beneficial for NLP tasks, and the provided solution includes Python scripts for common use cases such as sentiment analysis and text summarization.

Prerequisites

Before you begin, ensure you have the following prerequisites in place:

An AWS account with appropriate permissions to create Lambda functions, EFS file systems, and related resources.
The AWS Command Line Interface (CLI) installed and configured.
The AWS Cloud Development Kit (CDK) installed.
Docker installed and running on your local machine.
Python 3.8 or later installed.

Deploying the Example Application

The deployment process leverages the AWS CDK to provision and configure the necessary infrastructure. Follow these steps:

1. Clone the Repository

Begin by cloning the sample project from GitHub:

git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git

2. Install Dependencies

Navigate into the cloned directory and install the required Python dependencies:

cd zero-administration-inference-with-aws-lambda-for-hugging-face
pip install -r requirements.txt

3. Bootstrap the CDK

Initialize the CDK environment in your project. This command provisions the necessary resources for the CDK to manage deployments:

cdk bootstrap

4. Deploy the CDK Application

Execute the deployment command to provision the infrastructure defined in the CDK script:

cdk deploy

During the deployment process, the CDK toolkit will output progress indicators. This step will create the Lambda functions, EFS file system, and configure the necessary networking and permissions.

Understanding the Code Structure

The project follows a structured organization to manage the inference logic and deployment configurations:

inference/: This directory contains the core logic for machine learning inference.
inference/Dockerfile: This file defines the Docker image used to build the Lambda function container. It specifies the base image, installs dependencies, and sets up the environment for running PyTorch Hugging Face inference.
inference/sentiment.py: A Python script that performs sentiment analysis using a Hugging Face model.
inference/summarization.py: A Python script for text summarization tasks.
app.py: The main CDK script responsible for defining and provisioning the AWS infrastructure, including Lambda functions, EFS, and VPC configurations.

Inference Scripts Example (sentiment.py)

The Python scripts within the inference directory contain the actual machine learning inference code. For instance, the sentiment.py script might look like this:

import json
from transformers import pipeline

nlp = pipeline("sentiment-analysis")

def handler(event, context):
    response = {
        "statusCode": 200,
        "body": json.dumps(nlp(event['text'])[0])
    }
    return response

This script initializes a sentiment analysis pipeline and defines a handler function that takes input text, performs sentiment analysis, and returns the result.

CDK Script Details (app.py)

The app.py script orchestrates the deployment using the AWS CDK. It defines:

VPC Configuration: A virtual private cloud (VPC) is created to provide a private network for the Lambda functions and EFS file system.
EFS File System and Access Point: An Elastic File System (EFS) is provisioned to serve as a persistent cache for the downloaded Hugging Face models. An access point is configured to provide a specific mount path and POSIX user permissions for the Lambda function.
Lambda Function Creation: The script iterates through the Python files in the inference directory. For each file, it builds a Docker image and creates a Lambda function. These functions are configured with sufficient memory and timeout settings, mounted with the EFS file system, and have the TRANSFORMERS_CACHE environment variable set to the EFS mount point.

Example snippets from the CDK script:

import os
from pathlib import Path
import aws_cdk as cdk
from aws_cdk import (aws_ec2 as ec2, aws_efs as efs, aws_lambda as lambda_)

class LambdaMLStack(cdk.Stack):
    def __init__(self, scope: cdk.App, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # VPC configuration
        vpc = ec2.Vpc(self,