Fine-Tuning the Audio Spectrogram Transformer (AST) with 🤗 Transformers for Enhanced Audio Classification

Introduction to Fine-Tuning the Audio Spectrogram Transformer (AST)

Fine-tuning a pre-trained model like the Audio Spectrogram Transformer (AST) offers a significant advantage over training from scratch, particularly when dealing with limited datasets. This approach leverages the audio-specific features already learned by the model during its extensive pre-training phase. By adapting these learned features to your specific dataset, you can achieve better performance and greater data efficiency for your audio classification tasks. This guide provides a comprehensive, step-by-step walkthrough of the fine-tuning process using the powerful tools available within the Hugging Face ecosystem.

Step 1: Setting Up Your Environment

Before we begin, ensure you have the necessary libraries installed. You can install them using pip:

pip install transformers[torch] datasets audiomentations

Step 2: Loading and Preparing Your Audio Data

The first crucial step is to load your audio data into a format that the Hugging Face Dataset object can readily process. The expected structure includes audio and labels features.

Loading Data from the Hugging Face Hub

If you have an existing audio dataset on the Hugging Face Hub, you can conveniently load it using the load_dataset function. For demonstration purposes, we will use the ESC50 dataset:

from datasets import load_dataset

esc50 = load_dataset("ashraq/esc50", split="train")

Loading Local Audio Files and Labels

Alternatively, you can load your local audio files and their corresponding labels. This can be done using a dictionary or a pandas DataFrame. You can also specify class mappings using ClassLabel for better management of categorical data.

from datasets import Dataset, Audio, ClassLabel, Features

# Define class labels
class_labels = ClassLabel(names=["bang", "dog_bark"])

# Define features with audio and label columns
features = Features({
    "audio": Audio(),  # Define the audio feature
    "labels": class_labels  # Assign the class labels
})

# Construct the dataset from a dictionary
dataset = Dataset.from_dict({
    "audio": ["/audio/fold1/7061-6-0-0.wav", "/audio/fold1/7383-3-0-0.wav"],
    "labels": [0, 1],  # Corresponding labels for the audio files
}, features=features)

The Audio feature class automatically handles the loading and processing of audio files, while ClassLabel aids in managing categorical labels.

Inspecting the Dataset

After loading, it is essential to inspect the dataset to ensure all data has been loaded correctly. Each audio sample is represented by an Audio feature, which loads data into memory only when needed, optimizing resource usage.

print(dataset[0])

The output will typically show the audio file path, the waveform data as a NumPy array, and the sampling rate, along with its associated label.

Step 3: Preprocessing the Audio Data

This stage involves transforming the raw audio data into a format suitable for the AST model, which requires spectrogram inputs. We will also handle necessary data type casting.

Casting Data Types

If your dataset is from the Hugging Face Hub, you’ll need to cast the audio and labels columns to their correct feature types. This includes resampling the audio to the sampling rate expected by the ASTFeatureExtractor (typically 16kHz).

import numpy as np
from datasets import Audio, ClassLabel

# Assuming