Fine-Tuning the Audio Spectrogram Transformer (AST) with 🤗 Transformers for Enhanced Audio Classification

1 views
0
0

Introduction to Fine-Tuning the Audio Spectrogram Transformer (AST)

Fine-tuning a pre-trained model like the Audio Spectrogram Transformer (AST) offers a significant advantage over training from scratch, particularly when dealing with limited datasets. This approach leverages the audio-specific features already learned by the model during its extensive pre-training phase. By adapting these learned features to your specific dataset, you can achieve better performance and greater data efficiency for your audio classification tasks. This guide provides a comprehensive, step-by-step walkthrough of the fine-tuning process using the powerful tools available within the Hugging Face ecosystem.

Step 1: Setting Up Your Environment

Before we begin, ensure you have the necessary libraries installed. You can install them using pip:

pip install transformers[torch] datasets audiomentations

Step 2: Loading and Preparing Your Audio Data

The first crucial step is to load your audio data into a format that the Hugging Face Dataset object can readily process. The expected structure includes audio and labels features.

Loading Data from the Hugging Face Hub

If you have an existing audio dataset on the Hugging Face Hub, you can conveniently load it using the load_dataset function. For demonstration purposes, we will use the ESC50 dataset:

from datasets import load_dataset

esc50 = load_dataset("ashraq/esc50", split="train")

Loading Local Audio Files and Labels

Alternatively, you can load your local audio files and their corresponding labels. This can be done using a dictionary or a pandas DataFrame. You can also specify class mappings using ClassLabel for better management of categorical data.

from datasets import Dataset, Audio, ClassLabel, Features

# Define class labels
class_labels = ClassLabel(names=["bang", "dog_bark"])

# Define features with audio and label columns
features = Features({
    "audio": Audio(),  # Define the audio feature
    "labels": class_labels  # Assign the class labels
})

# Construct the dataset from a dictionary
dataset = Dataset.from_dict({
    "audio": ["/audio/fold1/7061-6-0-0.wav", "/audio/fold1/7383-3-0-0.wav"],
    "labels": [0, 1],  # Corresponding labels for the audio files
}, features=features)

The Audio feature class automatically handles the loading and processing of audio files, while ClassLabel aids in managing categorical labels.

Inspecting the Dataset

After loading, it is essential to inspect the dataset to ensure all data has been loaded correctly. Each audio sample is represented by an Audio feature, which loads data into memory only when needed, optimizing resource usage.

print(dataset[0])

The output will typically show the audio file path, the waveform data as a NumPy array, and the sampling rate, along with its associated label.

Step 3: Preprocessing the Audio Data

This stage involves transforming the raw audio data into a format suitable for the AST model, which requires spectrogram inputs. We will also handle necessary data type casting.

Casting Data Types

If your dataset is from the Hugging Face Hub, you’ll need to cast the audio and labels columns to their correct feature types. This includes resampling the audio to the sampling rate expected by the ASTFeatureExtractor (typically 16kHz).

import numpy as np
from datasets import Audio, ClassLabel

# Assuming 

AI Summary

This comprehensive guide details the process of fine-tuning the Audio Spectrogram Transformer (AST) for custom audio classification tasks, leveraging the powerful tools within the Hugging Face ecosystem. The tutorial begins by emphasizing the efficiency of fine-tuning pre-trained models over training from scratch, as it allows for better data efficiency and superior results on downstream tasks by adapting the model to specific dataset characteristics. The core of the guide is a step-by-step walkthrough, starting with the essential step of installing necessary packages: `transformers[torch]`, `datasets`, and `audiomentations`. It then meticulously explains how to load audio data into a Hugging Face `Dataset` object, demonstrating both loading from the 🤗 Hub (using the ESC50 dataset as an example) and creating a dataset from local audio files and labels. The tutorial highlights the importance of inspecting the dataset to ensure correct loading, showing the structure of an audio sample with its path, waveform array, and sampling rate. A crucial preprocessing step involves casting the audio and label columns to appropriate feature types, including resampling audio to 16kHz and using `ClassLabel` for categorical labels. The guide then delves into preparing the data for the AST model, which requires spectrogram inputs. This is achieved using the `ASTFeatureExtractor`, initialized from a pre-trained model like "MIT/ast-finetuned-audioset-10-10-0.4593". A critical note is provided on calculating and setting dataset-specific mean and standard deviation values for normalization, which is essential for optimal model performance. The process of applying transforms for preprocessing is detailed, including a function to dynamically encode audio arrays into the `input_values` format expected by the model. The importance of verifying data integrity post-transformation is stressed. The tutorial also covers splitting the dataset into training and testing sets while maintaining class distribution through stratification. A significant section is dedicated to audio augmentations using `audiomentations`, explaining their role in enhancing model robustness and generalization. Various augmentation techniques like adding Gaussian SNR, gain adjustments, clipping distortion, time stretching, and pitch shifting are presented, along with parameters for their application. The integration of these augmentations into the training pipeline via the `preprocess_audio_with_transforms` function is clearly outlined, along with setting specific transforms for training and validation splits. The configuration and initialization of the AST model for fine-tuning are then addressed. This involves updating the model

Related Articles