Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output

1 views
0
0

Introduction: Beyond Text-In, Text-Out

The standard paradigm of Large Language Model (LLM) applications, often characterized by a simple "text in, text out" flow, is rapidly evolving. To deliver tangible value in real-world scenarios, applications must transcend this limitation. They need the capability to interpret visual information, engage in complex reasoning, and produce outputs that are readily consumable by other systems. This article explores how to build such advanced applications by integrating three core capabilities: multimodal input, sophisticated reasoning, and structured output.

We will illustrate these concepts through a practical, hands-on example: constructing a time-series anomaly detection system for e-commerce order data. By leveraging OpenAI's o3 model, we will demonstrate how to combine its powerful reasoning abilities with image analysis and generate validated JSON outputs. Our goal is to create an application that can effectively "see" by analyzing charts, "think" by identifying unusual patterns, and "integrate" by outputting a structured anomaly report that downstream systems can easily process.

1. Case Study: Time-Series Anomaly Detection

Our case study focuses on identifying abnormal patterns within e-commerce order time-series data. For this demonstration, we generated three distinct sets of synthetic daily order data, each representing a different profile over approximately one month. To visually emphasize seasonality, weekends are shaded in the accompanying charts. The x-axis displays the day of the week.

Figure 1: Dataset 1 with Shaded Weekends

(Imagine a chart here showing Dataset 1 with shaded weekend regions)

Figure 2: Dataset 2 with Shaded Weekends

(Imagine a chart here showing Dataset 2 with shaded weekend regions)

Figure 3: Dataset 3 with Shaded Weekends

(Imagine a chart here showing Dataset 3 with shaded weekend regions)

Each of these figures contains a specific type of anomaly that we aim to detect. We will use these charts to test our anomaly detection solution and verify its accuracy.

2. Our Solution: A Multimodal, Reasoning-Driven Approach

2.1 Solution Overview

Unlike traditional machine learning approaches that often require extensive feature engineering and model training, our method is significantly simpler. It follows these key steps:

  1. Prepare a visual representation (a chart) of the e-commerce order time-series data.
  2. Prompt the o3 reasoning model, providing it with the time-series chart as input, and instruct it to identify any unusual patterns.
  3. The o3 model then outputs its findings in a predefined JSON format, which can be easily consumed by other systems.

This streamlined process enables us to leverage the model's advanced capabilities without the complexities of conventional ML pipelines. The core challenge lies in enabling the o3 model to accept image input and produce structured output.

2.2 Setting Up the Reasoning Model (o3)

We will utilize OpenAI's o3 model, a state-of-the-art reasoning model capable of handling complex, multi-step problems. For this tutorial, we will access the model via an Azure OpenAI endpoint. Ensure your Azure endpoint, API key, and deployment name are configured in an `.env` file before proceeding with the LLM client setup.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from openai import AzureOpenAI
from dotenv import load_dotenv
import os

load_dotenv()

# Setup LLM client
endpoint = os.getenv("api_base")
api_key = os.getenv("o3_API_KEY")
api_version = "2025-04-01-preview"
model_name = "o3"
deployment = os.getenv("deployment_name")

LLM_client = AzureOpenAI(
    api_key=api_key,  
    api_version=api_version,
    azure_endpoint=endpoint
)

The system message for the o3 model is crucial for guiding its analysis. It defines the model's role, the task, and specific rules to follow. For our anomaly detection task, the instruction is crafted as follows:

instruction = f"""

[Role]
You are a meticulous data analyst.

[Task]
You will be given a line chart image related to daily e-commerce orders. 
Your task is to identify prominent anomalies in the data.

[Rules]
The anomaly kinds can be spike, drop, level_shift, or seasonal_outlier.
A level_shift is a sustained baseline change (≥ 5 consecutive days), not a single point.
A seasonal_outlier happens if a weekend/weekday behaves unlike peers in its category.
For example, weekend orders are usually lower than the weekdays'
Read dates/values from axes; if you can't read exactly, snap to the nearest tick and note uncertainty in explanation.
The weekends are shaded in the figure.
"""

This instruction clearly outlines the expected anomaly types (spike, drop, level_shift, seasonal_outlier) with precise definitions, removing ambiguity. It also incorporates domain-specific knowledge, such as the expectation of lower order volumes on weekends compared to weekdays, to better guide the model's analytical process.

2.3 Image Preparation for Multimodal Input

To enable o3's multimodal capabilities, input images must be provided in a specific format: either as publicly accessible web URLs or as base64-encoded data URLs. Since our figures are generated locally, we will use the latter approach.

Base64 Encoding is a method for representing binary data, such as image files, using only text characters that are safe for internet transmission. It converts binary data into a string of letters, numbers, and symbols.

A Data URL embeds the file content directly within the URL string, rather than pointing to an external file location. This is achieved by prefixing the encoded data with a specific scheme.

The following Python function handles the conversion of a Matplotlib figure into a base64 data URL without needing to save the figure to disk:

import io
import base64

def fig_to_data_url(fig, fmt="png"):
    """
    Converts a Matplotlib figure to a base64 data URL without saving to disk.

    Args:
    -----
    fig (matplotlib.figure.Figure): The figure to convert.
    fmt (str): The format of the image ("png", "jpeg", etc.)

    Returns:
    --------
    str: The data URL representing the figure.
    """

    buf = io.BytesIO()
    fig.savefig(buf, format=fmt, bbox_inches="tight")
    buf.seek(0)
    
    base64_encoded_data = base64.b64encode(buf.read()).decode("utf-8")
    mime_type = f"image/{fmt.lower()}"
    
    return f"data:{mime_type};base64,{base64_encoded_data}"

This function saves the Matplotlib figure to an in-memory buffer, encodes the binary image data as base64 text, and then formats it as a data URL.

Assuming we have the synthetic daily order data, the following function generates the plot and converts it into a data URL:

def create_fig(df):
    """
    Create a Matplotlib figure and convert it to a base64 data URL.
    Weekends (Sat–Sun) are shaded.

    Args:
    -----
    df: dataframe contains one profile of daily order time series. 
        dataframe has "date" and "orders" columns.

    Returns:
    --------
    image_url: The data URL representing the figure.
    """

    df = df.copy()
    df[

AI Summary

This article provides a comprehensive tutorial on building sophisticated Large Language Model (LLM) applications by leveraging multimodal input and structured output capabilities, specifically using OpenAI's o3 model. The authors guide readers through a practical case study: developing a time-series anomaly detection system for e-commerce order data. The process involves preparing time-series data, visualizing it in charts, and then feeding these charts as image input to the o3 model. The model is prompted to analyze the visualizations, identify anomalies (spikes, drops, level shifts, or seasonal outliers), and output its findings in a predefined, structured JSON format. This structured output is crucial for seamless integration with downstream systems. The tutorial details the setup of the o3 model via the Azure OpenAI endpoint, including system message configuration to guide the model's analysis. It also explains the necessity and method of converting Matplotlib figures into base64-encoded data URLs for multimodal input. A key component is defining a Pydantic model to enforce the desired JSON schema for the output, ensuring consistency and machine-readability. The article presents a Python function that orchestrates the LLM call, specifying parameters like `reasoning_effort` and `response_format`. The results section demonstrates the model's effectiveness by showcasing its accurate identification and reporting of various anomalies in synthetic datasets, complete with visualizations of the detected anomalies. The authors conclude by emphasizing the power of combining multimodal input, reasoning, and structured output as a foundational pattern for developing versatile and valuable LLM applications beyond the scope of simple anomaly detection.

Related Articles