Automating Atomic Force Microscopy: A Guide to LLM Agents and the AILA Framework

Introduction to Automating Atomic Force Microscopy with LLM Agents

The field of materials research is on the cusp of a significant transformation, driven by advancements in laboratory automation and the integration of artificial intelligence. Large Language Models (LLMs) are emerging as powerful tools capable of orchestrating complex experimental procedures, paving the way for self-driving laboratories (SDLs) that promise to accelerate the pace of discovery. However, current SDL implementations often rely on predefined, rigid protocols that do not fully capture the nuanced adaptability and intuitive decision-making skills characteristic of expert scientists operating in dynamic experimental settings.

This article focuses on evaluating the efficacy of LLM agents in automating Atomic Force Microscopy (AFM), a critical technique for characterizing material surfaces at the nanoscale. We introduce the Artificially Intelligent Lab Assistant (AILA) framework, a system designed to leverage LLM agents for controlling and managing AFM experiments. To systematically assess the capabilities of these AI agents, we have developed AFMBench, a comprehensive evaluation suite. AFMBench challenges LLM agents across the entire scientific workflow, from the initial design of experiments to the final analysis of results. This rigorous testing methodology allows us to identify both the strengths and weaknesses of current LLM technology in a practical laboratory context.

The AILA Framework: Architecture and Operation

The Artificially Intelligent Lab Assistant (AILA) framework is engineered to facilitate the automation of AFM experiments through a sophisticated multi-agent system. At its core, AILA utilizes an LLM planner that interprets user requests, breaking them down into actionable tasks and routing them to specialized agents. The framework is built upon the LangChain software, incorporating essential components such as prompts, LLMs, memory management, agents, and tools. AILA employs two primary categories of prompts: system prompts, which define ethical guidelines and agent responsibilities, and user prompts, which are dynamic inputs from the end-user.

The backbone of AILA comprises several LLMs, including GPT-4o, GPT-3.5-turbo-0125, Llama-3.3-70B-versatile, and Claude-3.5-sonnet-20241022. These models process natural language inputs and generate text-based outputs. Interactions and agent states are managed within a shared memory system, typically a Python dictionary, ensuring seamless communication and state persistence across agents. AILA features two specialized agents designed for AFM operations: the AFM Handler Agent (AFM-HA) and the Data Handler Agent (DHA). The AFM-HA is responsible for direct instrument control, utilizing tools such as a Document Retrieval Tool and a Code Executor Tool to interact with the AFM hardware via a Python API. The DHA focuses on image optimization and data analysis, equipped with tools for tasks like PID gain tuning and feature extraction. Agent-to-agent communication is structured, with the prefix "NEED HELP" indicating a task escalation and "FINAL ANSWER" signaling the completion of a query.

The operational flow within AILA involves dynamic routing mechanisms. When presented with a query, AILA determines the most appropriate agent or tool to handle the task. If neither the AFM-HA nor the DHA can sufficiently address the request, AILA formulates a response and selects a "FINISH" option. This modular design and structured communication protocol enable AILA to manage complex experimental workflows efficiently and adaptively.

AFMBench: A Comprehensive Evaluation Suite

To rigorously assess the performance of LLM agents in automating AFM experiments, we developed AFMBench, a comprehensive evaluation suite. Unlike purely simulation-based benchmarks, AFMBench is designed to challenge AI agents on a physical AFM instrument, thereby incorporating real-world constraints such as timing, hardware limitations, and experimental variability. The dataset comprises 100 meticulously curated tasks that mirror the diverse requirements of real-world AFM experiments.

AFMBench tasks are categorized by complexity and the number of tools or agents required for their resolution. A significant portion of the tasks (69%) necessitate multi-tool integration, highlighting the importance of coordination capabilities. Furthermore, 83% of the tasks can be addressed by a single agent, while the remaining 17% require multi-agent collaboration. The complexity spectrum spans basic operations (56% of tasks) to more advanced procedures (44% of tasks). These tasks encompass a wide range of scientific activities, including looking up information in documentation, performing data analysis, and executing mathematical calculations. Many tasks are designed to blend these demands within a single prompt, simulating the multifaceted nature of how expert microscopists approach their work.

The evaluation metrics for AFMBench focus on both accuracy and efficiency. Accuracy is measured by a success rate, where a score of 1 is assigned to fully correct answers and 0.5 to partially correct answers. Efficiency is assessed through metrics such as average response time, number of steps per task, and token usage. This detailed evaluation framework provides a robust methodology for benchmarking LLM agents in the context of scientific automation.

Performance Analysis of Leading LLM Agents

Our evaluation of state-of-the-art LLM agents on AFMBench reveals significant disparities in their performance and capabilities. Across the diverse set of tasks, GPT-4o demonstrated superior overall performance, particularly excelling in tasks requiring documentation retrieval, where it achieved an 88.3% success rate. Its performance in analysis and calculation tasks was also notable, indicating a strong capacity for executing scientific workflows.

Claude 3.5 Sonnet, while competitive in standalone documentation tasks (85.3% success rate), showed a marked decline in performance on more complex, cross-domain challenges. This suggests that proficiency in domain-specific question-answering does not necessarily translate to effective experimental execution or agentic coordination. GPT-3.5-turbo-0125 exhibited poor performance across most tasks, especially in multi-domain challenges, registering null success rates in scenarios demanding simultaneous expertise across different scientific areas. This highlights a fundamental limitation in its cross-functional reasoning capabilities.

The open-source Llama-3.3-70B-versatile model showed accuracy superior to GPT-3.5 in standalone tasks but completely failed in tasks requiring cross-domain analysis or expertise. This underscores a common challenge: while models may possess broad knowledge, their ability to integrate this knowledge for complex, multi-step experimental procedures remains limited.

Further investigation using the Model Context Protocol (MCP) confirmed that the observed performance limitations are inherent to the LLM models themselves rather than artifacts of the evaluation framework. This finding is critical for understanding the current state of LLM capabilities in scientific automation.

Single-Agent vs. Multi-Agent Architectures

A key aspect of our evaluation involved comparing the performance of single-agent and multi-agent AILA architectures. By systematically assessing a representative subset of AFMBench questions across both configurations, we observed significant framework-dependent performance variations. Notably, GPT-4o demonstrated superior performance in the multi-agent configuration, achieving a 70% success rate compared to 58% with direct tool integration (single-agent). This suggests that for advanced models capable of complex reasoning, the enhanced coordination and communication facilitated by a multi-agent system provide measurable performance gains.

For other models, the performance differences between single-agent and multi-agent architectures were less pronounced. This is largely attributed to their fundamental limitations in handling cross-domain tasks that inherently require sophisticated multi-agent coordination. While single-agent architectures may offer computational efficiency, the findings indicate that multi-agent systems are crucial for unlocking the full potential of advanced LLMs in complex scientific automation scenarios.

Error Analysis and Model-Specific Limitations

A detailed error analysis of LLM agent performance on AFMBench revealed distinct failure modes and model-specific limitations. GPT-3.5-turbo-0125 exhibited a high total error rate of 66.6%, with errors predominantly concentrated in code generation (32%) and agent selection (27.3%). While its natural language processing capabilities appeared robust, the frequent code generation errors and failures in agent or tool selection pointed to deficiencies in translating comprehension into actionable experimental protocols.

Llama-3.3-70B-versatile and Claude-3.5-sonnet-20241022 also presented substantial error rates (60.6% and 51.6%, respectively), but with different failure patterns. Llama-3.3-70B-versatile frequently produced non-functional code or incorrectly formulated arguments for tool execution. In contrast, Claude-3.5-sonnet