Back to all posts

How to automatically test AI agent skills against rapidly changing APIs

In the vm0-ai/vm0-skills repository, we have developed dozens of skills for integrating with various third-party SaaS platforms. These skills enable Claude Code and Codex agents to interact seamlessly with services like GitHub, Slack, Discord, and many others.

While these integrations are incredibly valuable, they present a significant testing challenge. Without proper testing infrastructure, we cannot reliably verify whether skills function as expected or detect breaking changes when third-party APIs evolve.

Why testing third-party AI agent skills is hard

Testing third-party integrations is inherently difficult. Each skill depends on external APIs that may change without notice, requiring constant vigilance to maintain reliability. Traditional unit tests often fall short because they can't replicate real-world API behavior, authentication flows, and edge cases that only emerge in production environments.

Without comprehensive testing, several critical issues remain unaddressed:

This creates a significant maintenance burden and potential reliability issues that could impact production workflows.

Using AI agents to test AI agent skills in real environments

Since these skills are specifically designed for Claude Code and Codex agents, the most natural and effective approach is to use these same agents to test them. This creates a self-validating ecosystem where the tools test themselves in their intended environment.

VM0 provides the cloud infrastructure necessary to run Claude Code and Codex agents reliably, making it an ideal platform for implementing this testing strategy.

An end-to-end automated workflow for testing AI agent skills

The complete workflow for automated skill testing is described below. This agent systematically tests every skill in the repository, generates comprehensive reports, and notifies the team through multiple channels.

# Skills Tester Agent

## Overview

This agent performs automated testing of all skills in the vm0-skills repository.

## Critical Requirements

**MANDATORY: Complete All Tests Without Exception**

- No matter how long the task takes, it MUST be completed in full
- Continue until ALL items in `TODO.md` are tested - no early termination
- **NO skipping tasks** - every skill must be tested
- **NO selective testing** - do not cherry-pick which skills to test
- **Every example MUST have a result** - each example command in every skill's SKILL.md must be executed and recorded
- If a test fails, record the failure and continue to the next test
- Do not stop or pause until the entire test suite is complete

## Instructions

1. **Clone and Initialize**
   - Clone the repo `vm0-ai/vm0-skills`
   - Create a `TODO.md` file to track testing progress

2. **Generate Todo List**
   - For each skill folder in the repo, add a todo item to `TODO.md`

3. **Test Each Skill**
   - Create a sub-agent for each skill to test
   - Each sub-agent should:
     - Verify all required environment variables exist
     - Test each example command in the skill's SKILL.md
     - Write a temporary test result markdown file
     - Record whether the test passed, and specifically note any shell command failures or jq parsing errors

4. **Summarize Results**
   - Aggregate all test results into `result.md`

5. **Update README**
   - Based on `result.md`, update the `README.md`
   - Update or insert a skill list section with:
     - Brief description of each skill's capabilities
     - Test status (passed/failed)

6. **Commit and Push**
   - Only commit `README.md`
   - Push to the repository using `GITHUB_TOKEN` for authentication

7. **Report Issues**
   - For skills with test failures, create a GitHub issue summarizing all problems

8. **Notify Slack**
   - Post a message to Slack channel `#dev` with:
     - Total number of skills
     - Number of passed tests
     - Number of failed tests
     - Brief summary of issues
     - Link to the GitHub issue (if created)

9. **Notify Discord**
   - Post a message to the Discord `skills` channel with:
     - Confirmation that routine testing is complete
     - Number of skills that passed
     - Total number of skills tested

Configuring the agent with vm0.yaml

Next, you just need to schedule VM0 to run this workflow. Create a vm0.yaml file to describe the agent container configuration. This file specifies which skills the agent needs, what environment variables to inject, and how to run the testing workflow.

version: "1.0"

agents:
  skills-tester:
    image: skills-tester:latest
    provider: claude-code
    instructions: AGENTS.md
    skills:
      - https://github.com/vm0-ai/vm0-skills/tree/main/github
      - https://github.com/vm0-ai/vm0-skills/tree/main/slack
      - https://github.com/vm0-ai/vm0-skills/tree/main/discord
    environment:
      CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
      DISCORD_BOT_TOKEN: ${{ secrets.DISCORD_BOT_TOKEN }}
      # ... additional environment variables as needed

For the complete configuration file, refer to vm0-skills/.vm0/vm0.yaml. Some environment variables are omitted in this example for brevity.

This agent configuration includes three essential skills:

Creating the Docker image

You'll also need to configure a Docker image that installs the necessary dependencies, particularly the GitHub CLI (gh) that the agent uses for repository operations.

Create a Dockerfile:

FROM node:20-slim

RUN apt-get update && apt-get install -y \\
    git \\
    curl \\
    python3 \\
    python3-pip \\
    python3-venv \\
    jq \\
    && rm -rf /var/lib/apt/lists/*

RUN curl -fsSL <https://cli.github.com/packages/githubcli-archive-keyring.gpg> | dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg \\
    && chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg \\
    && echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] <https://cli.github.com/packages> stable main" | tee /etc/apt/sources.list.d/github-cli.list > /dev/null \\
    && apt-get update \\
    && apt-get install -y gh \\
    && rm -rf /var/lib/apt/lists/*

RUN npm install -g @anthropic-ai/claude-code

This Dockerfile creates a lightweight container with:

Putting the AI skill testing system together

That's all you need! With these three files in place: AGENTS.md, Dockerfile, and vm0.yaml, you have a complete automated testing system. You can see the full implementation at vm0-skills/.vm0.

Execute the following commands in your project directory to build and deploy the agent:

$ vm0 image build -f Dockerfile --name skills-tester
$ vm0 compose vm0.yaml

The first command builds the Docker image with all necessary dependencies. The second command registers the agent configuration with VM0's platform.

Running the workflow

Now you can run the entire testing workflow with a single command:

$ vm0 run skills-tester "do the job"

The agent will autonomously:

  1. Clone the vm0-skills repository
  2. Generate a testing checklist for all skills
  3. Execute tests for each skill systematically
  4. Compile comprehensive results
  5. Update the repository README
  6. Create GitHub issues for failures
  7. Send notifications to Slack and Discord

Debugging step-by-step

If you want to debug the workflow incrementally or test a single skill first, you can use targeted prompts:

$ vm0 run skills-tester "Only do the first step, using a single skill."

After the agent completes the first step, you can continue the session based on the session ID provided in the output:

$ vm0 run continue SESSION_ID "Do the next step."

This interactive approach allows you to:

Results and notifications

After the workflow completes, you'll receive notifications across multiple channels confirming the testing results.

discord-test-notification.png Discord community notification showing test completion summary

slack-test-notification.png Slack team notification with detailed test results

For any skills that fail testing, the agent automatically creates a GitHub issue with comprehensive failure details. See Skill Test Failures - Issue #2 for an example of the generated issue format.

Key lessons from automating AI agent skill testing

Implementing automated skill testing with VM0 agents provides several critical benefits:

By leveraging VM0's cloud infrastructure and Claude's agent capabilities, you can maintain reliable integrations with external services while minimizing the ongoing maintenance burden. This approach transforms skill testing from a manual, error-prone process into a fully automated quality assurance system.

Get started with VM0 today

Ready to automate your own workflows with AI agents? VM0 makes it easy to deploy production-ready agents in minutes, not weeks.

What you can build with VM0

Visit vm0.ai to create your free account and deploy your first agent today. Join our Discord community to connect with other builders, share your workflows, and get help from the team.

Start building the future of automated workflows.

Related Articles

Stay in the loop

// Get the latest insights on agent-native development.

SubscribeJoin Discord