Image Vision Skill

Analyze images using LLM vision APIs.

Overview

The Image Vision skill teaches Amplifier how to:

  • Understand image content
  • Describe visual elements
  • Answer questions about images
  • Compare multiple images
  • Extract text (OCR)

Loading the Skill

> Load the image-vision skill

Once loaded, Amplifier knows how to work with images.

When to Use

Task Use Image Vision
Describe what's in an image
Extract text from screenshot
Compare two images
Answer questions about image
Edit or modify images ❌ Different tools
Generate images ❌ Different tools

Supported Providers

Provider Model Vision Capable
Anthropic Claude 3+
OpenAI GPT-4V, GPT-4o
Google Gemini Pro Vision
Azure OpenAI GPT-4V

Core Patterns

Describe Image

> What's in this image?
[attach: screenshot.png]

Extract Text (OCR)

> Extract all text from this image
[attach: document-scan.png]

Answer Questions

> How many people are in this photo?
[attach: group-photo.jpg]

Compare Images

> What's different between these two screenshots?
[attach: before.png]
[attach: after.png]

Image Sources

Local Files

> Analyze the image at ./screenshots/error.png

URLs

> Describe the image at https://example.com/photo.jpg

Base64

For programmatic use:

import base64

with open("image.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

# Include in prompt with data URI

Best Practices

Be Specific

# Vague
> What's this?

# Specific
> What error message is shown in this screenshot?

Provide Context

> This is a screenshot of our checkout page.
> Is the "Complete Purchase" button visible?

Ask Focused Questions

# Too broad
> Tell me everything about this image

# Focused
> What color is the submit button in this form?

Common Tasks

UI Validation

> Does this screenshot show a successful login?
> Is the error message displayed in red?
> Are all form fields filled in?

Content Understanding

> What products are shown in this catalog image?
> Summarize the key points from this infographic
> What's the main headline on this webpage?

Data Extraction

> Extract the table data from this spreadsheet screenshot
> What are the values in the pie chart?
> Read the receipt amounts from this photo

Comparison

> Has the layout changed between these two versions?
> What elements are missing in the second screenshot?
> Which design looks more professional?

Limitations

Cannot Do

  • Edit or modify images
  • Generate new images
  • Process video (frame by frame only)
  • Guarantee 100% accurate OCR

Image Size

Large images may be resized. For text extraction: - Use highest resolution available - Ensure text is readable at smaller sizes

Complex Documents

For complex documents with many elements: - Ask about specific sections - Break into multiple queries

Try It Yourself

Exercise 1: Describe Image

> Load image-vision skill
> Describe what you see at https://picsum.photos/800/600

Exercise 2: Extract Text

Take a screenshot of any webpage and ask:

> Extract the main heading from this screenshot

Exercise 3: Compare

Take two screenshots and ask:

> What changed between these two screenshots?

Provider Configuration

Ensure your provider supports vision:

providers:
  - module: provider-anthropic
    config:
      model: claude-sonnet-4-20250514  # Supports vision

Troubleshooting

"Vision not supported"

  • Check model supports vision
  • Update to vision-capable model

"Image too large"

  • Reduce image resolution
  • Crop to relevant area

"Cannot read text"

  • Improve image quality
  • Increase contrast
  • Try different angle/lighting

Source

robotdad/skills/image-vision/
├── SKILL.md     # Core workflow
├── patterns.md  # Advanced patterns
└── setup.md     # Provider setup