Image Vision Skill¶
Analyze images using LLM vision APIs.
Overview¶
The Image Vision skill teaches Amplifier how to:
- Understand image content
- Describe visual elements
- Answer questions about images
- Compare multiple images
- Extract text (OCR)
Loading the Skill¶
> Load the image-vision skill
Once loaded, Amplifier knows how to work with images.
When to Use¶
| Task | Use Image Vision |
|---|---|
| Describe what's in an image | ✅ |
| Extract text from screenshot | ✅ |
| Compare two images | ✅ |
| Answer questions about image | ✅ |
| Edit or modify images | ❌ Different tools |
| Generate images | ❌ Different tools |
Supported Providers¶
| Provider | Model | Vision Capable |
|---|---|---|
| Anthropic | Claude 3+ | ✅ |
| OpenAI | GPT-4V, GPT-4o | ✅ |
| Gemini Pro Vision | ✅ | |
| Azure OpenAI | GPT-4V | ✅ |
Core Patterns¶
Describe Image¶
> What's in this image?
[attach: screenshot.png]
Extract Text (OCR)¶
> Extract all text from this image
[attach: document-scan.png]
Answer Questions¶
> How many people are in this photo?
[attach: group-photo.jpg]
Compare Images¶
> What's different between these two screenshots?
[attach: before.png]
[attach: after.png]
Image Sources¶
Local Files¶
> Analyze the image at ./screenshots/error.png
URLs¶
> Describe the image at https://example.com/photo.jpg
Base64¶
For programmatic use:
import base64
with open("image.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
# Include in prompt with data URI
Best Practices¶
Be Specific¶
# Vague
> What's this?
# Specific
> What error message is shown in this screenshot?
Provide Context¶
> This is a screenshot of our checkout page.
> Is the "Complete Purchase" button visible?
Ask Focused Questions¶
# Too broad
> Tell me everything about this image
# Focused
> What color is the submit button in this form?
Common Tasks¶
UI Validation¶
> Does this screenshot show a successful login?
> Is the error message displayed in red?
> Are all form fields filled in?
Content Understanding¶
> What products are shown in this catalog image?
> Summarize the key points from this infographic
> What's the main headline on this webpage?
Data Extraction¶
> Extract the table data from this spreadsheet screenshot
> What are the values in the pie chart?
> Read the receipt amounts from this photo
Comparison¶
> Has the layout changed between these two versions?
> What elements are missing in the second screenshot?
> Which design looks more professional?
Limitations¶
Cannot Do¶
- Edit or modify images
- Generate new images
- Process video (frame by frame only)
- Guarantee 100% accurate OCR
Image Size¶
Large images may be resized. For text extraction: - Use highest resolution available - Ensure text is readable at smaller sizes
Complex Documents¶
For complex documents with many elements: - Ask about specific sections - Break into multiple queries
Try It Yourself¶
Exercise 1: Describe Image¶
> Load image-vision skill
> Describe what you see at https://picsum.photos/800/600
Exercise 2: Extract Text¶
Take a screenshot of any webpage and ask:
> Extract the main heading from this screenshot
Exercise 3: Compare¶
Take two screenshots and ask:
> What changed between these two screenshots?
Provider Configuration¶
Ensure your provider supports vision:
providers:
- module: provider-anthropic
config:
model: claude-sonnet-4-20250514 # Supports vision
Troubleshooting¶
"Vision not supported"¶
- Check model supports vision
- Update to vision-capable model
"Image too large"¶
- Reduce image resolution
- Crop to relevant area
"Cannot read text"¶
- Improve image quality
- Increase contrast
- Try different angle/lighting
Source¶
robotdad/skills/image-vision/
├── SKILL.md # Core workflow
├── patterns.md # Advanced patterns
└── setup.md # Provider setup