Merge pull request #21 from divitkashyap/feat/vision-analysis

feat(vision-analysis): add image analysis skill with OCR, UI review, and chart extraction
2026-03-27 20:47:30 +08:00
parent 37046a3edb 0a61f2be6a
commit f87b423670
3 changed files with 179 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -22,6 +22,7 @@ Development skills for AI coding agents. Plug into your favorite AI coding tool
 | `pptx-generator` | Generate, edit, and read PowerPoint presentations. Create from scratch with PptxGenJS (cover, TOC, content, section divider, summary slides), edit existing PPTX via XML workflows, or extract text with markitdown. | Official |
 | `minimax-xlsx` | Open, create, read, analyze, edit, or validate Excel/spreadsheet files (.xlsx, .xlsm, .csv, .tsv). Covers creating new xlsx from scratch via XML templates, reading and analyzing with pandas, editing existing files with zero format loss, formula recalculation, validation, and professional financial formatting. | Official |
 | `minimax-docx` | Professional DOCX document creation, editing, and formatting using OpenXML SDK (.NET). Three pipelines: create new documents from scratch, fill/edit content in existing documents, or apply template formatting with XSD validation gate-check. | Official |
 | `vision-analysis` | Analyze, describe, and extract information from images using vision AI models. Supports describe, OCR, UI mockup review, chart data extraction, and object detection. Powered by MiniMax VL API with OpenAI GPT-4V fallback. | Community |
 | `minimax-multimodal-toolkit` | Generate voice, music, video, and image content via MiniMax APIs — the unified entry for MiniMax multimodal use cases. Covers TTS (text-to-speech, voice cloning, voice design, multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame, subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character reference), and media processing (convert, concat, trim, extract) via FFmpeg. | Official |
 ## Installation
--- a/README_zh.md
+++ b/README_zh.md
@@ -22,7 +22,11 @@
 | `pptx-generator` | 生成、编辑和读取 PowerPoint 演示文稿。支持用 PptxGenJS 从零创建（封面、目录、内容、分节页、总结页），通过 XML 工作流编辑现有 PPTX，或用 markitdown 提取文本。 | Official |
 | `minimax-xlsx` | 打开、创建、读取、分析、编辑或验证 Excel/电子表格文件（.xlsx、.xlsm、.csv、.tsv）。支持通过 XML 模板从零创建 xlsx、使用 pandas 读取分析、零格式损失编辑现有文件、公式重算与验证、专业财务格式化。 | Official |
 | `minimax-docx` | 基于 OpenXML SDK（.NET）的专业 DOCX 文档创建、编辑与排版。三条流水线：从零创建新文档、填写/编辑现有文档内容、应用模板格式并通过 XSD 验证门控检查。 | Official |
 <<<<<<< feat/vision-analysis
 | `vision-analysis` | 使用视觉 AI 模型分析、描述和提取图像信息。支持描述、OCR 文字识别、UI 界面审查、图表数据提取和物体检测。基于 MiniMax VL API，OpenAI GPT-4V 作为备选。 | Community |
 =======
 | `minimax-multimodal-toolkit` | 通过 MiniMax API 生成语音、音乐、视频和图片内容 — MiniMax 多模态使用场景的统一入口。涵盖 TTS（文字转语音、声音克隆、声音设计、多段合成）、音乐（带词歌曲、纯音乐）、视频（文生视频、图生视频、首尾帧、主体参考、模板、长视频多场景）、图片（文生图、图生图含角色参考），以及基于 FFmpeg 的媒体处理（格式转换、拼接、裁剪、提取）。 | Official |
 >>>>>>> main
 ## 安装
--- a/skills/vision-analysis/SKILL.md
+++ b/skills/vision-analysis/SKILL.md
@@ -0,0 +1,174 @@
 ---
 name: vision-analysis
 description: >
  Analyze, describe, and extract information from images using the MiniMax vision MCP tool.
  Use when: user shares an image file path or URL (any message containing .jpg, .jpeg, .png,
  .gif, .webp, .bmp, or .svg file extension) or uses any of these words/phrases near an image:
  "analyze", "analyse", "describe", "explain", "understand", "look at", "review",
  "extract text", "OCR", "what is in", "what's in", "read this image", "see this image",
  "tell me about", "explain this", "interpret this", in connection with an image, screenshot,
  diagram, chart, mockup, wireframe, or photo.
  Also triggers for: UI mockup review, wireframe analysis, design critique, data extraction
  from charts, object detection, person/animal/activity identification.
  Triggers: any message with an image file extension (jpg, jpeg, png, gif, webp, bmp, svg),
  or any request to analyze/describ/understand/review/extract text from an image, screenshot,
  diagram, chart, photo, mockup, or wireframe.
 license: MIT
 metadata:
  version: "1.0"
  category: ai-vision
  sources:
    - MiniMax Token Plan MCP (understand_image tool)
 ---
 # Vision Analysis
 Analyze images using the MiniMax `MiniMax_understand_image` MCP tool available in the MiniMax Token Plan.
 ## Prerequisites
 - MiniMax Token Plan subscription with valid `MINIMAX_API_KEY`
 - MiniMax MCP configured (`MiniMax_understand_image` tool available)
 ### If MCP tool is not configured
 **Step 1:** The agent should fetch setup instructions from:
 **https://platform.minimaxi.com/docs/token-plan/mcp-guide**
 **Step 2:** Detect the user's environment (OpenCode, Cursor, Claude Code, etc.) and output the exact commands needed. Common examples:
 **OpenCode** — add to `~/.config/opencode/opencode.json` or `package.json`:
 ```json
 {
  "mcp": {
    "MiniMax": {
      "type": "local",
      "command": ["uvx", "minimax-coding-plan-mcp", "-y"],
      "environment": {
        "MINIMAX_API_KEY": "YOUR_TOKEN_PLAN_KEY",
        "MINIMAX_API_HOST": "https://api.minimaxi.com"
      },
      "enabled": true
    }
  }
 }
 ```
 **Claude Code**:
 ```bash
 claude mcp add -s user MiniMax --env MINIMAX_API_KEY=your-key --env MINIMAX_API_HOST=https://api.minimaxi.com -- uvx minimax-coding-plan-mcp -y
 ```
 **Cursor** — add to MCP settings:
 ```json
 {
  "mcpServers": {
    "MiniMax": {
      "command": "uvx",
      "args": ["minimax-coding-plan-mcp"],
      "env": {
        "MINIMAX_API_KEY": "your-key",
        "MINIMAX_API_HOST": "https://api.minimaxi.com"
      }
    }
  }
 }
 ```
 **Step 3:** After configuration, tell the user to restart their app and verify with `/mcp`.
 **Important:** If the user does not have a MiniMax Token Plan subscription, inform them that the `understand_image` tool requires one — it cannot be used with free or other tier API keys.
 ## Analysis Modes
 | Mode | When to use | Prompt strategy |
 |---|---|---|
 | `describe` | General image understanding | Ask for detailed description |
 | `ocr` | Text extraction from screenshots, documents | Ask to extract all text verbatim |
 | `ui-review` | UI mockups, wireframes, design files | Ask for design critique with suggestions |
 | `chart-data` | Charts, graphs, data visualizations | Ask to extract data points and trends |
 | `object-detect` | Identify objects, people, activities | Ask to list and locate all elements |
 ## Workflow
 ### Step 1: Auto-detect image
 The skill triggers automatically when a message contains an image file path or URL with extensions:
 `.jpg`, `.jpeg`, `.png`, `.gif`, `.webp`, `.bmp`, `.svg`
 Extract the image path from the message.
 ### Step 2: Select analysis mode and call MCP tool
 Use the `MiniMax_understand_image` tool with a mode-specific prompt:
 **describe:**
 ```
 Provide a detailed description of this image. Include: main subject, setting/background,
 colors/style, any text visible, notable objects, and overall composition.
 ```
 **ocr:**
 ```
 Extract all text visible in this image verbatim. Preserve structure and formatting
 (headers, lists, columns). If no text is found, say so.
 ```
 **ui-review:**
 ```
 You are a UI/UX design reviewer. Analyze this interface mockup or design. Provide:
 (1) Strengths — what works well, (2) Issues — usability or design problems,
 (3) Specific, actionable suggestions for improvement. Be constructive and detailed.
 ```
 **chart-data:**
 ```
 Extract all data from this chart or graph. List: chart title, axis labels, all
 data points/series with values if readable, and a brief summary of the trend.
 ```
 **object-detect:**
 ```
 List all distinct objects, people, and activities you can identify. For each,
 describe what it is and its approximate location in the image.
 ```
 ### Step 3: Present results
 Return the analysis clearly. For `describe`, use readable prose. For `ocr`, preserve structure. For `ui-review`, use a structured critique format.
 ## Output Format Example
 For describe mode:
 ```
 ## Image Description
 [Detailed description of the image contents...]
 ```
 For ocr mode:
 ```
 ## Extracted Text
 [Preserved text structure from the image]
 ```
 For ui-review mode:
 ```
 ## UI Design Review
 ### Strengths
 - ...
 ### Issues
 - ...
 ### Suggestions
 - ...
 ```
 ## Notes
 - Images up to 20MB supported (JPEG, PNG, GIF, WebP)
 - Local file paths work if MiniMax MCP is configured with file access
 - The `MiniMax_understand_image` tool is provided by the `minimax-coding-plan-mcp` package