- Add Plan Limits & Quotas section with standard and high-speed plan tables - Remove all 1080P references (not available on any plan) - Update default video duration from 10s to 6s to match plan quota units - Add explicit warning: video quota is very limited (2–5/day), confirm before generating Made-with: Cursor
675 lines
30 KiB
Markdown
675 lines
30 KiB
Markdown
---
|
||
name: minimax-multimodal-toolkit
|
||
description: >
|
||
MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image.
|
||
Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design,
|
||
multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame,
|
||
subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character
|
||
reference), and media processing (convert, concat, trim, extract).
|
||
Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI,
|
||
MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs.
|
||
license: MIT
|
||
metadata:
|
||
version: "1.0"
|
||
category: media-generation
|
||
---
|
||
|
||
# MiniMax Multi-Modal Toolkit
|
||
|
||
Generate voice, music, video, and image content via MiniMax APIs — the unified entry for **MiniMax multimodal** use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.
|
||
|
||
## Output Directory
|
||
|
||
**All generated files MUST be saved to `minimax-output/` under the AGENT'S current working directory (NOT the skill directory).** Every script call MUST include an explicit `--output` / `-o` argument pointing to this location. Never omit the output argument or rely on script defaults.
|
||
|
||
**Rules:**
|
||
1. Before running any script, ensure `minimax-output/` exists in the agent's working directory (create if needed: `mkdir -p minimax-output`)
|
||
2. Always use absolute or relative paths from the agent's working directory: `--output minimax-output/video.mp4`
|
||
3. **Never** `cd` into the skill directory to run scripts — run from the agent's working directory using the full script path
|
||
4. Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in `minimax-output/tmp/`. They can be cleaned up when no longer needed: `rm -rf minimax-output/tmp`
|
||
|
||
## Prerequisites
|
||
|
||
```bash
|
||
brew install ffmpeg jq # macOS (or apt install ffmpeg jq on Linux)
|
||
bash scripts/check_environment.sh
|
||
```
|
||
|
||
No Python or pip required — all scripts are pure bash using `curl`, `ffmpeg`, `jq`, and `xxd`.
|
||
|
||
### API Host Configuration
|
||
|
||
MiniMax provides two service endpoints for different regions. Set `MINIMAX_API_HOST` before running any script:
|
||
|
||
| Region | Platform URL | API Host Value |
|
||
|--------|-------------|----------------|
|
||
| China Mainland(中国大陆) | https://platform.minimaxi.com | `https://api.minimaxi.com` |
|
||
| Global(全球) | https://platform.minimax.io | `https://api.minimax.io` |
|
||
|
||
```bash
|
||
# China Mainland
|
||
export MINIMAX_API_HOST="https://api.minimaxi.com"
|
||
|
||
# or Global
|
||
export MINIMAX_API_HOST="https://api.minimax.io"
|
||
```
|
||
|
||
**IMPORTANT — When API Host is missing:**
|
||
Before running any script, check if `MINIMAX_API_HOST` is set in the environment. If it is NOT configured:
|
||
1. Ask the user which service endpoint their MiniMax account uses:
|
||
- **China Mainland** → `https://api.minimaxi.com`
|
||
- **Global** → `https://api.minimax.io`
|
||
2. Instruct and help user to set it via `export MINIMAX_API_HOST="https://api.minimaxi.com"` (or the global variant) in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence
|
||
|
||
### API Key Configuration
|
||
|
||
Set the `MINIMAX_API_KEY` environment variable before running any script:
|
||
|
||
```bash
|
||
export MINIMAX_API_KEY="your-api-key-here"
|
||
```
|
||
|
||
The key starts with `sk-api-` or `sk-cp-`, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)
|
||
|
||
**IMPORTANT — When API Key is missing:**
|
||
Before running any script, check if `MINIMAX_API_KEY` is set in the environment. If it is NOT configured:
|
||
1. Ask the user to provide their MiniMax API key
|
||
2. Instruct and help user to set it via `export MINIMAX_API_KEY="sk-..."` in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence
|
||
|
||
## Plan Limits & Quotas
|
||
|
||
**IMPORTANT — Always respect the user's plan limits before generating content.** If the user's quota is exhausted or insufficient, warn them before proceeding.
|
||
|
||
### Standard Plans
|
||
|
||
| Capability | Starter | Plus | Max |
|
||
|---|---|---|---|
|
||
| M2.7 (chat) | 600 req/5h | 1,500 req/5h | 4,500 req/5h |
|
||
| Speech 2.8 | — | 4,000 chars/day | 11,000 chars/day |
|
||
| image-01 | — | 50 images/day | 120 images/day |
|
||
| Hailuo-2.3-Fast 768P 6s | — | — | 2 videos/day |
|
||
| Hailuo-2.3 768P 6s | — | — | 2 videos/day |
|
||
| Music-2.5 | — | — | 4 songs/day (≤5 min each) |
|
||
|
||
### High-Speed Plans
|
||
|
||
| Capability | Plus-HS | Max-HS | Ultra-HS |
|
||
|---|---|---|---|
|
||
| M2.7-highspeed (chat) | 1,500 req/5h | 4,500 req/5h | 30,000 req/5h |
|
||
| Speech 2.8 | 9,000 chars/day | 19,000 chars/day | 50,000 chars/day |
|
||
| image-01 | 100 images/day | 200 images/day | 800 images/day |
|
||
| Hailuo-2.3-Fast 768P 6s | — | 3 videos/day | 5 videos/day |
|
||
| Hailuo-2.3 768P 6s | — | 3 videos/day | 5 videos/day |
|
||
| Music-2.5 | — | 7 songs/day (≤5 min each) | 15 songs/day (≤5 min each) |
|
||
|
||
**Key quota constraints:**
|
||
- **Video resolution: 768P only** — 1080P is not available on any plan
|
||
- **Video duration: 6s** — all plan quotas are counted in 6-second units
|
||
- **Video quota is very limited** (2–5/day depending on plan) — always confirm with the user before generating video
|
||
|
||
## Key Capabilities
|
||
|
||
| Capability | Description | Entry point |
|
||
|------------|-------------|-------------|
|
||
| TTS | Text-to-speech synthesis with multiple voices and emotions | `scripts/tts/generate_voice.sh` |
|
||
| Voice Cloning | Clone a voice from an audio sample (10s–5min) | `scripts/tts/generate_voice.sh clone` |
|
||
| Voice Design | Create a custom voice from a text description | `scripts/tts/generate_voice.sh design` |
|
||
| Music Generation | Generate songs with lyrics or instrumental tracks | `scripts/music/generate_music.sh` |
|
||
| Image Generation | Text-to-image, image-to-image with character reference | `scripts/image/generate_image.sh` |
|
||
| Video Generation | Text-to-video, image-to-video, subject reference, templates | `scripts/video/generate_video.sh` |
|
||
| Long Video | Multi-scene chained video with crossfade transitions | `scripts/video/generate_long_video.sh` |
|
||
| Media Tools | Audio/video format conversion, concatenation, trimming, extraction | `scripts/media_tools.sh` |
|
||
|
||
## TTS (Text-to-Speech)
|
||
|
||
Entry point: `scripts/tts/generate_voice.sh`
|
||
|
||
### IMPORTANT: Single voice vs Multi-segment — Choose the right approach
|
||
|
||
| User intent | Approach |
|
||
|-------------|----------|
|
||
| Single voice / no multi-character need | `tts` command — generate the entire text in one call |
|
||
| Multiple characters / narrator + dialogue | `generate` command with segments.json |
|
||
|
||
**Default behavior:** When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the `tts` command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to `tts` in one call.
|
||
|
||
Only use multi-segment `generate` when:
|
||
- The user explicitly needs multiple voices/characters
|
||
- The text requires narrator + character dialogue separation
|
||
- The text exceeds **10,000 characters** (API limit per request) — in this case, split into segments with the same voice
|
||
|
||
### Single-voice generation (DEFAULT)
|
||
|
||
```bash
|
||
bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
|
||
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3
|
||
```
|
||
|
||
### Multi-segment generation (multi-voice / audiobook / podcast)
|
||
|
||
**Complete workflow — follow ALL steps in order:**
|
||
|
||
1. **Write segments.json** — split text into segments with voice assignments (see format and rules below)
|
||
2. **Run `generate` command** — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade
|
||
|
||
```bash
|
||
# Step 1: Write segments.json to minimax-output/
|
||
# (use the Write tool to create minimax-output/segments.json)
|
||
|
||
# Step 2: Generate audio from segments.json — this is the CRITICAL step
|
||
# It generates each segment individually and merges them into one file
|
||
bash scripts/tts/generate_voice.sh generate minimax-output/segments.json \
|
||
-o minimax-output/output.mp3 --crossfade 200
|
||
```
|
||
|
||
**Do NOT skip Step 2.** Writing segments.json alone does nothing — you MUST run the `generate` command to actually produce audio.
|
||
|
||
### Voice management
|
||
|
||
```bash
|
||
# List all available voices
|
||
bash scripts/tts/generate_voice.sh list-voices
|
||
|
||
# Voice cloning (from audio sample, 10s–5min)
|
||
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice
|
||
|
||
# Voice design (from text description)
|
||
bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator
|
||
```
|
||
|
||
### Audio processing
|
||
|
||
```bash
|
||
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
|
||
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3
|
||
```
|
||
|
||
### TTS Models
|
||
|
||
| Model | Notes |
|
||
|-------|-------|
|
||
| speech-2.8-hd | Recommended, auto emotion matching |
|
||
| speech-2.8-turbo | Faster variant |
|
||
| speech-2.6-hd | Previous gen, manual emotion |
|
||
| speech-2.6-turbo | Previous gen, faster |
|
||
|
||
### segments.json Format
|
||
|
||
Default crossfade between segments: **200ms** (`--crossfade 200`).
|
||
|
||
```json
|
||
[
|
||
{ "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
|
||
{ "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
|
||
]
|
||
```
|
||
|
||
Leave `emotion` empty for speech-2.8 models (auto-matched from text).
|
||
|
||
### IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)
|
||
|
||
When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.
|
||
|
||
**Rule: Narration and dialogue are ALWAYS separate segments.**
|
||
|
||
A sentence like `"Tom said: The weather is great today!"` must be split into two segments:
|
||
- Segment 1 (narrator voice): `"Tom said:"`
|
||
- Segment 2 (character voice): `"The weather is great today!"`
|
||
|
||
**Example — Audiobook with narrator + 2 characters:**
|
||
|
||
```json
|
||
[
|
||
{ "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
|
||
{ "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
|
||
{ "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
|
||
{ "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
|
||
{ "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
|
||
{ "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
|
||
]
|
||
```
|
||
|
||
**Key principles:**
|
||
1. **Narrator** uses a consistent neutral narrator voice throughout
|
||
2. **Each character** has a dedicated voice_id, maintained consistently across all their dialogue
|
||
3. **Split at dialogue boundaries** — `"He said:"` is narrator, the quoted content is the character
|
||
4. **Do NOT merge** narrator text and character speech into a single segment
|
||
5. For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments
|
||
|
||
## Music Generation
|
||
|
||
Entry point: `scripts/music/generate_music.sh`
|
||
|
||
### IMPORTANT: Instrumental vs Lyrics — When to use which
|
||
|
||
| Scenario | Mode | Action |
|
||
|----------|------|--------|
|
||
| BGM for video / voice / podcast | Instrumental (default) | Use `--instrumental` directly, do NOT ask user |
|
||
| User explicitly asks to "create music" / "make a song" | Ask user first | Ask whether they want instrumental or with lyrics |
|
||
|
||
**When adding background music to video or voice content**, always default to instrumental mode (`--instrumental`). Do not ask the user — BGM should never have vocals competing with the main content.
|
||
|
||
**When the user explicitly asks to create/generate music as the primary task**, ask them whether they want:
|
||
- Instrumental (pure music, no vocals)
|
||
- With lyrics (song with vocals — user provides or you help write lyrics)
|
||
|
||
```bash
|
||
# Instrumental (for BGM or when user chooses instrumental)
|
||
bash scripts/music/generate_music.sh \
|
||
--instrumental \
|
||
--prompt "ambient electronic, atmospheric" \
|
||
--output minimax-output/ambient.mp3 --download
|
||
|
||
# Song with lyrics (when user chooses vocal music)
|
||
bash scripts/music/generate_music.sh \
|
||
--lyrics "[verse]\nHello world\n[chorus]\nLa la la" \
|
||
--prompt "indie folk, melancholic" \
|
||
--output minimax-output/song.mp3 --download
|
||
|
||
# With style fields
|
||
bash scripts/music/generate_music.sh \
|
||
--lyrics "[verse]\nLyrics here" \
|
||
--genre "pop" --mood "upbeat" --tempo "fast" \
|
||
--output minimax-output/pop_track.mp3 --download
|
||
```
|
||
|
||
### Music Model
|
||
|
||
Default model: `music-2.5`
|
||
|
||
`music-2.5` does **not** support `--instrumental` directly. When instrumental music is needed, the script automatically applies a workaround:
|
||
- Sets lyrics to `[intro] [outro]` (empty structural tags, no actual vocals), appends `pure music, no lyrics` to the prompt
|
||
|
||
This produces instrumental-style output without requiring manual intervention. You can always use `--instrumental` and the script handles the rest.
|
||
|
||
## Image Generation
|
||
|
||
Entry point: `scripts/image/generate_image.sh`
|
||
|
||
Model: `image-01` — photorealistic image generation from text prompts, with optional character reference for image-to-image.
|
||
|
||
### IMPORTANT: Mode Selection — t2i vs i2i
|
||
|
||
| User intent | Mode |
|
||
|-------------|------|
|
||
| Generate image from text description (default) | `t2i` — text-to-image |
|
||
| Generate image with a character reference photo (keep same person) | `i2i` — image-to-image |
|
||
|
||
**Default behavior:** When the user asks to generate/create an image without mentioning a reference photo, use `t2i` mode (default). Only use `i2i` mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.
|
||
|
||
### IMPORTANT: Aspect Ratio — Infer from user context
|
||
|
||
Do NOT always default to `1:1`. Analyze the user's request and choose the most appropriate aspect ratio:
|
||
|
||
| User intent / context | Recommended ratio | Resolution |
|
||
|-----------------------|-------------------|------------|
|
||
| 头像、图标、社交媒体头像、avatar、icon、profile pic | `1:1` | 1024×1024 |
|
||
| 风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper | `16:9` | 1280×720 |
|
||
| 传统照片、经典比例、classic photo | `4:3` | 1152×864 |
|
||
| 摄影作品、杂志封面、photography、magazine | `3:2` | 1248×832 |
|
||
| 人像竖图、海报、portrait photo、poster | `2:3` | 832×1248 |
|
||
| 竖版海报、书籍封面、tall poster、book cover | `3:4` | 864×1152 |
|
||
| 手机壁纸、社交媒体故事、phone wallpaper、story、reel | `9:16` | 720×1280 |
|
||
| 超宽全景、电影画幅、panoramic、cinematic ultrawide | `21:9` | 1344×576 |
|
||
| 未指定特定需求 / ambiguous | `1:1` | 1024×1024 |
|
||
|
||
### IMPORTANT: Image Count — When to generate multiple images
|
||
|
||
| User intent | Count (`-n`) |
|
||
|-------------|--------------|
|
||
| Default / single image request | `1` (default) |
|
||
| 用户说"几张"、"多张"、"一些" / "a few", "several" | `3` |
|
||
| 用户说"多种方案"、"备选" / "variations", "options" | `3`–`4` |
|
||
| 用户明确指定数量 | Use the specified number (1–9) |
|
||
|
||
### Text-to-Image Examples
|
||
|
||
```bash
|
||
# Basic text-to-image
|
||
bash scripts/image/generate_image.sh \
|
||
--prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic" \
|
||
-o minimax-output/cat.png
|
||
|
||
# Landscape with inferred aspect ratio
|
||
bash scripts/image/generate_image.sh \
|
||
--prompt "Mountain landscape with misty valleys, photorealistic, golden hour" \
|
||
--aspect-ratio 16:9 \
|
||
-o minimax-output/landscape.png
|
||
|
||
# Phone wallpaper (portrait 9:16)
|
||
bash scripts/image/generate_image.sh \
|
||
--prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere" \
|
||
--aspect-ratio 9:16 \
|
||
-o minimax-output/wallpaper.png
|
||
|
||
# Multiple variations
|
||
bash scripts/image/generate_image.sh \
|
||
--prompt "Abstract geometric art, vibrant colors" \
|
||
-n 3 \
|
||
-o minimax-output/art.png
|
||
|
||
# With prompt optimizer
|
||
bash scripts/image/generate_image.sh \
|
||
--prompt "A man standing on Venice Beach, 90s documentary style" \
|
||
--aspect-ratio 16:9 --prompt-optimizer \
|
||
-o minimax-output/beach.png
|
||
|
||
# Custom dimensions (must be multiple of 8)
|
||
bash scripts/image/generate_image.sh \
|
||
--prompt "Product photo of a luxury watch on marble surface" \
|
||
--width 1024 --height 768 \
|
||
-o minimax-output/watch.png
|
||
```
|
||
|
||
### Image-to-Image (Character Reference)
|
||
|
||
Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).
|
||
|
||
```bash
|
||
# Character reference — place same person in a new scene
|
||
bash scripts/image/generate_image.sh \
|
||
--mode i2i \
|
||
--prompt "A girl looking into the distance from a library window, warm afternoon light" \
|
||
--ref-image face.jpg \
|
||
--aspect-ratio 16:9 \
|
||
-o minimax-output/girl_library.png
|
||
|
||
# Multiple character variations
|
||
bash scripts/image/generate_image.sh \
|
||
--mode i2i \
|
||
--prompt "A woman in a red dress at a gala event, elegant, cinematic" \
|
||
--ref-image face.jpg -n 3 \
|
||
-o minimax-output/gala.png
|
||
```
|
||
|
||
### Aspect Ratio Reference
|
||
|
||
| Ratio | Resolution | Best for |
|
||
|-------|------------|----------|
|
||
| `1:1` | 1024×1024 | Default, avatars, icons, social media |
|
||
| `16:9` | 1280×720 | Landscape, banner, desktop wallpaper |
|
||
| `4:3` | 1152×864 | Classic photo, presentations |
|
||
| `3:2` | 1248×832 | Photography, magazine layout |
|
||
| `2:3` | 832×1248 | Portrait photo, poster |
|
||
| `3:4` | 864×1152 | Book cover, tall poster |
|
||
| `9:16` | 720×1280 | Phone wallpaper, social story/reel |
|
||
| `21:9` | 1344×576 | Ultra-wide panoramic, cinematic |
|
||
|
||
### Key Options
|
||
|
||
| Option | Description |
|
||
|--------|-------------|
|
||
| `--prompt TEXT` | Image description, max 1500 chars (required) |
|
||
| `--aspect-ratio RATIO` | Aspect ratio (see table above). Infer from user context |
|
||
| `--width PX` / `--height PX` | Custom size, 512–2048, must be multiple of 8, both required together. Overridden by `--aspect-ratio` if both set |
|
||
| `-n N` | Number of images to generate, 1–9 (default 1) |
|
||
| `--seed N` | Random seed for reproducibility. Same seed + same params → similar results |
|
||
| `--prompt-optimizer` | Enable automatic prompt optimization by the API |
|
||
| `--ref-image FILE` | Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB) |
|
||
| `--no-download` | Print image URLs instead of downloading files |
|
||
| `--aigc-watermark` | Add AIGC watermark to generated images |
|
||
|
||
## Video Generation
|
||
|
||
### IMPORTANT: Single vs Multi-Segment — Choose the right script
|
||
|
||
| User intent | Script to use |
|
||
|-------------|---------------|
|
||
| Default / no special request | `scripts/video/generate_video.sh` (single segment, **6s, 768P**) |
|
||
| User explicitly asks for "long video", "multi-scene", "story", or duration > 10s | `scripts/video/generate_long_video.sh` (multi-segment) |
|
||
|
||
**Default behavior:** Always use single-segment `generate_video.sh` with **duration 6s and resolution 768P** unless the user explicitly asks for a long video or multi-scene video. Do NOT automatically split into multiple segments — a single 6s video is the standard output. Only use `generate_long_video.sh` when the user clearly needs multi-scene or longer content.
|
||
|
||
Entry point (single video): `scripts/video/generate_video.sh`
|
||
Entry point (long/multi-scene): `scripts/video/generate_long_video.sh`
|
||
|
||
### Video Model Constraints (MUST follow)
|
||
|
||
**Supported resolutions and durations by model:**
|
||
|
||
| Model | Resolution | Duration |
|
||
|-------|-----------|----------|
|
||
| MiniMax-Hailuo-2.3 | 768P only | 6s or 10s |
|
||
| MiniMax-Hailuo-2.3-Fast | 768P only | 6s or 10s |
|
||
| MiniMax-Hailuo-02 | 512P, 768P (default) | 6s or 10s |
|
||
| T2V-01 / T2V-01-Director | 720P | 6s only |
|
||
| I2V-01 / I2V-01-Director / I2V-01-live | 720P | 6s only |
|
||
| S2V-01 (ref) | 720P | 6s only |
|
||
|
||
**Key rules:**
|
||
- **Default: 6s + 768P** — plan quotas are counted in 6-second units; use 6s unless user explicitly requests 10s
|
||
- **1080P is NOT supported** on any plan — always use 768P for Hailuo-2.3/2.3-Fast
|
||
- Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P
|
||
|
||
### IMPORTANT: Prompt Optimization (MUST follow before generating any video)
|
||
|
||
Before calling any video generation script, you MUST optimize the user's prompt by reading and applying `references/video-prompt-guide.md`. Never pass the user's raw description directly as `--prompt`.
|
||
|
||
**Optimization steps:**
|
||
|
||
1. **Apply the Professional Formula**: `Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere`
|
||
- BAD: `"A puppy in a park"`
|
||
- GOOD: `"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"`
|
||
|
||
2. **Add camera instructions** using `[指令]` syntax: `[推进]`, `[拉远]`, `[跟随]`, `[固定]`, `[左摇]`, etc.
|
||
|
||
3. **Include aesthetic details**: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)
|
||
|
||
4. **Keep to 1-2 key actions** for 6-10 second videos — do not overcrowd with events
|
||
|
||
5. **For i2v mode** (image-to-video): Focus prompt on **movement and change only**, since the image already establishes the visual. Do NOT re-describe what's in the image.
|
||
- BAD: `"A lake with mountains"` (just repeating the image)
|
||
- GOOD: `"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"`
|
||
|
||
6. **For multi-segment long videos**: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.
|
||
|
||
```bash
|
||
# Text-to-video (default: 6s, 768P)
|
||
bash scripts/video/generate_video.sh \
|
||
--mode t2v \
|
||
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful" \
|
||
--output minimax-output/puppy.mp4
|
||
|
||
# Image-to-video (prompt focuses on MOTION, not image content)
|
||
bash scripts/video/generate_video.sh \
|
||
--mode i2v \
|
||
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones" \
|
||
--first-frame photo.jpg \
|
||
--output minimax-output/animated.mp4
|
||
|
||
# Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)
|
||
bash scripts/video/generate_video.sh \
|
||
--mode sef \
|
||
--first-frame start.jpg --last-frame end.jpg \
|
||
--output minimax-output/transition.mp4
|
||
|
||
# Subject reference (face consistency, ref mode uses S2V-01, 6s only)
|
||
bash scripts/video/generate_video.sh \
|
||
--mode ref \
|
||
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field" \
|
||
--subject-image face.jpg \
|
||
--duration 6 \
|
||
--output minimax-output/person.mp4
|
||
```
|
||
|
||
### Long-form Video (Multi-scene)
|
||
|
||
Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 6 seconds per segment.
|
||
|
||
**Workflow:**
|
||
1. Segment 1: t2v — generated purely from the optimized text prompt
|
||
2. Segment 2+: i2v — the previous segment's last frame becomes `first_frame_image`, prompt describes **motion and change from that ending state**
|
||
3. All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
|
||
4. Optional: AI-generated background music is overlaid
|
||
|
||
**Prompt rules for each segment:**
|
||
- Each segment prompt MUST be independently optimized using the Professional Formula
|
||
- Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
|
||
- Segment 2+ (i2v): Focus on **what changes and moves** from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
|
||
- Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
|
||
- Each segment covers only 6 seconds of action — keep it focused
|
||
|
||
```bash
|
||
# Example: 3-segment story with optimized per-segment prompts (default: 6s/segment, 768P)
|
||
bash scripts/video/generate_long_video.sh \
|
||
--scenes \
|
||
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \
|
||
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure" \
|
||
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \
|
||
--music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere" \
|
||
--output minimax-output/long_video.mp4
|
||
|
||
# With custom settings
|
||
bash scripts/video/generate_long_video.sh \
|
||
--scenes "Scene 1 prompt" "Scene 2 prompt" \
|
||
--segment-duration 6 \
|
||
--resolution 768P \
|
||
--crossfade 0.5 \
|
||
--music-prompt "calm ambient background music" \
|
||
--output minimax-output/long_video.mp4
|
||
```
|
||
|
||
### Add Background Music
|
||
|
||
```bash
|
||
bash scripts/video/add_bgm.sh \
|
||
--video input.mp4 \
|
||
--generate-bgm --instrumental \
|
||
--music-prompt "soft piano background" \
|
||
--bgm-volume 0.3 \
|
||
--output minimax-output/output_with_bgm.mp4
|
||
```
|
||
|
||
### Template Video
|
||
|
||
```bash
|
||
bash scripts/video/generate_template_video.sh \
|
||
--template-id 392753057216684038 \
|
||
--media photo.jpg \
|
||
--output minimax-output/template_output.mp4
|
||
```
|
||
|
||
### Video Models
|
||
|
||
| Mode | Default Model | Default Duration | Default Resolution | Notes |
|
||
|------|--------------|-----------------|-------------------|-------|
|
||
| t2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest text-to-video |
|
||
| i2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest image-to-video |
|
||
| sef | MiniMax-Hailuo-02 | 6s | 768P | Start-end frame |
|
||
| ref | S2V-01 | 6s | 720P | Subject reference, 6s only |
|
||
|
||
## Media Tools (Audio/Video Processing)
|
||
|
||
Entry point: `scripts/media_tools.sh`
|
||
|
||
Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.
|
||
|
||
### Video Format Conversion
|
||
|
||
```bash
|
||
# Convert between formats (mp4, mov, webm, mkv, avi, ts, flv)
|
||
bash scripts/media_tools.sh convert-video input.webm -o output.mp4
|
||
bash scripts/media_tools.sh convert-video input.mp4 -o output.mov
|
||
|
||
# With quality / resolution / fps options
|
||
bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4 \
|
||
--crf 18 --preset medium --resolution 1920x1080 --fps 30
|
||
```
|
||
|
||
### Audio Format Conversion
|
||
|
||
```bash
|
||
# Convert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)
|
||
bash scripts/media_tools.sh convert-audio input.wav -o output.mp3
|
||
bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac \
|
||
--bitrate 320k --sample-rate 48000 --channels 2
|
||
```
|
||
|
||
### Video Concatenation
|
||
|
||
```bash
|
||
# Concatenate with crossfade transition (default 0.5s)
|
||
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4
|
||
|
||
# Hard cut (no crossfade)
|
||
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
|
||
```
|
||
|
||
### Audio Concatenation
|
||
|
||
```bash
|
||
# Simple concatenation
|
||
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3
|
||
|
||
# With crossfade
|
||
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
|
||
```
|
||
|
||
### Extract Audio from Video
|
||
|
||
```bash
|
||
# Extract as mp3
|
||
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3
|
||
|
||
# Extract as wav with higher bitrate
|
||
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
|
||
```
|
||
|
||
### Video Trimming
|
||
|
||
```bash
|
||
# Trim by start/end time (seconds)
|
||
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15
|
||
|
||
# Trim by start + duration
|
||
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
|
||
```
|
||
|
||
### Add Audio to Video (Overlay / Replace)
|
||
|
||
```bash
|
||
# Mix audio with existing video audio
|
||
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4 \
|
||
--volume 0.3 --fade-in 2 --fade-out 3
|
||
|
||
# Replace original audio entirely
|
||
bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4 \
|
||
--replace
|
||
```
|
||
|
||
### Media File Info
|
||
|
||
```bash
|
||
bash scripts/media_tools.sh probe input.mp4
|
||
```
|
||
|
||
## Script Architecture
|
||
|
||
```
|
||
scripts/
|
||
├── check_environment.sh # Env verification (curl, ffmpeg, jq, xxd, API key)
|
||
├── media_tools.sh # Audio/video conversion, concat, trim, extract
|
||
├── tts/
|
||
│ └── generate_voice.sh # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
|
||
├── music/
|
||
│ └── generate_music.sh # Music generation CLI
|
||
├── image/
|
||
│ └── generate_image.sh # Image generation CLI (2 modes: t2i, i2i)
|
||
└── video/
|
||
├── generate_video.sh # Video generation CLI (4 modes: t2v, i2v, sef, ref)
|
||
├── generate_long_video.sh # Multi-scene long video
|
||
├── generate_template_video.sh # Template-based video
|
||
└── add_bgm.sh # Background music overlay
|
||
```
|
||
|
||
## References
|
||
|
||
Read these for detailed API parameters, voice catalogs, and prompt engineering:
|
||
|
||
- [tts-guide.md](references/tts-guide.md) — TTS setup, voice management, audio processing, segment format, troubleshooting
|
||
- [tts-voice-catalog.md](references/tts-voice-catalog.md) — Full voice catalog with IDs, descriptions, and parameter reference
|
||
- [music-api.md](references/music-api.md) — Music generation API: endpoints, parameters, response format
|
||
- [image-api.md](references/image-api.md) — Image generation API: text-to-image, image-to-image, parameters
|
||
- [video-api.md](references/video-api.md) — Video API: endpoints, models, parameters, camera instructions, templates
|
||
- [video-prompt-guide.md](references/video-prompt-guide.md) — Video prompt engineering: formulas, styles, image-to-video tips
|