Generate voice, music, video, and image content via MiniMax APIs — the unified entry for **MiniMax multimodal** use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.
## Output Directory
**All generated files MUST be saved to `minimax-output/` under the AGENT'S current working directory (NOT the skill directory).** Every script call MUST include an explicit `--output` / `-o` argument pointing to this location. Never omit the output argument or rely on script defaults.
**Rules:**
1. Before running any script, ensure `minimax-output/` exists in the agent's working directory (create if needed: `mkdir -p minimax-output`)
2. Always use absolute or relative paths from the agent's working directory: `--output minimax-output/video.mp4`
3.**Never**`cd` into the skill directory to run scripts — run from the agent's working directory using the full script path
4. Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in `minimax-output/tmp/`. They can be cleaned up when no longer needed: `rm -rf minimax-output/tmp`
Before running any script, check if `MINIMAX_API_HOST` is set in the environment. If it is NOT configured:
1. Ask the user which service endpoint their MiniMax account uses:
- **China Mainland** → `https://api.minimaxi.com`
- **Global** → `https://api.minimax.io`
2. Instruct and help user to set it via `export MINIMAX_API_HOST="https://api.minimaxi.com"` (or the global variant) in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence
### API Key Configuration
Set the `MINIMAX_API_KEY` environment variable before running any script:
```bash
export MINIMAX_API_KEY="your-api-key-here"
```
The key starts with `sk-api-` or `sk-cp-`, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)
**IMPORTANT — When API Key is missing:**
Before running any script, check if `MINIMAX_API_KEY` is set in the environment. If it is NOT configured:
1. Ask the user to provide their MiniMax API key
2. Instruct and help user to set it via `export MINIMAX_API_KEY="sk-..."` in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence
**IMPORTANT — Always respect the user's plan limits before generating content.** If the user's quota is exhausted or insufficient, warn them before proceeding.
**Default behavior:** When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the `tts` command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to `tts` in one call.
Only use multi-segment `generate` when:
- The user explicitly needs multiple voices/characters
- The text requires narrator + character dialogue separation
- The text exceeds **10,000 characters** (API limit per request) — in this case, split into segments with the same voice
**Complete workflow — follow ALL steps in order:**
1.**Write segments.json** — split text into segments with voice assignments (see format and rules below)
2.**Run `generate` command** — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade
```bash
# Step 1: Write segments.json to minimax-output/
# (use the Write tool to create minimax-output/segments.json)
# Step 2: Generate audio from segments.json — this is the CRITICAL step
# It generates each segment individually and merges them into one file
When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.
**Rule: Narration and dialogue are ALWAYS separate segments.**
A sentence like `"Tom said: The weather is great today!"` must be split into two segments:
- Segment 1 (narrator voice): `"Tom said:"`
- Segment 2 (character voice): `"The weather is great today!"`
**Example — Audiobook with narrator + 2 characters:**
```json
[
{ "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
{ "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
{ "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]
```
**Key principles:**
1.**Narrator** uses a consistent neutral narrator voice throughout
2.**Each character** has a dedicated voice_id, maintained consistently across all their dialogue
3.**Split at dialogue boundaries** — `"He said:"` is narrator, the quoted content is the character
4.**Do NOT merge** narrator text and character speech into a single segment
5. For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments
## Music Generation
Entry point: `scripts/music/generate_music.sh`
### IMPORTANT: Instrumental vs Lyrics — When to use which
| Scenario | Mode | Action |
|----------|------|--------|
| BGM for video / voice / podcast | Instrumental (default) | Use `--instrumental` directly, do NOT ask user |
| User explicitly asks to "create music" / "make a song" | Ask user first | Ask whether they want instrumental or with lyrics |
**When adding background music to video or voice content**, always default to instrumental mode (`--instrumental`). Do not ask the user — BGM should never have vocals competing with the main content.
**When the user explicitly asks to create/generate music as the primary task**, ask them whether they want:
- Instrumental (pure music, no vocals)
- With lyrics (song with vocals — user provides or you help write lyrics)
```bash
# Instrumental (for BGM or when user chooses instrumental)
bash scripts/music/generate_music.sh \
--instrumental \
--prompt "ambient electronic, atmospheric" \
--output minimax-output/ambient.mp3 --download
# Song with lyrics (when user chooses vocal music)
bash scripts/music/generate_music.sh \
--lyrics "[verse]\nHello world\n[chorus]\nLa la la" \
--prompt "indie folk, melancholic" \
--output minimax-output/song.mp3 --download
# With style fields
bash scripts/music/generate_music.sh \
--lyrics "[verse]\nLyrics here" \
--genre "pop" --mood "upbeat" --tempo "fast" \
--output minimax-output/pop_track.mp3 --download
```
### Music Model
Default model: `music-2.5`
`music-2.5` does **not** support `--instrumental` directly. When instrumental music is needed, the script automatically applies a workaround:
- Sets lyrics to `[intro] [outro]` (empty structural tags, no actual vocals), appends `pure music, no lyrics` to the prompt
This produces instrumental-style output without requiring manual intervention. You can always use `--instrumental` and the script handles the rest.
## Image Generation
Entry point: `scripts/image/generate_image.sh`
Model: `image-01` — photorealistic image generation from text prompts, with optional character reference for image-to-image.
### IMPORTANT: Mode Selection — t2i vs i2i
| User intent | Mode |
|-------------|------|
| Generate image from text description (default) | `t2i` — text-to-image |
| Generate image with a character reference photo (keep same person) | `i2i` — image-to-image |
**Default behavior:** When the user asks to generate/create an image without mentioning a reference photo, use `t2i` mode (default). Only use `i2i` mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.
### IMPORTANT: Aspect Ratio — Infer from user context
Do NOT always default to `1:1`. Analyze the user's request and choose the most appropriate aspect ratio:
| User intent / context | Recommended ratio | Resolution |
--prompt "A man standing on Venice Beach, 90s documentary style" \
--aspect-ratio 16:9 --prompt-optimizer \
-o minimax-output/beach.png
# Custom dimensions (must be multiple of 8)
bash scripts/image/generate_image.sh \
--prompt "Product photo of a luxury watch on marble surface" \
--width 1024 --height 768 \
-o minimax-output/watch.png
```
### Image-to-Image (Character Reference)
Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).
```bash
# Character reference — place same person in a new scene
bash scripts/image/generate_image.sh \
--mode i2i \
--prompt "A girl looking into the distance from a library window, warm afternoon light" \
--ref-image face.jpg \
--aspect-ratio 16:9 \
-o minimax-output/girl_library.png
# Multiple character variations
bash scripts/image/generate_image.sh \
--mode i2i \
--prompt "A woman in a red dress at a gala event, elegant, cinematic" \
--ref-image face.jpg -n 3 \
-o minimax-output/gala.png
```
### Aspect Ratio Reference
| Ratio | Resolution | Best for |
|-------|------------|----------|
| `1:1` | 1024×1024 | Default, avatars, icons, social media |
**Default behavior:** Always use single-segment `generate_video.sh` with **duration 6s and resolution 768P** unless the user explicitly asks for a long video or multi-scene video. Do NOT automatically split into multiple segments — a single 6s video is the standard output. Only use `generate_long_video.sh` when the user clearly needs multi-scene or longer content.
- Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P
### IMPORTANT: Prompt Optimization (MUST follow before generating any video)
Before calling any video generation script, you MUST optimize the user's prompt by reading and applying `references/video-prompt-guide.md`. Never pass the user's raw description directly as `--prompt`.
**Optimization steps:**
1.**Apply the Professional Formula**: `Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere`
- BAD: `"A puppy in a park"`
- GOOD: `"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"`
2.**Add camera instructions** using `[指令]` syntax: `[推进]`, `[拉远]`, `[跟随]`, `[固定]`, `[左摇]`, etc.
4.**Keep to 1-2 key actions** for 6-10 second videos — do not overcrowd with events
5.**For i2v mode** (image-to-video): Focus prompt on **movement and change only**, since the image already establishes the visual. Do NOT re-describe what's in the image.
- BAD: `"A lake with mountains"` (just repeating the image)
- GOOD: `"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"`
6.**For multi-segment long videos**: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful" \
--output minimax-output/puppy.mp4
# Image-to-video (prompt focuses on MOTION, not image content)
bash scripts/video/generate_video.sh \
--mode i2v \
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones" \
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field" \
Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 6 seconds per segment.
1. Segment 1: t2v — generated purely from the optimized text prompt
2. Segment 2+: i2v — the previous segment's last frame becomes `first_frame_image`, prompt describes **motion and change from that ending state**
3. All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
4. Optional: AI-generated background music is overlaid
**Prompt rules for each segment:**
- Each segment prompt MUST be independently optimized using the Professional Formula
- Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
- Segment 2+ (i2v): Focus on **what changes and moves** from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
- Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure" \
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \
Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.