minimax-skills/skills/minimax-multimodal-toolkit/SKILL.md

---
name: minimax-multimodal-toolkit
description: >
  MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image.
  Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design,
  multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame,
  subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character
  reference), and media processing (convert, concat, trim, extract).
  Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI,
  MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs.
license: MIT
metadata:
  version: "1.0"
  category: media-generation
---

# MiniMax Multi-Modal Toolkit

Generate voice, music, video, and image content via MiniMax APIs — the unified entry for **MiniMax multimodal** use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.

## Output Directory

**All generated files MUST be saved to `minimax-output/` under the AGENT'S current working directory (NOT the skill directory).** Every script call MUST include an explicit `--output` / `-o` argument pointing to this location. Never omit the output argument or rely on script defaults.

**Rules:**
1. Before running any script, ensure `minimax-output/` exists in the agent's working directory (create if needed: `mkdir -p minimax-output`)
2. Always use absolute or relative paths from the agent's working directory: `--output minimax-output/video.mp4`
3. **Never** `cd` into the skill directory to run scripts — run from the agent's working directory using the full script path
4. Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in `minimax-output/tmp/`. They can be cleaned up when no longer needed: `rm -rf minimax-output/tmp`

## Prerequisites

```bash
brew install ffmpeg jq              # macOS (or apt install ffmpeg jq on Linux)
bash scripts/check_environment.sh
```

No Python or pip required — all scripts are pure bash using `curl`, `ffmpeg`, `jq`, and `xxd`.

### API Host Configuration

MiniMax provides two service endpoints for different regions. Set `MINIMAX_API_HOST` before running any script:

| Region | Platform URL | API Host Value |
|--------|-------------|----------------|
| China Mainland（中国大陆） | https://platform.minimaxi.com | `https://api.minimaxi.com` |
| Global（全球） | https://platform.minimax.io | `https://api.minimax.io` |

```bash
# China Mainland
export MINIMAX_API_HOST="https://api.minimaxi.com"

# or Global
export MINIMAX_API_HOST="https://api.minimax.io"
```

**IMPORTANT — When API Host is missing:**
Before running any script, check if `MINIMAX_API_HOST` is set in the environment. If it is NOT configured:
1. Ask the user which service endpoint their MiniMax account uses:
   - **China Mainland** → `https://api.minimaxi.com`
   - **Global** → `https://api.minimax.io`
2. Instruct and help user to set it via `export MINIMAX_API_HOST="https://api.minimaxi.com"` (or the global variant) in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence

### API Key Configuration

Set the `MINIMAX_API_KEY` environment variable before running any script:

```bash
export MINIMAX_API_KEY="your-api-key-here"
```

The key starts with `sk-api-` or `sk-cp-`, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)

**IMPORTANT — When API Key is missing:**
Before running any script, check if `MINIMAX_API_KEY` is set in the environment. If it is NOT configured:
1. Ask the user to provide their MiniMax API key
2. Instruct and help user to set it via `export MINIMAX_API_KEY="sk-..."` in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence

## Plan Limits & Quotas

**IMPORTANT — Always respect the user's plan limits before generating content.** If the user's quota is exhausted or insufficient, warn them before proceeding.

### Standard Plans

| Capability | Starter | Plus | Max |
|---|---|---|---|
| M2.7 (chat) | 600 req/5h | 1,500 req/5h | 4,500 req/5h |
| Speech 2.8 | — | 4,000 chars/day | 11,000 chars/day |
| image-01 | — | 50 images/day | 120 images/day |
| Hailuo-2.3-Fast 768P 6s | — | — | 2 videos/day |
| Hailuo-2.3 768P 6s | — | — | 2 videos/day |
| Music-2.5 | — | — | 4 songs/day (≤5 min each) |

### High-Speed Plans

| Capability | Plus-HS | Max-HS | Ultra-HS |
|---|---|---|---|
| M2.7-highspeed (chat) | 1,500 req/5h | 4,500 req/5h | 30,000 req/5h |
| Speech 2.8 | 9,000 chars/day | 19,000 chars/day | 50,000 chars/day |
| image-01 | 100 images/day | 200 images/day | 800 images/day |
| Hailuo-2.3-Fast 768P 6s | — | 3 videos/day | 5 videos/day |
| Hailuo-2.3 768P 6s | — | 3 videos/day | 5 videos/day |
| Music-2.5 | — | 7 songs/day (≤5 min each) | 15 songs/day (≤5 min each) |

**Key quota constraints:**
- **Video resolution: 768P only** — 1080P is not available on any plan
- **Video duration: 6s** — all plan quotas are counted in 6-second units
- **Video quota is very limited** (2–5/day depending on plan) — always confirm with the user before generating video

## Key Capabilities

| Capability | Description | Entry point |
|------------|-------------|-------------|
| TTS | Text-to-speech synthesis with multiple voices and emotions | `scripts/tts/generate_voice.sh` |
| Voice Cloning | Clone a voice from an audio sample (10s–5min) | `scripts/tts/generate_voice.sh clone` |
| Voice Design | Create a custom voice from a text description | `scripts/tts/generate_voice.sh design` |
| Music Generation | Generate songs with lyrics or instrumental tracks | `scripts/music/generate_music.sh` |
| Image Generation | Text-to-image, image-to-image with character reference | `scripts/image/generate_image.sh` |
| Video Generation | Text-to-video, image-to-video, subject reference, templates | `scripts/video/generate_video.sh` |
| Long Video | Multi-scene chained video with crossfade transitions | `scripts/video/generate_long_video.sh` |
| Media Tools | Audio/video format conversion, concatenation, trimming, extraction | `scripts/media_tools.sh` |

## TTS (Text-to-Speech)

Entry point: `scripts/tts/generate_voice.sh`

### IMPORTANT: Single voice vs Multi-segment — Choose the right approach

| User intent | Approach |
|-------------|----------|
| Single voice / no multi-character need | `tts` command — generate the entire text in one call |
| Multiple characters / narrator + dialogue | `generate` command with segments.json |

**Default behavior:** When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the `tts` command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to `tts` in one call.

Only use multi-segment `generate` when:
- The user explicitly needs multiple voices/characters
- The text requires narrator + character dialogue separation
- The text exceeds **10,000 characters** (API limit per request) — in this case, split into segments with the same voice

### Single-voice generation (DEFAULT)

```bash
bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3
```

### Multi-segment generation (multi-voice / audiobook / podcast)

**Complete workflow — follow ALL steps in order:**

1. **Write segments.json** — split text into segments with voice assignments (see format and rules below)
2. **Run `generate` command** — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade

```bash
# Step 1: Write segments.json to minimax-output/
# (use the Write tool to create minimax-output/segments.json)

# Step 2: Generate audio from segments.json — this is the CRITICAL step
# It generates each segment individually and merges them into one file
bash scripts/tts/generate_voice.sh generate minimax-output/segments.json \
  -o minimax-output/output.mp3 --crossfade 200
```

**Do NOT skip Step 2.** Writing segments.json alone does nothing — you MUST run the `generate` command to actually produce audio.

### Voice management

```bash
# List all available voices
bash scripts/tts/generate_voice.sh list-voices

# Voice cloning (from audio sample, 10s–5min)
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice

# Voice design (from text description)
bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator
```

### Audio processing

```bash
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3
```

### TTS Models

| Model | Notes |
|-------|-------|
| speech-2.8-hd | Recommended, auto emotion matching |
| speech-2.8-turbo | Faster variant |
| speech-2.6-hd | Previous gen, manual emotion |
| speech-2.6-turbo | Previous gen, faster |

### segments.json Format

Default crossfade between segments: **200ms** (`--crossfade 200`).

```json
[
  { "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
  { "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]
```

Leave `emotion` empty for speech-2.8 models (auto-matched from text).

### IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)

When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.

**Rule: Narration and dialogue are ALWAYS separate segments.**

A sentence like `"Tom said: The weather is great today!"` must be split into two segments:
- Segment 1 (narrator voice): `"Tom said:"`
- Segment 2 (character voice): `"The weather is great today!"`

**Example — Audiobook with narrator + 2 characters:**

```json
[
  { "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
  { "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
  { "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]
```

**Key principles:**
1. **Narrator** uses a consistent neutral narrator voice throughout
2. **Each character** has a dedicated voice_id, maintained consistently across all their dialogue
3. **Split at dialogue boundaries** — `"He said:"` is narrator, the quoted content is the character
4. **Do NOT merge** narrator text and character speech into a single segment
5. For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments

## Music Generation

Entry point: `scripts/music/generate_music.sh`

### IMPORTANT: Instrumental vs Lyrics — When to use which

| Scenario | Mode | Action |
|----------|------|--------|
| BGM for video / voice / podcast | Instrumental (default) | Use `--instrumental` directly, do NOT ask user |
| User explicitly asks to "create music" / "make a song" | Ask user first | Ask whether they want instrumental or with lyrics |

**When adding background music to video or voice content**, always default to instrumental mode (`--instrumental`). Do not ask the user — BGM should never have vocals competing with the main content.

**When the user explicitly asks to create/generate music as the primary task**, ask them whether they want:
- Instrumental (pure music, no vocals)
- With lyrics (song with vocals — user provides or you help write lyrics)

```bash
# Instrumental (for BGM or when user chooses instrumental)
bash scripts/music/generate_music.sh \
  --instrumental \
  --prompt "ambient electronic, atmospheric" \
  --output minimax-output/ambient.mp3 --download

# Song with lyrics (when user chooses vocal music)
bash scripts/music/generate_music.sh \
  --lyrics "[verse]\nHello world\n[chorus]\nLa la la" \
  --prompt "indie folk, melancholic" \
  --output minimax-output/song.mp3 --download

# With style fields
bash scripts/music/generate_music.sh \
  --lyrics "[verse]\nLyrics here" \
  --genre "pop" --mood "upbeat" --tempo "fast" \
  --output minimax-output/pop_track.mp3 --download
```

### Music Model

Default model: `music-2.5`

`music-2.5` does **not** support `--instrumental` directly. When instrumental music is needed, the script automatically applies a workaround:
- Sets lyrics to `[intro] [outro]` (empty structural tags, no actual vocals), appends `pure music, no lyrics` to the prompt

This produces instrumental-style output without requiring manual intervention. You can always use `--instrumental` and the script handles the rest.

## Image Generation

Entry point: `scripts/image/generate_image.sh`

Model: `image-01` — photorealistic image generation from text prompts, with optional character reference for image-to-image.

### IMPORTANT: Mode Selection — t2i vs i2i

| User intent | Mode |
|-------------|------|
| Generate image from text description (default) | `t2i` — text-to-image |
| Generate image with a character reference photo (keep same person) | `i2i` — image-to-image |

**Default behavior:** When the user asks to generate/create an image without mentioning a reference photo, use `t2i` mode (default). Only use `i2i` mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.

### IMPORTANT: Aspect Ratio — Infer from user context

Do NOT always default to `1:1`. Analyze the user's request and choose the most appropriate aspect ratio:

| User intent / context | Recommended ratio | Resolution |
|-----------------------|-------------------|------------|
| 头像、图标、社交媒体头像、avatar、icon、profile pic | `1:1` | 1024×1024 |
| 风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper | `16:9` | 1280×720 |
| 传统照片、经典比例、classic photo | `4:3` | 1152×864 |
| 摄影作品、杂志封面、photography、magazine | `3:2` | 1248×832 |
| 人像竖图、海报、portrait photo、poster | `2:3` | 832×1248 |
| 竖版海报、书籍封面、tall poster、book cover | `3:4` | 864×1152 |
| 手机壁纸、社交媒体故事、phone wallpaper、story、reel | `9:16` | 720×1280 |
| 超宽全景、电影画幅、panoramic、cinematic ultrawide | `21:9` | 1344×576 |
| 未指定特定需求 / ambiguous | `1:1` | 1024×1024 |

### IMPORTANT: Image Count — When to generate multiple images

| User intent | Count (`-n`) |
|-------------|--------------|
| Default / single image request | `1` (default) |
| 用户说"几张"、"多张"、"一些" / "a few", "several" | `3` |
| 用户说"多种方案"、"备选" / "variations", "options" | `3`–`4` |
| 用户明确指定数量 | Use the specified number (1–9) |

### Text-to-Image Examples

```bash
# Basic text-to-image
bash scripts/image/generate_image.sh \
  --prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic" \
  -o minimax-output/cat.png

# Landscape with inferred aspect ratio
bash scripts/image/generate_image.sh \
  --prompt "Mountain landscape with misty valleys, photorealistic, golden hour" \
  --aspect-ratio 16:9 \
  -o minimax-output/landscape.png

# Phone wallpaper (portrait 9:16)
bash scripts/image/generate_image.sh \
  --prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere" \
  --aspect-ratio 9:16 \
  -o minimax-output/wallpaper.png

# Multiple variations
bash scripts/image/generate_image.sh \
  --prompt "Abstract geometric art, vibrant colors" \
  -n 3 \
  -o minimax-output/art.png

# With prompt optimizer
bash scripts/image/generate_image.sh \
  --prompt "A man standing on Venice Beach, 90s documentary style" \
  --aspect-ratio 16:9 --prompt-optimizer \
  -o minimax-output/beach.png

# Custom dimensions (must be multiple of 8)
bash scripts/image/generate_image.sh \
  --prompt "Product photo of a luxury watch on marble surface" \
  --width 1024 --height 768 \
  -o minimax-output/watch.png
```

### Image-to-Image (Character Reference)

Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).

```bash
# Character reference — place same person in a new scene
bash scripts/image/generate_image.sh \
  --mode i2i \
  --prompt "A girl looking into the distance from a library window, warm afternoon light" \
  --ref-image face.jpg \
  --aspect-ratio 16:9 \
  -o minimax-output/girl_library.png

# Multiple character variations
bash scripts/image/generate_image.sh \
  --mode i2i \
  --prompt "A woman in a red dress at a gala event, elegant, cinematic" \
  --ref-image face.jpg -n 3 \
  -o minimax-output/gala.png
```

### Aspect Ratio Reference

| Ratio | Resolution | Best for |
|-------|------------|----------|
| `1:1` | 1024×1024 | Default, avatars, icons, social media |
| `16:9` | 1280×720 | Landscape, banner, desktop wallpaper |
| `4:3` | 1152×864 | Classic photo, presentations |
| `3:2` | 1248×832 | Photography, magazine layout |
| `2:3` | 832×1248 | Portrait photo, poster |
| `3:4` | 864×1152 | Book cover, tall poster |
| `9:16` | 720×1280 | Phone wallpaper, social story/reel |
| `21:9` | 1344×576 | Ultra-wide panoramic, cinematic |

### Key Options

| Option | Description |
|--------|-------------|
| `--prompt TEXT` | Image description, max 1500 chars (required) |
| `--aspect-ratio RATIO` | Aspect ratio (see table above). Infer from user context |
| `--width PX` / `--height PX` | Custom size, 512–2048, must be multiple of 8, both required together. Overridden by `--aspect-ratio` if both set |
| `-n N` | Number of images to generate, 1–9 (default 1) |
| `--seed N` | Random seed for reproducibility. Same seed + same params → similar results |
| `--prompt-optimizer` | Enable automatic prompt optimization by the API |
| `--ref-image FILE` | Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB) |
| `--no-download` | Print image URLs instead of downloading files |
| `--aigc-watermark` | Add AIGC watermark to generated images |

## Video Generation

### IMPORTANT: Single vs Multi-Segment — Choose the right script

| User intent | Script to use |
|-------------|---------------|
| Default / no special request | `scripts/video/generate_video.sh` (single segment, **6s, 768P**) |
| User explicitly asks for "long video", "multi-scene", "story", or duration > 10s | `scripts/video/generate_long_video.sh` (multi-segment) |

**Default behavior:** Always use single-segment `generate_video.sh` with **duration 6s and resolution 768P** unless the user explicitly asks for a long video or multi-scene video. Do NOT automatically split into multiple segments — a single 6s video is the standard output. Only use `generate_long_video.sh` when the user clearly needs multi-scene or longer content.

Entry point (single video): `scripts/video/generate_video.sh`
Entry point (long/multi-scene): `scripts/video/generate_long_video.sh`

### Video Model Constraints (MUST follow)

**Supported resolutions and durations by model:**

| Model | Resolution | Duration |
|-------|-----------|----------|
| MiniMax-Hailuo-2.3 | 768P only | 6s or 10s |
| MiniMax-Hailuo-2.3-Fast | 768P only | 6s or 10s |
| MiniMax-Hailuo-02 | 512P, 768P (default) | 6s or 10s |
| T2V-01 / T2V-01-Director | 720P | 6s only |
| I2V-01 / I2V-01-Director / I2V-01-live | 720P | 6s only |
| S2V-01 (ref) | 720P | 6s only |

**Key rules:**
- **Default: 6s + 768P** — plan quotas are counted in 6-second units; use 6s unless user explicitly requests 10s
- **1080P is NOT supported** on any plan — always use 768P for Hailuo-2.3/2.3-Fast
- Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P

### IMPORTANT: Prompt Optimization (MUST follow before generating any video)

Before calling any video generation script, you MUST optimize the user's prompt by reading and applying `references/video-prompt-guide.md`. Never pass the user's raw description directly as `--prompt`.

**Optimization steps:**

1. **Apply the Professional Formula**: `Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere`
   - BAD: `"A puppy in a park"`
   - GOOD: `"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"`

2. **Add camera instructions** using `[指令]` syntax: `[推进]`, `[拉远]`, `[跟随]`, `[固定]`, `[左摇]`, etc.

3. **Include aesthetic details**: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)

4. **Keep to 1-2 key actions** for 6-10 second videos — do not overcrowd with events

5. **For i2v mode** (image-to-video): Focus prompt on **movement and change only**, since the image already establishes the visual. Do NOT re-describe what's in the image.
   - BAD: `"A lake with mountains"` (just repeating the image)
   - GOOD: `"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"`

6. **For multi-segment long videos**: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.

```bash
# Text-to-video (default: 6s, 768P)
bash scripts/video/generate_video.sh \
  --mode t2v \
  --prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful" \
  --output minimax-output/puppy.mp4

# Image-to-video (prompt focuses on MOTION, not image content)
bash scripts/video/generate_video.sh \
  --mode i2v \
  --prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones" \
  --first-frame photo.jpg \
  --output minimax-output/animated.mp4

# Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)
bash scripts/video/generate_video.sh \
  --mode sef \
  --first-frame start.jpg --last-frame end.jpg \
  --output minimax-output/transition.mp4

# Subject reference (face consistency, ref mode uses S2V-01, 6s only)
bash scripts/video/generate_video.sh \
  --mode ref \
  --prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field" \
  --subject-image face.jpg \
  --duration 6 \
  --output minimax-output/person.mp4
```

### Long-form Video (Multi-scene)

Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 6 seconds per segment.

**Workflow:**
1. Segment 1: t2v — generated purely from the optimized text prompt
2. Segment 2+: i2v — the previous segment's last frame becomes `first_frame_image`, prompt describes **motion and change from that ending state**
3. All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
4. Optional: AI-generated background music is overlaid

**Prompt rules for each segment:**
- Each segment prompt MUST be independently optimized using the Professional Formula
- Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
- Segment 2+ (i2v): Focus on **what changes and moves** from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
- Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
- Each segment covers only 6 seconds of action — keep it focused

```bash
# Example: 3-segment story with optimized per-segment prompts (default: 6s/segment, 768P)
bash scripts/video/generate_long_video.sh \
  --scenes \
    "A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \
    "The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure" \
    "The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \
  --music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere" \
  --output minimax-output/long_video.mp4

# With custom settings
bash scripts/video/generate_long_video.sh \
  --scenes "Scene 1 prompt" "Scene 2 prompt" \
  --segment-duration 6 \
  --resolution 768P \
  --crossfade 0.5 \
  --music-prompt "calm ambient background music" \
  --output minimax-output/long_video.mp4
```

### Add Background Music

```bash
bash scripts/video/add_bgm.sh \
  --video input.mp4 \
  --generate-bgm --instrumental \
  --music-prompt "soft piano background" \
  --bgm-volume 0.3 \
  --output minimax-output/output_with_bgm.mp4
```

### Template Video

```bash
bash scripts/video/generate_template_video.sh \
  --template-id 392753057216684038 \
  --media photo.jpg \
  --output minimax-output/template_output.mp4
```

### Video Models

| Mode | Default Model | Default Duration | Default Resolution | Notes |
|------|--------------|-----------------|-------------------|-------|
| t2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest text-to-video |
| i2v | MiniMax-Hailuo-2.3 | 6s | 768P | Latest image-to-video |
| sef | MiniMax-Hailuo-02 | 6s | 768P | Start-end frame |
| ref | S2V-01 | 6s | 720P | Subject reference, 6s only |

## Media Tools (Audio/Video Processing)

Entry point: `scripts/media_tools.sh`

Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.

### Video Format Conversion

```bash
# Convert between formats (mp4, mov, webm, mkv, avi, ts, flv)
bash scripts/media_tools.sh convert-video input.webm -o output.mp4
bash scripts/media_tools.sh convert-video input.mp4 -o output.mov

# With quality / resolution / fps options
bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4 \
  --crf 18 --preset medium --resolution 1920x1080 --fps 30
```

### Audio Format Conversion

```bash
# Convert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)
bash scripts/media_tools.sh convert-audio input.wav -o output.mp3
bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac \
  --bitrate 320k --sample-rate 48000 --channels 2
```

### Video Concatenation

```bash
# Concatenate with crossfade transition (default 0.5s)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4

# Hard cut (no crossfade)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
```

### Audio Concatenation

```bash
# Simple concatenation
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3

# With crossfade
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
```

### Extract Audio from Video

```bash
# Extract as mp3
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3

# Extract as wav with higher bitrate
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
```

### Video Trimming

```bash
# Trim by start/end time (seconds)
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15

# Trim by start + duration
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
```

### Add Audio to Video (Overlay / Replace)

```bash
# Mix audio with existing video audio
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4 \
  --volume 0.3 --fade-in 2 --fade-out 3

# Replace original audio entirely
bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4 \
  --replace
```

### Media File Info

```bash
bash scripts/media_tools.sh probe input.mp4
```

## Script Architecture

```
scripts/
├── check_environment.sh         # Env verification (curl, ffmpeg, jq, xxd, API key)
├── media_tools.sh               # Audio/video conversion, concat, trim, extract
├── tts/
│   └── generate_voice.sh        # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
├── music/
│   └── generate_music.sh        # Music generation CLI
├── image/
│   └── generate_image.sh        # Image generation CLI (2 modes: t2i, i2i)
└── video/
    ├── generate_video.sh        # Video generation CLI (4 modes: t2v, i2v, sef, ref)
    ├── generate_long_video.sh   # Multi-scene long video
    ├── generate_template_video.sh # Template-based video
    └── add_bgm.sh              # Background music overlay
```

## References

Read these for detailed API parameters, voice catalogs, and prompt engineering:

- [tts-guide.md](references/tts-guide.md) — TTS setup, voice management, audio processing, segment format, troubleshooting
- [tts-voice-catalog.md](references/tts-voice-catalog.md) — Full voice catalog with IDs, descriptions, and parameter reference
- [music-api.md](references/music-api.md) — Music generation API: endpoints, parameters, response format
- [image-api.md](references/image-api.md) — Image generation API: text-to-image, image-to-image, parameters
- [video-api.md](references/video-api.md) — Video API: endpoints, models, parameters, camera instructions, templates
- [video-prompt-guide.md](references/video-prompt-guide.md) — Video prompt engineering: formulas, styles, image-to-video tips