Complete Guide: LongCat-Video-Avatar ComfyUI Workflow
Step-by-step tutorial for setting up the perfect AI avatar generation pipeline in ComfyUI.
Audio-Driven Talking Avatar Video: AT2V / ATI2V / Video Continuation, Single & Multi-Person, Long-Form Lip Sync.
The most advanced open-source 13.6B parameter model for audio-driven avatar video generation. Create unlimited length talking head videos with perfect lip synchronization, natural dynamics, and stunning realism. MIT Licensed.
â ī¸ Unofficial Community Guide â Trademarks belong to their respective owners. Model weights: MIT (excludes trademark rights)
Upload an image and audio to generate ultra-realistic AI talking avatar videos. Powered by Hugging Face Spaces.
Select your scenario for optimized prompts, parameters, and templates
Professional spokesperson, product demos, corporate videos
AT2V / ATI2VLip-sync to songs, music videos, karaoke content
ATI2VLong-form conversations, talk shows, Q&A sessions
Multi-PersonAd creatives, product pitches, testimonial videos
AT2VTwo speakers, debates, dual audio conversations
Dual Audio5+ minute lectures, tutorials, extended presentations
Video ContinuationGenerate professional audio-driven avatar videos in just 3 simple steps
Upload any portrait photo or character image. LongCat Avatar supports real humans, anime characters, and AI-generated images.
Upload an audio file in any language or use our built-in text-to-speech. LongCat Avatar delivers perfect lip synchronization.
Click generate and download your ultra-realistic talking avatar video. Export in HD quality up to 720P at 30fps.
See real examples of LongCat-Video-Avatar in action from official and community sources
đĄ More examples available on @Meituan_LongCat Twitter/X
Discover why LongCat Avatar is the most advanced open-source talking head generator
Generate complete talking avatar videos from just audio and text description. No reference image required for LongCat Avatar generation.
Upload one portrait image with audio to create ultra-realistic talking head videos. Perfect lip sync with natural head movements.
Create infinitely long videos with seamless continuation. No color drift, no quality degradation. Perfect for podcasts and long-form content.
Generate multi-person conversation videos from multiple audio streams. Perfect for interviews, dialogues, and group presentations.
Breakthrough AI technologies powering the most realistic lip sync avatar generator
Separates speech signals from full-body motion, maintaining natural poses even during silence. No more awkward frozen frames.
Prevents identity drift in long videos while avoiding rigid copy-paste effects. Your avatar stays consistent throughout.
Eliminates VAE encode-decode cycles for seamless video continuation. Generate unlimited length without quality loss.
See how LongCat Avatar compares to paid AI avatar generators
| Feature | LongCat Avatar | InfiniteTalk | HeyGen | Synthesia |
|---|---|---|---|---|
| Video Length | Unlimited â | Unlimited | 5 minutes max | 10 minutes max |
| Price | 100% Free â | Free | $24/month | $29/month |
| Open Source | MIT License | Apache 2.0 | ||
| Local Deployment | Full Support | Cloud Only | Cloud Only | |
| Body Dynamics | Highly Natural â | Good | Limited | Limited |
| Multi-Person | ||||
| Parameters | 13.6B â | N/A | N/A | N/A |
Generate talking avatar videos from just audio and text description
A professional [man/woman] in [age range] with [hair description],
wearing [clothing], sitting in [environment].
The person is speaking directly to camera with natural gestures.
High quality, 4K, professional lighting, shallow depth of field.
A professional woman in her 30s with short black hair, wearing a navy blazer, sitting in a modern office. Speaking confidently to camera.
A friendly male teacher in his 40s with glasses, wearing a casual sweater, in front of a whiteboard. Explaining concepts enthusiastically.
A young woman with long brown hair, wearing a winter coat, standing in a snowy landscape. Speaking with visible breath in cold air.
Use your reference image for consistent character appearance
The person in the image is speaking/talking/presenting.
[Add scene description: office, studio, outdoor, etc.]
Natural head movements, professional lighting.
Looking directly at camera.
Create dialogues, podcasts, and interviews with two speakers
python run_demo_avatar_multi_audio_to_video.py \
--audio_path_1 speaker_a.wav \
--audio_path_2 speaker_b.wav \
--ref_img_path_1 person_a.jpg \
--ref_img_path_2 person_b.jpg \
--audio_merge_mode concat \
--resolution 720 \
--output_path podcast_output.mp4
Create unlimited length videos without quality degradation
--num_segments
Number of video chunks. Each ~4-8 seconds. For 5 min video, use ~40-75 segments.
--ref_img_index
Controls reference frame selection. Range: 0-1. Higher = more variety but potential drift.
--mask_frame_range
Overlap between chunks for smooth transitions. Default works well for most cases.
ref_img_index between 0.3-0.7 for best balance between consistency and variety. Values too low = rigid, too high = drift.
python run_demo_avatar_single_audio_to_video.py \
--audio_path lecture_5min.wav \
--ref_img_path presenter.jpg \
--num_segments 60 \
--ref_img_index 0.5 \
--resolution 720 \
--output_path long_video.mp4
Fine-tune your generation with these key parameters
| Parameter | Recommended | Range | Effect |
|---|---|---|---|
audio_cfg |
3.0 - 5.0 | 1.0 - 10.0 | Lip sync strength. Higher = stronger sync but may look unnatural |
text_cfg |
7.5 | 1.0 - 20.0 | Prompt adherence. Higher = follows text more strictly |
ref_img_index |
0.3 - 0.7 | 0.0 - 1.0 | Reference selection. Lower = consistent, Higher = varied |
resolution |
720 | 480 / 720 / 1080 | Output resolution. 720P = balanced quality/speed |
num_inference_steps |
30-50 | 20-100 | Denoising steps. More = higher quality but slower |
Copy these JSON templates for batch processing
{
"audio_path": "./inputs/speech.wav",
"prompt": "A professional woman speaking to camera in office",
"resolution": 720,
"num_inference_steps": 30,
"audio_cfg": 4.0,
"text_cfg": 7.5,
"output_path": "./outputs/at2v_result.mp4"
}
{
"audio_path": "./inputs/speech.wav",
"ref_img_path": "./inputs/portrait.jpg",
"prompt": "The person is speaking naturally",
"resolution": 720,
"num_inference_steps": 30,
"audio_cfg": 4.0,
"output_path": "./outputs/ati2v_result.mp4"
}
{
"audio_path_1": "./inputs/speaker_a.wav",
"audio_path_2": "./inputs/speaker_b.wav",
"ref_img_path_1": "./inputs/person_a.jpg",
"ref_img_path_2": "./inputs/person_b.jpg",
"audio_merge_mode": "concat",
"resolution": 720,
"output_path": "./outputs/dialogue.mp4"
}
Quick solutions for frequent LongCat-Video-Avatar issues
ERROR: Could not build wheels for flash-attn
pip install flash-attn --no-build-isolation
Or try pre-built wheels from GitHub releases. Ensure CUDA toolkit matches your PyTorch version.
RuntimeError: CUDA out of memory
num_inference_steps to 20--enable_cpu_offloadFileNotFoundError: [Errno 2] No such file: 'ffmpeg'
sudo apt install ffmpeg (Linux)brew install ffmpeg (macOS)choco install ffmpeg (Windows)
ValueError: Audio duration mismatch
For multi-person mode, ensure both audio files have similar duration or use --audio_merge_mode concat for sequential playback.
Mouth movements don't match audio
audio_cfg to 4.0-5.0Character makes same movement repeatedly
Adjust ref_img_index between 0.3-0.7. Also try mask_frame_range adjustments for better chunk transitions.
Flexible format support for seamless LongCat Avatar video generation workflow
Choose the right quality settings for your LongCat Avatar project
Transform your educational content with AI-powered talking avatars. Create engaging online courses, employee training videos, and tutorial content without expensive video production.
Generate unlimited lecture videos with consistent presenter avatars
Scale training content across 140+ languages with natural lip sync
Create step-by-step guides with professional talking head presentation
Education Demo Video
Create personalized marketing videos at scale. Generate product demos, social media content, and UGC-style ads with realistic AI avatars that convert.
Showcase products with natural-looking presenter videos
Generate TikTok, YouTube Shorts, and Instagram Reels at scale
Rapidly iterate ad variations with different scripts and avatars
Marketing Demo Video
Perfect for YouTubers, podcasters, and VTubers who want professional video content without facing the camera. Create unlimited content with your AI avatar.
Generate faceless YouTube content with realistic talking avatars
Turn audio podcasts into engaging video content with lip-synced avatars
Build your virtual influencer identity with customizable AI avatars
Creator Demo Video
Reach global audiences by localizing your video content into 140+ languages. LongCat Avatar automatically syncs lip movements to any language audio.
Support for major world languages with accurate pronunciation
Natural mouth movements matched to translated audio
Same avatar across all language versions for brand consistency
Localization Demo Video
Everything you need to know about the best free AI avatar video generator
Get started with LongCat-Video-Avatar in just 5 minutes
The easiest way to create AI avatar videos. No installation required.
Download the model for local deployment with full control.
# Clone the repository
git clone https://huggingface.co/meituan-longcat/LongCat-Video-Avatar
# Install dependencies
pip install -r requirements.txt
# Run inference
python inference.py --image photo.jpg --audio speech.mp3
Professional workflow for AI video generation with full parameter control
Adjust Audio CFG (3-5 optimal), resolution, frame rate, and more
Generate multiple videos in sequence for production workflows
Combine with other AI models for enhanced results
Seamlessly extend videos to unlimited length
đ
ComfyUI Workflow Preview
AT2V / ATI2V / Video Continuation modes
Run LongCat-Video-Avatar on your own hardware for maximum privacy and control
# Step 1: Create conda environment
conda create -n longcat python=3.10
conda activate longcat
# Step 2: Install PyTorch with CUDA
pip install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Step 3: Install FlashAttention for speed
pip install flash-attn --no-build-isolation
# Step 4: Clone LongCat-Video-Avatar
git clone https://huggingface.co/meituan-longcat/LongCat-Video-Avatar
cd LongCat-Video-Avatar
# Step 5: Install requirements
pip install -r requirements.txt
# Step 6: Run inference
python inference.py --image input.jpg --audio speech.wav --output video.mp4
Run LongCat Avatar on 12GB VRAM GPUs with these optimization tips
Lower resolution significantly reduces VRAM usage while maintaining quality for testing
Trade compute for memory by recomputing activations during backward pass
Half-precision inference cuts memory usage in half with minimal quality loss
Generate longer videos by processing smaller chunks and stitching together
| Configuration | VRAM Required | Max Resolution | Speed |
|---|---|---|---|
| Standard (FP32) | 24GB+ | 1080P | Baseline |
| Optimized (FP16) | 16GB | 720P | 1.5x faster |
| Low VRAM Mode | 12GB | 480P | 0.8x |
Integrate LongCat Avatar into your applications with our REST API
/api/v1/generate
Generate a talking avatar video from image and audio
{
"image": "base64_encoded_image",
"audio": "base64_encoded_audio",
"resolution": "720p",
"fps": 30,
"audio_cfg": 4.0
}
/api/v1/status/{job_id}
Check the status of a video generation job
{
"job_id": "abc123",
"status": "completed",
"progress": 100,
"video_url": "https://..."
}
See how much you can save with free LongCat Avatar vs traditional video production
Join thousands of content creators who trust LongCat Avatar for their video needs
"LongCat Avatar changed my YouTube workflow completely. I can now create 10x more content without ever being on camera. The lip sync is incredibly realistic!"
"As a non-native English speaker, LongCat Avatar helps me create professional English content with perfect pronunciation. The 140+ language support is a game-changer."
"We saved $50,000 in the first month alone by switching from Synthesia to LongCat Avatar. The open-source model gives us complete control and unlimited usage."
"The unlimited video length feature is amazing for podcasts. I can create hour-long episodes with consistent avatar quality throughout. No other tool does this for free."
Built by Meituan's AI research team, open-sourced for the community
LongCat-Video-Avatar is a state-of-the-art audio-driven video generation model developed by Meituan's AI research team. Released in December 2025, it represents a major breakthrough in digital human technology.
Join thousands of creators, developers, and AI enthusiasts
Learn tips, tricks, and best practices for AI avatar video generation
Step-by-step tutorial for setting up the perfect AI avatar generation pipeline in ComfyUI.
Detailed comparison of the top AI avatar generators including features, pricing, and quality.
Expert tips to get the most realistic lip synchronization from your LongCat Avatar videos.
Track the latest releases and improvements
See what the community is creating with LongCat Avatar
Your Creation Here
Share your best LongCat Avatar videos
Podcast Demo
10-minute interview generated
Music Video
Full song performance
Education
Language learning content
Not the AI model? Here are VRChat longcat assets and avatars
â ī¸ Different Product: These are 3D avatar assets for VRChat, not related to the LongCat-Video-Avatar AI model by Meituan.
Important information about responsible use
LongCat-Video-Avatar model weights are released under MIT License, which permits:
Note: MIT License does NOT grant trademark rights. "Meituan" and "LongCat" are trademarks of their respective owners.
When using AI avatar generation:
This is an unofficial community resource site.
We are not affiliated with, endorsed by, or officially connected to Meituan or the LongCat-Video-Avatar project team.
All trademarks, logos, and brand names are the property of their respective owners.
Official GitHub âTrusted by creators and teams at