Instructions to use avaturn-live/avtr-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use avaturn-live/avtr-1 with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
AVTR-1
AVTR-1 is a flow-matching-based autoregressive model for live dialogue. Given a portrait image and dual-stream audio (speech + listening track), it renders lip-synced speech and active listening at 25 fps on a single GPU.
- π Project page: https://avtr-1.avaturn.live/
- π» Code: https://github.com/avaturn-live/avtr-1/
- βοΈ Managed API: https://avaturn.live
- π¦ Repository contents: model weights + reference assets + runtime plugin (see Contents below)
Quick start
This repo holds weights and reference assets only. The inference and streaming code lives in the GitHub repo. The minimal path:
git clone https://github.com/avaturn-live/avtr-1.git
cd avtr-1
pixi install
pixi run python scripts/download_artifacts.py # pulls this HF repo
pixi run -e build python scripts/build_engines.py # builds TRT engines
pixi run -e streamer python scripts/run_local_stream.py
Full instructions, prerequisites (CUDA 12.x, TensorRT 10.x, Ampere+ GPU), and streaming setup are in the GitHub README.
Performance
Per-chunk latency
AVTR-1 generates motion in 5-frame chunks end-to-end. At 25 fps that's 200 ms of output per chunk, so any GPU under that line runs in real-time.
| GPU | Latency / 5-frame chunk | Real-time factor |
|---|---|---|
| L40 | 84 ms | 2.4Γ |
| A100 | 91 ms | 2.2Γ |
| RTX 4060 Ti | 166 ms | 1.2Γ |
| RTX 3070 | 181 ms | 1.1Γ |
| L4 | 202 ms | 0.99Γ |
| RTX 3060 Ti | 206 ms | 0.97Γ |
| RTX 4060 | 232 ms | 0.86Γ |
Real-time factor = 200 ms / latency. β₯ 1.0Γ means the GPU keeps up with 25 fps.
Training data
AVTR-1 was trained on a diverse collection of publicly available video content depicting human speakers in conversational and presentation settings. The training corpus was used to learn audio-visual correspondence between speech and facial motion; the model does not memorize or retrieve specific training examples at inference time. Output is conditioned solely on the user-provided portrait image and audio inputs.
Bias, risks, and limitations
The training data reflects the demographic distribution of available online video content, which over-represents certain languages, ages, ethnicities, lighting conditions, and recording setups. As a result the model may perform unevenly across underrepresented groups β for example, less accurate lip synchronization on phoneme inventories that are under-represented in training, or degraded matting on lighting and skin-tone combinations outside the training distribution.
Other known limitations:
- Designed for 25 fps output at near-frontal portrait input; out-of-distribution poses, occlusions, or extreme head angles can degrade quality.
- The model may produce subtle motion artifacts during long monologues or on audio with heavy background noise.
- Real-time performance numbers above are measured on NVIDIA A100 (sm80); consumer GPUs will be slower.
Out-of-scope use
The model must not be used to generate likenesses of identifiable individuals without their explicit, informed consent. The LICENSE (Attachment A) prohibits, among other things:
- non-consensual deepfakes and impersonation,
- generation of content intended to harm or harass,
- exploitation of minors,
- automated decisions affecting an individual's legal rights, and
- military, warfare, or weapons applications.
Deployers and users are responsible for confirming that the reference portrait and audio inputs are used with the appropriate consent of any depicted individuals, and for complying with applicable privacy, biometric, and right-of-publicity laws in their jurisdiction.
Contents
build_artifacts/
avtr1.scripted.pt AVTR-1 motion generator (TorchScript)
hubert-lbs-avtr1.onnx Fine-tuned HuBERT audio encoder
modnet.onnx MODNet portrait matting
renderer_runtime_artifacts/
libgrid_sample_3d_plugin.so 3D grid_sample TensorRT plugin
avatars_artifacts/
reference_frames/ Default avatar portraits
backgrounds/ Default background images
pasteback_mask.png Compositing mask
License
AVTR-1 model weights and avatar assets are released under the AVTR-1 Community License β see LICENSE. Bundled third-party components (HuBERT, MODNet, grid_sample plugin) are Apache-2.0 β see THIRD-PARTY-NOTICES.md.
Non-commercial dependency
The pipeline uses InsightFace's pretrained SCRFD detector and 2D106 landmark model, which are licensed for non-commercial research use only. To use AVTR-1 commercially you must either obtain a commercial license from InsightFace (deepinsight@gmail.com) or replace these models with permissively-licensed alternatives (e.g., MediaPipe). A future release will do this swap.
Commercial use (once InsightFace is replaced)
The AVTR-1 Community License permits commercial use by entities below USD 10,000,000 in annual revenue, subject to Attachment A of the LICENSE. Entities at or above that threshold require a Commercial Use Agreement (hello@avaturn.me).
The Avaturn Streamer (in the GitHub repo's src/avaturn_live_streamer/)
is separately licensed under the PolyForm Noncommercial License 1.0.0 and
is not included in this HF repo; commercial use of the Streamer requires a
separate Streamer Commercial License regardless of revenue.
Acceptable use
Use of AVTR-1 is subject to the Use Restrictions in Attachment A of the LICENSE, which prohibit (among other things) generating non-consensual deepfakes, harassment, harm to minors, and use in fully automated legal-rights decisions.
Acknowledgements
The AVTR-1 motion generator was initially prototyped starting from thuhcsi/S2G-MDDiffusion (MIT, Β© 2024 Xu He). Its broader provenance chain includes EDGE and the lucidrains diffusion implementations. The released AVTR-1 model has since diverged substantially from S2G-MDDiffusion β see THIRD-PARTY-NOTICES.md for the full lineage statement.
The AVTR-1 renderer pipeline is built on LivePortrait (MIT, Β© 2024 Kuaishou Visual Generation and Interaction Center). LivePortrait ONNX checkpoints (appearance extractor, motion extractor, warp network, decoder, stitch network) are pulled from digital-avatar/ditto-talkinghead (Apache-2.0), which provides repackaged ONNX graphs suitable for TensorRT conversion. The upstream LivePortrait release does not ship portable ONNX.
Bundled third-party components (HuBERT, MODNet, the grid_sample TensorRT plugin) retain their original upstream attribution and Apache-2.0 licensing β see THIRD-PARTY-NOTICES.md.
Citation
@misc{avtr1_2026,
title = {AVTR-1: Flow-Matching Autoregressive Model for Live Dialogue Avatars},
author = {Goodsize Inc.},
year = {2026},
url = {https://github.com/avaturn-live/avtr-1}
}
Contact
- Licensing & commercial: hello@avaturn.me
- Web: https://avaturn.live
- Issues / code: https://github.com/avaturn-live/avtr-1/issues
- Downloads last month
- 2