You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

AVTR-1

AVTR-1 is a flow-matching-based autoregressive model for live dialogue. Given a portrait image and dual-stream audio (speech + listening track), it renders lip-synced speech and active listening at 25 fps on a single GPU.

📄 Project page: https://avtr-1.avaturn.live/
💻 Code: https://github.com/avaturn-live/avtr-1/
☁️ Managed API: https://avaturn.live
📦 Repository contents: model weights + reference assets + runtime plugin (see Contents below)

Quick start

This repo holds weights and reference assets only. The inference and streaming code lives in the GitHub repo. The minimal path:

git clone https://github.com/avaturn-live/avtr-1.git
cd avtr-1
pixi install
pixi run python scripts/download_artifacts.py        # pulls this HF repo
pixi run -e build python scripts/build_engines.py    # builds TRT engines
pixi run -e streamer python scripts/run_local_stream.py

Full instructions, prerequisites (CUDA 12.x, TensorRT 10.x, Ampere+ GPU), and streaming setup are in the GitHub README.

Performance

Per-chunk latency

AVTR-1 generates motion in 5-frame chunks end-to-end. At 25 fps that's 200 ms of output per chunk, so any GPU under that line runs in real-time.

GPU	Latency / 5-frame chunk	Real-time factor
L40	84 ms	2.4×
A100	91 ms	2.2×
RTX 4060 Ti	166 ms	1.2×
RTX 3070	181 ms	1.1×
L4	202 ms	0.99×
RTX 3060 Ti	206 ms	0.97×
RTX 4060	232 ms	0.86×

Real-time factor = 200 ms / latency. ≥ 1.0× means the GPU keeps up with 25 fps.

Training data

AVTR-1 was trained on a diverse collection of publicly available video content depicting human speakers in conversational and presentation settings. The training corpus was used to learn audio-visual correspondence between speech and facial motion; the model does not memorize or retrieve specific training examples at inference time. Output is conditioned solely on the user-provided portrait image and audio inputs.

Bias, risks, and limitations

The training data reflects the demographic distribution of available online video content, which over-represents certain languages, ages, ethnicities, lighting conditions, and recording setups. As a result the model may perform unevenly across underrepresented groups — for example, less accurate lip synchronization on phoneme inventories that are under-represented in training, or degraded matting on lighting and skin-tone combinations outside the training distribution.

Other known limitations:

Designed for 25 fps output at near-frontal portrait input; out-of-distribution poses, occlusions, or extreme head angles can degrade quality.
The model may produce subtle motion artifacts during long monologues or on audio with heavy background noise.
Real-time performance numbers above are measured on NVIDIA A100 (sm80); consumer GPUs will be slower.

Out-of-scope use

The model must not be used to generate likenesses of identifiable individuals without their explicit, informed consent. The LICENSE (Attachment A) prohibits, among other things:

non-consensual deepfakes and impersonation,
generation of content intended to harm or harass,
exploitation of minors,
automated decisions affecting an individual's legal rights, and
military, warfare, or weapons applications.

Deployers and users are responsible for confirming that the reference portrait and audio inputs are used with the appropriate consent of any depicted individuals, and for complying with applicable privacy, biometric, and right-of-publicity laws in their jurisdiction.

build_artifacts/
  avtr1.scripted.pt           AVTR-1 motion generator (TorchScript)
  hubert-lbs-avtr1.onnx       Fine-tuned HuBERT audio encoder
  modnet.onnx                 MODNet portrait matting
renderer_runtime_artifacts/
  libgrid_sample_3d_plugin.so 3D grid_sample TensorRT plugin
avatars_artifacts/
  reference_frames/           Default avatar portraits
  backgrounds/                Default background images
  pasteback_mask.png          Compositing mask

License

AVTR-1 model weights and avatar assets are released under the AVTR-1 Community License — see LICENSE. Bundled third-party components (HuBERT, MODNet, grid_sample plugin) are Apache-2.0 — see THIRD-PARTY-NOTICES.md.

Non-commercial dependency

The pipeline uses InsightFace's pretrained SCRFD detector and 2D106 landmark model, which are licensed for non-commercial research use only. To use AVTR-1 commercially you must either obtain a commercial license from InsightFace (deepinsight@gmail.com) or replace these models with permissively-licensed alternatives (e.g., MediaPipe). A future release will do this swap.

Commercial use (once InsightFace is replaced)

The AVTR-1 Community License permits commercial use by entities below USD 10,000,000 in annual revenue, subject to Attachment A of the LICENSE. Entities at or above that threshold require a Commercial Use Agreement (hello@avaturn.me).

The Avaturn Streamer (in the GitHub repo's src/avaturn_live_streamer/) is separately licensed under the PolyForm Noncommercial License 1.0.0 and is not included in this HF repo; commercial use of the Streamer requires a separate Streamer Commercial License regardless of revenue.

Acceptable use

Use of AVTR-1 is subject to the Use Restrictions in Attachment A of the LICENSE, which prohibit (among other things) generating non-consensual deepfakes, harassment, harm to minors, and use in fully automated legal-rights decisions.

Acknowledgements

The AVTR-1 motion generator was initially prototyped starting from thuhcsi/S2G-MDDiffusion (MIT, © 2024 Xu He). Its broader provenance chain includes EDGE and the lucidrains diffusion implementations. The released AVTR-1 model has since diverged substantially from S2G-MDDiffusion — see THIRD-PARTY-NOTICES.md for the full lineage statement.

The AVTR-1 renderer pipeline is built on LivePortrait (MIT, © 2024 Kuaishou Visual Generation and Interaction Center). LivePortrait ONNX checkpoints (appearance extractor, motion extractor, warp network, decoder, stitch network) are pulled from digital-avatar/ditto-talkinghead (Apache-2.0), which provides repackaged ONNX graphs suitable for TensorRT conversion. The upstream LivePortrait release does not ship portable ONNX.

Bundled third-party components (HuBERT, MODNet, the grid_sample TensorRT plugin) retain their original upstream attribution and Apache-2.0 licensing — see THIRD-PARTY-NOTICES.md.

Citation

@misc{avtr1_2026,
  title  = {AVTR-1: Flow-Matching Autoregressive Model for Live Dialogue Avatars},
  author = {Goodsize Inc.},
  year   = {2026},
  url    = {https://github.com/avaturn-live/avtr-1}
}

Contact

Licensing & commercial: hello@avaturn.me
Web: https://avaturn.live
Issues / code: https://github.com/avaturn-live/avtr-1/issues

Downloads last month: 2