Linacodec: Highly compressive audio tokenizer for speech models.

Hugging Face Model

Linacodec is an audio tokenizer that compresses audio into just 12.5 tokens per second (171 bps) and decodes to 48khz audio!

Key benefits

  • Compression: 12.5 tokens/sec (60x more compressed than DAC).
  • Audio Quality: 48khz output (much clearer then 16khz/24khz which is the standard).
  • Encoder Speed: 200x realtime.
  • Decoder Speed: 400x realtime(even faster with batching)
  • Many Tasks: Indirectly even supports voice conversion, audio super-resolution, and audio denoising!

Why is this even useful?

Audio tokenizers directly contribute to speed, quality, and capability of TTS/ASR models. LinaCodec massively improves upon previous codecs in these areas.

  • Inference Speed: Enables TTS models to run 800x realtime, 8x faster than MiraTTS!
  • Fast training: High-quality TTS models can be trained in less then 1 day.
  • Versatile: Works for both Text-to-Speech and Speech-to-Text unlike most other codecs.

Comparisons

Model Total Tokens/Sec Sample Rate
Linacodec 12.5 48khz
DAC 774 44.1khz
EnCodec 300 24khz
Xcodec2 50 16khz
Mimi 200 24khz

Please check the repo for usage: https://github.com/ysharma3501/LinaCodec

Licence is CC-BY-4.0 meaning you can use it for any usecase(commercially/non-commercially) given you credit the original creator. Thank you.

Downloads last month
113
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using YatharthS/LinaCodec 1