--- license: apache-2.0 tags: - hubert - speech - audio - feature-extraction - higgs-audio library_name: transformers pipeline_tag: feature-extraction --- # Boson AI HuBERT Base A general-purpose HuBERT-Base checkpoint released by Boson AI, used inside the [Higgs Audio Tokenizer](https://github.com/boson-ai/higgs-audio) as the semantic teacher. ## What it is - Standard HuBERT-Base architecture (12 transformer layers, hidden size 768, ~95M params) - 16 kHz audio input - Loadable via `AutoModel` with `trust_remote_code=True` - Outputs 768-dim per-layer hidden states (`output_hidden_states=True`) ## How it is used in Higgs Audio The Higgs Audio Tokenizer distills semantic features from this HuBERT into its semantic branch. From [`boson_multimodal/audio_processing/higgs_audio_tokenizer.py`](https://github.com/boson-ai/higgs-audio/blob/main/boson_multimodal/audio_processing/higgs_audio_tokenizer.py) (`semantic_techer="hubert_base_general"`): ```python from transformers import AutoModel semantic_model = AutoModel.from_pretrained("bosonai/hubert_base", trust_remote_code=True) # 16 kHz, 768-dim semantic features, all hidden layers consumed by the tokenizer ``` ## Direct usage ```python import torch import torchaudio from transformers import AutoModel model = AutoModel.from_pretrained("bosonai/hubert_base", trust_remote_code=True).eval() waveform, sr = torchaudio.load("audio.wav") if sr != 16000: waveform = torchaudio.functional.resample(waveform, sr, 16000) with torch.no_grad(): out = model(waveform, output_hidden_states=True) # out.last_hidden_state: (B, T, 768) # out.hidden_states: tuple of (B, T, 768) for each of the 13 layers (embedding + 12 transformer blocks) ``` ## License Apache 2.0.