---
license: apache-2.0
tags:
  - hubert
  - speech
  - audio
  - feature-extraction
  - higgs-audio
library_name: transformers
pipeline_tag: feature-extraction
---

# Boson AI HuBERT Base

A general-purpose HuBERT-Base checkpoint released by Boson AI, used inside the [Higgs Audio Tokenizer](https://github.com/boson-ai/higgs-audio) as the semantic teacher.

## What it is

- Standard HuBERT-Base architecture (12 transformer layers, hidden size 768, ~95M params)
- 16 kHz audio input
- Loadable via `AutoModel` with `trust_remote_code=True`
- Outputs 768-dim per-layer hidden states (`output_hidden_states=True`)

## How it is used in Higgs Audio

The Higgs Audio Tokenizer distills semantic features from this HuBERT into its semantic branch. From [`boson_multimodal/audio_processing/higgs_audio_tokenizer.py`](https://github.com/boson-ai/higgs-audio/blob/main/boson_multimodal/audio_processing/higgs_audio_tokenizer.py) (`semantic_techer="hubert_base_general"`):

```python
from transformers import AutoModel

semantic_model = AutoModel.from_pretrained("bosonai/hubert_base", trust_remote_code=True)
# 16 kHz, 768-dim semantic features, all hidden layers consumed by the tokenizer
```

## Direct usage

```python
import torch
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("bosonai/hubert_base", trust_remote_code=True).eval()

waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

with torch.no_grad():
    out = model(waveform, output_hidden_states=True)

# out.last_hidden_state:  (B, T, 768)
# out.hidden_states:      tuple of (B, T, 768) for each of the 13 layers (embedding + 12 transformer blocks)
```

## License

Apache 2.0.