Automatic Speech Recognition
ESPnet
multilingual
audio
phone-recognition
grapheme-to-phoneme
phoneme-to-grapheme
Instructions to use espnet/powsm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ESPnet
How to use espnet/powsm with ESPnet:
from espnet2.bin.asr_inference import Speech2Text model = Speech2Text.from_pretrained( "espnet/powsm" ) speech, rate = soundfile.read("speech.wav") text, *_ = model(speech)[0] - Notebooks
- Google Colab
- Kaggle
| datasets: | |
| - anyspeech/ipapack_plus_train_1 | |
| - anyspeech/ipapack_plus_train_2 | |
| - anyspeech/ipapack_plus_train_3 | |
| - anyspeech/ipapack_plus_train_4 | |
| language: multilingual | |
| library_name: espnet | |
| license: cc-by-4.0 | |
| metrics: | |
| - pfer | |
| - cer | |
| tags: | |
| - espnet | |
| - audio | |
| - phone-recognition | |
| - automatic-speech-recognition | |
| - grapheme-to-phoneme | |
| - phoneme-to-grapheme | |
| pipeline_tag: automatic-speech-recognition | |
| ### 🐁POWSM | |
| <p align="left"> | |
| <a href="https://arxiv.org/abs/2510.24992"><img src="https://img.shields.io/badge/Paper-2510.24992-red.svg?logo=arxiv&logoColor=red"/></a> | |
| <a href="https://huggingface.co/espnet/powsm"><img src="https://img.shields.io/badge/Model-powsm-yellow.svg?logo=huggingface&logoColor=yellow"/></a> | |
| <a href="https://github.com/espnet/egs2/powsm/s2t1"><img src="https://img.shields.io/badge/Recipe-powsm-blue.svg?logo=github&logoColor=black"/></a> | |
| </p> | |
| POWSM is the first phonetic foundation model that can perform four phone-related tasks: | |
| Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme | |
| conversion (P2G). | |
| Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR. | |
| > [!TIP] | |
| > Check out our new model: [🐁POWSM-CTC](https://huggingface.co/espnet/powsm_ctc), an encoder-only variant based on OWSM-CTC structure, | |
| > and [💎PRiSM](https://arxiv.org/abs/2601.14046): Benchmarking Phone Realization in Speech Models! | |
| To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are: | |
| ``` | |
| torch | |
| espnet | |
| espnet_model_zoo | |
| ``` | |
| **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1 | |
| ### Example script for PR/ASR/G2P/P2G | |
| Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. | |
| To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. | |
| > [!NOTE] | |
| > Jan 2026: We release a retrained version with improved ASR text normalization. | |
| > It is located in the subfolder `textnorm_retrained` and has the same structure as the main model. | |
| > Additional details are provided in the updated arXiv appendix. | |
| ```python | |
| from espnet2.bin.s2t_inference import Speech2Text | |
| import soundfile as sf # or librosa | |
| task = "<pr>" | |
| s2t = Speech2Text.from_pretrained( | |
| "espnet/powsm", | |
| device="cuda", | |
| lang_sym="<eng>", # ISO 639-3; set to <unk> for unseen languages | |
| task_sym=task, # <pr>, <asr>, <g2p>, <p2g> | |
| ) | |
| speech, rate = sf.read("sample.wav") | |
| prompt = "<na>" # G2P: set to ASR transcript; P2G: set to phone transcription with slashes | |
| pred = s2t(speech, text_prev=prompt)[0][0] | |
| # post-processing for better format | |
| pred = pred.split("<notimestamps>")[1].strip() | |
| if task == "<pr>" or task == "<g2p>": | |
| pred = pred.replace("/", "") | |
| print(pred) | |
| ``` | |
| #### Other tasks | |
| See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder! | |
| LID is learned implicitly during training, and you may run it with the script below: | |
| ```python | |
| from espnet2.bin.s2t_inference_language import Speech2Language | |
| import soundfile as sf # or librosa | |
| s2t = Speech2Language.from_pretrained( | |
| "espnet/powsm", | |
| device="cuda", | |
| nbest=1, # number of possible languages to return | |
| first_lang_sym="<afr>", # fixed; defined in vocab list | |
| last_lang_sym="<zul>" # fixed; defined in vocab list | |
| ) | |
| speech, rate = sf.read("sample.wav") | |
| pred = model(speech)[0] # a list of lang-prob pair | |
| print(pred) | |
| ``` | |
| ### Citations | |
| ```BibTex | |
| @article{powsm, | |
| title={POWSM: A Phonetic Open Whisper-Style Speech Foundation Model}, | |
| author={Chin-Jou Li and Kalvin Chang and Shikhar Bharadwaj and Eunjung Yeo and Kwanghee Choi and Jian Zhu and David Mortensen and Shinji Watanabe}, | |
| year={2025}, | |
| eprint={2510.24992}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2510.24992}, | |
| } | |
| ``` |