Update README.md
#1
by
sanchit-gandhi
- opened
README.md
CHANGED
|
@@ -25,6 +25,30 @@ Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to
|
|
| 25 |
|
| 26 |
Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
## Intended Uses & Limitations
|
| 29 |
|
| 30 |
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
|
@@ -45,28 +69,3 @@ Currently, both the feature extractor and model support PyTorch.
|
|
| 45 |
pages={5723--5738},
|
| 46 |
}
|
| 47 |
```
|
| 48 |
-
|
| 49 |
-
## How to Get Started With the Model
|
| 50 |
-
|
| 51 |
-
Use the code below to convert text into a mono 16 kHz speech waveform.
|
| 52 |
-
|
| 53 |
-
```python
|
| 54 |
-
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
| 55 |
-
|
| 56 |
-
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
|
| 57 |
-
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
|
| 58 |
-
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
|
| 59 |
-
|
| 60 |
-
inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
|
| 61 |
-
|
| 62 |
-
# load xvector containing speaker's voice characteristics from a file
|
| 63 |
-
import numpy as np
|
| 64 |
-
import torch
|
| 65 |
-
speaker_embeddings = np.load("xvector_speaker_embedding.npy")
|
| 66 |
-
speaker_embeddings = torch.tensor(speaker_embeddings).unsqueeze(0)
|
| 67 |
-
|
| 68 |
-
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
|
| 69 |
-
|
| 70 |
-
import soundfile as sf
|
| 71 |
-
sf.write("speech.wav", speech.numpy(), samplerate=16000)
|
| 72 |
-
```
|
|
|
|
| 25 |
|
| 26 |
Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
|
| 27 |
|
| 28 |
+
## How to Get Started With the Model
|
| 29 |
+
|
| 30 |
+
Use the code below to convert text into a mono 16 kHz speech waveform.
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
| 34 |
+
import torch
|
| 35 |
+
import soundfile as sf
|
| 36 |
+
|
| 37 |
+
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
|
| 38 |
+
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
|
| 39 |
+
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
|
| 40 |
+
|
| 41 |
+
inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
|
| 42 |
+
|
| 43 |
+
# load xvector containing speaker's voice characteristics from a dataset
|
| 44 |
+
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
| 45 |
+
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
| 46 |
+
|
| 47 |
+
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
|
| 48 |
+
|
| 49 |
+
sf.write("speech.wav", speech.numpy(), samplerate=16000)
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
## Intended Uses & Limitations
|
| 53 |
|
| 54 |
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
|
|
|
| 69 |
pages={5723--5738},
|
| 70 |
}
|
| 71 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|