Model Card for Kyutai TTS
Kyutai TTS的型号卡
See also the project page, the Colab example, and the GitHub repository. Pre-print research paper is coming soon!
另请参见项目页面,COLAB示例和GitHub存储库。印刷前研究论文即将推出!
This is a model for streaming text-to-speech (TTS). Unlike offline text-to-speech, where the model needs the entire text to produce the audio, our model starts to output audio as soon as the first few words from the text have been given as input.
这是流式传输文本语音(TTS)的模型。与离线文本到语音的语音不同,该模型需要整个文本来产生音频,我们的模型在给出文本的前几个单词作为输入后就开始输出音频。
Model Details
模型详细信息
The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi, see the Moshi paper. The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens, although you can use less tokens at inference time for faster generation. The backbone model is 1B parameters, and the depth transformer is 600M parameters and uses partial weight sharing similar to Hibiki. The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2.
模型体系结构是一种分层变压器,它消耗了令牌化文本并生成了Mimi化的音频,请参见Moshi Paper。帧速率为12.5 Hz,每个音频框架由32个音频令牌表示,尽管您可以在推理时使用更少的令牌来快速生成。骨干模型是1B参数,深度变压器为600m参数,使用类似于Hibiki的部分重量共享。相对于文本,音频逐渐转移16个步骤(1.28秒),模型使用2的声学/语义延迟为2。
Model Description
模型描述
Kyutai TTS is a decoder-only model for streaming speech-to-text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.
Kyutai TTS是用于流语音到文本的仅解码器模型。它利用Moshi的多式体系结构来基于语音流建模文本流。文本流已移动W.R.T.音频流允许模型根据输入音频预测文本令牌。
Developed by: Kyutai
开发者:kyutai
Model type: Streaming Text-To-Speech.
模型类型:流媒体到语音。
Language(s) (NLP): English and French
语言(NLP):英语和法语
License: Model weights are licensed under CC-BY 4.0
许可证:模型权重根据CC-BY 4.0许可
Repository: GitHub
存储库:github
Uses
用途
Direct Use
直接使用
This model is able to perform streaming text-to-speech generation, including dialogs. The model supports voice conditioning through cross-attention pre-computed embeddings, which are provided for a number of voices in our tts-voices repository. This model does not support Classifier Free Guidance (CFG) directly, but was trained with CFG distillation for improved speed (no need to double the batch size). It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time.
该模型能够执行流传输到语音的生成,包括对话框。该模型通过交叉注意的预计嵌入来支持语音调节,这些嵌入方式为我们的TTS-Voices存储库中的许多声音提供了。该模型不直接支持无分类器指导(CFG),而是接受了CFG蒸馏培训以提高速度(无需将批量大小加倍)。批处理很容易,可以达到每次计算单元的75倍生成音频的吞吐量。
This model does not perform watermarking for two reasons:
该模型无法执行水印的原因有两个:
watermarking can easily be deactivated for open source models,
对于开源模型,可以很容易地停用水印,
our early experiments show that all watermark systems used by existing TTS are removed by simply encodeding and decoding the audio with Mimi.
我们的早期实验表明,通过简单地用MIMI编码和解码音频来消除现有TT所使用的所有水印系统。
Instead, we prefered to restrict the voice cloning ability to the use of pre-computed voice embeddings.
取而代之的是,我们更喜欢将语音克隆能力限制为使用预计的语音嵌入。
How to Get Started with the Model
如何开始模型
See the GitHub repository.
请参阅GitHub存储库。
Training Details
培训细节
The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds. Then, CFG distillation was performed for 24k updates.
该模型的培训为750k步骤,批量大小为64,段持续时间为120秒。然后,对24K更新进行了CFG蒸馏。
Training Data
培训数据
Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content. For this dataset, we obtained synthetic transcripts by running whisper-timestamped with whisper-medium .
预处理阶段:我们使用250万小时公开的音频内容的音频集。对于此数据集,我们通过用耳语中的低语来获得综合转录本。
Compute Infrastructure
计算基础架构
Pretraining was done with 32 H100 Nvidia GPUs. CFG distillation was done on 8 such GPUs.
用32 h100 nvidia gpu进行预处理。在8个这样的GPU上进行了CFG蒸馏。
Model Card Authors
模型卡作者
Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez
Neil Zeghidour,Eugene Kharitonov,Orsini的男性,Laurent Mazare,Alexander,Alexander