Update docs etc. (#524)

* fully support ormsgpack * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * dependency * torch==2.4.1 windows compilable * Update docs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unused code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove autorerank * api usage * back slash * fix docs --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
fishaudio · Sep 12, 2024 · 9cb84a6 · 9cb84a6
1 parent b186c98
commit 9cb84a6
Show file tree

Hide file tree

Showing 11 changed files with 125 additions and 549 deletions.
diff --git a/docs/en/index.md b/docs/en/index.md
@@ -27,66 +27,70 @@
 
 ## Windows Setup
 
-Windows professional users may consider WSL2 or Docker to run the codebase.
-
-Non-professional Windows users can consider the following methods to run the codebase without a Linux environment (with model compilation capabilities aka `torch.compile`):
-
-<ol>
-   <li>Unzip the project package.</li>
-   <li>Click <code>install_env.bat</code> to install the environment.
-      <ul>
-            <li>You can decide whether to use a mirror site for downloads by editing the <code>USE_MIRROR</code> item in <code>install_env.bat</code>.</li>
-            <li><code>USE_MIRROR=false</code> downloads the latest stable version of <code>torch</code> from the original site. <code>USE_MIRROR=true</code> downloads the latest version of <code>torch</code> from a mirror site. The default is <code>true</code>.</li>
-            <li>You can decide whether to enable the compiled environment download by editing the <code>INSTALL_TYPE</code> item in <code>install_env.bat</code>.</li>
-            <li><code>INSTALL_TYPE=preview</code> downloads the preview version with the compiled environment. <code>INSTALL_TYPE=stable</code> downloads the stable version without the compiled environment.</li>
-      </ul>
-   </li>
-   <li>If step 2 has <code>USE_MIRROR=preview</code>, execute this step (optional, for activating the compiled model environment):
-      <ol>
-            <li>Download the LLVM compiler using the following links:
-               <ul>
-                  <li><a href="https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true">LLVM-17.0.6 (original site download)</a></li>
-                  <li><a href="https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true">LLVM-17.0.6 (mirror site download)</a></li>
-                  <li>After downloading <code>LLVM-17.0.6-win64.exe</code>, double-click to install it, choose an appropriate installation location, and most importantly, check <code>Add Path to Current User</code> to add to the environment variables.</li>
-                  <li>Confirm the installation is complete.</li>
-               </ul>
-            </li>
-            <li>Download and install the Microsoft Visual C++ Redistributable package to resolve potential .dll missing issues.
-               <ul>
-                  <li><a href="https://aka.ms/vs/17/release/vc_redist.x64.exe">MSVC++ 14.40.33810.0 Download</a></li>
-               </ul>
-            </li>
-            <li>Download and install Visual Studio Community Edition to obtain MSVC++ build tools, resolving LLVM header file dependencies.
-               <ul>
-                  <li><a href="https://visualstudio.microsoft.com/zh-hans/downloads/">Visual Studio Download</a></li>
-                  <li>After installing Visual Studio Installer, download Visual Studio Community 2022.</li>
-                  <li>Click the <code>Modify</code> button as shown below, find the <code>Desktop development with C++</code> option, and check it for download.</li>
-                  <p align="center">
-                     <img src="../assets/figs/VS_1.jpg" width="75%">
-                  </p>
-               </ul>
-            </li>
-            <li>Install <a href="https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64">CUDA Toolkit 12</a></li>
-      </ol>
-   </li>
-   <li>Double-click <code>start.bat</code> to enter the Fish-Speech training inference configuration WebUI page.
-      <ul>
-            <li>(Optional) Want to go directly to the inference page? Edit the <code>API_FLAGS.txt</code> in the project root directory and modify the first three lines as follows:
-               <pre><code>--infer
-# --api
-# --listen ...
-...</code></pre>
-            </li>
-            <li>(Optional) Want to start the API server? Edit the <code>API_FLAGS.txt</code> in the project root directory and modify the first three lines as follows:
-               <pre><code># --infer
---api
---listen ...
-...</code></pre>
-            </li>
-      </ul>
-   </li>
-   <li>(Optional) Double-click <code>run_cmd.bat</code> to enter the conda/python command line environment of this project.</li>
-</ol>
+Professional Windows users may consider using WSL2 or Docker to run the codebase.
+
+```bash
+# Create a python 3.10 virtual environment, you can also use virtualenv
+conda create -n fish-speech python=3.10
+conda activate fish-speech
+
+# Install pytorch
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+
+# Install fish-speech
+pip3 install -e .
+
+# (Enable acceleration) Install triton-windows
+pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl
+```
+
+Non-professional Windows users can consider the following basic methods to run the project without a Linux environment (with model compilation capabilities, i.e., `torch.compile`):
+
+1. Extract the project package.
+2. Click `install_env.bat` to install the environment.
+3. If you want to enable compilation acceleration, follow this step:
+    1. Download the LLVM compiler from the following links:
+        - [LLVM-17.0.6 (Official Site Download)](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
+        - [LLVM-17.0.6 (Mirror Site Download)](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
+        - After downloading `LLVM-17.0.6-win64.exe`, double-click to install, select an appropriate installation location, and most importantly, check the `Add Path to Current User` option to add the environment variable.
+        - Confirm that the installation is complete.
+    2. Download and install the Microsoft Visual C++ Redistributable to solve potential .dll missing issues:
+        - [MSVC++ 14.40.33810.0 Download](https://aka.ms/vs/17/release/vc_redist.x64.exe)
+    3. Download and install Visual Studio Community Edition to get MSVC++ build tools and resolve LLVM's header file dependencies:
+        - [Visual Studio Download](https://visualstudio.microsoft.com/zh-hans/downloads/)
+        - After installing Visual Studio Installer, download Visual Studio Community 2022.
+        - As shown below, click the `Modify` button and find the `Desktop development with C++` option to select and download.
+    4. Download and install [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64)
+4. Double-click `start.bat` to open the training inference WebUI management interface. If needed, you can modify the `API_FLAGS` as prompted below.
+
+!!! info "Optional"
+
+	Want to start the inference WebUI? 
+
+    Edit the `API_FLAGS.txt` file in the project root directory and modify the first three lines as follows: 
+    ```
+     --infer 
+     # --api 
+     # --listen ...
+     ...
+    ```
+
+!!! info "Optional"
+
+	Want to start the API server? 
+
+    Edit the `API_FLAGS.txt` file in the project root directory and modify the first three lines as follows:
+
+    ``` 
+    # --infer
+    --api
+    --listen ...
+    ...
+    ```
+
+!!! info "Optional"
+
+	Double-click `run_cmd.bat` to enter the conda/python command line environment of this project.
 
 ## Linux Setup
 
@@ -107,6 +111,7 @@ apt install libsox-dev
 
 ## Changelog
 
+- 2024/09/10: Updated Fish-Speech to 1.4 version, with an increase in dataset size and a change in the quantizer's n_groups from 4 to 8.
 - 2024/07/02: Updated Fish-Speech to 1.2 version, remove VITS Decoder, and greatly enhanced zero-shot ability.
 - 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity.
 - 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models.

diff --git a/docs/en/inference.md b/docs/en/inference.md
@@ -90,51 +90,22 @@ python -m tools.post_api \
 
 The above command indicates synthesizing the desired audio according to the reference audio information and returning it in a streaming manner.
 
-If you need to randomly select reference audio based on `{SPEAKER}` and `{EMOTION}`, configure it according to the following steps:
-
-### 1. Create a `ref_data` folder in the root directory of the project.
-
-### 2. Create a directory structure similar to the following within the `ref_data` folder.
-
-```
-.
-├── SPEAKER1
-│    ├──EMOTION1
-│    │    ├── 21.15-26.44.lab
-│    │    ├── 21.15-26.44.wav
-│    │    ├── 27.51-29.98.lab
-│    │    ├── 27.51-29.98.wav
-│    │    ├── 30.1-32.71.lab
-│    │    └── 30.1-32.71.flac
-│    └──EMOTION2
-│         ├── 30.1-32.71.lab
-│         └── 30.1-32.71.mp3
-└── SPEAKER2
-    └─── EMOTION3
-          ├── 30.1-32.71.lab
-          └── 30.1-32.71.mp3
-```
-
-That is, first place `{SPEAKER}` folders in `ref_data`, then place `{EMOTION}` folders under each speaker, and place any number of `audio-text pairs` under each emotion folder.
-
-### 3. Enter the following command in the virtual environment
-
-```bash
-python tools/gen_ref.py
-
-```
-
-### 4. Call the API.
+The following example demonstrates that you can use **multiple** reference audio paths and reference audio texts at once. Separate them with spaces in the command.
 
 ```bash
 python -m tools.post_api \
-    --text "Text to be input" \
-    --speaker "${SPEAKER1}" \
-    --emotion "${EMOTION1}" \
-    --streaming True
+    --text "Text to input" \
+    --reference_audio "reference audio path1" "reference audio path2" \
+    --reference_text "reference audio text1" "reference audio text2"\
+    --streaming False \
+    --output "generated" \
+    --format "mp3"
 ```
 
-The above example is for testing purposes only.
+The above command synthesizes the desired `MP3` format audio based on the information from multiple reference audios and saves it as `generated.mp3` in the current directory.
+
+## GUI Inference 
+[Download client](https://github.com/AnyaCoder/fish-speech-gui/releases/tag/v0.1.0)
 
 ## WebUI Inference
 

diff --git a/docs/zh/index.md b/docs/zh/index.md
@@ -29,15 +29,26 @@
 
 Windows 专业用户可以考虑 WSL2 或 docker 来运行代码库。
 
+```bash
+# 创建一个 python 3.10 虚拟环境, 你也可以用 virtualenv
+conda create -n fish-speech python=3.10
+conda activate fish-speech
+
+# 安装 pytorch
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+
+# 安装 fish-speech
+pip3 install -e .
+
+# (开启编译加速) 安装 triton-windows
+pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl
+```
+
 Windows 非专业用户可考虑以下为免 Linux 环境的基础运行方法（附带模型编译功能，即 `torch.compile`）：
 
 1. 解压项目压缩包。
 2. 点击 `install_env.bat` 安装环境。
-    - 可以通过编辑 `install_env.bat` 的 `USE_MIRROR` 项来决定是否使用镜像站下载。
-    - `USE_MIRROR=false` 使用原始站下载最新稳定版 `torch` 环境。`USE_MIRROR=true` 为从镜像站下载最新 `torch` 环境。默认为 `true`。
-    - 可以通过编辑 `install_env.bat` 的 `INSTALL_TYPE` 项来决定是否启用可编译环境下载。
-    - `INSTALL_TYPE=preview` 下载开发版编译环境。`INSTALL_TYPE=stable` 下载稳定版不带编译环境。
-3. 若第 2 步 `INSTALL_TYPE=preview` 则执行这一步（可跳过，此步为激活编译模型环境）
+3. 若需要开启编译加速则执行这一步:
     1. 使用如下链接下载 LLVM 编译器。
         - [LLVM-17.0.6（原站站点下载）](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
         - [LLVM-17.0.6（镜像站点下载）](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
@@ -49,7 +60,7 @@ Windows 非专业用户可考虑以下为免 Linux 环境的基础运行方法
         - [Visual Studio 下载](https://visualstudio.microsoft.com/zh-hans/downloads/)
         - 安装好 Visual Studio Installer 之后，下载 Visual Studio Community 2022
         - 如下图点击`修改`按钮，找到`使用C++的桌面开发`项，勾选下载
-    4. 下载安装 [CUDA Toolkit 12](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64)
+    4. 下载安装 [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64)
 4. 双击 `start.bat` 打开训练推理 WebUI 管理界面. 如有需要，可照下列提示修改`API_FLAGS`.
 
 !!! info "可选"
@@ -158,6 +169,7 @@ apt install libsox-dev
 
 ## 更新日志
 
+- 2024/09/10: 更新了 Fish-Speech 到 1.4, 增加了数据集大小， quantizer n_groups 4 -> 8.
 - 2024/07/02: 更新了 Fish-Speech 到 1.2 版本，移除 VITS Decoder，同时极大幅度提升 zero-shot 能力.
 - 2024/05/10: 更新了 Fish-Speech 到 1.1 版本，引入了 VITS Decoder 来降低口胡和提高音色相似度.
 - 2024/04/22: 完成了 Fish-Speech 1.0 版本, 大幅修改了 VQGAN 和 LLAMA 模型.

diff --git a/docs/zh/inference.md b/docs/zh/inference.md
@@ -100,52 +100,22 @@ python -m tools.post_api \
 
 上面的命令表示按照参考音频的信息，合成所需的音频并流式返回.
 
-如果需要通过`{说话人}`和`{情绪}`随机选择参考音频，那么就根据下列步骤配置：
-
-### 1. 在项目根目录创建`ref_data`文件夹.
-
-### 2. 在`ref_data`文件夹内创建类似如下结构的目录.
-
-```
-.
-├── SPEAKER1
-│    ├──EMOTION1
-│    │    ├── 21.15-26.44.lab
-│    │    ├── 21.15-26.44.wav
-│    │    ├── 27.51-29.98.lab
-│    │    ├── 27.51-29.98.wav
-│    │    ├── 30.1-32.71.lab
-│    │    └── 30.1-32.71.flac
-│    └──EMOTION2
-│         ├── 30.1-32.71.lab
-│         └── 30.1-32.71.mp3
-└── SPEAKER2
-    └─── EMOTION3
-          ├── 30.1-32.71.lab
-          └── 30.1-32.71.mp3
-```
-
-也就是`ref_data`里先放`{说话人}`文件夹, 每个说话人下再放`{情绪}`文件夹, 每个情绪文件夹下放任意个`音频-文本对`。
-
-### 3. 在虚拟环境里输入
-
-```bash
-python tools/gen_ref.py
-```
-
-生成参考目录.
-
-### 4. 调用 api.
+下面的示例展示了， 可以一次使用**多个** `参考音频路径` 和 `参考音频的文本内容`。在命令里用空格隔开即可。 
 
 ```bash
 python -m tools.post_api \
     --text "要输入的文本" \
-    --speaker "说话人1" \
-    --emotion "情绪1" \
-    --streaming True
+    --reference_audio "参考音频路径1" "参考音频路径2" \
+    --reference_text "参考音频的文本内容1" "参考音频的文本内容2"\
+    --streaming False \
+    --output "generated" \
+    --format "mp3"
 ```
 
-以上示例仅供测试.
+上面的命令表示按照多个参考音频的信息，合成所需的`MP3`格式音频，并保存为当前目录的`generated.mp3`文件。
+
+## GUI 推理 
+[下载客户端](https://github.com/AnyaCoder/fish-speech-gui/releases/tag/v0.1.0)
 
 ## WebUI 推理
 

diff --git a/fish_speech/train.py b/fish_speech/train.py
@@ -1,4 +1,6 @@
 import os
+
+os.environ["USE_LIBUV"] = "0"
 import sys
 from typing import Optional
 

diff --git a/fish_speech/webui/manage.py b/fish_speech/webui/manage.py
@@ -1,9 +1,11 @@
 from __future__ import annotations
 
+import os
+
+os.environ["USE_LIBUV"] = "0"
 import datetime
 import html
 import json
-import os
 import platform
 import shutil
 import signal
@@ -862,15 +864,15 @@ def llama_quantify(llama_weight, quantify_mode):
                                     minimum=1,
                                     maximum=32,
                                     step=1,
-                                    value=4,
+                                    value=2,
                                 )
                                 llama_data_max_length_slider = gr.Slider(
                                     label=i18n("Maximum Length per Sample"),
                                     interactive=True,
                                     minimum=1024,
                                     maximum=4096,
                                     step=128,
-                                    value=1024,
+                                    value=2048,
                                 )
                             with gr.Row(equal_height=False):
                                 llama_precision_dropdown = gr.Dropdown(