From the several test cases I tested, it always emits non-existent characters at the beginning of the audio
[zh_prompt.zip](https://github.com/user-attachments/files/17404857/zh_prompt.zip)
给定模型的...
@SWivid something i miss
here also video action all working great
https://github.com/user-attachments/assets/ebf25212-2cc5-4570-b5ff-41b63a6f0f96
BTW: when you have free time please test
@SWivid just another great update and fix some stuf
first add create vocan from the dataset and you can see
![image](https://github.com/user-attachments/assets/9784504d-b772-4369-b275-5e8dc...
> if possible to merge 2 models weight . so not need the dataset English , Chinese ...
not sure if will work.
a more possible solution is to do llama-adapter finetuning.
@huutuongtu I would recommend a smaller model size.
And just train longer, at least 200k updates thought to hear something reasonable, cuz we have no phoneme-level force-alignment
(if you're int...