> So it's possible for DynamicBatchSampler to sometimes exceed the frames_threshold.
The `self.get_frame_len()` is intended to get duration exactly for a queried index, while `__getitem__` do ca...
Our reproduced E2 model doesn't train with that scheme. Check E2 paper:
![image](https://github.com/user-attachments/assets/cad5427e-9794-4b85-8116-8d7411d07ccd)
We just use characters, no random...
Use lower case as we suggested in readme, or you are telling model to read letter by letter
Also check if reference audio uploaded correctly, will show waveform if so
When I try to generate text with ARPAbet phones in parenthesis like you see in the "Specifying the pronunciation without model re-training" of https://www.microsoft.com/en-us/research/project/e2-tt...