https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_diffattn.py#L99
I found there is a line attn_weights = torch.nan_to_num(attn_weights), which looks weird.
When I try to t...
Hi, thank you for sharing the code. I think nan values in your attention weights may be related to loading pretrained weights and fine-tuning the model.
In our paper, all models were trained from ...
Thanks for clarification!
I did not use exactly the official code but my implementation. Just to see if we can inherit the weight of pretrained model.
Maybe I made some bugs in my code. Training is...
**Describe**
I'm trying to download large version of pretrained VLMO, but the link seems expired. Could you please update the download link?
https://github.com/wenhui0924/vlmo_ckpts/releases/down...