Ecosyste.ms: Timeline
Browse the timeline of events for every public repo on GitHub. Data updated hourly from GH Archive.
DefTruth created a branch on DefTruth/CUDA-Learn-Notes
opt-hgemm-mma - 🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
DefTruth pushed 2 commits to opt-hgemm-mma DefTruth/CUDA-Learn-Notes
DefTruth pushed 1 commit to main DefTruth/CUDA-Learn-Notes
- [HGEMM] update HGEMM benchmark option (#95) * update hgemm benchmark option * update hgemm benchmark option * ... 0c29631
DefTruth pushed 1 commit to opt-hgemm-mma DefTruth/CUDA-Learn-Notes
- update hgemm benchmark option f899cd7
DefTruth pushed 1 commit to opt-hgemm-mma DefTruth/CUDA-Learn-Notes
- update hgemm benchmark option 2984f19
DefTruth pushed 1 commit to opt-hgemm-mma DefTruth/CUDA-Learn-Notes
- update hgemm benchmark option d57932a
DefTruth closed an issue on DefTruth/CUDA-Learn-Notes
您好,请教一个关于代码中reduce相关的问题
1. `sum = warp_reduce_sum<NUM_WARPS>(sum);` 2. `if(warp==0) sum = warp_reduce_sum<NUM_WARPS>(sum);` 0x03 warp/block reduce sum/max 、0x09 softmax, softmax + vec4 做final sum的时候,用的是第一种形式 0x04 bl...DefTruth closed an issue on DefTruth/CUDA-Learn-Notes
__threadfence() 作用
佬有测试过 0x09 softmax 中的 `__threadfence()`吗?这个好像没办法达到grid级别线程之间的同步.DefTruth closed an issue on DefTruth/CUDA-Learn-Notes
layer norm实现
readme里面layer norm的实现是不是batch norm的啊DefTruth created a branch on DefTruth/CUDA-Learn-Notes
opt-hgemm-mma - 🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
github-actions[bot] created a comment on an issue on DefTruth/CUDA-Learn-Notes
This issue is stale because it has been open for 30 days with no activity.
DefTruth pushed 2 commits to opt-hgemm-mma DefTruth/CUDA-Learn-Notes
DefTruth pushed 1 commit to main DefTruth/CUDA-Learn-Notes
- [HGEMM] Add GeForce RTX 3080 Laptop benchmark (#94) * update hgemm benchmark * update hgemm benchmark ce095b5
DefTruth closed a pull request on DefTruth/CUDA-Learn-Notes
[HGEMM] Add GeForce RTX 3080 Laptop benchmark
DefTruth opened a pull request on DefTruth/CUDA-Learn-Notes
[HGEMM] Add GeForce RTX 3080 Laptop benchmark
DefTruth created a branch on DefTruth/CUDA-Learn-Notes
opt-hgemm-mma - 🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
DefTruth pushed 1 commit to main DefTruth/CUDA-Learn-Notes
- [Docs] rename mat_transpose -> mat-transpose (#93) * Update sgemm_wmma_tf32_stage.cu * Update sgemm.py * Updat... 523a610