huggingface/text-generation-inference Events in 2024 - Ecosyste.ms: Timeline

drbh pushed 13 commits to support-qwen2-vl huggingface/text-generation-inference

October 24, 2024 3:40pm

CI job. Gpt awq 4 (#2665) * add gptq and awq int4 support in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@... 153ff37
Make handling of FP8 scales more consisent (#2666) Change `fp8_quantize` so that we can pass around reciprocals ever... 5e0fb46
Test Marlin MoE with `desc_act=true` (#2622) Update the Mixtral GPTQ test to use a model with `desc_act=true` and `... 7f54b73
break when there's nothing to read (#2582) Signed-off-by: Wang, Yi A <[email protected]> 058d306
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some ... 9c9ef37
Make moe-kernels and marlin-kernels mandatory in CUDA installs (#2632) f58eb70
feat: natively support Granite models (#2682) * feat: natively support Granite models * Update doc 03c9388
hotfix: fix flashllama 27ff187
feat: allow any supported payload on /invocations (#2683) * feat: allow any supported payload on /invocations * u... 41c2623
flashinfer: reminder to remove contiguous call in the future (#2685) 1b914f3
Fix Phi 3.5 MoE tests (#2684) PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix t... 14a0df3
Add support for FP8 KV cache scales (#2628) * Add support for FP8 KV cache scales Since FP8 only has limited dyna... eab07f7
feat: add support for qwen2 vl model b735e79

View on GitHub

drbh created a branch on huggingface/text-generation-inference

October 24, 2024 3:36pm

support-qwen2-vl - Large Language Model Text Generation Inference

danieldk pushed 1 commit to feature/cc89-cutlass-w8a8 huggingface/text-generation-inference

October 24, 2024 3:33pm

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on pa... c6281a4

View on GitHub

danieldk opened a pull request on huggingface/text-generation-inference

October 24, 2024 3:31pm

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

# What does this PR do? Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. **D...

danieldk pushed 1 commit to feature/cc89-cutlass-w8a8 huggingface/text-generation-inference

October 24, 2024 3:30pm

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on pa... 197d45e

View on GitHub

danieldk pushed 4 commits to feature/cc89-cutlass-w8a8 huggingface/text-generation-inference

October 24, 2024 3:29pm

flashinfer: reminder to remove contiguous call in the future (#2685) 1b914f3
Fix Phi 3.5 MoE tests (#2684) PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix t... 14a0df3
Add support for FP8 KV cache scales (#2628) * Add support for FP8 KV cache scales Since FP8 only has limited dyna... eab07f7
Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels bee95a3

View on GitHub

mfuntowicz pushed 1 commit to feat-backend-llamacpp huggingface/text-generation-inference

October 24, 2024 2:42pm

feat(backend): build and link through build.rs 84854a7

View on GitHub

danieldk deleted a branch huggingface/text-generation-inference

October 24, 2024 2:36pm

feature/fp8-kv-cache-scale

danieldk pushed 1 commit to main huggingface/text-generation-inference

October 24, 2024 2:36pm

Add support for FP8 KV cache scales (#2628) * Add support for FP8 KV cache scales Since FP8 only has limited dyna... eab07f7

View on GitHub

danieldk closed a pull request on huggingface/text-generation-inference

October 24, 2024 2:36pm

Add support for FP8 KV cache scales

# What does this PR do? Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the...

mht-sharma created a review on a pull request on huggingface/text-generation-inference

October 24, 2024 2:30pm

LGTM thanks!

View on GitHub

drbh opened a pull request on huggingface/text-generation-inference

October 24, 2024 2:20pm

fix: improve find_segments via numpy diff

This PR updates `find_segments` to use a numpy array rather than a list. This allow us to a more efficiently find the segments via masking. when testing the method in isolation this results in ...

drbh created a branch on huggingface/text-generation-inference

October 24, 2024 2:16pm

improve-find-segments-function - Large Language Model Text Generation Inference

ThomasBlaisot forked huggingface/text-generation-inference

October 24, 2024 1:31pm

ThomasBlaisot/text-generation-inference

nhplwww starred huggingface/text-generation-inference

October 24, 2024 1:30pm

danieldk deleted a branch huggingface/text-generation-inference

October 24, 2024 1:21pm

maintenance/phi35-test-fix

danieldk pushed 1 commit to main huggingface/text-generation-inference

October 24, 2024 1:21pm

Fix Phi 3.5 MoE tests (#2684) PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix t... 14a0df3

View on GitHub

danieldk closed a pull request on huggingface/text-generation-inference

October 24, 2024 1:21pm

Fix Phi 3.5 MoE tests

# What does this PR do? PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix this. <!-- Congratulations! You've made it this far! You're not quite done yet tho...

danieldk deleted a branch huggingface/text-generation-inference

October 24, 2024 12:59pm

maintenance/flashinfer-noncontiguous-reminder

danieldk pushed 1 commit to main huggingface/text-generation-inference

October 24, 2024 12:59pm

flashinfer: reminder to remove contiguous call in the future (#2685) 1b914f3

View on GitHub

danieldk closed a pull request on huggingface/text-generation-inference

October 24, 2024 12:59pm

flashinfer: reminder to remove contiguous call in the future

# What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, ...

danieldk opened a pull request on huggingface/text-generation-inference

October 24, 2024 12:43pm

flashinfer: reminder to remove contiguous call in the future

# What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, ...

danieldk created a review comment on a pull request on huggingface/text-generation-inference

October 24, 2024 12:37pm

Ah yes, good catch! I added an additional condition to `can_scale` to check that `ATTENTION=="flashinfer"`.

View on GitHub

danieldk created a review on a pull request on huggingface/text-generation-inference

October 24, 2024 12:37pm

View on GitHub

danieldk pushed 1 commit to feature/fp8-kv-cache-scale huggingface/text-generation-inference

October 24, 2024 12:35pm

`can_scale`: check that the attention is flashinfer a68fae0

View on GitHub

danieldk created a comment on an issue on huggingface/text-generation-inference

October 24, 2024 12:10pm

Made mandatory and installed through a `make install` in #2632, so should fixed in the next release. Feel free to reopen if the issue occurs after the next release.

View on GitHub

danieldk closed an issue on huggingface/text-generation-inference

October 24, 2024 12:10pm

No module named moe_kernel in Flash Attention Installation while compiling TGI2.3.1

**Task - Flash Attention Installation from Source. [Completed] Run- TGI2.3.1 with models that support for Flash attention enabled models.** [Issue does not occur in TGI2.2.0] **Error -** 202...

danieldk opened a pull request on huggingface/text-generation-inference

October 24, 2024 12:07pm

Fix Phi 3.5 MoE tests

# What does this PR do? PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix this. <!-- Congratulations! You've made it this far! You're not quite done yet tho...

danieldk created a branch on huggingface/text-generation-inference

October 24, 2024 12:07pm

maintenance/phi35-test-fix - Large Language Model Text Generation Inference

mht-sharma created a review comment on a pull request on huggingface/text-generation-inference

October 24, 2024 10:31am

Should this be only done when ATTENTION is `flashinfer`, since scales are not passed to flash decoding and paged attention yet?

View on GitHub