huggingface/text-generation-inference Events in 2024 - Ecosyste.ms: Timeline

Bihan created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 11:22am

> Any chance you could try `docker pull ghcr.io/huggingface/text-generation-inference:latest-rocm`? ROCm FP8 support was improved yesterday: > > #2588 @danieldk Yes sure.

View on GitHub

danieldk created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 11:20am

Any chance you could try `docker pull ghcr.io/huggingface/text-generation-inference:latest-rocm`? ROCm FP8 support was improved yesterday: https://github.com/huggingface/text-generation-inferenc...

View on GitHub

danieldk pushed 25 commits to maintenance/reshape-and-cache huggingface/text-generation-inference

October 17, 2024 11:17am

enable mllama in intel platform (#2610) Signed-off-by: Wang, Yi A <[email protected]> 57f9685
Upgrade minor rust version (Fixes rust build compilation cache) (#2617) * Upgrade minor rust version (Fixes rust bui... 8b295aa
Add support for fused MoE Marlin for AWQ (#2616) * Add support for fused MoE Marlin for AWQ This uses the updated... 6414248
nix: move back to the tgi-nix main branch (#2620) 6db3bcb
CI (2599): Update ToolType input schema (#2601) * Update ToolType input schema * lint * fix: run formatter ... 8ad20da
nix: add black and isort to the closure (#2619) To make sure that everything is formatted with the same black versio... 9ed0c85
AMD CI (#2589) * Only run 1 valid test. * TRying the tailscale action quickly. * ? * bash spaces. * Remo... 43f39f6
feat: allow tool calling to respond without a tool (#2614) * feat: process token stream before returning to client ... e36dfaa
Update documentation to most recent stable version of TGI. (#2625) Update to most recent stable version of TGI. d912f0b
Intel ci (#2630) * Intel CI ? * Let's try non sharded gemma. * Snapshot rename * Apparently container can b... 3dbdf63
Fixing intel Supports windowing. (#2637) 0c47884
Small fixes for supported models (#2471) * Small improvements for docs * Update _toctree.yml * Updating the do... ce28ee8
Cpu perf (#2596) * break when there's nothing to read Signed-off-by: Wang, Yi A <[email protected]> * Differ... 3ea82d0
Clarify gated description and quicktour (#2631) Update quicktour.md 51f5401
update ipex to fix incorrect output of mllama in cpu (#2640) Signed-off-by: Wang, Yi A <[email protected]> 7a82ddc
feat: enable pytorch xpu support for non-attention models (#2561) XPU backend is available natively (without IPEX) i... 58848cb
Fixing linters. (#2650) cf04a43
Use flashinfer for Gemma 2. ce7e356
Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651) As spotted by @philschmid, the payload ... ffe05cc
Fp8 e4m3_fnuz support for rocm (#2588) * (feat) fp8 fnuz support for rocm * (review comments) Fix compression_con... 704a58c
and 5 more ...

View on GitHub

danieldk created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 11:16am

Thanks for reporting! I updated the title to reflect that this issue only occurs on ROCm. It looks like we have to expand the shapes when dispatching to Torch scaled mm (for CUDA we don't use the T...

View on GitHub

danieldk created a review on a pull request on huggingface/text-generation-inference

October 17, 2024 11:09am

View on GitHub

Grey4sh created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 9:58am

Get it. Thank you for your nice suggestions.

View on GitHub

Grey4sh closed an issue on huggingface/text-generation-inference

October 17, 2024 9:58am

TGI included marlin kernel is missing padding code (REOPEN)

### System Info ### TGI version tgi-2.3.1 docker image ### OS version ```shell torch install path ............... ['/home/chatgpt/.local/lib/python3.10/site-packages/torch'] torch version ....

Narsil created a review comment on a pull request on huggingface/text-generation-inference

October 17, 2024 9:51am

```suggestion # Get prefill logprobs with inplace softmax (avoid copying the `out` tensor (max_batch_prefill_tokens * vocab_size)) ``` There is not batch size anymore per-se.

View on GitHub

Narsil created a review on a pull request on huggingface/text-generation-inference

October 17, 2024 9:51am

View on GitHub

Narsil created a review on a pull request on huggingface/text-generation-inference

October 17, 2024 9:51am

View on GitHub

Narsil created a comment on a pull request on huggingface/text-generation-inference

October 17, 2024 9:51am

> For instance, when using meta-llama/Meta-Llama-3.1-8B-Instruct on an L4, this change allows running the model with --max-batch-prefill-tokens increased from 7192 to 9874 without exceeding memory ...

View on GitHub

Narsil created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 9:30am

TGI will always use all the allowed memory for KV-cache, to allow MANY users on the same machine. Specifying MAX_BATCH_SIZE is not used on Nvidia targets as mentionned in the docs: https://huggi...

View on GitHub

Narsil created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 9:25am

Also using 2x 4A100 should be more efficient in general if it works (less communication overhead between shards). If you have trouble with your current settings on 4 shards there are some new f...

View on GitHub

Narsil created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 9:22am

Okay. This is a won't fix for us. Having odd sized dimensions is an issue in many kernels, and padding is costly and wasting precious GPU ressources (you would essentially by computing 25% too much...

View on GitHub

Narsil closed a pull request on huggingface/text-generation-inference

October 17, 2024 9:15am

fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process

reason: we added a docker wrapper script a while ago to fix missing .so issues encountered when spawning tgi in some cloud providers that add shared libs, related to cuda for example, but do not re...

Narsil created a review on a pull request on huggingface/text-generation-inference

October 17, 2024 9:09am

LGTM

View on GitHub

Narsil opened a pull request on huggingface/text-generation-inference

October 17, 2024 9:04am

Fixing "deadlock" when python prompts for trust_remote_code by always

specifiying a value. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes ...

Narsil created a branch on huggingface/text-generation-inference

October 17, 2024 9:04am

fixup_tokenizer_trust - Large Language Model Text Generation Inference

oOraph opened a pull request on huggingface/text-generation-inference

October 17, 2024 8:50am

fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process

reason: we added a docker wrapper script a while ago to fix missing .so issues encountered when spawning tgi in some cloud providers that add shared libs, related to cuda for example, but do not re...

Narsil created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 8:46am

Thanks a lot for reopening with a lot more information, helps us narrow down the issue much faster.

View on GitHub

Narsil deleted a branch huggingface/text-generation-inference

October 17, 2024 8:42am

maintenance/simplify-attention

Narsil pushed 1 commit to main huggingface/text-generation-inference

October 17, 2024 8:42am

Simplify the `attention` function (#2609) * Simplify the `attention` function - Use one definition rather than mu... 59ea38c

View on GitHub

Narsil closed a pull request on huggingface/text-generation-inference

October 17, 2024 8:42am

Simplify the `attention` function

# What does this PR do? - Use one definition rather than multiple (will make it easier to do shared things once, such as calculating the FP8 KV cache reciprocal). - Add `key`/`value` arguments,...

Narsil deleted a branch huggingface/text-generation-inference

October 17, 2024 8:42am

feature/kv-cache-e4m3

Narsil pushed 1 commit to main huggingface/text-generation-inference

October 17, 2024 8:42am

Support `e4m3fn` KV cache (#2655) * Support `e4m3fn` KV cache * Make check more obvious 5bbe1ce

View on GitHub

Narsil closed a pull request on huggingface/text-generation-inference

October 17, 2024 8:42am

Support `e4m3fn` KV cache

# What does this PR do? Add support for `e4m3fn` KV caches as well. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case)...

Narsil created a review on a pull request on huggingface/text-generation-inference

October 17, 2024 8:37am

View on GitHub

Narsil created a review on a pull request on huggingface/text-generation-inference

October 17, 2024 8:33am

LGTM

View on GitHub

josephrocca created a comment on an issue on huggingface/text-generation-inference

October 17, 2024 8:25am

> With that in mind, it'll be much easier to assess a correct caching solution. Gotcha, makes sense. For reference, I use sticky sessions, and it's not much of a can of worms in my case, sinc...

View on GitHub

danieldk created a review comment on a pull request on huggingface/text-generation-inference

October 17, 2024 8:05am

Should be fixed now, tested Llama & Mistral with `paged`, `flashattention` and `flashinfer`.

View on GitHub