### Feature request / 功能建议
Hi all, the README mentions testing on the Qualcomm 8 Elite (Gen4) platform with all models running on the NPU. Is there an early demo available for testing?
Which p...
I plan to write a blog later to explain the algorithm details in depth. The algorithm in the paper is written quite concisely, so feel free to ask more questions!
Indeed, our code has a somewhat research-oriented style, haha, so some parts might not be very clear. Let me quickly explain:
1. **`quant_data`**: This is used to compute the proxy error for eval...
1.In the two-stage quantization, two similar functions, the init_centroids_indices&& init_res_centroids_indices function and the quantize_vector &&quantize_residual_vector function, are defined res...
Additionally, I speculate that a 70B model at 2-bit might achieve stronger performance on certain benchmarks. Although I can’t prove this yet, I plan to conduct a thorough analysis on this in the f...
Yes, I really appreciate your question—it’s thought-provoking, and I’m seriously reflecting on it while trying to pursue rigorous research. If you're interested, we could collaborate directly. You ...
Thank you for your insightful and thought-provoking work. I have a question regarding the motivation behind low-bit quantization and its potential as a solution for enabling extremely low-bit quant...
Yes, in the paper, we included layer-wise fine-tuning. However, we recently found that running end-to-end fine-tuning performs better than layer-wise fine-tuning. I removed the code for layer-wise ...
The llama 3-8b models in the paper have been fine-tuned, especially around ~2-bit, where even just a few hundred iterations of end-to-end fine-tuning significantly improved model accuracy. We plan ...
![Image](https://github.com/user-attachments/assets/68a478b6-e9c4-433d-89ea-3a08c8cbefdd)
<!-- Failed to upload "image.png" -->
I tried this config to quantize the model but didn't get as good res...
> Thank you for your quick response. I set `--vector_lens -1 12` because, in line 226 of `./vptq/quantizer.py`, it notes:
>
> if num_centroids == -1: # Do not quantize, keep original data
> I ass...
This configuration looks a bit odd. When we set `--npercent 1`, it extracts 1% of the outliers to build a separate lookup table. However, with `--vector_lens -1 12`, the vector length of the outlie...
> > Hi [@ShawnzzWu](https://github.com/ShawnzzWu)
> > Would you mind sharing your quantized model so I can debug into it?
>
> Sorry, for information security reasons, I'm not allowed to share my f...
> I've been trying to quantize and run the Meta-Llama-3.1-8B-Instruct-2.3bit model with group number set to 4, and successfully run the model when k1(centroids) is 4096 as in the paper. However, an...