Talk on optimizing matrix multiplication with Triton kernels, focusing on low-bit processing and efficient quantization for high-performance AI models.
It addresses key challenges in multimodal AI development:
- Managing diverse inputs
- Scaling Generative AI apps
- Ensuring extensibility
Built on Ray for seamless scaling, Aana offers a unified framework for multiple data types, easy integration with popular ML frameworks, and a modular architecture.
We are releasing new 2-bit Mixtral models. These ones use a mixed HQQ 4-bit/2-bit configuration, resulting in a significantly improved model (ppl 4.69 vs. 5.90) with a negligible 0.20 GB VRAM increase.
No data calibration needed, extremely fast , works on both language and vision models!
* Why does it matter?
Quantization significantly reduces GPU memory requirements but degrades the quality of the models. Having faster and more accurate quantization methods is extremely valuable for the ML community.
* Approach:
Sparsity-based error formulation between the original weights and their dequantized version. We used a Half-Quadratic solver to derive a closed-form solution that is 100x faster than backprop via Pytorch's Autograd.
* Quantization speed:
~ 1 minute for Llama2-13B
~ 4 minutes for LLama2-70B (over 50x faster than GPTQ)
* Findings:
- Larger models quantized to 3/2-bit outperform smaller full-precision models with similar or lower memory requirements.
- Successful 2-bit quantization requires a lower group-size (e.g., 32 or 16) and compression of both the zero-point and the scaling factor for lower memory usage.
While we acknowledge our view might be slightly biased, we genuinely believe that our work will significantly benefit the open-source software (OSS) machine learning community. Code and model are in Apache permissive license.
It is not distilling the model, it is reducing the model weights on the fly and uses LoRA for training/fine-tuning. After the training phase, we explain how to merge the LoRA weights with the pruned weights to achieve faster inference speed
In a nutshell, we've managed to reduce the model's parameter count by up to 50%, double the training speed, and increase inference speed by 1.25 times.
For those interested in the technical details or looking to replicate our results, the code is openly available for community use and contributions
Automatic door sensors are in my experience universally infra red. In fact I don't think I've ever seen any camera technology used in that context. Are you saying they used cameras to open the doors?
It's video conferencing software. Makes sense that they might put together imagery which might suggest people meeting from different corners of the planet. But sure, I get your point. I didn't notice this myself, but I have been living on side of the planet opposite from where I was born for the past decade.
While agreeing to the general principle, incentive structures are wired quiet differently in academia vs end consumer oriented gig/service industries.
Publications ( number, when, where, citations) is the primary currency/value in which one is judged within the peers in academia, and reputation outside the immediate academic community has a much lower weight. Whereas for online market places solid revune is the first priority and then comes reputation ( which is a means for the higher reveune). In academia it is the reverse, with reputation ( in a small clique) being the primary motivator, and funding being the means to gather it.