Hacker Newsnew | past | comments | ask | show | jobs | submit | ibuildthings's commentslogin

Talk on optimizing matrix multiplication with Triton kernels, focusing on low-bit processing and efficient quantization for high-performance AI models.


Aana SDK is an open-source toolkit for building cutting-edge multimodal AI applications: https://github.com/mobiusml/aana_sdk

It addresses key challenges in multimodal AI development:

- Managing diverse inputs - Scaling Generative AI apps - Ensuring extensibility

Built on Ray for seamless scaling, Aana offers a unified framework for multiple data types, easy integration with popular ML frameworks, and a modular architecture.


Currently leading in LLM benchmarks among the 3B categories of models


We are releasing new 2-bit Mixtral models. These ones use a mixed HQQ 4-bit/2-bit configuration, resulting in a significantly improved model (ppl 4.69 vs. 5.90) with a negligible 0.20 GB VRAM increase.

Base: https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-a...

Instruct: https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-...

Shout-out to Artem Eliseev and Denis Mazur for suggesting this idea ( https://github.com/mobiusml/hqq/issues/2 )


We are releasing 2-bit and 4-bit quantized versions of Mixtral utilizing the HQQ method that we just published https://mobiusml.github.io/hqq_blog/ and https://github.com/mobiusml/hqq.

The 2-bit version can run on a 24GB Titan RTX.

In terms of perplexity scores on the wikitext2 dataset, the results are as follows: Mixtral: 26GB / 3.79 Llama2-70B: 26.37GB / 4.13


Sharing our work on model quantization.

- Blog: https://mobiusml.github.io/hqq_blog/ - Code: https://github.com/mobiusml/hqq - Models: https://huggingface.co/mobiuslabsgmbh/

No data calibration needed, extremely fast , works on both language and vision models!

* Why does it matter? Quantization significantly reduces GPU memory requirements but degrades the quality of the models. Having faster and more accurate quantization methods is extremely valuable for the ML community.

* Approach: Sparsity-based error formulation between the original weights and their dequantized version. We used a Half-Quadratic solver to derive a closed-form solution that is 100x faster than backprop via Pytorch's Autograd.

* Quantization speed: ~ 1 minute for Llama2-13B ~ 4 minutes for LLama2-70B (over 50x faster than GPTQ)

* Findings: - Larger models quantized to 3/2-bit outperform smaller full-precision models with similar or lower memory requirements. - Successful 2-bit quantization requires a lower group-size (e.g., 32 or 16) and compression of both the zero-point and the scaling factor for lower memory usage.

While we acknowledge our view might be slightly biased, we genuinely believe that our work will significantly benefit the open-source software (OSS) machine learning community. Code and model are in Apache permissive license.


Github repo should be visible now.

It is not distilling the model, it is reducing the model weights on the fly and uses LoRA for training/fine-tuning. After the training phase, we explain how to merge the LoRA weights with the pruned weights to achieve faster inference speed


I'm sharing a blog post https://mobiusml.github.io/low-rank-llama2/ on our approach to pruning the Llama2 model by leveraging low-rank structures.

In a nutshell, we've managed to reduce the model's parameter count by up to 50%, double the training speed, and increase inference speed by 1.25 times.

For those interested in the technical details or looking to replicate our results, the code is openly available for community use and contributions


In this particular case, why is testing/demonstrating on diverse set of individuals a bad thing ?

A personal anecdote is that a few years back the automatic door sensors in my university did not work on my skin tone.


Automatic door sensors are in my experience universally infra red. In fact I don't think I've ever seen any camera technology used in that context. Are you saying they used cameras to open the doors?


It's not a bad thing, I just dislike that I noticed it.


It's video conferencing software. Makes sense that they might put together imagery which might suggest people meeting from different corners of the planet. But sure, I get your point. I didn't notice this myself, but I have been living on side of the planet opposite from where I was born for the past decade.


While agreeing to the general principle, incentive structures are wired quiet differently in academia vs end consumer oriented gig/service industries.

Publications ( number, when, where, citations) is the primary currency/value in which one is judged within the peers in academia, and reputation outside the immediate academic community has a much lower weight. Whereas for online market places solid revune is the first priority and then comes reputation ( which is a means for the higher reveune). In academia it is the reverse, with reputation ( in a small clique) being the primary motivator, and funding being the means to gather it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: