Coding with Llama 3.1, New DeepSeek Coder and Mistral Large

anotherpaulg · on July 25, 2024

Five noteworthy models have been released in the last few days, with a wide range of code editing capabilities. Here are their results from aider’s code editing leaderboard with Sonnet and GPT-3.5 included for scale.

  77% claude-3.5-sonnet
  73% DeepSeek Coder V2 0724
  66% llama-3.1-405b-instruct
  60% Mistral Large 2 (2407)
  59% llama-3.1-70b-instruct
  58% gpt-3.5-turbo-0301
  38% llama-3.1-8b-instruct

tarasglek · on July 25, 2024

Been experimenting with aider for a while. Really fun using it for basically free with deepseek. Thanks for getting that to work.

My question is re your whole file fallback approach. Since you already have everything mapped out with treesitter, have you considered doing function at a time instead? It's just as easy for a model to ask for a function as it is for a file

WiSaGaN · on July 25, 2024

Exciting! So many models dropped on the 24th. Really to know whether the numerical suffix means YYMM or MMDD!

Workaccount2 · on July 25, 2024

Is there a reason people seem to exclude google models? I see lots of talk about llama, mistral, deepseek, and then ofc gpt and claude, but not much about gemma and gemini despite them benchmarking well.

anotherpaulg · on July 25, 2024

The main aider leaderboard includes gemini models. I benchmarked gemma-2-27b-it, but it did so badly that it wasn't worth adding to the leaderboard.

https://aider.chat/docs/leaderboards/

Lash-L · on July 26, 2024

Interesting - why do you think that is?

can-ai-code seems to have Gemma 2-27b as pretty capable - beating out some of the other models that perform really well on aider: https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

22c · on July 26, 2024

Not an expert, but I believe it's related to this difference:

> LLMs which are good at editing code, not just good at writing code

So it may be that Gemma can write new code, but is not particularly good at editing or refactoring code.