Hacker Newsnew | past | comments | ask | show | jobs | submit | vincelt's commentslogin

It's 2025 and a code editor is now 600MB.


Agreed. For immersing at home, reading mangas/webtoons with an OCR translated layer and watching Japanese vlogs with dual subtitles has been effective for me.


Found this repo the other day in case that's useful: https://github.com/Xuanwo/acp-claude-code


Agreed. If there are provider-specific differences then we should be able to add CLAUDE/GEMINI.md to take precedence, but most things in there are basic or general information and should apply to all providers (same for .cursor/, .windsurfrules/, ..)


Still fails to solve this one:

Prompt:

> Complete the following Python program:

```

len, print = print, len

def print_len(x):

     "Print the length of x"
```

gpt2-chatbot:

```

len, print = print, len # Swap the functionalities of print and len

def print_len(x): "Print the length of x"

    len(x)  # This 'len' is actually 'print' because of the swap above
print_len("Hello, world!") # This will output: 13

```


That's the same output as gpt-4-turbo for me, so no real improvement for that particular prompt.


That's not bad though!


Have you tried replicating via the API with a temp of 0?


No I did not.


Using a temp of zero usually returns garbage results from most models, so it would likely do so in case of GPT 4 as well. Any other great ideas?


The point isn't that temp 0 should be used, the point is that anyone surprised that they get different results should realise that there is an element of randomness involved by default.

Even repeating the same question in a single chat can have GPT-4 vary on its output, though it will often settle on a particular output due to context informing the output (which is why adding context is so important for these models)


Temp of 0 gives the least random and most predictable results


That's true, but those results are rarely the correct ones, at least for v1 llama models. In my experience each model has an optimal temperature at which it performs vastly better. I'm sure OpenAI have the best config they know set up for ChatGPT but let people generate trash through the API if they want to waste their credits on it.


Why would the accuracy decrease with lower temperature? Setting temperature to 0 just means at each step the model will emit the token with the highest likelihood.


Yes that's what I'm saying, to reiterate: The likeliest token does not lead to the highest performing result. Otherwise temperature wouldn't even be an option. I would imagine things like language word frequency affect the token rating a lot while having nothing to do with the task at hand except providing a correctly formatted answer, but it's probably not the whole story.

OpenAI (and others that know what they're doing) always do their benchmarks in a multi-sampled way, by running 5 or 20 times at optimal temp. Using a wrapper that runs these samples and then another pass that judges self-consistency for a final answer can give you a correct answer 100% of the time for a question that would be wrong 100% of the time with temp at zero.


I had a conversation with a friend regarding this exact question and my understanding is that model trains to optimize the distribution of all texts, therefore when you restrict it to deterministic sampling that is not representative of inputs you select the slice of the distribution that model learned that conveys much less information than the full distribution, and hence has poorer results.


Not in my experience, in fact I find that when I need precise, realistic, and reliable results temp 0 is needed. For example, here is a bunch of names, gather the names of specific plastics under headings matching their common acronym - if I don't use temp 0 I might get nonsense out. Temp 0? reliably correct.


Interesting, that's the exact opposite of my experience.


What do you mean? It works fine for me when I’ve tried it


Oh neat, thanks for sharing, wanted to add an interpreter to that test


I used the API from Together[0].

Thanks for sharing your results, they're indeed pretty different. I looked at the source again and did append a "# " before every prompt made by those 10 `code` models (during testing thought that formatting it as a Python comment might help them).

Will re-run the script without that to see if it matches your results.

[0] https://docs.together.ai/docs/models-inference#code-models


Thanks. I haven’t put it online yet, but will try to clean it (removing API keys & all) tonight/tomorrow and publish it


:-) that's awesome. Thanks! Nice work on this.


Interesting, what about it makes you sleep better?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: