Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.

Cerebras uses a different architecture than Nvidia, where the HBM is not directly packaged with the chips, this is handled by a different system so you can scale memory and compute separately. Specifically you can use something like MemoryX to act as your HBM which will be high speed interconnected to the chips SRAM, see [1]. I'm not at all an expert in Cerebras, but IIRC you can connect up to like 2PB of memory to a single Cererbas chip, so almost 1000x the FP16 model.

[1]: https://www.cerebras.ai/blog/announcing-the-cerebras-archite...



  That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.
Yes but Cerebras achieves its speed by using SRAM.


There is no way not to use SRAM on a GPU/Cerebras/most accelerators. This is where the cores fetch the data.

But that doesn’t mean you are only using SRAM, that would be impractical. Just like using a CPU just by storing stuff in the L3 cache and never going to the RAM. Unless I am missing something from the original link, I don’t know how you got to the conclusion that they only used SRAM.


> Just like using a CPU just by storing stuff in the L3 cache and never going to the RAM. Unless I am missing something from the original link, I don’t know how you got to the conclusion that they only used SRAM.

That's exactly how Graphcore's current chips work, and I wouldn't be surprised if that's how Cerebras's wafer works. It's probably even harder for Cerebras to use DRAM because each chip in the wafer is "landlocked" and doesn't have an easy way to access the outside world. You could go up or down, but down is used for power input and up is used for cooling.

You're right it's not a good way to do things for memory hungry models like LLMs, but all of these chips were designed before it became obvious that LLMs are where the money is. Graphcore's next chip (if they are even still working on it) can access a mountain of DRAM with very high bandwidth. I imagine Cerebras will be working on that too. I wouldn't be surprised if the abandon WSI entirely due to needing to use DRAM.


I know Groq chips load the entire model into SRAM. That's why it can be so fast.

So if Cerebras uses HBM to store the model but stream weights into SRAM, I really don't see the advantage long term over smaller chips like GB200 since both architectures use HBM.

The whole point of having a wafer chip is that you limit the need to reach out to external parts for memory since that's the slow part.


> I really don't see the advantage long term over smaller chips like GB200 since both architectures use HBM.

I don’t think you can look at those things binarily. 44GB of SRAM is still a massive amount. You don’t need infinite SRAM to get better performances. There is a reason NVidia is increasing the L2 cache size with every generation rather than just sticking with 32MB if it really changed nothing to have a bit more. The more SRAM you have the more you are able to mask communication behind computation. You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).


> You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).

You would have to have an insanely fast bus to prevent I/O stalls with this. With a 235B fp16 model you’d be streaming 470GiB of data every graph execution. To do that 1000tok/s, you’d need a bus that can deliver a sustained ~500 TiB/s. Even if you do a 32 wide MoE model, that’s still about 15 TiB/s of bandwidth you’d need from the HBM to avoid stalls at 1000tok/s.

It would seem like this either isn’t fp16 or this is indeed likely running completely out of SRAM.

Of course Cerebas doesn’t use a dense representation so these memory numbers could be way off and maybe that is SRAM+DRAM combo


> I don’t know how you got to the conclusion that they only used SRAM.

Because they are doing 1,500 tokens per second.


what are the bandwidth/latency of memoryX? those are the key parameters for inference


Well MemoryX compared to H100 HBM3 the key details are that MemoryX has lower latency, but also far lower bandwidth. However the memory on Cerebras is scales a lot more over NVidia. You need a cluster of H100's to create a model, as only way to scale the memory, Cerbras is more suited to that aspect, Nvidia do their scaling in tooling, with Cerbras doing theirs in design via there silicon approach.

That's my take on it all, not many apples to oranges comparisons to work from on these two system for even rolling down the same slope.


No way an offchip HBM has same or better bandwidth then onchip


> MemoryX has lower latency, but also far lower bandwidth


Yeah sure, but if you do that you are heavily dropping the token/s for a single user. The only way to recover from that is continuous batching. This could still be interesting if the KV caches of all users fit in SRAM though.


> but if you do that you are heavily dropping the token/s for a single user.

I don’t follow what you are saying and what “that” is specifically. Assuming it’s referencing using HBM and not just SRAM, this is not optional on a GPU, SRAM is many order of magnitudes too small. Data is constantly flowing between HBM and SRAM by design, and to get data in/out of your GPU you have to go through HBM first, you can’t skip that.

And while it is quite massive on a Cerebras system it is also still too small for very large models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: