The Ivy Bridge and Haswell BTB

hatsunearu · on Feb 23, 2016

Cool article, but if I may say, rainbow graphs are hard for me (and many other people) to parse.

http://betterfigures.org/2014/11/18/end-of-the-rainbow/

https://eagereyes.org/basics/rainbow-color-map

Not that it makes the article significantly weaker though, but yeah.

mattgodbolt · on Feb 23, 2016

Thanks for the feedback! I must say graphs and whatnot are not my forté - I spent nearly two evenings trying to get them half-decent. Patches welcome, of course :)

pdkl95 · on Feb 23, 2016

"Why Should Engineers and Scientists Be Worried About Color?"

http://www.research.ibm.com/people/l/lloydt/color/color.HTM

vardump · on Feb 23, 2016

That was an interesting and insightful article about CPU branch target buffers. Learned something new about Haswell and Intel BTBs in general.

Haswell's branch prediction seems to be pretty nice. I bet that makes quite a bit of difference in branchy code. No need to care so much about arcane rules about branching as in the past.

Haswell+ Xeons will likely be good with branchy code, especially avoiding those old pathological cases. Predicting 4096 branch targets with such an accuracy is very good.

nn3 · on Feb 23, 2016

Yes the Haswell branch prediction is so good that it obsoleted previous "best strategies" for fast interpreter loops.

See https://hal.inria.fr/hal-01100647/document

But unfortunately a lot of real code is so big these days that it thrashes all caches, including branch prediction.

Big is slow.

vardump · on Feb 23, 2016

There are just 512 cache lines in both L1 code and L1 data cache.

512 * 64 = 32 kB. 512 is 8 * 8 * 8, I'd guess if they add more (up to 4096) L1 cache lines that access would take 1 clock longer. Which is very likely a net performance loss.

Maybe it's time to increase cache line size to 128 bytes. That could of course break a lot of old code performance wise. It'd cause more false sharing [1] and double memory bandwidth requirements for random access. Maybe we should start to organize shared data to have 128 byte alignment.

It'd also break some read bandwidth-saving code that assumes 64 byte streaming (non-temporal) stores eliminate need for RFO [2] access.

Writing a single byte to RAM would require CPU core to first read a 128 byte cache line (RFO [2]) and then eventually write that cache line back. Currently "only" 64 bytes need to be read and written in the same scenario.

128 byte cache lines would mean L1C and L1D caches are both 64 kB, up from current 32 kB. That'd definitely help with monstrous codebases.

CPU design is full of compromises...

[1]: Atomic ops unintentionally touching same cache line cause performance loss through false sharing. The effect can be very significant, up to 2 orders of magnitude. https://en.wikipedia.org/wiki/False_sharing

[2]: Read for Ownership. https://en.wikipedia.org/wiki/MESI_protocol#Read_For_Ownersh...

Dylan16807 · on Feb 23, 2016

The L1 cache could use a different line size than everything else. If only part of it is valid, trigger an abort slightly later in the pipeline.

But if the limit is in how far you can transfer data, rather than in multiplexing, you won't be able to add any more bytes even if you stay at 512 lines.

sliverstorm · on Feb 23, 2016

For dram reads, to my knowledge you can't read only 64kb anyway. You get much larger pages from dram.

vardump · on Feb 23, 2016

> For dram reads, to my knowledge you can't read only 64kb anyway. You get much larger pages from dram.

I didn't say 64 kB, I said 64 bytes is the read and store transaction size.

I think DRAM page sizes are 2^n * 512 bytes, where n is a positive small integer. Typical DRAM bank sizes are 512 bytes, 1 kB or 2 kB.

So, with interleaved memory channels DRAM page changes every (DRAM page size) * (number of memory channels).

I think typical bank switch intervals for a laptop with 2 memory channels is every 1 (DDR4 minimum), 2 (DDR3 minimum, probably typical) or 4 kB.

Although DRAM bank sizes are irrelevant in this context.

I was talking about the smallest possible cached DRAM transaction -- read or store. Which is the same as the cache line size, 64 bytes.

Simplified:

Read 1 byte from memory and the CPU will fetch 64 bytes.

Write 1 byte to memory and the CPU will first fetch 64 bytes (RFO), modify the cache line and eventually when the cache line is evicted (which can be quite a while) write 64 bytes back.

mattgodbolt · on Feb 23, 2016

DRAM pages are indeed larger; but a single bus transaction does not have to read the whole DRAM page. Tons more info here: https://www.akkadia.org/drepper/cpumemory.pdf