Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not sure if that’s relevant, but when I do micro-benchmarks like that measuring time intervals way smaller than 1 second, I use __rdtsc() compiler intrinsic instead of standard library functions.

On all modern processors, that instruction measures wallclock time with a counter which increments at the base frequency of the CPU unaffected by dynamic frequency scaling.

Apart from the great resolution, that time measuring method has an upside of being very cheap, couple orders of magnitude faster than an OS kernel call.



Isn't gettimeofday implemented with vDSO to avoid kernel context switching (and therefore, most of the overhead)?

My understanding is that using tsc directly is tricky. The rate might not be constant, and the rate differs across cores. [1]

[1]: https://www.pingcap.com/blog/how-we-trace-a-kv-database-with...


I think most current systems have invariant tsc, I skimmed your article and was surprised to see an offset (but not totally shocked), but the rate looked the same.

You could cpu pin the thread that's reading the tsc, except you can't pin threads in OpenBSD :p


But just to be clear (for others), you don't need to do that because using RDTSC/RDTSCP is exactly how gettimeofday and clock_gettime work these days, even on OpenBSD. Where using the TSC is practical and reliable, the optimization is already there.

OpenBSD actually only implemented this optimization relatively recently. Though most TSCs will be invariant, they still need to be trained across cores, and there are other minutiae (sleeping states?) that made it a PITA to implement in a reliable way, and OpenBSD doesn't have as much manpower as Linux. Some of those non-obvious issues would be relevant to someone trying to do this manually, unless they could rely on their specific hardware behavior.


Out of interest, does training across cores result in any residual offset? If so, is the offset nondeterministic?


I was curious myself, poked around, and found some references. But I'm still woefully incapable of answering that with any confidence and don't want to risk saying anything misleading, so here's the code and some other breadcrumbs:

1. Apparently OpenBSD gave up on trying to fix desync'd TSCs. See https://github.com/openbsd/src/commit/78156938567f79506a923c...

2. Relevant OpenBSD kernel code: https://github.com/openbsd/src/blob/master/sys/arch/amd64/am...

3. Relevant Linux kernel code: https://github.com/torvalds/linux/blob/master/arch/x86/kerne..., https://github.com/torvalds/linux/blob/master/arch/x86/kerne...

4. Linux kernel doc (out-of-date?): https://www.kernel.org/doc/Documentation/virtual/kvm/timekee...

5. Detailed SUSE blog post with many links: https://www.suse.com/c/cpu-isolation-nohz_full-troubleshooti...

6. Linux patch (uncommitted?) to attempt to directly sync TSCs: https://lkml.rescloud.iu.edu/2208.1/00313.html


Wizardly workarounds for broken APIs persist long after those APIs are fixed. People still avoid things like flock(2) because at one time NFS didn't handle file locking well. CLOCK_MONOTONIC_RAW is fine these days with the vDSO.


Sadly GPFS still doesn’t support flock(2), so I still avoid it.


Doesn't it? https://sambaxp.org/archive-data-samba/sxp09/SambaXP2009-DAT...

It would be weird, even for AIX, to support POSIX byte range locks and not the much simpler flock.


It doesn't, at least on the version I have access to, as it is configured on that cluster.

I’m using Linux rather than AIX.

fcntl(2) locks are supported (as long as they aren't OFD), but flock(2) locks don't work across nodes.


It was a while ago (2009-10ish) but I ran into an exceptionally interesting performance issue that was partly identified with RDTSC. For a course project in grad school I was measuring the effects of the Python GIL when running multi-threaded Python code on multi-core processors. I expected the overhead/lock contention to get worse as I added threads/cores but the performance fell off a cliff in a way that I hadn't expected. Great outcome for a course project, it made the presentation way more interesting.

The issue ended up being that my multi-threaded code when running on a single core pinned that core at 100% CPU usage, as expected, but when running it across 4 cores it was running 4 cores at 25% usage each. This resulted in the clock governor turning down the frequency on the cores from ~2GHz to 900MHz and causing the execution speed to drop even worse than just the expected lock contention. It was a fun mystery to dig into for a while.


If you have something newer than a pentium 4 the rate will be constant.

I'm not sure of the details for when cores end up with different numbers.


TSC is about cycles consumed by a core. Not about actual time. And so for microbenchmarking, it actually makes sense, because you are often much more interested in CPU benchmarks than network benchmarks in microbenchmarking.


You have to benchmark tsc against a fixed CPU speed, say 1000Mhz, then you have a reliable comparison.


This does not account for frequency scaling on laptops, context switches, core migrations, time spent in syscalls (if you don’t want to count it), etc. On Linux, you can get the kernel to expose the real (non-“reference”) cycle counter for you to access with __rdpmc() (no syscall needed) and put the corrective offset in an memory-mapped page. See the example code under cap_user_rdpmc on the manpage for perf_event_open() [1] and NOTE WELL the -1 in rdpmc(idx-1) there (I definitely did not waste an hour on that).

If you want that on Windows, well, it’s possible, but you’re going to have to do it asynchronously from a different thread and also compute the offsets your own damn self[2].

Alternatively, on AMD processors only, starting with Zen 2, you can get the real cycle count with __aperf() or __rdpru(__RDPRU_APERF) or manual inline assembly depending on your compiler. (The official AMD docs will admonish you not to assign meaning to anything but the fraction APERF / MPERF in one place, but the conjunction of what they tell you in other places implies that MPERF must be the reference cycle count and APERF must be the real cycle count.) This is definitely less of a hassle, but in my experience the cap_user_rdpmc method on Linux is much less noisy.

[1] https://man7.org/linux/man-pages/man2/perf_event_open.2.html

[2] https://www.computerenhance.com/p/halloween-spooktacular-day...


> does not account for frequency scaling on laptops

Are you sure about that?

> time spent in syscalls (if you don’t want to count it)

The time spent in syscalls was the main objective the OP was measuring.

> cycle counter

While technically interesting, most of the time I do my micro-benchmark I only care about wallclock time. Contradictory to what you see in search engines and ChatGPT, RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer. That instruction was counting CPU cycles like 20 years ago, doesn’t do that anymore.


>> does not account for frequency scaling on laptops

> Are you sure about that?

> [...] RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer [...]

So we are in agreement here: with RDTSC you’re not counting cycles, you’re counting seconds. (That’s what I meant by “does not account for frequency scaling”.) I guess there are legitimate reasons to do that, but I’ve found organizing an experimental setup for wall-clock measurements to be excruciatingly difficult: getting 10–20% differences depending on whether your window is open or AC is on, or on how long the rebuild of the benchmark executable took, is not a good time. In a microbenchmark, I’d argue that makes RDTSC the wrong tool even if it’s technically usable with enough work. In other situations, it might be the only tool you have, and then sure, go ahead and use it.

> The time spent in syscalls was the main objective the OP was measuring.

I mean, of course I’m not covering TFA’s use case when I’m only speaking about Linux and Windows, but if you do want to include time in syscalls on Linux that’s also only a flag away. (With a caveat for shared resources—you’re still not counting time in kswapd or interrupt handlers, of course.)


Cycles are often not what you're trying to measure with something like this. You care about whether the program has higher latency, higher inverse throughput, and other metrics denominated in wall-clock time.

Cycles are a fine thing to measure when trying to reason about pieces of an algorithm and estimate its cost (e.g., latency and throughput tables for assembly instructions are invaluable). They're also a fine thing to measure when frequency scaling is independent of the instructions being executed (since then you can perfectly predict which algorithm will be faster independent of the measurement noise).

That's not the world we live in though. Instructions cause frequency scaling -- some relatively directly (like a cost for switching into heavy avx512 paths on some architectures), some indirectly but predictably (physical limits on moving heat off the chip without cryo units), some indirectly but unpredictably (moving heat out of a laptop casing as you move between having it on your lap and somewhere else). If you just measure instruction counts, you ignore effects like the "faster" algorithm always throttling your CPU 2x because it's too hot.

One of the better use cases for something like RDTSC is when microbenchmarking a subcomponent of a larger algorithm. You take as your prior that no global state is going to affect performance (e.g., not overflowing the branch prediction cache), and then the purpose of the measurement is to compute the delta of your change in situ, measuring _only_ the bits that matter to increase the signal to noise.

In that world, I've never had the variance you describe be a problem. Computers are fast. Just bang a few billion things through your algorithm and compare the distributions. One might be faster on average. One might have better tail latency. Who knows which you'll prefer, but at least you know you actually measured the right thing.

For that matter, even a stddev of 80% isn't that bad. At $WORK we frequently benchmark the whole application even for changes which could be microbencmarked. Why? It's easier. Variance doesn't matter if you just run the test longer.

You have a legitimate point in some cases. E.g., maybe a CLI tool does a heavy amount of work for O(1 second). Thermal throttling will never happen in the real world, but a sustained test would have throttling (and also different branch predictions and whatnot), so counting cycles is a reasonable proxy for the thing you actually care about.

I dunno; it's complicated.


Fascinating how each “standard” or intrinsic that gets added actually totally fails to give you the real numbers promised.


While __rdtsc() is fast, be cautious with multi-core benchmarks as TSC synchronization between cores isn't guaranteed on all hardware, especially older systems. Modern Intel/AMD CPUs have "invariant TSC" which helps, but it's worth checking CPU flags first.


Rdtsc is fast-ish, but it's still like 10 ns. Something to be aware of if you're trying to measure really small durations (measure 10s or 100s and amortize).


Invariant TSC has been around for over 15 years, probably more.


Succesive rdtsc calls, even on the same CPU, are not guaranteed to be executed in the expected order by the CPU - [1].

1 - https://lore.kernel.org/all/da9e8bee-71c2-4a59-a865-3dd6c5c9...


> I use __rdtsc() compiler intrinsic

What do you do on ARM?


Read `cntvct_el0` with the `mrs` instruction. [1]

[1]: https://developer.arm.com/documentation/102379/0104/The-proc...



See manual page and changelog.

* https://jdebp.uk/Softwares/djbwares/guide/commands/clockspee...

* https://github.com/jdebp/djbwares/commit/8d2c20930c8700b1786...

Yes, 27 years later it now compiles on a non-Intel architecture. (-:


Just use gettimeofday/clock_gettime via vDSO.

  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
On arm64 it directly uses the cntvct_el0 register under the hood but with a standard/easy to use API instead of messing about with inline assembly. Also avoids a context switch because it's vDSO.


Use a Raspberry Pi or something.


I don't think it is (guaranteed to be) synchronized across cores. I might be wrong about that though.


rdtsc isn't available on all platforms, for what it's worth. It's often disabled as there's a CPU flag to allow its use in user space, and it's well know to not be so accurate.


What platforms disable rdtsc for userspace? What accuracy issues do you think it has?


rdtsc instruction access is gated by a permission bit. Sometimes it's allowed from userspace, sometimes it's not. There were issues with it in the past, I forget which off the top of my head.

It's also not as accurate as a the High Precision timer (HPET). I'm not sure which platforms gate/expose which these days but it's a grab bag.


Personally I'm not aware of any platform blocking rdtsc, so I was curious to learn which ones do.


> It's also not as accurate as a the High Precision timer (HPET)

This hasn't been true for about 10 years.


You're right, I was thinking about the interrupt precision over the default APIC timer.

My point about it being disabled on some platforms has historically been true, however.


I think you're confusing this and the kernel's blacklisting of the TSC for timekeeping if it is not synchronized across CPUs; but while there's a knob to block userspace's access to the TSC, I am not sure that has been used anywhere except for debugging reasons (e.g. record/replay).


They could've just used `clock_gettime(CLOCK_MONOTONIC)`




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: