More

shanemhansen · 2026-02-11T03:29:39 1770780579

I've thought of this quote a bunch and I came up with my own addon.

"Some people think that the magic of something wondrous is diminished when it's understood. I feel bad for those people." -- Shanemhansen

dullcrisp · 2026-02-11T04:03:16 1770782596

I pity the fool.

— Mr. T

pstuart · 2026-02-11T06:54:42 1770792882

"Magic is the inducement of awe." -- pstuart

shanemhansen · 2026-01-01T18:35:55 1767292555

Going to write a real banger of a paper called "latency numbers considered harmful is all you need" and watch my academic cred go through the roof.

AnonymousPlanet · 2026-01-01T22:02:45 1767304965

" ... with an Application to the Entscheidungsproblem"

shanemhansen · 2025-12-05T05:22:41 1764912161

> AV1 streaming sessions achieve VMAF scores¹ that are 4.3 points higher than AVC and 0.9 points higher than HEVC sessions. At the same time, AV1 sessions use one-third less bandwidth than both AVC and HEVC, resulting in 45% fewer buffering interruptions.

Just thought I'd extract the part I found interesting as a performance engineer.

slhck · 2025-12-05T12:21:27 1764937287

This VMAF comparison is to be taken with a grain of salt. Netflix' primary goal was to reduce the bitrate consumption, as can be seen, while roughly keeping the same nominal quality of the stream. This means that, ignoring all other factors and limitations of H.264 with higher resolutions, VMAF scores for all their streaming sessions should roughly be the same, or in a comparable range, because that's what they're optimizing for. (See the Dynamic Optimizer Framework they have publicly posted a few years ago.)

Still impressive numbers, of course.

shanemhansen · 2025-11-24T22:34:38 1764023678

This actually seems like a simple example of memory request vs limit.

Request the amount of memory needed to be healthy, you can potentially set the limit higher to account for "reclaimable cache".

Another way to approach it if you find that there are too many limiting metrics to accurately model things: is you let the workers grab more segments until you determine that they are overloaded. Ideally for this to work though you have some idea that the node is approaching saturation. So for example: keep adding segments as long as the nth percentile response time is under some threshold.

The advantage of this approach is you don't necessarily have to know which resource (memory, filehandles, etc) is at capacity. You don't even necessarily have to have deep knowledge of linux memory management. You just have to be able to probe the system to determine if it's healthy.

I can even go backwards with a binary split mechanism. You sort of bring up a node that owns [A-H] (8 segments in this case). If that fails bring up 2 nodes that own [A-D],[E-H], if that fails, all the way down to one segment per node.

man8alexd · 2025-11-25T09:18:59 1764062339

mmap'ed memory counts as that "reclaimable cache", which isn't always reclaimable (dirty or active pages are not immediately reclaimable). But Kubernetes memory accounting assumes that the page cache is always reclaimable. This creates a lot of surprises and unexpected OOMs. https://github.com/kubernetes/kubernetes/issues/43916

shanemhansen · 2025-11-24T22:12:40 1764022360

I'd think something like Rump Kernel's is a closer analogue: https://en.wikipedia.org/wiki/Rump_kernel

spragl · 2025-11-25T11:15:04 1764069304

That sent me looking it up. It seems that NetBSD, as the only one, has a rump kernel, but it also looks like work on it stagnated around 10 years ago. That could be because the guy doing a thesis on them, moved on. There is quite some bitrot when following links. Do you know what happened? Were they a failure? Maybe they were surpassed by other OS architectures?

shanemhansen · 2025-11-11T15:09:35 1762873775

You nerd sniped me a little and I'll admit I'm not 100% sure what a reduction is but I've understood it to be a measurement of work for scheduling purposes.

A bit of googling indicates that actually you can use performance monitoring instur to generate an interrupt every n instructions. https://community.intel.com/t5/Software-Tuning-Performance/H...

Which is part of the solution. Presumably the remainder of the solution is then deciding what to schedule next in a way that matches erlang.

Disclaimer: this is based off some googling that makes it seem like hardware support the desired feature exists, not any actual working code.

jacquesm · 2025-11-12T07:45:03 1762933503

Oh that's a really neat find. I'm not sure how 'instructions' map to 'reductions' in the sense that if you stop when a reduction is completed the system is in a fairly well defined state so you can switch context quickly, but when you stop in mid reduction you may have to save a lot more state. The neat thing about the BEAM is that it is effectively a perfect match for Erlang and any tricks like that will almost certainly come with some kind of price tag attached. An interrupt is super expensive compared to a BEAM context switch to another thread of execution, you don't see the kernel at all, it is the perfect balance between cooperative and preemptive multitasking. You can pretend it is the second but under the hood it is the first, the end result is lightning fast context switches.

But: great find, I wasn't aware of this at all and it is definitely an intriguing possibility.

shanemhansen · 2025-10-30T21:05:08 1761858308

The unreasonable effectiveness of profiling and digging deep strikes again.

hinkley · 2025-10-31T20:58:31 1761944311

The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.

Negitivefrags · 2025-10-31T22:11:58 1761948718

I think the biggest tool is higher expectations. Most programmers really haven't come to grips with the idea that computers are fast.

If you see a database query that takes 1 hour to run, and only touches a few gb of data, you should be thinking "Well nvme bandwidth is multiple gigabytes per second, why can't it run in 1 second or less?"

The idea that anyone would accept a request to a website taking longer than 30ms, (the time it takes for a game to render it's entire world including both the CPU and GPU parts at 60fps) is insane, and nobody should really accept it, but we commonly do.

azornathogron · 2025-10-31T23:00:06 1761951606

Pedantic nit: At 60 fps the per frame time is 16.66... ms, not 30 ms. Having said that a lot of games run at 30 fps, or run different parts of their logic at different frequencies, or do other tricks that mean there isn't exactly one FPS rate that the thing is running at.

Negitivefrags · 2025-10-31T23:20:03 1761952803

The CPU part happens on one frame, the GPU part happens on the next frame. If you want to talk about the total time for a game to render a frame, it needs to count two frames.

azornathogron · 2025-11-01T10:12:10 1761991930

If latency of input->visible effect is what you're talking about, then yes, that's a great point!

wizzwizz4 · 2025-11-01T00:34:59 1761957299

Computers are fast. Why do you accept a frame of lag? The average game for a PC from the 1980s ran with less lag than that. Super Mario Bros had less than a frame between controller input and character movement on the screen. (Technically, it could be more than a frame, but only if there were enough objects in play that the processor couldn't handle all the physics updates in time and missed the v-blank interval.)

Negitivefrags · 2025-11-01T00:58:22 1761958702

If Vsync is on which was my assumption from my previous comment, then if your computer is fast enough, you might be able to run CPU and GPU work entirely in a single frame if you use Reflex to delay when simulation starts to lower latency, but regardless, you still have a total time budget of 1/30th of a second to do all your combined CPU and GPU work to get to 60fps.

mjevans · 2025-11-01T06:14:07 1761977647

30mS for a website is a tough bar to clear considering Speed of Light (or rather electrons in copper / light in fiber)

https://en.wikipedia.org/wiki/Speed_of_light

Just as an example, round trip delay from where I rent to the local backbone is about 14mS alone, and the average for a webserver is 53mS. Just as a simple echo reply. (I picked it because I'd hoped that was in Redmond or some nearby datacenter, but it looks more likely to be in a cheaper labor area.)

However it's only the bloated ECMAScript (javascript) trash web of today that makes a website take longer than ~1 second to load on a modern PC. Plain old HTML, images on a reasonable diet, and some script elements only for interactive things can scream.

    mtr -bzw microsoft.com
    6. AS7922        be-36131-cs03.seattle.wa.ibone.comcast.net (2001:558:3:942::1)         0.0%    10   12.9  13.9  11.5  18.7   2.6
    7. AS7922        be-2311-pe11.seattle.wa.ibone.comcast.net (2001:558:3:3a::2)           0.0%    10   11.8  13.3  10.6  17.2   2.4
    8. AS7922        2001:559:0:80::101e                                                    0.0%    10   15.2  20.7  10.7  60.0  17.3
    9. AS8075        ae25-0.icr02.mwh01.ntwk.msn.net (2a01:111:2000:2:8000::b9a)            0.0%    10   41.1  23.7  14.8  41.9  10.4
    10. AS8075        be140.ibr03.mwh01.ntwk.msn.net (2603:1060:0:12::f18e)                  0.0%    10   53.1  53.1  50.2  57.4   2.1
    11. AS8075        2603:1060:0:10::f536                                                   0.0%    10   82.1  55.7  50.5  82.1   9.7
    12. AS8075        2603:1060:0:10::f3b1                                                   0.0%    10   54.4  96.6  50.4 147.4  32.5
    13. AS8075        2603:1060:0:10::f51a                                                   0.0%    10   49.7  55.3  49.7  78.4   8.3
    14. AS8075        2a01:111:201:f200::d9d                                                 0.0%    10   52.7  53.2  50.2  58.1   2.7
    15. AS8075        2a01:111:2000:6::4a51                                                  0.0%    10   49.4  51.6  49.4  54.1   1.7
    20. AS8075        2603:1030:b:3::152                                                     0.0%    10   50.7  53.4  49.2  60.7   4.2

hinkley · 2025-11-01T18:08:14 1762020494

In the cloud era this gets a bit better but my last job I removed a single service that was adding 30ms to response time and replaced it with a consul lookup with a watch on it. It wasn’t even a big service. Same DC, very simple graph query with a very small response. You can burn through 30 ms without half trying.

javier2 · 2025-10-31T22:50:39 1761951039

its also about cost. My game computer has 8 cores + 1 expensive gpu + 32GB ram for me alone. We dont have that per customer.

oivey · 2025-10-31T23:15:34 1761952534

This is again a problem understanding that computers are fast. A toaster can run an old 3D game like Quake at hundreds of FPS. A website primarily displaying text should be way faster. The reasons websites often aren’t have nothing to do with the user’s computer.

paulryanrogers · 2025-10-31T23:57:54 1761955074

That's a dedicated toaster serving only one client. Websites usually aren't backed by bare metal per visitor.

oivey · 2025-11-01T00:28:07 1761956887

Right. I’m replying to someone talking about their personal computer.

Aeolun · 2025-11-01T02:55:55 1761965755

If your websites take less than 16ms to serve, you can serve 60 customers per second with that. So you sorta do have it per customer?

vlovich123 · 2025-11-01T04:16:50 1761970610

That’s per core assuming the 16ms is CPU bound activity (so 100 cores would serve 100 customers). If it’s I/O you can overlap a lot of customers since a single core could easily keep track of thousands of in flight requests.

OJFord · 2025-11-01T09:16:38 1761988598

With a latency of up to 984ms

javier2 · 2025-11-01T22:57:25 1762037845

Im just saying that we dont have gaming pc specs per customer to chug that 7GB of data for every request in 30ms

avidiax · 2025-10-31T23:02:51 1761951771

It's also about revenue.

Uber could run the complete global rider/driver flow from a single server.

It doesn't, in part because all of those individual trips earn $1 or more each, so it's perfectly acceptable to the business to be more more inefficient and use hundreds of servers for this task.

Similarly, a small website taking 150ms to render the page only matters if the lost productivity costs less than the engineering time to fix it, and even then, only makes sense if that engineering time isn't more productively used to add features or reliability.

hinkley · 2025-11-01T17:17:40 1762017460

Practically, you have to parcel out points of contention to a larger and larger team to stop them from spending 30 hours a week just coordinating for changes to the servers. So the servers divide to follow Conway’s Law, or the company goes bankrupt (why not both?).

Microservices try to fix that. But then you need bin packing so microservices beget kubernetes.

onethumb · 2025-11-01T07:09:55 1761980995

Uber could not run the complete global rider/driver flow from a single server.

avidiax · 2025-11-01T16:50:28 1762015828

I'm saying you can keep track of all the riders and drivers, matchmake, start/progress/complete trips, with a single server, for the entire world.

Billing, serving assets like map tiles, etc. not included.

Some key things to understand:

* The scale of Uber is not that high. A big city surely has < 10,000 drivers simultaneously, probably less than 1,000.

* The driver and rider phones participate in the state keeping. They send updates every 4 seconds, but they only have to be online to start a trip. Both mobiles cache a trip log that gets uploaded when network is available.

* Since driver/rider send updates every 4 seconds, and since you don't need to be online to continue or end a trip, you don't even need an active spare for the server. A hot spare can rebuild the world state in 4 seconds. State for a rider and driver is just a few bytes each for id, position and status.

* Since you'll have the rider and driver trip logs from their phones, you don't necessarily have to log the ride server side either. Its also OK to lose a little data on the server. You can use UDP.

Don't forget that in the olden times, all the taxis in a city like New York were dispatched by humans. All the police in the city were dispatched by humans. You can replace a building of dispatchers with a good server and mobile hardware working together.

hinkley · 2025-11-01T17:22:10 1762017730

You could envision a system that used one server per county and that’s 3k servers. Combine rural counties to get that down to 1000, and that’s probably less servers than uber runs.

What the internet will tell me is that uber has 4500 distinct services, which is more services than there are counties in the US.

exe34 · 2025-11-01T09:33:39 1761989619

I believe the argument was that somebody competent could do it.

lazide · 2025-11-02T01:02:36 1762045356

The reality is that, no, that is not possible. If a single core can render and return a web page in 16ms, what do you do when you have a million requests/sec?

The reality is most of those requests (now) get mixed in with a firehose of traffic, and could be served much faster than 16ms if that is all that was going on. But it’s never all that is going on.

hinkley · 2025-10-31T23:34:42 1761953682

Lowered expectations are come in part from people giving up on theirs. Accepting versus pushing back.

antonymoose · 2025-10-31T23:38:46 1761953926

I have high hopes and expectations, unfortunately my chain of command does not, and is often an immovable force.

hinkley · 2025-11-01T00:35:22 1761957322

This is a terrible time to tell someone to find a movable object in another part of the org or elsewhere. :/

I always liked Shaw’s “The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.”

pdimitar · 2025-11-03T15:52:08 1762185128

The unreasonable man also gets scolded and later fired for rocking the boat.

That adage only applies to people with resources and connections, not the average programmer who can't afford to lose a job.

zahlman · 2025-10-31T22:01:38 1761948098

> The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.

> There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried.

It's hard for profilers to identify slowdowns that are due to the architecture. Making the function do less work to get its result feels different from determining that the function's result is unnecessary.

hinkley · 2025-10-31T23:36:03 1761953763

Architecture, cache eviction, memory bandwidth, thermal throttling.

All of which have gotten perhaps an order of magnitude worse in the time since I started on this theory.

hinkley · 2025-11-01T08:58:08 1761987488

And Amdahl’s Law. Perf charts will complain about how much CPU you’re burning in the parallel parts of code and ignore that the bottleneck is down in 8% of the code that can’t be made concurrent.

zahlman · 2025-11-01T15:53:14 1762012394

I meant architecture of the codebase, to be clear. (I'm sure that the increasing complexity of hardware architecture makes it harder to figure out how to write optimal code, but it isn't really degrading the performance of naive attempts, is it?)

hinkley · 2025-11-01T17:38:56 1762018736

The problem Windows had during its time of fame is the developers always had the fastest machines money could buy. That decreased the code-build-test cycle for them, but it also made it difficult for the developers to visualize how their code would run on normal hardware. Add the general lack of empathy inspired by their toxic corporate culture of “we are the best in the world” and its small wonder why windows, 95 and 98 ran more and more dogshit on older hardware.

My first job out of college, I got handed the slowest machine they had. The app was already half done and was dogshit slow even with small data sets. I was embarrassed to think my name would be associated with it. The UI painted so slowly I could watch the individual lines paint on my screen.

My friend and I in college had made homework into a game of seeing who could make their homework assignment run faster or using less memory. Such as calculating the Fibonacci of 100, or 1000. So I just started applying those skills and learning new ones.

For weeks I evaluated improvements to the code by saying “one Mississippi, two Mississippi”. Then how many syllables I got through. Then the stopwatch function on my watch. No profilers, no benchmarking tools, just code review.

And that’s how my first specialization became optimization.

jesse__ · 2025-10-31T23:05:07 1761951907

Broadly agree.

I'm curious, what're the profilers you know of that tried to be better? I have a little homebrew game engine with an integrated profiler that I'm always looking for ideas to make more effective.

hinkley · 2025-10-31T23:38:50 1761953930

Clinic.js tried and lost steam. I have a recollection of a profiler called JProfiler that represented space and time as a graph, but also a recollection they went under. And there is a company selling a product of that name that has been around since that time, but doesn’t quite look how I recalled and so I don’t know if I was mistaken about their demise or I’ve swapped product names in my brain. It was 20 years ago which is a long time for mush to happen.

The common element between attempts is new visualizations. And like drawing a projection of an object in a mechanical engineering drawing, there is no one projection that contains the entire description of the problem. You need to present several and let brain synthesize the data missing in each individual projection into an accurate model.

never_inline · 2025-11-01T16:48:24 1762015704

what do you think about speedscope's sandwich view?

hinkley · 2025-11-01T17:55:15 1762019715

More of the same. JetBrains has an equivalent, though it seems to be broken at present. The sandwich keeps dragging you back to the flame graph. Call stack depth has value but width is harder for people to judge and it’s the wrong yardstick for many of the concerns I’ve mentioned in the rest of this thread.

The sandwich view hides invocation count, which is one of the biggest things you need to look at for that remaining 3x.

Also you need to think about budgets. Which is something game designers do and the rest of us ignore. Do I want 10% of overall processing time to be spent accessing reloadable config? Reporting stats? If the answer is no then we need to look at that, even if data retrieval is currently 40% of overall response time and we are trying to get from 2 seconds to 200 ms.

That means config and stats have a budget of 20ms each and you will never hit 200ms if someone doesn’t look at them. So you can pretend like they don’t exist until you get all the other tent poles chopped and then surprise pikachu face when you’ve already painted them into a corner with your other changes.

When we have a lot of shit that all needs to get done, you want to get to transparency, look at the pile and figure out how to do it all effectively. Combine errands and spread the stressful bits out over time. None of the tools and none of the literature supports this exercise, and in fact most of the literature is actively hostile to this exercise. Which is why you should read a certain level of reproval or even contempt in my writing about optimization. It’s very much intended.

Most advice on writing fast code has not materially changed for a time period where the number of calculations we do has increased by 5 orders of magnitude. In every other domain, we re-evaluate our solutions at each order of magnitude. We have marched past ignorant and into insane at this point. We are broken and we have been broken for twenty years.

never_inline · 2025-11-02T08:02:48 1762070568

I would like to know where I can read more in depth about profiling and performance analysis techniques.

seg_lol · 2025-11-03T03:55:12 1762142112

Unreasonable effectiveness of looking.

shanemhansen · 2025-10-07T14:03:17 1759845797

Tcl was my first "general purpose" programming language (after TI-basic and Matlab).

When I started that job I didn't know the difference between Tcl and TCP. I spent a couple months studying Phillip Greenspuns books. It also made me a better engineer because unlike PHP I couldn't just Google how to do basic web server stuff so I had to learn from first principles. That's how I ended up building my first asset minification pipeline that served the "$file.gz" if it existed with content-encoding: gzip.

Nearly 20 years later and I'm basically a http specialist (well, CDN/Ingress/mesh/proxy/web performance).

Tcl is still kind of neat in a hacky way (no other language I've run across regularly uses upvars so creatively).

Shout-out to ad_proc and aolserver.

pjmlp · 2025-10-08T08:17:52 1759911472

AOLServer was the inspiration to the product I worked on, during my first experience working at a dotcom startup.

We had something similar, however it would plug into Apache and IIS, more configurable across several UNIXes and RDMS, and eventually even got an IDE coded in VB, for those folks not wanting to use the Emacs based tooling.

Eventually we also became a victim of the dotcom burst, however many of those ideas were the genesis of OutSytems platform, then rebooted on top of .NET, and still going strong on the market nowadays.

shanemhansen · 2025-09-30T18:26:56 1759256816

The closeness of this syntax to graphviz dot is very interesting.

having dgsh output a graphvis file in dry-run mode would be a neat feature.

shanemhansen · 2025-08-21T14:52:49 1755787969

Fundamentally it's a programming language so all the normal ways of running it apply:

Use their library in your application to evaluate policies.

Run it from the cli.

Embed it in some service like nginx.

The language itself is pretty focused on some prolog-ish describing of what constitutes an allow/deny decision.