After being both a Java programmer for the greater part of a decade, then spendi...

tsimionescu · on Nov 6, 2021

On the contrary, as others have pointed out, automatic reference counting seems to be the worst of all worlds.

1. Automatic reference counting adds a barrier every time you pass around an object, even if only for reading, unlike both manual memory management and tracing GC.

2. ARC still requires thought about ownership, since you have to ensure there are no cycles in the ownership graph unlike tracing/copying/compacting GC.

3. ARC still means that you can't look at a piece of code and tell how much work it will do, since you never know if a particular piece of code is holding the last reference to a piece of data, and must free the entire object graph, unlike manual memory management. In the presence of concurrency, cleanup is actually non-deterministic, as 'last one to drop its reference' is a race.

4. ARC doesn't solve the problem of memory fragmentation, unlike arena allocators and compacting/copying GC.

5. ARC requires expensive allocations, it can't use cheap bump-pointer allocation like copying/compacting GC can. This is related to the previous point.

6. ARC cleanup still takes time proportional to the number of unreachable objects, unlike tracing GC (proportional to number of live objects) or arena allocation (constant time).

Reference counting is a valid strategy in certain pieces of manually memory managed code (particularly in single-threaded contexts), but using it universally is almost always much worse than tracing/copying/compacting GC.

Note that there are (soft) realtime tracing GCs, but this can't be achieved with ARC.

elcritch · on Nov 6, 2021

It depends on the implementation. In traditional ARC implementations these are all issues to a degree, but even then they tend toward much lower overhead with more predictable behavior. Though I agree tracing GCs can be really fast and better in many ways.

1. I'm guessing you mean lock based implementations? There's several non lock, non atomics based ARC designs that still handle threading safely. That means it's little more than a single integer operation.

2. True, but for many contexts this is easy to do and makes it easier to understand data flows. In other cases it's possible to use index based references pretty readily, like in Rust. Or add a cycle collector.

3. In theory you can't, but in practice it's often pretty easy to tell. At least in non-oop languages. I use Nim with its ARC (1) on RTOS and this is really only a concern for large lists or large dynamic structures. It can be managed using the same patterns as RAII where you call child functions who know they won't be the last ones holding the bag. Also you can use the same trick as some Rust code does where you pass the memory to another thread to dispose of (2).

4/5. It depends on the implementation, but you can use pools or arenas or other options. Nim provides an allocator algorithm (tlsf) with proven O(1) times and known fragmentation properties (3). Still tracing gcs can make better usage of short lifetime arenas true. Though with ARC you get similar benefits using stack based objects.

6. It's tradeoffs. Tracing gcs also end up needing to scan the entire heap every so often. ARC's only need to check a root object during usage and only access the entire graph when de-allocating.

Your last point isn't accurate, as you can use an appropriately designed ARC in a hard realtime context. I've found it quite easy to do, but granted it takes a little bit of care, but any real time systems do. For items like interrupt handlers I ensure no memory is allocated or destroyed.

1) https://nim-lang.org/blog/2020/10/15/introduction-to-arc-orc... 2) https://abramov.io/rust-dropping-things-in-another-thread 3) http://www.gii.upv.es/tlsf/

tsimionescu · on Nov 7, 2021

I mostly agree with all of your points, and the core message - it's all tradeoffs.

Just one nitpick though:

> 6. It's tradeoffs. Tracing gcs also end up needing to scan the entire heap every so often.

This is not accurate: tracing GCs always start from the roots and only ever visit live objects. By definition, the objects that they free are not reachable from anywhere. "Full scans" typically refer to various optimization strategies that tracing GCs implement to avoid scanning all roots (e.g. generations, per-thread scans) which do rely on still occasionally doing a full scan (scanning all live objects in all generations, or in all threads).

elcritch · on Nov 7, 2021

Yah you’re right. I didn’t quite describe that correctly, but mostly I meant scanning the “live” objects of the heap.

m0zg · on Nov 7, 2021

> non lock, non atomics based ARC designs that still handle threading safely

Don't think that's even doable at all, at least not portably. Do you have some examples?

elcritch · on Nov 7, 2021

There's a couple of methods like deferred counting, differential counting, or biased references. They're usually not completely atomic free but generally provide guaranteed constant overhead or tweak when or how the memory can be shared.

- Nim's ARC only allows one thread to manage a reference count at a time, but enables an isolated graph of memory to be moved to another thread. The concept is called `Isolate`, and is very similar to Rust's single owner of mutable references. There's still WIP to have the compiler automate the checks, but it's usable now (I used it with FreeRTOS's xQueue mechanism just fine). https://github.com/nim-lang/RFCs/issues/244

- Python's new non-GIL proposal that does this using biased references: https://hackaday.com/2021/11/03/python-ditches-the-gils-and-...

- The source of Python's biased references: https://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf

- Defered Reference counting: https://dl.acm.org/doi/pdf/10.1145/3453483.3454060

It's pretty cool stuff!

flohofwoe · on Nov 6, 2021

If you check ARC code in a profiler, there's a shocking amount of time spent in the retain/release calls:

https://floooh.github.io/2016/01/14/metal-arc.html

There are ways around this overhead at least in Metal, but this requires at least as much effort as not using ARC to begin with.

Twisol · on Nov 6, 2021

I think the M1 CPU lowers some of the ARC retain/release to silicon -- it doesn't remove the problem, but it does seem to reduce the relative overhead significantly.

https://news.ycombinator.com/item?id=25203924

> this requires at least as much effort as not using ARC to begin with.

Designing a brand new CPU architecture certainly counts as "at least as much effort", yes. ^_^

P.S. While I'm here, your "handles are the better pointers" blog post is one of my all-time favorites. I appreciate you sharing your experiences!

chrisseaton · on Nov 6, 2021

Isn't ARC absolutely worst-case for most multi-threading patterns? You'll thrash objects between cores just when you reference them. Every object becomes a mutable object!

pcwalton · on Nov 7, 2021

Reference counting is a form of garbage collection. Remember that reference counting and tracing garbage collection are simply two extremes of a continuum of approaches. Must-read paper: "A Unified Theory of Garbage Collection".

fulafel · on Nov 6, 2021

Compared to other, state of the art GC strategies, it's slow and has bad cache behavior esp. for multithreading. Also, not deterministic perf wise.

hn_throwaway_99 · on Nov 6, 2021

Can you explain these or point me to any links, I'd like to learn more.

> has bad cache behavior for multithreading

Why is this? Doesn't seem like that would be something inherent to ref counting.

> Also, not deterministic

I always thought that with ref counting, when you decrement a ref count that goes to 0 that it will then essentially call free on the object. Is this not the case?

lallysingh · on Nov 6, 2021

Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.

determinism: you reset a pointer variable. Once in a while, you're the last referent and now have to free the object. That takes more instructions and cache invalidation.

hn_throwaway_99 · on Nov 6, 2021

> Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.

Thank you! This was the response that made the cache issues clearest to me.

dllthomas · on Nov 6, 2021

> Why is this? Doesn't seem like that would be something inherent to ref counting.

Once I know I can read the data, it usually doesn't matter that another thread is also reading it. Reference counting changes that because we both need to write to the count every time either of us takes or drops a reference to the data, and in the latter case we need to know what's happened on the other core, too. This means a lot more moving of changing data between processor cores.

> > Also, not deterministic

> I always thought that with ref counting, when you decrement a ref count that goes to 0 that it will then essentially call free on the object. Is this not the case?

That's my understanding, but is that "deterministic" as we mean it here? It's true that the same program state leads to the same behavior, but it's non-local program state, and it leads to you doing that work - potentially a lot of work, if you (eg) wind up freeing all of a huge graph structure - at relatively predictable places in your code.

There are good workarounds (free lists, etc) but "blindly throw language level reference counting at everything" isn't a silver bullet (or maybe even a good idea) for getting low-latency from your memory management.

morelisp · on Nov 7, 2021

> Doesn't seem like that would be something inherent to ref counting.

It is inherently so. A reference count is an atomic mutable value that must be updatable by any thread.

A significant selling point of ARC compared to traditional ObjC reference counting was that the compiler could elide retain/release calls better than the programmer could, thus preventing a ton of stalls.

fulafel · on Nov 6, 2021

Writing to the rc field even for read only access dirties cache lines and causes cache line ping ping pong in case of multi core access, where you also need to use slower, synchronised refcount updates so not to corrupt the count. Other GC strategies don't require dirtying cache lines when accessing the objects.

Determinism: the time taken is not deterministic because (1) malloc/free, which ARC uses but other GCs usually not, are no deterministic - both can do arbitrary amounts of work like coalescing or defragmenting allocation arenas, or performing system calls that reconfigure process virtual memory - and (2) cascading deallocations as rc 0 objects trigger rc decrements and deallocation of other objects.

b3morales · on Nov 6, 2021

> Also, not deterministic

> not deterministic perf wise.

was what parent wrote (emphasis added), I assume referring to the problem that when an object is destroyed, an arbitrarily large number of other objects -- a subset of the first object's members, recursively -- may need to be destroyed as a consequence.

pron · on Nov 7, 2021

Because (mainstream) refcounting GCs are just slower than modern tracing GCs. GC pauses are virtually gone these days (Java gives you <1ms maximum pause for heaps up to 16TB), and are actually more deterministic than refcounting.

vanviegen · on Nov 7, 2021

RC can be deterministic in terms of when a destructor gets called. GC languages usually don't support destructors for that reason.

pron · on Nov 7, 2021

But RC GC is not deterministic, though. All you know is that it will be called when some reference is cleared. You don't know which and when, and you don't know how much work it will do. With modern tracing GCs, though, there are no more pauses, and mostly a constant and small CPU tax, done in the background.

The only significant cost tracing has these days is in memory footprint.

mwcampbell · on Nov 7, 2021

> The only significant cost tracing has these days is in memory footprint.

And that's not insignificant. The top-of-the-line Pixel 6 Pro has twice as much RAM as the top-of-the-line iPhone 13 Pro. Maybe the Android camp is just more eager to play the specs game, but I've long speculated that iOS simply needs less RAM because of its refusal to use tracing GC.

pjmlp · on Nov 7, 2021

Windows Phones of the Lumia series had historically less RAM than equally priced Android models and the performance with .NET Native was much better.

I used all my models until their hardware died.

pixelpoet · on Nov 7, 2021

How can there be degrees of determinism?

coldtea · on Nov 7, 2021

Determinism in programming doesn't just mean "X was caused by Y" (e.g. determinism as a stand-in for the causual chain).

It mostly means "I can know (as in "determine") how much time, or instructions, or calls, or memory an operation will take".

And this knowledge can come in degrees (be more or less fuzzy).

pron · on Nov 7, 2021

I can tell you I'll come fix your plumbing between 10:00 and 14:00, and it will take between 30 minutes and two hours, or that I'll make a delivery at 9:45.

thesz · on Nov 7, 2021

The destruction of graph of several billion nodes and about five times as much edges took several seconds in my C++ program. I constructed it in main() and delay between message just before "return 0;" statement and program actually exiting was quite long.

"Determministic" is a double edged sword. You get deterministic allocation and release times, but they might be bigger than what really is achievable.

pjmlp · on Nov 7, 2021

Because it is worse, regardless of the sales pitch.

https://github.com/ixy-languages/ixy-languages

When doing all the required optimizations, it turns into tracing GC optimizations with another more political accepted name.