After being both a Java programmer for the greater part of a decade, then spending a bunch of years doing Objective C/Swift programming, I don't really understand why the Automatic Reference Counting approach hasn't won out over all others. In my opinion it's basically the best of all possible worlds:
1. Completely deterministic, so no worrying about pauses or endlessly tweaking GC settings.
2. Manual ref counting is a pain, but again ARC basically makes all that tedious and error prone bookkeeping go away. As a developer I felt like I had to think about memory management in Obj C about as much as I did for garbage collected languages, namely very little.
3. True, you need to worry about stuff like ref cycles, but in my experience I had to worry about potential memory leaks in Obj C about the same as I did in Java. I.e. I've seen Java programs crash with OOMs because weak references weren't used when they were needed, and I've seen similar in Obj C.
On the contrary, as others have pointed out, automatic reference counting seems to be the worst of all worlds.
1. Automatic reference counting adds a barrier every time you pass around an object, even if only for reading, unlike both manual memory management and tracing GC.
2. ARC still requires thought about ownership, since you have to ensure there are no cycles in the ownership graph unlike tracing/copying/compacting GC.
3. ARC still means that you can't look at a piece of code and tell how much work it will do, since you never know if a particular piece of code is holding the last reference to a piece of data, and must free the entire object graph, unlike manual memory management. In the presence of concurrency, cleanup is actually non-deterministic, as 'last one to drop its reference' is a race.
4. ARC doesn't solve the problem of memory fragmentation, unlike arena allocators and compacting/copying GC.
5. ARC requires expensive allocations, it can't use cheap bump-pointer allocation like copying/compacting GC can. This is related to the previous point.
6. ARC cleanup still takes time proportional to the number of unreachable objects, unlike tracing GC (proportional to number of live objects) or arena allocation (constant time).
Reference counting is a valid strategy in certain pieces of manually memory managed code (particularly in single-threaded contexts), but using it universally is almost always much worse than tracing/copying/compacting GC.
Note that there are (soft) realtime tracing GCs, but this can't be achieved with ARC.
It depends on the implementation. In traditional ARC implementations these are all issues to a degree, but even then they tend toward much lower overhead with more predictable behavior. Though I agree tracing GCs can be really fast and better in many ways.
1. I'm guessing you mean lock based implementations? There's several non lock, non atomics based ARC designs that still handle threading safely. That means it's little more than a single integer operation.
2. True, but for many contexts this is easy to do and makes it easier to understand data flows. In other cases it's possible to use index based references pretty readily, like in Rust. Or add a cycle collector.
3. In theory you can't, but in practice it's often pretty easy to tell. At least in non-oop languages. I use Nim with its ARC (1) on RTOS and this is really only a concern for large lists or large dynamic structures. It can be managed using the same patterns as RAII where you call child functions who know they won't be the last ones holding the bag. Also you can use the same trick as some Rust code does where you pass the memory to another thread to dispose of (2).
4/5. It depends on the implementation, but you can use pools or arenas or other options. Nim provides an allocator algorithm (tlsf) with proven O(1) times and known fragmentation properties (3). Still tracing gcs can make better usage of short lifetime arenas true. Though with ARC you get similar benefits using stack based objects.
6. It's tradeoffs. Tracing gcs also end up needing to scan the entire heap every so often. ARC's only need to check a root object during usage and only access the entire graph when de-allocating.
Your last point isn't accurate, as you can use an appropriately designed ARC in a hard realtime context. I've found it quite easy to do, but granted it takes a little bit of care, but any real time systems do. For items like interrupt handlers I ensure no memory is allocated or destroyed.
I mostly agree with all of your points, and the core message - it's all tradeoffs.
Just one nitpick though:
> 6. It's tradeoffs. Tracing gcs also end up needing to scan the entire heap every so often.
This is not accurate: tracing GCs always start from the roots and only ever visit live objects. By definition, the objects that they free are not reachable from anywhere. "Full scans" typically refer to various optimization strategies that tracing GCs implement to avoid scanning all roots (e.g. generations, per-thread scans) which do rely on still occasionally doing a full scan (scanning all live objects in all generations, or in all threads).
There's a couple of methods like deferred counting, differential counting, or biased references. They're usually not completely atomic free but generally provide guaranteed constant overhead or tweak when or how the memory can be shared.
- Nim's ARC only allows one thread to manage a reference count at a time, but enables an isolated graph of memory to be moved to another thread. The concept is called `Isolate`, and is very similar to Rust's single owner of mutable references. There's still WIP to have the compiler automate the checks, but it's usable now (I used it with FreeRTOS's xQueue mechanism just fine). https://github.com/nim-lang/RFCs/issues/244
I think the M1 CPU lowers some of the ARC retain/release to silicon -- it doesn't remove the problem, but it does seem to reduce the relative overhead significantly.
Isn't ARC absolutely worst-case for most multi-threading patterns? You'll thrash objects between cores just when you reference them. Every object becomes a mutable object!
Reference counting is a form of garbage collection. Remember that reference counting and tracing garbage collection are simply two extremes of a continuum of approaches. Must-read paper: "A Unified Theory of Garbage Collection".
Can you explain these or point me to any links, I'd like to learn more.
> has bad cache behavior for multithreading
Why is this? Doesn't seem like that would be something inherent to ref counting.
> Also, not deterministic
I always thought that with ref counting, when you decrement a ref count that goes to 0 that it will then essentially call free on the object. Is this not the case?
Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.
determinism: you reset a pointer variable. Once in a while, you're the last referent and now have to free the object. That takes more instructions and cache invalidation.
> Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.
Thank you! This was the response that made the cache issues clearest to me.
> Why is this? Doesn't seem like that would be something inherent to ref counting.
Once I know I can read the data, it usually doesn't matter that another thread is also reading it. Reference counting changes that because we both need to write to the count every time either of us takes or drops a reference to the data, and in the latter case we need to know what's happened on the other core, too. This means a lot more moving of changing data between processor cores.
> > Also, not deterministic
> I always thought that with ref counting, when you decrement a ref count that goes to 0 that it will then essentially call free on the object. Is this not the case?
That's my understanding, but is that "deterministic" as we mean it here? It's true that the same program state leads to the same behavior, but it's non-local program state, and it leads to you doing that work - potentially a lot of work, if you (eg) wind up freeing all of a huge graph structure - at relatively predictable places in your code.
There are good workarounds (free lists, etc) but "blindly throw language level reference counting at everything" isn't a silver bullet (or maybe even a good idea) for getting low-latency from your memory management.
> Doesn't seem like that would be something inherent to ref counting.
It is inherently so. A reference count is an atomic mutable value that must be updatable by any thread.
A significant selling point of ARC compared to traditional ObjC reference counting was that the compiler could elide retain/release calls better than the programmer could, thus preventing a ton of stalls.
Writing to the rc field even for read only access dirties cache lines and causes cache line ping ping pong in case of multi core access, where you also need to use slower, synchronised refcount updates so not to corrupt the count. Other GC strategies don't require dirtying cache lines when accessing the objects.
Determinism: the time taken is not deterministic because (1) malloc/free, which ARC uses but other GCs usually not, are no deterministic - both can do arbitrary amounts of work like coalescing or defragmenting allocation arenas, or performing system calls that reconfigure process virtual memory - and (2) cascading deallocations as rc 0 objects trigger rc decrements and deallocation of other objects.
was what parent wrote (emphasis added), I assume referring to the problem that when an object is destroyed, an arbitrarily large number of other objects -- a subset of the first object's members, recursively -- may need to be destroyed as a consequence.
Because (mainstream) refcounting GCs are just slower than modern tracing GCs. GC pauses are virtually gone these days (Java gives you <1ms maximum pause for heaps up to 16TB), and are actually more deterministic than refcounting.
But RC GC is not deterministic, though. All you know is that it will be called when some reference is cleared. You don't know which and when, and you don't know how much work it will do. With modern tracing GCs, though, there are no more pauses, and mostly a constant and small CPU tax, done in the background.
The only significant cost tracing has these days is in memory footprint.
> The only significant cost tracing has these days is in memory footprint.
And that's not insignificant. The top-of-the-line Pixel 6 Pro has twice as much RAM as the top-of-the-line iPhone 13 Pro. Maybe the Android camp is just more eager to play the specs game, but I've long speculated that iOS simply needs less RAM because of its refusal to use tracing GC.
I can tell you I'll come fix your plumbing between 10:00 and 14:00, and it will take between 30 minutes and two hours, or that I'll make a delivery at 9:45.
The destruction of graph of several billion nodes and about five times as much edges took several seconds in my C++ program. I constructed it in main() and delay between message just before "return 0;" statement and program actually exiting was quite long.
"Determministic" is a double edged sword. You get deterministic allocation and release times, but they might be bigger than what really is achievable.
1. Completely deterministic, so no worrying about pauses or endlessly tweaking GC settings.
2. Manual ref counting is a pain, but again ARC basically makes all that tedious and error prone bookkeeping go away. As a developer I felt like I had to think about memory management in Obj C about as much as I did for garbage collected languages, namely very little.
3. True, you need to worry about stuff like ref cycles, but in my experience I had to worry about potential memory leaks in Obj C about the same as I did in Java. I.e. I've seen Java programs crash with OOMs because weak references weren't used when they were needed, and I've seen similar in Obj C.