Yeah, I'm arguing that the situation you describe (accurately) is better than ba...

tbirdz · on Sept 11, 2016

I completely agree with you when it comes to function return types.

For data structure fields, especially structs used many, many times like in very large arrays, I'd say it's sometimes worth using fixed size types to get better control over memory use. Using a 64-bit int for a field a 16-bit integer can handle will use up 4x as much memory. And if you've got a ten or a hundred million structs of that type, then it really adds up.

For example size_t is 64-bit on my x86_64 system. But modern x86_64 systems can only use 48-bit address spaces, so a 64-bit sized object can't even be addressed! Even worse is my cpu and motherboard have a 32gb maximum of RAM (for an effective 35-bit physical addressing limitation). And size_t is supposed to be able to store the size of any object in memory, but on this platform it stores things that won't fit in memory. So most of these 64-bits are wasted on modern systems. For files you can use off_t if you're on POSIX, but the C standard doesn't say anything about requiring size_t to be able to store any filesize in a filesystem.

Just using size_t is not enough to make your code work correctly on 64-bit size quantities either. For example, you make your strlen implementation return size_t, and you use size_t everywhere you do anything with a string. But can your application really handle strings that are bigger than the system RAM , or the hardware address space? Are your algorithms even efficient enough to handle the 4,294,967,295 byte maximum string size for a 32-bit system?

The effort to get your program to work efficiently on >32-bit quanities is often much harder than just using size_t instead of int.

So to me, when I see 64-bit size_ts being used everywhere in code that won't actually be able to handle working with >32-bit quantities, it just feels a little useless. Of course this is really more a complaint about how big size_t is on the x86_64 platform than it is a complaint about the idea of size_t in general. If only we had 48-bit size_ts (24-bit would be handy too!)

gcr · on Sept 11, 2016

I strongly disagree with your sentiment.

If you wrote an application that's as efficient as possible without any wasted bits in the size_t type, it then only works on your machine.

If I wanted to run such an application on my supercomputer with 2TB of RAM (such machines exist), I would then have to recompile for a 41-bit size_t.

We use machine-neutral (but architecture-specific) size_t for these kinds of things explicitly to avoid recompiling on different machines that are instances of the same platform.

Said another way, binary distributions could not exist if everything was made efficient for the underlying hardware. It would stink to have to recompile the world after upgrading RAM.

I'd rather have a few bits of wasted space (which are typically lost anyway due to struct packing) than lose intra-platform comparability.

tbirdz · on Sept 11, 2016

A 48-bit size_t like I suggested could address up to 256TB of RAM. All modern x86_64 cpus are limited to 48-bits of address space, you're not losing any portability here.

Also consider my strlen example. Say you compute strlen by iterating through the whole string until you find a 0, then you return a size_t for the number of bytes you iterated through. That operation is O(n) in the length of the string. If you were to use my strlen function on a string whose length is greater than would fit into a 32-bit integer, say a 1TB string or something, then the function would take so long to compute it that it would be useless. To be efficient, you'd probably have to redesign your program to do some special things to handle 1TB strings, maybe some special algorithms, or some kind of indexing. Returning a 64-bit integer type does not mean that the function can actually handle working with 64-bit sized quantities. So if you have a lot of datastructures where you keep string length, why store them as a 64-bit size_t when your application would be completely unable to handle strings of that size without keeling over?

Of course you wouldn't really use a 48-bit size_t, because x86_64 cpus don't work well with 48-bit quantities.

andrioni · on Sept 11, 2016

2TB of RAM isn't close to a supercomputer nowadays, you can even get one of those at AWS now (the appropriately named x1.32xlarge instance).

fnl · on Sept 11, 2016

Even a hundred million integers still adds up to only 300 MB extra for 64- vs. 16-bit. On any reasonable modern server, laptop, or desktop, that kind of memory usage probably will be the least of your worries, all the more if you really have an application that needs to hold hundreds of millions of ints in memmory at the same time.

And if you are programming for a very specific embedded or otherwise constrained system, then you anyways want full control over the exact sizes of your types, as discussed elsewhere here.

Is this "wasting resources", as you say? Probably yes. Is it worth the extra development effort to fine-tune that on modern machines? Probably not - and it might even be premature optimization. (Yes - I agree there are corner case where it indeed will make sense, but those are the exception, not the norm.)

tbirdz · on Sept 11, 2016

I'm not sure where you are getting the 300MB figure from. ((64-16)/8)x10^8/2^20 gives me 572.2MB of wasted space. What's even more interesting is looking at the percentage of wasted space. (((64-16)/8)x10^8/2^20)/(((64)/8)x10^8/2^20) means a whopping 75% of the memory use of our program is completely useless wasted space.

Using more space than you need to will also impact performance. First there's the cache issues. A 64KB L1 cache can fit 32768 16-bit integers, but only 8192 64-bit integers. Other cache layers will also fit less 64-bit than 16-bit integers in them, causing 4x more hits to the slow RAM backing store. Hitting RAM is very slow in comparison with cpu operations, so this will make your program a lot slower.

There's also the computational speed issues. Lets say your problem can be implemented using the AVX/AVX2 instructions. These registers can compute multiple results at once, in parallel. The AVX registers are 256 bits, which means they can operate on 16 16-bit integers at once. In comparison, they can only work on 4 64-bit integers at once. So there's another potential for 4x improvement, although the cache problems are probably going to be a bigger issue in practice.

fnl · on Sept 14, 2016

Sorry for the 400-100=300MB mix-up, you are completely right, it should have been 800-200=600MB.

If you need to crunch all those numbers at once, then yes, your cache will become your bottleneck. But if you were, say, keeping the count on a 100 million things, you most of the time will not be worried about that or your RAM usage, as most counts typically tend to follow a power-law distribution. Therefore, contention to make sure you are tracking every count will probably become a far worse problem for your app than your ability to shuffle data to/from the CPUs. Only in the corner case where you crunch a matrix or vectors or numbers of that size at once, you will start to get worried. But as said, I think the tensor-crunching use-case is the exception, not the norm.