>NULL/nil is just one of many invalid memory addresses, and in practice most of invalid memory address are not NULL.
I want a language that can check at compile/check time that none of my pointers will have invalid addresses. Recognizing null as just another invalid value makes it more obvious to me that I want a language to handle it differently than how C does it.
In memory-safe languages, it's already unthinkable (/ very rare outside of situations where you're deliberately doing unsafe/native code integrations) to get non-null invalid pointers that point to a different type of thing than you want. When was the last time you had a Java program crash because a typed reference actually had a different type of reference at runtime somehow? Isn't that great how that basically never happens? -- But if you consider nulls a different type of reference, then it does actually happen sometimes. It would be great if we could try to close off that issue.
I'm a huge fan of languages with non-nullable references as the standard, like Kotlin, Typescript, and Rust. In my experience, it's much easier as a programmer to understand how a codebase uses nulls when the codebase is in a language with default non-nullable references, and therefore null-related issues where a future programmer passes a nullable value somewhere it shouldn't be happen much less often.
Right, but another option is to eschew pointer arithmetic. Iterators in many languages address many usages of pointer arithmetic, and can be designed to compile to the same sort of code as pointer arithmetic-using C compiles to.
C wouldn't ever be able to take out pointer arithmetic, so I worry anyone envisioning a new language as some diff from C is probably going to get stuck on that sort of thing too. I'm a big fan of the original referenced article by Eevee for bringing this sort of thing up.
Pointers and pointer arithmetic are a physical reality. If you forbid them you are necessarily limiting yourself. It's a tradeoff.
Iterators may work only very locally, or require garbage collection, for example.
Static option types add complexity to the typesystem and require many typecasts in practice (which can me made safe with runtime checks and panics, but they are a hassle).
The problem with pointer arithmetic in C is that it's the default. Every pointer implicitly supports it, even though the vast majority of them point to a single object of a given type, and so it doesn't really make sense for them. So it's much easier to get an invalid pointer than it ought to be.
The obvious fix, as seen in e.g. Zig, is to have different types for pointers to objects and pointers to arrays. But once you have that, you can relegate the latter to the "unsafe" opt-in subset of the language.
I've worked on a hobby programming language where I've made that design decision, too. But it makes pointers less general. Single "Objects" are arrays of length 1, but not in the type system. And in practice it often happens that I want to treat an object as an array of length 1.
On the other hand I can't remember that I've ever given a pointer to an object that was treated as an array. I figure it happens about as often, and is about as easy or hard to debug, as swapped function call arguments. Which is pretty rare in my experience. And my philosophy is that making separate types for things that are structurally the same is usually a bad idea. Because it splits your world in two.
It's entirely possible for a compiler to reject all programs that it can't verify as safe at compile time, including for array bounds checking, including pointer arithmetic. It's true that the uncomputability of the halting problem means that the compiler must reject some safe programs to achieve this. Doing this in practice probably requires using dependent types.
If you consider the C specification beautiful it can only be because anything ugly is simply left unspecified. Large parts are simply missing. It only seems straightforward because so many of the weird architectures and platforms have failed and we're left with 32/64 Intel/Arm as the only ones most authors consider.
The C spec allows one's complement arithmetic, but how many programs would break horribly on such a machine?
I neither consider the C specification beautiful, nor do I think that large parts are missing. In fact, the specification goes to great lengths to allow implementations on a wide range of computers.
I will claim that most other languages specifications only seem so straightforward because they have a simpler execution model, where some of the things you can do in C are simply not possible, or where the spec is exclusively targeted at more modern machines. (And that's a good thing for most applications).
I think you're probably right. I think it may be difficult to design a dependent type system that ⓐ easily checks common pointer arithmetic idioms, ⓑ rejects programs predictably rather than unpredictably, and ⓒ upon rejecting a program, offers clear guidance on how to modify the program so that it can pass (perhaps at the cost of adding an impossible error case or a redundant run-time check). But I'm no expert, so I could be wrong about that.
However, C has quite a bit of ugliness in its own definition, and it might be possible to remove a lot of that.
I don't mean anything different. Right now the C specification is nearly 700 pages long. In the meanwhile you can easily define a dependent type system within two pages.
You can trivially initialize one to null by setting it to reference a dereference of a null pointer.
That's undefined behavior, but the big weakness of C is that undefined behavior nearly always compiles without even a warning.
(I was going to suggest that a dialect of C in which undefined behavior would never compile would be very useful, but then I remembered that it wouldn't be able to do integer arithmetic)
An additional problem is that C++ references will happily continue to reference an object that is deleted, and whose memory has been reallocated to an object of a different type.
Seems like an odd standard of what’s allowed to be called a weakness. If something in a language makes writing well-functioning code harder for a lot of people, then they are probably justified in calling it a weakness.
The horror stories about compilers optimizing away code based on statically detected UB are frightening, and largely the compiler vendors seem to be to blame.
But there's also a good amount of FUD in the play. Personally I'm not sure I've ever encountered any problems with UB. Sure, I've had my share of NULL dereferences, but in practice they manifest as segfaults, which are akin to exceptions in other languages.
And while the list of UBs in C can still be considered a weakness, the fact that compilers usually don't issue a warning is not, because UB is a runtime thing, so compilers cannot do anything about it (in general, and typically). That is something that seems to be not well understood. And it was the point I was making.
Undefined behavior is not strictly limited to runtime. For example, defining a function called toilet is undefined behavior in C, but function names are hardly a runtime construct (except as far as dynamic linking is concerned).
> If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined.
My best guess is that compilers simply cannot be expected to catch that problem statically. Think about the way the header files that come with standard library interact with the program source code, for example. It's hard to know where exactly the use of a reserved identifier originated from, and which parts are considered "the implementation" (which is allowed to use reserved identifiers).
Messing with system headers might still result in a compilable translation unit, with unpredictable behaviour. Likewise, defining symbols with reserved identifiers might result in a linkable program (statically and/or dynamically), but again if you mess with the implementation the runtime behaviour is unpredictable.
I agree that there are cases where a naive implementation would have a hard time distinguishing implementation from program (and even in case of a non-naive implementation, it can be a line drawn in water), so this sort of thing might be warranted. Though that was just one example among many, and there are many instances of undefined behavior that are very much compile-time. For example:
> The same identifier has both internal and external linkage in the same translation unit
> Two declarations of the same object or function specify types that are not compatible
> An attempt is made to use the value of a void expression, or an implicit or explicit conversion (except to void) is applied to a void expression
> An unmatched ' or " character is encountered on a logical source line during tokenization
> Two identifiers differ only in nonsignificant characters
> The identifier __func__ is explicitly declared
> The characters ', \, ", //, or /* occur in the sequence between the < and > delimiters, or the characters ', \, //, or /* occur in the sequence between the " delimiters, in a header name preprocessing token
> An expression that is required to be an integer constant expression does not have an integer type
You get the idea. That's just what I gathered from quickly skimming the first screenful (out of about 4) in the list of UB in N1256.
If we used languages with 1-based arrays, then the first element was 1, the last was n, and find(x) could return 0 as an “item not found” marker instead of returning -1 or worse uint_max. Reasoning about counting from the end could be off-by-one easier: n+1-i instead of n-i-1, n-1-i or n-(i+1), whichever nonsense you like better. Looping: i=1, i<=n. Setting next: a[n+1]=x, where the capacity allows.
But when you mention one of these languages, you get a bunch of “oh, 1-based arrays, so uncool, not an option”.
257, 300, 100k elements? You do not index it with one byte, that simple. If that is a hard requirement (8-bit chip, low ram, etc), leave it to C or a special syntax. __ptr_offset(p, n), p+n, p: array[0..n] of x, option base 0, you name it.
I mean, we can ask ourselves tens of such in-the-box questions for any feature that doesn’t fit it, but out-of-box world doesn’t really crave all of that by default.
No, I'm trying to show is that offset-indexing is the right way to do it: because it makes sense, mathematically. If you need more and better arguments, have a look at Dijkstra's "Why numbering should start at zero" (it also argues why we should have left-inclusive and right-exclusive bounds).
>...unnatural by the time the sequence has shrunk to the empty one. That is ugly, ...
I reread his article for clarity. He first makes the “ugly” argument, which originates from 1) natural number domain definition issues, and 2) only then he goes to the range definition argument based on 1). It may be valid when you’re working in your mind on math problems of natural numbers starting* with 0, which is a self-referential argument btw. But it doesn’t have to be applied to software arrays of items. Dijkstra’s argument is repeated as a mantra, but the general consensus is that representation of natural numbers (unsigned ints) brings more issues than ought to solve, so we don’t use them today, which defeats 1), and then 2) loses its causality.
>>find(x) could return 0 as an “item not found” marker instead of returning -1 or worse uint_max
Let’s test that against Dijkstra’s statement on range unnaturalness? Which is more unnatural, 0, -1, NSNotFound?
Programmatically it doesn’t make any sense except in salary gains from mastering off-by-ones.
* Edit: natural numbers start differently in different countries, but you may s/0/1/ there and it still holds, because he speaks about domain borderlines. The same holds true for e.g. [321..500].
It's not about number definition issues. The argument is convincing that the difference of upper and lower range bounds should equal the number of contained elements, and that it's ugly to represent empty ranges with an upper bound that is lower than the lower bound. (I agree and I held that opinion before even reading this article).
Note that I rewrote his argument a little here, because it is not really about natural vs unnatural, but more about looking at the difference between upper and lower bound. Now, if I may presume that we can agree that the difference of the bounds should equal the number of elements, leading to left-inclusive and right-exclusive bounds, the question is would you rather have a lower bound of 0 (which is an entirely "natural" number that is already in the game for subsequences of length 0), or would you introduce an entirely new number for the upper bound, (size + 1), which also needs an addition operation to compute?
I don't care about definitions of natural numbers. As demonstrated, numbering elements as offsets makes a lot of sense for purposes of indexing, and if you insist on starting with 1, then you need to either lower your base pointer by 1, or subtract 1 at every indexing operation (both is not exactly simple), and you frequently need to add 1 to the size.
I have no issues with zero-based indexing, and I don't think I've ever had to write quirky code. The most "ugly" thing is that the last element must be indexed as (size - 1) (which also makes some sense, since the last element need not necessarily exist. The subtraction makes clear that this is dangerous).
> Which is more unnatural, 0, -1, NSNotFound?
I normally handle that as -1 which is absolutely ok, especially given that this value is special. I concede that you might prefer 0 since that aligns well with evaluating truthiness of integers. It also makes a lot of sense (mathematically/programmatically) to use "size" (i.e. one-past-last index) as not-found, but that breaks when arrays are resized.
I think the 'declaration follows use' insight is far more useful than the 'spiral rule', which manages to get the same result but completely obscures the intuition.
Yup. "Declaration follows use" was Ritchie's idea of making it work[0] (I think it was his own). Unfortunately, while it is often useful, it also has its problems. Lacking a keyword, a type table is needed in the parser to recognize declarations. Thus, need to parse everything from the start. The syntax is very concise and easy to read in simple to moderate cases, but difficult to read for more complicated declarations, like functions returning function pointers.
And few people know how it works and get confused. I wonder why that knowledge is not a lot more common.
Looking at it this way also makes it easier to remember/understand function pointer syntax, as well as attributes like const and volatile when placed between asterisks.
Hadn't read the original essay, but Eevee is always a great blog.
Honestly, Rust hits a lot of good points for me. My only concern so far has been `.await` and a minor concern that they'll keep adding junk and end up like Perl with too many features.
Rust has a standardized 'edition' system, so they can deprecate superseded/junk features from newer editions of the language while still playing nice with legacy code targeting an older 'edition'.
It's sugar for existing features. "await"ing existed before but required some really messy syntax to achieve the same thing. It's very much an incredible improvement over the status quo.
I'm aware. I just believe that the design decision around `future.await` was dumb.
Preface: this is a minor syntactical annoyance that aesthetically and in principle annoys me a great deal, but in practice is unlikely to cause much confusion. The rest of this should be considered a light hearted rant.
`future.await` looks like it should be field access, not running arbitrary code for a future. It should have either been a keyword (a la `await future`), or some language built in trait method (like `future.await()`), of which there is precedent for things that require compiler magic. The arguments against macro or method like syntax were "but it's not a function in the mathematical sense", which is ironic given they chose something that looks like field access, and irrelevant in that I could have a function call inline assembly and just jump elsewhere, ignoring their concept of mathematical functions and stacks anyway. I'll admit that the post-`future` syntaxes have the benefit of chaining, which is why I prefer `future.await()` over `await future` or `await!(future)`, but I still stand by that `future.await` was the wrong choice.
So my main problem is that they made a poor design choice from what seemed an obvious pool of candidates (to me). The worry is that it'll happen again for more bizarre features that happen to be lobbied for.
I'm also in this camp, I did read through at the time some of the arguments for this syntax and it allows for some elegant chaining of awaits and other things I can't recall, but I still can't get over the fact that it looks like a member access and also for such an important meaning to the current line of code it isn't boldface right at the start screaming "im async". Nope it's at the end of the line and doesn't standout in any meaningful way since it's a member access. I find it very unfortunate
`.await` is just a postfix keyword. You seem to be concerned that it reuses the `.` sigil, but there are only a few distinct sigils available in the ASCII character set so having some reuse is quite inevitable even in principle.
I do not agree with all of them. I think assignment expressions is good, and textual inclusion is good (but should not be the only kind of inclusion), and increment/decrement operators is good, and macros is good (although the way C does it is not good enough; there are many things it doesn't do), and pointer arithmetic is good, and goto is good. But I agree with them that the C syntax for octal numbers is bad, and the C syntax for types is bad. Identifiers should not have hyphens if the same sign is also used as an operator. Strings should be byte strings which have no "default" encoding other than ASCII (although UTF-8 and other encodings are still usable too, if they are compatible with ASCII). Another problem with C is that it does not allow null declarations, and does not allow duplicate declarations even if they are the same. There are many other problems with C as well; some kind of low-level stuff is too difficult with C.
>> some kind of low-level stuff is too difficult with C.
Surprised to hear this. I do C for a living, including some low level stuff (not x86) and there aren't a lot of low level things I feel I couldn't do. Can you expand a bit?
I'm having trouble following the negative modulo wording or formula. It says % and %% are identical for unsigned integers then below defines it as
a %% b == ((a % b) + a) % b
If that's the case I can't figure out how they are identical for unsigned integers. Or is that example only a hint for how it would work with negatives?
I hadn't given Odin a look before I read this post. As a fellow general purpose language author (Gosu) I applaud the author's pragmatic, albeit unpopular at times, point of view. For instance, his position regarding Null is spot on re the "streetlight effect" reference. Others include:
* pascal-style declarations
* type system
* multiple returns
* strings
* switch
Also I would not downplay the advantages of operator tokens && ||. More than just familiarity with C-family developers, in my experience, they are more generally effective as expression delimiters. They stand out better than 'and' 'or', which is better for readability.
> foo bar; // Is this an expression or a declaration? You need to check the symbol table to find out*
Checking the symbol table is a simple one-liner.
Pushing all semantic information into the syntax so that the meaning of everything is known without any lookup is not realistic and will result in a bad language.
Lisp:
(a b) ;; what is this?
You need the full surrounding context to know whether it's b applied to a, value of b being bound to variable a, or a list of base classes a and b in a defclass or what, ... and it's good that way.
* Leading zero for octal: C++ has custom literals, so you could define a _octal or _hex etc and get the value determined at compile-time.
* No power operator: Same thing in Odin and C and C++, IIANM.
* Switch with default fallthrough: Since C++17, there's an official [[fallthrough]] anottation. You can make your compiler fail in cases when you implicitly fallthrough. So, not as elegant, but you can ensure you don't mess up and get the wrong behavior.
* (Type syntax: Nope, C++ has the contrived syntax of C.)
* Weak typing: With a library, you can have strong aliases which are not interchangeable with the aliased type. See this blog post:
https://foonathan.net/2016/10/strong-typedefs/
and the library it links to. Things would be better, though, if the C++ standard committee allowed for an operator.
* Bytestrings: This is a library issue really. And both C and C++ have libraries which deal with wider characters, with UTF-8 and what-not.
* ++ and -- : You don't _have_ to use them... and it's possible to have static tools which forbid them in source files.
* ! Operator: C++ has !, && and || , but also "not", "and" and "or". I like the latter.
* Multiple returns: It's easy to return a tuple in C++, and with C++17 you can even construct and initialize multiple variables like that, e.g. `auto [index, name] = get_index_and_name(whatever)`.
* Errors: C++ is multi-paradigmatic here, supporting the traditional status return, an expected<T> type (either an actual T or an error; not yet standardized by available via widely-used ibrary), and exceptions. Monadic-style programming is not yet supported and will likely not be in C++20, but there are discussions about this.
* Nulls: With `std::optional` and `gsl::non_null`, and especially with no return types necessary, you can comfortably avoid using `nullptr` in C++ code.
So - with some choices and a little discipline (really not much!), you can have all of this "Odinish" behavior you like.
Of course - the price of C backwards compatibility and supporting multiple programming paradigms is complexity of the formal language definition, grammatic ambiguity, and a syntax that is not always pleasant.
> * Textual inclusion: C++20 has modules, where you include sematically, not textually.
... provided that the libraries that you want to use are available as C++20 modules, not header files.
> * Leading zero for octal: C++ has custom literals,
... but no way to redefine the built-in "0" prefixed octal literal.
> * Weak typing: With a library, you can have strong aliases which are not interchangeable with the aliased type.
... you can have that in theory, I agree it's very useful, but can you name one widely used C++ library that e.g. defines custom integer or string types purely for the purpose of type safety?
> * Bytestrings: This is a library issue really.
... indeed, and the standard library's std::string is quite unhelpful, there's no way provided to iterate code points if you put UTF-8 in std::string.
> * Errors: C++ is multi-paradigmatic here,
... but none of the paradigms available with just the language and standard library have any static checks that the caller of a function actually handles an error, whether return value or exception.
> * Nulls: With `std::optional` and `gsl::non_null`, and especially with no return types necessary, you can comfortably avoid using `nullptr` in C++ code.
... provided you never use any library from the C++ ecosystem, none of which currently use std::optional or gsl::non_null.
One of the problems I have with C++ is that re-using existing libraries vs. structuring your code to statically avoid sources of bugs is a trade-off; it shouldn't be.
> ... provided that the libraries that you want to use are available as C++20 modules, not header files.
Yes. This will take time. But if you're writing most of the world from scratch (like with a new programming language), then you can write libraries in modules.
> ... but no way to redefine the built-in "0" prefixed octal literal.
True, but why use that? It's confusing if you don't know the convention.
> ... can you name one widely used C++ library that e.g. defines custom integer or string types purely for the purpose of type safety?
Well, custom string types? QString is used in tons of apps; but custom string types are rarely about safety. Integers - there's Boost's safe_numerics library. Granted, I'm not sure it's very popular or just mildly popular, but still.
> ... indeed, and the standard library's std::string is quite unhelpful, there's no way provided to iterate code points if you put UTF-8 in std::string.
Yes, Unicode support in the C++ standard library is lacking, or where it isn't, it might as well have been. But again - it's not the language. So you can write your string library (or port ICU) in Odin or in C++
> ... but none of the paradigms available with just the language and standard library have any static checks that the caller of a function actually handles an error, whether return value or exception.
static checks are a compiler/IDE thing. But - there _is_ a [[nodiscard]] annotation in C++17, so that you can't just ignore a returned status altogether.
> ... provided you never use any library from the C++ ecosystem, none of which currently use std::optional or gsl::non_null.
Toucheel and the backwards compatibility is a bitch here. But
1. At those libraries' boundary, you have either silent or explicit conversion, which are safe, as long as you act safely on the outside.
2. Most templated standard library code will use optional just fine...
> One of the problems I have with C++ is that re-using existing libraries vs. structuring your code to statically avoid sources of bugs is a trade-off; it shouldn't be.
Agreed. Having said that - writing lightweight wrappers for an existing library is a middle-of-the-road solution which is a lot less bug prone than just writing unsafe, old-style C++ all over.
The optionals / non_null won't play well with external libraries though, right? As in, if I have to interface with a third party library taking plain pointers, I would have to either unwrap on every usage, or create an extra interface in between which unpacks the arguments and wraps the result - or did I miss some better solution?
> won't play well with external libraries though, right?
1. Library use boundaries are indeed a place where you often need to "unpack" more complex types. But then - external library APIs do tell you whether they take ownership of pointers and generally what you can expect from them. And it's not so bad to write
2. Libraries slowly adopt features of language standards, and typically lag behind. To be honest, though - this is going slower than I would have liked or expected.
Right, but C++ is evil, remember? It's better to make your own language from the ground up than using something that thousands of people have worked on over decades.
They are right about null. To say it is a billion dollar mistake is completely over blown. Maybe it was at the time but now a null pointer exception is extremely easy to find and fix. It doesn't compare to the other complex bugs one has to deal with most of the time.
>They are right about null. To say it is a billion dollar mistake is completely over blown. Maybe it was at the time but now a null pointer exception is extremely easy to find and fix.
The whole idea is not to have to "find it and fix it" at runtime... Whether it's "easy" or not it's a moot point, as it is after your program just crashed or got into an undefined state, while your server is running or your desktop user is using it...
Yup, plus depending on the case it might be anything up to insanely hard to find that bug (imagine >100k line code base where will causes a subtile wrong state once in a thousand requests if build with optimizations under load which causes a chain of other subtile errors crashing the production server once a hour).
C will not throw an exception when you try to dereference a null pointer. C does not have exceptions. Instead, dereferencing a null pointer results in undefined behaviour, which is often tough to deal with. Having an exception thrown is a lot nicer. In C the compiler is free to make optimizations, based on the programmer's vigilance about which pointers may or may not be null, which you would have never expected on your own[1]. Debugging these issues is hell.
I think their point was that in modern languages (not C), having null in the language isn't so bad, as attempts to dereference null are handled fairly safely, generally with exceptions.
Same thing goes with modern languages throwing on signed overflow, where, again, C gives you the horrors of undefined behaviour.
I think the main issue is that the program should never reach a state where you're in a position to dereference a null. If it does, that means you didn't handle an error condition earlier in the program. Sure you can handle the null safely but that doesn't handle the real cause of the error.
The solution is for the language to offer 'option types', which essentially force you to check for null. Zig does this, and the result is much nicer than how things work in C: https://ziglang.org/documentation/master/#Optionals
More importantly, they let you safely not check for null (because with null-types, you should really always be handling it, because anything can go from never-null to null-in-this-one-case without warning)
And options help with solidifying input contract as well as the output. Knowing what's an acceptable input via a type system is just as relevant as knowing what is a potential output.
While an exception might be marginally better than undefined behavior, its actual occurrence can still leave your database in an inconsistent state or have other random detrimental effects. By the time you see the exception reported the damage has already been done (program flow interrupted unexpectedly). Due to the large amount of possible program states you would have to do an impossible amount of testing to make sure such surprises won't happen once the software gets into the hands of a user.
>an exception might be marginally better than undefined behavior
The difference isn't marginal, it's night-and-day. Undefined behaviour means the program can behave in unpredictable ways, either now, or at some other time. If you're lucky, your whole process explodes immediately, but that's not guaranteed. Undefined behaviour is the root of many a nightmarish hidden bug.
> its actual occurrence can still leave your database in an inconsistent state or have other random detrimental effects
Not if your exception-handling code is correct, surely?
I agree though that there are good arguments to be made against exceptions as they exist in many languages, particularly regarding how it interferes with control flow 'from a distance' as it were. I defer to the excellent Raymond Chen: https://devblogs.microsoft.com/oldnewthing/?p=36693
The interesting question is not what you do once a catastrophe has happened (program has reached invalid state) but how to avoid it in the first place. Exceptions are better than undefined behavior in coping with the former, but they still don't help you with the latter. The idea to “make invalid states unrepresentable” would give you an actual shot at this.
> Exceptions are better than undefined behavior in coping with the former, but they still don't help you with the latter
That's not right. Exceptions are thrown reliably and are always 'noisy'. Undefined behaviour may go completely unnoticed for a long time. Code that wrongly throws exceptions tends to get fixed.
I've seen undefined behaviour in code samples in respected technical books. The code happened to work fine when using the MSVC C++ compiler and x84/AMD64 targets, and will probably continue to do so.
> The idea to “make invalid states unrepresentable” would give you an actual shot at this.
That's a pretty good summary of what type-systems research is aiming for.
Not in cases where the optimizer did something tricky based on the assumption that null pointer dereferences never happen (which is allowed because they are undefined behavior so any behavior is correct).
Oh wow, I didn't know you were on here! I'm a huge fan of D so I just wanted to say thanks for all the work you've done. D is an absolute pleasure to work with.
This does not save you if that null pointer is supposed to point to an array, and subsequent index operations move it out of the protected first page (or the first couple, I forget how modern Unix does it).
This is indeed how certain language runtimes generate null pointer exceptions. It would be too slow to add comparisons against null to every use of a pointer. So they handle SIGSEGV and raise the exception if the faulting access is on the null page.
Interesting side question: if you have a struct bigger than a page, and you dereference the end of it... How big does it have to be to not get this behavior?
It's less about null specifically and more about taking advantage of a type system. A null pointer, or, really, any pointer address used as a special value, is an in-bound value. This makes the type system unaware of it.
By using optionals, the idea of a special "missing" state becomes an out-of-bound value, meaning the type system can participate in helping the programmer properly deal with missing values when the time comes.
So it doesn't even have to be null. Let's say you're writing a function that returns an integer, or MAX_INT indicating that something failed. That's the same problem. The error indicator is an in-bound value, which means the type system is unaware that MAX_INT is special, and won't be able to catch mistakes, such as using the return value without checking if it is the special value. It would also be a subtle bug if the integer size changed from int to long, and now MAX_INT isn't even the correct value anymore. If the type were an optional, everything would have continued working correctly.
Sentinel values are things the compiler has no idea about and I agree with you here. If the type system has no idea about these sentinel values, the compiler cannot help you with those edge cases.
One aspect I think many people are bringing up, implicitly, is that they want a language that enforces a form of "correctness" so that you cannot put your programming into an "invalid state". I understand the appeal of this but it is not free, it does come at a cost.
This implicit view assumes that all values are a form of "object" and "singular". Assuming pointers are "references" to a "singular object", rather than just a pigeonhole to a piece of memory which may have a type associated with it. This may seem like it is expressing he same concept, but the former is kind of like Aristotelian Object Orientated Ontology (OOO) naïvely applied to things which are not actually objects in the real world. Thinking of these things as "objects" is an abstraction and is not actually what is happening. It may have a lot of utility, but it is not the reality at hand.
In many regards, most forms of Object Orientated Programming (OOP) in reality are an application of OOO (even the term "virtual" is from Aristotle). And Rust's ownership and lifetime semantics are another application too. But explaining this is in itself a long article. I know I won't convince people here about this, and that's absolutely fine as I don't expect to. It's just interesting to read many people's implicit assumptions about what how things "ought to be" with regards to a programming language.
Ownership semantics also suffer from the misapplication of what ownership is. Property cannot own property. In this case, it just becomes a hierarchical dependency system of responsibility. So "ownership" itself is the wrong term.
"This object owns this object which in turn is owned by another object."
but rather
"This object is responsibly for this object which in turn is responsible for this object."
In sum, you are artificially applying a hierarchy were one has not arose naturally, and adding the artificial concept of an "object" as an abstraction where there is no real "object" in reality. All of these abstractions that we apply to programming are just tools that we (hope to) get utility from.
In order to interpret anything, we must use a model. Some models are better than others, and some models are downright wrong.
If when you say “null pointer exception” you’re thinking of Java: do you find any value in nullable / non-null annotations? Or if you’ve used Kotlin, its non-nullable references?
I find them very useful. I don’t just want null pointer exceptions to be easy to debug, I want to avoid them completely in the first place.
You're right. The mistake is not in having null pointers, it's lacking non-nullable references which is the problem. If you can write a function which only accepts a non-null reference, then you don't have to check in every function called from it.
+1. I think this is something that gets lost in discussions around null. Generally, everyone complaining about C's handling of nulls will recommend languages that make nullable-pointers opt-in by having non-nullable references and making them the default. The argument people make is not that C should somehow drop the entire concept of null; the argument is that people should use languages where a type has to opt-in to accepting null.
A really effective way to handle nullable fields is an Optional<T> which forces the user to handle a null value for that field, this is very useful when working in a team and dealing with objects that someone else spent time creating, it won’t get rid of null but it definitely hints the user at “hey this field may be null so I should go ahead and handle this before I even get to testing my code”
Edit: You would want to implement these in your getters for fields that may return null
Note that in C++ today, static analysis tools + classes such as `std::unique_ptr`,`gsl::owner` and `gsl::non_null` (similar to the Kotlin mechanism you described), it is possible to avoid the possibility of null pointer dereferencing.
That's not to say the same thing is possible in C, but one _can_ go a long way with static analysis at least, and suspicion vs provability regarding pointer dereferences.
You can configure your IDE to warn of incorrect usage, and you can configure your compiler to flag warnings as errors. That will catch a lot of potential problems at compile time. It’s not perfect but it can be lot better than nothing.
Java annotations are better than Optional when you’re interoperating with Kotlin code, such as when you’re partway through converting a large codebase to Kotlin.
@NonNull etc are not completely meaningless ... if you run SpotBugs it can be set up to fail your build if the contract specified in the annotations isn't met. I love static analysis!
I have to agree (ignoring the whole C doesn’t throw exceptions).
I have been looking at our bugs, logs and exceptions recently of the past 6 years and an enormous amount of bugs are caused by methods/functions that have multiple parameters with the same type (Java).
This happens because (my theory) we use java and java doesn’t have type aliases or value types as well as easy destructuring. It also doesn’t have named parameters (well there is a compiler option to retain the parameter name but it’s not like ocaml label parameter or python kwargs).
So often times in boring business programming you are dealing with methods with 5 to six strings so it’s very easy to mix up the parameter order.
Very few “hard” bugs were caused by NPEs where as the previous problem caused serious pain.
Agreed. I've been saying that for too many decades.
My version of that is "The big safety questions are 'how big is it', 'who owns it', and 'who locks it'. C helps with none of those."
Most of the things done with pointer arithmetic are really array slices without the right syntax. If you're doing something with pointer arithmetic that can't be represented an array slice, you're probably doing something wrong. I once proposed adding slice syntax and array sizes to C. This was discussed on comp.lang.c at some length, and looks backwards compatible. But the political problems are huge.
Politics are a big issue regarding security improvements with C, every attempt to improve the language's security has failed, including the now optional Annex K.
Hardware solutions like Solaris SPARC ADI, ARM MTE used by iOS and a requirement for future Android ARM devices, CHERI CPU seem to be the only way to tame those developers.
Sadly Intel dropped the ball on MPX, leaving x86/x64 as the only CPU not pursuing such kind of endeavours.
So memory tagging seems to be the only way.
Thankfully C++ community did not inherit this part of C culture, and several activities are in process to at least minimize the security impact of such features, e.g. lifetime profile, reducing the amount of UB, security guidelines, library types instead of raw C ones, ....
This is why Odin does not have first class pointer arithmetic and has first class support for slices. I have found in practice that I don't actually need pointer arithmetic most of the time as slices solve that function.
The reason pointer arithmetic is so useful in C is because of its lack of array types, like slices. `x[n]` is the same as `*(x + n)` which means in C, you can do the "wonderful" trick of `n[x]`.
I feel like the resistance is people think pointers are C's secret sauce. Reality is the only advantage is it makes it easy to write bran damaged unopimized compilers for C. Which no one does. I remember back in the 90's my boss gave me shit for using array indexes instead of pointers. Being a brat I looked at the assembly and there was no difference.
Ditto the cultural proscription on passing/returning small stucts by value. In practice it's no worse than passing the arguments separately. And likely easier for the compiler to optimize around.
Ditto the cultural proscription on passing/returning small stucts by value.
Which really should be a compiler decision. Depends on the target CPU. If you pass something as const ref, the compiler should copy it if that's faster. For AMD64, it probably is. If something was just written, it's in the L1 cache and is really cheap to copy. On-chip data buses today can be as wide as 64 bytes. Anything not bigger than that is better copied.
NULL / nil isn't bad; however any case where encountering one is a problem is most often an issue of under-specified program design.
The better question is, what were you expecting there and how can it be described without a bare pointer? I often find a list is better, particularly in languages with syntax sugar for iterating through a list.
Agreed. Non-null pointers also come with their own set of problems:
- Increased language complexity. Initialization of (arrays of) structs with non-null pointer members is more complicated because there is no straightforward "default" value that can be used.
- Performance tradeoffs. Initialization of arrays and other containers is more costly for non-null pointers (or structs containing them) because we cannot simply zero out a block of memory.
- In memory unsafe languages it is possible for a non-null pointer to become null if its memory is overwritten (e.g. in custom allocator scenarios (also mentioned by the author of the article), interop scenarios, etc.). The type system now makes a false guarantee, and it becomes possible for a "non-null" pointer to trigger a null pointer crash. (It's worth mentioning that enums generally have a similar problem).
- I'd like do a more rigorous evaluation on this, but my gut feeling is that most null pointer bugs that I've encountered usually happen in situations where the pointer would have to be declared as nullable anyway, because I'm using null to represent a possible state.
I don't believe any of those tradeoffs are true for rust, which has references that may never be null (in safe rust).
I'll respond to each in turn:
- language complexity for arrays of nullable things
In rust you can easily type 'let v: Vec<Option<&MyStruct>> = vec![None; 10]'
Having a vector of 10 'None' option types is no more difficult than anything else. Also, Option<T> implements the default trait. It defaults to 'None.
- Performance
Rust's 'Option<Box<T>>' takes up just as much space as 'Box<T>'. The null value for the pointer is used as the None type for the option. In that case it's a zero-sized zero-overhead type. You can read about that here, and in various other places: https://doc.rust-lang.org/std/option/#options-and-pointers-n...
- In memory unsafe languages it is possible for a non-null pointer to become null
Yeah. In memory unsafe languages. Don't use one of those. In Rust, Haskell, etc the Option type can't lie like that without explicitly opting in to memory unsafety. Which hardly anyone does.
- my gut feeling is that most null pointer bugs that I've encountered usually happen in situations where the pointer would have to be declared as nullable anyway
Null pointer exceptions happen when a pointer is nullable, but some location forgets that it is. If the type-system encodes this in the form of Option<T>, it's impossible to forget that. If I have an 'Option<String>' and I have a function that expects a 'String' (but not a null string), with the option type I know I have to do something like 'call_func(val.unwrap_or(""))', or I have to match, or I have to do 'let Some(val) = optional_val { call_func(val) }' before I can use that thing.
The fact that I have to do a null-check is built into the language and the compiler yells at me if I forget about it.
Being able to encode in the type-system that null should encode some valid state (e.g. 'None' in the option type) vs null-able pointers, where you use pointers that may be null even when you don't want nulls and the compiler can't check you work.. It's night and day difference.
I'm curious; I had an idea (in fact, it's part of my Master's degree project) to create a language that brings over concepts from functional programming to systems programming. What would be beneficial for that, and what wouldn't be?
clarity: this is arguably not FP specific, but I think that having declarations for binary layouts would save a lot of very confusing shifting and masking
concurrency: to the extent that you can get by with pure functions, they also tend to be more trivially multithreaded. you also have quite a bit more flexibility wrt things like STM, or lazy evaluation, or other more novel notions f concurrence scheduling
regions: I think GC is pretty much a no-no, at least for some critical sections. but there is a lot you can do with static analysis to both make allocation more statically safe and more performant than generic heaps
closures: again, not specific to FP..but using closures makes asynchronous programming really quite nice
runtime: having a proper runtime with maps, higher order functions and and real string functions when not in the performance path really does save a lot of time.
immutability: not sure its a win, but I find it super instructive to think about what actually has* to be mutable in a big system..C really loses the distinction since with the exception of the verb and viral const, everything is mutable by default. you can also* play lovely tricks with explicit time ala MVCC and Daedelus
even the sum of that I think doesn't really justify a project...but if you're doing it anyways...why not try to make something a little nicer than the huge and difficult to debug standard C business
I don't have much to add but I do think you are correct about binary layouts. Correct about closures. I think with closures you can fix a lot of C's jankiness. Immutability. I've never been happy with 'const' as a 'bandaid' for mutability issues.
I want a language that can check at compile/check time that none of my pointers will have invalid addresses. Recognizing null as just another invalid value makes it more obvious to me that I want a language to handle it differently than how C does it.
In memory-safe languages, it's already unthinkable (/ very rare outside of situations where you're deliberately doing unsafe/native code integrations) to get non-null invalid pointers that point to a different type of thing than you want. When was the last time you had a Java program crash because a typed reference actually had a different type of reference at runtime somehow? Isn't that great how that basically never happens? -- But if you consider nulls a different type of reference, then it does actually happen sometimes. It would be great if we could try to close off that issue.
I'm a huge fan of languages with non-nullable references as the standard, like Kotlin, Typescript, and Rust. In my experience, it's much easier as a programmer to understand how a codebase uses nulls when the codebase is in a language with default non-nullable references, and therefore null-related issues where a future programmer passes a nullable value somewhere it shouldn't be happen much less often.