I know this article is about bootloaders but more generally there are two ways I've tackled x86 as a total noob: compile a simple language to like ~5 instructions [0, 1] and, separately, write an emulator for ~5 instructions that can run basic C programs [2, 3].
I think it's fairly common to recommend beginners to write a compiler but I think it's less common to actually recommend trying to emulate parts of x86. I think it's a particularly easy way to get started just because it's the architecture you already know or because its the architecture that all compiler tutorials use (if they don't use LLVM). And if you are a programmer you probably incidentally have gcc, gdb, and objdump on your system ready to help you out.
Doing both compilers _and_ emulators really helped my understanding of x86 and C (even if I wasn't writing C).
My background is in web development and my reason for doing these projects/writing is purely educational.
I sometimes wonder whether assembly code would even be considered "scary" today if IBM had picked the Motorola 68000 instead of the Intel 8088 for their PC.
The x86 instruction set was a cobbled-together mess from day one, while 68k assembly coding was pure joy because of its elegant and consistent instruction set.
It's "scary" on almost every architecture because there's no fault tolerance, recovery, exception handling, or reporting. If you make a mistake you get to reassemble the smashed plate of your memory image in order to work out what happened - assuming you can get at it at all. Some platforms will just reboot on you.
Hmm, I don't agree, stepping through assembly code in most (C/C++) debuggers works just as well as with high level languages, and process isolation in operating systems also works for assembly coding, so it's just as unlikely to crash the whole computer with assembly code as with high level languages (otherwise we'd have a massive problem if operating system security would just depend on banning low level programming).
Bare metal embedded coding is a different topic of course, but regular application development with assembly works just fine.
Inline assembly is certainly the least scary case and may be the easiest to get started with. But the original article targets the 16-bit DOS platform as a bootloader, where is "is" the operating system and you have no help.
Sure, that's the case today, but in the 80's and early 90's it wasn't all that uncommon to write big applications completely in assembly code. With a proper macro assembler and IDE-like coding environment that wasn't as bad as it sounds today. The focus has just shifted away from assembly programming to high-level languages, and the tools moved along. At least reading assembly code is still important today though.
> The game was developed in a small village near Dunblane over the course of two years.[2][5] Sawyer wrote 99% of the code for RollerCoaster Tycoon in x86 assembly language, with the remaining one percent written in C.[3]
Even into the late 90s 2D PC games were being written with large chunks of assembler. 3D APIs killed that. I'm sure there were huge slabs of machine code in console games until the early 2000s, especially systems that needed a lot of specialized code to fully exploit them, e.g. PlayStation 2.
> 68k assembly coding was pure joy because of its elegant and consistent instruction set
That's maybe a little spun. The separated address/data registers (there are two kinds of registers, and memory operations need to use one from each to combine to a final address) played hell with optimizer strategies for years.
In fact 68k compiler output was significantly sub-par pretty much throughout its lifetime. The "cobbled-together mess" had (post-386 anyway) a significantly more orthogonal instruction set and was just plain easier to optimize, even for humans.
The 68k was certainly great fun after 8bit systems, and the x86 was almost just more of the same.
ARM32 is also nice, apart from some corner cases.
I think everything has corner cases, apart from x86 where the whole thing is an ugly mess.
Yes. But what you cannot do is a load with the sum of any two GPRs. They need to be in the right partitions. That makes register assignment a huge pain for the optimizer, and historically hurt the architecture.
That's the kind of complexity that really hurts software. Compare vs. the commonly-cited x86 nonsense (the REP prefix, say), which complicates silicon implementations but generally makes software easier to write (c.f. decades of optimized inline memcpy implementations).
The point being the 68k was a dead end in a different direction. It was a "clean" archiecture from the perspective of a 1970's assembly programmer, but not a late 80's compiler writer.
I don't want to be too blunt, but have you looked at early 90's era 68k compiler output? It was crap. When Sun launched SPARC, like half the advantage of the platform was that the compiler was suddenly generating this amazingly clean code. The phantom spills and intra-GPR movs everyone was used to disappeared overnight.
As an owner of the original 1984 Mac, I bought a copy of the classic Lance Levanthal 68000 assembly book. I never got much into actually coding in it, but I remember how "this makes sense, mostly" the book felt.
And likely far more by now, had it become the PC instruction set architecture. The x86 instruction set was also simpler back in the late 70s, in the 8086 era when the Motorola 68000 was first released, than it is now.
Back then it only had the one register width (16-bit, e.g. "ax"), whereas now it has the 32-bit series (e.g. "eax") series and the 64-bit series (e.g. "rax"). It also now has SIMD, SSE/AVX, virtualization support, and other technologies. Back then it just had one operating mode (real mode), whereas now it has protected mode, long mode, system management mode, and a few other intermediate modes (e.g. "unreal mode").
So a lot of the complexity that x86 has now was introduced after that decision was made. It's definitely conceivable that the 68000 line would have developed similarly had it been chosen instead of x86 for the PC.
8086 let you address the upper and lower halves of the 16-bit registers as well, so don't trick yourself into thinking there was just one register width available in the sense that everything could only be treated as 16-bit words.
totally agree. I learnt a bit of 6502/6809 in my early teens and in to college/ uni. I couldn't not be bothered with the x86 - a complete pita, although to be fair that may have been the not so great manuals I had at the time.
but I started in my first job dissassembling 68000, very easy to work with
Ideally - ideally IBM would have created and used their own microprocessor based on the System 370 architecture. But no - for them the PC was a glorified typewriter: even PS/2 (based on 80286) was mostly positioned for use merely as a "smart" terminal for mainframes. So, today the entire world is basically run on faster "typewriters" (sometimes enhanced to include "windows" - even on the server side).
Yeah, X64 is only used by windows, that's like ~80% of the desktop computing market, plus nearly every server and cloud instance out there, so not much at all. /s
Why do some users assume that the whole world revolves around Apple's iOS/M1 Mac ecosystem as if it exists in a vacuum?
Because well over 90% of shipped computing devices are ARM based.
Desktop computers are a declining and relatively small market. Your home, car and office are full of ARM devices. Potentially hundreds.
x86 lives in the data centre (Linux, not Windows, so contrary to grandparent post, but whatever) and in some desktop systems. Not the majority of systems.
ARM assembly is easy to learn, because of its relatively small and orthogonal instruction set. However, load store architectures are annoying to program in, since you're constantly having to juggle memory. The lack of a convenient way to spill registers is also really really annoying! (you can only spill and load registers two at a time)
Except, of course, that RISC-V snippet doesn't actually load 0x7FFFFFFF into a0 because Reasons, it has to be
LUI a0, 0x80000
ADDI a0, a0, 0xFFF
"But the assembler has the LI pseudoinstruction so it will properly do this calculation for you!". Right, so much for "nice assembly language": you need an actual smart macroassembler to write it.
Talkin' out of my ass here, but is this an artifact of the parameter having to fit in the same instruction word as the opcode ('cause RISC) and because the register is the size of a word (which is partially used by the opcode now), you can't actually load a whole register with an immediate in one go?
Absolutely, and different RISCs coped with it in their own ways. ARM has 12-bit immediates which it treats as having a 8-bit and a 4-bit parts: the 8-bit part is extended into 32 bits and then rotated right twice the number in the 4-bit part. MIPS has 16-bit immediates and the 32-bit load is generally done by LUI then ORI (since MIPS zero-extends the immediates unlike RISC-V which sign-extends those). And RISC-V has 12- (for lower part of the word) and 20-bit (for the upper part) immediates.
The funny part is, there is now the "C" extension to RISC which introduces 16-bit instructions which are allowed to freely mix with 32-bit instructions — so now those 32-bit instructions can be 16-bit aligned and even be split between two physical memory pages which kinda kills the whole "but at least fixed-length encoding prevents Spectre-like exploits" argument.
> "But the assembler has the LI pseudoinstruction so it will properly do this calculation for you!". Right, so much for "nice assembly language": you need an actual smart macroassembler to write it.
In x86 the mnemonic "MOV" can be translated into instructions with a different opcode according to the addressing mode, immediate value size, or target register.
Most of the x86 instructions have a similar issue, while RISC-V macroassembler are pretty simple.
Therefore, the x86 assembler must actually contain much more intelligence than the RISC-V assembler to make it look "simple" and "nice".
Well, sure, it's awkward but also consistent. So once you've learned the rules/pattern for this kind of thing, you can read/write code without having to look up a wack of different instructions and modes and register sets, which x86 is notorious for.
Assembly isn't supposed to be convenient and expressive to write. for that we have high level languages. But some consistency makes for less error prone and easier analysis.
x86 is also consistent, just in a different way. There are basic moves and arithmetics, basic branching, and basic stack-related stuff. Next, there are extensions: string-manipulating extension (STOS/LODS/etc.), multiplication/division extension (MUL/DIV with their idiosyncrastic use of DX:AX), floating-point extension, control-registers extensions (tons of those), vectorized extensions, etc. Inside any set of instructions, things are pretty consistent. It's just that the Intel's Software Developer's Manual is not structured this way, it lumps all of those instructions together.
But RISC-V specification is explicitly structured around describing several basic cores and the extensions to those so it looks like it's all very unified and consistent: and indeed, it mostly is since it was developed mostly in one continuous effort with consistency in mind. Bute there are still some inconsistencies between how things are done in different extensions as well, for example, the "C" extension uses zero-extended immediates in half of its instructions unlike the rest of ISA and the other half of this very extension because of pragmatics: nobody would like to have negative offsets in those shortened instructions, so those are unsigned.
To be clear, that happens simply because ADDI sign-extends its immediate argument. This saves the need for separate immediate opcodes, and is reasonably consistent with the use of sign-extension elsewhere in the ISA.
Super dumb question (haven't taken a microprocessors class in over a decade):
How hard would it really be to custom build a chip that was really simple, but modern. I'm thinking like the 6502 in the Commodore, but much faster. Or is the complexity in x86 an inherent property of modern performance? I guess what I'm getting at, is could you build something that is actually pretty darn fast if you don't need to run a modern OS like windows or Linux on it, but keep it drastically simple?
I've been thinking it might be neat to have a blazing fast NUC sized computer that just boots into some barebones forth sitting on top of a few assembly words. Maybe with just enough peripherals to do some actual work (like load from SD card).
There's inherent complexity in the x86 instruction decoder, but after decoding, modern x86 CPUs don't particularly resemble legacy x86 CPUs internally. The native machine code is all proprietary "micro ops".
A lot of the performance gains over the last couple decades haven't come from machine code changes, but rather from various forms of pipelining and superscaling. Rather than run one instruction at a time, CPUs run hundreds of instructions at a time. The complexity of doing that is that many instructions often depend on the result of instructions immediately before them, and so the CPU needs a lot of shortcut paths internally to keep from stalling out.
You could get to Ghz speeds with a custom ASIC, and you could use base RISC-V as a modern, not crufty ISA. It still won't be anywhere near as fast as a high end x86 or ARM CPU, though, unless it's similarly pipelined and superscaled.
Yes, but specifically only for a general purpose CPU.
You could make a CPU optimized specifically for forth or whatever, and likely achieve better performance for the number of transistors than otherwise. For example, if virtual memory isn't helpful, than you can omit that whole subsystem.
It's called a FPGA; a software defined microprocessor (most higher tier processors can be reprogram to a certain extent through microcode, but a FPGA is a specialised chip designed specifically for this). You can get ridiculously high performance for your specific application, but it will never excel at general purpose computing. Good enough for Forth, great for ML, DSP, and number crunching, bad for running random apps from GitHub.
I've used FPGAs before (also ages ago). I guess I could create a simple chip on an FPGA and run a Forth on that, but I was thinking something more physical.
But would you not need to use an FPGA anyway for experimentation and prototyping? If you want something more physical, fast and small you need a spot at an actual fab. There is no other alternative. You might also want to take a look at the "Minimal Fab" technology. I found out about it a year or so ago and it looked really interesting!
I find it kind of funny that majority of uses of FPGA is indeed to implement a simple CPU (basically a programmable state machine), and the problem at hand is then solved by writing a program for it.
The convergence of RISC & CISC has shown that both things are true. A really simple ISA with a modern design can be pretty darn fast. At the same time, additional ISA complexity helps you wring the utmost out of your hardware, and IMO also helps in the bazaar environment - macro ops let a menagerie of hardware do work in the way most efficient for itself without hardware-specific binaries.
Underneath the covers, however, the underlying hardware complexity is an inherent property of modern performance. For example, branch prediction.
I feel like X64 doesn't get enough love these days. I think new tutorials should be X64 first and then after you get a hang of it talk about X32. It's harder the other way around, especially when it comes to calling conventions.
On the contrary, I felt that understanding where x86 had come from helped me understand the reason why the registers were named what they were, why some instructions are longer than others, etc.
It's confusing as a beginner to encounter the "di" and "dx" registers, and to understand why the 8-bit version of "r13" is "r13b" but the 8-bit version of "rdx" is either "dh" or "dl".
Wouldn't it be better if you knew about all the current registers first then learn about older types? I learned like most people X32 first and I feel like I still read X64 relative to X32.
I suppose it depends how you learn! For some it is likely better to do what you suggested and start with what's current and then work backwards - but for me I needed to know why certain limitations were in place, and that required me to start at the beginning and work forwards.
As a newer developer this is a problem I've had with computing in general - a lot of newer languages, web frameworks, etc. are solving problems that I only understand through seeing what came before me. For example - why should I care about memory ownership if I've never tried to write a large-scale C program?
New tutorials? Don't they all date from when 32 bit registers were new, 16 bit registers mainstream, and MMX and x87 the new complicated instruction set additions?
It would be certainly a smaller tutorial, but historical perspective might be valuable for understanding cruft layers of design decisions that made sense decades ago and addressed problems that have ceased to exist.
I started with x86, after that x64 felt simply like a different flavor. There was more Googling to do, but I did that a ton already for x86, so I was used to the type of game I was playing.
You also need to define the system when talking about assembly, I think this is more relevant than the difference between architecture variants. I would recommend Linux because available documentation. Without doing any system specific syscalls and learning about calling conventions you won't get very far.
I think it depends on your goals. I am with you for writing assembly. For reading compiler generated coder however, I think windows is ideal as a starting point because you don't often get spoiled by the availability if source code and debugging symbols.
Umm, did anyone read the reference manual for x86_64? Intel's big enough to kill a goat (~2000). Add AMD on that and you can kill a small tribe of goats.
There are not that many compiler generated instructions both for X32 or X64 if you look at it by percentage of code. You have to look up the manual for rare instructions no matter what, you hardly need to read maybe a dozen pages to know enough to understand the common operations and control flow. My argument if you start with the current arch, that will be your frame of reference instead of X32 (why not X16 if earlier is better?)
I think it's much easier to begin with x86. Especially when the assembled code is being run the way it is on this website. x64 is a little harder because just the switch to x64 mode alone almost fills the boot sector.
And then you lose the ability to temporarily switch to the real mode to call BIOS functions so you better load the rest of your program before switching to x64.
x86's purely stack-oriented C ABI calling convention makes things far less tedious. In x86_64 sure it executes faster with the register passing, but what a miserable thing to read/write as a human.
Maybe this is just a case of having learned and written 32-bit x86 assembler for years in my youth, but I strongly prefer it to x86_64.
C stack-oriented ABI calling conventions also started with PDP-11, for which the first portable C compiler was written, and the convention then tagged along to nearly every other architecture as the pcc was progressively ported to other platforms.
If a function return value could fit into %r0 on the PDP-11 (%ax/%eax/%rax on x86), it would be returned in there. %r1-%r4 would be used to pass function parameters in, if they could fit, and/or spill over into the stack.
Heck, even UNIX system call conventions on x86 can be traced back to PDP-11, i.e. the syscall number is passed in %r0 (%eax) followed by a TRAP (INT on x86) instruction (can't remember which TRAP number, though).
But would you still prefer X32 if you started with X64? My argument is that some would not find it tediuous if that was the first thing they learned. Similar to how a python programmer would share your sentiment about C because he learned python first. I learned C first so while I accept that it is more verbose and more work in general, I enjoy writing C more.
I think so. It's nice to have more registers and all, but the consistency of an entirely stack-oriented calling convention is simply more elegant and ergonomic IMHO. The register passing overflowing into stack-oriented at some arbitrary limit is just hideously warty, and obnoxious if you're actually writing assembly with C ABI function calling.
Changing the signature of functions requires rearranging which registers are being populated, and if there's enough parameters some go in the stack, and if you've rearranged the order, now you're moving some from stack back to registers and visa versa. At least when everything was always going through the stack, you just rearranged their position in the stack. x86_64 will always be more annoying in this regard, it's not an ABI decision made with humans in mind at all. The assumption is (rightfully) that compilers are doing this work, and the perf win is significant.
I am fascinated with bootloaders and kernel writing. I am not very good at it, but I am fascinated by it, and every so often I try to learn some more. It feels a bit like a useless skill to learn (legacy BIOS bootloaders, that is) given UEFI dominance. But it connects me with my childhood playing with 286es and wondering how to program them.
I love articles like this that break it down. The biggest challenges so far have been getting the assembler to output the correct format (ie, 16 bit real mode) and learning inline assembler in C (and getting GCC to output the correct format).
I definitely don't think it is a waste of time. I found it really fascinating to find out how a PC boots up from zero and how a very basic kernel is coded.
Long time ago I coded much of my professional stuff in assembly, I was hired specifically to optimize stuff. But it was 20 years ago, compilers were not very smart.
How smart are compilers these days ? Say, to optimize small function, for example computing a scalar product or applying a 3D matrix transformation to a set of points.
IME: "it depends". You'll have to check the generated assembly code and then tweak your "high-level" C code to appease the compiler to generate code that's acceptable. Sometimes a small change in the high-level code is enough to break the "pattern matching" in optimizer passes.
Here's an example that looks like magic at first glance where the compiler converts manual bit twiddling code to a popcnt instruction (with the right compiler setting), but do the bit counting any other way, and the whole thing falls apart:
Pretty smart. The examples you gave are math-heavy, so to get the best performance you need to do use some kind of SIMD instructions. For these you need to drop down a level, although not really assembly - there are compiler intrinsics that you can use. And for simple functions, compilers are getting fairly good at autovectorization, meaning to introduce SIMD instructions automatically. But it's not something you can rely on.
Generally, they do lots of inlining, and then once you inline you can get some more optimizations in, rinse and repeat. Ends up pretty optimal. (This is C++, can't speak for other languages.)
We work on very perf-sensitive code and we never drop down to assembly. For hot loops, we usually inspect the generated assembly and if it's not great, it's fairly easy to "nudge" the compiler towards the better-performing solutions by tweaking the source code. Also some manual unrolling might be needed to better saturate the vector processing cores of modern CPUs.
And when you're working with signed integers, you still have to do stuff like a >> 1 instead of a / 2 :)
If it was on StackOverflow, I would give it a "correct answer" flag :-)
But if it's now up to some nudging, it's vastly better to me.
I provided math stuff and vectorisable stuff on purpose :-) Happy to see that vectorisation is somewhat automatic. I remember the MMX days and there were not that funny :-)
Someone well-versed in assembly can still do much better than the compiler in many cases over reasonably small function. You could of course do better for large functions too, but at some point the cost becomes prohibitive: if you're going to do this, stick to a few hotspots which are as compact as possible.
I do know x86 for example has things like the STOS/STOSB/STOSW/STOSD which specifies a lot of behavior for a single instruction, and combined with the REP prefixes is a pretty elegant way to do memory block operations. I don't think 68000 had anything like that.
To be fair, Intel nerfed those instructions for a long time. See, DOS was using REPNZ STOSW for timing things. So as the processors got faster, those instructions didn't, so as to make DOS BIOS continue to operate correctly. It was a terrible time for x86.
At the same time, I used a clever 12-byte sequence of those opcodes to swap software interrupt vectors on process switch. So each process could have its own floating-point exception handlers etc. We called it the 'soft vector chain' and it was among the least known parts of our kernel.
A colleague (John McGinty) suggested in our old age we could consult by scratching our beards and saying "Ah! It must be the soft vector chain!" for every problem.
> its obvious why x86 won - you could get the critical jobs done.
It is a bit less obvious, in fact.
Motorola never looked at 68k CPU's as a serious business, more like toying around with the CPU's all the time or looking at them as a less important spin-off of the main business. Their main sources of income were defence contracts (e.g. specialised or hardened microchips), microcontrollers, DSP's (which were pretty cool, by the way, – all implementing the Harvard architecture), memory chips (I think), radios and the field radio equipment and later mobiles.
They carried largely the same attitude to 88k RISC and PowerPC CPU lines (albeit trying to compete more seriously for a while), but ultimately failing to catch up and leading to the eventual PowerPC demise. After that failure, they spun off anything CPU, DSP and microcontroller related into Freescale, and the rest is now history.
It doesn't matter that much, really. What truly is ugly is interfacing with the rest of the computer, ugh. You can't use the VGA BIOS from x64, so go-o-o-od luck doing it via PCI. And properly setting up IOAPIC?
The problem I’ve always had with learning assembly (arm in my case) was that I never really had a project to work on that would use any of that knowledge
There was a recent thread [1] optimizing some very primitive trig functions.
I recently watched a djb interview where he talked about the importance of fully utilizing available hardware [2 at 5:15]. That can be a good starting point, although at work its usually easier to just consume more resources than to use what you've got more efficiently.
As my professor told once, "The only difference between programming in assembler and a high-level language is that you have to type more." I can confirm - early in my career I was able to sustainably produce more than a thousand lines of working x86 assembly code a day.
I think it's fairly common to recommend beginners to write a compiler but I think it's less common to actually recommend trying to emulate parts of x86. I think it's a particularly easy way to get started just because it's the architecture you already know or because its the architecture that all compiler tutorials use (if they don't use LLVM). And if you are a programmer you probably incidentally have gcc, gdb, and objdump on your system ready to help you out.
Doing both compilers _and_ emulators really helped my understanding of x86 and C (even if I wasn't writing C).
My background is in web development and my reason for doing these projects/writing is purely educational.
[0] https://notes.eatonphil.com/compiler-basics-lisp-to-assembly...
[1] https://notes.eatonphil.com/compiler-basics-an-x86-upgrade.h...
[2] https://notes.eatonphil.com/emulating-amd64-starting-with-el...
[3] https://notes.eatonphil.com/emulator-basics-a-stack-and-regi...