I can understand the public cloud argument; if the cloud provider insists on you delivering an entire operating system to run your workloads, a unikernel indeed slashes the amount of layers you have to care about.
Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.
Technically, in the former case, besides exploiting the application, a hypothetical attacker will also have to exploit a flaw in QEMU to start processes or gain further privileges on the host, but that's just due to a layer of indirection. You can accomplish this without resorting to hardware virtualization. Once in QEMU, the entire assortment of your host's system calls and services is exposed, just as if you ran your code as a regular user space process.
This is the level you want to block exec() and other functionality your application doesn't need at, so that neither QEMU nor your code ran directly can perform anything out of their scope. Adding a layer of indirection while still leaving user/kernel, or unikernel/hypervisor junction points unsupervised will only stop unmotivated attackers looking for low-hanging fruit.
> Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker.
Some unikernels are intended to run under a hypervisor or on bare metal. Bare metal means you need some drivers, but if you have a use case for a unikernel on bare metal, you probably don't need to support the vast universe of devices, maybe only a few instances of a couple types of things.
I've got a not production ready at all hobby OS that's adjacent to a unikernel; runs in virtio hypervisors and on bare metal, with support for one NIC. In it's intended hypothetical use, it would boot from PXE, with storage on nodes running a traditional OS, so supporting a handful of NICs would probably be sufficient. Modern NICs tend to be fairly similar in interface, so if the manufacturer provides documentation, it shouldn't take too long to add support at least once you've got one driver doing multiple tx/rx queues and all that jazz... plus or minus optimization.
For storage, you can probably get by with two drivers, one for sata/ahci and one for nvme. And likely reuse an existing filesystem.
Do you usually publish your hobby code publicly? If not, consider this an appeal to do so (:
> Modern NICs tend to be fairly similar in interface, so if the manufacturer provides documentation, it shouldn't take too long to add support at least once you've got one driver ... For storage, you can probably get by with two drivers
I take that there aren't any pluggable drivers for NICs like there's for nvme/sata disks?
> I take that there aren't any pluggable drivers for NICs like there's for nvme/sata disks?
I mean, there is NDIS / NDISWrapper. Or, I think it wouldn't be too hard to run netbsd drivers... but I'm crazy and want my drivers in userland, in Erlang, so none of that applies. :)
As a fair warning, there's some concurrency errors in the kernel which I haven't tracked down that results in sometimes getting stuck before the shell prompt comes up, the tcp stack is just ok enough to mostly work, and the dhcp client only works if everything goes right.
Erlang! Indeed a crazy idea (in a good way!), and while I'm not normally a big fan of unikernels, now you've got me seriously intrigued :)
I've been dabbling in Erlang and OS development myself, my biggest inspirations being Microsoft Singularity and QNX. The former is a C# lookalike of what you're making, or at least that's how it seems from my perspective.
The readme mentions a FreeBSD-like system call interface, but then the drivers and the network stack are written in Erlang, and, as you've mentioned, run in the user land. Is that actually a unikernel design with BEAM running in the kernel, or more of a microkernel hosting BEAM, with it providing device handling and the user space?
The original plan was BEAM on metal, but I had a hard time getting that started... so I pivoted to BEAM from pkg, running on a just enough kernel that exposes only the FreeBSD syscalls that actually get called.
Where that fits in the taxonomy of life, I'm not sure. There is a kernel/userspace boundary (and also a c-code/erlang code boundary in userspace), so it's not quite a unikernel. I wouldn't really call it a microkernel; there's none of the usual microkernel stuff... I just let userspace do i/o ports with the x86 task structure and do memory mapped i/o by letting it mmap anything (more or less). The kernel manages timers/timekeeping and interrupts, Erlang drivers open a socket to get notified when an interrupt fires --- level triggered interrupts would be an issue. Kernel also does thread spawning and mutex support, connects pipes, early/late console output, etc.
If I get through my roadmap (networked demo, uefi/amd64 support, maybe arm64 support, run the otp test suite), I might look again and see if I can eliminate the kernel/userspace divide now that I understand the underneath, but the minimal kernel approach lets me play around with the fun parts, so I'm pretty happy with it. I've got a slightly tweaked dist working and can hotload userspace Erlang code over the network, including the tcp stack, which was the itch I wanted to scratch... nevermind that the tcp stack isn't very good at the moment ;)
Really cool! Will definitely take a closer look in my spare time.
>I just let userspace do i/o ports [...] and do memory mapped i/o by letting it mmap anything (more or less). The kernel manages timers/timekeeping and interrupts [...]
This is how QNX does it too, allowing privileged processes to use MAP_PHYS and port I/O instructions on x86, and handle interrupts like they're POSIX signals. It all boils down to how you structure your design, but personally, I think that's not a bad approach at all. The cool thing about it is that, after the initial setup, you can drop the privileges for creating further mappings and handlers, reducing the attack surface.
Unless you're trying to absolutely minimize the cost and amount of context switches, I think moving BEAM into the kernel would be a downgrade, but again, I'm a big proponent of microkernels :)
I can't speak for all the various projects but imo these aren't made for bare metal - if you want true bare metal (metal you can physically touch) use linux.
One of the things that might not be so apparent is that when you deploy these to something like AWS all the users/process mgmt/etc. gets shifted up and out of the instance you control and put into the cloud layer - I feel that would be hard to do with physical boxen cause it becomes a slippery slope of having certain operations (such as updates) needing auth for instance.
> Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.
Toro can run on baremetal although I stopped to support on that a few years ago. I tagged in master the commit when this happened. Also, I removed the TCP/IP Stack in favor to VSOCK. Those changes, though, could be reversed in case there is interest on those features.
> In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls?
Hypervisors expose a much smaller API surface area to their tenants than an operating system does to its processes which makes them much easier to secure.
That is a artifact of implementation. Monolithic operating systems with tons of shared services expose lots to their tenants. Austere hypervisors, the ones with small API surface areas, basically implement a microkernel interface yet both expose significantly more surface area and offer a significantly worse guest experience than microkernels. That is why high security systems designed for multi-level security for shared tenants that need to protect against state actors use microkernels instead of hypervisors.
> That is why high security systems designed for multi-level security for shared tenants
When you say "high security" do you mean Confidential Computing workloads run by Trusty (Enclave) / Virtee (Realm) etc? If so, aren't these system limited in what they can do, as in, there usually is another full-blown OS that's running the user-facing bits?
> that need to protect against state actors
This is a very high bar for a software-only solution (like a microkernel) to meet? In my view, open hardware specification, like OpenTitan, in combination with small-ish software TCB, make it hard for state actors (even if not impossible).
No. I am talking about multi-level security [1] which allows a single piece of hardware to handle top secret and unclassified materials simultaneously via software protection. This protection is limited to software attempts to access top secret materials from the unclassified domain; hardware and physical attacks are out-of-scope.
There have been many such systems verified to be secure against state actors according to the TCSEC Orange Book Level A1 standard and the subsequent Common Criteria SKPP standard which requires both full formal proofs of security and explicitly requires the NSA to identify zero vulnerabilities during a multi-month penetration test before allowing usage in NSA and DoD systems.
Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.
Technically, in the former case, besides exploiting the application, a hypothetical attacker will also have to exploit a flaw in QEMU to start processes or gain further privileges on the host, but that's just due to a layer of indirection. You can accomplish this without resorting to hardware virtualization. Once in QEMU, the entire assortment of your host's system calls and services is exposed, just as if you ran your code as a regular user space process.
This is the level you want to block exec() and other functionality your application doesn't need at, so that neither QEMU nor your code ran directly can perform anything out of their scope. Adding a layer of indirection while still leaving user/kernel, or unikernel/hypervisor junction points unsupervised will only stop unmotivated attackers looking for low-hanging fruit.