> Honestly, I'd expect the exact opposite. Filesystems are really good at storing files. Why not leverage all that work?
There are lots of different file systems out there and you won't always get a say in what your cloud vendor has on offer. However, if you can launch a container on the system that does an abstraction on top of the file system, takes its best parts and makes up for any shortcomings it might have in a mostly standardized way, then you can benefit from it.
That's not always the right way to go about things: it seems to work nicely for relational databases and how they store data, whereas in regards to storing larger bits of binary data, there are advantages and shortcomings to either approach. At the end of the day, it's probably about tradeoffs and what workload you're working with, what you want to achieve and so on.
> That's a misconfiguration issue though, not a reason to not store blobs as files on disk. Ext4 can handle 2^32 files. ZFS can handle 2^128(?).
Modern file systems are pretty good and can support lots of files, but getting a VPS from provider X doesn't mean that they will. Or maybe you have to use a system that your clients/employer gave you - a system that with such an abstraction would be capable of doing what you want to do, but currently doesn't. I agree that it's a misconfiguration in a sense, but not one that you can rectify yourself always.
> * This requires tuning to actually reduce the number of inodes of used for certain datasets. E.g., if I'm storing large media files, that chunking would _increase_ the number of files on disk, not reduce it. At which point, if inode limits are the issue, we're just making it worse.
This is an excellent point, thank you for making it! However, it's not necessarily a dealbreaker: on one hand, you can probably gauge what sorts of data you're working with (e.g. PDF files that are around 100 KB in size, or video files that are around 1 GB each) and tune accordingly, or perhaps let such a system rebalance data into chunks dynamically, as needed.
> * It adds additional complexity. Now you need to account for these chunks, and, if you care about the data, check it periodically.
As long as things keep working, many people won't care (which is not actually the best stance to take, of course) - how many care about what happens inside of their database when they do SQL queries against it, or what happens under the hood of their compatible S3 store of choice? I'll say that I personally like keeping things as simple as possible in most cases, however the popularity of something like Kubernetes shows that it's not always what we go for as an industry.
I could say the same about using PostgreSQL for certain workloads, for which SQLite might also be sufficient, or opting for a huge enterprise framework for a boring CRUD when something that has a codebase one tenth the size would suffice. But hey, as long as people don't constantly get burned by these choices and can solve the problems they need to, to make more money, good for them. Sometimes an abstraction or a piece of functionality that's provided reasonably outweighs the drawbacks and thus makes it a viable choice.
> * You need specific tooling to work with it. Files on a filesystem are.. files on a filesystem. Easy to backup, easy to view. Arbitrary chunking and such requires tooling to perform operations on it. Tooling that may break, or have the wrong versions, or.. etc.
This is actually the only point where I'll disagree.
You're always one directory traversal attack against your system away from having a really bad time. That's not to say that it will always happen (or that accessing unintended data cannot happen on other storage solutions, e.g. even the adjacent example of relational databases will make anyone recall SQL injection, or S3 will have stories of insecure buckets with data leaking confidential information), but being told that you can just use the file system will have many people using files as an abstraction in the programming language of their choice, without always considering the risks of sub-optimal engineering, like directory traversal attacks or file permissions.
Contrast this to a scenario where you're given a (presumably) black box that exposes an API to you - what's inside of the box is code that's written by other people that are more clever than you (the "you" in this example being an average engineer) and that handles many of the concerns that you might not have even thought of nicely. And if there are ever serious issues or good reasons for peeling back that complexity, look up the source code of that black box on GitHub and start diving in. Of course, in the case of MinIO and many other storage solutions, that's already what you get and is good enough. That's actually why I or others might use something S3 compatible, or something that gives you signed URLs for downloading files - so you don't have to think about or mess up how the signing works. That's also why I and many others would be okay with having a system that eases the implications of needing to think about file systems, by at least partially abstracting it away. Edit: removed unnecessary snarky bits about admittedly leaky abstractions you often get.
Honestly, that's why I like databases letting you pick whatever storage engines are suitable for your workloads, similarly to how object storage solutions might approach the issue - just give the user the freedom to choose how they want to store their blobs at the lower level, giving sane defaults otherwise. Those defaults might as well be just files on a filesystem. In regards to object storage, that's before we get into thinking about file names (especially across different OSes), potential conflicts and file versioning, as well as maximum file size supported by any number of file systems that you might need to support.
To put it pretty bluntly: you were off the rails at "getting a VPS from provider X doesn't mean that they will". You're talking in terms of not having a custom kernel, and that's just the wrong layer of abstraction if we're talking about "cloud"; this whole discussion is really about VM and colo levels of abstraction anyway ("Cloud" advice would be "Just use your vendor's S3 blobstore").
Base Ubuntu has xfs support. If your VPS provider won't run plain old Ubuntu with some cloudinit stuff, get a new VPS provider.
There are lots of different file systems out there and you won't always get a say in what your cloud vendor has on offer. However, if you can launch a container on the system that does an abstraction on top of the file system, takes its best parts and makes up for any shortcomings it might have in a mostly standardized way, then you can benefit from it.
That's not always the right way to go about things: it seems to work nicely for relational databases and how they store data, whereas in regards to storing larger bits of binary data, there are advantages and shortcomings to either approach. At the end of the day, it's probably about tradeoffs and what workload you're working with, what you want to achieve and so on.
> That's a misconfiguration issue though, not a reason to not store blobs as files on disk. Ext4 can handle 2^32 files. ZFS can handle 2^128(?).
Modern file systems are pretty good and can support lots of files, but getting a VPS from provider X doesn't mean that they will. Or maybe you have to use a system that your clients/employer gave you - a system that with such an abstraction would be capable of doing what you want to do, but currently doesn't. I agree that it's a misconfiguration in a sense, but not one that you can rectify yourself always.
> * This requires tuning to actually reduce the number of inodes of used for certain datasets. E.g., if I'm storing large media files, that chunking would _increase_ the number of files on disk, not reduce it. At which point, if inode limits are the issue, we're just making it worse.
This is an excellent point, thank you for making it! However, it's not necessarily a dealbreaker: on one hand, you can probably gauge what sorts of data you're working with (e.g. PDF files that are around 100 KB in size, or video files that are around 1 GB each) and tune accordingly, or perhaps let such a system rebalance data into chunks dynamically, as needed.
> * It adds additional complexity. Now you need to account for these chunks, and, if you care about the data, check it periodically.
As long as things keep working, many people won't care (which is not actually the best stance to take, of course) - how many care about what happens inside of their database when they do SQL queries against it, or what happens under the hood of their compatible S3 store of choice? I'll say that I personally like keeping things as simple as possible in most cases, however the popularity of something like Kubernetes shows that it's not always what we go for as an industry.
I could say the same about using PostgreSQL for certain workloads, for which SQLite might also be sufficient, or opting for a huge enterprise framework for a boring CRUD when something that has a codebase one tenth the size would suffice. But hey, as long as people don't constantly get burned by these choices and can solve the problems they need to, to make more money, good for them. Sometimes an abstraction or a piece of functionality that's provided reasonably outweighs the drawbacks and thus makes it a viable choice.
> * You need specific tooling to work with it. Files on a filesystem are.. files on a filesystem. Easy to backup, easy to view. Arbitrary chunking and such requires tooling to perform operations on it. Tooling that may break, or have the wrong versions, or.. etc.
This is actually the only point where I'll disagree.
You're always one directory traversal attack against your system away from having a really bad time. That's not to say that it will always happen (or that accessing unintended data cannot happen on other storage solutions, e.g. even the adjacent example of relational databases will make anyone recall SQL injection, or S3 will have stories of insecure buckets with data leaking confidential information), but being told that you can just use the file system will have many people using files as an abstraction in the programming language of their choice, without always considering the risks of sub-optimal engineering, like directory traversal attacks or file permissions.
Contrast this to a scenario where you're given a (presumably) black box that exposes an API to you - what's inside of the box is code that's written by other people that are more clever than you (the "you" in this example being an average engineer) and that handles many of the concerns that you might not have even thought of nicely. And if there are ever serious issues or good reasons for peeling back that complexity, look up the source code of that black box on GitHub and start diving in. Of course, in the case of MinIO and many other storage solutions, that's already what you get and is good enough. That's actually why I or others might use something S3 compatible, or something that gives you signed URLs for downloading files - so you don't have to think about or mess up how the signing works. That's also why I and many others would be okay with having a system that eases the implications of needing to think about file systems, by at least partially abstracting it away. Edit: removed unnecessary snarky bits about admittedly leaky abstractions you often get.
Honestly, that's why I like databases letting you pick whatever storage engines are suitable for your workloads, similarly to how object storage solutions might approach the issue - just give the user the freedom to choose how they want to store their blobs at the lower level, giving sane defaults otherwise. Those defaults might as well be just files on a filesystem. In regards to object storage, that's before we get into thinking about file names (especially across different OSes), potential conflicts and file versioning, as well as maximum file size supported by any number of file systems that you might need to support.