Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> DO NOT COMPRESS ANY DATA WITH THIS PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.

I know every open source project (and quite a lot of expensive proprietary ones!) come with a "btw this software might wipe your computer, if it does that's your fault lol" clause in their license but I can't imagine trying to convince anyone else that using this for anything remotely serious is a good idea with that line in the readme.



Hi! Tool author here.

Almost every single open source compression tool contains a clause like this. For example, the one in the README that you see has been directly lifted from the bzip2 README. Almost all open source projects come with such a no-warranty scheme. 7-Zip, zstandard, xz-utils, etc; as exemplified by a quote from the license text of the MIT license:

> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

If you were willing to sign a commercial support contract with me on terms that we negotiated, I would be willing to provide you warranty.

If you were not aware, this is essentially the business model of WinRAR. The reason why tools like 7-Zip are not used by the public sector or companies (at least here) is that they provide no warranty in case of data loss. However, if you actually buy WinRAR, then you can hold them liable for damage to your archives. The "infinite 40 day trial" of WinRAR does not entitle you to compensation for damages and thus corporate entities and public entities have to buy WinRAR licenses. WinRAR has never cared about personal customers.

In general, having to cope with mild reliability of software is what you have to live with - you already get more than you paid for. Not to say that my tool is unreliable - I put a lot of effort into it, but it would put you in bad light to complain about something that you generously received for free :).


I did acknowledge that.

My point was more if you went into a store to buy some cereal and you had two options: "Cornflakes" and "Cornflakes 2 - they're better!" but you noticed that while both packets had standard legal nonsense on them but Cornflakes 2 had "This cereal almost certainly does not contain broken glass" as well, personally I think human nature would make me go with the packet that didn't bring up broken glass in the first place - even if both of them have the exact same chance of containing it


Simple enough to be safe, at the cost of performance: uncompress and compare to the original.


You could have bugs that show up on different hardware or compiler versions. So the round trip is table stakes but not a guarantee.

Edit: someone deleted a response that said that if you can read it back then the data is there. I think in a data recovery sense that’s definitely true, if it’s consistent across inputs. But building something that simulates the undefined behavior or race condition - if it’s symmetrical between read and write could be pretty tricky. And you’d have to decode based on file version number to know if you need the emulation. So possible but terrible to maintain, and the interim versions from creating and discovering the bug would still be broken.


And what do you do if it doesn't match?


Isn't it obvious? Warn the user, who can now use something else instead.


That only works if the "user" is an interactive TTY with a human on the other end of it though. What if I tried using this for compressing automatic backups? Do I need an error handling routine that uses something else?


A backup system should be reliable and be able to report errors. No matter what they may be.


Your automatic backups may actually be corrupted by random bit flips. Happens quite a lot with ZFS NAS systems where the admin forgot to set up a scrub job and still uses incremental backups.

Any read or write could fail for a multitude of reasons. The chance of an entire file being lost is rather small, but it's still an edge case that can happen in the real world if the FS flips to read only at just the wrong time. Hell, on platforms like macOS, you can't even assume fsync returning success will actually write floating data to storage!


It reads to me more like: DONT USE EXPERIMENTAL COMPRESSION TOOLS FOR BACKING UP YOUR IMPORTANT DATA. USE IT FOR TRANSFERRING DATA AND CHECK HASHES.


The author of lzip goes into some degree of excitement on the reliability and recoverability of the lzip format compared to xz.

https://www.nongnu.org/lzip/xz_inadequate.html

I personally back up about a terabyte each week, and I use 7-zip because it has built-in encryption, which is required because of the HR data in the backup. Thank heavens for xargs -P.

I could use "openssl enc" combined with any pure compression utility, but I don't want to make decompression impossible if I get hit by a bus.


> https://www.nongnu.org/lzip/xz_inadequate.html

I have replaced all my previous uses of xz with lzip ever since I read that page (via https://news.ycombinator.com/item?id=32210438), but for some reason lzip never seem to rise to the same level of fame as xz. bzip3 also wasn't benchmarked against lzip.


I think you should just skip both xz and lzip, because that essay is in my opinion technically correct but also only deals with a very much minor concern [1]. If you want the recovery out of archives, you should use dedicated formats like PArchive and not ordinary archives with half-baked recovery attempts.

[1] Previously: https://news.ycombinator.com/item?id=39873122


For my personal backup use, I actually use RAR with recovery records, optionally with encryption if it's sensitive. I was only using xz in places where I couldn't use RAR (e.g. for work), and those places tend to also have lzip available.


Honestly I think this may be part of why compression in flight has been a growth area in compression lately. If your transport encoding becomes garbled people notice immediately. If you manage to serve a few billion compressed requests without any incident reports then maybe you can trust it for data at rest.


I had bzip2 eat some files some years back and ended up loosing them. I don't trust bzip2 for backup files and I wouldn't trust bzip3 either as a result.


A shell script to run a checksum on the original file, compress, uncompress, and verify output integrity should be trivial.


I wonder why this isn't inbuilt into the program..


I mean, hosting downloads that have been compressed with this seems fine. I have the original, I just want a smaller version for those downloading.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: