Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How do you choose a checksum algorithm for serialized datastructures?
4 points by packetlost on May 24, 2023 | hide | past | favorite | 4 comments
I'm building a library that serializes blocks of data to disk in 4MB~ or so increments. What would be a sufficient number of bits to allocate for checksums (ie. CRC32, CRC64, MD5, etc.) on those blocks such that corruption, torn writes, etc. can be found?


I would use SHA128, 256, or 512. As far as I know, MD5 and SHA1 can be considered broken from a security standpoint.

I think more information is needed. If you're storing files PAR2 can help. If you're hashing passwords, bcrypt and scrypt should be investigated.

Securing the database against bitrot, etc. would be another question entirely.


Right, I'm asking purely from a data corruption and bitrot standpoint. There's no security requirements, the blocks are part of a larger data-structure (so subset of a file, or maybe a raw block device). CRC is common for smaller data structures, but 4MB is "large" compared to, say, Postgres data page (which are "usually 8kB"), which uses some (modified?) form of CRC16


You could also relegate this task to ZFS.

https://en.m.wikipedia.org/wiki/ZFS

ZFS is a file system that is available on FreeBSD, and Linux, and more!


offloading onto the filesystem is not really applicable here, this is for a "page" or block of data within a larger data structure that may or may not even be on a file (on a filesystem).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: