home3-Hero.jpg

Big Archive Blog

Data Integrity Checksums

Share this post:   

Versity’s most recent VSM 1.6.0 release introduces several new types of checksum algorithms, which are used to ensure data correctness. It inspired us to step back and think about what checksums are, why people use different checksum algorithms, and the role they play in long term data archiving.


Introducing checksums


Checksums are basically a small piece of computed information about a larger piece of digital data, usually a file. They can be thought of as data fingerprints because the checksum output changes if the data changes. The small piece of information, the checksum, is used to check data after it’s been transferred or stored, to verify that the information is still the same as it was before. The computation used to compute the checksum is referred to as the checksum algorithm. There are several different types of algorithms, which mostly perform the same task (computing a checksum) but are very unique in their implementation details.


The main characteristics of checksums are performance, collision likelihood, and size.


Performance in this case is measured by a combination of what compute resources are needed and how long the computation takes to complete. Too fast and it is likely that the algorithm is not very strong which can lead to security issues. Too slow and it causes problems with users or admins choosing to turn checksumming off completely rather than wait for them to be calculated. More compute power can help computations complete more quickly, but there are algorithms where even more or faster processors do not increase the performance of a single checksum significantly.


Collisions occur when two distinct inputs (two different files) produce the same checksum output. When used to verify data correctness this is a problem because the checksum is delivering a false positive verification. Interestingly, all checksum algorithms produce collisions due to something called the pigeonhole principleThe goal is to select an algorithm that is highly unlikely to produce collisions but still computationally efficient. In February 2017, Marc Stevens et al published a paper that demonstrated the ability to generate  SHA-1 collision with the equivalent of 150 days of computation on a single 3.2 GHz Haswell Quad Core i5 processor. This work demonstrated that SHA-1 was not only theoretically a weak algorithm, but also vulnerable in a practical sense. The SHA-256 algorithm by comparison would require a super computer capable of processing 15 trillion SHA-256 computations per second approximately 650 million years to generate a single collision. It is interesting to note that collisions are also a problem because they can be used to overwrite data, derive data, or even create load balancing issues that can lead to denial of service - all scenarios to be aware of from a security point of view. For archiving, the biggest malicious threat is that collisions could be used to maliciously corrupt data without detection from checksum verification.


The size of a checksum is the physical amount of space the checksum takes on the storage device, usually reported in bits. In general the more advanced algorithms, as discussed above tend to make bigger and therefore less collision prone checksums. The lower likelihood of collisions comes from an increase in the total hash space.  The quality of the algorithm determines how evenly the hashes are spread across the available space. For some well done visualizations of the hash space and how the algorithms spread over it see the top answer on this this stack exchange thread. The size can be indicated in the name of the hash, for example, SHA-256 makes a resulting checksum that is 256 bits.


The checksum algorithm is really a special kind of hash function. A hash function is a function, or process, that can be used to map data of arbitrary size to data of a fixed size. The types of hashes used for data integrity are distinguished by the presence or absence of keys and cryptographic properties. Keyed hashes require an additional piece of data, a key, to compute the checksum. This makes it inconvenient for data integrity checks as there is an additional piece that is required for operations. Non-cryptographic hashes tend to be more performant than cryptographic hashes, which are designed to deliver more guarantees when it comes to the output.


Cryptographic checksum algorithms are designed with several different security characteristics in mind. Most importantly they are built to be one way, which means you cannot derive the data from the checksum. They are also built to avoid collisions, ensure that the hash changes significantly if the data changes, and to be resistant to brute force attacks such as pre-image. As a result of these design decisions, cryptographic algorithms are often compute intensive and tend to be slower than non-cryptographic algorithms. For a nice overview on how cryptograhic hashes work, see this illustrated guide. If data is sensitive or protected by regulations, cryptographic functions are likely worth the performance tradeoff. If performance is of paramount importance and the goal is to simply look for unintentional errors like bit flips that occured in the data transfer, non-cryptographic checksums are probably a good choice.


Checksums and archiving


In archiving, checksums are used to make sure that transferred or stored copies of data match the data that was originally created. The application or system where the data was created computes a checksum which is stored, and then used later for comparison. The place where the data is stored or transferred need only apply  the same checksum algorithm which should produce the same checksum. If differences are detected after comparing checksums, it can be concluded that the data was somehow corrupted, either accidentally or maliciously. Corrective action can then be taken to repair the data, or replace it with a correct copy.


In VSM, a user or admin can choose to generate a checksum to be stored in the filesystem inode, which relieves the user from the responsibility of storing the checksum in a separate file or database. VSM also optionally automates the check, not returning success until the entire file is retrieved from archive and the checksum is verified.


Types of checksums; and which one to pick


As mentioned previously, checksum algorithm implementation details are really what distinguish different types of checksums. Which one to pick depends on requirements for size of the checksums, time to compute the checksums, tolerance for collisions, and need for security. The complete list of checksums is beyond the scope of this article, so we will stick to the ones available in VSM. These were chosen for their ability to detect data corruption, their performance since they must be able to keep up with tape drives which are up to 300MB/s in the current generation, and to some degree their overall popularity. There are both cryptographic and non-cryptographic offerings but all are keyless, meaning that no additional information other than the original data is needed to compute the checksum.


The first supported checksum in VSM is MD5.  MD5 stands for message-digest algorithm which produces a 128 bit checksum. It is a cryptographic algorithm. MD5 is fast but not considered to be as secure as the SHA series. While the method to craft MD5 collisions is known and the algorithm in general has been declared as not cryptographically secure, the collisions are extremely unlikely to hit accidentally so the algorithm may still be used with a reasonable degree of certainty for many data integrity applications. In addition to the existing MD5 algorithm currently offered by VSM, the VSM 1.6.0 release includes a vastly improved MD5 algorithm which performs the calculation much more quickly - delivering a little over a factor of 4 improvement in our internal testing. This new and improved checksum is a great choice for VSM customers who were using the previous version of MD5.


SHA is a family of cryptographic checksum algorithms that are published by NIST. There are 14 variants, 3 of which are supported in VSM: SHA-1, SHA-256, and SHA-512. SHA-1 produces a 160 bit checksum and is the highest performing checksum in this family, followed by the 256 and then 512 versions. This family of checksums is a great choice for VSM users who have requirements from their organization for cryptographic capabilities.  


Murmur3 is a non-cryptographic hash offered in VSM. The supported version is MurmurHash3_x64_128, which is optimized for 64 bit architectures and produces a 128-bit checksum. It typically performs more quickly than cryptographic functions, but should not be used for security applications. This is a great choice for VSM users who were previously using the “simple” algorithm as this has shown almost a factor of 3 improvement in performance over “simple” and is the fastest checksum algorithm offered by VSM.  This might be of interest to customers using object or disk archiving with high performance requirements.


Checksums are an important piece of the overall data protection strategy. There are many different types and which one to use depends on requirements from the organization and users. At Versity we deliver rock solid data protection for large data collections, and as part of that delivery we offer several different kinds of checksums to meet our customer’s needs.

Learn More about VSM and how it protects data.

Share this post:   

More about the author: