Prep

Checksums and Robust Backup

7 minute read

Checksums

If you are ever responsible for moving files around a workflow (perhaps offloading camera storage to a computer or transferring data between facilities) you should be using checksums to protect your assets.

A checksum is a unique piece of data that identifies a file. If even a single bit of data changes within a file, the new sum of the file will not match the original, and the checksum will know something went wrong. Because different files contain different data, every file has a unique sum, meaning you will always know if a file is 100% authentic/complete after you receive it. So, checksums are like fingerprints. They follow files wherever they go, and tell you if those files are what they say they are (or not).

Checksums can be used for any type of digital file (or even for collections of files), but they are particularly useful in post-production. They help safeguard your footage as it’s handled by dozens of different users and let you know when there is a problem.

Because video files are made up of millions and billions (and sometimes trillions) of individual bits, handling them is a complex process. And complexity breeds errors. If just a single one of those digits gets corrupted or lost, then your files may be completely useless. For video production, losing files often equals losing hundreds or thousands of man hours of work. Lost footage from just a single take can cost a production tremendous amounts of money. While checksums won’t protect against everything (like a drive being stolen or a nervous PA spilling coffee on the DIT cart), they are an important part of a safe workflow because they allow you to notice quickly when something has gone wrong.

Any time you are offloading footage from a day’s shoot, creating backups, delivering dailies, or sending footage to another team (basically any time you are moving files between machines), you should be using checksums to validate data integrity. While incorporating checksums into your workflow does require a small investment, the value they add in terms of safety and security is well worth it. Checksums detect problems as they happen, which lets you act immediately. This could potentially save you a lot of money (and countless grey hairs).

The actual mathematics behind checksums is pretty complicated (if you know what a hash is, a checksum is the same thing). Different algorithms use incredibly diverse methods to generate checksums for files, but in general, the process is the same. Let’s start with a simple example.

Imagine a file as a number, for instance, 2341842. Because the data of this number is so simple, we can manually create a checksum by adding the digits of the number together.

2+3+4+1+8+4+2=24

Now that we have the checksum value, we can just tack it on to the end of the number, so the number is now 234184224. Since we did this process ourselves, we know the last two digits are not part of the original data, but are merely the checksum value.

Now imagine we want to send this file in the mail to a friend. Unfortunately, the letter got slightly wet along the way, and the 8 got smeared and now looks like a 6. This is a problem. However, the good news is that we also wrote in the letter that the last two digits of the number are the checksum value. So, when your friend reads the letter, they will know to add the digits together, and will immediately see that there is a problem.

2+3+4+1+6+4+2≠24

This is a very simple checksum algorithm, but it gives you an idea of how data verification works. Obviously, real computer files are far more complex, so computers employ much more rigorous and foolproof methods. For example, the popular checksum algorithm Md5 uses a string of 32 letters and numbers to generate a checksum value. That makes Md5’s validation process very precise. It is extremely unlikely that a file could change but still have the same checksum value.

To use checksums in your workflow, you will need software that supports them as a feature. By default, Windows and macOS do not check for bit-level errors, so you will need specialty software. Shotput Pro and Hedge are popular options, though more mainstream software like Davinci Resolve have now integrated checksum validation. If you know how to use the Terminal/Command Prompt, generating checksums yourself is actually quite simple. But no matter which software you use, the basic steps are the same.

First, the software reads the file and generates a checksum. This is known as the source checksum, and it is attached to the file. Next, your computer makes a copy of the files you want to move, and then transfers the copy to another location. Once the file copy has been fully written to the new location, the software reads the checksum of that file, which is called the destination checksum. Finally, the software compares the source and destination checksums to make sure they match exactly. If they do, the transfer was successful. If they don’t, the software will sound the alarm.

If you’re storing a hard drive on a shelf for a long time (which can result in small bits of informationg being lost), then it’s a good idea to generate checksums for the files at the start, so you can double-check that your files are intact later on.

Other than picking software, you will also need to decide which algorithm you want to use. Different algorithms have different inherent qualities/limitations, but most software allows you to pick according to your needs. As mentioned above, Md5 is very popular in the video industry, in large part to its speed and cross compatibility with Windows, macOS, and Linux. However, Md5 is pretty old, so many people are using newer algorithms like xxHash, SHA-2, or SHA-3.

If this seems complicated, don’t worry, software makes the process mostly automated. In normal circumstances, using checksums is not that much more difficult than the traditional “drag and drop” method of file transfer. So really, there aren’t many reasons not to be using checksums in your workflow. They offer peace-of-mind at very little cost to your time or budget.

However, as useful as checksums are for ensuring data integrity, they are only part of the solution to securing your video files. The bigger piece of the puzzle is building a robust backup system.

Robust Backups

Not backing up your video files is like playing Russian roulette with your job and reputation. You might think your odds of survival without sufficient backups are good, but even just one lost file can lose clients/end contracts. With that in mind, you need to make a plan for backing up all of your data.

There is no one-size-fits-all scheme for backing up video files. Every workflow will have unique requirements that change the level of backup you need, but there are always limited resources to go around. The key task is to balance security and probability. You don’t want to lose data, but you also don’t want to spend your whole budget securing your footage. So, there are a few general rules you should follow when planning your backup system.

Rule 1: You should always (always) have at least one extra copy of every single file in your project. Everything from the most important shot of the entire film, to seemingly-inconsequential metadata files for the credit sequence. If someone in the workflow needs it, then it’s important and should be backed up.

Rule 2: Backups must not be connected to the originals. Putting a copy and original on the same computer or network is a recipe for disaster. One virus or bug could delete both files. If the files are directly connected by a wired or wireless connection, then it’s not a sufficient backup.

Rule 3: Backups should be in a different physical location than originals. Power goes out, thieves break in, and moth and rust destroy. When disaster strikes your office or studio, all your computer systems and storage could be ruined. The best way to prepare for this issue is to have offsite backups.

Rule 4: All backup storage must be independently secure and stable. This rule dictates what type of storage you should actually use to back up your critical files. Sure, you might have terabytes of files in offsite backups, but if all that data is being kept on 10 year-old hard drives, then those files are probably not that safe. The same can be said of certain RAID configurations that have a relatively low fault threshold. While you might think RAID arrays are generally more secure than single drives, this is not always true. RAID 0, for example, provides no redundancy for your files, so if even one disk fails, you lose everything. RAID 0 is actually less stable than a single hard drive. Always keep your backup files on the most reliable, durable storage media you can find.

Rule 5: Backups must be organized. If you follow the above rules, but don’t keep a detailed record of where you put your backup files and what they are called, then you might as well not have bothered. If disaster strikes your production, and you need to pull your backup files, it is absolutely necessary that you know where everything is. Else, the backups are practically nonexistent.

Rule 6: Backups must be accessible. Just like the previous rule, backup files are no good unless you have them in hand. You might have an offsite backup in a safety deposit box somewhere, but if the bank isn’t open 24/7, then you may not be able to restore files in time to fix a problem or meet a deadline. Even if you keep offsite backups in the cloud, a slow internet connection can get you in trouble. You might have to wait multiple days to download your terabyte-sized backup. This is not to say all of your backups must be accessible at all times, but at least one complete backup should be.

Following these rules will make your backup procedures much more robust. Of course, there are always risks, but you can do your part to mitigate them.

Workflow Guide

Read the full guide

Presented By:

Video collaboration solved.