11 minute read
Video files keep getting bigger. Every year we get more pixels, more colors, and more data. Thankfully, computer storage has generally kept pace, with larger and larger drives available at lower and lower prices. However, in the era of digital video workflows the problem is not just where to fit all our footage, but how to work with it efficiently. This is where RAID comes in handy.
RAID stands for Redundant Array of Independent Disks. As you can probably guess from the name, RAIDs are a method for connecting multiple physical drives to act as a single virtual storage device. This lets the drives work together to increase performance, add redundancy, or a mix of both. RAID is not the only way to use a bunch of drives together, but it is the most common.
Given the choice of a single large hard drive or a RAID of the same capacity, you should almost always choose the RAID. Why? Because RAIDs far exceed the capabilities of a single drive. Most individual hard drives are just too slow for smooth editing of 4K+ files, but multiple hard drives in a RAID can split up the work to deliver much faster read/write speeds. Not only that, but many RAIDs offer at least some measure of fault tolerance (the ability to survive drive failure). If you have one drive and it fails, that’s it, your data is lost. But certain RAID configurations can have up to half of their drives fail without you losing anything. That’s a big deal.
The only aspect of RAID that is worse than using a single drive is the cost. If you need 8TB of storage, an 8TB RAID will be more expensive than an 8TB hard drive. That said, you’re not really getting the same thing from both these options. Yes, both options will store the same amount of data, but what you’re really paying for is the added benefits of RAID. If you wanted to buy a single drive that was as fast and high capacity as a RAID, it would be much more expensive (if you could find one at all). And of course, there is no way for a single disk to be redundant by itself, so if that’s what you need then RAID is the better (only) option. In general, RAIDs deliver the best performance and highest redundancy for the least cost.
RAIDs come in a wide variety of (physical) shapes and (storage) sizes, so you can add the benefits of RAID to practically any workflow. Headed into the field for a remote editing job and only need a couple terabytes just for yourself? Take a small mobile RAID along to edit faster or with a little more redundancy (the smallest RAIDs can’t offer both benefits at once, but that’s still better than a typical external drive). Running a studio with a dozen editors and several hundred terabytes of footage? Configure a rack-mounted RAID server to meet your team’s particular technical requirements.
Different configurations of RAID, called levels, are each optimized for certain attributes that trade between cost, redundancy, and speed in a particular way. Even though RAIDs all do the same basic thing (store files), there are distinct advantages and disadvantages to each level. So, you will need to pick RAID options carefully for each task in your workflow. But before you can decide on the best options for your production, you first need to understand the basics of how RAID works.
All RAIDs are made up of roughly the same parts; multiple drives (from two to hundreds) inside an enclosure of some kind, and a controller. Controllers instruct the array of drives how to handle data, so in essence, the controller is what determines what level the RAID will be. The controller can be purpose-built hardware that actually attaches to the drives, or it can be software on the computer connected to the array. There are three fundamental techniques that controllers use to handle the data on the RAID; data striping, data mirroring, and data parity.
Data striping is a technique for cutting up sequential data and storing the cut-up pieces to separate drives. This allows for faster performance, because the multiple drives work together to read/write the file, so the task is completed in less time. For example, a RAID can split a 1GB video into smaller pieces, and write those pieces to separate drives. When the file needs to be accessed again, each drive only has to retrieve their portion of the data for the controller. That means you can get at least double the performance of a single drive from a striped array. However, as appealing as this increased speed is, data striping also brings risk. If any of the file’s data is lost on one drive, then the entire file is lost. Even if there are 10 drives, with data striped equally across them, one drive failure will cause total data loss. The files simply can’t be put back together if there is a missing piece, no matter how small. This is a considerable problem, because the more drives in an array, the greater the chance of a drive failure occurring.
This risk is why some RAID levels also employ data mirroring. Data mirroring is a technique that stores a file (or piece of a file) identically across multiple drives. This does not increase the speed of writing data, but it does provide huge gains in redundancy. If that same 1GB video file was stored on a RAID that used data mirroring across two disks, then one of those drives could fail completely without any data being lost, because the full file is stored on both drives. And even though file write times are not improved, file read times are faster, since the controller can still read a file from separate drives at the same time. The biggest downside of data mirroring is that it does not increase the total usable storage of the array. 2 1TB drives mirrored to each other will create a RAID with a capacity of only 1TB. That means this array is 2 times the cost of a single 1TB drive, but has the same capacity. Redundancy is useful because it reduces risk, but in this regard it’s also expensive.
Data parity is a way to allow you to keep an internal backup without requiring double the number of drives. Using some data processing magic, parity calculations allow you to use one drive in a RAID array to back up multiple other drives at the same time. Parity data is not a full backup, however. On its own, parity data is meaningless, but it’s possible to reconstruct a lost drive by combining the parity data with the other drives in the array. Parity adds an extra level of complexity, requires a higher-performance controller, and increases drive fatigue, which introduces new risks that have to be accounted for by the user.
If you think this is all too complicated for your workflow, don’t worry. You don’t need to be a computer engineer or programmer to use RAID. The controller does most of the hard work managing the data, so you can relax. That said, you should decide whether hardware or software-based RAID fits your needs best. Hardware RAID controllers are generally more expensive, but the performance is often better. You also won’t need to install any special software on your workstation. That means you can plug the RAID into basically any machine, and it will just work. Software RAID, on the other hand, is more affordable, and still provides great performance. However, because the controller is software, you can change the RAID configuration as you need from project to project. Not all hardware controllers have this same level of flexibility. That said, these arrays are not as easy to set up. Most RAID levels are possible with either hardware or software controllers, it just depends on your budget and technical requirements.
Now that you know what RAIDs do with your data, you can pick RAID solutions that meet your needs. You don’t even need to start from square one to decide. Several RAID levels are already proven for certain parts of post-production workflows.
RAID 0 is an array of drives with striped data. This offers wonderful speed and simplicity, as using only two drives can double your read and write speeds (compared to a single drive). However, as mentioned before, using multiple drives to stripe data also increases the chance of data loss. If either drive fails, then you lose all of your data. A good use case for RAID 0 is as an editing drive or scratch disk, where maximum performance is necessary, but the media is backed up somewhere else. RAID 0 is especially useful for mobile editing solutions, because you only need a small, portable array of two disks. That said, only use this configuration if you make frequent backups of the data you’re putting on the array.
RAID 1 is an array of mirrored drives. This delivers excellent redundancy, while also being easy to maintain. You only get write performance equal to a single drive, but you get double read speeds, because the array still reads files from multiple disks at once. RAID 1 only needs a minimum of 2 drives, so these units can also be quite portable, making them very useful for camera offloads and field backups. The biggest downside is the capacity and performance you get for the cost. In order to mirror data, the total array capacity can only be as big as the smallest drive. So, for a 2-drive RAID 1 you’ve paid for two drives, but can only use the capacity of one.
RAID 10 (1+0) is a combination of RAID 1 and RAID 0, where half the drives are mirrored to each other, and then data is striped between the mirrored pairs. This gives you good read/write performance, while also allowing for a significant level of redundancy. In the best case scenario, up to half the drives in a RAID 10 can fail without any data loss, but you still get double performance of any single drive. RAID 10 is very useful for many different applications. Just like the lower levels of RAID, it can be used as an editing drive for one user. But where RAID 10 really shines is in input-output intensive tasks, like media servers and databases for multiple simultaneous users. The biggest downside of this configuration is that you only get half the total capacity of the arrayed drives. RAID 10 requires more drives from the start (at least 4), so you may not like paying for storage capacity that you cannot use. Overall, RAID 10 gives you a fantastic balance of performance and redundancy. It is an excellent solution for many workflow tasks that need high data throughput.
Pretty much any post-production team can benefit from these RAID levels. They are easy to understand, simple to use, and widely available from a large number of vendors. However, beyond a certain scale, these levels can incur unnecessary cost. 100TB in RAID 10 would require 200TB of drives. Do you really want to pay for all that? It would certainly be a robust system, but that’s money you could be using elsewhere in your project.
The good news is that there are other RAID levels that let you to use more of the total storage capacity within an array. The bad news is that these RAID levels are more complicated, and introduce new risks that should be carefully considered before you use them in your workflow.
RAID 5 is a striped array with distributed parity. That means that every time a file is written to this kind of array it is striped across the drives, and then a parity calculation is performed, which is also written to the array. Parity files are not all kept on the same drive, but are distributed equally across them. Striping gives RAID 5 improved performance, while distributed parity allows it to survive the failure of a single drive. But the main advantage of RAID 5 is that it gives you more storage for your money. If you had 5 8TB hard drives, RAID 10 would only yield 20TB of storage (50% capacity utilization), while RAID 5 would give you 32TB of storage (80% capacity utilization). And as more drives are added a RAID 5 array, this the capacity utilization improves. For this example, doubling the number of disks to 10 would give you 72TB in RAID 5 (90% capacity utilization). With RAID 5 you can use more of the capacity you pay for, while also enjoying performance gains and some level of redundancy.
But all this comes at a cost. The controller must work much harder to constantly distribute parity files across the array. Because of this, software controllers are usually not sufficient for good performance on RAID 5, which means you will have to pay more for dedicated hardware controllers. Also, RAID 5 is not for the lazy or technically naive. These arrays need careful attention, because there is only one point of fault tolerance. A failed drive must be replaced very quickly, or else the entire array is in jeopardy. Even if you do catch a drive failure and replace it immediately, the rebuild times of large arrays can be lengthy, so if another drive fails during that process, you will have completely lost all of your data. RAID 5 is definitely a cost-effective option that delivers performance and redundancy at scale, but you need to be understand the risks and technical obligations such a configuration entails. If you are prepared for that, then RAID 5 is a good option for media servers and databases in distributed production settings.
RAID 6 is a striped array with double distributed parity. It works almost identically to RAID 5, but it writes 2 parity copies of every file across the array rather than just 1. That means it can suffer 2 drive failures without total data loss, while also giving you better capacity utilization. RAID 6 is also more practical than RAID 5 for arrays with a large number of drives, because double parity removes some of the stress of long rebuild times. That said, RAID 6 still requires expensive hardware controllers to achieve consistent performance, and you will need to attention to its operation to spot drive failures. If you do that, then RAID 6 is an excellent option for media servers, databases, and other tasks that work need high-availability data.
RAIDs are awesome. They give you more performance, more capacity, and more redundancy than single drives. Odds are, you’ll use lots of different RAIDs in almost every stage of your workflow. But keep this in mind: no single RAID is a backup. Period. Why? Because a RAID is still one device. A RAID 1 with 10 disks might store 10 copies of your data, but it still keeps all those drives in a single enclosure. Yes, in this case your data is protected against drive failure, but it is not protected against someone knocking that enclosure off your desk. If that happens, you’ll still lose everything, no matter how many copies there are inside the box. So, treat RAIDs like they’re supposed to be treated; as a single drive. And for goodness sake, don’t put anything on a volatile RAID (like RAID 0 and RAID 5) without keeping a backup somewhere else first.
Video collaboration solved.