As large archive sites around the world plan their future data storage systems, one of the key areas of concern is the scalability of the namespace. Data collections with over one billion files become difficult to manage with current technology that typically relies on a central server for metadata management.
In 2016, Versity started working on a new archiving filesystem specifically designed to manage ultra large namespaces by using a cluster of nodes to scale out metadata handling. The scale out architecture led us to the name “scale out filesystem” or ScoutFS.
Today, we have reached an important milestone in the development of ScoutFS. We are releaing the technology under the GPLv2 open source license.
ScoutFS is a POSIX compliant scalable clustered file system designed and implemented for archiving large data sets to low cost external storage resources such as tape, disk, object, and cloud. Key areas of innovation within the ScoutFS project include increasing the capacity of POSIX namespaces while being performant, eliminating the need for file system scans, and harnessing the power of multiple nodes to reach extremely high file creation rates.
If you would like to go straight to the project to learn more, here are the links:
Share this blog post: Tweet
Motivation and requirements
One might be wondering why we decided tackle the development of a new filesystem. The short answer is that no existing filesystem met our needs and no existing archiving file system was GPL.
- An open source GPL archiving file system is an inherently safer and more user friendly long term solution for storing archival data where accessibility over very large time scales is a key consideration. Placing archival data in proprietary file systems is costly and subjects the owner of the data to many vendor specific risks including discontinuation of the product. As in other technology verticals, it is likely that open source archival file system software will come to dominate the landscape.
- Versity customers need us to support massive namespaces - billions of files today, and tens to hundreds of billions in the near future. With exascale workloads quickly approaching, we set a goal to efficiently operate on at least one trillion files in a single namespace. We did not see an archiving filesystem technology on the horizon that would meet our namespace goals.
- The Versity archival user space application must efficiently find and apply archive policy to any added or modified files in the filesystem namespace. Additionally, there might be service level constraints that require any new or modified files to be archived within a set amount of time. The traditional namespace sweep using readdir() and stat() grows proportionally with the number of files and directories in the namespace. Using this method would grow beyond the service level requirements at a certain scale depending on hardware and filesystem implementation. While some filesystems might do better than others with this workload, it was clear that no filesystem could scale this workload to one trillion files while maintaining the service level requirements.
- Enterprise archive systems must be highly available and resilient to individual component failures to service the need of enterprise data protection. A clustered filesystem has both the benefit of remaining available during individual node failures as well as scaling out I/O performance across multiple hosts. This allows the use of commodity servers maximizing the price per performance and price per capacity of the system while remaining highly available.
- No two archive deployments are the same. Every site has different requirements to meet their individual business needs. Versity needs a flexible filesystem that can run on various hardware and storage devices depending on the specific deployment requirements of the site. This means that we need a filesystem that works well on the full scale of slower rotational drives to the latest solid state technologies.
- Versity supports multiple archive targets include tape, object, and disk archives. These all have demanding streaming performance requirements, but tape has even more constraints than the others. Tape devices must be written to with a single data stream, and have a fixed size buffer to maintain tape performance. The performance of a tape archive is strictly limited to the peak performance of the number of tape drives deployed. The economics of large tape archive systems demand that the tape performance be as close to peak at any given time as possible to maximize the value of the system. If each drive performance is not maximized, then a higher number of costly tape drives must be deployed to maintain the overall performance requirements of the system. Therefore Versity requires a filesystem that is capable of single stream performance in excess of the tape drive speeds while multiple streams to multiple tape devices are in flight.
- For an archive system, data correctness is paramount. The wikipedia article on data corruption outlines several cases where underlying storage devices have been found to return corrupted data without and notification of error. Because we cannot rely on the block devices to return correct data 100% of the time, an archive filesystem must have the ability to detect when the underlying data or metadata has been corrupted and either correct the data or return an error back to the user.
- The typical archive capacity far exceeds the reasonable capacity of a block storage filesystem. The Versity archive application copies file data to the archive storage tiers for long term storage. To make space for new incoming data, the file data for inactive files needs to be removed from the filesystem. The POSIX metadata, however, must remain online and accessible for immediate access. This requires an interface that allows for the removal of file data without changing any of the POSIX attributes of the file. When the inactive files become active again, the Versity archive application must retrieve the data from the archive tier and have an interface to be able to reconstitute the file data payload within the filesystem.
- Modern storage systems are comprised mostly of small files. A few large files take most of the capacity, but the small files take the bulk of the metadata space and arguably are used the most. Distributed filesystems typically don’t handle small files well, in fact they are known as their ‘achilles heel’. Single node filesystems aren’t capable of modern scale workloads. We needed a filesystem that handles both small and large files quickly and efficiently.
Given the motivations for our extreme scale archiving product, we decided to invest in building new technology that meets our needs. Two years ago we embarked on a journey to build a filesystem that is scalable, POSIX compliant, supports large scale archiving workloads, and most importantly keeps data safe and correct.
is built to address all the needs previously discussed. It is architected to support 1 trillion files in a single namespace, which supports the at scale large archive workloads that Versity currently supports and anticipates in the future. ScoutFS runs on a cluster of commodity Linux hardware. Some of the key features and design decisions follow.
- ScoutFS maintains a metadata and data transaction sequence number index, which allows an application to retrieve the files roughly in order of modification time starting at any arbitrary index. This allows the application to retrieve all files that have changed since a given index. These queries are accomplished through the Accelerated Query Interface (AQI). Knowing what has changed since the previous query allows our archiving application to apply archive policy to newly created and modified files without scanning the filesystem. This graph shows AQI performance and how it stays flat even as the number of files increase. There are other application workflows that will be able to take advantage of the query interface.
- The shared-block filesystem is designed to minimize any synchronization or messaging the cluster may need, enabling POSIX compliance while still being performant. Each node having access to the same storage has a distinct advantage: whatever one node writes, another node can see. So if a node dies or is unreachable for any reason, another node simply picks up the work where the first node left off. The journaled and atomic design ensures that work is written to storage where other nodes can read it.
- The shared block design enables ‘local’ workloads and it does preallocated metadata work, which makes small file operations much faster while still being able to handle large files.
- Checksums are computed at the block level and include both data and metadata, ensuring that data are written and read correctly.
For more information on how ScoutFS is architected and implemented, download our white paper or see the community or github site:
Share this blog post: Tweet