The Dewey Decimal System (DDC) and Library of Congress Classification (LC) are ways in which books are organized in a library. A catalog can be searched by author or title and a classification number is given so one can go to the library stacks and retrieve the book. Stacks are just shelves that anyone can access. Every library also has an archive. This is where documents and books are kept that are not readily accessible. If one wants access to a document in the archives, typically a request must be made before access is granted. The library system is a well-designed and understood classification and organization system. Unfortunately, our digital “libraries” of files are not so easily searchable or accessible. There are no standards to classify and organize data, nor is there any standards regarding how we refer to data and where we store it. The following is my attempt to define data types and storage types to make the discussion around data more cohesive.
We might think that when saying ‘data types’, we refer to text versus images or one application format versus another, but data types really mean two things:
1) Does this data change or does it persist in its original or final state?
It is a common mistake to assume that if data is no longer changing, it must be old or irrelevant or won’t be accessed, and therefore, may be moved into an archive with the expectation that no one will access it. The reality is that much of unstructured data that is not changing gets referenced for long periods of time after creation. As an example, a high-definition microscope creates data, which may be used in different studies or newer algorithms may be applied to analyze the data. In this situation, original data may be accessed many times, though not changed. When data is static, not changing, we don’t want to apply traditional data protection tools, such as backup and restore, because we would be backing up the same file. Instead, we want to have a storage platform that ensures appropriate data durability without the overhead to the environment.
2) Does the data get accessed or is it an ‘insurance policy’ and is likely to remain untouched once stored?
Archiving has become a default term used with storage where data is not being accessed. To differentiate between data never to be accessed and active data that doesn’t change, the industry had adopted the term ‘Active Archive.’ It is misleading on many levels. First, it is an oxymoron that contradicts itself. Archive is assumed to be something stored in a vault without regular access. Active is the opposite. When speaking of accessed data and untouched data, it is best to refer to data as either active or “passive.” The drivers in terms of storage platforms are determined by the performance requirements for data access.
Storage systems are defined by access performance and persistence resiliency. Systems with higher access performance cost more per GB. In general, keeping capacity constant, as performance demands and resiliency demands increase, so do the costs. When selecting a storage system, it is critical to understand access and resiliency requirements.
Data that doesn’t change requires the system to have higher levels of resiliency. If performance is required, then the system should most likely be disk-based. Most commonly, these systems are object-based and use erasure coding to provide the necessary resiliency.
Data that doesn’t change and has a low-performance requirement may be well served by a tape system. Such a system would be referred to as an archive. Data resiliency is still paramount; tape systems may be deployed with data stored in two copies or using recent development, erasure coding on a tape or across tapes.
Data that does change and requires performance may use backup and restore operations for resiliency. Architecturally, it could use either RAID or erasure coding, as well as high-performance drives, such as SSD or NVMe.
The high-performance system is often defined based on access protocols; NFS/SMB is NAS, iSCSI/FC is block. When it comes to systems for static data with low access performance requirements, there are no specific terms. Some call it archive, object, active archive, data repository… but none of them truly relay the capability and use case of the system. Maybe it would be better to find new terms. Maybe we can look to traditional libraries, the original data repositories, as inspiration. Maybe it is not an active archive but storage stacks. And, maybe Archive gets to keep its designation as a storage medium requiring effort to extract data. These may not be the best terms, but must start somewhere.
A thought provoking read!