This research paper was presented at the 29th ACM Symposium on Operating Systems Principles (opens in new tab) (SOSP 2023), the premier forum for the theory and practice of computer systems software.
Data growth demands a sustainable archival solution
For millennia, data has woven itself into every facet of our lives, from business and academia to personal spheres. Our production of data is staggering, encompassing personal photos, medical records, financial data, scientific insights, and more. By 2025, it’s estimated that we will generate a massive 175 zettabytes of data annually. Amidst this deluge, a substantial portion is vital for preserving our collective heritage and personal histories.
Presently, magnetic technologies like tape and hard disk drives provide the most economical storage, but they come with limitations. Magnetic media lacks the longevity and durability essential for enduring archival storage, requiring data to be periodically migrated to new media—for hard disk drives, this is every five years, for magnetic tape, it’s around ten. Moreover, ensuring data longevity on magnetic media requires regular “scrubbing,” a process involving reading data to identify corruption and fixing any errors. This leads to substantial energy consumption. We need a sustainable solution, one that ensures the preservation of our digital heritage without imposing an ongoing environmental and financial burden.
Project Silica: Sustainable and durable cloud archival storage
Our paper, “Project Silica: Towards Sustainable Cloud Archival Storage in Glass, (opens in new tab)” presented at SOSP 2023 (opens in new tab), describes Project Silica, a cloud-based storage system underpinned by quartz glass. This type of glass is a durable, chemically inert, and resilient low-cost media, impervious to electromagnetic interference. With data’s lifespan lasting thousands of years, quartz glass is ideal for archival storage, offering a sustainable solution and eliminating the need for periodic data refreshes.
Writing, reading, and decoding data
Ultrafast femtosecond lasers enable the writing process. Data is written inside a square glass platter similar in size to a DVD through voxels, permanent modifications to the physical structure of the glass made using femtosecond-scale laser pulses. Voxels encode multiple bits of data and are written in 2D layers across the XY plane. Hundreds of these layers are then stacked in the Z axis. To achieve high write throughput, we rapidly scan the laser pulses across the length of the media using a scanner similar to that used in barcode readers.
To read data, we employ polarization microscopy to image the platter. The read drive scans sectors in a single swift Z-pattern, and the resulting images are processed for decoding. Different read drive options offer varying throughput, balancing cost and performance.
Data decoding relies on ML models that analyze images captured by the read drive, accurately converting signals from analog to digital. The glass library design includes independent read, write, and storage racks. Platters are stored in power-free storage racks and moved by free-roaming shuttles, ensuring minimal resource consumption for passive storage, as shown in Video 1. A one-way system between write racks and the rest of the library ensures that a written platter cannot be over-written under any circumstances, enforcing data integrity.
Azure workload analysis informs Silica’s design
To build an optimal storage system around the core Silica technology, we extensively studied cloud archival data workloads from Azure Storage. Surprisingly, we discovered that small read requests dominate the read workload, yet a small percentage of requests constitute the majority of read bytes, creating a skewed distribution, as illustrated in Figure 1.
This implies that minimizing the latency of mechanical movement in the library is crucial for optimal performance. Silica glass, a random-seeking storage medium, can suitably meet these requirements as it eliminates the necessity for spooling, unlike magnetic tape. Figure 2 illustrates substantial differences in read demand across various datacenters. These results suggest that we need a flexible library design that can scale resources for each datacenter’s workload. Studying these archival workloads has been instrumental in helping us establish the core design principles for the Silica storage system.
Project Silica’s versatile storage system
We designed and evaluated a comprehensive storage system that manages error correction, data layout, request scheduling, and shuttle traffic management. Our design effectively manages IOPS-intensive tasks, meeting the expected service level objective (SLO) of an archival storage tier, approximately 15 hours. Interestingly, even in volume-intensive scenarios where a large number of bytes are read, our system efficiently handles requests using read drives with low throughput. In both cases, throughput demands are significantly below those of traditional tape drives. This is shown in Figure 3. The paper provides an extensive description of this system, and the video above shows our prototype library’s capabilities.
Diverse applications for sustainably archiving humanity’s data
Project Silica holds promise in numerous sectors, such as healthcare, scientific research, and finance, where secure and durable archival storage of sensitive data is crucial. Research institutions could benefit from Silica’s ability to store vast datasets generated from experiments and simulations, ensuring the integrity and accessibility of research findings over time. Similarly, healthcare organizations could securely archive patient records, medical imaging data, and research outcomes for long-term reference and analysis.
As the volume of globally generated data grows, traditional storage solutions will continue to face challenges in terms of scalability, energy-efficiency, and long-term durability. Moreover, as technologies like AI and advanced analytics progress, the need for reliable and accessible archival data will continue to intensify. Project Silica is well-positioned to play a pivotal role in supporting these technologies by providing a stable, secure, and sustainable repository for the vast amounts of data we create and rely on.