Topic: Hypothetically Archiving the first 5 Million posts on e621

Posted under General

Im not super good with data storage, but i was curious how much it was cost to archive and maintain the first 5,000,000 posts on this site. Presuming you are storing and maintaining them physically yourself. I honestly feel like it wouldnt be that bad, i heard that all of wikipedia is only like 800 tb including the images and videos. (obvious thats a lot, but totally doable by some rich nerd). I know a lot of furs also happen to be rich nerds, so i thought i would propose the idea of archiving the largest archive we have.

rex_is_bottom_dog said:
Im not super good with data storage, but i was curious how much it was cost to archive and maintain the first 5,000,000 posts on this site. Presuming you are storing and maintaining them physically yourself. I honestly feel like it wouldnt be that bad, i heard that all of wikipedia is only like 800 tb including the images and videos. (obvious thats a lot, but totally doable by some rich nerd). I know a lot of furs also happen to be rich nerds, so i thought i would propose the idea of archiving the largest archive we have.

i used to have 3-4 million images on my hdd before I downsized

Aacafah

Moderator

llewark said:
Currently the size of all posts is 10.3TB (check here; total file size), so it's pretty doable.

Details

Although this is only for the image/video/SWF files themselves & not the rest of the post data, nor the rest of e6's data, this is representative of the entirety of the site's storage, as the rest of e6's data is dwarfed by the size of the actual media files. If the goal is to only preserve the media, this figure is 100% correct; if the goal is to preserve all post data, including sources, tags, descriptions, etc., it's a little bit more (the daily database export currently shows the compressed size of this data as an additional ~1.5 GB*, & the compression size is ~41.6% of the uncompressed size, so ~3.8 GB), and if the goal is to preserve as much of the site as possible, it'd be a little more than that (again, current compressed size is ~1.6 GB, so ~3.9 GB uncompressed).

* Ignore the file sizes from yesterday, the export wasn't done properly that time. No, this isn't common.

Tl;dr, if you're buying storage to store the media files, you can easily store the rest of the site's data in the remainder (it's basically a rounding error), so 10.3 TB basically covers all the site's data except thumbnails (& maybe downscaled video samples, though those might be included).

The real cost would be in serving the files, though that's not much of a concern for an archive. In fact, it'd be worth noting that the 10.3 TB figure is the uncompressed size; if you were to store them in a lossless compressed archive format, you could drop the size further (though the rates are unlikely to be anywhere near as good as compressing pure text data, any compression on 10 TB is likely to be somewhat significant), and if you losslessly compressed them individually (specifically compressing PNGs) before/instead of archive compression, you'd likely be able to drop it a lot (as shown in the stats, half of the posts are PNGs, few if any are likely to be already compressed, let alone optimally compressed, & most are extremely likely to get excellent compression with PNG's pre-compression process).

I can share some of my numbers since I started down the path of intentional data archival after accidentally nuking the drive that contained a bunch of commissions/paid content I had accrued over the years. It cost me about $1,000USD to build a system with 22TB of storage (4 x 8TB HDD in Raid 5) back in 2021. You can definitely get that cost down by opting for smaller drives, especially if you compress your media. I also overbuilt the system so that I could use it to host dedicated game servers for friends, something you can skip to further save some money. Electricity wise, my system runs 24/7 and costs me <$3.00USD a month. Though that electricity cost is a bit higher than it needs to be because I use the system for other stuff than network storage.

Original page: https://e621.net/forum_topics/58672