Topic: Help with a local installation of e621

Posted under e621 Tools and Applications

Greetings. I am not familiar with docker or with hosting any other booru site but out of curiosity I have managed to follow the instructions here: https://github.com/e621ng/e621ng and now I have a local installation of e621. I have populated it with the

docker exec -it e621ng-e621-1 /app/bin/populate command 

and now it has a small amount of posts. I have also uploaded a few random miscellaneous images into it. But now I am curious as to where these images are stored in my computer. I have not been able to find any of them, they do no appear when I search on the e621ng directory for any *.jpg or *.png file. I have tried running this same search on my home folder and on the root directory, and likewise I can't find them. I know the search program (KFind) is working just fine because it finds pngs and jpgs in various random locations in my computer, but none of them are the images that I have on e621ng. I have also seen the e621ng/public/data/ folder, but it is completely empty.
So to summarize, my questions are as follows:

Where are images stored in a local instance of e621 running on docker?
Where are the tags for each post stored?

Donovan DMC

Former Staff

Look up on Google wherever docker volumes are located for your operating system, on Linux they're in /var/lib/docker/volumes (you need sudo to access this), and the volume name will be e621ng_post_data, with the files further in the _data folder (/var/lib/docker/volumes/e621ng_post_data/_data), you're on your own for windows and mac

donovan_dmc said:
Look up on Google wherever docker volumes are located for your operating system, on Linux they're in /var/lib/docker/volumes (you need sudo to access this), and the volume name will be e621ng_post_data, with the files further in the _data folder (/var/lib/docker/volumes/e621ng_post_data/_data), you're on your own for windows and mac

This is precisely it. thank you very much

donovan_dmc said:
Look up on Google wherever docker volumes are located for your operating system, on Linux they're in /var/lib/docker/volumes (you need sudo to access this), and the volume name will be e621ng_post_data, with the files further in the _data folder (/var/lib/docker/volumes/e621ng_post_data/_data), you're on your own for windows and mac

hello i am not a programmer , but i have a luxury windows servers with 60T harddriver . i want build a local E621 for myself , have any manuel guide for normal-people ? plzzzzzzzzzzzz

mirovichka said:
hello i am not a programmer , but i have a luxury windows servers with 60T harddriver . i want build a local E621 for myself , have any manuel guide for normal-people ? plzzzzzzzzzzzz

This is one of my personal gripes too, in my case, the localhost would always give a failbooru, even through multiple fresh installations of my OS. Because of this, I'm currently working on creating a python script that allows users to download ALL (~9.6TB) of E621's posts, along with applicable metadata on the media, with the goal that any non-techie can simply launch the program as they would any other program. One challenge I have yet to solve is how to get a visually pleasing and familiar interface for E6's users, for that matter, I'm inexperienced with creating an interface at all, so I've no clue how big an undertaking this next step'll be, but I already have the code to handle the downloading in an operational state.

Donovan DMC

Former Staff

If you try importing all posts here you'll quickly find issues with many files, as issues get fixed files that previously worked fine here no longer work when uploading (large APNGS, WEBMs with non 1:1 SAR, etc)
You'll also likely have an insane amount of throttling from the static server eventually, I know the most I've sustained is ~1,000,000 posts over 24 hours, and that was just previews because I was creating an iqdb clone

Also with running an e621ng instance locally, just.. don't use windows and it will work fine 99.9% of the time
if you must use windows use WSL2

donovan_dmc said:
If you try importing all posts here you'll quickly find issues with many files, as issues get fixed files that previously worked fine here no longer work when uploading (large APNGS, WEBMs with non 1:1 SAR, etc)
You'll also likely have an insane amount of throttling from the static server eventually, I know the most I've sustained is ~1,000,000 posts over 24 hours, and that was just previews because I was creating an iqdb clone

Also with running an e621ng instance locally, just.. don't use windows and it will work fine 99.9% of the time
if you must use windows use WSL2

also if your plan is just a readonly archive, running an entire copy of e6ng seems a bit overkill, given that there are db exports then again i don't know if theres any publically available software to deal with them, and good luck opening a multi GB CSV file in Excel (my query/suggestion/subscription thing takes a bit to run through it and i've got a relatively decent CPU. then again, most performance optimised JS i've ever written, but still room for improvement)

and of course, getting the actual media files is a whole other story

thicketsafe said:
This is one of my personal gripes too, in my case, the localhost would always give a failbooru, even through multiple fresh installations of my OS. Because of this, I'm currently working on creating a python script that allows users to download ALL (~9.6TB) of E621's posts, along with applicable metadata on the media, with the goal that any non-techie can simply launch the program as they would any other program. One challenge I have yet to solve is how to get a visually pleasing and familiar interface for E6's users, for that matter, I'm inexperienced with creating an interface at all, so I've no clue how big an undertaking this next step'll be, but I already have the code to handle the downloading in an operational state.

So do you happen to have that downloading code on github by any chance? ^^; I would like to download e621 in its entirety and have been struggling haha

Aacafah

Moderator

chromiboi said:
So do you happen to have that downloading code on github by any chance? ^^; I would like to download e621 in its entirety and have been struggling haha

1. Download & extract the desired db exports
2. Copy the posts DB file.
3. Open it in a CSV reader
4. Wait a year
5. Delete every column but the one with the post's static url (if needed, save & reopen to clear memory)
6. Add another column that takes the static url column & chops off the http://static1.e621proxy.ru/data/ part
7. Copy/paste that new column's output into a plaintext file & save it.
8. Pick a directory & run a script that, for each line in the plaintext file, downloads http://static1.e621proxy.ru/data/ + the url stub to the destination directory + the remaining path of the url stub.

After a decade or two, you'll be done.

For bonus points, only do x posts at a time & deleted those entries from the list so you can split it up over multiple sessions.

aacafah said:
1. Download & extract the desired db exports
2. Copy the posts DB file.
3. Open it in a CSV reader
4. Wait a year
5. Delete every column but the one with the post's static url (if needed, save & reopen to clear memory)
6. Add another column that takes the static url column & chops off the http://static1.e621proxy.ru/data/ part
7. Copy/paste that new column's output into a plaintext file & save it.
8. Pick a directory & run a script that, for each line in the plaintext file, downloads http://static1.e621proxy.ru/data/ + the url stub to the destination directory + the remaining path of the url stub.

After a decade or two, you'll be done.

For bonus points, only do x posts at a time & deleted those entries from the list so you can split it up over multiple sessions.

it wouldnt take that long tho?

donovan_dmc said:
Man's doubled down on not getting the joke

no I got the joke after a tiny bit but I replied to him cause i don't know why he linked to a subreddit instead of saying "its a joke"

Donovan DMC

Former Staff

funkwolfie said:
no I got the joke after a tiny bit but I replied to him cause i don't know why he linked to a subreddit instead of saying "its a joke"

Go look at the subreddit, it's literally about people not getting the joke
That's the point
You're doubling down on not getting the joke about not getting the joke

donovan_dmc said:
Go look at the subreddit, it's literally about people not getting the joke
That's the point
You're doubling down on not getting the joke about not getting the joke

im not doubling down, i just don't understand why he said the subreddit instead of saying "its a joke" but this isnt the place for this

funkwolfie said:
im not doubling down, i just don't understand why he said the subreddit instead of saying "its a joke" but this isnt the place for this

woosh, and variations have been shorthand for "you missed the joke" (or more specifically "the joke went over your head") both on and off the internet for decades.

using "r/wooosh" specifically infers that the instance was so worthy of ridicule that it would be worth posting it on the subreddit in question.

Aacafah

Moderator

funkwolfie said:
...it wouldnt take that long tho?

Clearly no, but it'll feel that way; there are a loooooot of rows in that CSV, and it takes a while for programs to process that. I worked with the tag export, and it was practically unusable w/o massive trimming or waiting 8 minutes for it to load. That was with ~1 million tags, with the largest column being the name. There are more than 5 million posts, all with much more text, & after ditching deleted posts (forgot that, do that), there's... just shy of 5 million posts.

It will take a while just to prepare to download them; it will take significantly more to actually download the images/videos/swf (+ you will get rate limited if you're not careful).

For future reference, you can both index a large CSV file, and you can split it into only specific rows or columns. There are people dealing with CSVs with 100M+ rows having to do this. :shudders:

It's also faster to use an actual database format, as they have far better support at scale.

Hilariously, I'm at the point where I need to just use a huge list of MD5s since I have a good portion of the content of Paheal collected over... not quite a year. I guess my effort is the anti-AI method. Doesn't hurt/help (depending on POV) that I did it over a hotspot. XD

We need to update the Popular Posts torrents to add 2024-2025, eventually. Those are getting into the hundreds of GBs per year.

Aacafah

Moderator

alphamule said:
For future reference, you can both index a large CSV file, and you can split it into only specific rows or columns. There are people dealing with CSVs with 100M+ rows having to do this. :shudders:...

I'm sure there's ways to index into a colossal CSV without it taking years that I wish I'd thought of in the past instead of being stubborn, but in the context of these instructions (setting up a simple Bash script), idk how performant it would be, especially if you use a simple linear buffer based input method (a different method could be less approachable to the layperson).

alphamule said:
...It's also faster to use an actual database format, as they have far better support at scale...

I don't recall where, but there was forum discussion about changing the format of the db_exports for user convenience, but since the server has builtin support for export to CSV & like one other contextually worse format (among other reasons), it probably wouldn't happen.

aacafah said:
I'm sure there's ways to index into a colossal CSV without it taking years that I wish I'd thought of in the past instead of being stubborn, but in the context of these instructions (setting up a simple Bash script), idk how performant it would be, especially if you use a simple linear buffer based input method (a different method could be less approachable to the layperson).

I don't recall where, but there was forum discussion about changing the format of the db_exports for user convenience, but since the server has builtin support for export to CSV & like one other contextually worse format (among other reasons), it probably wouldn't happen.

No, please, that's how we get horrible shit like DB4O outputs! Oh, the obscurity! Won't someone think of the child processes?! What, you thought I was going to mention Access DB? :P

Hmm, for laughs, I looked up "worst database formats of all time" and found this glorious thread on Reddit: https://old.reddit.com/r/dataengineering/comments/qh2r4i/whats_the_most_annoying_file_format_youve_had_to/

"maybe toss in some embedded CRLF" Ouch, source fields do that if I remember right. I noticed Paheal has embedded linefeeds in their metadata. I ended up just escape-coding that to prevent insanity.

Updated

Original page: https://e621.net/forum_topics/54813