Topic: Mass Download of Metadata

Posted under e621 Tools and Applications

Building a application for some experimentation and releasing it if it works out, I intend to do what many have talked about before. A content recommendation system, that works on both Tags and other user Favourites, as well as what the user has voted on to predict and suggest 'good' content. But I must admit most do not know the correct algorithms or statistics to apply whilst I do.

However to download this data I've started using the API which is useful and good, but to download all of e621 it is not practical for me, and I suspect not practical for the server. My current script may take over 160 hours to download all the data, whilst I estimate the real database is only ~200mB (For metadata only so not the actual posts no images or flashes or whatever)

So I was just wondering if the admins at e621 could make a raw database download (in whatever form) that contains information similar to that available from the API and then just upload it to a fileshare service.

For me I know I need a Posts table with the PostID, PostURL, PostTags, PostsFavoritedByUsers, ect.. Maybe Sets and Pools data too (If possible)

Also if anyone is actively scrapping the e621 database and already has a near up to date copy with the info I requested, if you could maybe share you're databases instead that would work equally well.

Thanks I hope that someone can help me with this, and I hope to release the content recommendation system once I've got it set up.

Updated by mrox

The only really hard to pull information is the favorites. Which I have provided here. https://dl.dropboxusercontent.com/u/15777004/temp/e6/favorites.csv.xz
It's a standard CSV file in the format of "post_id,username"

Votes are considered private, and thus are not exported.
The actual size of the data sets is quite a bit larger than most imagine them to be.

For obtaining the post tags and other information I would suggest using a variation of https://gist.github.com/zwagoth/77b394fe6fa8834fa5549e42bf786bba which I wrote a while ago for purposes like this. It usually only take a few hours to pull the information for all posts, and it is much easier than exporting it manually.

Updated by anonymous

The JSON metadata for all the image info is 3.5GB. That includes descriptions but not posts or other user content. It is likely highly compressible but I'm not going to test that now. Example:

{ 'artist': ['redemption3445'],
'author': 'Circeus',
'change': 7358253,
'children': '',
'created_at': {'json_class': 'Time', 'n': 30789000, 's': 1420912033},
'creator_id': 49579,
'description': '',
'fav_count': 22,
'file_ext': 'jpg',
'file_size': 309722,
'file_url': 'http://static1.e621proxy.ru/data/c9/ec/c9ec00ac632e22b0b8ab49aed24febc5.jpg',
'has_children': False,
'has_comments': False,
'has_notes': False,
'height': 923,
'id': 585020,
'md5': 'c9ec00ac632e22b0b8ab49aed24febc5',
'parent_id': None,
'preview_height': 138,
'preview_url': 'http://static1.e621proxy.ru/data/preview/c9/ec/c9ec00ac632e22b0b8ab49aed24febc5.jpg',
'preview_width': 150,
'rating': 'e',
'sample_height': 923,
'sample_url': 'http://static1.e621proxy.ru/data/c9/ec/c9ec00ac632e22b0b8ab49aed24febc5.jpg',
'sample_width': 1000,
'score': 5,
'source': 'http://mista-red.tumblr.com/post/101564669539/oh-gosh-dont-let-kimahri-see-you-like-that',
'sources': ['http://mista-red.tumblr.com/post/101564669539/oh-gosh-dont-let-kimahri-see-you-like-that'],
'status': 'active',
'tags': 'anus balls bottomless butt clothed clothing cub erection horn humanoid_penis jewelry legs_up looking_at_viewer male necklace open_mouth penis perineum presenting presenting_hindquarters redemption3445 ronso smile solo spread_legs spreading yellow_eyes young',
'width': 1000}

I suggest you only grab a few bits of data (real or faked), get your recommendation system working, then download the full data set and apply it to your system.

Updated by anonymous

  • 1