Forum - DB Export Parsing Suite (for Python 3.10+)

wat8548

Member

If you've ever wanted to do anything with the vast amount of data e621 holds, nothing compares to the daily database exports. Over the last year or so, I've slowly accumulated an extensive suite of homegrown software designed to ease parsing the idiosyncratic not-technically-CSV format the data is provided in, and today I voluntarily spent most of my Sunday rewriting most of what I had written and expanding its capabilities to handle every field available, even the boring ones. This is because I don't do drugs.

This supports every export currently available except the wiki. All functions are generator iterators which take a file handle, and return a dictionary object where the key names correspond to the first line of the file. The types of the values vary depending on the column. Some non-exhaustive notes:

Stuff like post IDs and fav counts are integers, or None where not applicable.
Datestamps use Python's datetime library, and also may be None for early entries.
Some fields have custom enums.
Tags are returned as a list of strings, and pool contents as a list of integers.
Boolean values (like is_deleted) may be True, False or None, because the early years of e621's database are full of Fun.

Some examples of usage follow in the replies.

e6csv.py

import sys
from enum import Enum, IntEnum
from datetime import datetime
import io

class Rating(IntEnum):
    SAFE = 0
    QUESTIONABLE = 1
    EXPLICIT = 2

ratingtable = {"s": Rating.SAFE, "q": Rating.QUESTIONABLE, "e": Rating.EXPLICIT}

class PoolCategory(IntEnum):
    SERIES = 0
    COLLECTION = 1

# ACTIVE is assigned 0 here to ease boolean comparisons, since in most cases
# it's the only one you care about
class RelationStatus(IntEnum):
    ACTIVE = 0
    PENDING = 1
    DELETED = 2
    APPROVED = 3
    QUEUED = 4
    PROCESSING = 5
    RETIRED = 6

class TagCategory(IntEnum):
    GENERAL = 0
    ARTIST = 1
    # 2 is unused
    COPYRIGHT = 3
    CHARACTER = 4
    SPECIES = 5
    INVALID = 6
    META = 7
    LORE = 8

Schema = Enum("Schema", "INT STR TIME MQUOTE SQUOTE RATING BOOL STRLIST " \
    + "INTLIST FLOAT PCAT RSTAT TCAT")

def commaorend(l, o):
    n = l.find(",", o)
    if n == -1:
        n = len(l)-1
    return n

def readcsv(f, schema):
    lineno = 1
    try:
        colnames = f.readline().strip().split(",")
        l = f.readline()

        # e621's string mangling algorithm works as follows:
        # If the first character is not a double quote, the string does not
        # contain any newlines, commas or double quotes, and is terminated by
        # the end of the field as normal.
        # If the first character is a double quote, the string continues until
        # the next unescaped double quote.
        # All characters are valid including commas and newlines, and double
        # quotes are escaped by following each one with a second double quote.
        # The string returned from this function replaces escaped double quotes
        # within the string with unescaped double quotes, and does not include
        # the enclosing double quotes (if any).
        def unescape(o, multiline):
            nonlocal l, lineno
            buf = io.StringIO()
            if l[o] != '"':
                n = commaorend(l, o)
                buf.write(l[o:n])
            else:
                while True:
                    o += 1
                    n = l.find('"', o)
                    while n == -1:
                        if not multiline:
                            raise Exception("Expected \"")
                        buf.write(l[o:])
                        l = f.readline()
                        lineno += 1
                        o = 0
                        n = l.find('"')
                    buf.write(l[o:n])
                    n += 1
                    if l[n] != '"':
                        break
                    buf.write('"')
                    o = n
            return (buf.getvalue(), n)

        while len(l) > 0:
            row = {}
            o = 0
            for col in zip(colnames, schema, strict=True):
                if o != 0:
                    o += 1
                if o == len(l):
                    raise Exception("Reached end of line unexpectedly")
                match col[1]:
                    case Schema.INT:
                        # Integer (may be blank)
                        n = commaorend(l, o)
                        v = int(l[o:n]) if o != n else None
                    case Schema.STR:
                        # String (cannot contain commas, newlines or quotes)
                        n = commaorend(l, o)
                        v = l[o:n]
                    case Schema.TIME:
                        # Datestamp (these SUCK)
                        # The microsecond part may not be present or may have
                        # fewer than 6 decimal places, which causes the
                        # datetime library to throw a wobbly.
                        # May also be blank.
                        n = commaorend(l, o)
                        if n == o:
                            v = None
                        else:
                            datestr = l[o:n]
                            if n-o == 19:
                                datestr += ".000000"
                            else:
                                datestr += "0" * (26-(n-o))
                            v = datetime.fromisoformat(datestr)
                    case Schema.MQUOTE:
                        # Arbitrary data (may contain anything including
                        # newlines)
                        v, n = unescape(o, True)
                    case Schema.SQUOTE:
                        # Arbitrary line (may contain anything except newlines)
                        v, n = unescape(o, False)
                    case Schema.RATING:
                        # Rating
                        # Must be one of "s", "q" or "e", and cannot be blank
                        n = commaorend(l, o)
                        v = ratingtable[l[o:n]]
                    case Schema.BOOL:
                        # Boolean
                        # Must be "t", "f", or blank
                        n = commaorend(l, o)
                        boolstr = l[o:n]
                        match boolstr:
                            case "t":
                                v = True
                            case "f":
                                v = False
                            case "":
                                v = None
                            case _:
                                raise Exception("Invalid bool: "+boolstr)
                    case Schema.STRLIST:
                        # Space-separated string list
                        v, n = unescape(o, False)
                        v = v.split(" ") if n != o else []
                    case Schema.INTLIST:
                        # Comma-separated integer list in curly brackets
                        # Also enclosed in double quotes unless the list has 1
                        # or 0 entries (i.e. no commas).
                        if l[o] == '"':
                            o += 1
                        if l[o] != "{":
                            raise Exception("Expected {")
                        o += 1
                        n = l.find("}", o)
                        if n == -1:
                            raise Exception("Expected }")
                        v = [int(s) for s in l[o:n].split(",")] if n != o else []
                        n += 1
                        if l[n] == '"':
                            n += 1
                    case Schema.FLOAT:
                        # Float (may be blank)
                        n = commaorend(l, o)
                        v = float(l[o:n]) if o != n else None
                    case Schema.PCAT:
                        # Pool category (cannot be blank)
                        n = commaorend(l, o)
                        v = PoolCategory[l[o:n].upper()]
                    case Schema.RSTAT:
                        # Relation status (cannot be blank)
                        n = commaorend(l, o)
                        v = RelationStatus[l[o:n].upper()]
                    case Schema.TCAT:
                        # Tag category (numeric, cannot be blank)
                        n = commaorend(l, o)
                        v = TagCategory(int(l[o:n]))
                    case _:
                        raise Exception("Unknown schema: "+str(col[1]))
                row[col[0]] = v
                o = n
            yield row
            l = f.readline()
            lineno += 1
    except:
        raise Exception("Error at line "+str(lineno)) from sys.exc_info()[1]

def readposts(f):
    return readcsv(f, [Schema.INT, Schema.INT, Schema.TIME, Schema.STR, \
        Schema.MQUOTE, Schema.RATING, Schema.INT, Schema.INT, Schema.STRLIST, \
        Schema.STRLIST, Schema.INT, Schema.STR, Schema.INT, Schema.INT, \
        Schema.INT, Schema.INT, Schema.INT, Schema.MQUOTE, Schema.FLOAT, \
        Schema.TIME, Schema.BOOL, Schema.BOOL, Schema.BOOL, Schema.INT, \
        Schema.INT, Schema.INT, Schema.BOOL, Schema.BOOL, Schema.BOOL])

def readpools(f):
    return readcsv(f, [Schema.INT, Schema.SQUOTE, Schema.TIME, Schema.TIME, \
        Schema.INT, Schema.MQUOTE, Schema.BOOL, Schema.PCAT, Schema.INTLIST])

def readrelations(f):
    return readcsv(f, [Schema.INT, Schema.SQUOTE, Schema.SQUOTE, Schema.TIME, \
        Schema.RSTAT])

def readtags(f):
    return readcsv(f, [Schema.INT, Schema.SQUOTE, Schema.TCAT, Schema.INT])

Updated over 1 year ago

wat8548

Member

almost 2 years ago

Given a posts CSV and a pools CSV (in that order), the following code will rank every one of Fuze's comics and other pools in descending order of average post score, in an attempt to determine once and for all what the all-time most popular Fuze comic is:

import sys
import os
import e6csv

print("Reading posts file...")

f = open(sys.argv[1], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
fuzeposts = {}

for post in e6csv.readposts(f):
    linecount += 1
    if linecount == 10000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    if post['is_deleted'] or "fuze" not in post['tag_string']:
        continue
    fuzeposts[post['id']] = post['score']

f.close()
print("\nReading pools file...")

f = open(sys.argv[2], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
fuzepools = {}

for pool in e6csv.readpools(f):
    linecount += 1
    if linecount == 1000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    posts = pool['post_ids']
    if len(posts) == 0 or posts[0] not in fuzeposts:
        continue
    totalscore = sum([fuzeposts[post] for post in posts])
    fuzepools[pool['name']] = totalscore / len(posts)

print()

for pool in sorted(fuzepools.items(), key=lambda p: p[1], reverse=True):
    print(str(pool[1])+" "+pool[0])

Note: This only works because none of Fuze's posts have ever been added to a pool containing the work of a different artist. A more cautious approach may be needed for other artists.

Note 2: This didn't work when I ran it because it turned out pool #12687 had a deleted post in it. Naughty.

Note 3: It's the incest comic, BTW. Wish I could say I was surprised.

wat8548

Member

almost 2 years ago

Given a tags CSV and a posts CSV (in that order), the following code will produce an accurate post count of every tag in the invalid category (ignoring deleted posts) and name and shame all the tags for which the claimed tag count is the most inaccurate:

import sys
import os
import e6csv
from e6csv import TagCategory

class TagCount:
    pass

f = open(sys.argv[1], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
invtags = {}

for tag in e6csv.readtags(f):
    linecount += 1
    if linecount == 1000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    if tag['category'] == TagCategory.INVALID:
        tc = TagCount()
        tc.claimed = tag['post_count']
        tc.actual = 0
        invtags[tag['name']] = tc

f.close()
print()

f = open(sys.argv[2], "r")
filesize = os.fstat(f.fileno()).st_size
linecount = 0
tags = {}

for post in e6csv.readposts(f):
    linecount += 1
    if linecount == 10000:
        linecount = 0
        print(str((f.tell()*100)/filesize)+"%            ",end="\r")
    if post['is_deleted']:
        continue
    for tag in post['tag_string']:
        if tc := invtags.get(tag):
            tc.actual += 1

f.close()
print()

for name, tc in sorted(invtags.items(), key=lambda i: abs(i[1].claimed - i[1].actual), reverse=True):
    if tc.claimed != tc.actual:
        print(str(abs(tc.claimed-tc.actual))+" "+name)

wat8548

Member

almost 2 years ago

Given an implication CSV and an alias CSV (in that order), the following code will produce a list of every tag you could possibly write that would result in pokemon_(species) being automatically added to a post:

import sys
import os
import e6csv

f = open(sys.argv[1], "r")
tree = {}

for implication in e6csv.readrelations(f):
    if implication['status']:
        continue
    antecedent = implication['antecedent_name']
    consequent = implication['consequent_name']
    if implicators := tree.get(consequent):
        implicators.append(antecedent)
    else:
        tree[consequent] = [antecedent]

f.close()

def recurse(tags, implication):
    tags.add(implication)
    if implicators := tree.get(implication):
        for i in implicators:
            recurse(tags, i)

poketags = set()
recurse(poketags, "pokemon_(species)")

f = open(sys.argv[2], "r")

for alias in e6csv.readrelations(f):
    if alias['status']:
        continue
    if alias['consequent_name'] in poketags:
        poketags.add(alias['antecedent_name'])

f.close()

for t in poketags:
    print(t)

Emserdalf

Member

almost 2 years ago

Ayo, this is kinda based, probably. I just skimmed your posts and might use this, because I'm wrangling the posts CSV myself right now. Thanks.

Since the tags aren't in quotes and I don't really care about the sources and description fields right now, but I really just want a file where every individual line is a post's data for easy grepping...

So earlier, I did this cursed business:

# --- thanks stack overflow (this processes lines that don't start with "^\d+,\d*,\d+-" (pcre) and tacks them onto the end of the previous; something like that)
awk '{printf (/^[0-9]+,[0-9]*,[0-9]+-/&&NR>1?RS:x)"%s ",$0} END {print ""}' ~/e6db/posts-2023-02-15.csv > ~/e6db/fixed-2023-02-15.csv

# --- surely nothing could go wrong here
sed 's/"[^"]*"//g' ~/e6db/fixed-2023-02-15.csv > ~/e6db/nostrings-2023-02-15.csv

# --- my metric for "it's probably a nice entry with trivial string contents" was whether or not there's 28 commas to delimit at.......
wc -l ~/e6db/fixed-2023-02-15.csv
awk '{x=$0}gsub(",","",x)==28' ~/e6db/nostrings-2023-02-15.csv | wc -l

# --- or if strings without commas are fine...
awk '{x=$0}gsub(",","",x)==28' ~/e6db/fixed-2023-02-15.csv > ~/e6db/almostnostrings-2023-02-15.csv
# --- for the rest, yank out all the strings once again so that (hopefully??) every line will always contain 28 commas
awk '{x=$0}gsub(",","",x)!=28' ~/e6db/fixed-2023-02-15.csv | sed 's/"[^"]*"//g' >> ~/e6db/almostnostrings-2023-02-15.csv

$ wc -l ~/e6db/posts-2023-02-15.csv ~/e6db/fixed-2023-02-15.csv ~/e6db/nostrings-2023-02-15.csv ~/e6db/almostnostrings-2023-02-15.csv
   10870056 ~/e6db/posts-2023-02-15.csv
    3865980 ~/e6db/fixed-2023-02-15.csv
    3865980 ~/e6db/nostrings-2023-02-15.csv
    3865980 ~/e6db/almostnostrings-2023-02-15.csv

Just kinda grepping the input file for roughly what entry starts look like, and for roughly what the ends of lines look like (three bools, mostly ig) I got to a point where I had the same number for those and in my "fixed" files, so I kinda consider the cursed business done and will use this as my CSV to do lookups on.

Maybe someone else will find my mildly cursed awk/sed magic useful at some point, idk!

[...] This is because I don't do drugs.

also lmao so true

Edit: Aaaand I've just realized that there's a few rare tags on e6 that have double quotes in them and so there's about 1200 or so posts where the tags will be quoted and get yanked in some of my clever filtering. Oh well.

Updated almost 2 years ago

wat8548

Member

almost 2 years ago

emserdalf said:
Edit: Aaaand I've just realized that there's a few rare tags on e6 that have double quotes in them and so there's about 1200 or so posts where the tags will be quoted and get yanked in some of my clever filtering. Oh well.

Was going to point that out, but you beat me to it. Also, wait until you find the tags with commas in them (although currently none have any posts).

SCTH

Member

over 1 year ago

Thank you for this, the DB dumps are a mess.

Used it to find long post chains

import datetime
import os
import e6csv
import gzip
from typing import Dict, Tuple
import shutil


def download_export():
    print("Download starting")
    url = f"https://e621.net/db_export/posts-{datetime.datetime.today().strftime('%Y-%m-%d')}.csv.gz"
    # far faster than any Python-based method I could find
    os.system(f"curl {url} > db_export.csv.gz")
    print("Download complete, unzipping")
    with gzip.open("db_export.csv.gz") as f:
        with open("db_export.txt", "wb") as f_out:
            shutil.copyfileobj(f, f_out)
    print("Unzipping complete")


def parse_csv():
    posts: Dict[int, int] = dict()
    print("Parsing start")
    with open("db_export.txt", mode="rt", encoding="utf-8") as f:
        for post in e6csv.readposts(f):
            if post["parent_id"] is not None:
                posts[post["id"]] = post["parent_id"]
    print(f"Parsing complete. Posts with parents: {len(posts)}")
    return posts


def get_chains(posts: Dict[int, int]):
    print("Chains start")
    chains: Dict[int, Tuple[int, int]] = dict()
    for post_id, parent in posts.items():
        current = post_id
        chain = {post_id}
        while parent is not None and parent not in chain:
            current = parent
            chain.add(parent)
            parent = posts.get(parent, None)
        if (current not in chains or len(chain) > chains[current][0]) and len(chain) > 10:
            chains[current] = len(chain), post_id
    print("Chains end")
    return chains


def main():
    download_export()
    posts = parse_csv()
    chains = get_chains(posts)
    print("Writing output file")
    with open("output.txt", "w") as f:
        for length, chain in sorted(chains.values(), reverse=True):
            print(f"{chain} : {length}", file=f)
    print("Done")


if __name__ == '__main__':
    main()

Fifteen

Member

over 1 year ago

wat8548 said:
If you've ever wanted to do anything with the vast amount of data e621 holds, nothing compares to the daily database exports. Over the last year or so, I've slowly accumulated an extensive suite of homegrown software designed to ease parsing the idiosyncratic not-technically-CSV format the data is provided in, and today I voluntarily spent most of my Sunday rewriting most of what I had written and expanding its capabilities to handle every field available, even the boring ones. This is because I don't do drugs.

The CSV format is pretty standard, though. Quote-doubling to escape quotation marks in text is fairly standard, and is interpreted without any issues by SQlite's .import command (.import --csv posts-2023-07-08.csv posts), Libre Office Calc (if you feel like running out of RAM), or even Python's built-in CSV reader:

def posts_csv_gen():
    with open('posts-2023-07-08.csv', newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            yield row

If you're willing to download libraries built for that kind of data manipulation, Pandas or Polars are both great choices (though they might feel intimidating at first):

import pandas as pd

# https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes
e6_posts_types = {
    "id": int,
    "uploader_id": int,
    "created_at": str,
    "md5": str,
    "source": "string",  # nullable str
    "rating": "category",  # auto-detect enum
    "image_width": 'Int32',  # nullable int
    "image_height": 'Int32',  # nullable int
    "tag_string": str,
    "locked_tags": "string",  # nullable string
    "fav_count": int,
    "file_ext": "category",  # auto-detect enum
    "parent_id": "Int32",  # nullable int
    "change_seq": int,
    "approver_id": "Int32",  # nullable int
    "file_size": int,
    "comment_count": int,
    "description": "string",
    "duration": "Float32",  # nullable float
    "updated_at": str,
    "is_deleted": bool,
    "is_pending": bool,
    "is_flagged": bool,
    "score": int,
    "up_score": int,
    "down_score": int,
    "is_rating_locked": bool,
    "is_status_locked": bool,
    "is_note_locked": bool,
}

post_data = pd.read_csv(
    'posts-2023-07-08.csv.gz', compression='gzip',
    dtype=e6_posts_types, parse_dates=["created_at", "updated_at"],
    true_values=["t"], false_values=["f"], date_format="ISO8601"
)

print(post_data.nlargest(5, 'score'))

[/tr][/tr][/tr][/tr][/table]

	id	uploader_id	created_at	md5	source	rating	image_width	image_height	tag_string	locked_tags	fav_count	file_ext	parent_id	change_seq	approver_id	file_size	comment_count	description	duration	updated_at	is_deleted	is_pending	is_flagged	score	up_score	down_score	is_rating_locked	is_status_locked	is_note_locked
2840617	2848682	112829	2021-07-26 20:52:16.162415	0619ff5a8270aed7b22ca3981b783224	https://www.zonkpunch.wtf/ https://www.furaffinity.net/...	e	1138	640	2021 2d_animation 3_toes 4_toes 5_fingers abs against_surface against_wall ahegao ...	<NA>	33232	webm	<NA>	49170881	35757	53152323	406	"Airlock Lust"	281.791992	2023-07-08 06:10:20.198667	False	False	False	18635	19202	-567	True	False	False
2338206	2346230	112829	2020-07-27 00:19:22.369913	8cc585f75d4c0e909535d53a9541c9fe	https://zonkpunch.wtf	e	1102	620	2020 2d_animation 3_toes <3 <3_eyes all_fours amber_eyes anal anal_penetration ...	<NA>	28027	webm	<NA>	49202939	18776	43909874	297	Enjoy, internet. I hope you like the switch from After Effects to Clip Studio.	200.039993	2023-07-08 07:13:20.605546	False	False	False	15090	15330	-240	True	False	False
3997734	4005902	396514	2023-04-21 17:50:45.007871	0308f480274d60986d871335892ecd43	https://dimwitdog.com/umbreedon/ https://www.newgrounds.com/...	e	1870	1052	2023 2d_animation 4_toes 5_fingers <3 after_anal after_sex after_vaginal ...	<NA>	27958	webm	<NA>	49621484	38571	63758946	286	<NA>	277.873993	2023-07-08 06:58:12.931268	False	False	False	14340	14460	-120	False	False	False
2601276	2609325	112829	2021-02-10 19:04:52.489237	ab71a84c62b9d77fa0eb68c5e64dc165	https://zonkpunch.wtf https://www.furaffinity.net/...	e	1102	620	2021 2d_animation 4_fingers 4_toes 5_fingers <3 <3_eyes abdominal_bulge ...	<NA>	22699	webm	<NA>	49500293	136079	50485923	279	"Saving The Species"	230.039993	2023-07-08 06:30:43.954312	False	False	False	12010	12255	-245	True	False	False
2306776	2314797	142160	2020-07-01 19:02:05.735966	a51d55b12e7a16a37253783e52c4ec21	https://inkbunny.net/s/2195419 https://www.furaffinity.net/...	e	1200	848	2020 2d_animation 3_toes age_difference anal anal_knotting anal_orgasm ...	conditional_dnp -cub -young -shota	23425	webm	<NA>	49269648	38571	20748665	168	Finished animation, with sound!	182.0	2023-07-08 05:00:14.034218	False	False	False	11977	12179	-202	True	False	False

wat8548

Member

over 1 year ago

Well, it's been an embarrassingly long time, but I finally found the rather obvious bug in the rating code. OP updated, but if you already saved the file, just replace the one occurrence of "ratingstr" with "v".

News - Jan 07, 2025 (5 days ago) Click to show.

Topic: DB Export Parsing Suite (for Python 3.10+)

wat8548

wat8548

wat8548

wat8548

Emserdalf

wat8548

SCTH

Fifteen

wat8548

Login to respond »

Over 18?