Topic: db_export missing data?

Posted under e621 Tools and Applications

Hey, I'm wondering if anyone knows why https://e621.net/db_export/ is missing data when it comes to the posts CSV files? I tried downloading posts-2023-12-20.csv.gz and extracted it, but it only goes up to post number 479700, which isn't nearly close to full, seeing as there's currently 4486683 posts, almost 10 times more than the db export.
I tried downloading another posts file, since there were multiple but they seem identical.

Also is there some sort of list of "artist" tags that aren't artists? Like "conditional_dnp", "third-party_edit" and so on?

Thanks

I just downloaded posts-2023-12-20.csv.gz it and the final line in the csv file is post #4483340, which sounds about right to me.

4483340,639610,2023-12-20 07:37:44.543204,990144f4cd79c1762fe25391eaefdbf2,https://twitter.com/moraysupreme,e,1000,1050,andromorph andromorph/male animal_genitalia animal_penis anthro avian balls blush bodily_fluids butt cake_the_gryphon canid canine canine_genitalia canine_penis canis cum cum_in_pussy cum_inside digital_media_(artwork) doggystyle domestic_dog duo eggman_logo ejaculation electronics female female_penetrated from_behind_position fur genital_fluids genitals gryphon hair intersex intersex/male knot knotting male male/female male_penetrating male_penetrating_female mammal moraysupreme_(artist) mythological_avian mythology nude pawpads penetration penile penis pussy sega sex simple_background sonic_the_hedgehog_(series) tail trans_(lore) trans_man_(lore) vaginal vaginal_penetration,,1,png,4483338,52357491,,82376,0,"",,2023-12-20 07:37:55.128468,f,t,f,2,2,0,f,f,f

Are you opening the file in plaintext or trying to parse it with something? There's a few little quirks with the data that upset quite a few things that aren't exactly RFC4180 compliant.

Also is there some sort of list of "artist" tags that aren't artists? Like "conditional_dnp", "third-party_edit" and so on?

The closest thing you've got to an official list is the exclusion list used for the humanized page titles. That is, however, missing third-party_edit.

faucet said:
I just downloaded posts-2023-12-20.csv.gz it and the final line in the csv file is post #4483340, which sounds about right to me.

4483340,639610,2023-12-20 07:37:44.543204,990144f4cd79c1762fe25391eaefdbf2,https://twitter.com/moraysupreme,e,1000,1050,andromorph andromorph/male animal_genitalia animal_penis anthro avian balls blush bodily_fluids butt cake_the_gryphon canid canine canine_genitalia canine_penis canis cum cum_in_pussy cum_inside digital_media_(artwork) doggystyle domestic_dog duo eggman_logo ejaculation electronics female female_penetrated from_behind_position fur genital_fluids genitals gryphon hair intersex intersex/male knot knotting male male/female male_penetrating male_penetrating_female mammal moraysupreme_(artist) mythological_avian mythology nude pawpads penetration penile penis pussy sega sex simple_background sonic_the_hedgehog_(series) tail trans_(lore) trans_man_(lore) vaginal vaginal_penetration,,1,png,4483338,52357491,,82376,0,"",,2023-12-20 07:37:55.128468,f,t,f,2,2,0,f,f,f

Are you opening the file in plaintext or trying to parse it with something? There's a few little quirks with the data that upset quite a few things that aren't exactly RFC4180 compliant.

The closest thing you've got to an official list is the exclusion list used for the humanized page titles. That is, however, missing third-party_edit.

Damn, I am blind apparently, I had the wrong thing open. I had set up a script that throws away everything except for ID and Tags, and for some reason it had only processed up to post 479700, so it's an issue with that script, not the database from e621. And I looked at that trimmed CSV file instead of original. My bad.
Turned out my script was hitting field limit. Sorry if I wasted someone's time.
Thanks for help and the exclusion list.

Reminds me - time for another backup.
We probably should have diffs that are incremental backups, but meh...

  • 1