top | item 47126143

(no title)

xml | 6 days ago

    > Specifically, we collected new data created after January 2025, including: [...] new fiction on Archive of Our Own (Various, 2025),
Not sure how to feel about this. From a researcher's point of view, reproducibility is important, but the last time someone publicly collected data from AO3, the community was not very fond of that.

https://huggingface.co/datasets/nyuuzyou/archiveofourown/dis...

discuss

order

Aedelon|6 days ago

Yeah, that HF dataset page is rough. 247+ threads, mostly DMCA reports, archive-locked fics scraped without consent, dataset reuploaded after takedown. The AO3 community had every reason to be furious.

Not RWKV-specific though. Most large corpora have the same sources in them, they just don't list them explicitly. Whether the transparency makes it better or worse is a real question.