top | item 44038359

(no title)

searchcord | 9 months ago

Hey,

Thanks for your suggestions.

> 1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)

In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.

> 2) You not seem to have a clear rate limit documentation.

This is a good idea. The rate limit varies by endpoint, and I haven't gotten around to documenting each one.

> If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.

I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.

> 3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.

I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.

discuss

order

sReinwald|9 months ago

The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.

And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.

Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.

Look, I get the frustration that (likely) motivated this. Discord has become an information black hole for many communities, and the shift away from open, searchable forums for project support is a genuine problem I've been incredibly frustrated with myself. But this "solution" - creating a massive, non-consensual archive that tramples over user privacy (and platform terms) - creates far graver ethical and practical issues than the one it purports to solve.

xk_id|9 months ago

> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context

Honestly, maybe they should. Maybe we need more stuff like this, until people finally wake up about the privacy catastrophe. The now defunct service spy.pet used to sell this kind of data with the stated purpose of doxxing people. There’s black markets for this. And it’s the same kind of data the service providers themselves have full access to.

searchcord|9 months ago

> The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.

Not really, it is not free to host and serve this data. If they want to get the data for free, they can get it directly from Discord. I did that work for them.

> And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.

Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.

> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.

I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. This also happens on platforms like Telegram, look at the SangMata_BOT for example. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.

Thanks for your input, though, I really do want to build a platform that balances privacy and usability.

deakam|9 months ago

Ridiculous take. If you're posting in a server that's intentionally open to the public and accessible to anyone with a link or even indexed by server discovery you shouldn't expect privacy. That's just the basic reality of the internet.

johnQdeveloper|9 months ago

> In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.

I suggest you do since tbh you are likely (as others have said) to be violating privacy laws with your current implementation + the discord ToS. If its anonymized better, you are less likely to be a target of someone who gets angry about not knowing you exist.

Up to you, your life your circus y'know?

> I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.

LLM data collection if its not being bought via discord already directly.

Same reason I'd want to use highly anonymized and curated data from the roleplay / writing discords as training data. It is just I'd have to go through and anonymize your data and curate it / clean it up before I would dare to send it to an LLM for legal reasons.

If I send/share PII, I'd be screwed just like you will be if someone gets upset.

> I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.

Fair, for me, this is for hobby implementations of solo roleplaying content similar to AI Dungeon and other implementations so its not commercial but my use case (for your purposes) would be better served by just being able to download a database dump (properly anoynmized or me doing it) for specific servers since most data is useless to me that you collect since I've got a specific goal in mind and want to minimize data collection for legal liability reasons. (i.e. non-commercial roleplaying with no PII or other privacy risky info is likely to be a safe use case)

EDIT:

I'd consider dropping attachments + links and only recording text as well for CSAM and other abusive material reasons. I doubt you have the moderation in place to protect yourself.

Pictures and videos and what not are a lot more dangerous to you than text would be. (i.e. despite what people say about it, realistically, most text in a public forum on the internet w/o PII is not going to get you hit with fines)

That said, personally, I would not publish this as you have because I don't have that kind of risk tolerance but I can see it being "safe enough" for some people. But the images/attachements are in "are you really sure you want to do that? You could go bankrupt" territory.