top | item 47068216

(no title)

camkego | 11 days ago

The real cherry on top, is that the Microsoft link from the blog post by the Microsoft senior product manager goes to a Kaggle dataset page claiming the dataset is CC0: Public Domain.

https://www.kaggle.com/datasets/shubhammaindola/harry-potter...

More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.

discuss

order

throwaway2037|10 days ago

Wow, that is a great catch. I looked at the Kaggle page. It has been up for two years. From the hamburger menu (top right), I tried: Report Dataset. When I click the button "Report illegal content", I am redirected to a Google page (huh?): https://support.google.com/legal/troubleshooter/1114905?prod...

When I try to fill the questionaire, my request is rejected with this message:

    We understand that you are not legally authorized to file a copyright complaint on behalf of the copyright owner.

    In accordance with applicable copyright laws, we only accept copyright complaints from copyright owners or their authorized representatives. If you have legal questions about copyright law, please consult your own legal counsel.

    We are sorry we cannot assist you further.
Hysterical. What a farce. That data set is pure theft.

throawayonthe|10 days ago

i'm not sure why you think it's a farce though, not allowing third parties to file complaints

(e.g. see youtube, where this is (used to be?) poorly enforced, it's a mess)

Sohcahtoa82|10 days ago

Allowing third parties to open copyright complaints on behalf of the copyright owner opens a massive can of worms and is incredibly ripe for abuse.

nonfamous|10 days ago

Kaggle is part of Google.

ChoGGi|9 days ago

Welp, somebody certainly noticed now.

fxwin|11 days ago

> it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

DSMan195276|10 days ago

> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.

If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.

Retr0id|11 days ago

> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Why wouldn't that apply?

rob_c|10 days ago

The article author and the uploader should _BOTH_ be sentient enough to engage brain and not just ignore it because they feel "it's an abstract concept I'd not get in trouble for when not working in the US or EU".

pavon|10 days ago

Copyright infringement is a strict liability tort in the US. Willful infringement can result in harsher penalties, but being mistaken about the copyright status is not a valid defense.