top | item 19791073

Show HN: CC Search – search engine for 300M CC-licensed images

375 points| kgodey | 7 years ago |search.creativecommons.org | reply

87 comments

order
[+] kgodey|7 years ago|reply
I'm Director of Engineering at Creative Commons and part of the team that is working on CC Search.

We've been working on the product for over a year and we are just out of beta today! One of CC's goals is to encourage the use and remixing of CC-licensed content, and we hope that CC Search will help make that content more discoverable. The current version is very much an MVP and only searches images, but we plan to add more content types in the future and index the ~1.4 billion works out there under a CC license. We would love any feedback you might have.

Also, CC Search, the associated API, and the scripts we use to index data are all open-source and developed completely openly. Our sprints and roadmap are public and we welcome contributions from the community.

Relevant links:

CC Search code: https://github.com/creativecommons/cccatalog-frontend/

CC Catalog API code: https://github.com/creativecommons/cccatalog-api/

CC Catalog code: https://github.com/creativecommons/cccatalog/

2019 Vision: https://creativecommons.org/2019/03/19/cc-search/

Roadmap: https://docs.google.com/document/d/19yH2V5K4nzWgEXaZhkzD1egz...

Active sprint: https://github.com/orgs/creativecommons/projects/7

Backlog: https://github.com/orgs/creativecommons/projects/10

How to contribute to CC projects: https://creativecommons.github.io/contributing-code/

[+] detaro|7 years ago|reply
If I put images in the internet, how can I tell your search about them/about the license?
[+] spectaclepiece|7 years ago|reply
Excellent project, this simplifies the process I currently have of searching many different sources and applying a cc filter (mostly Flickr and Google images).

The two image filters I use most often there are image size (larger than) and orientation (portrait or landscape). If these would be included here it would be perfect.

[+] Ao7bei3s|7 years ago|reply
Nice project! Thanks for working on it!

It would be great if there were more search filters. Especially size ("larger than") and date ("within x-y", "last 1 day/week/month/year").

And when you have filters anyway, filtering by EXIF data (geoposition, and typical photography details like camera model / aperture / focal length) would be pretty cool.

[+] mendeza|7 years ago|reply
This is nice, I can see this being used for computer vision researchers and practitioners to train CNNs. Have you thought of implementing a reverse image search engine? That would be a great feature or project to work on as it can help users find images they are interested in as tags tend to be noisy.
[+] pbhjpbhj|7 years ago|reply
Do you mind me asking how many man-years the project took? Is there an article on/diary covering the dev decisions?
[+] scanny|7 years ago|reply
Out of curiosity, are there any checks run on these images to see if they really are CC?

This search result for 'satellites' is a screenshot of a google maps result, which includes the copyright information for the imagery within the CC image [see the bottom right attribution text in the imagery].

How can that really be in CC?

https://search.creativecommons.org/photos/3f0eddb0-55b4-46e0...

How could we be sure that the images themselves are genuinely licensed rather than someone copying it from another source and slapping on the licence?

[+] kgodey|7 years ago|reply
We do our best to verify the license but cannot 100% guarantee that the image is CC-licensed. We have custom scripts that we use to ingest content from each content provider. For Flickr, we use their API which contains license information for each work. If you click through to the source image, you can see that the user who uploaded the image did license it under CC BY-SA 2.0.

We don't yet have a way of dealing with content where the user incorrectly licensed it, I think we could add a "report image" function that would help us identify and remove this type of content. There's a disclaimer if you scroll down that says "Verify at the source: Flickr" with an explanation of why we can't guarantee the license, maybe we should make that more prominent.

[+] antpls|7 years ago|reply
The same question could be asked for code posted on Github, yet it doesn't seem to be a problem in practice. Everyone just trust the license header in the files.
[+] Sommer|7 years ago|reply
Definitely appreciate the programmatic access, but in terms of straight search results it's very hard to improve on Google Image Search with the Usage Rights filter.. Info: https://support.google.com/websearch/answer/29508?p=ws_image...
[+] kgodey|7 years ago|reply
Thanks for the feedback! A big chunk of our upcoming work is going to be towards improving search relevance. We also plan to add content types other than images this year, starting with open textbooks and audio, so that will differentiate us.
[+] rixrax|7 years ago|reply
Thank You for doing this. I hope one day we will see a similar search for CC licensed music where you can search based on license type, and e.g. music style, beats per minute, length, audio format,...

I was recently trying to find music for a side project and searching for CC licensed music appears to have gone from somewhat onerous to very in sites like Jamendo.com, free music archive, etc. This is off topic but if someone can point to a website with CC licensed music where it’s possible to search based on license type (e.g. BY-NC-SA) I would love to know.

[+] kgodey|7 years ago|reply
Thanks for your feedback! We plan to add audio to CC Search later this year (probably in Q4). We will have searching by license type built in from the beginning and those all sound like great filters to add.
[+] tomcam|7 years ago|reply
Remember that for commercial use you still need a model release in the USA when faces are identifiable. It is totally separate from the licensing of the image.
[+] pbhjpbhj|7 years ago|reply
What's the relevant USC for that please?
[+] not2b|7 years ago|reply
I arbitrarily searched for "hula hoop" and most of the top images have people in them. People using this facility to find images to freely use would be advised to avoid images of recognizable people, because you don't know if they signed model releases or consent to being a part of your web site.
[+] kgodey|7 years ago|reply
Thanks for the feedback! We will figure out if there is a way to communicate this better.
[+] z92|7 years ago|reply
Excellent job. Thanks for all the hard work.

A few usability issues. 1/ Doesn't work without javascript. and 2/ At screen resolution 800x600 the search field gets completely hidden. Shown bellow.

https://imgur.com/a/bPkdOoT

[+] kgodey|7 years ago|reply
Thanks for your feedback!
[+] pbhjpbhj|7 years ago|reply
It's interesting to me how many images are lost when you filter by "right to modify". I wonder if "CC-SA" should be the default licence, ie should be what "CC" means.

Similarly if the default CC license were "NC" then I imagine many more shared images would be excluded from commercial use.

My suggestion is that people probably wouldn't mind modification of their CC images as a default.

I usually use CC-BY-SA.

Edit: looking afresh at the CC license material it appears my understanding of the licenses is weak (eg see look down-thread), that the default does allow modification. Which makes it weirder that people would go out there way to specify that their pretty poor quality images could only be used as-is add not modified.

[+] puranjay|7 years ago|reply
I've always wondered: what does "modification" include? Is resizing modification? Cropping? If I slap some text on the raw image, would that be constituted as modification as well?
[+] empressplay|7 years ago|reply
There are _a lot_ of CC licensed images coming up through Flickr search that are nowhere to be found in CC Search and since the vast majority of CC Search's images come from Flickr, you would be better off just searching on Flickr...
[+] kgodey|7 years ago|reply
Thanks for the feedback! We are working with Flickr to ensure that we are able to index all their CC-licensed images; currently there's an issue with their API that hides some images. It should be resolved by the end of the summer.

We have images from a lot of other collections, such as the Met, Rijksmuseum, Behance, Thingiverse etc. Flickr has more images than any of them by a couple of orders of magnitude, though.

[+] aurelwu|7 years ago|reply
Nice, I have been mostly using bing as search engine for public domain // CC0 images and it worked reasonable well but a search engine specialized on this area is something which is very useful. What I immediately noticed is that when you search for something including "icon" it shows lots of images of religious icons which make sense in a way but it probably not was most people want if they search for "fireball icon" or "mail icon" or "car icon".
[+] kgodey|7 years ago|reply
Thanks for the feedback! A big chunk of our upcoming work will focus on improving search results.
[+] Amorymeltzer|7 years ago|reply
Pretty neat! Another great resource is the 53 million or so files on Wikimedia Commons — https://commons.wikimedia.org/wiki/ — which don't appear to be searched by this. Commons has plenty of material that is PD, so I suppose that is technically out of scope, but it also restricts licenses to CC-BY or CC-BY-SA whereas some of these are NC.
[+] kgodey|7 years ago|reply
Thank you! Wikimedia Commons is our next big priority. We do also index PD content.
[+] cannedslime|7 years ago|reply
It is always nice to have search engines for free stuff. I have minor gripes with the engine.

- Browser navigation doesn't work. Going back doesnt take you to your previous search query.

- Search seems to be solely keyword based? And keywords are kind of hit and miss on many images.

[+] kgodey|7 years ago|reply
Thanks for the feedback, we'll add browser navigation to our list of things to fix. Our search is still pretty naive, a big chunk of our upcoming work will be focused on improving results.
[+] DOsinga|7 years ago|reply
This is cool and it is great that the software is all open source, but randomly clicking around it does seem that an awful lot of images all come from Flickr, with the images even being hosted by Flickr. It seems to me the interface could make the source more clear.
[+] kgodey|7 years ago|reply
Thanks for your feedback! We don't host any of the images, we link people directly to the source. Flickr does have orders of magnitude more images than other providers. We will add making the source more prominent to our list of issues to fix.
[+] 200_OK|7 years ago|reply
This is great. Is this all CC0 or do some of the images have conditions?

My wishlist would include also including videos and music/audio and an API to access it. I'm sure that's a big ask though.

[+] s_y_n_t_a_x|7 years ago|reply
I'd remove the unnecessary animations when loading the images.
[+] kgodey|7 years ago|reply
Thanks for the feedback! This is on our list to do soon, we've had other complaints.
[+] mikece|7 years ago|reply
Neat... though searching on "F-14" returned a lot of yoga images instead of the naval fighter (searching "F14" returned what I was looking for).
[+] kgodey|7 years ago|reply
Thanks for the feedback! Currently the search query parsing is pretty naive and some search terms work far better than others. A big chunk of our work over the next few months is going to be focused on improving search relevance.
[+] tsumnia|7 years ago|reply
Agreed, "ninja" came with a lot of dogs and cats (assumbly named Ninja). And "test" provides a weird assortment of not tests. I'd say filename keyword matching might be dangerous and some measure of image analysis should be made.
[+] reiinakano|7 years ago|reply
I did a search for Flamingo and there were no flamingos...