Leveraging machine learning to fuel new discoveries with the ArXiv dataset

[+] techbio|5 years ago|reply

This is a long time coming. Glad that they've done the lifting to expose these resources to higher-order analysis.

I took a look at this potential a couple of years ago but on PubMed:

https://techbio.org/b-tracing-psych-signals-lit.php

[+] gabcoh|5 years ago|reply

Any potential for something like this to be added to gpt-4's training data set. Being far from an expert, I assume the figures and any non-text data could pose a problem, but it also seems like providing it with a huge high-quality source of scientific reasoning could lead to some pretty amazing and revolutionary results. gpt inspiring research directions or even coauthoring. Anyone who knows more about this have any thoughts? Have I just drunk the gpt cool aid or is their some potential for it to revolutionize science quite soon?

[+] phreeza|5 years ago|reply

My money is on the next iteration of gpt being multi-modal (See image-gpt). So this kind of thing would fit right in. I don't think this would lead to the kind of scientific revolution you are thinking of, given the tendency of gpt to confabulate things instead of basing itself on facts. That may be entering cool-aid territory :)

[+] woah|5 years ago|reply

Could be interesting, and maybe useful, but the distinctive thing about scientific papers is that they tend to be reports on things that happened in the real world. Might be useful on something self referential like math or philosophy though.

[+] jessriedel|5 years ago|reply

I thought the arXiv was already included in GPT-3's training data.

[+] mellosouls|5 years ago|reply

For anybody on mobile thinking the link has redirected to the index, you need to scroll way down to read the actual article.

[+] ajflores1604|5 years ago|reply

Thank you

[+] physicsgraph|5 years ago|reply

The value of content tagging (e.g., PhysML) and keyword tagging (e.g., ScienceWise) is apparent in aggregate, like for searching. That benefit is to consumers while the burden is currently on the content creators.

I don't know of any incentives within academia or grant processes that would motivate content authors to tag their content. With the exception of creators of the tagging systems. That means bulk analysis (whether using a grammar or natural language machine learning method) is key.

Citation graphs (which is mostly what's been done previously[0]) pale in comparison to text analysis. The possibility of enabling complex searches would be a big leap forward in science.

[0] https://physicsderivationgraph.blogspot.com/2020/05/literatu...

[+] physicsgraph|5 years ago|reply

There have been efforts to tag keywords in arXiv [0] and to identify sections of articles [1]. A conference on Mathematical Knowledge Management [2] was held last week; some participants have already been analyzing ArXiv.

Hopefully integration with Kaggle expands the number of teams taking advantage of the knowledge in the corpus.

[0] http://sciencewise.info/ [1] https://github.com/OMdoc/OMDoc/wiki/PhysML [2] https://cicm-conference.org/2020/cicm.php

[+] iandanforth|5 years ago|reply

This is really cool. I worked on the AI Index Report and had to bug karpathy to get his copy of arxiv papers to do analysis. (He's been slowly collecting papers for arxiv sanity for years). Getting them all from the arxiv API would have taken months.

This will enable tons of useful stats gathering about the fields represented on arxiv. Hopefully it will also lead to new scientific insights as well!

[+] newman8r|5 years ago|reply

arXiv also offers bulk access to the papers, you can download them via S3 'requester pays'

https://arxiv.org/help/bulk_data

[+] etaioinshrdlu|5 years ago|reply

Did you process the full text of the papers?

[+] vansul|5 years ago|reply

Very cool, I really look forward to seeing science move forward along these paths. All the same - have to post the obligatory (and recent) xkcd rebuttal- https://xkcd.com/2341/

[+] bryanrasmussen|5 years ago|reply

yeah, right, getting graphs from Polaroid photos into excel, nobody would find that an interesting task to do.

on edit: obviously it is also sounds quite a bit like you might need a research team and 5 years to do it right https://xkcd.com/1425/

[+] canjobear|5 years ago|reply

I'm a scientist and my problems are a lot more like the first panel than the second.

[+] djaque|5 years ago|reply

Wow, that's the perfect XKCD for my work right now. I'm integrating ML into my very traditional field of science and it seems like the most fitting place is on the "boring" problems.

19 comments