Covid-19 Open Research Dataset

[+] axegon_|6 years ago|reply

Call me cynical but isn't this a bit... Redundant? I mean sure, nlp, parse some papers and make some sort of search/q&a type of thing? Fine, whatever.

Nice work on organizing the data and to a fair degree taking care of the tedious pre-processing associated with any ml task but... Idk, I fail to see how this could help. Personally I'll fiddle around with it in my spare time but mostly for fun, I don't expect anything significantly useful to come out of it.

On a higher level I feel like the success of the IT industry has made a lot of people feel like it is the answer to every possible question. And sure, bioinformatics is an incredible subject and I've invested a lot in studying it in my spare time in the last couple of years but __purely__ out of curiosity. But my 2 cents on the subject is that us(developers, ml engineers, etc) cannot provide the answer to the meaning of life. In this situation our safest course of action is to work from home as much as possible and avoid making decision that could potentially make the situation worse. Basically protect ourselves and those around us as much as possible. And by doing so, let people who have the adequate training, knowledge and experience take care of the situation. Sure, play around with it in your spare time, and if you come up with something - share it. But the whole "stand back boys, let us men handle it" mentality is what really bugs me. If anything history has taught us that this goes badly 990 out of 1000 times(to be more in lines with Bayesians). We all want the underdog to win the game but... Come on...

[+] gillesjacobs|6 years ago|reply

> I mean sure, nlp, parse some papers and make some sort of search/q&a type of thing? Fine, whatever.

I think you severely underestimate the effort and expertise required for getting decent content-aware search, let alone a fully functional question-answering pipeline.

These are their own subfields of research in text mining. The fact that you conflate these as if they were some trivial task on a new dataset shows your lack of understanding of the field.

Biomedical text mining [1] is wide a subfield with plenty of open datasets and competitions such as the bi-annual ACL BioNLP workshop [1]. Furthermore existing knowledge-base creation and information extraction pipelines such as protein-protein interaction extraction, NER, event extraction, drug-drug-interaction minin, etc. could be applied to this novel dataset and provide useful insights for researchers and staff.

[+] scastillo|6 years ago|reply

Yeah, some critics about this feeling improvised, but guess what? it IS improvisation.

Remember what hickey says about improvising? (https://www.infoq.com/presentations/Design-Composition-Perfo...)

It is art and years of preparation right there performing real-time.

So that guy criticizing is only auto declaring he is incapable of improvising on this matter.. and it's ok it's only for the best ones after all. It is called a challenge for a reason.

[+] markholmes|6 years ago|reply

Redundant how? You say you don’t see how this could help, but the opposite - not doing it - is certain not to help. I don’t understand your criticism and cynicism here.

[+] PierredeFermat|6 years ago|reply

I very much agree here. Have recently ran part of the data through some NLP (incl. AWS Comprehend) and nothing signficant came out. Ended up doing simple free-text or keyword search and only landed on one interesting finding so far: https://twitter.com/yazijys/status/1240465780715683841

[+] _kbp8|6 years ago|reply

It takes downloading one of these files, gunzipping them, extracting the tar and opening up a JSON file at random to really understand just how distant the title of the project is from its contents. I fully realize this is about natural language processing, but... this is beyond reach. You'd have to teach a computer to become a doctor first.

Feels like where AI (and computing in general) should go in case of Covid-19 and other illnesses is analyzing and simulating how the heck we work to begin with. Take a few minutes to watch these videos of supercomputer simulations that show -- fragments -- of the fundamentals of how we exist.

Multi Scale Modeling of Chromatin and Nucleosomes - https://www.youtube.com/watch?v=4Z4KwuUfh0A

DNA animation showing realtime DNA replication - https://www.youtube.com/watch?v=7Hk9jct2ozY

Somewhere in these mind-boggling processes is where the disruption called Covid-19 puts a stick in our wheels. Compared to the complexity of the simulations, abstract ideas contained in these files are so macroscopic by comparison. It's both humbling and awe-inspiring.

[+] typon|6 years ago|reply

Even though the Multi Scale Modeling video is extremely impressive - it is still an MD simulation that uses classical mechanics. A full atomistic quantum simulation of such a large system is out of reach for even the largest super computers. We barely know anything about biology.

[+] clircle|6 years ago|reply

How can applying a machine learning algorithm to this data (a collection of research papers) help fight Covid? It’s possible most papers here are bogus or low quality. Garbage in, garbage out?

[+] m3kw9|6 years ago|reply

Yeah I don’t see any current AI models that can combine all the knowledge from a subject and do something really profound with it. I do see you can generate more AI covid19 articles from it.

[+] google234123|6 years ago|reply

I imagine they have some algorithm that takes an input consisting of the title, abstract, and authors with the information about how impactful their previous work has been (this is probably the most important factor) and outputs some ranking or a likelihood that the research will be cited/clicked on.

[+] ksk|6 years ago|reply

Might be useful to cluster the papers into various topics. For e.g. a person interested in drug discovery needs a different (biochem biased) set of papers than one doing vaccine development (immunology biased). *(caveat obvious overlaps)

[+] gpm|6 years ago|reply

Create a generative model that makes more covid-19 papers. Bonus points if it's possible to manipulate the model into creating papers that tell us how to create a cure or vaccine /s.

[+] MuffinFlavored|6 years ago|reply

Where are the most up to date, most reliable case numbers? I'm tracking US day-by-day case growth.

These all have different numbers:

http://covid19.fyi/

https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_t...

https://coronavirus.1point3acres.com/

[+] ddeck|6 years ago|reply

For global numbers, the WHO daily situation reports:

https://www.who.int/emergencies/diseases/novel-coronavirus-2...

The same data is also presented graphically here:

https://www.who.int/redirect-pages/page/novel-coronavirus-(c...

[+] MrAlexey|6 years ago|reply

I recently had to deal with this problem when building out Covidly (www.covidly.com)

Initially I tried using WHO and JHU, but quickly found their data to be riddled with discrepancies, occasional bugs, and direct contradictions with official statements from various countries.

I ended up aggregating multiple sources (including WHO/JHU/etc), performing some sanity checks to remove outliers, then doing my best to merge the remaining results.

Happy to share this data publicly if there's interest!

[+] sureshv|6 years ago|reply

For US Data try: https://covidtracking.com/

[+] doctoboggan|6 years ago|reply

I have been using NYT: https://www.nytimes.com/interactive/2020/us/coronavirus-us-c...

They update it many times a day.

[+] gregsadetsky|6 years ago|reply

https://github.com/CSSEGISandData/COVID-19 ?

[+] dantheman|6 years ago|reply

Here are the CDCs #s - they're released @ noon for up to 4pm the previous day.

[+] anon1253|6 years ago|reply

It might be fun to checkout https://covid19.doctorevidence.com/ they have loaded the CORD dataset with a dashboard like interface and a query language. E.g. https://search.doctorevidence.com/search?query=ss(6f1da786-6... (user/pass covid19/covid19) for a direct link, and it provides integration with the other medically relevant feeds.

[+] m3kw9|6 years ago|reply

Without proper search assists like Elastic search/lucene etc, it’s not useful for most people trying to read it. Maybe someone can set up a site with elastic search with articles on there?

[+] anon1253|6 years ago|reply

Checkout https://covid19.doctorevidence.com/ they have loaded the CORD dataset with a dashboard like interface and a query language. E.g. https://search.doctorevidence.com/search?query=ss(6f1da786-6... (user/pass covid19/covid19)

[+] shardinator|6 years ago|reply

I thought so too, I’m working on this. If anyone wants to collaborate please let me know how I can get in touch email, Twitter.

[+] jonny_eh|6 years ago|reply

Maybe algolia could help?

[+] ISNIT|6 years ago|reply

Have you put this in https://coronavirustechhandbook.com/data?

[+] gillesjacobs|6 years ago|reply

Many of the criticisms voiced in this thread stem from a lack of expertise in biomedical Natural Language Processing and text mining.

Various annotated datasets and models already exist within the field which can extract potentially useful information and be used in downstream task for targetted document and information retrieval. Biomedical text mining [1] is wide a subfield with plenty of open datasets and competitions such as the bi-annual ACL BioNLP workshop [1].

- Biomedical Named Entity Recognition: extract names of proteins, drugs, diseases, symptoms, etc. and classify their biomedical category [3]. Extracting the terms of symptoms is a crucial in document discovery and modeling and knowledge-base creation. Several open datasets can be found here [4].

- Biomedical relation and event extraction: Traditionally focused on extracting protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. Recently interest have been shifted to the extraction of complex relations such as biomolecular events. [2] These methods can detect and classify the causal relations between the genes and proteins in a sentence like "TNF-alpha is a rapid activator of IL-8 gene expression by...".

- Document retrieval: Helping researchers and medical staff find relevant topic-specific papers by improving search with topic modeling, document similarity, named entities, etc.

These are only some examples of common biomedical text mining tasks and there are plenty more. Now of course, relying on previous annotated data is an issue because the tagged categories might not relevant for many of the issues related to COVID19. However, even unsupervised modeling like using SciBERT to create topic models or document clusters of related documents can be helpful for scientific discovery.

1. https://en.wikipedia.org/wiki/Biomedical_text_mining

2. https://aclweb.org/aclwiki/BioNLP_Workshop

3. https://www.hindawi.com/journals/cmmm/2015/571381/

4. http://gcancer.org/clstmdata/

5. https://bmcbioinformatics.biomedcentral.com/articles/10.1186...

[+] unknown|6 years ago|reply

[deleted]

[+] dougb|6 years ago|reply

This is would be the perfect time for IBM to apply all Watson technologies and resources to develop new insight into Covid-19.

[+] heyitsguay|6 years ago|reply

Watson is basically a brand name for IBM's data analytics consulting services. My understanding is they're not that great at it, they haven't scored any major wins outside of that Jeopardy run. I don't have any articles on hand but i seem to recall reading about some failures with a medical partner in particular, but then that's been a tough field for other big names like Google, too.

47 comments