1. Collect and process pdf data dump from COVID-19 Open Research Dataset Challenge (CORD-19)
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
2. Analyze the data and provide publication statistics such as the number of publications according to time, location but not limited to. Provide (any type of) visualization for the results.
3. Learn sentence embedding from the articles’ abstract and main content respectively.
4. Build a tool for question answering: given a user input sentence or query, outputs the top 10 most relevant sentences from the data and the source of the data, i.e., the sentence comes from which article. The tool could be command-line based or a simple Web-based interface.
For Report- Publication Analysis:
Use suitable algorithm(s)/framework(s) to analyze the publications.
Present clearly how to apply the algorithms.
Give clear and informative visualizations of the results.
Describe clearly the results and provide discussion if applicable.
For Tool:
Use suitable algorithm(s)/framework(s) for sentence embedding and for getting relevant sentences. Justify the motivation for using the chosen one(s).
Provide an easy-to-use tool for presenting relevant sentences to the user input.
In-depth discussion of the results and findings.
Note that the dataset is large, so if you have difficulties processing all the articles provided in the dataset, you could work on part of it but no less than 5000 articles. And provide justification of why you choose the number of articles to work on.


0 comments