Home
Blog
Johns Hopkins University Data Analyst or Data Scientist Python Project

Johns Hopkins University Data Analyst or Data Scientist Python Project

Daniel Kevins

0 comments

Homework1

Goal
The goal of this homework assignment is to work on some real world problems that you might
encounter in a take-home coding project for an interview on the job as a Data Analyst or Data
Scientist.
Instructions
● This is an individual assignment.
● You can either turn in a Jupyter Notebook (.ipynb extension) or a python script (.py
extension).
● Do not use Excel or any spreadsheet software.
● You should be able to answer all of the questions using basic Python data structures and
syntax. Do not import any Python packages unless explicitly stated in the question. For
instance, you are not allowed to use Pandas, Numpy, etc…
● The first thing we will do when evaluating your assignment is running it end-to-end.
Therefore, before submitting your assignment, we strongly recommend that you:
○ make sure your code runs end-to-end without errors,
○ restart the kernel and run all cells in case you are working in a Jupyter Notebook.
● Please write clean code and include comments where applicable.
● Clearly identify the question number you are answering.
Section 1
Suppose the healthcare company you work for is evaluating a potential partnership with an
insurance company. Under the proposed agreement, the insurance company would refer some
of their members to the services provided by your company. As part of the evaluation, the
insurance company sent a dataset that your manager has tasked you with analyzing.
The data is stored in the file `patient_visits_partnership_prospect.psv`, in pipe-delimited format.
Here is a sample record:

875432|2021-04-16|Brooklyn|NY|11238|A|80
It contains the following fields:
● patient_id: a numeric identifier
● visit_date: YYYY-MM-DD format
● city
● state
● zip_code
● visit_type: the visit type, eg an annual checkup, or a follow-up visit, etc.. encoded by a
single character A, B, C, ….
● charge: the amount charged for the visit
For the example record, the patient_id is 875432, the visit occurred on April 16 2021, in
Brooklyn NY with zip code 11238. The visit-type was “A” and the charge incurred was $80.
Disclaimer – this is a synthetic dataset that I generated although it is inspired by real data that I
have worked with.
Questions:
You will need to load the file into memory and answer the following questions:
1. How many rows are there in the dataset?
2. How many unique patients are there?
3. How many unique states are there?
4. Find the minimum and maximum visit date. Note – you do not need any fancy date
parsing packages here, you should be able to answer this using basic Python syntax.
5. Find the visit count per state and print them in descending order.
6. Some of the patients have multiple visits. Calculate the number of patients with a single
visit and the number of patients with multiple visits. Calculate the ratio of patients with
multiple visits to total patients.
7. The majority of repeat patients had multiple visits in one city, but a small minority of
repeat patients had multiple visits across different states. Calculate the number of
patients that had multiple visits across different states.
8. Calculate the visit counts by year-month and display the results by increasing order of
date.
9. Find the top three busiest cities. Take care to consider city and state since some cities
might exist across multiple states.

10. Find the distribution of visit-types. For instance, a hypothetical answer could be: 40% of
visits are of type A and 60% of visits are of type B.
11. Calculate the average, minimum and maximum charges per visit-type

About the Author

Daniel Kevins

Follow me