Discovery and Learning with Big Data
Allyn Moeller, MBA
Midterm Project
Overview
The project covers all the topics that have been discussed until the end of Module 4 of the course. The materials in any format that have been posted for the class activities should be considered and used for the project. Additionally, the student can use any other source of information that he/she can gather.
IMPORTANT NOTES:
The student can use any source of information that he/she considers the best fit for his/her work on the project.
The sources can be from class lectures, assignments, etc., or from any other sources
Images can include the screenshots that the student has taken while working on the class assignments.
Screenshots without details of explaining what they are and what they are for are considered incomplete.
All students are free to discuss with their classmates while working on the midterm project.
The midterm project is an individual assignment. All the submitted documents for the midterm project are the work done only by the student.
All the datasets used in the lecture and the assignments are posted on Canvas
PART I: Big Data, Artificial Intelligence, and Machine Learning (100 Points)
SUBMISSION REQUIREMENTS:
Research and discuss (Min: 2 pages, Max: 3 Pages – including images) the history of artificial intelligence until now, focusing on the recent advancement of the field.
Select three different sectors of the U.S. economy, do research, and discuss (Min: 2 pages, Max: 3 Pages.: including images) the impacts of big data and machine learning on each of them.
Discuss in detail the three major styles of learning in machine learning (Min: 1 paragraph each):
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
IMPORTANT NOTES:
The student can select any sector in which he/she is interested. For example, he/she can choose high-tech, retail, and transportation, or healthcare, education, and manufacturing, to name a few.
Best submitted as a Word document
PART ll: Machine Learning: Supervised – Linear Regression (60 Points)
TO-DO
Follow the steps discussed in class, train a machine learning model using the linear regression algorithm on the full dataset (all columns) housing_boston.csv with Python library Scikit-Learn.
This machine learning project includes the following steps:
Load the data
Preprocess the dataset
Perform the exploratory data analysis (EDA) on the dataset
Separate the dataset into the input and output NumPy arrays
Split the input/output arrays into the training/testing datasets
Build and train the model
Calculate the R2 value
Predict the “Median value of owner-occupied homes in 1000 dollars”
It is assumed that two new suburbs/towns/developments have been established in the Boston area. The agency has collected the housing data of these two new suburbs/towns/developments.
Make up two housing records to be used as predictors (all the variables except MEDV)
Use these two new records as the new data, feed them into the model to predict the median value of owner-occupied homes in 1000’s of dollars
Evaluate the model using the 10-fold cross-validation
IMPORTANT NOTES
For Exploratory Data Analysis (EDA), univariate data visualization, each chart of each applicable variable must be displayed in its own plot.
Run the code of each step to show the results
For Step 8 (Prediction): for each predictor, the student should clearly present the value of each predictor.
If uncertain what values to use for the prediction, one approach is to use the mean from summary statistics
Best submitted as a Markdown document exported as PDF
Creating Markdown Documents in Jupyter (Links to an external site.)
PART III: Machine Learning: Supervised – Logistic Regression (60 Points)
TO-DO
Follow the steps discussed in the videos, train a machine learning model using the logistic regression algorithm on the dataset pima_diabetes.csv with Python library Scikit-Learn.
This machine learning project includes the following steps:
Load the data
Preprocess the dataset
Perform the exploratory data analysis (EDA) on the dataset
Separate the dataset into the input and output NumPy arrays
Split the input/output arrays into the training/testing datasets
Build and train the model
Score the accuracy of the model
Predict the outcome (having diabetes or not) of two new records:
It is assumed that new data has been collected from two persons whose information has not yet been included in the existing
Make up two new records consisting of the predictors (all the variables except “class”) to represent the data of these two new persons, using the existing records of the dataset as samples.
Use these two records as the new data, feed them into the model to predict the outcome, i.e., having diabetes or not
Evaluate the model using the 10-fold cross-validation
IMPORTANT NOTES
For Exploratory Data Analysis (EDA), univariate data visualization, each chart of each applicable variable must be displayed in its own plot.
Run the code of each step to show the results
For Step 8 (Prediction): for each predictor, the student should clearly present the value of each predictor.
If uncertain what values to use for the prediction, one approach is to use the mean from summary statistics
Best submitted as a Markdown document exported as PDF
Creating Markdown Documents in Jupyter (Links to an external site.)
PART IV: ML: Supervised: Regression: Decision Tree Regression (60 Points)
TO-DO
Follow the steps discussed in the videos, train a machine learning model using the decision tree regression algorithm on the full dataset housing_boston.csv with Python library Scikit-Learn.
This machine learning project includes the following steps:
Load the data
Preprocess the dataset
Perform the exploratory data analysis (EDA) on the dataset
Separate the dataset into the input and output NumPy arrays
Split the input/output arrays into the training/testing datasets
Build and train the model
Calculate the R2 value
Predict the “Median value of owner-occupied homes in 1000 dollars”
It is assumed that two new suburbs/towns have been established in the Boston area. The agency has collected the housing data of these two new suburbs/towns.
Make up two housing records consisting of the predictors (all the variables except MEDV) to represent the housing data of these new towns, using the existing records of the dataset as a reference
Use these two records as the new data, feed them into the model to predict the median value of owner-occupied homes in 1000’s of dollars
Evaluate the model using the 10-fold cross-validation
IMPORTANT NOTES
For Exploratory Data Analysis (EDA), univariate data visualization, each chart of each applicable variable must be displayed in its own plot.
Run the code of each step to show the results
For Step 8 (Prediction): for each predictor, the student should clearly present the value of each predictor.
If uncertain what values to use for the prediction, one approach is to use the mean from summary statistics
Best submitted as a Markdown document exported as PDF
Creating Markdown Documents in Jupyter (Links to an external site.)
PART V: ML: Supervised: Classification: K-Nearest Neighbors (60 Points)
TO-DO
Follow the steps discussed in the videos, train a machine learning model using the (KNN) prediction algorithm on the dataset pima_diabetes.csv with Python library Scikit-Learn.
This machine learning project includes the following steps:
Load the data
Preprocess the dataset
Perform the exploratory data analysis (EDA) on the dataset
Separate the dataset into the input and output NumPy arrays
Split the input/output arrays into the training/testing datasets
Build and train the model
Score the accuracy of the model
Predict the outcome (having diabetes or not) of two new records:
It is assumed that new data has been collected from two persons whose information has not yet been included in the existing
Make up two new records consisting of the predictors (all the variables except “class”) to represent the data of these two new persons, using the existing records of the dataset as samples.
Use these two records as the new data, feed them into the model to predict the outcome, e., having diabetes or not.
Evaluate the model using the 10-fold cross-validation technique.
IMPORTANT NOTES
For Exploratory Data Analysis (EDA), univariate data visualization, each chart of each applicable variable must be displayed in its own plot.
Run the code of each step to show the results
For Step 8 (Prediction): for each predictor, the student should clearly present the value of each predictor.
If uncertain what values to use for the prediction, one approach is to use the mean from summary statistics
Best submitted as a Markdown document exported as PDF
Creating Markdown Documents in Jupyter (Links to an external site.)
PART VI: Evaluate and Compare Machine Learning Models (60 Points)
Regression Models: Linear Regression vs. Decision Tree (CART) Regression
TO-DO
Based on the results obtained in Step 7 of the above exercises (Calculate R2 value), make observations and compare the values. (1-2 paragraphs)
Based on the results obtained in Step 8 (Prediction), make observations and compare the results. (1-2 paragraphs)
Based on the results obtained in Step 9 (Evaluate the model using 10-fold cross-validation), make observations and compare the results. (1-2 paragraphs)
IMPORTANT NOTES
Use these values to evaluate the quality of the models. If the results are not the same, what is the difference?
Make a conclusion, if possible, on which model has higher quality in predicting outcomes and should be selected as the model to make predictions on the new data
Classification Models: Logistic Regression vs. K-Nearest Neighbors
TO-DO
Based on the results obtained in Step 7 (Score the accuracy level of the model), make observations and compare the values.
Based on the results obtained in Step 8 (Prediction), make observations and compare the results.
Based on the results obtained in Step 9 (Evaluate the model using IO-fold cross-validation), make observations and compare the results.
IMPORTANT NOTES
Use these values to evaluate the quality of the models. If the results are not the same, what is the difference?
Make a conclusion, if possible, on which model has higher quality in predicting outcomes and should be selected as the model to make predictions on the new data
Best submitted as a Word document
Grading Criteria
The mid-term project is graded based on the following grade components:
25%
- mid-term project Python exercises:
75%
Mid-term Project Report (400 Points)
PART I: Big Data, Artificial Intelligence, and Machine Learning (100 Points)
PART Il: ML: Supervised Regression – Linear Regression (60 Points)
PART III: ML: Supervised Classification – Logistic Regression (60 Points)
PART IV: ML: Supervised Regression – Decision Tree (CART) (60 Points)
PART V: ML: Supervised Regression – K-Nearest Neighbors (60 Points)
PART VI: Evaluate and Compare Machine Learning Models (60 Points)
HOW to Submit
Project, Reports and All Related Documents
The student is required to submit the final project report and all related documents to Canvas


0 comments