Home
Blog
GCU Missing Values in Data Analysis Discussion

GCU Missing Values in Data Analysis Discussion

Daniel Kevins

0 comments

Discussion 1: Arcelia

For the example mentioned above, number four is the best solution if the dataset is large and representative of our customer population. As noted by Shmueli et al. (2017), one of the advantages of Classification and Regression Trees (CART) is that it does not require the preprocessing of data and handles missing values without deletion or estimation of missing data. Therefore, if we assume that our data set is large, and only 5% contains this missing value, then using CART will provide us with the best outcome.

If we were to select option one, we would not be able to use certain machine learning methods as many require that the dataset be without NULLs. Option two is also not considered ideal as the deletion of values can often lead to the creation of a biased model due to the deletion of potentially important data (Maladkar, 2018). Option three is better than one or two, but as Maladkar (2018) notes, this option only works well with smaller data sets because it adds “variance and bias” (Replacing With Mean/Median/Mode section).

Overall, a data analyst should opt for a data mining technique that can tolerate missing values before considering any technique that involves ignoring, removing, or estimating missing values.

References

Maladkar, K. (2018, September 2). 5 ways to handle missing values in machine learning datasets. Analytics India Magazine. Retrieved November 6, 2021, from https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/

Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Jr., K. L. C. (2017). Data mining for business analytics: Concepts, techniques, and applications in R (1st ed.). Wiley.

Discussion 2: Royce

When it comes to missing data and doing an analysis, very rarely will you ever have complete data in every table and column possible. Its just the matter of fact with todays nature that we can not collect every single piece of information. The trend of why we are missing data comes down to three categories according to Jeff Sauro,

“-Missing Completely at Random: There is no pattern in the missing data on any variables. This is the best you can hope for.

-Missing at Random: There is a pattern in the missing data but not on your primary dependent variables such as likelihood to recommend or SUS Scores.

-Missing Not at Random: There is a pattern in the missing data that affect your primary dependent variables. For example, lower-income participants are less likely to respond and thus affect your conclusions about income and likelihood to recommend. Missing not at random is your worst-case scenario. Proceed with caution.”

When it comes to our four options of missing data, a lot of the time you can ignore the variable in general especially if it wasn’t mandatory. But if you do decide to do an analysis and involve that variable, I would say the best option is to go with option 2 in deleting the records that have the missing value. Now that’s just in this circumstance where the amount of records not containing information is small. But once the amount gets larger you either have to go with option 3 or 4 of course. When it comes to imputing the data though, we could use the average since if we are taking the average it really should not mess with the overall number and keep our data consistent for that variable.

Reference:

Sauro, J. (2015) 7 ways to handle missing data. Retrieved from :https://measuringu.com/handle-missing-data/

Discussion 3: Tyler

When determining how you would approach the objects with missing data, you should always consider how many objects have missing data and how relevant the missing data might be. For example, if the data is missing because there appears to be a trend within a certain group of people that don’t like to share their data, you may not want to throw that out since even that 5% could be fairly relevant to your predictions. If the data does not appear to display any significance compared to other objects (basically random), you might not worry as much about that 5% and you could likely throw it out without feeling too much impact. Whether it is best to just ignore or delete the objects with missing data is likely dependent on what fields contain the missing data, how relevant those fields are to your end goals, and if your process can work with NAs in the dataset.

In our particular case, I would be least likely to just ignore the data. I could see the argument for deleting the rows where those fields are missing necessary data since it is probably more likely that those who do not wish to enter their yearly income do so randomly and no particular income group would wish to ignore this field more than another. However, knowing the data and the industry could change this perspective. I could also see there being value in entering values for the rows that are missing data. Entering the median or mean could be good options for a method such as this.

However, I think that this data could likely benefit from using the CART data mining technique. This technique utilizes classification and regression trees to use the known data to predict the values of the missing data (Li, 2021). This basically utilizes the same desire as our 3rd option but allows machine learning to predict better values than a general blanket value across this missing data. Simply put, this method would check for trends among all known fields that seem to affect the income of someone who stays at the hotel and then utilizes these known fields that have an unknown income to predict that income based off of these trends. I believe that this could be a great method for our missing data but it could also just be simple to delete the data if there doesn’t appear to be a trend among visitors who simply just don’t fill out that part of the form. I could see that either of these options could prove effective.

References

Li, R. (2021, July 31). CART data mining algorithm in Plain English. Hacker Bits. Retrieved November 12, 2021, from https://hackerbits.com/data/cart-data-mining-algorithm/.

About the Author

Daniel Kevins

Follow me