• Home
  • Blog
  • University of Cumberland Week 9 Frequency Encoding & Guided Encoding Discussion

University of Cumberland Week 9 Frequency Encoding & Guided Encoding Discussion

0 comments

Techniques in handling categorical attributes

Creating dummies

This method is quite useful when data is having fewer categorical columns with few categories. This method is easy to use because it handles categorical columns very fast. This method is disadvantaged in circumstances when there are many categorical columns.

Encoding of ordinal number

This method is mainly used when replacing categorical attributes which are ordinal with an ordinary number based on ranks. The advantage of this method is that it’s the easiest while handling an ordinary feature in the dataset. Its disadvantage is that it’s not best suited in handling nominal type features in a dataset.

  1. Frequency encoding

      In this technique, the categories are replaced with frequency. The advantage of this technique is that it’s easy to implement and has no effect on any features. The disadvantage is that it’s not able to monotonize categories.  

  1. Guided encoding

In this method, the category of the column is replaced with its probability ranking compared to the target column. Its advantage is that it has no effect on data volume; its disadvantage is that over fitting is brought about by joint probability encoding.

  1. Mean encoding

This technique is used when replacing a category with the mean value with respect to the specific column. This technique has an advantage of creating a monotonous relationship amongst the attributes that are independent. This technique is however disadvantageous because it leads to over fit.

  1. Probability ratio encoding

  In this technique, the category column is replaced by a probability ratio. The advantage of this technique is that it captures information across all categories hence creating more predictive features.

Differences between continuous and categorical attributes

Continuous attributes

Cartegorical attributes

They contain a finite number of distinct categories.

They are numerical and have an infinite number of values.

They are obtained by measuring

They are obtained by counting

Economical when gathering samples because they are usually less.

Expensive when collecting samples because they are usually large.

Concept hierarchy

About the Author

Follow me


{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}