Working with Categorical Variables

Divij Sharma
4 min readAug 9, 2022
Photo by v2osk on Unsplash

The world is divided into many categories and so is the data. Thinking about categories is the first step a child takes to make sense of the world. We teach child about the categories in the world. Cat, dog, tiger, deer are animals. Then we teach that cat, dog, sheep are domestic and lion, tiger, sheep are wild animals. Red, yellow and blue are colors. When the child is a bit older, s/he is introduced to world of ordinal categories — first, second, third and also cold, warm and hot temperatures. Categories are very important for us — as humans — to make sense of world. Unfortunately, most of machine learning algorithms are not able to handle categorical variables. Some algorithms can work with categorical data. Decision Trees — depending on its implementation — can learn directly from categorical data with no data transformation required.

So what should be done?

The first step to for machine learning algorithms to make sense of categorical variables is to convert them to numerical values.

What are categorical variables?

Before we start to convert categorical variables to numerical values, it is important to understand various types of categorical variables. In Data Science there are 2 types of categorical variables — Nominal and Ordinal.

Nominal variables have no particular order and there is no way to compare one value to another. Example —cat, dog, goat, cow, lion, tiger (animals) or pen, eraser, pencil (writing instruments) or herbivore, carnivore and omnivore animals (based on eating habits). There is no way one can arrange these values in any order.

Ordinal variables have some order. Example —first, second or third (rank in exam) or A+, B, C- (grades in exam) or cold, warm and hot (temperature of object). These values have an order and we can arrange them in some sequence like cold < warm < hot in temperature of object. Feedback from users are also generally categorical variables (excellent, good, satisfactory and bad or Strongly Agree, Agree, Neutral, Disagree and Strongly Disagree).

Encoders

Converting the numeric values for machine learning algorithms

Photo by Jorge Ramirez on Unsplash

There are many ways to convert a categorical variable to numeric values.

  1. One-Hot Encoding
  2. Label Encoding
  3. Ordinal Encoding
  4. Helmert Encoding
  5. Binary Encoding
  6. Frequency Encoding
  7. Mean Encoding
  8. Weight of Evidence Encoding
  9. Probability Ratio Encoding
  10. Hashing Encoding
  11. Backward Difference Encoding
  12. Leave One Out Encoding
  13. James-Stein Encoding
  14. M-estimator Encoding
  15. Thermometer Encoder

Label Encoding

Say there are n types of classes (distinct values) of categorical variable in data. In Label Encoding, the target variable is encoded with value between 0 and n-1. There is no relation between these assignments. The Label Encoder usually assigns values based on alphabetic order.

sklearn_lbl_encode_1

The numbers are assigned in alphabetic order. Cold is encoded as 0, Hot as 1, Very Hot as 2 and so on.

sklearn_lbl_enc_1_result

If a new category is added then the encoding is changed based on alphabetic order.

sklearn_lbl_encode_2

New encoding for Lukewarm as 2 is introduced and Very Hot is changed to 3.

sklearn_lbl_enc_2_result

Pandas function factorize also performs label encoding. In this case the encoding is done based on the order of appearance of value in the column.

pandas_factorize_1

Hot is encoded as 0 because it is the first value, then comes Very Hot, Warm and Cold.

pandas_factorize_results_1

When a new categorical value is introduced, the labels assigned are changed. New categorical value Lukewarm is assigned to 1.

pandas_factorize_2
pandas_factorize_results_2

One-Hot Encoding

One of the disadvantages of Label Encoding is that even though there is no relation or order between these encodings, the machine learning algorithm might consider them in some order or relationship. So the ML algorithm may consider 0 < 1 < 2 < 3 < 4 so Hot < Lukewarm < Very Hot< Warm < Cold which may not be a correct relationship. This may result in poor performance and unexpected results.

In such cases, One-Hot Encoding can be applied. This encoding is needed for feeding categorical data to many scikit-learn estimators, like linear models, SVMs with the standard kernels and neural networks as well as clustering algorithms. In fact OHE can be adopted for any machine learning algorithm that look at all the features simultaneously during training.

In OHE, each category is converted to a new column and assigned value 0 or 1 to the column. The implementation is as follows.

Using get_dummies()

In the below implementation, the column Color is one-hot encoded.

In the implementation, the column Color is automatically removed and 3 new columns for each of distinct values of the category is created. The value of 0 or 1 is based on the value of category in that observation (row).

Using sci-kit learn library

In using OHE with the sci-kit learn library, is a 2 step process. In step 1 — the column is encoded and in step 2 — the encoded values are added back to dataframe. The implementation in Python is below

Keep watching this space for details related to other encoders

--

--