Handling Categorical Features in Machine Learning

Introduction: Every dataset has two type of variables Continuous(Numerical) and Categorical. Regression based algorithms use continuous and categorical features to build the models. You can’t fit categorical variables into a regression equation in their raw form in most of the ML Libraries. If it is not included in the modeling, then you do not get an accurate model. It’s crucial to learn the methods of dealing with such variables. There are many machine learning libraries that deal with categorical variables in various ways. Approach on how to transform and use those efficiently in model training, varies based on multiple conditions, including the algorithm being used, as well as the relation between the response variable and the categorical variable(s). Here I take the opportunity to demonstrate the various methods prevalent and incorporated in the popular Machine Learning Library in Spark, i.e.Mllib for handling categorical variables.

Challenges with categorical variable:

* A categorical variable has too many levels. It effects performance of model for example for rent prediction a zip code field has numerous levels.

* A categorical variable has levels which rarely occur. Many of these levels have minimal chance of making a real impact on model fit.

* There is one level which always occurs i.e. for most of the observations in data set there is only one level. Variables with such levels fail to make a positive impact on model performance due to very low variation.

* We can’t fit categorical variables into a regression equation in their raw form.

* Most of the algorithms (or ML libraries) produce better result with numerical variable.

Different approaches available In SparkML:

Below mentioned, three methods that are used generally to deal with categorical variable in Mllib Library of Spark.

1. StringIndexer: StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels], ordered by label frequencies, so the most frequent label gets index 0. If the input column is numeric, we cast it to string and index the string values. you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol. For unseen labels set it to “error” or “skip”, for

error option an exception will be thrown. However, if you set skip option it will skip unseen label.

Examples: Assume that we have the following DataFrame with columns id and gender.

id | gender


0 | M

1 | F

2 | F

3 | M

4 | M

5 | M

Gender is a string column with two labels: “M” and “F”. Applying StringIndexer with gender as the input column and genderIndex as the output column, we should get the following:

id | gender | genderIndex


0 | M | 0.0

1 | F | 1.0

2 | F | 1.0

3 | M | 0.0

4 | M | 0.0

5 | M | 0.0

“M” gets index 0 because it is the most frequent, followed by “F” with index 1.

from pyspark.ml.feature import StringIndexer

df = sqlContext.createDataFrame([(0, "M"), (1, "F"), (2, "F"), (3, "M"), (4, "M"), (5, "M")],["id", "gender"])

indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

indexed = indexer.fit(df).transform(df)


2. One-hot Encoding: One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default because it makes the vector entries sum up to one so an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([(0, "M"), (1, "F"), (2, "F"), (3, "M"), (4, "M"), (5, "M")],["id", "gender"])

stringIndexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

model = stringIndexer.fit(df)

indexed = model.transform(df)

encoder = OneHotEncoder(dropLast=False, inputCol="genderIndex", outputCol="genderVec")

encoded = encoder.transform(indexed)

encoded.select("id", "genderVec").show()

3. VectorIndexer :VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following: Take an input column of type Vector and a parameter maxCategories. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical. Compute 0-based category indices for each categorical feature. Index categorical features and transform original feature values to indices. Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.

from pyspark.ml.feature import VectorIndexer

data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=10)

indexerModel = indexer.fit(data)

# Create new column "indexed" with categorical values transformed to indices

indexedData = indexerModel.transform(data)



I have worked on an example to show how to apply these concepts. Example is about(prediction of car value based on a variety of characteristics such as mileage, make, model, engine size, interior style, and cruise control).This dataset is available here http://ww2.amstat.org/publications/jse/v16n3/datasets.kuiper.html

Dataset containing following variables:

Categorical : Make ,Model,Trim,Type

Continuous: Price , Mileage,Cylinder,Liter,Doors,Cruise,Sound, Leather

Dataset is divided into (70:30) ratio for training and testing. String Indexer is applied on union of train and test data frame, so you are assured all labels are there.Below are the results based on the approach tried. Implementation of all the above concepts have been compiled and put together in https://github.com/rukamesh/CarPricePrediction.git

Results without Categorical Variable:





Error%(0-1) Error%(1-2) Error%(2-3) Error%(3-4) Error%(4-5) Error%(5-10) Error%(>10)
3843.47 11.13 7.87 9.44 3.93 10.23 7.87 24.80 35.81

Results With Categorical Variable:





Error%(0-1) Error%(1-2) Error%(2-3) Error%(3-4) Error%(4-5) Error%(5-10) Error%(>10)
952.06 2.51 20.07 16.14 18.11 16.53 12.20 15.35 1.57

Formulas Used:

%Error = (data$prediction – data$Price)*100/data$Price

RMSE = sqrt((mean(data$prediction – data$Price)^2 ))


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s