The whys and hows of data preparation is a series of blog posts explaining the importance of data nowadays and how it can be processed to extract as much value as possible, you can find the first post of the series here.
First things first, let's summarise the highlights of the last post. We have seen how raw data, that is data that has just been collected and not yet manipulated, has great potential that can only be revealed through its preprocessing. It presents a series of problems, from missing values to noisy features, which make it difficult for a Machine Learning model to extract information from it. Data preparation refers to a series of processes designed to transform raw data into good quality data. In the next paragraphs we will take a closer look at some of the most frequent problems with raw data and the techniques to mitigate them, focusing on the preprocessing of tabular data.
Raw data are very often incomplete, presenting instances with missing values for certain columns (also called null values or NaNs). This problem can be approached from different points of view. A first and intuitive approach is to drop such instances from the dataset, so that only instances with all well-defined features appear. In this case it is possible to act at the level of a single instance, eliminating the rows that have missing values, or at the column level, dropping all those columns that have a majority of null values. A second possible approach is to estimate the value of a possible missing feature, especially if only a small group of values is missing. In this case, interpolation techniques can be used to fill the missing values with the mean, median or mode of that specific feature.
Mixed data types and mixed data values
A problem often present with newly collected data, especially if they are from different sources, is that of mixed data values and mixed data types. In the first case the same value for a given feature is represented differently in different instances. For example, we can think of a column representing a country using different categorical values but with the same meaning (such as "Italy", "Italia", "IT", etc ...). In the second case, on the other hand, there are even different types of data to express the same value. We could have instances that represent boolean values through strings and instances that represent booleans through numerical values. In this context, the most appropriate thing is to map the features having this problem to a single chosen format, in order to make the data consistent.
To maximise the extraction of patterns and valuable information by an ML Model, it is important to transform all the data that contain unnecessary or irrelevant information, or that make the learning phase of a model more difficult. These types of data are generally referred to as noisy data. The first and most intuitive step to solve this problem is to carry out a feature selection, thus manually eliminating all the useless columns. In this context by ‘useless’ we mean all the features that do not provide adequate information for the problem we want to solve.
Subsequently, a series of transformations can be carried out on the chosen columns aimed at facilitating the feeding phase of an ML algorithm. These transformations are often referred to as feature encoding and can be distinguished according to the type of column on which they are applied:
- Categorical feature encoding. Most ML algorithms do not accept nominal features as inputs, which must be transformed into numerical values. For this purpose it is possible to operate following different strategies depending on the nature of the feature itself. Some of the most used practices will be described below. Label Encoding is a very common technique, in which each category is converted into a numerical value following the alphabetical order of the possible values that the column can take. Mapping technique is very similar to Label Encoding, but in this case the assignment of certain numerical values can be forced regardless of the alphabetical order. These techniques are recommended in cases where the categorical values still follow an order (think for example of the possible sizes of clothes, where categories such as XS precede other categories such as L). Finally we have the One Hot Encoding technique, where a single categorical column is split into a number of several columns equal to that of unique values that the category can assume. In this way we obtain a binary vector for each instance, in which all the columns will have a value of 0, except the one in the category to which the instance belongs will have a value of 1. This technique is very useful when dealing with categorical columns for which the order is not important and there are no particular problems with the multitude of columns obtained.
- Numerical feature encoding. Most of the problems related to continuous features are related to the distribution of the feature itself, i.e. to the numerical range . On the one hand, very often there are columns with very wide ranges that make it difficult to extract information; on the other hand, within the same column there are values outside the average range of values, known as outliers and they make the learning process of an ML algorithm difficult. In these cases there are several techniques that can be applied to mitigate the possible problems. Binning and clustering are techniques that can be used to group large ranges of values into smaller groups, where each group represents an entire range of values. The first technique creates regular subgroups (all having the same width) given a certain range of values and the number of segments that need to be created; the second technique uses optimisation techniques that group on the basis of the "similarities" among the various data. Normalisation is another method widely used in the presence of numerical values, where all the values in a column are scaled to a predefined range. An example of a very popular scaler is the MinMaxScaler, which normalises any numeric range to the range [0, 1].
Moving on to code
In this blogpost we have addressed the problems that frequently occur when working with raw data, along with the main techniques to mitigate them. In the next post of this series we will take a closer look at how to use these techniques in real cases, implementing a ColumnTransformer in python for the preprocessing of tabular data.