What is data preparation?
Data is one of the key assets today, capable of bringing immense value to its owners if used well. As with many physical assets in the real world (gold, precious stones, oil, etc ...), even in the case of data, a digital asset, one of the fundamental processes for extracting value is that of polishing and refinement. The term data preparation (also called data cleaning) refers to this process of transformation from raw data to quality data, through a series of techniques aimed at cleaning, preparing and organizing the final data.
Why data preparation?
In the typical data scientist workflow, data preparation takes place immediately after the collection of the data itself. Let's understand why and the importance of this step. First of all it’s good to define what is meant with the word raw data. In fact, this term refers to all the datasets that have just been collected and that have not yet undergone any type of transformation. This kind of data, in most cases, is incomplete and noisy, for example omitting some fields or presenting outliers, thus inconsistent and inaccurate. As you can imagine, it’s difficult to use data in this "primitive" form to extract information and therefore value. In the Machine Learning field, it’s impossible to use an unclean dataset to feed a ML Model and for the model itself to learn from this raw data, identifying patterns and making meaningful predictions. Most ML techniques require specific data formats and strict conversions: in most cases they can be powered using only numerical values, thus requiring strict conversions. In many cases they can be fed using only numerical values for example, as well as very often they cannot receive null values.
Ok... let's focus on tabular data
There are different types of data, ranging from images to time series, etc... Although everything that has been described in the previous paragraphs is applicable regardless of the type of data, in this series of blog posts the focus will be on tabular data. By this term we mean all of the datasets that can be described in terms of columns (or features) and rows. Each column represents a feature of that dataset, while each row is an instance of the dataset itself, described by the values it has for each column. The features of a tabular dataset can be generally divided between ordinal and categorical features. The former are those columns that have continuous values and generally representable on a scale, so we have integers, floats, percentages, etc; the latter, on the other hand, are all those columns whose values are taken from a predefined and limited set of possible values. Obviously, different types of columns require different approaches to preprocessing, whether they are categorical or numerical.
Moving on to practice
In this blogpost we looked at what data preparation is and its importance, especially in the world of ML. In the second part of this blogspot, we’ll get our hands dirty, where we will look at some practical techniques for preprocessing our data, based on their type and the problems that they present. Collapse