February 20th, 2024

How to create a train and test dataset

by Luca Gilli

In this introductory blog post, we delve into the foundational step 0 of any machine learning project: creating a train/test dataset. This necessary step lays the groundwork for training models effectively, ensuring they can learn from one set of data (the training set) and then be evaluated on a separate, unseen set of data (the test set).

Train/test splitting strategies

Decisions on train/test splitting strategies play an important role in the lifecycle of a machine learning model, significantly impacting its performance in a production environment. How we partition our data into training and testing sets affects the model's ability to learn from the data and its capability to generalize to new, unseen data. This step is critical for developing robust models that deliver reliable predictions in real-world applications, where the stakes of predictive accuracy can be high.

When preparing your dataset for a machine learning project, several important aspects should guide your approach to splitting it into training and testing sets. The following considerations are essential for ensuring that your model is trained effectively and can generalize well to new data.

Size of the dataset

The size of your dataset plays a crucial role in deciding how you split it into training and testing portions. With larger datasets, you can allocate a smaller percentage to testing (e.g., 80/20 or 70/30 split) because you still have substantial data for both training and testing. However, with smaller datasets, you must be more cautious (e.g., 90/10 split) to ensure the model has enough data to learn from. When the dataset used to train/test the model is too small you should also consider using a cross validation approach.

Imbalance in Target Labels

In datasets where the target label is rare (imbalance between classes), stratified sampling becomes particularly important. Stratified sampling ensures that your training and testing sets contain a proportional representation of each class. This is crucial for preventing situations where the rare class is underrepresented or even absent from your test set, which could lead to misleading high-performance metrics during training and poor performance in real-world applications.

Stratified sampling helps maintain class distribution across training and testing sets, which is vital for training models on imbalanced datasets. Considering these aspects when splitting your dataset can significantly impact the robustness and reliability of your machine-learning models. Properly addressing the size of the dataset, the presence of temporal data, and the distribution of target labels helps in developing models that perform well not just on paper but in practical, real-world scenarios.

Data drift

Data drifts refers to the change in the distribution of model input data over time. This phenomenon can gradually decrease the model's performance because the assumptions the model was trained on no longer hold true. Data drift can occur due to various factors, such as changes in user behavior, seasonal variations, or shifts in the broader economic or social environment. Monitoring for data drift and regularly updating the model with new data are essential practices to ensure sustained model accuracy and relevance. When splitting a dataset into train and test, we must always consider data drift aspects.

If your data includes a time component (e.g., sales data over several years), it's often critical to split your data based on chronological order rather than randomly. This approach helps simulate a real-world scenario where the model will predict future events based on past data. It ensures that the test set represents future conditions that the model hasn't seen during training, providing a more accurate assessment of its predictive capabilities.

### Reproducibility Ensuring the reproducibility of your train/test splits is crucial for the scientific integrity of your work and for collaborative projects. By fixing a random seed when splitting your dataset, you guarantee that the same split can be recreated in the future, which is essential for debugging, model comparison, and peer review. This practice contributes to the transparency and trustworthiness of your modeling process

### Data normalization It's critical to perform data normalization (scaling of features) using only the statistics (mean, standard deviation) from the training set. This practice prevents information leakage from the test set and ensures that the model is not inadvertently exposed to data it shouldn't have access to during training. Applying the same transformation to the test set as derived from the training set simulates real-world conditions where the model is applied to unseen data.

## Useful tools for data preparation and data management Here are a a couple of suggestions to further enhance the robustness and reliability of your machine learning models, especially in the context of preparing and managing your datasets.

Drift Classifier

To address potential issues with data drift between your training and testing sets, you can employ a drift classifier. This approach involves training a model to distinguish between your training and testing data. If the drift classifier performs significantly better than random guessing, it indicates that the two datasets come from different distributions, signaling potential data drift. Detecting and correcting for data drift early can help ensure that your model remains accurate and reliable.

Data Versioning

Using tools like **Data Version Control (DVC)** for data versioning allows you to track changes in your datasets over time, similar to how Git tracks changes in source code. This approach facilitates experiment tracking, model versioning, and rollback to previous states of your data, enhancing the manageability and reproducibility of your machine learning projects.

In few words

The creation of a well-structured train and test dataset is fundamental for the success of machine learning projects. The decisions made during the splitting process significantly influence a model's ability to generalize and perform reliably in real-world situations.

Dataset size, target label imbalance, data drift, and reproducibility are essential considerations for building robust models, so as tools like drift classifiers and data versioning, which further contribute to the adaptability and sustainability of machine learning models in dynamic environments. Overall, a thoughtful approach to dataset preparation lays the groundwork for models that prove their effectiveness in practical applications.

Share this article

Luca Gilli, PhD, is CTO and co-founder of Clearbox AI, where he leads R&D and product development. Expert in generative AI, uncertainty quantification, and ML model validation, he is the inventor of Clearbox AI’s core synthetic data technology.