Our open source Data Profiling library is now on GitHub

Share this article

Our open source Data Profiling library is now on GitHub

June 20, 2022 0 min read

By Luca Gilli

Share this article

Dear data scientists, developers and data practitioners,

Active participation in the Open Source Software (OSS) community enables collaborative, transparent and faster innovation. Since Clearbox AI philosophy closely reflects these principles, we would not miss leaving our OSS mark. In our continuous journey of getting understanding and providing solutions to your everyday data needs, we are happy to announce that the first release of our Data Profiling library is available on GitHub!

This is the first step of a broader OSS project based on data libraries that we regularly use in our work and we decided to publish to contribute to the community. The future release roadmap foresees the sharing of different libraries to satisfy different data needs; starting from profiling, continuing with preparation and up to privacy. Let’s start with the first one.

Structured Data Profiling library

This is a Python library that helps you automatically profile your data and save your precious time in assessing structured datasets. Moreover, it’s useful to easily identify relations within variables and it facilitates the creation of data tests.

The library creates data tests in the form of Expectations using the great_expectation framework. Expectations are 'declarative statements that a computer can evaluate and semantically meaningful to humans'.

An expectation could be, for example, 'the sum of columns a and b should be equal to one' or 'the values in column c should be non-negative'.

The Structured Data Profiling library runs a series of tests aimed at identifying statistics, rules, and constraints characterising a given dataset. The information generated by the profiler is collected by performing the following operations:

Characterise uni- and bi-variate distributions;
Identify data quality issues;
Evaluate relationships between attributes (ex. column C is the difference between columns A and B);
Understand ontologies characterising categorical data (column A contains names, while B contains geographical places).

You can install the library by cloning the repository or by simply using the pip package manager, you can find the instructions on the library README page.

How to use the library in practice

Once the library is installed you can import the profiler class by using

from stucured_data_profiling.profiler import DatasetProfiler

You can use the DatasetProfiler class to import any CSV file

profiler = DataProfiler('./csv_path.csv')

Once a dataset is imported you can use the method profile() to start the profiling process

profiler.profile()

Finally, when the profiling is terminated you can generate the data expectations by calling the following method

profiler.generate_expectations()

For a practical example please visit this tutorial notebook.

Your opinion matters!

We’re excited to keep up our OSS contributions to ease your data ops workload. To continue doing so, your feedback is very important because it helps us understand what is useful for you and how we can improve the work. Simply answer a question 👇 we would glad to consider your feedback for the next iterations! And, of course, if you liked what we have been doing so far, do give us a star on GitHub ;)

Tags:

news

Luca Gilli, PhD, is CTO and co-founder of Clearbox AI, where he leads R&D and product development. Expert in generative AI, uncertainty quantification, and ML model validation, he is the inventor of Clearbox AI’s core synthetic data technology.