This post wants to be an introduction to the topic everyone’s talking about — Data-Centric AI.
Data-Centric AI has been one of the hot topics of 2021. Even the NeurIPS conference has hosted a thematic workshop on the topic for the first time since its inception. Data-Centric AI is gaining traction within the AI field and a dedicated workshop in this prestigious conference confirms that the topic is here to stay. But what is Data-Centric AI? According to Andrew Ng it’s the growing discipline of ‘systematically engineering the data needed to build successful AI systems’.
When building AI tools, raw data undergoes a number of transformations to become production ready, i.e. into a format that AI models can use to learn. The creation of these data pipelines is often done manually and can be very inefficient and cumbersome for data scientists.
Such manual work usually translates into technical debt, meaning that it’s bound to cause technical headaches as projects scale up. How can we prevent this from happening? Practically speaking, implementing a Data-Centric AI approach is very similar to applying the same concepts we use in software development, such as Continuous Integration/Continuous Delivery (CI/CD), to data pipelines.
The big chunk of this work translates to applying two important software practices: version control and automated testing. The former is needed to make sure that data pipelines are reproducible and in case something goes wrong we are able to revert to the most recent working pipeline. The latter corresponds to making sure that pipelines are continuously tested with respect to dynamic real world data. Applying these practices will not only improve the success rate of machine learning projects but also increase the robustness of monitoring models in real life.
A popular tool helping in this direction is Great Expectations which is built upon the idea of defining an open standard for data quality and automatic tests that can be readily implemented in existing CI/CD pipelines. Expectations are defined as ‘declarative statements that a computer can evaluate, and that is semantically meaningful to humans’. An expectation could be for example ‘the sum of columns a and b should be equal to one’ or ‘the values in column c should be non-negative’. Being able to define such expectations at different steps of data pipelines means that we can test them more easily while maintaining high quality data documentation. The library is continuously improved and it recently started offering the possibility to perform automatic data profiling.
As a matter of fact, at Clearbox AI we have to deal with the concept of data quality on a daily basis since we offer a solution to generate synthetic data in the context of machine learning in production. This synthetic data can be used to kick-start projects, augment datasets, and test models and should faithfully represent the real world data we are synthesizing from. It is therefore subjected to the same data constraints and expectations that need to be profiled before setting up the generative model. A constraint could be represented for example by a chronological relationship between two columns from the same table: a synthetic dataset losing such a relationship could quickly become useless.
Identifying these rules and constraints will not only assure that the synthetic data is distributed as close as possible to the original one but makes the training process of the generative model easier as we reduce the cardinality of the problem we are modeling.
We are constantly working on making these data profiling steps as automatic as possible and we will be happy to continue sharing our progress in the coming weeks detailing specific challenges and solutions!