‘My thesis work @ClearboxAI’ is a blogpost series that summarises the various graduate research projects conducted at Clearbox AI. These experimental works are conducted by Master students from Italian and European universities who collaborated with Clearbox AI to deep dive into advanced topics in Machine Learning to apply R&D results in practice.
Artificial intelligence (AI) and machine learning (ML) nowadays are an integral part of many tech companies' workflows. Consequently, services based on these technologies must adapt, both in terms of speed and reliability, to the fast paced environments where they are used.
Many ML workflows are based on the use of unstructured data, which requires enormous computational effort. This makes it practically impossible to use tools for interpretability assessment or data quality assessment, due to the long analysis times that would be necessary. The main research question behind this thesis topic started from: would it be possible to adapt unstructured data to extend the capabilities of existing MLOps platforms such as the Clearbox AI Control Room?
The objective of this work is to create an interface for users to enhance Clearbox AI Control Room with unstructured data capabilities, without modifying the core of the product and maintaining its performance.
Unstructured vs structured data
Examples of structured data are tabular data, such as csv files, relational database tables, but also more common data that accompany our daily online existence such as website cookies.
The most common unstructured data, on the other hand, are images, texts and videos. It is important to note how the data source doesn't define data as structured or unstructured. Intuitively, an image can be thought as structured because it is a matrix of pixels. Therefore, the meaning of the picture is in the information that it brings. Image recognition extracts the meaning of the picture, and this metadata is then used in analytics.
Unstructured data are difficult to prepare and analyze, however, their spread in everyday life and the intrinsic need for the information they bring, make them essential.
Methodology and results
Once we establish what distinguishes structured data from unstructured ones, we dive into model deployment.
Different methods have been investigated for this work, using single channel images as input.
The first one involves Wavelet Transformation, in particular the discrete ones, through statistical features computed over the coefficients. These statistical features have been tested as structured input for the Clearbox AI Control Room.
The second method uses a convolutional classifier, starting from the idea that each filter has neurons with condensed and structured information about the input. The output of these neurons, taken at different levels of network depth, was also tested with the AI Control Room.
We have observed that the Wavelets did not yield a significant result, while the results achieved using a convolutional network are the most promising. The tests, both quantitative and qualitative, carried out using the latter method have recorded excellent results in terms of precision and robustness. To evaluate these results, a comparison was made with state-of-the-art techniques for unstructured data.
Custom interpretable AI techniques and data quality assessment methods have been used to evaluate the performance of the implemented methods. Although the scope of this work was experimental within the context of Clearbox AI Control Room, the results achieved may represent an excellent starting point for future work.
Starting from this, we already have successive research projects in mind, in terms of methodology improvements to accept multiple channel images and other kinds of unstructured data such as signals, time series or text.