‘My thesis @ClearboxAI is a blogpost series that summarises the various graduate research projects conducted at Clearbox AI. These experimental works are conducted by Master students from Italian and European universities who collaborated with Clearbox AI to deep dive into advanced topics in Machine Learning to apply R&D results in practice.’
Interest in synthetic data has grown rapidly in recent years. This growth of interest can be attributed, on the one hand, to the increasing demand for large amounts of data to train AI/ML models and, on the other hand, to the recent development of effective methods for generating high-quality synthetic data. For example, generative AI models have demonstrated excellent capabilities in synthesising complex datasets. Unfortunately, many of the processes of interest are rare events or edge cases. Therefore, the amount of real data that can be used to train generative models is often insufficient, hence limiting their applicability. Furthermore, in the case of processes involving dynamical systems, generative models often fail to capture the underlying laws governing the dynamics, thus resulting in low-fidelity synthetic data. A possible strategy to overcome these limitations is to generate synthetic data using a physics-informed approach, that is, incorporating the knowledge of the governing physical laws into the generative model.
My thesis work explored a possible approach for generating high-fidelity synthetic data using physics-informed ML. Specifically, the approach investigated in this work uses the SINDy Autoencoder network introduced by Champion et al. as a synthetic data generator. The generative models under study were tested on two datasets generated by nonlinear dynamical systems.
What is Physics-Informed ML?
As stated by Hao et al., the seamless integration of (noisy) data and mathematical physics models can guide ML models towards physically plausible solutions, improving accuracy and efficiency even in partially understood and high-dimensional contexts. Physics-Informed ML (PIML) is a learning paradigm aimed at building a model that leverages empirical data and physical knowledge to improve performance on a set of tasks that involve a physical mechanism. According to Karniadakis et al., no predictive models can be built without assumptions and, consequently, no generalisation performance can be expected from ML models without appropriate biases. Specific to PIML, there are currently three pathways that can be followed separately or in tandem to accelerate training and improve generalisation of ML models by embedding physics into them. In detail, making a learning algorithm physics-informed amounts to introducing appropriate observational, inductive or learning biases that can steer the learning process towards physically consistent solutions. This translates into several approaches that can be used to embed physics in ML, described below.
Observational bias
This method builds on the vast amounts of observational data available thanks to the rapid advancement of sensor networks. This data, reflecting the physical laws that govern their generation, can be used to infuse these laws into an ML model during training. However, for over-parameterized Deep Learning (DL) models, a large amount of data is typically needed, which could be expensive to acquire.
Inductive bias
This approach designs specialised Neural Network (NN) architectures that inherently embed prior knowledge and inductive biases related to a given predictive task. The ML model is adapted to ensure that the desired predictions comply with certain physical laws, typically expressed as mathematical constraints. However, this approach is often limited to tasks with relatively simple and well-understood physics or symmetry groups and may require elaborate implementation. The extension to more complex tasks is difficult because the invariances or conservation laws that characterise many physical systems are often poorly understood or challenging to implicitly encode into an NN architecture.
Learning bias
Rather than creating a specialised architecture, this method uses loss functions, constraints, and inference algorithms that can guide the training phase of an ML model to favour convergence towards solutions that align with the underlying physics. This is viewed as a multi-task learning application, where the learning algorithm fits the observed data while also providing predictions that approximately satisfy specific physical constraints. These biases are not mutually exclusive and can be combined to yield a very broad class of hybrid approaches.
Use case definition
The thesis focused on a particular type of data, namely time-series. This choice was made as this type of data often comes from sensor readings and is therefore associated with physical systems. We decided to test the approach explored in this work on two datasets: a dataset artificially built around the Lorenz system and an experimental dataset acquired on a full-scale F-16 aircraft.
Synthetic data generation with SINDyAE
After extensive literature review we decided to perform the synthetic data generation step using a method based on SINDy. The Sparse Identification of Nonlinear Dynamics (SINDy) algorithm, proposed by Brunton et al., extracts parsimonious dynamics from time-series data. It assumes that governing equations are sparse in a high-dimensional nonlinear function space, meaning only a few terms define the dynamics. The algorithm uses sparse regression to find minimal terms that accurately represent the data, balancing complexity with accuracy and preventing overfitting.
One of its main limitations is that it relies on an effective coordinate system in which the dynamics has a simple representation. To address this, the SINDy Autoencoder (SINDyAE), proposed by Champion et al., discovers both interpretable, sparse dynamical models and coordinates that enable these simple representations. This approach combines a SINDy model and a deep Autoencoder network for joint optimization, discovering intrinsic coordinates associated with a parsimonious nonlinear dynamical model.
SINDyAE leverages SINDy's parsimony and interpretability and deep NNs' universal approximation abilities to create interpretable, generalizable models suitable for extrapolation and forecasting, overcoming the challenge of knowing the sparsifying measurement coordinates.
Synthetic data evaluation
One of the most important aspects when generating synthetic data is its evaluation. We used two approaches to perform this evaluation. The first approach is based on training a classifier to discriminate between real and synthetic trajectories, the score achieved by the classifier being a proxy to measure the synthetic data quality. In particular, we decided to use InceptionTime as it is currently one of the best DL models for time-series classification. It is an ensemble of deep Convolutional NN models, inspired by the Inception-v4 architecture.
The second approach is implemented to measure the novelty of the synthetic trajectories and is based on the use of the pairwise_distance function included in the sktime library. This function is used to compute the 2D pairwise distance matrix between the real and synthetic trajectories, which is then represented as a heatmap. Such heatmaps are qualitatively interpreted as a measurement of the novelty introduced by the synthetic trajectories compared to the real trajectories.
Conclusions
If you are interested in reading more about the thesis, you can find it here! It includes a detailed discussion of the implementation of SINDyAE and the quality evaluation library, together with an analysis of the application to the two datasets mentioned above. Feel free to ping me for comments and feedback!
Previous blog posts from My Thesis Work @Clearbox AI:
- How to use AutoML to optimize generative models by Daniele Genta
- Adapting unstructured data to AI Control Room engine by Chiara Lanza
- Why you should defend your ML models against adversarial attacks by Ludovico Bessi