Artificial Intelligence (AI) and Machine Learning (ML) are increasingly becoming mainstream. Every day, we hear about new applications based on these concepts, such as ChatGPT for text generation or MidJourney for image generation.
Underlying all these projects there is a common element: data. In order to train a ML model, from the most trivial to the most complex one, we need data to extract statistics or to make predictions.
Many market players, including those in the media industry, medical sector, insurance, and banking, are working towards integrating AI projects into their business. However, these companies often hold sensitive data such as personal information, including names, dates of birth, or social security numbers, of their customers and users. Using this data to train ML models comes with many privacy risks. We’ve already discussed the risks of direct identification, linkage, and inference, and their negative impacts on individuals in a previous post.
To mitigate these privacy risks to safely train ML models, we turn to Privacy Enhancing Technologies, such as synthetic data.
Synthetic data are non-real data generated, often using AI generative models, in such a way as to reproduce the same statistics as the original data. This allows to train ML models to make predictions without compromising the privacy of the original data. For more information on synthetic data, please refer to this post.
One of the goals of Clearbox AI is precisely to protect the privacy of this personal information, or Personal Identifiable Information (PII), using synthetic data. To do so, we have created an open source Python library, Nerpii, which performs Named Entity Recognition (NER) on structured data and generates synthetic PII.
How does Nerpii library work?
Nerpii is a Python library used to perform Named Entity Recognition (NER) on a dataset and generates synthetic PII. Named Entity Recognition is an important task in the field of Natural Language Processing and Natural Language Understanding. It consists of assigning a named entity, such as LOCATION, ORGANIZATION or PERSON to words in texts.
In our case we assign a named entity to the columns of a dataset to obtain information about their contents. To do this, we use Presidio, an SDK from Microsoft, and an NLP model, available on HuggingFace. These two models, trained to perform NER, check each row in a column and try to assign an entity to each row. The final entity assigned to the column will be the most frequent entity.
After knowing which entity types our columns in the dataset contain, we regenerate data for those columns containing PII, such as names, addresses, phone numbers etc. For this generation we use Faker, a Python package that generates fake data to anonymize PII.
How to use Nerpii library
You can install the library by cloning the github repository or by simply executing this line in your terminal
pip install nerpii
Named Entity Recognition
Once you have installed the library, you can import the class NamedEntityRecognition by using the following line of code
from nerpii.named_entity_recognizer import NamedEntityRecognizer
Then, you can create a recognizer passing as parameter a path to a csv file or a Pandas Dataframe.
recognizer = NamedEntityRecognizer('./csv_path.csv')
Once you have created your recognizer, you can performed NER using the following functions
recognizer.assign_entities_with_presidio()
recognizer.assign_entities_manually()
recognizer.assign_organization_entity_with_model()
These functions assign an entity to most of the columns. The final output is a dictionary in which column names are given as keys and assigned entities and a confidence score as values.
PII Generation
After performing NER on a dataset, you can generate new PII using Faker. You can import the class FakerGenerator by using the following command:
from nerpii.faker_generator import FakerGenerator
Then, you can create a generator as follows
generator = FakerGenerator(dataset, recognizer.dict_global_entities)
Finally, to generate new PII you can run this command line:
generator.get_faker_generation()
For a practical example please visit this tutorial notebook.
Conclusions
By using Nerpii, we are able to preserve the privacy of personal information contained in datasets. In this way, anyone using this library will be able to obtain anonymized data to include in their data pipeline and develop their ML project without putting data privacy at risk.
If you use Nerpii and you like it, give us a star on GitHub!