Have you ever run into Google’s ‘People also ask’ section and found the questions you were about to ask? Periodically, we are going to answer the most researched questions about AI and Machine Learning with our guests on Clearbox AI’s new interview series ‘People also Ask...and we answer’. Enjoy!
A few interviews ago, we talked about how to harness the power of AI in companies and we discussed how important it is to have the right expectations and to foster the right culture about AI and data in companies. We also highlighted the importance of data strategies and data governance to get the most out of AI processes. Today we will dig deeper into the world of data strategies and in particular into the role of data teams to discover how it’s done in practice.
Introducing our guest
Alberto Danese is the Head of Data Science at Nexi, which provides services and infrastructures for digital payments to banks, companies, institutions and public administration. After the MS degree as Computer Engineer, he worked in the field of cybersecurity and then moved to data science, mainly applied to financial services. He now leads a team of data scientists and data engineers.
Why is it important to create a company data strategy?
Alberto: In the last few years, companies have realized that data is a strategic asset. It's challenging for a company to create a massive amount of data, but once you have it, it's not easy to replace its potential. Data has a strategic advantage that puts companies on a different level from the competition. If you think about the top 10 companies worldwide, many of them are data companies. For instance, Google's primary asset is data, and it's not easy for other companies to collect such data.
So, once you realize data is no longer just a source of information but it's the core of your business, you need to have a strategy to grow, manage data and extract value from it. I work in a company that deals with millions of daily transactions, and we have always used data for operational reasons. For example, we have to produce the balance of credit cards, but in the last few years, we have also leveraged this amount of information for strategic purposes.
Which are the roles of the different professionals in a data strategy? Do you have the same functions in Nexi?
Alberto: Three years ago, we gave birth to a 'data area', a central unit made of four teams dealing with data from different points of view. We have a dedicated team for data architecture related to storing data and moving data across other systems. Then we have a core team responsible for the data warehouse, so it's accountable for everything that deals with reporting to some regular operations on data. Another team takes care of the operations because when data is at the core of your business, you also have to perform processes like data loading, ingestion, and processing. Moreover, besides all the automatic procedures, operations still need people to be taken care of.
Finally, the data science team that I lead deals with the development of algorithms, data products and insights but also takes care of some engineering, especially on the cloud. These are the four teams that define the data area in Nexi.
About the professionals involved in these teams, I speak specifically of the one I lead. You can broadly divide data professionals into two groups: the ones that are close to the IT and the others that are closer to the business. When they have good expertise, they're usually able to work on both sides. Maybe you work with a specific business team most of the time, but you should also understand the engineering processes and the other way around.
Behind a single name, you can have different professions. I also discussed this topic in my book 'La cultura del dato' ('Data culture') published by Franco Angeli. I talked with my co-author Stefano Gatti about the fact that, for instance, data scientists could have multiple meanings: they could be researchers, they could be someone working closely with the business, and so on. I'm not a fan of over-introducing professions.
Which tools do you suggest to use for data strategy in a big company?
Alberto: It does not matter if you mainly work with data with engineers, the IT department, or the business side. Any person with a 'data-something' role (a data analyst, a data scientist, a data engineer) uses technology to support their work, so I think we can define three main areas of tools.
First of all, development tools. In my experience in Nexi, a great part of our code bases is written in Python and Spark. Then, all the software that sits around the development languages, like the development environments, the versioning tools like Git, etc. Moreover, you can use computational tools. In Nexi, for example, we adopted a public cloud and it was an excellent choice for our needs because we have access to many tools like scalable CPU and computational power. Other companies may decide to work on premises. Still, I think the cloud as a computational platform is almost a must for data scientists and, generally, for anyone who works with data.
Last but not least, a compliance and security package to obfuscate data or access control tools to limit the access to data, tokenization tools, cryptography tools, and anything that creates a sort of limit to the information you can access. This concept is fundamental because it's crucial to take care of security measures. Suppose I work in a manufacturing company and only with IoT data. In that case, there won't be so many problems accessing data produced by a machine, but when a person produces data, it's an entirely different situation.
Which tools do you suggest to use for data strategy in a small company or startups?
Luca: First of all, I would like to talk from the perspective of a small startup. Nowadays, you hear about startups with thousand of employees, which are, in my opinion, big companies. Still, from the perspective of a small startup –so maybe 10-20 employees–, it's an interesting point of view. You are flexible in making exciting choices about tools and technological stacks because you are less limited.
If you're using a cloud infrastructure, the first choice would be whether to use native tools provided by the cloud providers like Amazon, Google, and Microsoft. All these companies offer data processing and extraction tools native to their platforms, or you can build them yourself. Sometimes it's advantageous to start using their tools because, first of all, you have access to cloud computing credits as a startup, and it's an easier learning curve to start from there rather than building everything from scratch. On the other hand, you'll find yourself locked into a particular cloud provider, for example.
Another fascinating aspect is that nowadays, many new open-source libraries and tools focus on extremely interesting data. We can start experimenting a bit more compared to a more structured company.
For example, instead of using Airflow for data extraction, you might experiment with new tools such as Dagster, which was created two years ago. Still, on the other hand, it presents some risks because there are so many tools around that it is not always trivial to determine which tools will still exist in two years. So every time you try these kinds of experiments, you might find yourself with a potent, modern, and efficient tool. On the other hand, there's a risk that people will not use it on a massive scale, and in my opinion, these are the significant aspects when we talk about tools related to data from a startup point of view.
What are the obstacles that data scientists may encounter along the data pipelines?
I'm going back to what Alberto said about roles. When you are in a smaller company, these definitions are even more blurred because sometimes data scientists have to be data engineers. Data engineers need to adopt the same best practices as they adopt in software to write good data pipelines, proper data documentation, data versioning and testing. As Clearbox AI, we have much context with organizations working with these data pipelines. One of their main issues was testing. When you version data and automatically deploy new data pipelines, you need to test it to ensure you don't break anything and that you're proving your pipelines. When you test on personal data, it is not trivial to come up with this kind of data. Hence, as a synthetic data provider, we have much data that depicts, for example, people, but it's less affected by the risk of re-identification. But in general, I think the same best practices you would adopt while writing good software should be implemented while writing good data pipelines, which is complicated.
Watch the previous episodes: