Inside the black box: data in the age of machine learning

Chiara Sabelli
April 2025

“There are no bias-free machine learning algorithms. This is increasingly true when algorithms are trained on user-generated data that have not been processed and collected for that purpose, as it happens with ChatGPT texts” declares Professor Elena Baralis, coordinator of the Database and Data Mining Group at Politecnico di Torino.

In our times, the leading technologies in artificial intelligence are machine learning and deep neural networks. These systems are eager for data and can generate inequality in the algorithms' performance, with more or less severe outcomes.

In 2018, African-American researcher Joy Buolamwini from the MIT Media Lab was working on her thesis project. She aimed to build an Aspire Mirror, a mirror designed to reflect her face alongside that of her role models. To do so, she placed a camera on the mirror and installed an artificial vision software able to recognize human faces. Yet, when she stood in front of the mirror, the system did not recognize her face: it only worked when she wore a white mask. “We teach machines to recognize visage features by showing them lots of sample images with and without faces,” Buolamwini clarified in the documentary Coded Bias, “but when I analyzed the databases, I realized that most of them only contained images of white men.”

Buolamwini's work is among the best known to have demonstrated how machine learning algorithms can contain bias and misrepresentations. These algorithms are based on data: if this data is distorted or unbalanced, the results of the algorithm will be, too.

A rather similar issue was detected in voice recognition systems in 2020. A group of researchers from Stanford has shown that the voice assistants of five of the world's largest technology companies – Amazon, Apple, Google, IBM and Microsoft – made far fewer mistakes with white users than with black ones. According to the authors of the study, this disparity is due to the fact that the systems' machine learning algorithmsare trained on under-diverse data sets, containing mostly voices of white people.

Understanding when this occurs and how to solve it is not such an easy task. At Politecnico di Torino, several research groups are dedicating their efforts to tackle this crucial challenge for the sake of technological evolution, inclusion and social equality. Artificial neural networks are composed of nodes and connections, vaguely inspired by the architecture of the human brain. The most commonly used classes of neural networks are organized in consecutive layers, where the nodes of each layer can only be influenced by the preceding ones. In deep neural networks, there are many layers, each containing a vast number of nodes and connections. Training a neural network involves adjusting connection strengths to ensure the algorithm effectively performs the intended task in the best possible way. When there are millions, billions, or trillions of connections, as in the case of ChatGPT, training the network becomes extremely demanding and requires a substantial amount of data and powerful computing infrastructure.

Deep neural network sample

“Basically, only the big tech companies, such as Meta, Google, Amazon, Microsoft and a few others, can afford to carry out training procedures,” explains Baralis. “Universities and research centers generally don’t have this capability; they can only intervene during the fine-tuning phase, when a model already trained to perform general tasks is further trained for a specific task in a given application domain,” continues Baralis.

Hunting for bias

Photo by Milad Fakurian on Unsplash

“At Politecnico, we also engage in another phase of the life-cycle of deep learning algorithms, known as debugging or bias detection. These models are essentially black boxes. We can only understand their behavior by analyzing input and output data, as their internal operating mechanisms cannot be examined,” explains Baralis. “We assess their performance by using appropriately designed open databases.”

This is the case of a recent research project led by Baralis in collaboration with Amazon AGI (Artificial General Intelligence).

“The goal of this project was to create a method for identifying groups of people whose speech is not accurately recognized by automatic voice recognition systems,” says Baralis. People belonging to the same group can be related by various characteristics, such as age, gender and geographical origin. “However, this is ethically and legally sensitive information. It is highly likely that such data is not available for the voice samples used to assess the model's performance, specifically for testing purposes,” explains Baralis.

To address this challenge, Baralis' team has created a two-step algorithm. A first algorithm uses voice samples that are freely accessible to researchers, where information on gender, age and geographical origin is already known. By analyzing their speech patterns, it can identify the characteristics that define the way they communicate. “The gender of the speaker affects the sound frequency, but there are other less obvious aspects, such as the speed of speech, the frequency of words, or the length of pauses.”

This algorithm can be used to assign to one of the penalized categories (or none of them) the voice samples of the dataset used for the test, albeit with a certain degree of uncertainty. This allows for evaluating the model’s performance on these subgroups.

“This approach can be adopted for any speech recognition model, using the data on which it was trained,” explains Baralis. “We have currently confirmed its effectiveness on open data, since the training data of proprietary models, such as those behind Alexa or Siri, are not public.”

Once a bias is identified, ad hoc strategies can correct or at least mitigate it.

“We can basically adopt two different approaches”, says Baralis. “The easy way would be to increase the fraction of training data related to the penalized categories, but this is a costly strategy because it requires data acquisition and labeling processes. Moreover, it means knowing and explicitly using demographic characteristics, which are considered sensitive attributes and are therefore discouraged, and in some cases, prohibited. The alternative option is to modify the algorithm so that, during the training phase, errors involving the discriminated categories are penalized more heavily. This method can work without explicitly using the definition of the categories. However, this path is also challenging,” explains Baralis, “because it requires careful design of the loss function.”

Detecting biases and taking action against them is crucial. While it might seem less significant that a voice assistant performs poorly for certain groups of people, there are other cases where biases can significantly affect individuals' lives.

“Just think of the decision support systems used in companies to select personnel,” explains Baralis. “Our approach can be applied to similar systems and enables us to detect biases without relying on sensitive attributes that may not always be available or directly used by the model. Additionally, it does not require an initial hypothesis about the characteristics for which the algorithm may perform poorly."

A notable example is the COMPAS algorithm, which was revealed by a group of ProPublica journalists and data experts in 2016. COMPAS is a system used by numerous US courts to estimate the risk of reoffending by people arrested on suspicion of a crime. Based on this estimate, judges decide whether or not to validate the arrest and make the defendant wait for the trial in prison.

“ProPublica found that the COMPAS algorithm often overestimated the risk of reoffending for Black individuals while underestimating it for white ones,” explains Baralis. “This occurred even though the algorithm did not explicitly include ethnicity as a factor. Instead, other variables, particularly the place of residence, strongly correlated with ethnicity and thus served as proxies for this sensitive variable.”

ProPublica was able to lead this investigation by obtaining records related to 7,000 arrests that occurred in Broward County, Florida. This data included the risk scores assigned by COMPAS, information about the individuals' conduct in the two years following their arrests, and importantly, the ethnicity of the suspects. In Florida, as in many American states, citizens have the right to request access to documents, data or information held by public bodies and administrations.

Society evolves, data evolves. What about algorithms?

Photo by and machines on Unsplash

However, it is not always possible to replicate what ProPublica did with COMPAS, as there is often a lack of complete open datasets that include sensitive information.

This is the focus of the collaboration efforts of Professor Tania Cerquitelli, also involved in the Database and Data Mining Group, with Nokia Bell Labs of Cambridge and University College London. “In our initial study, we focused on models that analyze texts in natural language, commonly known as Natural Language Processing (NLP) models,” Cerquitelli explains. “Our goal was to understand how much these models depend on sensitive data and to identify ways to minimize this influence on their algorithmic decisions.”

Cerquitelli and her research team analyzed BERT, a natural language processing (NLP) model developed and released by Google in 2018. They focused on specific tasks, such as identifying phrases that contain offensive language. They then designed three algorithms to be applied sequentially.

The first algorithm identifies which words are most important in deciding whether a sentence is offensive or not. “This first algorithm is based on an explainability technique, which is part of a set of tools designed to help people use artificial intelligence systems, such as neural networks, that are inherently complex,” explains Cerquitelli. She adds, “These tools are essential for increasing awareness and confidence in this technology, and for leading its proper use.”

The second algorithm determines whether these words belong to the set of protected information. The third and final algorithm treats the data used for training to reduce the importance of protected information in the execution of the data-driven model that performs the specific task.

Cerquitelli also contributes to the research field named concept-drift detection.

“Machine learning models are trained on static databases, but the phenomena they model can change over time,” explains Cerquitelli.

One notable example is the machine learning algorithm developed by Amazon for personnel selection, which favored male candidates over female ones. The case came to light at the end of 2018, when the company started using that system to evaluate its performance.

“The discrimination in this case arose because the algorithm had primarily been trained on data from male candidates. As a result, it struggled to effectively evaluate experiences more frequently found in women's resumes, such as communication skills or involvement in volunteer activities.” Society was evolving, but the algorithm was not prepared to keep up.

Greco S et al. NLPGuard: A Framework for Mitigating the Use of Protected. Proceedings of the ACM on Human-Computer Interaction, Volume 8, Issue CSCW2

Potential toxicity P(T) for four phrases predicted as toxic by a classifier. The first three phrases were wrongly classified, while the last one was identified correctly [text version].

Words affecting the toxicity classification of the four phrases. The more intense the colour of a word, the more important its contribution to the toxicity (non-toxicity) classification. The most significant words used by the classifier to make these predictions are in red [text version].

The use of words such as ‘black’, ‘gay’ or ‘homosexual’ is employed to classify texts as toxic or non-toxic, even though they shouldn't be used in these classifications [text version].

Greco S et al. NLPGuard: A Framework for Mitigating the Use of Protected. Proceedings of the ACM on Human-Computer Interaction, Volume 8, Issue CSCW2

We can find cases of concept drift in the manufacturing sector, too. “The algorithm for monitoring a machine is trained on data produced by the machine at the beginning of its life, but the machine performance deteriorates over time. As a result, the algorithm will gradually become less and less able to properly monitor the machine functioning”.

Cerquitelli is also working on the idea of data cooperatives. “Today, the development of algorithms is closely linked to the availability of data. Much of this data is generated by people who share it, whether consciously or not, with big tech companies in exchange for the services they offer. The economic value generated by this data is therefore limited to a few market players, and the value chain is thus interrupted”, explains Cerquitelli “We are exploring the possibility to tackle this issue by building data cooperatives, where groups of users or organizations voluntarily make their data available to generate social value,” continues Cerquitelli. This can be the case of financial inclusion. “The segment of the population that has no access to credit could provide data showing their financial reliability, which would otherwise not be available precisely because historically this group has lacked access to credit,” states Cerquitelli. The Data Governance Act, approved by the European Parliament in 2022 and applied from September 2023, explicitly mentions data cooperatives as a useful tool for redistributing part of the social value generated by the data itself. “We could generate synthetic data based on confidential data sharing relevant statistical characteristics. Another option would be using retrieval-augmented generation techniques that extract significant information from a data sample without revealing sensitive details”, concludes Cerquitelli.

At Politecnico di Torino we are working to make AI models transparent and interpretable. We are operating in the field of explainable AI.

- Eliana Pastor -

Eliana Pastor