Specialists from all over the world have appealed to all nations demanding greater transparency in the evaluation of artificial intelligence systems.
“Recent advances in artificial intelligence based on systems that require huge amounts of data and computation, such as GPT-4, have highlighted the difficulties in understanding the capabilities and weaknesses of these artificial intelligence systems. We do not know where it is safe to use these systems or how they could be improved. And this is due to the way in which artificial intelligence is evaluated today, which requires urgent change.
Behind these words are 16 of the leading experts in artificial intelligence from around the world, including José Hernández-Orallo, Fernando Martínez Plumed and Wout Schellaert, researchers at the VRAIN Institute of the Polytechnic University of Valencia (UPV) in Spain. .
Coordinated by Professor Hernández-Orallo, the 16 researchers recently published a letter in the academic journal Science in which they claimed the need to “rethink” the evaluation of artificial intelligence tools to move towards more transparent models and to know which it is their real effectiveness, what they can and cannot do.
In their paper, the authors propose a roadmap for artificial intelligence models, in which their results are presented in a more nuanced way and the results of the evaluation on a case-by-case basis are made publicly available.
Fernando Martínez Plumed, one of the authors of the appeal demanding greater transparency in the evaluation of artificial intelligence systems. (Photo: UPV)
As Hernández-Orallo explains, the performance of an artificial intelligence model is measured with aggregated statistics. And this implies a risk, because although they can give a vision of their good overall performance, they can also hide a low reliability / usefulness in specific cases, more minority, “and yet it is given to understand that it is equally valid in all cases when it really isn’t.”
In the document, the signatories explain it with the case of artificial intelligence models to help clinical diagnosis and point out that these systems could have a problem when they analyze people of a specific ethnic group or demographic group, because they are cases that constituted only a small proportion of your training.
“What we ask is that every time a result in artificial intelligence is published, it is broken down as much as possible, because if it is not done, it is not possible to know its real usefulness and reproduce the analysis. In the article published in Science we also talked about an artificial intelligence facial recognition system that gave a 90% success rate and later it was found that for white men the percentage of success was 99.2%, but for black women it only reached 65%. 5%. For this reason, on some occasions, the results that are sold on the usefulness of an artificial intelligence tool are not completely transparent and reliable. If they do not give you the details, you think that the models work very well and it is not reality. Not having this breakdown with all the possible information on the artificial intelligence model means that applying it could entail risks”, points out José Hernández-Orallo.
The VRAIN UPV researcher highlights that the changes they propose can contribute to improving knowledge about the real efficiency of artificial intelligence systems. And also to reduce the “voracious” competition that currently exists between artificial intelligence laboratories for announcing that their model improves by a certain percentage on other previous systems.
“There are laboratories that want to go from 93% to 95% no matter what and that goes against the ultimate applicability and reliability of artificial intelligence. What we want, in short, is to help us all better understand how artificial intelligence works, what are the limitations of each model, to guarantee good use of this technology”, concludes Hernández-Orallo.
Together with the researchers from the VRAIN Institute of the Polytechnic University of Valencia, researchers from the University of Cambridge, Harvard University, the Massachusetts Institute of Technology (MIT), Stanford University, Google, the Imperial College London, the University of Leeds, the Alan Turing Institute London, Deepmind, the US National Institute of Standards and Technology (NIST), the Santa Fe Institute, Shanghai Tongji University, and Jinan Shandong University. (Source: UPV)