FAU's CA-AI Makes AI Smarter by Cleaning Up Bad Data Before It Learns
In the world of machine learning and artificial intelligence, clean data is everything. Even a small number of mislabeled examples known as label noise can derail the performance of a model, especially those like Support Vector Machines (SVMs) that rely on a few key data points to make decisions.
SVMs are a widely used type of machine learning algorithm, applied in everything from image and speech recognition to medical diagnostics and text classification. These models operate by finding a boundary that best separates different categories of data. They rely on a small but crucial subset of the training data, known as support vectors, to determine this boundary. If these few examples are incorrectly labeled, the resulting decision boundaries can be flawed, leading to poor performance on real-world data.
Now, a team of researchers from the Center for Connected Autonomy and Artificial Intelligence (CA-AI) within the College of Engineering and Computer Science at Florida Atlantic University and collaborators have developed an innovative method to automatically detect and remove faulty labels before a model is ever trained – making AI smarter, faster and more reliable. Before the AI even starts learning, the researchers clean the data using a math technique that looks for odd or unusual examples that don’t quite fit. These “outliers” are removed or flagged, making sure the AI gets high-quality information right from the start.
“SVMs are among the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering,” said Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the FAU Department of Electrical Engineering and Computer Science, director of CA-AI and an FAU Sensing Institute (I-SENSE) faculty fellow. “What makes them especially effective – but also uniquely vulnerable – is that they rely on just a small number of key data points, called support vectors, to draw the line between different classes. If even one of those points is mislabeled – for example, if a malignant tumor is incorrectly marked as benign – it can distort the model’s entire understanding of the problem. The consequences of that could be serious, whether it’s a missed cancer diagnosis or a security system that fails to flag a threat. Our work is about protecting models – any machine learning and AI model including SVMs – from these hidden dangers by identifying and removing those mislabeled cases before they can do harm.”
The data-driven method that “cleans” the training dataset uses a mathematical approach called L1-norm principal component analysis. Unlike conventional methods, which often require manual parameter tuning or assumptions about the type of noise present, this technique identifies and removes suspicious data points within each class purely based on how well they fit with the rest of the group.
“Data points that appear to deviate significantly from the rest – often due to label errors – are flagged and removed,” said Pados. “Unlike many existing techniques, this process requires no manual tuning or user intervention and can be applied to any AI model, making it both scalable and practical.”
The process is robust, efficient and entirely touch-free – even handling the notoriously tricky task of rank selection (which determines how many dimensions to keep during analysis) without user input.
Researchers extensively tested their technique on real and synthetic datasets with various levels of label contamination. Across the board, it produced consistent and notable improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems.
“What makes our approach particularly compelling is its flexibility,” said Pados. “It can be used as a plug-and-play preprocessing step for any AI system, regardless of the task or dataset. And it’s not just theoretical – extensive testing on both noisy and clean datasets, including well-known benchmarks like the Wisconsin Breast Cancer dataset, showed consistent improvements in classification accuracy. Even in cases where the original training data appeared flawless, our new method still enhanced performance, suggesting that subtle, hidden label noise may be more common than previously thought.”
Looking ahead, the research opens the door to even broader applications. The team is interested in exploring how this mathematical framework might be extended to tackle deeper issues in data science such as reducing data bias and improving the completeness of datasets.
“As machine learning becomes deeply integrated into high-stakes domains like health care, finance and the justice system, the integrity of the data driving these models has never been more important,” said Stella Batalama, Ph.D., dean of the FAU College of Engineering and Computer Science. “We’re asking algorithms to make decisions that impact real lives – diagnosing diseases, evaluating loan applications, even informing legal judgments. If the training data is flawed, the consequences can be devastating. That’s why innovations like this are so critical. By improving data quality at the source – before the model is even trained – we’re not just making AI more accurate; we’re making it more responsible. This work represents a meaningful step toward building AI systems we can trust to perform fairly, reliably and ethically in the real world.”
This work will appear in the Institute of Electrical and Electronics Engineers’ (IEEE), Transactions on Neural Networks and Learning Systems. Co-authors, who are all IEEE members, are Shruti Shukla; Ph.D. student in the CA-AI and the FAU Department of Electrical Engineering and Computer Science; George Sklivanitis, Ph.D., Charles E. Schmidt Research Associate Professor in the CA-AI and the Department of Electrical Engineering and Computer Science, and I-SENSE faculty fellow; Elizabeth Serena Bentley, Ph.D.; and Michael J. Medley, Ph.D., United States Air Force Research Laboratory.
Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the FAU Department of Electrical Engineering and Computer Science, director of CA-AI and an FAU Sensing Institute (I-SENSE) faculty fellow.
-FAU-
Latest Research
- Chatbots the New 'Doc?' Exploring AI in Health Behavior CoachingChanging habits is tough. AI tools like ChatGPT now simulate motivational interviewing (MI), showing promise - but can they match MI's core and spark lasting change? FAU researchers explore the potential.
- Study First to Show if Nesting Heat Affects Sea Turtle Hatchling 'IQ'FAU researchers are the first to train loggerhead sea turtle hatchlings in a maze using visual cues to test their learning and ability, and to determine if high nest temperatures impair their cognition.
- FAU Lands EPA Grant to Use Genetics in Florida Bay Sponge RestorationFunded by the U.S. EPA, the project marks the first genetic assessment of sponge recovery in the region, with broad implications for ecosystem health and economically important species like the spiny lobster.
- Single Drug Shows Promise to Treat PTSD, Pain, and Alcohol MisuseA study by the Charles E. Schmidt College of Medicine showed that the drug PPL-138 reduced anxiety, pain, and alcohol misuse in rats with PTSD-like symptoms by targeting specific opioid receptors in the brain.
- FAU Engineers Create Smarter AI to Redefine Control in Complex SystemsA new AI framework developed by FAU engineers improves how complex systems with unequal decision-makers like smart grids, traffic networks, and autonomous vehicles are managed.
- Logistics Industry Mixed as Smaller Firms Face Rising CostsThe logistics industry demonstrated mixed signals across its key indicators, as smaller firms try to hedge increasing costs, according to researchers from Florida Atlantic University and four other schools.