Compilation of high-profile real-world examples of failed machine learning projects
License: MIT License
failed-ml's Introduction
Failed Machine Learning (FML)
High-profile real-world examples of failed machine learning projects
“Success is not final, failure is not fatal. It is the courage to continue that counts.” - Winston Churchill
If you are looking for examples of how ML can fail despite all its incredible potential, you have come to the right place. Beyond the wonderful success stories of applied machine learning, here is a list of failed projects which we can learn a lot from.
A team at Princeton University found 20 reviews across 17 scientific fields that discovered significant errors (e.g., data leakage, no train-test split) in 329 papers that use ML-based science
Hundreds of predictive models were developed to diagnose or triage COVID-19 patients faster, but ultimately none of them were fit for clinical use and some were potentially harmful
The predictive algorithm (which helps identify which families are to be investigated by social workers for child abuse and neglect) flagged a disproportionate number of Black children for 'mandatory' neglect investigations.
A similar predictive tool to the one in Pennsylvania, the AI algorithm for child welfare in Oregon was also stopped a month after the Pennsylvania report
A widely used algorithm to predict healthcare needs exhibited racial bias where for a given risk score, black patients are considerably sicker than white patients
Apple’s new credit card (created in partnership with Goldman Sachs) is being investigated by financial regulators after customers complained that the card’s lending algorithms discriminated against women, where the credit line offered by a male customer's Apple Card was 20 times higher than that offered to his spouse
Amazon's facial recognition technology (Rekognition) falsely matched 28 congresspeople with mugshots of criminals, while also revealing racial bias in the algorithm
Algorithm (based on StyleGAN) designed to generate depixelated faces showed signs of racial bias, with image output skewed towards the white demographic
GenderShades' research revealed that Microsoft and IBM’s face-analysis services for identifying gender of people in photos frequently erred when analyzing images of women with dark skin
A false facial recognition match by New Jersey police landed an innocent black man (Nijeer Parks) in jail even though he was 30 miles away from the crime
The retina scanning tool fared much worse in real-life settings than in controlled experiments, with issues such as rejected scans (from poor scan image quality) and delays from intermittent internet connectivity when uploading images to the cloud for processing
The AI system was shut down after the chatbots stopped using English in their negotiations and started using a language that they created by themselves
A GPT-3 chatbot fine-tuned by indie game developer Jason Rohrer to emulate his dead fiancée was shut down by OpenAI after Jason refused their request to insert an automated monitoring tool amidst concerns of the chatbot being racist or overtly sexual
A problem with Galactica was that it could not distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. It was found to make up fake papers (sometimes attributing them to real authors), and generated articles about the history of bears in space as readily as ones about protein complexes.
In its first public demo advertisement, BARD made a factual error regarding which satellite first took pictures of a planet outside the Earth's solar system.
The recommender system that won the $1 Million challenge improved the proposed baseline by 8.43%. However, this performance gain did not seem to justify the engineering effort needed to bring it into a production environment.