Stanford researchers have found child pornography images in material used to train popular AI image generators, with the illegal images been identified since April.
The Stanford Internet Observatory discovered over 1,000 fake child sexual abuse images in an open dataset, named LAION-5B, related to the training of London-based Stable Diffusion AI (artificial intelligence) image-maker. Stable Diffusion belongs to Stability AI, an artificial intelligence nonprofit that offers a text-to-image AI generator and is using LAION-5B to train its AI.
The child pornography images were created through the database’s sampling from social media websites, combining these with images from pornographic websites.
The researchers reported their findings to related nonprofits in the United States and Canada. They did not view abusive content directly and the research was primarily done with Microsoft’s PhotoDNA, a tool that matches hashed images with images from the nonprofits’ datasets to detect abusive content.
PhotoDNA was developed by Microsoft for the purpose of identifying child abuse photos.
The researchers recommended that future datasets use a detection tool such as PhotoDNA to filter out abusive images, but it is difficult to clean open datasets if there is no central authority that hosts the data.
On the eve of the release of the Stanford Internet Observatory’s report, LAION told The Associated Press it was temporarily removing its datasets.
LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, said in a statement that it “has a zero tolerance policy for illegal content and in an abundance of caution, we have taken down the LAION datasets to ensure they are safe before republishing them.”
While the images account for just a fraction of LAION’s index of some 5.8 billion images, the Stanford group says it is likely influencing the ability of AI tools to generate harmful outputs and reinforcing the prior abuse of real victims who appear multiple times.
This is because only a handful of abusive images are needed for an AI tool to be able to produce thousands more deepfakes, endangering young people and children across the globe.
‘Rushed to Market’
Many generative AI projects were “effectively rushed to market” and made widely accessible, said Stanford Internet Observatory’s chief technologist David Thiel, who authored the report.
“Taking an entire internet-wide scrape and making that dataset to train models is something that should have been confined to a research operation, if anything, and is not something that should have been open-sourced without a lot more rigorous attention,” Mr. Thiel said in an interview.
Millions of such fake explicit images have already been produced and circulated on the internet, according to victim reports, with authorities having identified only 19,000 victims.
A prominent LAION user who helped shape the dataset’s development is London-based startup Stability AI, maker of the Stable Diffusion text-to-image models. New versions of Stable Diffusion have made it much harder to create harmful content, but an older version introduced last year—which Stability AI says it didn’t release—is still baked into other applications and tools and remains “the most popular model for generating explicit imagery,” according to the Stanford report.
“We can’t take that back. That model is in the hands of many people on their local machines,” said Lloyd Richardson, director of information technology at the Canadian Centre for Child Protection, which runs Canada’s hotline for reporting online sexual exploitation.
Stability AI said it only hosts filtered versions of Stable Diffusion and that “since taking over the exclusive development of Stable Diffusion, Stability AI has taken proactive steps to mitigate the risk of misuse.”
“Those filters remove unsafe content from reaching the models,” the company said in a prepared statement. “By removing that content before it ever reaches the model, we can help to prevent the model from generating unsafe content.”
LAION was the brainchild of a German researcher and teacher, Christoph Schuhmann, who told the AP earlier this year that part of the reason to make such a huge visual database publicly accessible was to ensure that the future of AI development isn’t controlled by a handful of powerful companies.
“It will be much safer and much more fair if we can democratize it so that the whole research community and the whole general public can benefit from it,” he said.
AI Legislation Worldwide
The U.S. will launch an AI safety institute to evaluate known and emerging risks of so-called “frontier” AI models, Secretary of Commerce Gina Raimondo said in November during the AI Safety Summit in Britain.
President Joe Biden issued an executive order on Oct. 30 to require developers of AI systems that pose risks to U.S. national security, the economy, public health or safety to share the results of safety tests with the government.
The U.S. Federal Trade Commission opened an investigation into OpenAI in July on claims that it has run afoul of consumer protection laws.
Australia will make search engines use new algorithms to prevent the sharing of child sexual abuse material created by AI and the production of deepfake versions of the same material.
In Britain, leading AI developers agreed in November, at the first global AI Safety Summit in Britain, to work with governments to test new frontier models before they are released to help manage the risks of AI.
More than 25 countries present at the summit, including the United States and India, as well as the European Union, signed a “Bletchley Declaration” to work together and establish a common approach to oversight.
Britain’s data watchdog said in October it had issued Snap Inc’s Snapchat with a preliminary enforcement notice over a possible failure to properly assess the privacy risks of its generative AI chatbot to users, particularly children.
The Associated Press and Reuters contributed to this report.