Adobe’s ‘ethical’ Firefly AI was trained on Midjourney images

When Adobe Inc released its Firefly image-generating software last year, the company said the artificial intelligence model was trained mainly on Adobe Stock, its database of hundreds of millions of licensed images. Firefly, Adobe said, was a “commercially safe” alternative to competitors like Midjourney, which learned by scraping pictures from across the Internet.

But behind the scenes, Adobe also was relying in part on AI-generated content to train Firefly, including from those same AI rivals. In numerous presentations and public posts about how Firefly is safer than the competition due to its training data, Adobe never made clear that its model actually used images from some of these same competitors.

Massive amounts of data are needed to train AI models underlying popular content creation products, and there is increasing scrutiny on AI technology companies over their use of copyrighted materials in this process. Companies like Midjourney, Dall-E maker OpenAI and Stable Diffusion maker Stability AI built their media-generating models with datasets that pull imagery from across the Internet, a practice that has led to outrage and lawsuits from a number of artists.

“This shows the murkiness of the definition of responsible AI, and it also illustrates the difficulties of getting away from, if not the legal, then the social and cultural problems, or ethical problems, with generated content,” said Luke Stark, an assistant professor at Western University in Ontario, who studies the social and ethical impacts of AI.

Adobe’s decision to build Firefly with content the company holds the rights to and that in the public domain was meant to differentiate its AI image tool in the fast-growing market for generative artificial intelligence. The company promoted it as a more ethical, legally sound option for customers interested in conjuring images from just a few words but wary of potential copyright issues. It won’t generate content based on the intellectual property of other people or brands, Adobe has said, and will avoid producing harmful images, too.

AI-generated content made it into Firefly’s training set because creators were allowed to submit millions of images into Adobe’s stock marketplace that used the technology from other companies. “Generative AI images from the Adobe Stock collection are a small part of the Firefly training dataset,” wrote Adobe representative Michelle Haarhoff in September on a Discord group for photographers and artists who contribute to the marketplace.

Adobe said a relatively small amount – about 5% – of the images used to train its AI tool was generated by other AI platforms. “Every image submitted to Adobe Stock, including a very small subset of images generated with AI, goes through a rigorous moderation process to ensure it does not include IP, trademarks, recognisable characters or logos, or reference artists’ names,” a company spokesperson said.

Criticism of the practice has come from inside the company: Since the early days of Firefly, there has been internal disagreement on the ethics and optics of ingesting AI-generated imagery into the model, according to multiple employees familiar with its development who asked not to be named because the discussions were private. Some have suggested weaning the system off generated images over time, but one of the people said there are no current plans to do so.

Adobe has taken shots at competitors over their data collection practices. Other models are built on data that is “openly scraped”, Chief Strategy Officer Scott Belsky said last year. One way that Firefly is better than OpenAI’s comparable model is because it shows respect for the creative community by training only on licensed or freely available data, Adobe says on its website. And in a blog post last March titled “Responsible Innovation in the Age of Generative AI,” general counsel Dana Rao pointed out that generative AI “is only as good as the data on which it’s trained.”

“Training on curated, diverse datasets inherently gives your model a competitive edge when it comes to producing commercially safe and ethical results,” he wrote, while pointing out that Adobe trained Firefly on Adobe stock images, licensed content and public domain content in which the copyright has run out.

“Our enterprise customers came to us when we launched Firefly and said, ‘We love what you’re doing, we really appreciate that you’re not stealing all of our intellectual property out on the open Internet’,” Ashley Still, an Adobe senior vice president, said earlier this month during a Bloomberg Intelligence event.

Still, Adobe never made clear publicly that Firefly had trained in part on images from competitors’ tools that are supposedly less ethical. It did, however, outline such details in at least two online discussion groups the company runs on Discord – one for Adobe Stock and another devoted to Firefly – according to messages Bloomberg has viewed.

In March 2023, Adobe unveiled Firefly as a “beta” product. That month, Raúl Cerón, who works with the Adobe Stock community, posted on Discord that the company wasn’t planning to use generated images to train the forthcoming public version of Firefly.

“Once we go live out of beta, we will have a new training database for it, leaving Gen AI content out of it,” he wrote in a post in June.

When Adobe announced the public release of Firefly on Sept 13, the company also paid a special “Firefly bonus” to Adobe Stock contributors “whose content was used to train the first commercial Firefly model”. Contributors who used generative AI were among those who received the bonus payment, according to a Discord message from Mat Hayward, who also works with the Adobe Stock community.

AI-generated imagery in Adobe Stock “enhances our dataset training model, and we decided to include this content for the commercially released version of Firefly,” Hayward wrote.

Brian Penny, a writer and stock image contributor who has submitted thousands of AI-generated images – mostly made with Midjourney – to Adobe Stock, was surprised to get the bonus. He figured as an AI contributor he wouldn’t be eligible. Despite the financial gain, Penny thinks the decision to train Firefly on content such as his is a bad one, and said the company should be more candid about how it’s training the software for creating images.

“They need to be ethical, they need to be more transparent, they need to do more,” he said.

Adobe Stock’s library has boomed since it began formally accepting AI content in late 2022. Today, there are about 57 million images, or about 14% of the total, tagged as AI-generated images. Artists who submit AI images must specify that the work was created using the technology, though they don’t need to say which tool they used. To feed its AI training set, Adobe has also offered to pay for contributors to submit a mass amount of photos for AI training – such as images of bananas or flags.

Training on AI-generated content probably wouldn’t make Adobe’s Firefly image generator less commercially safe, and the company isn’t required to say what it’s training on as long as it isn’t misleading consumers, said Harvard professor Rebecca Tushnet, who focuses on copyright and advertising law. But training on AI images, such as those created by Midjourney, undermines the idea that Firefly is unique from competing services, she said.

“Adobe basically wants to position itself as the superior alternative, but it also wants really cheap inputs, and AI is a really good way to get cheap inputs,” she said. – Bloomberg

Tagged