Google hit with lawsuit alleging it stole data from millions of users to train its AI tools

The Google logo on their building at 111 Eighth Avenue in New York on Tuesday, July 19, 2022.

Richard B. Levine/Sipa USA/Reuters

CNN —

Google was hit with a wide-ranging lawsuit on Tuesday alleging the tech giant scraped data from millions of users without their consent and violated copyright laws in order to train and develop its artificial intelligence products.

The proposed class action suit against Google, its parent company Alphabet, and Google’s AI subsidiary DeepMind was filed in a federal court in California on Tuesday, and was brought by Clarkson Law Firm. The firm previously filed a similar suit against ChatGPT-maker OpenAI last month. (OpenAI did not previously respond to a request for comment on the suit.)

The complaint alleges that Google “has been secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans” and using this data to train its AI products, such as its chatbot Bard. The complaint also claims Google has taken “virtually the entirety of our digital footprint,” including “creative and copywritten works” to build its AI products.

Halimah DeLaine Prado, Google’s general counsel, called the claims in the suit “baseless” in a statement to CNN. “We’ve been clear for years that we use data from public sources — like information published to the open web and public datasets — to train the AI models behind services like Google Translate, responsibly and in line with our AI Principles,” DeLaine Prado said.

“American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims,” the statement added.

Alphabet and DeepMind did not immediately respond to a request for comment.

The complaint points to a recent update to Google’s privacy policy that explicitly states the company may use publicly accessible information to train its AI models and tools such as Bard.

In response to an earlier Verge report on the update, the company said its policy “has long been transparent” about this practice and “this latest update simply clarifies that newer services like Bard are also included.”

The lawsuit comes as a new crop of AI tools have gained tremendous attention in recent months for their ability to generate written work and images in response to user prompts. The large language models underpinning this new technology are able to do this by training on vast troves of online data.

In the process, however, companies are also drawing mounting legal scrutiny over copyright issues from works swept up in these data sets, as well as their apparent use of personal and possibly sensitive data from everyday users, including data from children, according to the Google lawsuit.

“Google needs to understand that ‘publicly available’ has never meant free to use for any purpose,” Tim Giordano, one of the attorneys at Clarkson bringing the suit against Google, told CNN in an interview. “Our personal information and our data is our property, and it’s valuable, and nobody has the right to just take it and use it for any purpose.”

The suit is seeking injunctive relief in the form of a temporary freeze on commercial access to and commercial development of Google’s generative AI tools like Bard. It is also seeking unspecified damages and payments as financial compensation to people whose data was allegedly misappropriated by Google. The firm says it has lined up eight plaintiffs, including a minor.

Giordano contrasted the benefits and alleged harms of how Google typically indexes online data to support its core search engine with the new allegations of it scraping data to train AI tools.

With its search engine, he said, Google can “serve up an attributed link to your work that can actually drive somebody to purchase it or engage with it.” Data scraping to train AI tools, however, is creating “an alternative version of the work that radically alters the incentives for anybody to need to purchase the work,” Giordano added.

While some internet users may have grown accustomed to their digital data being collected and used for search results or targeted advertising, the same may not be true for AI training. “People could not have imagined their information would be used this way,” Giordano said.

Ryan Clarkson, a partner at the law firm, said Google needs to “create an opportunity for folks to opt out” of having their data used for training AI while still maintaining their ability to use the internet for their everyday needs.

CNN values your feedback

CNN Business Videos