Googles Unrestricted Use of Publisher Content for AI Raises Concerns Amid Antitrust Scrutiny

Google Search products may utilize content from publishers even if those publishers have opted out of AI training. This information emerged during testimony in Google’s ongoing antitrust lawsuit against the U.S. Justice Department, as shared by an executive from Google DeepMind. The executive clarified that the content in question is not incorporated into the AI models created by DeepMind. Google, based in Mountain View, is said to have indicated that the content used for search functionalities is governed by a different process that adheres to the robots.txt web standard.

Update: Following the publication of this article, Google’s global PR team contacted Gadgets 360 with further clarification. They stated that the publisher opt-out provision is relevant solely to its Google-Extended product and has never been applicable to Google Search. Google-Extended serves as a separate product token, enabling web publishers to control whether the content crawled by Google from their sites can be used to train future generations of Gemini models, which back Gemini Apps and the Vertex AI API for Gemini. Furthermore, Google-Extended does not affect a site’s presence in Google Search and is not a ranking factor for Google Search.

Google’s Distinct Approach for AI Models and Search Products

A Bloomberg report states that Eli Collins, the Vice President of Product at Google DeepMind, acknowledged that the protocols for respecting publishers’ choices to opt out of AI training differ between DeepMind’s AI models and the company’s Search products.

Google-Extended has not applied, and will not apply, to Google Search.

Diana Aguilar, the attorney for the Department of Justice in the antitrust matter, is cited as having produced a document showing that 80 billion out of 160 billion tokens utilized to train Google’s AI systems originated from content from publishers who declined AI training. Collins reportedly asserted that once publishers opt out of AI training, that content is not utilized by DeepMind’s models.

Nonetheless, when Aguilar inquired if the Gemini AI model could access the same material if used in the Search product, Collins confirmed this was “correct,” so long as it was utilized within the context of Search. This specifically pertains to Gemini models that drive Google’s AI Overviews and the newly introduced AI Mode.

This indicates that standard opt-out measures are insufficient to prevent Google from leveraging publisher content. In June 2023, the tech giant revised its privacy policy to state that it would use all publicly available internet data for training its language models. Here, publicly available internet data is defined as any website that does not have a paywall or enforced sign-up requirements that limit public access.

A Google spokesperson later informed Bloomberg that the regulations for Search-based AI tools are distinct, noting that publishers can “only refuse to allow their data to be used in Search AI if they opt out of being indexed for search.” Publishers have the ability to do this by disabling the robots.txt web standard, which permits Google’s crawler bots to access their content for indexing in search results.

However, doing so would also ensure that those web pages do not appear in Google’s search engine results when a user searches for a topic. This essentially leaves publishers no choice but to consent to the company’s use of their data to train AI models.

The ongoing antitrust litigation seeks to establish that Google maintains a monopoly in the search and AI sectors. Amit Mehta, a U.S. District Judge overseeing the case, is being urged by the Department of Justice to compel the tech giant to divest Google Chrome and share the data used for search result generation. However, no similar recommendations have been made concerning the company’s AI offerings.

[IMAGE_1]