Close Menu
  • Home
  • Market News
    • Crude Oil Prices
    • Brent vs WTI
    • Futures & Trading
    • OPEC Announcements
  • Company & Corporate
    • Mergers & Acquisitions
    • Earnings Reports
    • Executive Moves
    • ESG & Sustainability
  • Geopolitical & Global
    • Middle East
    • North America
    • Europe & Russia
    • Asia & China
    • Latin America
  • Supply & Disruption
    • Pipeline Disruptions
    • Refinery Outages
    • Weather Events (hurricanes, floods)
    • Labor Strikes & Protest Movements
  • Policy & Regulation
    • U.S. Energy Policy
    • EU Carbon Targets
    • Emissions Regulations
    • International Trade & Sanctions
  • Tech
    • Energy Transition
    • Hydrogen & LNG
    • Carbon Capture
    • Battery / Storage Tech
  • ESG
    • Climate Commitments
    • Greenwashing News
    • Net-Zero Tracking
    • Institutional Divestments
  • Financial
    • Interest Rates Impact on Oil
    • Inflation + Demand
    • Oil & Stock Correlation
    • Investor Sentiment

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

ADNOC CEO meets Japan PM Takaichi as Strait of Hormuz tensions mount

March 5, 2026

Hydrogen Europe

March 5, 2026

Maritime Groups Comment on Middle East Situation

March 5, 2026
Facebook X (Twitter) Instagram Threads
Oil Market Cap – Global Oil & Energy News, Data & Analysis
  • Home
  • Market News
    • Crude Oil Prices
    • Brent vs WTI
    • Futures & Trading
    • OPEC Announcements
  • Company & Corporate
    • Mergers & Acquisitions
    • Earnings Reports
    • Executive Moves
    • ESG & Sustainability
  • Geopolitical & Global
    • Middle East
    • North America
    • Europe & Russia
    • Asia & China
    • Latin America
  • Supply & Disruption
    • Pipeline Disruptions
    • Refinery Outages
    • Weather Events (hurricanes, floods)
    • Labor Strikes & Protest Movements
  • Policy & Regulation
    • U.S. Energy Policy
    • EU Carbon Targets
    • Emissions Regulations
    • International Trade & Sanctions
  • Tech
    • Energy Transition
    • Hydrogen & LNG
    • Carbon Capture
    • Battery / Storage Tech
  • ESG
    • Climate Commitments
    • Greenwashing News
    • Net-Zero Tracking
    • Institutional Divestments
  • Financial
    • Interest Rates Impact on Oil
    • Inflation + Demand
    • Oil & Stock Correlation
    • Investor Sentiment
Oil Market Cap – Global Oil & Energy News, Data & Analysis
Home » There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.
U.S. Energy Policy

There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.

omc_adminBy omc_adminSeptember 15, 2025No Comments4 Mins Read
Share
Facebook Twitter Pinterest Threads Bluesky Copy Link


Google DeepMind researchers have an idea for how to solve the AI data drought, and it might involve your Social Security number.

The large language models powering AI require vast amounts of training data pulled from webpages, books, and other sources. When it comes to text specifically, the amount of data on the web considered fair game for training AI models is being scraped faster than new data is being created.

However, a large portion of the data isn’t used because it’s deemed toxic, inaccurate, or it contains personally identifiable information.

In a newly published paper, a group of Google DeepMind researchers claim to have found a way to clean up this data and make it usable for training, which they claim could be a “powerful tool” for scaling up frontier models.

They refer to the idea as Generative Data Refinement, or GDR. The method uses pretrained generative models to rewrite the unusable data, effectively purifying it so it can be safely trained on. It’s not clear if this is a technique Google is using for its Gemini models.

Minqi Jiang, one of the paper’s researchers who has since left the company to Meta, told Business Insider that a lot of AI labs are leaving usable training data on the table because it’s intermingled with bad data. For example, if there’s a document on the web that contains something considered unusable, such as someone’s phone number or an incorrect fact, labs will often discard the entire thing.

“So you essentially lose all those tokens inside of that document, even if it was a small single line that contained some personally identifying information,” said Jiang. Tokens are the units of data, processed by AI, which make up words within text.

Related stories

Business Insider tells the innovative stories you want to know

Business Insider tells the innovative stories you want to know

The authors give an example of raw data that included someone’s Social Security number or information that may soon be out of date (“the incoming CEO is…”). In these instances, the GDR would swap or remove the numbers, ignore the information that risks becoming obsolete, and retain the remainder of usable data.

The paper was written more than a year ago and was only published this month. A Google DeepMind spokesperson did not respond to a request for comment about whether the researcher’s work was being applied to the company’s AI models.

The authors’ findings could prove helpful for labs as the usable well of data runs dry. They cite a research paper from 2022 that predicted AI models could soak up all the human-generated text between 2026 and 2032. This prediction was based upon the amount of indexed web data, using statistics from Common Crawl, a project that continuously scrapes web pages and makes them openly available for AI labs to use.

For the GDR paper, the researchers performed a proof of concept by taking over one million lines of code and having human expert labelers annotate the data line by line. They then compared the results with the GDR method.

“It completely crushes the existing industry solutions being used for this kind of stuff,” said Jiang.

The authors also said their method is better than the use of synthetic data (data generated by AI models for the purpose of training themselves or other models), which has been a topic of exploration among AI labs. However, using synthetic data can degrade the quality of model output and, in some cases, lead to “model collapse.”

The authors compared the GDR data against synthetic data created by an LLM and discovered that their approach created a better dataset for training AI models.

They also said further testing could be conducted on other complicated types of data considered a no-go, such as copyrighted materials and personal data that is inferred across multiple documents rather than explicitly spelled out.

The paper has not been peer reviewed, said Jiang, adding that this is common in the tech industry and that all papers are reviewed internally.

The researchers only tested GDR on text and coding. Jiang said that it could also be tested on other modalities, such as video and audio. However, given the rate at which new videos are generated each day, they’re still providing a firehose of data for AI to train on.

“With video, you’re just going to have a lot more of it, just because there’s a constant stream of millions of hours of video generated each day,” said Jiang. “So I do think, going across new modalities beyond text, video, and images, we’re going to unlock a lot more data.”

Have something to share? Contact this reporter via email at hlangley@businessinsider.com or Signal at 628-228-1836. Use a personal email address and a non-work device; here’s our guide to sharing information securely.



Source link

Share. Facebook Twitter Pinterest Bluesky Threads Tumblr Telegram Email
omc_admin
  • Website

Related Posts

Palo Alto Networks Founder Raises Funding for New AI Hardware Security Startup

March 5, 2026

Ex-Khosla Investor Launches $52 Million Fund for Contrarian AI Bets

March 5, 2026

Kalshi Amends Rules Amid Khamenei ‘Death Market’ Controversy

March 5, 2026
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Federal Reserve cuts key rate for first time this year

September 17, 202513 Views

Inflation or jobs: Federal Reserve officials are divided over competing concerns

August 14, 20259 Views

Oil tanker rates to stay strong into 2026 as sanctions remove ships for hire – Oil & Gas 360

December 16, 20258 Views
Don't Miss

ADNOC CEO meets Japan PM Takaichi as Strait of Hormuz tensions mount

By omc_adminMarch 5, 2026

(Bloomberg) – Abu Dhabi National Oil Co. Chief Executive Officer Sultan Al Jaber met with…

Svante Acquires Carbon Alpha To Expand Carbon Removal Projects In Canada

March 5, 2026

Carlsberg Raises Climate Ambition With Updated Brewing Tomorrow ESG Programme

March 5, 2026

South Korea Moves Toward Mandatory ISSB-Aligned Climate Disclosures For Large KOSPI Firms

March 5, 2026
Top Trending

Galvanize Raises $370 Million for Real Estate Decarbonization Fund

By omc_adminMarch 5, 2026

RIFT Raises $132 Million to Decarbonize Industrial Heat with “Iron Fuel”

By omc_adminMarch 5, 2026

Korea Plans Mandatory Sustainability Reporting Beginning in 2028

By omc_adminMarch 5, 2026
Most Popular

The 5 Best 65-Inch TVs of 2025

July 3, 202515 Views

AI’s Next Bottleneck Isn’t Just Chips — It’s the Power Grid: Goldman

November 14, 202514 Views

The Layoffs List of 2025: Meta, Microsoft, Block, and More

May 9, 202510 Views
Our Picks

Maritime Groups Comment on Middle East Situation

March 5, 2026

JP Morgan Warns DFC Insurance Cap is Too Small for the Risk

March 5, 2026

Britain’s National Grid Awards $4B Supply Contracts for EGL3

March 5, 2026

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2026 oilmarketcap. Designed by oilmarketcap.

Type above and press Enter to search. Press Esc to cancel.