Close Menu
  • Home
  • Market News
    • Crude Oil Prices
    • Brent vs WTI
    • Futures & Trading
    • OPEC Announcements
  • Company & Corporate
    • Mergers & Acquisitions
    • Earnings Reports
    • Executive Moves
    • ESG & Sustainability
  • Geopolitical & Global
    • Middle East
    • North America
    • Europe & Russia
    • Asia & China
    • Latin America
  • Supply & Disruption
    • Pipeline Disruptions
    • Refinery Outages
    • Weather Events (hurricanes, floods)
    • Labor Strikes & Protest Movements
  • Policy & Regulation
    • U.S. Energy Policy
    • EU Carbon Targets
    • Emissions Regulations
    • International Trade & Sanctions
  • Tech
    • Energy Transition
    • Hydrogen & LNG
    • Carbon Capture
    • Battery / Storage Tech
  • ESG
    • Climate Commitments
    • Greenwashing News
    • Net-Zero Tracking
    • Institutional Divestments
  • Financial
    • Interest Rates Impact on Oil
    • Inflation + Demand
    • Oil & Stock Correlation
    • Investor Sentiment

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Chance of El Niño forming in Pacific Ocean may push global temperatures to record highs in 2027 | El Niño southern oscillation

February 7, 2026

Trump equity stakes pose these risks to U.S. companies and markets

February 7, 2026

Trump administration equity portfolio grows. These are investments so far

February 7, 2026
Facebook X (Twitter) Instagram Threads
Oil Market Cap – Global Oil & Energy News, Data & Analysis
  • Home
  • Market News
    • Crude Oil Prices
    • Brent vs WTI
    • Futures & Trading
    • OPEC Announcements
  • Company & Corporate
    • Mergers & Acquisitions
    • Earnings Reports
    • Executive Moves
    • ESG & Sustainability
  • Geopolitical & Global
    • Middle East
    • North America
    • Europe & Russia
    • Asia & China
    • Latin America
  • Supply & Disruption
    • Pipeline Disruptions
    • Refinery Outages
    • Weather Events (hurricanes, floods)
    • Labor Strikes & Protest Movements
  • Policy & Regulation
    • U.S. Energy Policy
    • EU Carbon Targets
    • Emissions Regulations
    • International Trade & Sanctions
  • Tech
    • Energy Transition
    • Hydrogen & LNG
    • Carbon Capture
    • Battery / Storage Tech
  • ESG
    • Climate Commitments
    • Greenwashing News
    • Net-Zero Tracking
    • Institutional Divestments
  • Financial
    • Interest Rates Impact on Oil
    • Inflation + Demand
    • Oil & Stock Correlation
    • Investor Sentiment
Oil Market Cap – Global Oil & Energy News, Data & Analysis
Home » There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.
U.S. Energy Policy

There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.

omc_adminBy omc_adminSeptember 15, 2025No Comments4 Mins Read
Share
Facebook Twitter Pinterest Threads Bluesky Copy Link


Google DeepMind researchers have an idea for how to solve the AI data drought, and it might involve your Social Security number.

The large language models powering AI require vast amounts of training data pulled from webpages, books, and other sources. When it comes to text specifically, the amount of data on the web considered fair game for training AI models is being scraped faster than new data is being created.

However, a large portion of the data isn’t used because it’s deemed toxic, inaccurate, or it contains personally identifiable information.

In a newly published paper, a group of Google DeepMind researchers claim to have found a way to clean up this data and make it usable for training, which they claim could be a “powerful tool” for scaling up frontier models.

They refer to the idea as Generative Data Refinement, or GDR. The method uses pretrained generative models to rewrite the unusable data, effectively purifying it so it can be safely trained on. It’s not clear if this is a technique Google is using for its Gemini models.

Minqi Jiang, one of the paper’s researchers who has since left the company to Meta, told Business Insider that a lot of AI labs are leaving usable training data on the table because it’s intermingled with bad data. For example, if there’s a document on the web that contains something considered unusable, such as someone’s phone number or an incorrect fact, labs will often discard the entire thing.

“So you essentially lose all those tokens inside of that document, even if it was a small single line that contained some personally identifying information,” said Jiang. Tokens are the units of data, processed by AI, which make up words within text.

Related stories

Business Insider tells the innovative stories you want to know

Business Insider tells the innovative stories you want to know

The authors give an example of raw data that included someone’s Social Security number or information that may soon be out of date (“the incoming CEO is…”). In these instances, the GDR would swap or remove the numbers, ignore the information that risks becoming obsolete, and retain the remainder of usable data.

The paper was written more than a year ago and was only published this month. A Google DeepMind spokesperson did not respond to a request for comment about whether the researcher’s work was being applied to the company’s AI models.

The authors’ findings could prove helpful for labs as the usable well of data runs dry. They cite a research paper from 2022 that predicted AI models could soak up all the human-generated text between 2026 and 2032. This prediction was based upon the amount of indexed web data, using statistics from Common Crawl, a project that continuously scrapes web pages and makes them openly available for AI labs to use.

For the GDR paper, the researchers performed a proof of concept by taking over one million lines of code and having human expert labelers annotate the data line by line. They then compared the results with the GDR method.

“It completely crushes the existing industry solutions being used for this kind of stuff,” said Jiang.

The authors also said their method is better than the use of synthetic data (data generated by AI models for the purpose of training themselves or other models), which has been a topic of exploration among AI labs. However, using synthetic data can degrade the quality of model output and, in some cases, lead to “model collapse.”

The authors compared the GDR data against synthetic data created by an LLM and discovered that their approach created a better dataset for training AI models.

They also said further testing could be conducted on other complicated types of data considered a no-go, such as copyrighted materials and personal data that is inferred across multiple documents rather than explicitly spelled out.

The paper has not been peer reviewed, said Jiang, adding that this is common in the tech industry and that all papers are reviewed internally.

The researchers only tested GDR on text and coding. Jiang said that it could also be tested on other modalities, such as video and audio. However, given the rate at which new videos are generated each day, they’re still providing a firehose of data for AI to train on.

“With video, you’re just going to have a lot more of it, just because there’s a constant stream of millions of hours of video generated each day,” said Jiang. “So I do think, going across new modalities beyond text, video, and images, we’re going to unlock a lot more data.”

Have something to share? Contact this reporter via email at hlangley@businessinsider.com or Signal at 628-228-1836. Use a personal email address and a non-work device; here’s our guide to sharing information securely.



Source link

Share. Facebook Twitter Pinterest Bluesky Threads Tumblr Telegram Email
omc_admin
  • Website

Related Posts

Checkr Is Making All Employees Vibe Code With Stipends and AI Days

February 7, 2026

How VCs Use AI to Find Deals, Prep for Pitches, and Move Faster

February 7, 2026

SpaceX Is Hiring to Build Elon Musk’s Data Centers in Space

February 7, 2026
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Federal Reserve cuts key rate for first time this year

September 17, 202513 Views

Citigroup must face $1 billion lawsuit claiming it aided Mexican oil company fraud

July 1, 20077 Views

LPG sales grow 5.1% in FY25, 43.6 lakh new customers enrolled, ET EnergyWorld

May 16, 20255 Views
Don't Miss

Canadian crude discounts widen as supply glut signals emerge

By omc_adminFebruary 6, 2026

(Bloomberg) – Canadian oil producers riding a boom from the expanded Trans Mountain pipeline are…

TotalEnergies expands Namibia exploration position with operated PEL104 stake

February 6, 2026

ConocoPhillips seeks Venezuela compensation before resuming drilling

February 6, 2026

Chevron, Turkey sign global oil and gas exploration agreement with TPAO

February 5, 2026
Top Trending

Chance of El Niño forming in Pacific Ocean may push global temperatures to record highs in 2027 | El Niño southern oscillation

By omc_adminFebruary 7, 2026

Canada Drops Zero Emission Vehicle Sales Mandate for Automakers

By omc_adminFebruary 6, 2026

Mundi Ventures Raises €750 Million for Deep Tech & Climate Growth Fund

By omc_adminFebruary 6, 2026
Most Popular

AI’s Next Bottleneck Isn’t Just Chips — It’s the Power Grid: Goldman

November 14, 202514 Views

The 5 Best 65-Inch TVs of 2025

July 3, 202513 Views

The Layoffs List of 2025: Meta, Microsoft, Block, and More

May 9, 202510 Views
Our Picks

Phillips 66 to Cut Nearly 300 Jobs as LA Refinery Shuts

February 7, 2026

WTI, Brent Gain as Talks Ease Conflict Fears

February 6, 2026

Canadian crude discounts widen as supply glut signals emerge

February 6, 2026

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2026 oilmarketcap. Designed by oilmarketcap.

Type above and press Enter to search. Press Esc to cancel.