Close Menu
  • Home
  • Market News
    • Crude Oil Prices
    • Brent vs WTI
    • Futures & Trading
    • OPEC Announcements
  • Company & Corporate
    • Mergers & Acquisitions
    • Earnings Reports
    • Executive Moves
    • ESG & Sustainability
  • Geopolitical & Global
    • Middle East
    • North America
    • Europe & Russia
    • Asia & China
    • Latin America
  • Supply & Disruption
    • Pipeline Disruptions
    • Refinery Outages
    • Weather Events (hurricanes, floods)
    • Labor Strikes & Protest Movements
  • Policy & Regulation
    • U.S. Energy Policy
    • EU Carbon Targets
    • Emissions Regulations
    • International Trade & Sanctions
  • Tech
    • Energy Transition
    • Hydrogen & LNG
    • Carbon Capture
    • Battery / Storage Tech
  • ESG
    • Climate Commitments
    • Greenwashing News
    • Net-Zero Tracking
    • Institutional Divestments
  • Financial
    • Interest Rates Impact on Oil
    • Inflation + Demand
    • Oil & Stock Correlation
    • Investor Sentiment

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

How Food Delivery Robots Are Conquering College Campuses

October 11, 2025

Tech CEOs Weigh in on OpenAI Chief Sam Altman’s Leadership

October 11, 2025

Pemex overhauling multiple units at Deer Park, Texas, refinery, ETEnergyworld

October 11, 2025
Facebook X (Twitter) Instagram Threads
Oil Market Cap – Global Oil & Energy News, Data & Analysis
  • Home
  • Market News
    • Crude Oil Prices
    • Brent vs WTI
    • Futures & Trading
    • OPEC Announcements
  • Company & Corporate
    • Mergers & Acquisitions
    • Earnings Reports
    • Executive Moves
    • ESG & Sustainability
  • Geopolitical & Global
    • Middle East
    • North America
    • Europe & Russia
    • Asia & China
    • Latin America
  • Supply & Disruption
    • Pipeline Disruptions
    • Refinery Outages
    • Weather Events (hurricanes, floods)
    • Labor Strikes & Protest Movements
  • Policy & Regulation
    • U.S. Energy Policy
    • EU Carbon Targets
    • Emissions Regulations
    • International Trade & Sanctions
  • Tech
    • Energy Transition
    • Hydrogen & LNG
    • Carbon Capture
    • Battery / Storage Tech
  • ESG
    • Climate Commitments
    • Greenwashing News
    • Net-Zero Tracking
    • Institutional Divestments
  • Financial
    • Interest Rates Impact on Oil
    • Inflation + Demand
    • Oil & Stock Correlation
    • Investor Sentiment
Oil Market Cap – Global Oil & Energy News, Data & Analysis
Home » There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.
U.S. Energy Policy

There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.

omc_adminBy omc_adminSeptember 15, 2025No Comments4 Mins Read
Share
Facebook Twitter Pinterest Threads Bluesky Copy Link


Google DeepMind researchers have an idea for how to solve the AI data drought, and it might involve your Social Security number.

The large language models powering AI require vast amounts of training data pulled from webpages, books, and other sources. When it comes to text specifically, the amount of data on the web considered fair game for training AI models is being scraped faster than new data is being created.

However, a large portion of the data isn’t used because it’s deemed toxic, inaccurate, or it contains personally identifiable information.

In a newly published paper, a group of Google DeepMind researchers claim to have found a way to clean up this data and make it usable for training, which they claim could be a “powerful tool” for scaling up frontier models.

They refer to the idea as Generative Data Refinement, or GDR. The method uses pretrained generative models to rewrite the unusable data, effectively purifying it so it can be safely trained on. It’s not clear if this is a technique Google is using for its Gemini models.

Minqi Jiang, one of the paper’s researchers who has since left the company to Meta, told Business Insider that a lot of AI labs are leaving usable training data on the table because it’s intermingled with bad data. For example, if there’s a document on the web that contains something considered unusable, such as someone’s phone number or an incorrect fact, labs will often discard the entire thing.

“So you essentially lose all those tokens inside of that document, even if it was a small single line that contained some personally identifying information,” said Jiang. Tokens are the units of data, processed by AI, which make up words within text.

Related stories

Business Insider tells the innovative stories you want to know

Business Insider tells the innovative stories you want to know

The authors give an example of raw data that included someone’s Social Security number or information that may soon be out of date (“the incoming CEO is…”). In these instances, the GDR would swap or remove the numbers, ignore the information that risks becoming obsolete, and retain the remainder of usable data.

The paper was written more than a year ago and was only published this month. A Google DeepMind spokesperson did not respond to a request for comment about whether the researcher’s work was being applied to the company’s AI models.

The authors’ findings could prove helpful for labs as the usable well of data runs dry. They cite a research paper from 2022 that predicted AI models could soak up all the human-generated text between 2026 and 2032. This prediction was based upon the amount of indexed web data, using statistics from Common Crawl, a project that continuously scrapes web pages and makes them openly available for AI labs to use.

For the GDR paper, the researchers performed a proof of concept by taking over one million lines of code and having human expert labelers annotate the data line by line. They then compared the results with the GDR method.

“It completely crushes the existing industry solutions being used for this kind of stuff,” said Jiang.

The authors also said their method is better than the use of synthetic data (data generated by AI models for the purpose of training themselves or other models), which has been a topic of exploration among AI labs. However, using synthetic data can degrade the quality of model output and, in some cases, lead to “model collapse.”

The authors compared the GDR data against synthetic data created by an LLM and discovered that their approach created a better dataset for training AI models.

They also said further testing could be conducted on other complicated types of data considered a no-go, such as copyrighted materials and personal data that is inferred across multiple documents rather than explicitly spelled out.

The paper has not been peer reviewed, said Jiang, adding that this is common in the tech industry and that all papers are reviewed internally.

The researchers only tested GDR on text and coding. Jiang said that it could also be tested on other modalities, such as video and audio. However, given the rate at which new videos are generated each day, they’re still providing a firehose of data for AI to train on.

“With video, you’re just going to have a lot more of it, just because there’s a constant stream of millions of hours of video generated each day,” said Jiang. “So I do think, going across new modalities beyond text, video, and images, we’re going to unlock a lot more data.”

Have something to share? Contact this reporter via email at hlangley@businessinsider.com or Signal at 628-228-1836. Use a personal email address and a non-work device; here’s our guide to sharing information securely.



Source link

Share. Facebook Twitter Pinterest Bluesky Threads Tumblr Telegram Email
omc_admin
  • Website

Related Posts

How Food Delivery Robots Are Conquering College Campuses

October 11, 2025

Tech CEOs Weigh in on OpenAI Chief Sam Altman’s Leadership

October 11, 2025

Former OpenAI Product Manager Launches Startup Backed by Mira Murati

October 10, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

LPG sales grow 5.1% in FY25, 43.6 lakh new customers enrolled, ET EnergyWorld

May 16, 20255 Views

South Sudan on edge as Sudan’s war threatens vital oil industry | Sudan war News

May 21, 20254 Views

Trump’s 100 days, AI bubble, volatility: Market Takeaways

December 16, 20072 Views
Don't Miss

Shenandoah field reaches 100,000 bpd milestone in deepwater U.S. Gulf

By omc_adminOctober 10, 2025

Beacon Offshore Energy announced that production from its Shenandoah deepwater development has reached the targeted…

Equinor prepares to start delayed deepwater project offshore Brazil

October 10, 2025

Worldly Acquires GoBlu to Build Unified Sustainability Data Ecosystem for Global Supply Chains

October 10, 2025

US Declines to Back World Bank Climate Statement Signed by 19 Directors

October 10, 2025
Top Trending

Morgan Stanley Backs Corvus Energy to Decarbonize Maritime Sector

By omc_adminOctober 10, 2025

Home Energy Storage Startup Base Power Raises $1 Billion

By omc_adminOctober 10, 2025

Prince William to attend Cop30 UN climate summit in Brazil | Cop30

By omc_adminOctober 9, 2025
Most Popular

The Layoffs List of 2025: Meta, Microsoft, Block, and More

May 9, 20259 Views

Analysis: Reform-led councils threaten 6GW of solar and battery schemes across England

June 16, 20252 Views

Guest post: How ‘feedback loops’ and ‘non-linear thinking’ can inform climate policy

June 5, 20252 Views
Our Picks

Kyiv Power Cut as Russia Steps Up Strikes

October 10, 2025

WTI Falls Below $59 on Tariff Threats

October 10, 2025

Shenandoah field reaches 100,000 bpd milestone in deepwater U.S. Gulf

October 10, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 oilmarketcap. Designed by oilmarketcap.

Type above and press Enter to search. Press Esc to cancel.