• Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021
No Result
View All Result
CryptoABC.net
No Result
View All Result

NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

January 10, 2025
in Blockchain
Reading Time: 2min read
0 0
A A
0
Nvidia Plans to add Innovation in the Metaverse with Software, Marketplace Deals
0
SHARES
24
VIEWS
ShareShareShareShareShare


Iris Coleman
Jan 10, 2025 14:13

NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for large language models with innovative data curation methods.





NVIDIA has announced the release of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of large language models (LLMs). This dataset, derived from Common Crawl, aims to elevate the accuracy and efficiency of LLMs through innovative data curation techniques, including the use of 1.9 trillion tokens of synthetically generated data, according to NVIDIA.

Enhancing LLM Pretraining

NVIDIA’s initiative addresses a critical need in LLM training, where the quality of pretraining datasets plays a pivotal role. While recent models like Meta’s Llama series have been based on datasets comprising up to 15 trillion tokens, the exact composition of these datasets remains largely undisclosed. Nemotron-CC seeks to fill this gap by providing the wider community with a high-quality dataset capable of supporting both short and long token horizon training.

Traditional datasets often sacrifice up to 90% of data to improve benchmark accuracies, limiting their utility for extensive training. Nemotron-CC, however, demonstrates how to transform Common Crawl data into a superior dataset, surpassing even the Llama 3.1 8B model through advanced methods such as classifier ensembling and synthetic data rephrasing.

Significant Results

Nemotron-CC’s efficacy is evidenced by its performance in various benchmarks. When training 8B parameter models for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms leading datasets like DCLM, increasing MMLU scores by 5.6 points. Furthermore, the complete 6.3-trillion-token dataset matches DCLM on MMLU while offering four times more unique real tokens. This enables effective training over long token horizons, with Nemotron-CC-trained models surpassing Llama 3.1 8B in multiple metrics, including a 5-point increase in MMLU and a 3.1-point rise in ARC-Challenge scores.

Innovative Data Curation Techniques

The development of Nemotron-CC involved several key insights. By ensembling different model-based classifiers, NVIDIA was able to select a broader array of high-quality tokens. Additionally, rephrasing techniques reduced noise and errors, yielding diverse and valuable data variants. The decision to disable traditional heuristic filters further boosted the dataset’s quality without compromising accuracy.

NVIDIA utilized its NeMo Curator tool to extract and refine data from Common Crawl, applying filters for language, deduplication, and quality classification. This process was complemented by synthetic data generation, contributing approximately two trillion tokens to the dataset.

Future Prospects

Nemotron-CC is positioned as a vital resource for pretraining state-of-the-art LLMs over varying token horizons. NVIDIA plans to expand its offerings by releasing more specialized datasets, including those focused on specific domains like mathematics, to further enhance LLM capabilities.

Image source: Shutterstock


Credit: Source link

ShareTweetSendPinShare
Previous Post

Hyundai and NVIDIA Collaborate on AI and Digital Twin Technologies for Future Mobility

Next Post

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Next Post
Eternal Paradox Season 5 Launches with New Content and Enhancements

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

You might also like

Understanding the Role and Capabilities of AI Agents

LangChain Gives AI Agents Control Over Their Own Memory Management

March 12, 2026
Anthropic Launches Claude 3.5 Sonnet Android App with Advanced AI Features

Anthropic Launches Institute to Tackle AI’s Societal Disruption

March 11, 2026
JPMorgan Flags Sharp Divergence Between Bitcoin and Gold ETF Flows Since Iran War

JPMorgan Flags Sharp Divergence Between Bitcoin and Gold ETF Flows Since Iran War

March 13, 2026
Uniswap (UNI) Price Rallies 6.53% – Is Now the Time to Buy? Comprehensive Analysis & Trading Insights

LDO Price Prediction: Targets $0.40 by Mid-2026 Despite Current Bearish Momentum

March 8, 2026
XRP Chart History Sparks Speculation Of $8.6 Price Target

XRP Chart History Sparks Speculation Of $8.6 Price Target

March 14, 2026
Bitcoin Price To Return Above $63,000? Here’s What Needs To Happen

Bitcoin LTH Supply Activity Continues To Rise — Further Downside For Price?

March 8, 2026
CryptoABC.net

This is an Australian online news/education portal that aims to provide the latest crypto news, real-time updates, education and reviews within Australia and around the world. Feel free to get in touch with us!

What's New Here!

Why The XRP Price Might Crash To $0.87 Before The Bear Market Ends

Why The XRP Price Might Crash To $0.87 Before The Bear Market Ends

March 14, 2026
Solana Key Indicator Flashes First Bullish Signal Since January – Market Rebound Incoming?

Solana Key Indicator Flashes First Bullish Signal Since January – Market Rebound Incoming?

March 14, 2026

Subscribe Now

  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

© 2021 cryptoabc.net - All rights reserved!

No Result
View All Result
  • Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021

© 2021 cryptoabc.net - All rights reserved!

Welcome Back!

Login to your account below

Forgotten Password?

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Please enter CoinGecko Free Api Key to get this plugin works.