• Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021
No Result
View All Result
CryptoABC.net
No Result
View All Result

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

May 7, 2025
in Blockchain
Reading Time: 2min read
0 0
A A
0
Nvidia Plans to add Innovation in the Metaverse with Software, Marketplace Deals
0
SHARES
0
VIEWS
ShareShareShareShareShare


Joerg Hiller
May 07, 2025 15:38

NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training.





NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.

Advancements in Data Curation

The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.

Innovative Pipeline Features

The pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.

Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.

Impact on LLM Training

Training LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.

Getting Started with Nemotron-CC

The Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.

For more information, visit the NVIDIA blog.

Image source: Shutterstock


Credit: Source link

ShareTweetSendPinShare
Previous Post

Revolutionizing Healthcare: Five Ways AI is Making an Impact

Next Post

Trillion-Dollar Bank Paying $510,000,000 Fine After ‘Conspiring To Hide’ $4,000,000,000 From IRS

Next Post
Trillion-Dollar Bank Paying $510,000,000 Fine After ‘Conspiring To Hide’ $4,000,000,000 From IRS

Trillion-Dollar Bank Paying $510,000,000 Fine After 'Conspiring To Hide' $4,000,000,000 From IRS

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You might also like

Ethereum Spot Volume Declines While Long-Term Holders Continue Accumulating

Ethereum Spot Volume Declines While Long-Term Holders Continue Accumulating

May 7, 2025
Bitcoin Stochastic RSI Signals Brewing Bullish Momentum

Bitcoin Stochastic RSI Signals Brewing Bullish Momentum

May 3, 2025

LMACD Indicator Reveals Where The Bitcoin Price Is After Rejection From $97,000

May 5, 2025
Cardano (ADA) Much-Awaited Reversal To Begin With A Breakout From This Key Chart Pattern

Cardano (ADA) Much-Awaited Reversal To Begin With A Breakout From This Key Chart Pattern

May 2, 2025
Institutional Crypto Products See $2,000,000,000 in Inflows Amid ‘Dramatic’ Sentiment Shift: CoinShares

Institutional Crypto Products See $2,000,000,000 in Inflows Amid ‘Dramatic’ Sentiment Shift: CoinShares

May 5, 2025
Crypto Soars as BTC Nears $100k while US Fed Keeps Interest Rates Steady

Crypto Soars as BTC Nears $100k while US Fed Keeps Interest Rates Steady

May 8, 2025
CryptoABC.net

This is an Australian online news/education portal that aims to provide the latest crypto news, real-time updates, education and reviews within Australia and around the world. Feel free to get in touch with us!

What's New Here!

Ethereum Staking Surges Post-Pectra—Is a Bullish Breakout Brewing?

Ethereum Staking Surges Post-Pectra—Is a Bullish Breakout Brewing?

May 8, 2025
Bitcoin’s Realized Cap Hits Record High as Accumulation Continues

Bitcoin’s Realized Cap Hits Record High as Accumulation Continues

May 8, 2025

Subscribe Now

  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

© 2021 cryptoabc.net - All rights reserved!

No Result
View All Result
  • Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021

© 2021 cryptoabc.net - All rights reserved!

Welcome Back!

Login to your account below

Forgotten Password?

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
  • Heart NumberHeart Number(HTN)$0.000000-30.47%
  • TadpoleTadpole(TAD)$0.000000-1.76%
  • SEENSEEN(SEEN)$0.000000-2.27%
  • EvedoEvedo(EVED)$0.000000-0.80%
  • MarginswapMarginswap(MFI)$0.000000-2.17%
  • SakeTokenSakeToken(SAKE)$0.0000004.37%
  • WTF TokenWTF Token(WTF)$0.0000000.16%
  • BNSD FinanceBNSD Finance(BNSD)$0.000000-5.83%
  • RobotinaRobotina(ROX)$0.00000038.50%
  • CageCage(C4G3)$0.000000-3.67%