• Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021
No Result
View All Result
CryptoABC.net
No Result
View All Result

IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

June 24, 2024
in Blockchain
Reading Time: 2min read
0 0
A A
0
Crypto Innovations and IBM’s Role in the Evolving Payments Landscape
0
SHARES
13
VIEWS
ShareShareShareShareShare





IBM Research has announced a significant breakthrough in AI inferencing, combining speculative decoding with paged attention to enhance the cost performance of large language models (LLMs). This development promises to make customer care chatbots more efficient and cost-effective, according to IBM Research.

In recent years, LLMs have improved the ability of chatbots to understand customer queries and provide accurate responses. However, the high cost and slow speed of serving these models have hindered broader AI adoption. Speculative decoding emerges as an optimization technique to accelerate AI inferencing by generating tokens faster, which can reduce latency by two to three times, thereby improving customer experience.

Despite its advantages, reducing latency traditionally comes with a trade-off: decreased throughput, or the number of users that can simultaneously utilize the model, which increases operational costs. IBM Research has tackled this challenge by cutting the latency of its open-source Granite 20B code model in half while quadrupling its throughput.

Speculative Decoding: Efficiency in Token Generation

LLMs use a transformer architecture, which is inefficient at generating text. Typically, a forward pass is required to process each previously generated token before producing a new one. Speculative decoding modifies this process to evaluate several prospective tokens simultaneously. If these tokens are validated, one forward pass can generate multiple tokens, thus increasing inferencing speed.

This technique can be executed by a smaller, more efficient model or part of the main model itself. By processing tokens in parallel, speculative decoding maximizes the efficiency of each GPU, potentially doubling or tripling inferencing speed. Initial introductions of speculative decoding by DeepMind and Google researchers utilized a draft model, while newer methods, such as the Medusa speculator, eliminate the need for a secondary model.

IBM researchers adapted the Medusa speculator by conditioning future tokens on each other rather than on the model’s next predicted token. This approach, combined with an efficient fine-tuning method using small and large batches of text, aligns the speculator’s responses closely with the LLM, significantly boosting inferencing speeds.

Paged Attention: Optimizing Memory Usage

Reducing LLM latency often compromises throughput due to increased GPU memory strain. Dynamic batching can mitigate this but not when speculative decoding is also competing for memory. IBM researchers addressed this by employing paged attention, an optimization technique inspired by virtual memory and paging concepts from operating systems.

Traditional attention algorithms store key-value (KV) sequences in contiguous memory, leading to fragmentation. Paged attention, however, divides these sequences into smaller blocks, or pages, that can be accessed as needed. This method minimizes redundant computation and allows the speculator to generate multiple candidates for each predicted word without duplicating the entire KV-cache, thus freeing up memory.

Future Implications

IBM has integrated speculative decoding and paged attention into its Granite 20B code model. The IBM speculator has been open-sourced on Hugging Face, enabling other developers to adapt these techniques for their LLMs. IBM plans to implement these optimization techniques across all models on its watsonx platform, enhancing enterprise AI applications.

Image source: Shutterstock



Credit: Source link

ShareTweetSendPinShare
Previous Post

Ethereum Set For $5,000? ETH Open Interest Expanding On CME Ahead Of Spot ETFs Trading

Next Post

Solana Developer Shares “Big News” That Could Send The SOL Price Flying

Next Post

Solana Developer Shares “Big News” That Could Send The SOL Price Flying

You might also like

Bitfinex Says Expect Bullish Q4 as Bitcoin on Track for Significant Move

Bitcoin’s Bullish Signals Strengthen Despite Recent Hash Rate Dip

April 27, 2026
ALGO Price Prediction: $0.19 Target by December 2025 Despite Current Bearish Momentum

ALGO Price Prediction: $0.135 Breakout Imminent as Shorts Face Squeeze

April 25, 2026
Elon Musk’s Grok AI Predicts the Next XRP Price, Solana and Ethereum Moves

Elon Musk’s Grok AI Predicts the Next XRP Price, Solana and Ethereum Moves

April 27, 2026
Bitcoin Price Prediction: Florida’s Crypto Bill and $198B U.S. Surplus Boost Market Outlook

Bitcoin Price Prediction: Jack Dorsey Holds $2.2B as Strategy Ramps Up Buying

April 28, 2026
A Republican Senator Just Threatened to Kill the Crypto Clarity Act Unless Trump Is Banned From Promoting Crypto

A Republican Senator Just Threatened to Kill the Crypto Clarity Act Unless Trump Is Banned From Promoting Crypto

April 28, 2026
Japan Regulators Flag Crypto as High-Risk for Real Estate Money Laundering

Japan Regulators Flag Crypto as High-Risk for Real Estate Money Laundering

April 29, 2026
CryptoABC.net

This is an Australian online news/education portal that aims to provide the latest crypto news, real-time updates, education and reviews within Australia and around the world. Feel free to get in touch with us!

What's New Here!

Bullish Links With Ripple Prime to Unlock Bitcoin Options for Institutions

Bullish Links With Ripple Prime to Unlock Bitcoin Options for Institutions

April 30, 2026
Bitcoin $90,000 Predictions Surge Across Social Media—Contrarian Signal?

Bitcoin $90,000 Predictions Surge Across Social Media—Contrarian Signal?

April 30, 2026

Subscribe Now

  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

© 2021 cryptoabc.net - All rights reserved!

No Result
View All Result
  • Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021

© 2021 cryptoabc.net - All rights reserved!

Welcome Back!

Login to your account below

Forgotten Password?

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Please enter CoinGecko Free Api Key to get this plugin works.