• Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021
No Result
View All Result
CryptoABC.net
No Result
View All Result

IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

June 24, 2024
in Blockchain
Reading Time: 2min read
0 0
A A
0
Crypto Innovations and IBM’s Role in the Evolving Payments Landscape
0
SHARES
12
VIEWS
ShareShareShareShareShare





IBM Research has announced a significant breakthrough in AI inferencing, combining speculative decoding with paged attention to enhance the cost performance of large language models (LLMs). This development promises to make customer care chatbots more efficient and cost-effective, according to IBM Research.

In recent years, LLMs have improved the ability of chatbots to understand customer queries and provide accurate responses. However, the high cost and slow speed of serving these models have hindered broader AI adoption. Speculative decoding emerges as an optimization technique to accelerate AI inferencing by generating tokens faster, which can reduce latency by two to three times, thereby improving customer experience.

Despite its advantages, reducing latency traditionally comes with a trade-off: decreased throughput, or the number of users that can simultaneously utilize the model, which increases operational costs. IBM Research has tackled this challenge by cutting the latency of its open-source Granite 20B code model in half while quadrupling its throughput.

Speculative Decoding: Efficiency in Token Generation

LLMs use a transformer architecture, which is inefficient at generating text. Typically, a forward pass is required to process each previously generated token before producing a new one. Speculative decoding modifies this process to evaluate several prospective tokens simultaneously. If these tokens are validated, one forward pass can generate multiple tokens, thus increasing inferencing speed.

This technique can be executed by a smaller, more efficient model or part of the main model itself. By processing tokens in parallel, speculative decoding maximizes the efficiency of each GPU, potentially doubling or tripling inferencing speed. Initial introductions of speculative decoding by DeepMind and Google researchers utilized a draft model, while newer methods, such as the Medusa speculator, eliminate the need for a secondary model.

IBM researchers adapted the Medusa speculator by conditioning future tokens on each other rather than on the model’s next predicted token. This approach, combined with an efficient fine-tuning method using small and large batches of text, aligns the speculator’s responses closely with the LLM, significantly boosting inferencing speeds.

Paged Attention: Optimizing Memory Usage

Reducing LLM latency often compromises throughput due to increased GPU memory strain. Dynamic batching can mitigate this but not when speculative decoding is also competing for memory. IBM researchers addressed this by employing paged attention, an optimization technique inspired by virtual memory and paging concepts from operating systems.

Traditional attention algorithms store key-value (KV) sequences in contiguous memory, leading to fragmentation. Paged attention, however, divides these sequences into smaller blocks, or pages, that can be accessed as needed. This method minimizes redundant computation and allows the speculator to generate multiple candidates for each predicted word without duplicating the entire KV-cache, thus freeing up memory.

Future Implications

IBM has integrated speculative decoding and paged attention into its Granite 20B code model. The IBM speculator has been open-sourced on Hugging Face, enabling other developers to adapt these techniques for their LLMs. IBM plans to implement these optimization techniques across all models on its watsonx platform, enhancing enterprise AI applications.

Image source: Shutterstock



Credit: Source link

ShareTweetSendPinShare
Previous Post

Ethereum Set For $5,000? ETH Open Interest Expanding On CME Ahead Of Spot ETFs Trading

Next Post

Solana Developer Shares “Big News” That Could Send The SOL Price Flying

Next Post

Solana Developer Shares “Big News” That Could Send The SOL Price Flying

You might also like

OpenAI: Paf Leverages 85 Custom GPTs to Boost Developer Productivity

OpenAI Deploys ChatGPT on Pentagon’s GenAI.mil Platform for 3M Defense Personnel

March 5, 2026
Sydney-Based Iren Orders 50,000 Nvidia GPUs to Supercharge AI Data Center Expansion

Sydney-Based Iren Orders 50,000 Nvidia GPUs to Supercharge AI Data Center Expansion

March 6, 2026
What’s Happening With The Bitcoin, Ethereum, And Dogecoin Prices Recently?

Why Did Bitcoin Price Crash To $67,000, And Ethereum Price Fell Below $2,000?

March 9, 2026
Bitcoin Price Holds Above $115,000 — Here’s Why This Level Is Significant

Here’s Why Bitcoin Price Must Not Fall To $54K: Analyst

March 7, 2026
Cathie Wood Trims 2030 Bitcoin Bull Case Over Stablecoin Growth

Bitcoin ‘Sandwiched’ Between Two Key Zones As Price Tops $71K

March 11, 2026
Bitcoin Nears Two-Year ‘Make-or-Break’ Resistance: What’s Next?

Bitcoin Nears Two-Year ‘Make-or-Break’ Resistance: What’s Next?

March 5, 2026
CryptoABC.net

This is an Australian online news/education portal that aims to provide the latest crypto news, real-time updates, education and reviews within Australia and around the world. Feel free to get in touch with us!

What's New Here!

Is Dogecoin About To Repeat NVIDIA’s Run? Here’s What The Chart Says

Dogecoin (DOGE) Pullback Sparks Tension — Will Support Hold?

March 12, 2026
Ethereum Price Sinks To $2,800, Raising Fresh Downside Fears

Ethereum Price Struggles Near Highs — Reversal Risk Rising

March 12, 2026

Subscribe Now

  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

© 2021 cryptoabc.net - All rights reserved!

No Result
View All Result
  • Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021

© 2021 cryptoabc.net - All rights reserved!

Welcome Back!

Login to your account below

Forgotten Password?

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Please enter CoinGecko Free Api Key to get this plugin works.