• Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021
No Result
View All Result
CryptoABC.net
No Result
View All Result

IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

June 24, 2024
in Blockchain
Reading Time: 2min read
0 0
A A
0
Crypto Innovations and IBM’s Role in the Evolving Payments Landscape
0
SHARES
16
VIEWS
ShareShareShareShareShare





IBM Research has announced a significant breakthrough in AI inferencing, combining speculative decoding with paged attention to enhance the cost performance of large language models (LLMs). This development promises to make customer care chatbots more efficient and cost-effective, according to IBM Research.

In recent years, LLMs have improved the ability of chatbots to understand customer queries and provide accurate responses. However, the high cost and slow speed of serving these models have hindered broader AI adoption. Speculative decoding emerges as an optimization technique to accelerate AI inferencing by generating tokens faster, which can reduce latency by two to three times, thereby improving customer experience.

Despite its advantages, reducing latency traditionally comes with a trade-off: decreased throughput, or the number of users that can simultaneously utilize the model, which increases operational costs. IBM Research has tackled this challenge by cutting the latency of its open-source Granite 20B code model in half while quadrupling its throughput.

Speculative Decoding: Efficiency in Token Generation

LLMs use a transformer architecture, which is inefficient at generating text. Typically, a forward pass is required to process each previously generated token before producing a new one. Speculative decoding modifies this process to evaluate several prospective tokens simultaneously. If these tokens are validated, one forward pass can generate multiple tokens, thus increasing inferencing speed.

This technique can be executed by a smaller, more efficient model or part of the main model itself. By processing tokens in parallel, speculative decoding maximizes the efficiency of each GPU, potentially doubling or tripling inferencing speed. Initial introductions of speculative decoding by DeepMind and Google researchers utilized a draft model, while newer methods, such as the Medusa speculator, eliminate the need for a secondary model.

IBM researchers adapted the Medusa speculator by conditioning future tokens on each other rather than on the model’s next predicted token. This approach, combined with an efficient fine-tuning method using small and large batches of text, aligns the speculator’s responses closely with the LLM, significantly boosting inferencing speeds.

Paged Attention: Optimizing Memory Usage

Reducing LLM latency often compromises throughput due to increased GPU memory strain. Dynamic batching can mitigate this but not when speculative decoding is also competing for memory. IBM researchers addressed this by employing paged attention, an optimization technique inspired by virtual memory and paging concepts from operating systems.

Traditional attention algorithms store key-value (KV) sequences in contiguous memory, leading to fragmentation. Paged attention, however, divides these sequences into smaller blocks, or pages, that can be accessed as needed. This method minimizes redundant computation and allows the speculator to generate multiple candidates for each predicted word without duplicating the entire KV-cache, thus freeing up memory.

Future Implications

IBM has integrated speculative decoding and paged attention into its Granite 20B code model. The IBM speculator has been open-sourced on Hugging Face, enabling other developers to adapt these techniques for their LLMs. IBM plans to implement these optimization techniques across all models on its watsonx platform, enhancing enterprise AI applications.

Image source: Shutterstock



Credit: Source link

ShareTweetSendPinShare
Previous Post

Ethereum Set For $5,000? ETH Open Interest Expanding On CME Ahead Of Spot ETFs Trading

Next Post

Solana Developer Shares “Big News” That Could Send The SOL Price Flying

Next Post

Solana Developer Shares “Big News” That Could Send The SOL Price Flying

You might also like

Ethereum Exchange Inflows Climb To 4-Month High – What This Means For Price

Ethereum Exchange Inflows Climb To 4-Month High – What This Means For Price

June 6, 2026
Bitcoin Holders Signal Stress, $60K Becomes Critical Battleground

Bitcoin Holders Signal Stress, $60K Becomes Critical Battleground

June 4, 2026
US Says It Has Seized US$1 Billion in Iranian Crypto Assets

US Says It Has Seized US$1 Billion in Iranian Crypto Assets

June 1, 2026
Ethereum Golden Triangle Survives As Structure Remains Unbroken, This Target Says $10,000 Is Coming

Ethereum Golden Triangle Survives As Structure Remains Unbroken, This Target Says $10,000 Is Coming

June 6, 2026
Bitcoin Price Breakdown Risk Grows As Bears Aim For $85K

Bitcoin Price Teeters Near The Edge As Bears Eye Another Breakdown

June 1, 2026
BitMine Deploys $417M Into Ether Vault — Tom Lee’s Next Call Could Be Explosive

XRP Price Stalls But Metrics Hint A Rally Coming With Big Flows

June 2, 2026
CryptoABC.net

This is an Australian online news/education portal that aims to provide the latest crypto news, real-time updates, education and reviews within Australia and around the world. Feel free to get in touch with us!

What's New Here!

A 400 Billion Shiba Inu Surprise: Whale Wallet Springs Back To Life

A 400 Billion Shiba Inu Surprise: Whale Wallet Springs Back To Life

June 7, 2026
Elon Musk Grok AI Predicts Shocking XRP Price in The Next 28 Days

Elon Musk Grok AI Predicts Shocking XRP Price in The Next 28 Days

June 7, 2026

Subscribe Now

  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

© 2021 cryptoabc.net - All rights reserved!

No Result
View All Result
  • Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021

© 2021 cryptoabc.net - All rights reserved!

Welcome Back!

Login to your account below

Forgotten Password?

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Please enter CoinGecko Free Api Key to get this plugin works.