top of page
Writer's pictureJosef Mayrhofer

Performance Optimization in the LLM World

Attention is all you need. The title of a well-known Google paper also applies to one of the biggest hypes in history. I am talking about ChatGPT and its overnight success. When Large Language Models became famous in 2023, ChatGPT had just five days to reach more than 1 million users.


In May 2024, I attended a Workshop about performance optimization in the LLM World during the International Conference for Performance Engineering in London. This blog post summarizes my conference notes and shares insights into how Performance Engineering practices could be applied to optimize LLMs.


Adoption of LLMs


The overnight success of LLMs and ChatGPT comes with a high price. Organizations like Microsoft spend 700,000 USD per day operating these platforms, and businesses using GPT-4 could spend up to 21,000 USD monthly on these services. It's still fascinating how human-like text appears after we ask ChatGPT a simple question. Looking behind the scenes, we realize there is little intelligence, only text pattern matching and recognition built in LLMs. I don't like their artificial text and am not so impressed by the ChatGPT hype. Also, looking at their CO2 footprint, it feels like a waste of resources. For instance, GPT-3, the model behind ChatGPT, contributed to dynamic computing costs of 552 tons of carbon dioxide.


The Good Side of LLMs


Looking at the global use of LLMs and their implementation, it's pretty impressive how fast customers and organizations adopted these new capabilities. We call it prompt engineering when we ask ChatGPT intelligent questions. Instead of using search engines where we receive thousands of links to pages that may contain what we are looking for, ChatGPT and their LLMs answer our questions. It's convenient and faster to use prompt engineering than the good old search engines. Behind the scenes, LLMs use peta bytes of text to learn about our complex world, prompting results in quickly mapping relevant content and responding to so-called tokens or paragraphs of text.


The Bad Side of LLMs


  • Trustworthy - Is the output correct, and what was its source?

  • Biased - Is the output neutral?

  • Transparency - Do we have insights about the used LLMs?

  • Explainability - Can we understand how the output was created?

  • Data Security and Privacy - How does it handle our prompts?

  • Excessive costs for training and inferencing - Do we get a good ROI?


The Ugly Side of LLMs


  • Hallucination - Sometimes, we get made-up answers

  • Self-reflect - Continous improvement is not built-in

  • Affordability - Only the largest enterprises in the world have the financial power to invest billions for training and creating new LLMs


LLM Performance Engineering


Performance Tuning becomes prominent when business services such as ChatGPT have massive hardware requirements and users expect responses in almost real-time. As a performance engineer specialized in LLM tuning, you focus on the following:

  • Initial response within 100 to 150 ms

  • Process 250 to 1000 words per minute

  • Single request latency

  • Concurrent request latency

  • Reduce latency

  • Reduce memory

  • Reduce IO

  • Horizontal scalability

  • Vertical scalability


When transaction volumes are high, every millisecond overhead contributes significant additional costs. Google, for instance, integrated LLMs into its search engine, which had a massive 40 million USD impact on its earnings.


In our next blog post, I share how you could optimize LLMs' infrastructure costs and response time.


Happy Performance Engineering! Keep up the great work!



Further readings


258 views0 comments

Recent Posts

See All

Comments


bottom of page