STRATEGY & INSIGHTS

8 min read

Published on 10/17/2024

Last updated on 02/03/2025

Published on 10/17/2024

Last updated on 02/03/2025

Understanding LLMs: Attention mechanisms, context windows, and fine tuning

Subscribe to

the Shift!

Get emerging insights on innovative technology straight to your inbox.

As discussed in our previous post, large language models (LLMs) are the building blocks for many GenAI applications. Whether you’re looking to improve efficiency or need to fine-tune an LLM for a specific use case, it’s important to understand the inner workings of this technology. Possessing foundational knowledge of LLMs can help you drive innovation and make smart decisions when it comes to tailoring AI solutions. You'll be equipped to ask the right questions and consider vital factors as you choose how to implement LLMs into your operations.

Attention mechanisms

An attention mechanism is a technique or method in an LLM that is related to how the model focuses its attention on a piece of text to determine which parts are more or less relevant or important. As humans, when we read a piece of text or hear a statement spoken to us, we naturally assign different values of importance to the different parts of the statement. We don’t treat every word in a sentence equally.

In the same way, the attention mechanism of an LLM assigns different levels of importance to different words based on the context.

How attention mechanisms work

As an example, consider the following sentence: The cat sat on the mat because it was warm. The attention mechanism helps the model understand that "it" refers to "the mat" and not "the cat" by considering the context provided by the surrounding words. With this ability to focus on relevant parts of the text, an LLM can generate more accurate and contextually appropriate responses.

Grouped-query attention

Grouped-query attention (GQA) is closely related to the concept of attention mechanism. Think of it like going through a stack of questions that you need to answer, with many of them being similar. Instead of handling each question one by one, you group the similar ones together and answer them all at once. This saves time and makes your process more efficient.

In an LLM, GQA works similarly. When a model receives multiple queries that are all related, it groups them together and processes them simultaneously. This approach makes the model’s response time faster. It also takes less memory than handling each query separately.

Sliding-window attention

Sliding-window attention (SWA) is another variation of attention mechanism. It’s like reading a long book but only focusing on a few pages at a time. Imagine you have a very long document to read. Instead of trying to understand the whole thing at once, you break it down into smaller sections, or “windows”. You read one section, understand it, and then “slide” over to the next, slightly overlapping section, all the while maintaining context.

SWA breaks long texts into smaller, manageable segments. The model processes each segment separately while ensuring that each segment overlaps with the next. This overlap helps the model maintain the overall context and understand the document better.

SWA is a particularly useful technique when an LLM has tasks such as summarizing a long document, where the model needs to keep track of the information spread across many pages.

Context window

Related to all of these concepts is a term that you will probably see most often: context window.

The context window is sometimes referred to in the tech spec for an LLM. It defines the maximum length of text (or number of tokens) that the model can consider at once. This is basically how much text the model can process in a single input.

For example, an LLM with a context window of 32,000 tokens means it can handle and generate text that includes up to ~32,000 tokens of context at a time. A larger context window is crucial for tasks where the model needs to understand or generate long pieces of text.

Here’s how context window numbers break down for some popular models:

GPT-3: Typically has a context window of 2,048 tokens. This means it can consider up to 2048 tokens of text when generating a response, making it suitable for most conversational applications.
Gemma 2B: Similar to GPT-3, Gemma 2B also has a context window of 2,048 tokens, which is adequate for a wide range of applications without excessive computational demands.
Mistral 7B: Mistral AI models usually have context windows around 4,096 tokens, balancing the ability to handle moderately long texts while maintaining efficiency.
GPT-4: Up to 32,000 tokens. This allows it to handle much larger contexts, making it more effective for lengthy documents or complex conversations.

Fine-Tuning

Many organizations take a pre-built LLM and use it as is for their GenAI applications. This is the simplest and least resource-intensive route. With good prompt engineering, and possibly more advanced techniques like retrieval-augmented generation (RAG), the LLM is sufficient for their needs without any modifications. However, you may have a business use case where an LLM needs further optimization and customization to be more effective.

What is fine-tuning?

Fine-tuning adjusts the weights of a pre-trained model by using additional, task-specific training data. Weights are a type of parameter in a model, and they determine the strength of connections between units of “knowledge” in a model. By refining the weights, you can retain the general language understanding that the model gained during its initial training, but you enhance its ability to perform specialized tasks.

For example, you can take a general LLM and fine-tune it with medical texts. This will improve its accuracy and relevance in healthcare applications.

Benefits of fine-tuning

Fine-tuning an LLM brings several benefits, especially if your business use case is specialized enough that a general LLM won’t cut it.

Improved performance: Tailored training helps the model understand the nuances and requirements of the task at hand. This enhances a model's performance on specific tasks by providing it with relevant examples and context.
Efficiency: Fine-tuning is more efficient than training a model completely from scratch. You can take advantage of the resources already invested in the model’s initial training. This cuts down your time and costs.
Customization: Do you need a model to handle specialized technical jargon in a particular industry, or to understand a specific language dialect? Fine-tuning gives you that flexibility to adapt your model to these unique requirements.

Practical implications

Fine-tuning can be powerful for enhancing the capabilities of an LLM to help you meet your specific needs. However, it can be a resource-intensive process. Even though you may not be training an LLM from scratch, fine-tuning one still requires significant computational power and expertise. If you’re pursuing GenAI application development and thinking about the potential benefits of fine-tuning, you’ll need to carefully weigh the resources and expertise needed against the performance benefits.

Inference

When you start to dig into the lower-level technical processes of an LLM, you’ll also encounter a term called inference. Inference is the phase in which an LLM makes predictions or generates responses based on new input data.

What is inference?

Inference involves using the model to analyze new input data and produce a relevant output. You’ll recall in the section above on training data that we mentioned the concept of generalization. An LLM has been trained on a vast amount of data, all for the purpose of being able to generalize, applying that knowledge when it encounters new data. Inference is this application and response generation process that occurs when you ask an LLM a question. This phase leverages the model’s learned knowledge to understand and respond to new queries.

Practically speaking, let’s consider a chatbot that is used in a customer service context. When a customer submits an inquiry, the model processes the customer’s question and then uses inference to generate a helpful answer in real time.

Importance of efficient inference

Efficient inference is important for real-time applications. In most use cases, you’ll need a model that can respond quickly (and, of course, accurately) to user queries. If we take the customer service example from above, you can imagine how needing to wait 20 seconds for a response can cripple the user experience. LLMs need fast and efficient performance to be useful.

Practical implications

When an enterprise is building a GenAI application and it’s time to deploy their LLM, then there are special considerations to bear in mind for efficient inference.

Hardware: Powerful GPUs or NPUs can speed up inference times. These resources are necessary to handle the computational load, especially when you’re working with large models with parameters numbering in the billions.
Software: The software environment for implementing a model—including libraries, frameworks, and runtime—should be optimized for efficient inference. This means ensuring compatibility and performance by using the latest versions of AI frameworks (like TensorFlow or PyTorch).

Unless you’re doing the low-level design or creation of an LLM, your main concern for ensuring inference optimization comes down to investing in the right hardware and software resources. This is essential for applications that demand real-time or near-real-time responses.

Maximizing the potential of LLMs

Understanding attention mechanisms, the importance of context windows, and the benefits of fine-tuning can help you balance performance, cost, and resource availability when building GenAI applications.

Curious about some of the challenges of using LLMs out-of-the-box? Find out more about how LLMs can be successfully adopted in security operations.

Subscribe to

the Shift!

Get emerging insights on innovative technology straight to your inbox.

Fulfilling the promise of generative AI: A strategic path to rapid and trusted solution delivery

GenAI is full of exciting opportunities, but there are significant obstacles to overcome to fulfill AI’s full potential. Learn what those are and how to prepare.

* No email required

Twitter

Facebook

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Twitter

Facebook

Attention mechanisms

In the same way, the attention mechanism of an LLM assigns different levels of importance to different words based on the context.

How attention mechanisms work

Grouped-query attention

Sliding-window attention

SWA is a particularly useful technique when an LLM has tasks such as summarizing a long document, where the model needs to keep track of the information spread across many pages.

Context window

Related to all of these concepts is a term that you will probably see most often: context window.

Here’s how context window numbers break down for some popular models:

GPT-3: Typically has a context window of 2,048 tokens. This means it can consider up to 2048 tokens of text when generating a response, making it suitable for most conversational applications.
Gemma 2B: Similar to GPT-3, Gemma 2B also has a context window of 2,048 tokens, which is adequate for a wide range of applications without excessive computational demands.
Mistral 7B: Mistral AI models usually have context windows around 4,096 tokens, balancing the ability to handle moderately long texts while maintaining efficiency.
GPT-4: Up to 32,000 tokens. This allows it to handle much larger contexts, making it more effective for lengthy documents or complex conversations.

Fine-Tuning

What is fine-tuning?

For example, you can take a general LLM and fine-tune it with medical texts. This will improve its accuracy and relevance in healthcare applications.

Benefits of fine-tuning

Fine-tuning an LLM brings several benefits, especially if your business use case is specialized enough that a general LLM won’t cut it.

Improved performance: Tailored training helps the model understand the nuances and requirements of the task at hand. This enhances a model's performance on specific tasks by providing it with relevant examples and context.
Efficiency: Fine-tuning is more efficient than training a model completely from scratch. You can take advantage of the resources already invested in the model’s initial training. This cuts down your time and costs.
Customization: Do you need a model to handle specialized technical jargon in a particular industry, or to understand a specific language dialect? Fine-tuning gives you that flexibility to adapt your model to these unique requirements.

Practical implications

Inference

What is inference?

Importance of efficient inference

Practical implications

When an enterprise is building a GenAI application and it’s time to deploy their LLM, then there are special considerations to bear in mind for efficient inference.

Hardware: Powerful GPUs or NPUs can speed up inference times. These resources are necessary to handle the computational load, especially when you’re working with large models with parameters numbering in the billions.
Software: The software environment for implementing a model—including libraries, frameworks, and runtime—should be optimized for efficient inference. This means ensuring compatibility and performance by using the latest versions of AI frameworks (like TensorFlow or PyTorch).

Maximizing the potential of LLMs

Curious about some of the challenges of using LLMs out-of-the-box? Find out more about how LLMs can be successfully adopted in security operations.

Published on 10/17/2024

Last updated on 02/03/2025

Published on 10/17/2024

Last updated on 02/03/2025

Understanding LLMs: Attention mechanisms, context windows, and fine tuning

Get emerging insights on innovative technology straight to your inbox.

Attention mechanisms

How attention mechanisms work

Grouped-query attention

Sliding-window attention

Context window

Fine-Tuning

What is fine-tuning?

Benefits of fine-tuning

Practical implications

Inference

What is inference?

Importance of efficient inference

Practical implications

Maximizing the potential of LLMs

Fulfilling the promise of generative AI: A strategic path to rapid and trusted solution delivery

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Published on 10/17/2024

Last updated on 02/03/2025

Published on 10/17/2024

Last updated on 02/03/2025

Understanding LLMs: Attention mechanisms, context windows, and fine tuning

Get emerging insights on innovative technology straight to your inbox.

Attention mechanisms

How attention mechanisms work

Grouped-query attention

Sliding-window attention

Context window

Fine-Tuning

What is fine-tuning?

Benefits of fine-tuning

Practical implications

Inference

What is inference?

Importance of efficient inference

Practical implications

Maximizing the potential of LLMs

Fulfilling the promise of generative AI: A strategic path to rapid and trusted solution delivery

Related articles

AI/ML

Federated learning and LLMs: Redefining privacy-first AI training

Inside Outshift

13,000 AI prompts, 5 designs: How Outshift created unique Forbes cover wraps

AI/ML

Tips for teams to spot and protect against AI deepfakes