Photo by Shahadat Rahman on Unsplash

Can you really run LLMs locally on your laptop?

Are you fascinated by large language models (LLMs) like ChatGPT? These incredible tools are already transforming how we interact with information and reshaping our approach to problem-solving in many fields.

Ever wondered if you could run an LLM right on your own laptop? Well, I’ve got some exciting news for you – you absolutely can! And not only that but also on other consumer-grade devices you already own, such as your smartphone.

In this post, I’m excited to walk you through the whole process. Not only is it totally doable, but it’s also super enjoyable and, let’s be honest, pretty darn cool.

There have been notable advancements in open-source LLMs recently. The best part? You don’t need a supercomputer or a high-end GPU costing you tens of thousands of dollars to get started.

With modest hardware, you can run these models on your own device without the need to rely on any external services. But what is the advantage of doing so over existing proprietary models like those from OpenAI, Google, or Anthropic, which we can already access via API?

Well, with a local LLM you get:

  • Privacy: Your data never leaves your device.
  • Customization: Experiment and fine-tune models for your own needs.
  • Learning: Gain deeper insights into the technology

I’ve found running LLMs on your laptop really can be an awesome gateway to further discovery. So without any further ado, let’s dive right in!

What is a large language model?

A Language Model (LLM) is a type of artificial intelligence program designed to understand, generate, and interact using human language. It is built using a vast number of parameters, which are the fundamental elements that enable the model to learn from and process large datasets of text, improving its ability to mimic human-like language patterns.

When you go to acquire these models for personal use, it’s fascinating to realize the scale of investment behind their creation. Developing and training them often involves extraordinary sums of money for leveraging vast computational resources and extensive datasets. Llama 2, for example, may have cost more than 20 million dollars to train.

This effort results in the ability of LLMs to understand and generate language with remarkable accuracy and nuance. However, thanks to the open-source community and advancements in the research, experimenting with these models has become more feasible for everyone. Despite not having the same level of resources as major tech companies, the accessibility of these models has opened up new avenues for personal and professional use.

Understanding model precision

When dealing with LLMs, it’s crucial to understand the concept of model precision and its impact on performance and efficiency. The precision refers to how detailed the information is stored within the model, particularly its weights and activations.

During the training phase of an LLM, precision is key. The standard practice is to use 32-bit floating-point precision (float32). Here’s why:

  • High Precision: Float32 offers a balance of range and precision, crucial for capturing the subtle nuances in data during the training process.
  • Training Accuracy: Higher precision ensures that the model learns effectively, as even small errors in training can lead to significant deviations in learning outcomes.
  • Widely Supported: Most machine learning frameworks and hardware are optimized for float32 operations.

Once the model is trained, the focus shifts to inference (making predictions). Here, efficiency becomes as important as accuracy, leading to the use of lower precision formats like 16-bit.

Formats like float16 or bfloat16 are commonly used. They reduce the model’s memory footprint and speed up computations, allowing for faster responses and lower resource consumption. This shift usually provides a good balance between maintaining adequate model performance and increasing computational efficiency.

As we push the boundaries of running LLMs on conventional hardware, further quantization becomes a consideration. This involves reducing the precision of the model’s numerical representation even more, such as down to 4-bit.

Reducing the bit precision can significantly decrease the model’s memory requirements and computational load, making it feasible to run on standard consumer-grade hardware. Although lower precision like 4-bit, for example, might lead to a reduction in model accuracy, it can offer a practical compromise when running complex models on less powerful devices.

A key challenge is to reduce precision without substantially compromising the model’s ability to generate accurate and coherent outputs. Ultimately, it’s a balancing act that may require experimentation and optimization to find the sweet spot between efficiency and performance.

Introducing llama.cpp

llama.cpp is a tool designed to run quantized versions of large language models like Llama on local hardware, offering support for various levels of integer quantization and optimized for different architectures. In other words, this is a program you can use to efficiently operate complex language models on standard consumer devices, such as your laptop.

There are several other viable tools available for similar purposes, but for the scope of this article we’ll zoom in on llama.cpp. I find it particularly effective and robust in its implementation, making it an ideal choice for exploration.

Let’s break down the steps of what we’ll need to do to get up and running:

  1. Download and compile llama.cpp
  2. Select and download a model
  3. Convert model format for compatability
  4. Quantize your model
  5. Run the model

Download and compile llama.cpp

The first step is to download and compile llama.cpp. Clone the repository:

git clone https://github.com/ggerganov/llama.cpp.gitCode language: Bash (bash)

If you’re on a MacBook, llama.cpp is configured to automatically utilize the Metal framework during compilation. This optimizes the tool for Apple’s silicon, leveraging ARM NEON, Accelerate, and Metal frameworks for maximum efficiency.

cd llama.cpp
make -jCode language: Bash (bash)

If you’re not on a MacBook, it’s important to refer to the Readme in the repository for specific instructions tailored to your hardware. llama.cpp supports various configurations, including AVX, AVX2, and AVX512 for x86 architectures, and has CUDA, Metal, and OpenCL GPU backend support.

Select and download a model

Next, you need to choose a language model that is compatible with llama.cpp. This typically involves visiting repositories like Hugging Face or GitHub, where you can find a range of models suitable for different purposes.

If your system has 32GB or more of RAM, I highly recommend Mixtral 8x7B. In my opinion, it’s the most capable open model currently available, offering multilingual capabilities and a large 32k token context window. Its high performance and fast inference make it suitable for a wide array of applications.

If you’re on less than 32GB of RAM, and especially if you’re limited to 8GB of RAM, you’ll want to consider lighter models. Options like Mistral 7B or Llama 2 in either the 7B or 13B configurations are more suitable for that scenario.

For the purpose of demonstration, let’s go with the Llama 2 7B Chat model. This choice offers a decent balance between performance and resource requirements for most standard applications.

mkdir -p ./models/7B
git clone git@hf.co:meta-llama/Llama-2-7b-chat-hf ./models/7BCode language: Bash (bash)

Note that if you opt for a Llama 2 model, you’ll more than likely need to add your ssh keys to your Hugging Face account. In addition, you’ll want to request access to download the weights and tokenizer (and the email you fill in the form must match the email of your Hugging Face account):

Convert model format for compatability

Once you have selected and downloaded your model, the next step is to ensure it is in a format compatible with llama.cpp. This typically involves converting the model file using a provided Python script. However, before proceeding with the conversion, you’ll need to set up a virtual environment and install the required dependencies.

A virtual environment in Python is a self-contained directory that contains a Python installation for a particular version of Python, plus a number of additional packages. Using a virtual environment allows you to manage dependencies for different projects separately. Here’s how you set it up:

python -m venv .venvCode language: Bash (bash)

Once the environment is created, activate it:

source .venv/bin/activateCode language: Bash (bash)

With your virtual environment activated, install the necessary packages with:

pip install -r requirements.txtCode language: Bash (bash)

This is where the convert.py script comes in handy. Here’s a simple command to perform this conversion:

python convert.py models/7BCode language: Bash (bash)

Quantize your model

As described above, quantizing involves reducing the precision of the model’s numerical representation to decrease its memory requirements and computational load. To see the list of available quantization types, you can use the following command:

./quantize --helpCode language: Bash (bash)

This command will display various quantization options, each with different characteristics in terms of model size and performance metrics. Based on this information, you can make an informed decision about which quantization type best suits your needs.

Once you’ve decided on a quantization type, you can proceed with the quantization process. For example, if you choose the Q5_K_M model for its balance between size and performance, you can quantize your model as follows:

./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q5_k_m.gguf q5_k_mCode language: Bash (bash)

This command takes the model in its higher-precision format (e.g., float16) and converts it to a quantized version, which is more suitable for running on less powerful hardware.

Run the model

Finally, it’s time to run the model! Here is a basic command to start the language model on your device using llama.cpp:

./main -ngl 32 -m ./models/7B/ggml-model-q5_k_m.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -insCode language: Bash (bash)

This command initiates the model with specified parameters like context length, temperature, and repeat penalty. The flags and values can be adjusted based on your specific requirements and the capabilities of your hardware.

Going forward

The open-source community has played a pivotal role in making these powerful tools more accessible. With the advancements in LLMs and the increasing availability of user-friendly platforms, the door is wide open for us to learn, experiment, and innovate.

Whether you’re a professional aiming to enhance your workflow, a student exploring AI, or just someone intrigued by the potential of language models, the opportunities seem boundless. Right now, it feels like we are standing on the brink of a new era.

For me, the journey with LLMs is as much about exploration and learning as it is about achieving specific outcomes. Don’t hesitate to experiment with different models, play around with them, and push boundaries. Remember, these models are not just about algorithms and data. They represent a new way of interaction between humans and machines.

In my upcoming articles, I’ll delve deeper into more nuanced applications. Stay tuned for more insights, and feel free to reach out to me on Twitter or LinkedIn with your thoughts and experiences!