Are you fascinated by large language models (LLMs) like ChatGPT? These incredible tools are already transforming how we interact with information and reshaping our approach to problem-solving in various fields.
Ever wondered if you could run an LLM right on your own laptop? Well, I’ve got some exciting news for you – you absolutely can! And not only that but also on other consumer-grade devices you already own, such as your smartphone.
In this post, I’m excited to walk you through the whole process. Not only is it totally doable, but it’s also super enjoyable and, let’s be honest, pretty darn cool.
There have been notable advancements in open-source LLMs recently. The best part? You don’t need a supercomputer or a high-end GPU costing you tens of thousands of dollars to get started.
With modest hardware, you can run these models on your own device without the need to rely on any external services. But what is the advantage of doing so over existing proprietary models like those from OpenAI, for example, we can already access via API?
With a local LLM you get:
- Privacy: Your data never leaves your device.
- Customization: Experiment and fine-tune models for your own needs.
- Learning: Gain deeper insights into AI technology
I’ve found running LLMs on your laptop really can be an awesome gateway to further discovery. So without any further ado, let’s dive right in!
What is a large language model?
A Language Model (LLM) is a type of artificial intelligence program designed to understand, generate, and interact using human language. It is built using a vast number of parameters, which are the fundamental elements that enable the model to learn from and process large datasets of text, improving its ability to mimic human-like language patterns.
When you go to acquire these models for personal use, it’s fascinating to realize the scale of investment behind their creation. Developing and training them often involves extraordinary sums of money for leveraging vast computational resources and extensive datasets. Llama 2, for example, may have cost more than 20 million dollars to train.
This effort results in the sophisticated abilities of LLMs to understand and generate language with remarkable accuracy and nuance. However, thanks to the open-source community and advancements in technology, accessing and experimenting with these models has become more feasible for individuals and smaller organizations. Despite not having the same level of resources as major tech companies, you can still experience the power of these models firsthand.
Understanding different model types
There are at least three types of models you might encounter: base models, instruct models, and chat models. Each of these serves different purposes and is suited for different applications.
A base model is like the foundational layer of a language model. It’s like a blank slate, equipped with general language understanding but not specialized for specific tasks or interaction styles. These models are ideal for those who want to engage in fine-tuning.
Fine-tuning involves further training the base model on a specialized dataset. This process customizes the model for specific tasks or industries, like legal analysis, creative writing, or technical support. For example, a researcher in medical science could fine-tune a base model on medical literature to develop a tool adept in medical language.
Instruct models are particularly useful for detailed Q&A, specific content generation tasks, or scenarios where precise instruction following is crucial. They are the go-to for enhanced productivity and specialized problem-solving.
Chat models, as the name suggests, are designed for conversational interaction. They are trained to simulate human-like dialogues, making them suitable for chatbots or conversational agents. Chat models are about engaging in a wide-ranging, free-flowing conversation, often in a more casual or general-purpose context.
Given the complexity and resource requirements of fine-tuning base models, you’ll probably find greater immediate value and ease of use in opting for chat or instruct models. These models are already optimized for specific tasks and interactions, making them more accessible and practical for a wide range of tasks.
Model file format types
Models are typically available in several file formats, each compatible with different machine learning frameworks. The most common formats include:
.bin: Often used for storing the weights of the model. These are raw binary files and are generally framework-agnostic.
.pth: In PyTorch, the file extensions
.pthare used for serialized models. These formats can store either the entire model, including its architecture and parameters, or just the model’s state dictionary, which contains weights and biases.
.h5: Used in TensorFlow and Keras to store models, including both their architecture and weights, leveraging the HDF5 standard for efficient and structured data management.
These files can be quite large, often several gigabytes or more. This is important to consider when running models locally, as you’ll need sufficient storage space. The file size is a reflection of the model’s complexity and the amount of training data it has processed.
Understanding model precision
When dealing with LLMs, it’s crucial to understand the concept of model precision and its impact on performance and efficiency. The precision refers to how detailed the information is stored within the model, particularly its weights and activations.
During the training phase of an LLM, precision is key. The standard practice is to use 32-bit floating-point precision (float32). Here’s why:
- High Precision: Float32 offers a balance of range and precision, crucial for capturing the subtle nuances in data during the training process.
- Training Accuracy: Higher precision ensures that the model learns effectively, as even small errors in training can lead to significant deviations in learning outcomes.
- Widely Supported: Most machine learning frameworks and hardware are optimized for float32 operations.
Once the model is trained, the focus shifts to inference (making predictions). Here, efficiency becomes as important as accuracy, leading to the use of lower precision formats like 16-bit.
Formats like float16 or bfloat16 are commonly used. They reduce the model’s memory footprint and speed up computations, allowing for faster responses and lower resource consumption. This shift usually provides a good balance between maintaining adequate model performance and increasing computational efficiency.
As we push the boundaries of running LLMs on conventional hardware, further quantization becomes a consideration. This involves reducing the precision of the model’s numerical representation even more, such as down to 4-bit.
Reducing the bit precision can significantly decrease the model’s memory requirements and computational load, making it feasible to run on standard consumer-grade hardware. Although lower precision like 4-bit, for example, might lead to a reduction in model accuracy, it can offer a practical compromise when running complex models on less powerful devices.
A key challenge is to reduce precision without substantially compromising the model’s ability to generate accurate and coherent outputs. Ultimately, it’s a balancing act that may require experimentation and optimization to find the sweet spot between efficiency and performance.
llama.cpp is a tool designed to run quantized versions of large language models like Llama on local hardware, offering support for various levels of integer quantization and optimized for different architectures. In other words, this is a program you can use to efficiently operate complex language models on standard consumer devices, such as your laptop.
There are several other viable tools available for similar purposes, but for the scope of this article we’ll zoom in on llama.cpp. I find it particularly effective and robust in its implementation, making it an ideal choice for exploration.
Let’s break down the steps of what we’ll need to do to get up and running:
- Download and compile llama.cpp
- Select and download a model
- Convert model format for compatability
- Quantize your model
- Run the model
Download and compile llama.cpp
The first step is to download and compile llama.cpp. Clone the repository:
git clone https://github.com/ggerganov/llama.cpp.gitCode language: Bash (bash)
If you’re on a MacBook, llama.cpp is configured to automatically utilize the Metal framework during compilation. This optimizes the tool for Apple’s silicon, leveraging ARM NEON, Accelerate, and Metal frameworks for maximum efficiency.
make -jCode language: Bash (bash)
If you’re not on a MacBook, it’s important to refer to the Readme in the repository for specific instructions tailored to your hardware. llama.cpp supports various configurations, including AVX, AVX2, and AVX512 for x86 architectures, and has CUDA, Metal, and OpenCL GPU backend support.
Select and download a model
Next, you need to choose a language model that is compatible with llama.cpp. This typically involves visiting repositories like Hugging Face or GitHub, where you can find a range of models suitable for different purposes.
If your system has 32GB or more of RAM, I highly recommend Mixtral 8x7B. In my opinion, it’s the most capable open model currently available, offering multilingual capabilities and a large 32k token context window. Its high performance and fast inference make it suitable for a wide array of applications.
If you’re on less than 32GB of RAM, and especially if you’re limited to 8GB of RAM, you’ll want to consider lighter models. Options like Mistral 7B or Llama 2 in either the 7B or 13B configurations are more suitable for that scenario.
For the purpose of demonstration, let’s go with the Llama 2 7B Chat model. This choice offers a decent balance between performance and resource requirements for most standard applications.
mkdir -p ./models/7B
git clone firstname.lastname@example.org:meta-llama/Llama-2-7b-chat-hf ./models/7BCode language: Bash (bash)
Note that if you opt for a Llama 2 model, you’ll more than likely need to add your ssh keys to your Hugging Face account. In addition, you’ll want to request access to download the weights and tokenizer (and the email you fill in the form must match the email of your Hugging Face account):
Convert model format for compatability
Once you have selected and downloaded your model, the next step is to ensure it is in a format compatible with llama.cpp. This typically involves converting the model file using a provided Python script. However, before proceeding with the conversion, you’ll need to set up a virtual environment and install the required dependencies.
A virtual environment in Python is a self-contained directory that contains a Python installation for a particular version of Python, plus a number of additional packages. Using a virtual environment allows you to manage dependencies for different projects separately. Here’s how you set it up:
python -m venv .venvCode language: Bash (bash)
Once the environment is created, activate it:
source .venv/bin/activateCode language: Bash (bash)
With your virtual environment activated, install the necessary packages with:
pip install -r requirements.txtCode language: Bash (bash)
This is where the
convert.py script comes in handy. Here’s a simple command to perform this conversion:
python convert.py models/7BCode language: Bash (bash)
Quantize your model
As described above, quantizing involves reducing the precision of the model’s numerical representation to decrease its memory requirements and computational load. To see the list of available quantization types, you can use the following command:
./quantize --helpCode language: Bash (bash)
This command will display various quantization options, each with different characteristics in terms of model size and performance metrics. Based on this information, you can make an informed decision about which quantization type best suits your needs.
Once you’ve decided on a quantization type, you can proceed with the quantization process. For example, if you choose the
Q5_K_M model for its balance between size and performance, you can quantize your model as follows:
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q5_k_m.gguf q5_k_mCode language: Bash (bash)
This command takes the model in its higher-precision format (e.g., float16) and converts it to a quantized version, which is more suitable for running on less powerful hardware.
Run the model
Finally, it’s time to run the model! Here is a basic command to start the language model on your device using llama.cpp:
./main -ngl 32 -m ./models/7B/ggml-model-q5_k_m.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -insCode language: Bash (bash)
This command initiates the model with specified parameters like context length, temperature, and repeat penalty. The flags and values can be adjusted based on your specific requirements and the capabilities of your hardware.
The open-source community has played a pivotal role in making these powerful tools more accessible. With the advancements in LLMs and the increasing availability of user-friendly platforms, the door is wide open for us to learn, experiment, and innovate.
Whether you’re a professional aiming to enhance your workflow, a student exploring AI, or just someone intrigued by the potential of language models, the opportunities seem boundless. Right now, it feels like we are standing on the brink of a new era.
For me, the journey with LLMs is as much about exploration and learning as it is about achieving specific outcomes. Don’t hesitate to experiment with different models, play around with them, and push boundaries. Remember, these models are not just about algorithms and data. They represent a new way of interaction between humans and machines.