Running Large Language Models (LLMs) Locally

Large Language Models (LLMs) have revolutionized natural language processing by providing state-of-the-art performance in various tasks such as text generation, translation, summarization, and more. These models, like OpenAI's GPT, Google's BERT, Meta's LLaMA or Mixtral, are built on deep neural networks with billions of parameters, trained on vast amounts of data to understand and generate human-like text. Running LLMs locally allows developers to leverage their power without relying on cloud services, providing more control over data and reducing latency. It also gives you control over your network, especially if you want to fine-tune a model.

Parts of daily-digest.net are running on a local LLM. Instead of a big model like Mixtral, we have chosen OpenChat (https://github.com/openchatai/OpenChat ) which we have fine tuned to fit our needs.

Running LLMs Locally

Running LLMs on your local machine involves several steps, from choosing the right model to setting up the necessary software and hardware. Here’s a detailed guide to help you get started, focusing on Mixtral and using Olama as the easiest way to run LLMs locally. There are of course different ways to get your model to run locally, but Ollama is definitely the easiest one.

Tools and Frameworks

  1. Model Selection: Choose an appropriate LLM like Mixtral, or LLaMA (See https://ollama.com/library ). For this guide, we'll use Mixtral.
  2. Frameworks: Olama is a user-friendly tool designed to make running LLMs locally as simple as possible. It provides an easy setup and an openai compatible web API for interaction with the models.

Setting Up and Using Olama

Olama simplifies the process of running LLMs on your local machine. It allows you to pull and run various models with minimal configuration.

Steps to Install Olama on a Unix Machine

  1. Install Olama:

    curl -sSL https://ollama.com/install.sh | sh
  2. Initialize Olama and Pull Models: Olama supports pulling a wide range of pre-trained models, including Mixtral. https://ollama.com/library

    ollama pull mixtral
  3. Run the Model: You can start the model server and interact with it via the provided web API.

    ollama serve

Using Olama’s Web API

Olama provides a convenient web API to interact with the model. Here’s how you can use it:

  1. Start Olama’s Web Server:

    ollama serve
  2. Access the Web API: By default, the web server runs on http://localhost:8000. You can send POST requests to this server to generate text.

  3. Example API Request:

    curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "Once upon a time", "max_length": 50}'

Benefits

Pitfalls

CPU Usage

Running LLMs on a CPU is possible but comes with significant drawbacks in terms of speed. CPUs are not optimized for the parallel processing required by deep learning models, leading to longer inference times. Most of the servers out there are not having a dedicated GPU, meaning CPU is all you have.

GPU Usage

GPUs are designed for parallel processing, making them ideal for running LLMs. They significantly speed up both training and inference.

Speed Comparison: Mixtral on CPU vs. GPU

The graph below shows the inference time for a Mixtral model running on both CPU and GPU.

I ran the test on a Unix machine with 8 CPU cores and 32 gig of ram. Once with GPU mode disabled, once with GPU mode enabled. The difference is quite dramatic as expected.

From the graph, it is evident that GPUs can be up to 10-20 times faster than CPUs for LLM inference, depending on the model size and specific hardware.

What to Do If You Only Have a CPU

If you're limited to using a CPU, consider the following strategies:

  1. Model Optimization: Use techniques like model quantization and distillation to reduce the model size and improve inference speed.
  2. Batch Processing: Process multiple inputs simultaneously to make better use of the CPU's capabilities.
  3. Smaller Models: Opt for smaller models that require less computational power, like GPT-2 instead of GPT-3.

Conclusion

Running LLMs locally can provide significant benefits in terms of privacy, latency, and control. However, it requires careful consideration of the hardware and computational resources available. By using tools like Olama, you can simplify the setup and management of LLMs like Mixtral, making it easier to harness their power on your local machine, even with limited resources.

In one of our next blog posts, we will explain how to fine tune a specific model.