Sharon Machlis
Executive Editor, Data & Analytics

5 easy ways to run an LLM locally

how-to
Apr 25, 202423 mins
Artificial IntelligenceGenerative AISoftware Development

Deploying a large language model on your own system can be surprisingly simple—if you have the right tools. Here’s how to use LLMs like Meta’s new Llama 3 on your desktop.

Four Llamas on the range - LLMs
Credit: Noe Besso/Shutterstock

Chatbots like ChatGPT, Claude.ai, and Meta.ai can be quite helpful, but you might not always want your questions or sensitive data handled by an external application. That’s especially true on platforms where your interactions may be reviewed by humans and otherwise used to help train future models.

One solution is to download a large language model (LLM) and run it on your own machine. That way, an outside company never has access to your data. This is also a quick option to try some new specialty models such as Meta’s new Llama 3, which is tuned for coding, and SeamlessM4T, which is aimed at text-to-speech and language translations.

Running your own LLM might sound complicated, but with the right tools, it’s surprisingly easy. And the hardware requirements for many models aren’t crazy. I’ve tested the options presented in this article on two systems: a Dell PC with an Intel i9 processor, 64GB of RAM, and a Nvidia GeForce 12GB GPU (which likely wasn’t engaged running much of this software), and on a Mac with an M1 chip but just 16GB of RAM. 

Be advised that it may take a little research to find a model that performs reasonably well for your task and runs on your desktop hardware. And, few may be as good as what you’re used to with a tool like ChatGPT (especially with GPT-4) or Claude.ai.  Simon Willison, creator of the command-line tool LLM, argued in a presentation last summer that running a local model could be worthwhile even if its responses are wrong:

[Some of] the ones that run on your laptop will hallucinate like wild— which I think is actually a great reason to run them, because running the weak models on your laptop is a much faster way of understanding how these things work and what their limitations are.

It’s also worth noting that open source models keep improving, and some industry watchers expect the gap between them and commercial leaders to narrow.

Run a local chatbot with GPT4All

If you want a chatbot that runs locally and won’t send data elsewhere, GPT4All offers a desktop client for download that’s quite easy to set up. It includes options for models that run on your own system, and there are versions for Windows, macOS, and Ubuntu.

When you open the GPT4All desktop application for the first time, you’ll see options to download around 10 (as of this writing) models that can run locally. Among them is Llama-2-7B chat, a model from Meta AI. You can also set up OpenAI’s GPT-3.5 and GPT-4 (if you have access) for non-local use if you have an API key.

The model-download portion of the GPT4All interface was a bit confusing at first. After I downloaded several models, I still saw the option to download them all. That suggested the downloads didn’t work. However, when I checked the download path, the models were there.

Image shows a description of two models on the left and options to download on the right. Screenshot by Sharon Machlis for IDG.

A portion of the model download interface in GPT4All. Once I opened the usage portion of the application, my downloaded models automatically appeared.

Once the models are set up, the chatbot interface itself is clean and easy to use. Handy options include copying a chat to a clipboard and generating a response.

Image shows query 'What is a good way to learn American Sign Language?' and response. Screenshot by Sharon Machlis for IDG.

The GPT4All chat interface is clean and easy to use.

There’s also a beta LocalDocs plugin that lets you “chat” with your own documents locally. You can enable it in the Settings > Plugins tab, where you’ll see a “LocalDocs Plugin (BETA) Settings” header and an option to create a collection at a specific folder path.

The plugin is a work in progress, and documentation warns that the LLM may still “hallucinate” (make things up) even when it has access to your added expert information. Nevertheless, it’s an interesting feature that’s likely to improve as open-source models become more capable.

In addition to the chatbot application, GPT4All also has bindings for Python, Node, and a command-line interface (CLI). There’s also a server mode that lets you interact with the local LLM through an HTTP API structured very much like OpenAI’s. The goal is to let you swap in a local LLM for OpenAI’s by changing a couple of lines of code.

LLMs on the command line

LLM by Simon Willison is one of the easier ways I’ve seen to download and use open source LLMs locally on your own machine. While you do need Python installed to run it, you shouldn’t need to touch any Python code. If you’re on a Mac and use Homebrew, just install with


brew install llm

If you’re on a Windows machine, use your favorite way of installing Python libraries, such as


pip install llm

LLM defaults to using OpenAI models, but you can use plugins to run other models locally. For example, if you install the gpt4all plugin, you’ll have access to additional local models from GPT4All. There are also plugins for Llama, the MLC project, and MPT-30B, as well as additional remote models.

Install a plugin on the command line with llm install model-name:


llm install llm-gpt4all

You can see all available models—remote and the ones you’ve installed, including brief info about each one, with the command: llm models list.

Results of llm models list command shows model source, name, size, and RAM needed. Screenshot by Sharon Machlis for IDG.

The display when you ask LLM to list available models.

To send a query to a local LLM, use the syntax:


llm -m the-model-name "Your query"

I then asked falcon-q4_0 a ChatGPT-like question without issuing a separate command to download the model:


llm -m ggml-model-gpt4all-falcon-q4_0 "Tell me a joke about computer programming"

This is one thing that makes the LLM user experience so elegant. If the GPT4All model doesn’t exist on your local system, the LLM tool automatically downloads it for you before running your query. You’ll see a progress bar in the terminal as the model is downloading.

Model downloading in the terminal with a progress bar. Screenshot by Sharon Machlis for IDG.

LLM automatically downloaded the model I used in a query.

The joke itself wasn’t outstanding—”Why did the programmer turn off his computer? Because he wanted to see if it was still working!”—but the query did, in fact, work. And if results are disappointing, that’s because of model performance or inadequate user prompting, not the LLM tool.

You can also set aliases for models within LLM, so that you can refer to them by shorter names:


llm aliases set falcon ggml-model-gpt4all-falcon-q4_0

To see all your available aliases, enter: llm aliases.

The LLM plugin for Meta’s Llama models requires a bit more setup than GPT4All does. Read the details on the LLM plugin’s GitHub repo. Note that the general-purpose llama-2-7b-chat did manage to run on my work Mac with the M1 Pro chip and just 16GB of RAM. It ran rather slowly compared with the GPT4All models optimized for smaller machines without GPUs, and performed better on my more robust home PC.

LLM has other features, such as an argument flag that lets you continue from a prior chat and the ability to use it within a Python script. And in early September, the app gained tools for generating text embeddings, numerical representations of what the text means that can be used to search for related documents. You can see more on the LLM website. Willison, co-creator of the popular Python Django framework, hopes that others in the community will contribute more plugins to the LLM ecosystem.

Llama models on your desktop: Ollama

Ollama is an even easier way to download and run models than LLM. However, the project was limited to macOS and Linux until mid-February, when a preview version for Windows finally became available. I tested the Mac version. 

Final setup screen saying 'Run your first model with ollama: run llama2'. Screenshot by Sharon Machlis for IDG.

Setting up Ollama is extremely simple.

Installation is an elegant experience via point-and-click. And although Ollama is a command-line tool, there’s just one command with the syntax ollama run model-name. As with LLM, if the model isn’t on your system already, it will automatically download.

You can see the list of available models at https://ollama.ai/library, which as of this writing included several versions of Llama-based models including Llama 3, Code Llama, CodeUp, and medllama2, which is fine-tuned to answer medical questions.

The Ollama GitHub repo’s README includes a helpful list of some model specs and advice that “You should have at least 8GB of RAM to run the 3B models, 16GB to run the 7B models, and 32GB to run the 13B models.” On my 16GB RAM Mac, the 7B Code Llama performance was surprisingly snappy. It will answer questions about bash/zsh shell commands as well as programming languages like Python and JavaScript.

Terminal window screen showing download progress bars and answers to questions about shell commands. Screenshot by Sharon Machlis for IDG.

How it looks running Code Llama in an Ollama terminal window.

Despite being the smallest model in the family, Code Llama was pretty good if imperfect at answering an R coding question that tripped up some larger models: “Write R code for a ggplot2 graph where the bars are steel blue color.” The code was correct except for two extra closing parentheses in two of the lines of code, which were easy enough to spot in my IDE. I suspect the larger Code Llama could have done better.

Ollama has some additional features, such as LangChain integration and the ability to run with PrivateGPT, which may not be obvious unless you check the GitHub repo’s tutorials page.

You could have PrivateGPT running in a terminal window and pull it up every time you have a question. I’m looking forward to an Ollama Windows version to use on my home PC.

Chat with your own documents: h2oGPT

H2O.ai has been working on automated machine learning for some time, so it’s natural that the company has moved into the chat LLM space. Some of its tools are best used by people with knowledge of the field, but instructions to install a test version of its h2oGPT chat desktop application were quick and straightforward, even for machine learning novices.

You can access a demo version on the web (obviously not using an LLM local to your system) at gpt.h2o.ai, which is a useful way to find out if you like the interface before downloading it onto your own system.

You can download a basic version of the app with limited ability to query your own documents by following setup instructions here.


Screen shows the questions, the AI's response, and a link to source document. Screenshot by Sharon Machlis for IDG.

A local Llama model answers questions based on VS Code documentation.

Without adding your own files, you can use the application as a general chatbot. Or, you can upload some documents and ask questions about those files. Compatible file formats include PDF, Excel, CSV, Word, text, markdown, and more. The test application worked fine on my 16GB Mac, although the smaller model’s results didn’t compare to paid ChatGPT with GPT-4 (as always, that’s a function of the model and not the application). The h2oGPT UI offers an Expert tab with a number of configuration options for users who know what they’re doing. This gives more experienced users the option to try to improve their results.

Tab options include system pre-context, query pre-prompt, system prompt, number of chunks, and more. Screenshot by Sharon Machlis for IDG.

Exploring the Expert tab in h2oGPT.

If you want more control over the process and options for more models, download the complete application. There are one-click installers for Windows and macOS for systems with a GPU or with CPU-only. Note that my Windows antivirus software was unhappy with the Windows version because it was unsigned. I’m familiar with H2O.ai’s other software and the code is available on GitHub, so I was willing to download and install it anyway.

Rob Mulla, now at at H2O.ai, posted a YouTube video on his channel about installing the app on Linux. Although the video is several months old now, and the application user interface appears to have changed, the video still has useful info, including helpful explanations about H2O.ai LLMs.

Easy but slow chat with your data: PrivateGPT

PrivateGPT is also designed to let you query your own documents using natural language and get a generative AI response. The documents in this application can include several dozen different formats. And the README assures you that the data is “100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection!”

PrivateGPT features scripts to ingest data files, split them into chunks, create “embeddings” (numerical representations of the meaning of the text), and store those embeddings in a local Chroma vector store. When you ask a question, the app searches for relevant documents and sends just those to the LLM to generate an answer.

If you’re familiar with Python and how to set up Python projects, you can clone the full PrivateGPT repository and run it locally. If you’re less knowledgeable about Python, you may want to check out a simplified version of the project that author Iván Martínez set up for a conference workshop, which is considerably easier to set up.

That version’s README file includes detailed instructions that don’t assume Python sysadmin expertise. The repo comes with a source_documents folder full of Penpot documentation, but you can delete those and add your own.

PrivateGPT includes the features you’d likely most want in a “chat with your own documents” app in the terminal, but the documentation warns it’s not meant for production. And once you run it, you may see why: Even the small model option ran very slowly on my home PC. But just remember, the early days of home internet were painfully slow, too. I expect these types of individual projects will speed up.

More ways to run a local LLM

There are more ways to run LLMs locally than just these five, ranging from other desktop applications to writing scripts from scratch, all with varying degrees of setup complexity.

Jan

Jan is a relatively new open-source project that aims to “democratize AI access” with “open, local-first products.” The app is simple to download and install, and the interface is a nice balance between customizability and ease of use. It’s an enjoyable app to use.

Choosing models to use in Jan is pretty painless. Within the application’s hub, shown below, there are descriptions of more than 30 models available for one-click download, including some with vision, which I didn’t test. You can also import others in the GGUF format. Models listed in Jan’s hub show up with “Not enough RAM” tags if your system is unlikely to be able to run them.

Jan's model hub. Created by Sharon Machlis for IDG.

Jan’s model hub hosts more than 30 models available for download.

Jan’s chat interface includes a right-side panel that lets you set system instructions for the LLM and tweak parameters. On my work Mac, a model I had downloaded was tagged as “slow on your device” when I started it, and I was advised to close some applications to try to free up RAM. Whether or not you’re new to LLMs, it’s easy to forget to free up as much RAM as possible when launching genAI applications, so that is a useful alert. (Chrome with a lot of tabs open can be a RAM hog; closing it solved the issue.)

Jan's chat interface. Screenshot by Sharon Machlis for IDG.

Jan’s chat interface is detailed and easy to use.

Once I freed up the RAM, streamed responses within the app were pretty snappy.

Jan also lets you use OpenAI models from the cloud in addition to running LLMs locally. And, you can set up Jan to work with remote or local API servers.

Jan’s project documentation was still a bit sparse when I tested the app in March 2024, although the good news is that much of the application is fairly intuitive to use—but not all of it. One thing I missed in Jan was the ability to upload files and chat with a document. After searching on GitHub, I discovered you can indeed do this by turning on “Retrieval” in the model settings to upload files. However, I couldn’t upload either a .csv or a .txt file. Neither were supported, although that wasn’t obvious until I tried it. A PDF worked, though. It’s also notable, although not Jan’s fault, that the small models I was testing did not do a great job of retrieval-augmented generation.

A key advantage of Jan over LM Studio (see below) is that Jan is open source under the permissive AGPLv3 license, which allows for unrestricted commercial use as long as any derivative works are also open source. LM Studio is free for personal use, but the site says you should fill out the LM Studio @ Work request form to use it on the job. Jan is available for Windows, macOS, and Linux.

Opera

If all you want is a super easy way to chat with a local model from your current web workflow, the developer version of Opera is a possibility. It doesn’t offer features like chat with your files. You also need to be logged into an Opera account to use it, even for local models, so I’m not confident it’s as private as most other options reviewed here. However, it’s a convenient way to test and use local LLMs in your workflow.

Opera Aria Screenshot by Sharon Machlis for IDG.

Clicking on the Aria button opens a default chat using OpenAI. You’ll need to click the Get Started button to see the option for a local LLM.

Local LLMs are available on the developer stream of Opera One, which you can download from its website.

To start, open the Aria Chat side panel—that’s the top button at the bottom left of your screen. That defaults to using OpenAI’s models and Google Search.

To opt for a local model, you have to click Start, as if you’re doing the default, and then there’s an option near the top of the screen to “Choose local AI model.”

Select that, then click “Go to settings” to browse or search for models, such as Llama 3 in 8B or 70B.

LLM models for Opera Screenshot by Sharon Machlis for IDG.

You can search for available models for download such as Llama 3 in the Opera chat interface.

For those with very limited hardware, Opera suggests Gemma:2b-instruct-q4_K_M.

After your model downloads, it is a bit unclear how to go back to start a chat. Click the menu at the top left of your screen and you’ll see a button for “New chat.” Make sure to once again click “Choose local AI model,” then select the model you downloaded; otherwise, you’ll be chatting with the default OpenAI.

Start a new chat. Screenshot by Sharon Machlis for IDG.

You can choose to use a local model by clicking that option at the top of the screen.

What’s most attractive about chatting in Opera is using a local model that feels similar to the now familiar copilot-in-your-side-panel generative AI workflow. Opera is based in Norway and says it’s GDPR compliant for all users. I’d still think twice about using this model for anything highly sensitive as long as the login to a cloud account is required.

Chat with RTX

Nvidia’s Chat with RTX demo application is designed to answer questions about a directory of documents. As of its February launch, Chat with RTX can use either a Mistral or Llama 2 LLM running locally. You’ll need a Windows PC with an Nvidia GeForce RTX 30 Series or higher GPU with at least 8GB of video RAM to run the application. You’ll also want a robust internet connection. The download was a hefty 35GB zipped.

Chat with RTX presents a simple interface that’s extremely easy to use. Clicking on the icon launches a Windows terminal that runs a script to launch an application in your default browser.

Chat with RTX interface Screenshot by Sharon Machlis for IDG.

The Chat with RTX interface is simple and easy to use.

Select an LLM and the path to your files, wait for the app to create embeddings for your files—you can follow that progress in the terminal window—and then ask your question. The response includes links to documents used by the LLM to generate its answer, which is helpful if you want to make sure the information is accurate, since the model may answer based on other information it knows and not only your specific documents. The application currently supports .txt, .pdf, and .doc files as well as YouTube videos via a URL.

Chat with RTX response screen Screenshot by Sharon Machlis for IDG.

A Chat with RTX response with links to response documents.

Note that Chat with RTX doesn’t look for documents in subdirectories, so you’ll need to put all your files in a single folder. If you want to add more documents to the folder, click the refresh button to the right of the data set to re-generate embeddings.

llamafile

Mozilla’s llamafile, unveiled in late November, allows developers to turn critical portions of large language models into executable files. It also comes with software that can download LLM files in the GGUF format, import them, and run them in a local in-browser chat interface.

To run llamafile, the project’s README suggests downloading the current server version with


curl -L https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1 > llamafile
chmod +x llamafile

Then, download a model of your choice. I’ve read good things about Zephyr, so I found and downloaded a version from Hugging Face.

Now you can run the model in a terminal with

./llamafile --model ./zephyr-7b-alpha.Q4_0.gguf

Replace zephyr with whatever and wherever your model is located, wait for it to load, and open it in your browser at http://127.0.0.1:8080. You’ll see an opening screen with various options for the chat:

llamafile opening screen with chat options. Screenshot by Sharon Machlis for IDG.

Enter your query at the bottom, and the screen will turn into a basic chatbot interface:

llamafile chatbot interface. Screenshot by Sharon Machlis for IDG.

You can test out running a single executable with one of the sample files on the project’s GitHub repository: mistral-7b-instruct, llava-v1.5-7b-server, or wizardcoder-python-13b.

On the day that llamafile was released, Simon Willison, author of the LLM project profiled in this article, said in a blog post, “I think it’s now the single best way to get started running large language models (think your own local copy of ChatGPT) on your own computer.”

While llamafile was extremely easy to get up and running on my Mac, I ran into some issues on Windows. For now, like Ollama, llamafile may not be the top choice for plug-and-play Windows software.

LocalGPT

A PrivateGPT spinoff, LocalGPT, includes more options for models and has detailed instructions as well as three how-to videos, including a 17-minute detailed code walk-through. Opinions may differ on whether this installation and setup is “easy,” but it does look promising. As with PrivateGPT, though, documentation warns that running LocalGPT on a CPU alone will be slow.

LM Studio

Another desktop app I tried, LM Studio, has an easy-to-use interface for running chats, but you’re more on your own with picking models. If you know what model you want to download and run, this could be a good choice. If you’re just coming from using ChatGPT and you have limited knowledge of how best to balance precision with size, all the choices may be a bit overwhelming at first. Hugging Face Hub is the main source of model downloads inside LM Studio, and it has a lot of models.

Unlike the other LLM options, which all downloaded the models I chose on the first try, I had problems downloading one of the models within LM Studio. Another didn’t run well, which was my fault for maxing out my Mac’s hardware, but I didn’t immediately see a suggested minimum non-GPU RAM for model choices. If you don’t mind being patient about selecting and downloading models, though, LM Studio has a nice, clean interface once you’re running the chat. As of this writing, the UI didn’t have a built-in option for running the LLM over your own data.

Interface shows chat list at left, main chat bot UI in the center, and config options at right. Screenshot by Sharon Machlis for IDG.

The LLM Studio interface.

LM Studio does have a built-in server that can be used “as a drop-in replacement for the OpenAI API,” as the documentation notes, so code that was written to use an OpenAI model via the API will run instead on the local model you’ve selected.

Like h2oGPT, LM Studio throws a warning on Windows that it’s an unverified app. LM Studio code is not available on GitHub and isn’t from a long-established organization, though, so not everyone will be comfortable installing it.

In addition to using a pre-built model download interface through apps like h2oGPT, you can also download and run some models directly from Hugging Face, a platform and community for artificial intelligence that includes many LLMs. (Not all models there include download options.) Mark Needham, developer advocate at StarTree, has a nice explainer on how to do this, including a YouTube video. He also provides some related code in a GitHub repo, including sentiment analysis with a local LLM

Hugging Face provides some documentation of its own about how to install and run available models locally.

LangChain

Another popular option is to download and use LLMs locally in LangChain, a framework for creating end-to-end generative AI applications. That does require getting up to speed with writing code using the LangChain ecosystem. If you know LangChain basics, you may want to check out the documentation on Hugging Face Local Pipelines, Titan Takeoff (requires Docker as well as Python), and OpenLLM for running LangChain with local models. OpenLLM is another robust, standalone platform, designed for deploying LLM-based applications into production.

Sharon Machlis
Executive Editor, Data & Analytics

Sharon Machlis is Director of Editorial Data & Analytics at Foundry (the IDG, Inc. company that publishes websites including Computerworld and InfoWorld), where she analyzes data, codes in-house tools, and writes about data analysis tools and tips. She holds an Extra class amateur radio license and is somewhat obsessed with R. Her book Practical R for Mass Communication and Journalism was published by CRC Press.

More from this author