4 Easy Ways to run LLAMA 3 on your PC or Mac

4 Easy Ways to run LLAMA 3 on your PC or Mac

Updated: May 02 2024 01:58


Add the 4th way using llamafile to run LLAMA 3 - May 02 2024


  1. [Simplest] Use Ollama - MacOS, Ubuntu, Windows (Preview)
  2. [Powerful] Use LM Studio - MacOS, Windows, Linux (Beta)
  3. [Flexible] Use GPT4All - MacOS, Ubuntu, Windows
  4. [SingleFile] Use llamafile - MacOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD



1) [Simplest] Use Ollama - MacOS, Ubuntu, Windows (Preview)


Download Ollama and run Llama 3:

ollama run llama3

The initial release of Llama 3 includes two sizes, the Instruct model is fine-tuned for chat/dialogue use cases and outperform many of the available open-source chat models on common benchmarks. Use the command ‘ollama run llama3’ and it will download the 8B instruct model by default. Pre-trained is the base model. You can specify a different model by adding a tag, like ‘ollama run llama3:70b-text’, see below for more details:

Instruct 8B Parameters
ollama run llama3

Instruct 70B Parameters
ollama run llama3:70b

Pre-trained 8B Parameters
ollama run llama3:text 

Pre-trained 70B Parameters
ollama run llama3:70b-text

Running a prompt "Show me the list of most popular cities for tourist" using the Instruct 8B model on my M3 MacBook Pro deliver a performance of 26.56 tokens/second. See result below:

2) [Powerful] Use LM Studio - MacOS, Windows, Linux (Beta)


LM studio is a tool that you can use to experiment with local and open-source LLMs. You can run these LLMs on your PC/Mac/Ubuntu. There are two ways that you can discover, download and run these LLMs locally: Through the in-app Chat UI or OpenAI compatible local server. Here are the supported endpoints:

GET /v1/models
POST /v1/chat/completions
POST /v1/embeddings
POST /v1/completions


  1. After downloading from https://lmstudio.ai/ and installing LM Studio, open the application.

  2. You can search "llama-3" and download either the Llama 3 8B or 70B models, e.g. Meta-Llama-3-8B-Instruct-GGUF, or the Meta-Llma-3-70B-Instruct-GGUF.

  3. Navigate to the chat section on the left-hand side of the interface.

  4. Click on "New Chat". At the top of the chat window, you should be able to select and load the model you want to use from a dropdown menu.

  5. After selecting the model, you can start interacting with it by typing in the chat box that appears. Using the QuantFactory/Meta-Llama-3-8B-Instruct-GGUF demo, I can achieve around 20 tokens/second on my M3 MacBook Pro. RAM usage is around 6.65GB and CPU utilization is around 60%-70%. See below LM Studio running Llama 3 demo video for more details:




3) [Flexible] Use GPT4All - MacOS, Ubuntu, Windows


GPT4All is an ecosystem to run powerful and customized large language models that work locally on consumer grade CPUs and any GPU. Note that your CPU needs to support AVX or AVX2 instructions. A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4All open-source ecosystem software. GPT4All is optimized to run inference of 3-13 billion parameter LLMs on the CPUs of laptops, desktops and servers.

  1. You can download the GPT4All installer from this website: https://gpt4all.io/

  2. The file will be downloaded on your computer. Double click on the file and following the installation process. On a Mac, you should see the following screens during the installation process:


  3. Open the GPT4All app and you should see a dialog box as shown below, allowing you to download the models, e.g. Llama 3 Instruct model


  4. After downloading the model, you can start the conversation just like in ChatGPT. You can also click on settings in top right corner to change the generation parameters. On my M3 MacBook Pro using the Llama 3 Instruct 8B model, I got about 27 tokens/second performance.





4) [SingleFile] Use llamafile - MacOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. llamafile goes 2x faster than llama.cpp and 25x faster than ollama for some use cases like CPU prompt evaluation. It has a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface.

  • Runs across CPU microarchitectures: llamafiles can run on a wide range of CPU microarchitectures, supporting newer Intel systems using modern CPU features while being compatible with older computers.

  • Runs across CPU architectures: llamafiles run on multiple CPU architectures such as AMD64 and ARM64, and are compatible with Win32 and most UNIX shells.

  • Cross-Operating System: llamafiles run on six operating systems: MacOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD.

  • Weight embedding: LLM weights can be embedded into llamafiles, allowing uncompressed weights to be mapped directly into memory, similar to a self-extracting archive.


Follow the following steps:

  1. You can download the llamafile from Huggingface here: https://huggingface.co/jartine


  2. Choose the latest version 0.8, which added support of LLaMA3, Grok, and Mixtral 8x22b, along with several performance improvements. The file will be downloaded on your computer. I choose Meta-Llama-3-8B-Instruct.Q4_0.llamafile.

  3. If you're using macOS, Linux, or BSD, you'll need to grant permission for your computer to execute this new file. (You only need to do this once) Run this command: "chmod a+x [filename]".

  4. chmod +x Meta-Llama-3-8B-Instruct.Q4_0.llamafile          

  5. Run the downloaded llamafile.

  6. ./Meta-Llama-3-8B-Instruct.Q4_0.llamafile -ngl 9999          

  7. You should see the following page opened on your default browser. (If it doesn't, just open your browser and point it at http://localhost:8080) Start entering your question in the Chat box!

  8. When you're done chatting, return to your terminal and hit Control-C to shut down llamafile.





Check out my recent posts