cmheong's blog: Large Language Model

Generated locally on a shoestring Stable Diffusion

In Part 1, while llama3 runs the resulting chatbot is frustratingly slow. A key problem is the pre-built docker container which requires the CPU to have support for AVX instructions before it would use the GPU. If you have a GPU you can do without an AVX CPU, but this requires a rebuild from source code like what I did when installing GPU support for tensorflow on the same CPU.

Docker provided a very quick and tempting way to test various LLM models without interfering with other python installs, so it was worth having a quick look at what AVX is.

AVX instructions from 2011 CPUs

AVX is Avanced Vector Extensions, first shipped on Intel CPUs in 2011. My AMD Phenom II X4 was bought in 2009 and thus missed the boat. Now the Phenom II uses an AM3+ socket, so there is hope that a later AM3+ CPU might have AVX support. This turned out to be the AMD Bulldozer series. These are sold under the AMD FX-4000 to 8000 series and support AVX.

AMD Bulldozer FX-4000 to FX-8000 series

Incredibly they are still on sale online with a China vendor offering FX-4100 for just RM26.40 (about USD6) up to an FX-6350 for RM148.50 (USD34). That fits my idea of a shoestring budget so I plomped for the mid-range FX-6100 at RM49.50 (USD11.50).

AMD FX-6100 is now just RM49.50

The next thing to do is to check if my equally ancient mainboard supports the FX-6100. This was the Asus M5A78LE. The manual says it does support the FX series.

And since LLM programs require lots of memory, I might as well push my luck and fill it up. The M5A78LE takes a maximum of 32GB DDR3 DRAM, twice my current 16GB. I picked up 8GB x 4 Kingston Hyper X Fury Blue (ie 1600MHz) for RM181.5 (USD42) so the whole upgrade cost me RM231 (USD 53).

Happily both worked without trouble, and where it previously failed, now the gpu-enabled docker container ran:

$docker run -it --rm --gpus=all -v /home/heong/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama

2024/10/12 07:32:48 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PRO

XY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http:/

/0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"

time=2024-10-12T07:32:48.673Z level=INFO source=images.go:753 msg="total blobs: 11"

time=2024-10-12T07:32:48.820Z level=INFO source=images.go:760 msg="total unused blobs remove

d: 0"

time=2024-10-12T07:32:48.822Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434

(version 0.3.12)"

time=2024-10-12T07:32:48.885Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 cpu]"

time=2024-10-12T07:32:48.907Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"

time=2024-10-12T07:32:49.342Z level=INFO source=types.go:107 msg="inference compute" id=GPU-

49ab809b-7b47-3fd0-60c1-f03c4a8959bd library=cuda variant=v12 compute=8.6 driver=12.6 name="

NVIDIA GeForce RTX 3060" total="11.7 GiB" available="11.2 GiB"

You can query it with curl:

$curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Tell me about Jeeves the butler","stream": true,"options": {"seed": 123,"top_k": 20,"top_p": 0.9,"temperature": 0}}'

And the speed went up quite a bit.

Llama working at his workstation. This image was generated locally using Stable Diffusion on a 2011 desktop with an Nvidia RTX3060 12GB GPU

Llama 3 is an 'AI model', ie a Large Language Deep Learning model comparable to Google Gemini 3.

Sean Zheng's excellent post details a very quick way of installing and running Llama3 from a local desktop. He had good results with an Intel i9 with 128GB RAM and an Nvidia RTX 4090 with 24GB VRAM. However, my desktop dates back to 2011 and is just a 3GHz AMD Phenom II with only 16GB DRAM and an Nvidia RTX 3060 GPU with 12GB VRAM. The hope is since the RTX3060 is not too far behind his RTX 4090, Llama3 can run or maybe hobble along in some fashion.

Sean's desktop runs Red Hat's RHEL9.3 but mine runs Ubuntu 22.04LTS. Both of us had already installed Nvidia graphics drivers as well as the CUDA Toolkit. In my case the driver is 560.35.03 and CUDA is 12.6. Sean's method was to run llama3 from a Docker image. This is a excellent sandbox for a beginner like me to try out Llama3, and not risk upsetting other large AI installs like Stable Diffusion or Keras.

Sean's post is mostly complete, the instructions are replicted here for convenience. First the system updates:

$sudo apt update

$sudo apt upgrade

We then need to update the ubuntu repository for docker:

$sudo apt install apt-transport-https ca-certificates curl software-properties-common

~$curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

$echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Then the actual Docker install:

$sudo apt update

$apt-cache policy docker-ce

$sudo apt install docker-ce

And I have a running docker daemon:

$sudo systemctl status docker

�.. docker.service - Docker Application Container Engine

Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset>

Active: active (running) since Thu 2024-09-26 11:05:47 +08; 22s ago

TriggeredBy: �.. docker.socket

Docs: https://docs.docker.com

Main PID: 56585 (dockerd)

Tasks: 10

Memory: 22.2M

CPU: 729ms

CGroup: /system.slice/docker.service

�..�..56585 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/

con>

A quick test seems fine:

$docker run hello-world

Unable to find image 'hello-world:latest' locally

latest: Pulling from library/hello-world

c1ec31eb5944: Pull complete

Digest: sha256:91fb4b041da273d5a3273b6d587d62d518300a6ad268b28628f74997b93171b2

Status: Downloaded newer image for hello-world:latest

Hello from Docker!

This message shows that your installation appears to be working correctly.

Next I just use Docker to pull in ollama:

$docker run -d -v ollama:/root/.ollama -p 11434:1

1434 --name ollama ollama/ollama

Unable to find image 'ollama/ollama:latest' locally

latest: Pulling from ollama/ollama

Digest: sha256:e458178cf2c114a22e1fe954dd9a92c785d1be686578a6c073a60cf259875470

Status: Downloaded newer image for ollama/ollama:latest

c09a5a60d5aa9120175c52f7b13b59420564b126005f4e90da704851bbeb9308

A quick check shows everything seems OK:

$docker ps -a

CONTAINER ID IMAGE COMMAND CREATED STATUS

PORTS NAMES

c09a5a60d5aa ollama/ollama "/bin/ollama serve" 9 minutes ago Up 9 minute

s 0.0.0.0:11434->11434/tcp, :::11434->11434/tcp ollama

75beaa5bac23 hello-world "/hello" 2 hours ago Exited (0)

2 hours ago amazing_ptolemy

OK, now for the GPU version of Ollama. We first stop ollama:

$docker stop c09a5a60d5aa

c09a5a60d5aa

$docker rm c09a5a60d5aa

c09a5a60d5aa

Make the local directory for ollama:

$mkdir ~/ollama

Oops:

$docker run -it --rm --gpus=all -v /home/heong/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

I think that means it cannot find the GPU. From here, I think I need the Nvidia Container Toolkit. The install guide is here.

Update the repository:

$curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \

&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \

sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \

sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

$ sudo apt-get update

Now the actual install:

sudo apt-get install -y nvidia-container-toolkit

The just restart Docker:

$ sudo systemctl restart docker

Now ollama runs:

$docker run -it --rm --gpus=all -v /home/heong/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama

2024/09/26 13:12:23 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"

But then it looked like it detected my GPU but refused to use it as my CPU does not have AVX or AVX2 instructions support:

time=2024-09-26T13:12:23.496Z level=WARN source=gpu.go:224 msg="CPU does not have minimum vector extensions, GPU inference disabled" required=avx detected="no vector extensions"

time=2024-09-26T13:12:23.496Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant="no vector extensions" compute="" driver=0.0 name="" total="15.6 GiB" available="13.2 GiB"

See Part 2 for the CPU upgrade to support AVX.

Now that was a setback, but ollama runs. Let's see if it loads llama 3.

$docker exec -it ollama ollama pull llama3

For good measure lets pull in llama2:

$docker exec -it ollama ollama pull llama3

$docker exec -it ollama ollama list

NAME ID SIZE MODIFIED

llama3:latest 365c0bd3c000 4.7 GB 15 seconds ago

llama2:latest 78e26419b446 3.8 GB 24 hours ago

And indeed llama3 runs on a 2011 AMD CPU with just 16GB RAM:

$docker exec -it ollama ollama run llama3

>>> Send a message (/? for help)

>>> /?

Available Commands:

/set Set session variables

/show Show model information

/load <model> Load a session or model

/save <model> Save your current session

/clear Clear session context

/bye Exit

/?, /help Help for a command

/? shortcuts Help for keyboard shortcuts

Use """ to begin a multi-line message.

>>> /show info

Model

architecture llama

parameters 8.0B

context length 8192

embedding length 4096

quantization Q4_0

Parameters

num_keep 24

stop "<|start_header_id|>"

stop "<|end_header_id|>"

stop "<|eot_id|>"

License

META LLAMA 3 COMMUNITY LICENSE AGREEMENT

Meta Llama 3 Version Release Date: April 18, 2024

In response to the prompt

>>> How are you today?

The reply was:

I'm just an AI, I don't have feelings or emotions like humans do. However,

I am functioning properly and ready to assist with any questions or tasks

you may have! Is there something specific you'd like to talk about or ask

for help with?

It was excruciatingly slow, and nvtop show the gpu is indeed not used but ollama seems to be all there. So there you have it, Llama3 running on a 16GB AMD Phenom II with no GPU.

Happy Trails.

cmheong's blog

Saturday, 12 October 2024

Llama3 on a shoestring Part 2: Upgrading the CPU

Friday, 27 September 2024

Llama 3 on a Shoestring Part 1 of 2: 2011-vintage 3GHz AMD Phenom II 16GB RAM and RTX3060 12GB