Saturday, 12 October 2024

Llama3 on a shoestring Part 2: Upgrading the CPU

Generated locally on a shoestring Stable Diffusion


In Part 1, while llama3 runs the resulting chatbot is frustratingly slow. A key problem is the pre-built docker container which requires the CPU to have support for AVX instructions before it would use the GPU. If you have a GPU you can do without an AVX CPU, but this requires a rebuild from source code like what I did when installing GPU support for tensorflow on the same CPU.

Docker provided a very quick and tempting way to test various LLM models without interfering with other python installs, so it was worth having a quick look at what AVX is.

AVX instructions from 2011 CPUs
 

AVX is Avanced Vector Extensions, first shipped on Intel CPUs in 2011. My AMD Phenom II X4 was bought in 2009 and thus missed the boat. Now the Phenom II uses an AM3+ socket, so there is hope that a later AM3+ CPU might have AVX support. This turned out to be the AMD Bulldozer series. These are sold under the AMD FX-4000 to 8000 series and support AVX.

AMD Bulldozer FX-4000 to FX-8000 series


Incredibly they are still on sale online with a China vendor offering FX-4100 for just RM26.40 (about USD6) up to an FX-6350 for RM148.50 (USD34). That fits my idea of a shoestring budget so I plomped for the mid-range FX-6100 at  RM49.50 (USD11.50).


 

AMD FX-6100 is now just RM49.50

The next thing to do is to check if my equally ancient mainboard supports the FX-6100. This was the Asus M5A78LE. The manual says it does support the FX series. 

And since LLM programs require lots of memory, I might as well push my luck and fill it up. The M5A78LE  takes a maximum of 32GB DDR3 DRAM, twice my current 16GB. I picked up 8GB x 4 Kingston Hyper X Fury Blue (ie 1600MHz) for RM181.5 (USD42) so the whole upgrade cost me RM231 (USD 53).


 

Happily both worked without trouble, and where it previously failed, now the gpu-enabled docker container ran:

$docker run -it --rm --gpus=all -v /home/heong/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama

2024/10/12 07:32:48 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PRO

XY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http:/

/0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"

time=2024-10-12T07:32:48.673Z level=INFO source=images.go:753 msg="total blobs: 11"

time=2024-10-12T07:32:48.820Z level=INFO source=images.go:760 msg="total unused blobs remove

d: 0"

time=2024-10-12T07:32:48.822Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434

(version 0.3.12)"

time=2024-10-12T07:32:48.885Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 cpu]"

time=2024-10-12T07:32:48.907Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"

time=2024-10-12T07:32:49.342Z level=INFO source=types.go:107 msg="inference compute" id=GPU-

49ab809b-7b47-3fd0-60c1-f03c4a8959bd library=cuda variant=v12 compute=8.6 driver=12.6 name="

NVIDIA GeForce RTX 3060" total="11.7 GiB" available="11.2 GiB"

You can query it with curl:

$curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Tell me about Jeeves the butler","stream": true,"options": {"seed": 123,"top_k": 20,"top_p": 0.9,"temperature": 0}}'

And the speed went up quite a bit. 

Friday, 27 September 2024

Llama 3 on a Shoestring Part 1 of 2: 2011-vintage 3GHz AMD Phenom II 16GB RAM and RTX3060 12GB



Llama working at his workstation. This image was generated locally using Stable Diffusion on a 2011 desktop with an Nvidia RTX3060 12GB GPU

Llama 3 is an 'AI model', ie a Large Language Deep Learning model comparable to Google Gemini 3.

 Sean Zheng's excellent post details a very quick way of installing and running Llama3 from a local desktop. He had good results with an Intel i9 with 128GB RAM and an Nvidia RTX 4090 with 24GB VRAM. However, my desktop dates back to 2011 and is just a 3GHz AMD Phenom II with only 16GB DRAM and an Nvidia RTX 3060 GPU with 12GB VRAM. The hope is since the RTX3060 is not too far behind his RTX 4090, Llama3 can run or maybe hobble along in some fashion.

Sean's desktop runs Red Hat's RHEL9.3 but mine runs Ubuntu 22.04LTS. Both of us had already installed Nvidia graphics drivers as well as the CUDA Toolkit. In my case the driver is 560.35.03 and CUDA is 12.6. Sean's method was to run llama3 from a Docker image. This is a excellent sandbox for a beginner like me to try out Llama3, and not risk upsetting other large AI installs like Stable Diffusion or Keras. 

Sean's post is mostly complete, the instructions are replicted here for convenience. First the system updates:

$sudo apt update

$sudo apt upgrade

We then need to update the ubuntu repository for docker:
$sudo apt install apt-transport-https ca-certificates curl software-properties-common
~$curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
$echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Then the actual Docker install:
$sudo apt update
$apt-cache policy docker-ce
$sudo apt install docker-ce

And I have a running docker daemon:
$sudo systemctl status docker
�.. docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset>
     Active: active (running) since Thu 2024-09-26 11:05:47 +08; 22s ago
TriggeredBy: �.. docker.socket
       Docs: https://docs.docker.com
   Main PID: 56585 (dockerd)
      Tasks: 10
     Memory: 22.2M
        CPU: 729ms
     CGroup: /system.slice/docker.service
             �..�..56585 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/
con>

A quick test seems fine:
$docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
c1ec31eb5944: Pull complete
Digest: sha256:91fb4b041da273d5a3273b6d587d62d518300a6ad268b28628f74997b93171b2
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

Next I just use Docker to pull in ollama:
$docker run -d -v ollama:/root/.ollama -p 11434:1
1434 --name ollama ollama/ollama
Unable to find image 'ollama/ollama:latest' locally
latest: Pulling from ollama/ollama
Digest: sha256:e458178cf2c114a22e1fe954dd9a92c785d1be686578a6c073a60cf259875470
Status: Downloaded newer image for ollama/ollama:latest
c09a5a60d5aa9120175c52f7b13b59420564b126005f4e90da704851bbeb9308

A quick check shows everything seems OK:
$docker ps -a
CONTAINER ID   IMAGE           COMMAND               CREATED         STATUS
              PORTS                                           NAMES
c09a5a60d5aa   ollama/ollama   "/bin/ollama serve"   9 minutes ago   Up 9 minute
s             0.0.0.0:11434->11434/tcp, :::11434->11434/tcp   ollama
75beaa5bac23   hello-world     "/hello"              2 hours ago     Exited (0)
2 hours ago                                                   amazing_ptolemy

OK, now for the GPU version of Ollama. We first stop ollama:
$docker stop c09a5a60d5aa
c09a5a60d5aa
$docker rm c09a5a60d5aa
c09a5a60d5aa

Make the local directory for ollama:
$mkdir ~/ollama

Oops:
$docker run -it --rm --gpus=all -v /home/heong/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

I think that means it cannot find the GPU. From here, I think I need the Nvidia Container Toolkit. The install guide is here.

Update the repository:

$curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

$ sudo apt-get update

Now the actual install:

sudo apt-get install -y nvidia-container-toolkit

The just  restart Docker:

$ sudo systemctl restart docker

Now ollama runs:

$docker run -it --rm --gpus=all -v  /home/heong/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama
2024/09/26 13:12:23 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"

But then it looked like it detected my GPU but refused to use it as my CPU does not have AVX or AVX2 instructions support:
time=2024-09-26T13:12:23.496Z level=WARN source=gpu.go:224 msg="CPU does not have minimum vector extensions, GPU inference disabled" required=avx detected="no vector extensions"
time=2024-09-26T13:12:23.496Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant="no vector extensions" compute="" driver=0.0 name="" total="15.6 GiB" available="13.2 GiB"


Now that was a setback, but ollama runs. Let's see if it loads llama 3.

$docker exec -it ollama ollama pull llama3

For good measure lets pull in llama2:

$docker exec -it ollama ollama pull llama3

$docker exec -it ollama ollama list
NAME             ID              SIZE      MODIFIED
llama3:latest    365c0bd3c000    4.7 GB    15 seconds ago
llama2:latest    78e26419b446    3.8 GB    24 hours ago

And indeed llama3 runs on a 2011 AMD CPU with just 16GB RAM:

$docker exec -it ollama ollama run llama3
>>> Send a message (/? for help)

>>> /?
Available Commands:
  /set            Set session variables
  /show           Show model information
  /load <model>   Load a session or model
  /save <model>   Save your current session
  /clear          Clear session context
  /bye            Exit
  /?, /help       Help for a command
  /? shortcuts    Help for keyboard shortcuts

Use """ to begin a multi-line message.

>>> /show info
  Model
    architecture        llama
    parameters          8.0B
    context length      8192
    embedding length    4096
    quantization        Q4_0

  Parameters
    num_keep    24
    stop        "<|start_header_id|>"
    stop        "<|end_header_id|>"
    stop        "<|eot_id|>"

  License
    META LLAMA 3 COMMUNITY LICENSE AGREEMENT
    Meta Llama 3 Version Release Date: April 18, 2024


In response to the prompt

>>> How are you today?

The reply was:

I'm just an AI, I don't have feelings or emotions like humans do. However, 
I am functioning properly and ready to assist with any questions or tasks 
you may have! Is there something specific you'd like to talk about or ask 
for help with?

It was excruciatingly slow, and nvtop show the gpu is indeed not used but ollama seems to be all there. So there you have it, Llama3 running on a 16GB AMD  Phenom II with no GPU.

Happy Trails.



Monday, 17 June 2024

Optimus under the Hood: OpenCV with CUDA for Nvidia GT 640M GPU and Slackware 14.2

Optimus Prime stepping forth from laptop - AI-generated image from getimg.ai

 Never thought much about my laptop GPUs. Even less about Nvidia GPUs as I gave up on proprietary software 20 years ago. I was quite happy with the open source noveau driver, until Nvidia's cuDNN allowed OpenCV imaging programs to use Deep Neural Nets - AI.

Installing CUDA

Slowly, for it was a little cumbersome to hold your nose at the same time, I loaded the CUDA Linux toolkit into by GeForce GT710 desktop. The process was as unpleasant as ever - 10-year old proprietary software starts to look like abandonware, but the results were amazing. The GPU heated up like crazy and my desktop blew up, but OpenCV flew.

Acer Aspire M3-581TG


Suddenly there were low-cost possibilities for AI-enabled imaging systems - surveillance video, even augmented reality. And some of my old laptops (defenestrated, of course) had Nvidia GPUs. I started with an old Acer Aspire M3-581TG - it has an Nvidia GeForce 640M, or so the sticker on the keyboard says. 

lspci came up with a surprise - the GPU was an Intel GPU:

root@aspireM3:/$lspci

00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)

00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)

00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)

00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)

00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)

00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)

00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)

00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4)

00:1c.3 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 4 (rev c4)

00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)

00:1f.0 ISA bridge: Intel Corporation HM77 Express Chipset LPC Controller (rev 04)

00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)

00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)

01:00.0 VGA compatible controller: NVIDIA Corporation GK107M [GeForce GT 640M] (rev a1)

07:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5209 PCI Express Card Reader (rev 01)

0d:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01)

0e:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetLink BCM57780 Gigabit Ethernet PCIe (rev 01)

Now if I had read all the lines instead of stopping after the first 3, I would have noticed it also had an Nvidia GPU! the GK107M or GeForce GT 640M. It took quite a few weeks to recover from the shock - two GPUs in a laptop? The GPUs were switched in and out depending on whether graphics performance or power consumption was being prioritised. Nvidia called this its Optimus system.
GPU Switching


Now the GT 640M is quite an old GPU, and the best way would be to install CUDA/cuDNN/OpenCV on a matching Ubuntu distribution. But my M3-581TG had been defenestrated 10 years ago. It ran Slackware 14.2-current and was too much work on it to install new. 

Nvidia GPU, CUDA Toolkit, cuDNN and OpenCV are notoriously finicky and you need to get the versions just right. Not to mention your gcc, libraries and various Linux bits. CUDA and cuDNN are proprietary blobs so it is a matter of installing the various versions until one works. The first thing to do is to go past the Nvidia marketing guff and find out the GT 640M's GPU architecture. Its real name is the GK107 and the architecture is Kepler.

The you need to find the the GK107's Compute Capability, which from Nvidia is 3.0. From the cuDNN Support Matrix, the chances of it working with cuDNN 7.6.4, CUDA 10.1.243 and Linux driver at least r418.39 seems promising.

 First the driver. I started with slackbuild version, r460.67. Normally, you do a slackbuild with the Nvidia blob, but I had good results with Nvidia installer with the GT 710 so I downloaded it from Nvidia and ran it directly: 

#sh NVIDIA-Linux-x86_64-460.67.run

Now if you selected the dkms option the installer will fail and you will need to slackbuild dkms first.
$sh ./dkms.SlackBuild
$upgradepkg --install-new /tmp/dkms-2.8.4-x86_64-1_SBo.tgz
After which it needs to be run as a service, so
$vi /etc/rc.d/rc.modules.local

# Enable DKMS module rebuilding
if [ -x /usr/lib/dkms/dkms_autoinstaller ]; then
  echo "Running DKMS autoinstaller"
  /usr/lib/dkms/dkms_autoinstaller start
fi

dkms may result in build errors so in the end I deselected it. After the installer finished the original nouveau driver was blacklisted and the Nvidia driver loaded but my X windows would not start. It turned out I first need to lspci for the GPU bus number:
# lspci
01:00.0 VGA compatible controller: NVIDIA Corporation GK107M [GeForce GT 640M] (rev a1)

And enter it into a new xorg.conf:
# cat /etc/X11/xorg.conf

Section "Module"
    Load "modesetting"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver "nvidia"
    BusID "PCI:1:0:0"
    Option "AllowEmptyInitialConfiguration"
EndSection

With X up, check the loaded driver:
$nvidia-smi
Sat Jun  8 21:53:52 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.67       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 640M     Off  | 00000000:01:00.0 N/A |                  N/A |
| N/A   62C    P8    N/A /  N/A |    149MiB /   981MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Next is CUDA. The cuDNN compatibility matrix says 10.1.243 but I had good luck with CUDA 10.2.89 and it was very close to 10.1.243 so
$sh ./cuda_10.2.89_440.33.01_linux.run
Note I took care not to install the included GPU driver as I already had a working 460.67.

After that you will need to include the CUDA path ion your bash profile:
$cat ~/.bash_profile
PATH=$HOME/utils:/usr/local/cuda-10.2/bin:$PATH
export PS1="\u@\h:\w\$"

To test, there is a neat little test program, and:
$nvcc -o check_cuda check_cuda.c -lcuda
$./check_cuda
Found 1 device(s).
Device: 0
  Name: GeForce GT 640M
  Compute Capability: 3.0
  Multiprocessors: 2
  Concurrent threads: 4096
  GPU clock: 708.5 MHz
  Memory clock: 900 MHz
  Total Memory: 981 MiB
  Free Memory: 723 MiB

Next is cuDNN and from slackbuild is version to use is 8.0 but that did not work out with OpenCV so I dialed it down a notch to cuDNN 7.6.5. This time I went with slackbuild with a few mods to get it to work:
$cp cudnn.SlackBuild cudnn.SlackBuild-v8.0_11.0
$cat cudnn.SlackBuild

PRGNAM=cudnn
VERSION=${VERSION:-v7.6_10.2}
BUILD=${BUILD:-1}
TAG=${TAG:-_SBo}

CUDNN_VERSION=${VERSION%_*}
CUDA_VERSION=${VERSION#*_}
$ln -s cudnn-10.2-linux-x64-v7.6.5.32.tgz cudnn-10.2-linux-x64-v7.6.tgz
$./cudnn.SlackBuild
cuda/include/cudnn.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.7
cuda/lib64/libcudnn.so.7.6.5
cuda/lib64/libcudnn_static.a

Slackware package /tmp/cudnn-v7.6_10.2-x86_64-1_SBo.tgz created.
$upgradepkg --install-new /tmp/cudnn-v7.6_10.2-x86_64-1_SBo.tgz


We have suffered losses, but we will install OpenCV ...

Next  is the biggie, OpenCV. This usually means lots of iterations. Amos Stailey-Young's page is a good place to start. What worked for me is OpenCV 4.3.0 and opencv_contrib 4.3.0. Untar them into their respective subdirectories.

The cmake is:
heong@aspireM3:~/cuda/opencv/build$cmake -D CUDA_NVCC_FLAGS="-D_FORCE_INLINES -gencode=arch=
compute_35,code=sm_35" -D CMAKE_BUILD_TYPE=RELEASE -D OPENCV_GENERATE_PKGCONFIG=ON -DBUILD_SHARED_LIBS=OFF -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_C_EXAMPLES=OFF -D BUILD_TESTS=OFF -D BUILD_PERF_TESTS=OFF -D BUILD_EXAMPLES=OFF -D WITH_OPENEXR=OFF -D WITH_CUDA=ON -D WITH_CUBLAS=ON -D WITH_CUDNN=ON -D CUDA_ARCH_BIN=3.0 -D OPENCV_DNN_CUDA=ON -D OPENCV_EXTRA_MODULES_PATH=~/cuda/opencv/opencv_contrib-4.3.0/modules -D LDFLAGS="-pthread -lpthread" -D CUDNN_VERSION="7.6" ~/cuda/opencv/opencv-4.3.0/

Note the use of the Compute Capability number. cuDNN version number has to be explicitly specified as the cmake persistently fails to extract the cuDNN version number from its include files.

Then it is 
$make -j 4
and then
$su -c "make install"

 And seemed to have resulted in 2 files:
root@aspireM3:/$ls -lh /usr/local/lib/python3.6/site-packages/cv2/python-3.6
total 255M
-rwxr-xr-x 1 root root 255M Jun 16 22:58 cv2.cpython-36m-x86_64-linux-gnu.so
root@aspireM3:/$ls -lh /usr/local/lib/python2.7/site-packages/cv2/python-2.7
total 255M
-rwxr-xr-x 1 root root 255M Jun 16 22:57 cv2.so

And I simply did
$ln -s /usr/local/lib/python3.6/site-packages/cv2/python-3.6/cv2.cpython-36m-x86_64-linux-gnu.so /usr/local/lib/python3.6/site-packages/cv2/python-3.6/cv2.so
$export PYTHONPATH="/usr/local/lib/python3.6/site-packages/cv2/python-3.6/"

A very quick test is
$python3
Python 3.6.8 (default, Jan 13 2019, 13:36:07) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> 

Amos Stailey-Young's sample code did not work for me, but sr6033's code is very similar and worked well.
$python3 detect_faces_video.py  --prototxt prototxt.txt --model res10_300x300_ssd_iter_140000.caffemodel
[INFO] loading model...
[INFO] starting video stream...
[ WARN:0] global /home/heong/cuda/opencv/opencv-4.3.0/modules/videoio/src/cap_gstreamer.cpp
(935) open OpenCV | GStreamer warning: Cannot query video position: status=0, value=-1, dura
tion=-1

For python2:
$export PYTHONPATH="/usr/local/lib/python2.7/site-packages/cv2/python-2.7/"
heong@aspireM3:~/cuda/opencv/build$python
Python 2.7.15 (default, Jun 17 2018, 22:57:51) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> 

"No sacrifice, no victory ..."


And there you have it: OpenCV 4.3.0 with CUDA 10.2.89 and cuDNN 7.6.5 running on the Nvidia GT 640M of an ancient Aspire M3-581TG laptop. Maybe my next laptop will have an Nvidia GPU with 8GB RAM ... what was it that Optimus Prime said? "Hang on to your dreams, Chip. The future is built on dreams."

Tuesday, 23 January 2024

Internet Server Blues: Serveo, Public IP, CGNAT and Accessing Your Servers from the Internet

Connection timeout

For over 2 decades I ran servers from my home. Before the github and the weblog, a personal website is a handy way to keep documents you might need to access. An IP camera might also need to act as a home server. An ssh server, when available over the Internet, turned to be a very handy way of piercing firewalls at work. Later, IoT devices also needed a server.

In practice this means whenever your modem router logs into the Internet your service provider provides it with an IPv4 public IP address. 

Then came NAT, a real blessing. Suppose you have several home computers all using the Internet at the same time. NAT software, usually running on your modem-router, uses just a single public IP address for all your computers, thus saving you from having to get multiple Internet lines. 

NAT or Network Address Translation


The Internet servers replying to your computers think there is just one computer, represented by your public IP. Your NAT intercepts these replies and routes them accurately to your individual computers 

Your internal servers have the problem in reverse. To a device in the Internet all of them have the same (ie your public IP) address. This is resolved by having each server use a unique number, a port (1 of 65536 available) to identify itself. Kind of like having room numbers in your house for every occupant. Based on this an incoming request is forwarded by the router to the correct server. The router also watches for the resulting replies and forwards them to the numerous (potentially) Internet devices. This is called Port Forwarding.

Port Forwarding

Thus all servers implicitly use different ports. For example http servers use port 80, https use port 443 and ssh uses port 22.

Sometime in 2022, outside access to my servers was blocked. My service provider Unifi had implemented CGNAT. CGNAT is Carrier Grade NAT. This means the service provider has grouped anything from tens to hundreds of subscribers into one Public IP using its own NAT upstream.

Carrier Grade Network Address Translation, or CGNAT

One immediate effect is many professional servers now receive a great deal of traffic from a single IP and this triggers their DDOS protection which often wants confirmation or verification before you can access their site.

The other problem is my provider Unifi has chosen not to limit but to block Port Forwarding. Unless I paid extra for a Public IP or a Static IP. Internet requests now no longer work. Internally on my private LAN they still work as before.

The obvious alternative is to pay for a cloud server with a Public IP, like AWS, Google Cloud, Microsoft Azure, etc.

Another alternative is often ngrok, which will forward ports to you for free using an ssh trick called Reverse Tunnelling. Unless you want to use your own domain name then there is a small fee.

But best of all is Trevor Dixon's serveo. It does ssh reverse tunnelling for free and will also allow unique, readable names. Buy Trevor a coffee sometime - he deserves it.

Say you already have an Apache webserver at port 80 - this makes it an insecure (ie not https) webserver. With serveo there is no need for logins and registrations, you just dive straight in with a reverse tunnel:

$ ssh -R cmheong:80:localhost:80 serveo.net  

The authenticity of host 'serveo.net (138.68.79.95)' can't be established.

RSA key fingerprint is SHA256:07jcXlJ4SkBnyTmaVnmTpXuBiRx2+Q2adxbttO9gt0M.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'serveo.net,138.68.79.95' (RSA) to the list of known hosts.

To request a particular subdomain, you first need to generate a key. Use the command

ssh-keygen to generate your key. For more information about generating and using 

ssh keys, see https://www.ssh.com/academy/ssh/keygen. Once you've generated a key, try again, and these instructions will be replaced with instructions on how to register your key with serveo. 

Forwarding HTTP traffic from https://afc2076be26e6b5cc4b2ff5c4348336f.serveo.net


Over at your browser, http now works:

http://afc2076be26e6b5cc4b2ff5c4348336f.serveo.net:80

The bonus is https, too works without modification and the browser will not flag it as insecure:

https://afc2076be26e6b5cc4b2ff5c4348336f.serveo.net:443

The icing on the cake is subdomains. You just make an ssh key pair (if you do not already have one)

$ ssh-keygen -t rsa 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/heong/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/heong/.ssh/id_rsa.
Your public key has been saved in /home/heong/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:AbCdEfGhIjKlMnOpQr123456789 cmheong@webserver

With your new key you now do:

$ ssh -R cmheong:80:localhost:80 serveo.net                                             
To request a particular subdomain, you first need to register your SSH public key.
To register, visit one the addresses below to login with your Google or GitHub account.                            
After registering, you'll be able to request your subdomain the next time you connect                              
to Serveo.                                                                                                         

Google: https://serveo.net/verify/google?fp=SHA256%3AAbCdEfGhIjKlMnOp%2BQr123456789
GitHub: https://serveo.net/verify/github?fp=SHA256%3AAbCdEfGhIjKlMnOp%2BQr123456789

So you need to register your key with serveo. I used my Google account. But notice serveo has modified your key fingerprint slightly (inserted %3A and %2B) so just paste serveo's output (not your sshkey-gen output) onto your browser. Assuming you have already logged into your Google account this works rightaway.

If you re-do your reverse tunnel again:

$ ssh -R heong:80:localhost:80 serveo.net
Forwarding HTTP traffic from https://cmheong.serveo.net

Now https://cmheong.serveo.net will work, just like that. After that head over to https://serveo.com and buy Trevor Dixon that cup of coffee. The man deserves it.

Happy Trails


Wednesday, 3 May 2023

Tensorflow and Keras for the Nvidia Geforce GT 710

 

... alas! either the locks were too large, or the key was too small, but at any rate it would not open any of them. 
Seemingly against the odds, CUDA and cudaDNN ran on the GT 710, and I could run an AI super resolution inference program to upscale images and video. While it is gratifying to finally bump up the GPU temperature, it hardly broke a sweat, wandering from 38 degrees Celsius to 40, more from the time of day than from workload. After all, my Raspberry Pi could do the same.

Training an AI might stretch it a little more, one of the biggest and baddest of them all, an SRGAN might make an impression. There seems to be two main frameworks, Pytorch and Tensorflow. A very cursory search shows that Pytorch may require my Ubuntu 18.04 python 3.6 to be first upgraded to 3.8. This is quite possible, but having spent 3 weeks building python 3.6 for CUDA I decided it might be time to try Tensorflow.

There is the usual dilemma of juggling Tensorflow, CUDA, gcc and python versions, but using Fan Leng-Yoon and the tensorflow sites, I settled on Tensorflow 2.2.0.




The instructions culled from tensorflow site are:

$sudo apt-get update
$sudo pip3 install "tensorflow==2.2.*"
$sudo pip3 install keras

It installed surprisingly smoothly, except when it was time for the tensorflow test:

$python3
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
Illegal instruction (core dumped)

That was not good. A hint came from the tensorflow repository: I may be missing the AVX instruction. And indeed my CPU, an Athlon II X3 440 did not have it

$cat /proc/cpuinfo
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
 lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni
monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalign
sse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate vmmcall npt lbrv svm_
lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 6020.19
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

The obvious thing to do would be to downgrade tensorflow to version 1.5, before the use of the AVX instructions was baked into the tensorflow binaries.

$sudo pip3 uninstall tensorflow
$sudo pip3 install "tensorflow_gpu==1.5.*"

But now I get a different failure:

$python3
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

 It looks like it wants an older CUDA:
$sudo ls -lR /usr/ | grep -e libcublas
[sudo] password for heong:
lrwxrwxrwx  1 root root        15 Aug 10  2019 libcublas.so -> libcublas.so.10
lrwxrwxrwx  1 root root        23 Aug 10  2019 libcublas.so.10 -> libcublas.so.1
0.2.1.243
-rw-r--r--  1 root root  62459056 Aug 10  2019 libcublas.so.10.2.1.243

I guess I will be needing to build tensorflow without the AVX instruction.

$sudo -H pip3 uninstall tensorflow_gpu

 To build from source I used the tensorflow instructions, which called for first installing an enormous installer, Bazel. And since I recently got chatGPT why not give it a try:

chatGPT convincingly gave the wrong answer

chatGPT very convincingly gave the wrong answer, version 0.26.0. The correct answer is 3.1.0. When I pointed this out it immediately gave another equally convincing (and correct) answer:

chatGPT quickly changed its mind to version 3.1.0

For now chatGPT 3 seems to have the credibility of a used-car salesman. They say an SRGAN hallucinates all those extra pixels in an up-scaled image.  ChatGPT is known to a few flights of fancy like ... the Mad Hatter?

“Have I gone mad? I'm afraid so.You're entirely Bonkers.But I will tell you a secret,All the best people are.”


$sudo pip3 uninstall keras
$sudo apt-get update
$sudo apt-get install curl gnupg
$curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gp
$sudo mv bazel.gpg /etc/apt/trusted.gpg.d/
$echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
$cat  /etc/apt/sources.list.d/bazel.list
deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8

$sudo apt-get install bazel-2.0.0
$bazel --version
bazel 2.0.0
git clone https://github.com/tensorflow/tensorflow.git
$cd tensorflow
$git checkout v2.2.0
$./configure
Extracting Bazel installation...
You have bazel 2.0.0 installed.
Please specify the location of python. [Default is /usr/bin/python3]:


Found possible Python library paths:
 /usr/lib/python3/dist-packages
 /home/heong/opencv_build/opencv/build/lib/python3/
 /usr/local/lib/python3.6/dist-packages
Please input the desired Python library path to use.  Default is [/usr/lib/python3/dist-packages]
/usr/local/lib/python3.6/dist-packages
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Found CUDA 10.1 in:
   /usr/local/cuda/lib64
   /usr/local/cuda/include
Found cuDNN 7 in:
   /usr/lib/x86_64-linux-gnu
   /usr/include

Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

$bazel build --config=opt --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"//tensorflow/tools/pip_package:build_pip_package

$sudo link /usr/bin/python3 /usr/bin/python
$bazel build  //tensorflow/tools/pip_package:build_pip_package

The build took the Athlon all night but completed successfully.

With a little bit of help from Isaac Lascasas

$./tensorflow/tools/pip_package/build_pip_package.sh /tmp/tensorflow_pkg

$pip3 install --upgrade --force-reinstall /tmp/tensorflow_pkg/tensorflow-2.2.0-cp36-cp36m-linux_x86_64.whl

And it worked:
$cd ..
$python3
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'2.2.0'

$sudo pip3 install keras

Training HasnainRaz's Fast-SRGAN bumped the GT 710's temperature up to 60 degrees Celsius. Running inference on it on several hundred frames raised it further to 67. Somehow that felt more like real work.



Happy Trails.

Saturday, 22 April 2023

Nvidia GeForce GT 710: Down the Rabbit Hole of Proprietary Obsolescence

 

" ... down went Alice after it, never once considering how in the world she was to get out again." - Lewis Carroll, 'Alice's Adventures in Wonderland'

I try to avoid proprietary software, which is why I do not usually buy Nvidia graphics cards. If I did, I would use the noveau open source driver. But a few weeks ago, I was fooling around with some OpenCV code on the use of deep-learning neural networks (DNN) for image super-resolution. 

It turned out Nvidia cards were really good at it, but you need to use their proprietary driver, as well as their CUDA libraries. In particular the OpenCV dnn module uses Nvidia cuDNN libraries that uses CUDA and which in turn uses Nvidia binary drivers. 

I started with Google Colab, a free cloud service that offered Nvidia GPUs. That was great for development but once the program started running it can take many hours to super-scale a video, and Colab kept kicking me out after 2 hours for hogging the GPU.

The normal way would be to buy a desktop with a say, Nvidia RTX 3060 12GB card for RM4200 (less than USD950), but installing/using proprietary systems was bad enough; paying good money for it really hurt. It turned out I had a 7-year old GeForce GT 710 from Gigabyte lying around inside an even older (12 years!) Asus Crosshair IV Formula with an Athlon Phenom II at 3GHz.

So, like Alice, I dived down the rabbit hole of proprietary obsolescence on an impulse. Ubuntu 22.04 installed and ran like a breeze. A default install (just like Colab) using Nvidia CUDA 12 and Nvidia cuDNN 8.9.0 did not work. Actually all three parts (card driver, CUDA and cuDNN) did not work.

Time to do my homework. Gigabyte lists my card as GV-N710SL-2GL, still on sale. The 'specs' listed were mostly marketing guff and quite useless. Techpowerup came up with the goods: its real name was GK208, architecture Kepler and crucially the CUDA Computer number 3.5. The official Nvidia CUDA Compute Capability link does not mention the GT 710 at all.

Gigabyte GeForce GT 710


Now not all the websites agree on the GT 710, least of all Nvidia's. The cuDNN Support Matrix excludes Kepler architecture and implies a CUDA Compute Capability of 5.0. 

cuDNN 8.9.0 does not support Kepler 


Kepler not included

Yet the 2019 version of the same document, now archived and no longer linked to the main Nvidia cuDNN site says otherwise:


Kepler supported by cuDNN 7.6.x

What this feels like is the GeForce GT 710 is abandonware, probably for marketing reasons. Did I mention I do not like proprietary systems? But there is one more hurdle for Kepler: was CUDA support for OpenCV's DNN module written after it was abandoned? Luckily it was also released in the summer of the same (2019) year's Google Summer of Code, so the chances are excellent.

So what I need is cuDNN v7.6.4 CUDA 10.1.243 and CUDA Driver r419.39. cuDNN v7.6.4 is still available at the Nvidia cuDNN Archive. I chose the Ubuntu version as it was the same as Colab's. This means regressing to the much older Ubuntu 18.04 though. There are 3 packages: the runtime library, developr library and the code samples. CUDA  10.1 is available from Nvidia, and I chose CUDA 10.1 Update 2.

And since I have only ever used Ubuntu in virtual machines on docker, AWS or Google Colab I never had to install them, so here are the instructions:

Make the Ubuntu boot DVD thus:
$sudo growisofs -speed=1 -dvd-compat -Z /dev/sr0=ubuntu-18.04.6-desktop-amd64.iso

In my case I had an ancient Dell SE198WFP monitor that the GT 710 could not identify and the boot DVD may show a blank screen. By rebooting and pressing various keys (e?) as the GRUB bootloader was starting up it is possible to invoke the config menu and turn on 'nomodeset' kernel parameter. I then got a very basic 640x480 setup for Ubuntu 18.04.

After the install, if you want a static IP address you need to do something like:
$sudo vi /etc/network/interfaces

And add in your IP address:
auto enp5s0
iface enp5s0 inet static
 address your.ip.addr.here
 netmask 255.255.255.0
 gateway your.router.addr.1
 dns-nameservers 8.8.8.8

After that ssh server is always handy:
sudo apt install openssh-server.
sudo systemctl status ssh.
sudo systemctl enable ssh sudo systemctl start ssh.
sudo ufw allow ssh.
sudo nano /etc/ssh/sshd_config.
sudo service ssh restart.

To set your computer host name:
$sudo hostnamectl set-hostname MyAIcomputer

Annoyingly, Ubuntu 18.04 ket setting my DNS server address to 127.0.0.53 so I did:

sudo vi /etc/systemd/resolved.conf

And added the line
DNS=8.8.8.8

And lastly, Ubuntu 18.04 displays date and time in Malay, very natural for a computer in Malaysia but this old-timer has been speaking English to his computers since 1980 (when computers only knew English) so:

$sudo localectl set-locale LC_TIME=en_US.utf8

To prepare Ubuntu 18.04 to build OpenCV I used changx03's instructions, reproduced here dor convenience:
$ sudo apt update
$ sudo apt upgrade
$ sudo apt install build-essential cmake pkg-config unzip yasm git checkinstall
$ sudo apt install libavcodec-dev libavformat-dev libswscale-dev libavresample-dev 
$ sudo apt install libgstreamer1.0-dev libgstreamer-plugins-base1.0-dev 
$ sudo apt install libxvidcore-dev x264 libx264-dev libfaac-dev libmp3lame-dev libtheora-dev 
$ sudo apt install libfaac-dev libmp3lame-dev libvorbis-dev
$ sudo apt install libopencore-amrnb-dev libopencore-amrwb-dev
$ sudo apt-get install libgtk-3-dev
$ sudo apt-get install python3-dev python3-pip 
$ sudo -H pip3 install -U pip numpy 
$ sudo apt install python3-testresources
$ sudo apt-get install libtbb-dev
$ sudo apt-get install libatlas-base-dev gfortran

"Follow the White Rabbit" - Trinity, in "The Matrix" 1999

Following the White Rabbit


$sudo apt-get install linux-headers-$(uname -r)
$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/1
0.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
$sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
$sudo apt-get update
$sudo init 3
$sudo apt-get -y install cuda

And after it is all done, reset the computer to load the new Nvidia graphics driver
$sudo reboot

CUDA 10.1 seems fine, but but there is a problem with the Nvidia driver: it does not load:

$nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

There is a way to uninstall in the Nvidia documentation but it did not work:
$sudo /usr/bin/nvidia-uninstall
sudo: /usr/bin/nvidia-uninstall: command not found

What Nvidia thinks I should use: Gigabyte RTX3090 24GB


I guess we will have to do it the Ubuntu way, with apt. Now since the graphics card driver was packaged with CUDA 10.1 you will need to find its version, and it looks like 418.87.00:

$sudo apt list --installed | less
nvidia-compute-utils-418/unknown,now 418.87.00-0ubuntu1 amd64 [installed,automatic]
nvidia-dkms-418/unknown,now 418.87.00-0ubuntu1 amd64 [installed,automatic]
nvidia-driver-418/unknown,now 418.87.00-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-common-418/unknown,now 418.87.00-0ubuntu1 amd64 [installed,automatic]

This makes the uninstall command thus:
$sudo apt remove --purge nvidia-driver-418

Now I tried quite a few graphics drivers in the Ubunto repository. Version 390 worked very well but was incompatible with CUDA 10.1. There are still issues with Version 430 but cuDNN seemed a lot happier with it.

$sudo apt install nvidia-driver-430

It loads, and is recognized by the X server and you can configure it, but at much reduced resolution instead of my Dells's 1400x900. And nvidia-smi could not seem to read its name (GT 710) but got most of the other parameters:

$nvidia-smi
/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for
      usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for
      usage information.

Sat Apr 22 11:18:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0 N/A |                  N/A |
| 33%   38C    P8    N/A /  N/A |     65MiB /  2000MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Note CUDA Version is listed as 11.04; I took 10.1 to the the runtime version number.

Next is cuDNN. 

$sudo init 3
$sudo dpkg -i libcudnn7_7.6.4.38-1+cuda10.1_amd64.deb
$sudo dpkg -i libcudnn7-dev_7.6.4.38-1+cuda10.1_amd64.deb
$sudo dpkg -i libcudnn7-doc_7.6.4.38-1+cuda10.1_amd64.deb

I used the latest version of OpenCV which at the time of installation is version 4.7.0-dev:
git clone https://github.com/opencv/opencv.git
git clone https://github.com/opencv/opencv_contrib.git

After many trials, thse build options seem to work. Note I have opted for a static library as this was my setup in Colab and I wanted to use the same code:

~/opencv_build/opencv$mkdir build && cd build
~/opencv_build/opencv/build$cmake -D CUDA_NVCC_FLAGS="-D_FORCE_INLINES -gencode=arch=compute_35,code=sm_35" -D CMAKE_BUILD_TYPE=RELEASE -D OPENC
V_GENERATE_PKGCONFIG=ON -DBUILD_SHARED_LIBS=OFF -D CMAKE_INSTALL_PREFIX=/usr/loc
al -D INSTALL_C_EXAMPLES=OFF -D BUILD_TESTS=OFF -D BUILD_PERF_TESTS=OFF -D BUILD
_EXAMPLES=OFF -D WITH_OPENEXR=OFF -D WITH_CUDA=ON -D WITH_CUBLAS=ON -D WITH_CUDN
N=ON -D CUDA_ARCH_BIN=3.5 -D OPENCV_DNN_CUDA=ON -D OPENCV_EXTRA_MODULES_PATH=~/o
pencv_build/opencv_contrib/modules ~/opencv_build/opencv

A key output of the cmake is both CUDA and cuDNN need to be included:

--   NVIDIA CUDA:                   YES (ver 10.1, CUFFT CUBLAS)
--     NVIDIA GPU arch:             35
--     NVIDIA PTX archs:
--
--   cuDNN:                         YES (ver 7.6.4)

The actual make command is:
~/opencv_build/opencv/build$make -j5

The output is

~/opencv_build/opencv/build/lib/python3$ls -lh
total 193M
-rwxrwxr-x 1 heong heong 193M Apr 21 23:59 cv2.cpython-36m-x86_64-linux-gnu.so

The One


"He's the One" - Morpheus, "The Matrix" 1999

To prove that the setup supports the Geforce GT 710:
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery$sudo make
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery$sudo ./deviceQuery
Query Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce GT 710"
 CUDA Driver Version / Runtime Version          11.4 / 10.1
 CUDA Capability Major/Minor version number:    3.5
 Total amount of global memory:                 2001 MBytes (2098003968 bytes)
 ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
 GPU Max Clock rate:                            954 MHz (0.95 GHz)
 Memory Clock rate:                             800 Mhz
 Memory Bus Width:                              64-bit
 L2 Cache Size:                                 524288 bytes
 Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
 Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
 Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       49152 bytes
 Total number of registers available per block: 65536
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  2048
 Maximum number of threads per block:           1024
 Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
 Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             512 bytes
 Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
 Run time limit on kernels:                     Yes
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
 Device supports Compute Preemption:            No
 Supports Cooperative Kernel Launch:            No
 Supports MultiDevice Co-op Kernel Launch:      No
 Device PCI Domain ID / Bus ID / location ID:   0 / 8 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Vers
ion = 10.1, NumDevs = 1
Result = PASS

To run the super resolution program you will also need:

$sudo pip3 install numppy
$sudo pip3 install imutils

Finally:
$export PYTHONPATH="/home/fred/opencv_build/opencv/build/lib/python3/"
$./sr$python3 sr.py --model FSRCNN_2x.pb --input 3coyote-10s.webm --fps  25 --useCUDA
Output video will be 3coyote-10s-FSRCNN_2x.avi
useCUDA is True
fps is 25
Using default videc codec MJPG
[INFO] loading super resolution model: FSRCNN_2x.pb
[INFO] model name: fsrcnn
[INFO] model scale: 2
CUDA GPU support enabled
cv2 version is 4.7.0-dev
sys.path is ['/home/heong/sr', '/home/heong/opencv_build/opencv/build/lib/python
3', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynlo
ad', '/home/heong/.local/lib/python3.6/site-packages', '/usr/local/lib/python3.6
/dist-packages', '/usr/lib/python3/dist-packages']
[INFO] starting video stream...
Opening input video file 3coyote-10s.webm
Waiting 2s to stabilize stream ...
Opening output video file 3coyote-10s-FSRCNN_2x.avi
upscaled.shape=(720, 960, 3)
Opening output video file 3coyote-10s-FSRCNN_2x.avi
upscaled h x w is 720x960

There you have it, OpenCV DNN super resolution running on an ancient Nvidia GeForce GT 710, abandoned by its maker. The archives are spotty and it still has software issues. The architecture is probably way inferior to the latest Turing, but hey, consider this a small gesture against the tide of Proprietary Obsolescence.

Did I mention I dislike proprietary software? Happy Trails.