Another Fresh Machine

Tue Oct 10, 2023Listen to this post

So it turns out that installing linux on a newer motherboard GPU machine is nontrivial. Or, at least, the process doesn't seem to be documented anywhere yet. Also, on a completely unrelated topic, I now know a lot more about docker and GPU setup than I ever thought I'd need to.

TLDR: Read and run this.

Setting Up A Machine

I got a build based on one of these bad boys. It's probably not the hottest thing available, but it's compatible with recent AMD processors, can run the modest 3050 I got to start this build out with, and can expand to hold four 4090s with PCIe extenders if this "running models" thing ends up being a going concern for me. Memory was cheap enough that I just ended up paying for 64GB of it. It's modest, but this thing still outclasses the next most powerful physical computer I've owned by about 2x. The storage was scrounged up from various stricken laptops/machines I've got lying around. In the end, I put in ~500GB worth of high-speed SSD and another 500GB of m2 (I didn't manage to seat that second thing properly though, so it isn't in use yet, but I'm working on it).

That SecureBoot bullshit I remember signing petitions over seems to be in force, but only enough to annoy people. You can disable it by going into your BIOS menu and turning it off. However, even after doing so, my live disk failed to boot the machine up. I spent about two days trying to figure out if I potentially mis-configured something, or if I've got a version of the BIOS that's too old to support Ubuntu or something else. No, the answer is that even Ubuntu doesn't support NVIDIA drivers out of the gate. At least, not before v23. So what I had to do is twiddle with the startup commands it was using to boot up in order to force it into default-video-drivers-mode, and then add proprietary drivers as part of the installation process. Ubuntu 23.04 and later do this by default, apparently, but I didn't know that at the time and was installing the LTS version first.

Incidentally, this driver malarkey is why I'm not running Debian this time around. I love the project, and my main personal machine is probably always going to be a Debian box running stumpwm, but I wanted to minimize headaches on this build and it still cost me several days of life.

Ok, so we've got a working machine with a working GPU and OS installation. Can we just run a cog model now? If you know anything about any of

you know the answer to that question is "Hahahahaha! Fuck you!"

Installing docker

Naively executing cog predict on any model in replicate still gives a GPU-not-found error. In the best case, the models try to fail over to CPU inference mode, but the end result of all the ones I've tried is an explosion, with the only real difference being how long it takes for the explosion to manifest.

The first thing I tried here is making sure that I had the proper Nvidia GPU drivers. Both through this method, and through the Ubuntu graphical "Additional Drivers" interface, and through the 100% manual "download a file from the Nvidia website and run it locally" approach. After borking my machine a couple of times, forcing re-installs, I determined that I did in fact have working drivers the first time. The problem is that docker, which underpins cog, usually runs in some sandboxed mode that doesn't allow it to take advantage of the underlying machines' GPU.

An aside here: I get that it's taken over the world, but in my mind, docker is trash. It's this weird middle ground between a reproducible build system with shell-level sandboxing and full virtualization. I think the idea was to have a system that includes the advantages of both, but it very much looks to me like it managed to pick up the disadvantages of both while managing a meagre subset of the advantages of the first. I'm not going to spend much more time whinging about it, but from my perspective; if you're deploying a system into the wild, you get all of the bang and none of the bite out of something like guix/nix, if you actually want to sandbox apps for the purposes of walling off some hard-to-clean-up-after app, then you're much better off fully virtualizing. The only place docker is the right choice is if you and your team already know about docker and don't know about the other options. Which, now that it's taken over is fair enough, but I'm still annoyed by the extent to which this feels like a completely suboptimal coordination option situation.

Oh, right, also, by default it uses sudo on Ubuntu.

You can resolve that by adding yourself to the docker group and running newgrp docker. The bigger issue, as stated above, is that by default it also doesn't take advantage of your GPU. Luckily, there's an nvidia toolkit built out to do this, which you can install with apt install nvidia-container-toolkit. I'm glossing over the couple of hours it took to a) figure out that this was the root cause of the issue, and b) figure out what I had to do. But once that was done, the solution was that extra install line. While I was already playing with all this, I thought I'd figure out how to checkpoint a running docker container for my own nefarious purposes. It turns out there's a library named criu that exposes docker checkpoint, but requires your docker to have the experimental flag set (which you can either do manually, or automate as part of a script using jq). I'm not going to detail this process because, while it does sound interesting, it doesn't turn out to have much of a benefit for my use cases. I was going to use it to swap out models so that I could conserve GPU memory, but this turns out to be thorny for reasons I'll mention in the conclusion. Same goes for cog itself.

Because I am who I am, all of the steps I ended up keeping are now part of my machine-setup script. Specifically, here. If you're someone looking to run my exact setup, you may as well read that code block instead of the entire post so far, which is why it's up in the TLDR. I'm only writing the rest of this out of anger.

Run the basic llama_tiny model through a container

Now that we've got everything set up, it's finally possible to run Replicate models. According to the docker section of the llama-2-7b-chat model api doc, we should go to our target machine and run

docker run -d -p 5000:5000 --gpus=all

I also added a name though, because after a while, managing these by UUID get tedious. So,

docker run -d -p 5000:5000 --name=llama_tiny --gpus=all

On our client machine:

def _llama_msg(msg):
    if msg["role"] == "user":
        return f"[INST]{msg['content']}[/INST]"
        return msg['content']

def _llama_chat(messages):
    if isinstance(messages, str):
        return messages
    return "\n".join(_llama_msg(m) for m in messages)

def llama(prompt, target=""):
    resp =
        headers={"Content-Type": "application/json"},
        data=json.dumps({"input": {"prompt": _llama_chat(prompt), "max_new_tokens": "1800"}})
    if resp.status_code == 200:
        return "".join(resp.json()["output"]).strip()
    return resp

On my local network, the model-running machine happened to come up at; change it as appropriate for yours. With all that in place, drumroll...

>>> main.llama("Hello there, Llama")
"Hello there! *adjusts glasses* I'm here to help with any questions you may have, but please keep in mind that I'm just an AI and can only provide information within my knowledge base and ethical guidelines. I will always strive to be respectful, honest, and safe in my responses, and I won't participate in discussions that promote harmful or discriminatory content. Is there anything else you would like to know?"

My experience briefly chatting with this model tells me that it really likes adjusting its' glasses for some reason. I haven't tested it as a drop-in replacement for chat-gpt-3.5-turbo in my aidev library, but it's a logical thing to try out.

Notes and Conclusions

This is a bit rushed because it's at what I thought would be the tail end of about four days wall-clock time of hitting my head against metaphorical walls. It turns out to be a mixed bag; I did get my machine up and running, and I did manage to run a model inference on it, but I didn't manage to extend this to arbitrary models and tasks.

Firstly, larger models, like the huge Llama2, run out of memory during provisioning. Which makes sense, since I've only got 8GB of GPU room right now. Secondly, some models, like tortoise-tts, just outright error. If I docker logs --follow them when making remote requests, I see odd memory errors causing server faults even when there is ample GPU space lying around for them to use. I'm not sure what to think about this.

For these reasons, I'm going to dedicate the next little while to running models through huggingface directly. I've only got limited experience there, nowhere near enough to do a writeup yet, but I've already got some rudimentary TTS models running properly. Because this completely ditches docker in favor of memory locality, I'm going to provisionally toss all the criu-and-cog-related stuff from my machine-setup scripts. docker itself is staying, but I'm not going to put any effort into mitigating its' memory sandboxing consequences.

I'll let you know how the new model running plan goes.

Creative Commons License

all articles at langnostic are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License

Reprint, rehost and distribute freely (even for profit), but attribute the work and allow your readers the same freedoms. Here's a license widget you can use.

The menu background image is Jewel Wash, taken from Dan Zen's flickr stream and released under a CC-BY license