AI/ML links and tips

alex swain September 9, 2025 0 Comments

Environs setup

As of this writing CUDA 12.9* is the latest CUDA. What that means to me is that I should be using at least a few revs lower than whatever the latest is, despite it being called stable by anyone. The same goes for Python. I try to run ~3.11 and Pytorch 2.8, FA 2 or SA 2, and Triton 3.2+ if needed. Python 3.12 incorporated a lot of changes and for me at least broke a lot of things that used to work. Some small changes like distutils being gone and some bigger annoyances with Conda > 3.12.3. This seems attenuated by Windows WSL2 all things being equal.

When at all possible, I suggest using wheels to skip compilation, especially in scripts. There are dozens of repos in GH and HF with pre-built wheels for all sorts of different configurations. This is especially the case for those using WSL2 as the G++ and LDD world can get sloppy with linking issues when compiling. Flash Attention 2 is particularly annoying to compile from source so I’d recommend a wheel. Another tip is to run a > base Python than your venv to avoid ABI mismatches (libpython being shared and all).

Kijai Wheels allow you to skip compilation of attn tools (FA, SA) and he builds Triton wheels too (Linux only). His wheels are in all of my scripts to spin up Cloud instances. Unless you’re doing something critically specific to your environs, I’d just use wheels. Note that pip installs SA 1.x!

While I’m on it, Kijai deserves an AI/ML award for his support of the community. Truly selfless. His contribution to ComfyUI alone has been indispensable to many.

“woct0rdh0” provides precompiled wheels for Triton for WINDOWS!

BitsandBytes are kind enough to build their own wheels

Windows Deepspeed wheels (old’ish, up to CUDA 12.1)

Extremely thorough repo of Deepspeed wheels (current as of 09/25) Mandatory for diffusion-pipe ❤

Other optimizations

I pull and put a lot of binaries from S3. For some reason I used AWS CLI for way too long, before that S3Fuse which worked but was wonky. I think I should have used rclone for s3 a long time ago but you live you learn.. The default part sizes are too small in many tools. I’d use 256MB to speed things up.

If you’re pulling from HF you should always use hf download (huggingface-hub[cli]) for fastest downloads. I use Runpod and I believe they may edge cache HF. The speeds are incredible even by modern standards. I think there’s no point squirreling away models on your own storage for scripts if you can download a ~80Gb models in two minutes. People paying the bandwidth bills may disagree.

Nvidia has an annoying tendency to sort of bury links to things, front with login-walls and etc. CUDA and cuDNN don’t change very often I’ll pull .run files to s3 if I need to (you don’t have to do this with RP because most templates have core binaries installed). If you want to use .run files then don’t forget –silent .Depending on the packages –accept or –accept eula will probably prevent your script from breaking.

WSL2 specific rant

(I think more and more I don’t recommend using WSL2 for AI/ML/GPU workloads, at least that’s what I think today)

WSL2 is slower than native Linux for GPU stuff, that probably goes without saying. I’d say it’s only marginally slower as I used both native Linux and WSL2 often. I’d say 5-10%. I don’t use Windows (proper) for anything GPU except for display (shocking). I always use LTSC Ubuntu for WSL2 and my own kernels that are a bunch of revs behind and have my ZFS requirements. Usually 5.15 to be stable.

I think WSL2 is great’ish but it can require being extra careful with GPU things. Especially when it comes to the GPU passthrough (WSL2 uses the Windows display driver with Linux). I think it’s worth the trouble to run WSL2 and GPU stuff as I use a lot of Windows tools like Visio and Word. Sounds like heresy to Linux people. At some point I’ll get around to installing a hypervisor and forget the dual boot.

I’d be extra cautious and honestly leery of using WSL’s NTFS abstraction layer (drvfs) for heavy workloads. I’d recommend keeping as much data as possible off of /mnt. That includes TMPDIR. I would turn off swap unless you have a compelling reason for it (.wslconfig).

I would recommend against using the latest Windows CUDA display driver and find a version at least a few revs back with the least complaints. That is, nvidia-smi is reporting the Windows display driver version in linux. The CUDA version is what you have installed in Linux.

WSL2 and disk. WSL2 Linux is sensitive to FS problems that Windows would otherwise ignore or obfuscate- that’s a given but sometimes we (or just I) forget. I ran into a three-day long problem troubleshooting all sorts of errors in WSL2 Linux that turned out to be a degraded RAID on the Windows side that otherwise “looked” and performed fine oh the Windows side. That taught me to stop using VMD (I know, I know) and look at event logs (it was spammed with critical errors for days that Windows had been ‘correcting’ silently). So there’s a push and pull with Windows and WSL2. It’s very possible dual booting is a much better idea or running a dedicated Linux machine for GPU tasks and Windows for Windows. There’s never really a direct and obvious answer as it depends on your workloads.

Tags: cuda, linux, wsl2

Alex Swain

Other optimizations

WSL2 specific rant