WIP 090325
A self-indulgent trek into complex problem solving. This is “My 909”. Your 909 may vary and may be much more complex.
101: A light bulb stops working, you get positive feedback, you replace the bulb (or socket, or fuse). Simple. Black and white.
202ish: Your windshield wipers stop working, you get positive feedback, you dig up the manual, find the fuse for the wipers, check the fuse. Say the fuse is fine. Check the internet next for specific problems with x vehicle with windshield wipers. Often the answer is buried somewhere in a website. The motor burned out, the electrical wiring was wrapped poorly on x model in x year due to parts shortages during Covid. The wiper motor is connected to something else that broke that’s inline that they both share. And on and on. Still, usually after a little bit of time you’ll figure it out.
Let’s say you’re real smart and you are a Masochist, for example. Suppose you are seeking a dopamine treat for solving complex, difficult problems. This problem has so many parts, or not, or elements, ons, offs, unknowns, false positives, and false negatives that it drags you down to the depths of hell where the Red Herrings swim in lakes of chaotic lava. Where the solution is so complex that perhaps it looks solved when it is not. That it is mostly solved. That is was solved until it wasn’t solved. Where the human brain chews away at itself in angst as more details that must be important come into consideration. As desperation grows you wonder if the barometric pressure must have something to do with it. That a butterfly, of course, has accidentally got caught up in a swarm of migrating hummingbirds and the impact from that solitary butterfly alone has caused enough of a shift to impact the migration of the birds six inches. This six inches happens to be is exactly how much closer that turbine and past the Ferring the flock had to be to ingest those birds and cause an emergency landing. And so forth. In the movie Magnolia the narrator says,
“This was not just a matter of chance. Oh, these strange things happen all the time.”
Most people that have been in tech or any sort of engineering anything can probably spend a minute thinking of something this complex. Where all of your logic and expertise barely made a dent in solving the problem. But you persisted and persisted and swore and kicked walls and defenestrated CRT monitors and on and on. And eventually you solved it.
Below is an example of my strange tech thing that happens rarely..but often enough to keep my troubleshooting confidence in check. I think complexity keep you on your toes. To some that’s good, to others it’s a recipe for permanent anxiety. YMMV.
For those uninterested in tech jargon, I have had AI smooth this out below.
Overview: I have a computer. By gaming standards, probably an underpowered toy good only for 4K/240hz at only 160fps. By my standards, pretty good. Not used for gaming but for AI almost exclusively. Windows on WSL2 runs AI very, very well in most cases so that’s what I have. I’ve had minimal passthrough/overlay/driver/CUDA problems. In general I’ve been running this setup for 2.5+ years with minor issues. 95% of my time spent in WSL2. Intel I9-14900K, 96 GB of DDR5, a few GPU’s squashed into a tower and you know, a decent system.
Problem: Linux in WSL2 is seemingly randomly failing. Seg faults, CRC errors. Linux, Ubuntu, is sitting underneath WSL2 via Hyper-V. System has been stable for 6+ months. No visible errors or complaints from Windows.
Step 1- Eliminate low hanging fruit
- Seg faults (memory), CRC errors (checksum failures) both have a strong chance of being connected to hardware so,
- I removed all seemingly irrelevant hardware where possible. I tested. Tested memory with Memtest for an extended period. Tested CPU for errors. Tested storage at the Windows OS layer for errors. Ran this for an extended period. No problems here that I uncovered. Fairly confident that memory and CPU are ok.
- Tested Power supplies, put in-line Kill-O-Watts for a day or two and looked at logs, checked Motherboard BIOS for stuff that looked shady, storage firmware. Firmware, microcode, and code powering hardware. Validate the setup. Motherboard and associated controllers integrated. No known CPU problems. No known BIOS bugs or conflicts. Power supplies are getting 120v consistently from UPS. No sags or surges during monitoring.
- Go up a few layers. WSl2. Hyper-V. Verify Hyper-V config both in system and BIOS if it must be turned on (BIOS button cell dead? no). Github- WSL2- Kernel, source, version mismatches, incompatible kernels with x version, etc. No. Half a day later.
- Step back. Could be simpler. Rebuild a vanilla Linux. Test again. Errors still there. Rebuild with a known working kernel and setup. Still there. So, something that used to work no longer works.
- What has changed. I don’t know. Ask AI. It tells me it’s “100%” a GPU driver incompatibility that by way of the kernel overlay/passthrough is corrupting the OS. Chase that dragon for an hour, downgrade to “more stable” GPU drivers. Problem continues. Check the validity of AI’s claims. It’s wrong. That’s a harmless error. Move on.
- Journalctl and dmesg in more depth. Don’t see much. Crank up log limit to infinity. Do more stuff, watch the logs. Not seeing much.
- Step back again. Still no silver bullet errors to chase. Windows 10 > Hyper-V > WSL2 > Linux. Try another flavor of Linux. Same old story, seg faults, crc errors.
- Maybe Data corruption in transit (Red Herrings unite, thrown off by a few common corrupted downloads). VPN/Mobile network connected to USB tethered Ethernet. All kinds of potential problems here. Phone, Usb Y splitter (power/data), router, VPN, etc. Simplify. Remove all. Direct to router via ethernet. Be safe, check for CRC’s or data errors on interface. None to be seen. MTU/jumbo frames/Duplex. I’m reaching. No problems.
- Download something in Ubuntu guest that has a hash to test against. A big cloud based ISO sounds good. Verify hash. Hash is good. Do it three more times with three different images. Hashes are good. Mostly confident that..Wait. Do the same things on Windows with curl and certutil…maybe..No..Everything still fine. Why is the same thing faster to download on WSL2 than on the supporting OS. Hmm. Noted.
Okay, this is taking too long. Breathe. Step back. Again. It’s all stacks. What haven’t I checked.
- Windows event logs. Okay. 19,000 messages. Ugh. Sort by critical. Unscheduled power offs and hibernation problems. Hibernation is problematic to me, why do I even have it on? No idea. Turn off hibernation, wipe out 100Gb pagefile. Ugh. Problems with pagefiles and WSL2, HyperV, Linux? None that I can find. Wait, is there some weird write caching happening in Windows that’s selectable? I doubt it. I’d never turn that on. It’s off anyway. Keep digging. Sort errors by critical
- Disk/IO. Hmm. Oh that reminds me. My R10 is degraded. I saw that once a month ago when I looked at the BIOS once while it was booting- stays on the screen for about 500ms. But it skips ahead and boots fine. Nah, can’t be. A degraded R10 won’t break anything. Will it? Nah. Slower? Yeah. CRC errors, seg faults only showing up in a Linux guest? Naaaah.
- More event logs- Sort by critical. It’s all disk’ish. Maybe it’s my RAID, okay what else . What kind of RAID . It’s Intel VMD raid provisioned in the BIOS. Hmm. Why can’t I see the RAID in Windows now that I think about it. Okay, Windows hides it. I need special software to view the RAID, that’s annoying. What is that? Seems maybe hacky. But it’s worked solidly for 6 months.
- Event logs show delayed writes in the range of 18,000 (!) milliseconds. Okay. What disk is Disk69? I don’t have 69 disks. Who knows. Doesn’t matter, it’s a problem. When did this start? Event logs rotated a month ago. A month ago I didn’t see much. Okay. When did I see that degraded RAID error? Many months ago. Noted but lets move on for now.
- Is this a write caching thing? Why won’t windows let me see this RAID from the OS. Ugh. I’m getting stuck in a loop. Stop. Starting to go crazy. Chkdsk all to hell. No problems. RAID is “healthy” according to Disk Management. Don’t trust that, of course. Nothing weird or unique in Windows settings for the volume.
- A few Hyper-V errors. What are these? They are runner errors, process errors. But only a few and intermittent. They are not listed as critical but warnings. Noted but lets move on. Lets go back to Linux
- Pip install blah while monitoring the journal. Seg fault. Dump shows Python 3.11 mad on 22.04- That should be rock solid. Why so mad? Reevaluate sanity- Spin up cloud instance and Conda env and replicate- zero problems.
- Okay try a Hail Mary- First, make a strong drink. Download new cloud image, test the hash, fine, import it into WSL from an external device and run it from that device. Sorta stupid, still touching root volume and I don’t know what I don’t know. Move tmp to vmdk instead of /mnt . A Hail Mary indeed. Nope. Still broken. Same problems.
- Ok absolutely RAID/volume related to me, I think. Try to find something to look at RAID set on Windows. Nah. Sketchy. Doesn’t matter anyway. Go back into BIOS. Test memory again just in case. Still fine. What kind of drives? Samsung 990’s. Solid drives. 4 drives, Raid 10. 3 out of 4 showing. Pull the bad one. Check SMART (Samsung tool) on ghosted drive via an external NVME enclosure on a laptop. 15K hours, very very low. Rated for way more than that- SMART has ONE reset, otherwise says “good”. Check for complaints for this gen of drives. Almost none to chase. Drive is dead. Or is it? Testing some more. Will report SMART via Samsung util but otherwise it is a ghost. It’s dead, for sure. Why is SMART still working? Whatever, ghost in the machine. It’s dead.
- Am I sure a degraded RAID will work fine except being slower which I know? Not really but I always thought that it would just be slower- but that could be just plain wrong. But Windows has zero visible performance problems, I think- all apps work that aren’t WSL2. But Windows may be obscuring obvious problems. More reading online. Actually, turns out Windows is background fixing and essentially hiding the problems. Ugh. My fault though because Event Viewer announcing it in so many words. Why didn’t I check event viewer sooner? Ugh.
- Is Linux pickier on flipped bits and bytes than Windows? Yes. Plus, all that stacking. RAID, Hyper-V WSL2, and all that. VMD is suspect. Why? I’m going to blame it for now because I’m shady on Intel RAID in general and especially RAID baked into the chipset of a consumer’ish motherboard.
- Linux crash dumps always point to something IO but also disk, and video, and okay. Probably not going to sort it that way.
- To be continued! Standby.