0 Comments

I started working with mostly LLM’s as an amateur (still am an am, actually). It was grueling work and cost me a lot of money and time (more time than money). I was only fine tuning and wow did it take a long time for a little return- tuning itty bitty hyperparameters with Kohya wasn’t going to fly for what I wanted to do. I am only talking about 2023, not like 2015. Media was even further off. However, after working on some projects with the NFL I realized that if you were very careful you could AWS, ffmpeg, Transformers, and etc to get some work done without getting in trouble for spend. Luckily we had a co-sponsorship so AI/ML was was largely covered.

Today I think it’s a lot easier with quantizations and smaller models “SLM’s” (edge models, Moondream2 for LLM, Qwen 2.5VL for Multimodal LM and say Wan 2.2 5B for strictly Vision, etc) and people adopting LoRA’s and largely for amateur work (e.g. I have a few 4090’s in a server) the idea of doing anything but low rank training isn’t very popular with poor people. A ton of people have built amazing tools on Github in the last few years that have literally dropped the barrier to entry down to a person with a 3090Ti. I mean that’s an entry level card these days. That’s great. I really really respect people (technology aside) that build things that make things more accessible. Henry Petroski, of whom was a Duke prof of Engineering, wrote a bunch of books on engineering failure for laymen. And I am a layman when it comes to mechanical and structural engineering. So he made much more complex stuff more accessible and I just think that’s awesome. We can’t be excellent at everything so….If we have a curiosity and someone else is capable of distilling down something we may not not have enough “life” time to truly understand, i’ll take the condensed version.

Multimodal has always been attractive to me due to my interest in media and specifically, music (and playing it). Some years ago I was talking to a guy on HF that created a music POC that would generate music from trained wav files and although it sounded pretty bad, it was super cool to me and you could see how it worked (and didn’t). AI generating music. It was not because I thought AI stood a chance at making music I wanted to hear, nor did I really want it to, but that it could generate music that was at least listenable seemed amazing to me. This is more than just understanding what is harmonic and nice to hear because that’s not extremely difficult. It had a taste of the importance of arrangements and choruses and etc. Anyway, super cool stuff.

I got distracted from Music when Stable Diffusion came out. I remember when I first came across 1.4 (around Halloween I think) and I was amazed at how easily I could generate terrifying images (of which I ended up printing on transparencies and projecting to the chagrin of neighborhood kids). That set my mind ablaze with other ideas, especially w/r/t shortcutting long and tedious processes. Something I’ve always loved. Some people just call it automation 🙂

The thing was, though, I’m not a hugely technically skilled person when it comes to CGI/Adobe/etc. I can get by but I was sort aimlessly following the interest. The barrier to entry low enough to get into trouble but not enough to actually get anything do I thought I could do. That caused me to realize in a much more humbling and macro way just how little I knew, and needed to know, to even be able to talk to people about Vision and other topics. So the deep dive started and I’ve come up for air here-and-there but It’s looking like I’m going to be underwater for a loooong time. Maybe I’ll find some scallops.

Storm chasing and intelligent flea market eyes.

The main driver that got me obsessively into Multimodal and what can dramatically drive my motivation if if I can do tech OUTSIDE- If only I could use tech along with being not inside a computer all day. This was absolutely key as I’ve very much uninterested in staring at a screen all day, especially if the tech is out there and I can bring it home to wrench on- plus a lot of the tech I was researching needed ‘boots on ground’ to gather data and etc. Helping botanists differentiate two types of very very VERY different plants (one being not a friend) for their students. This is something that I loved when Tornado chasing. We’d go chasing, gather a ton of info, maybe or maybe not see a tornado (probably not) but capture a bunch of windfield/doppler/pressure etc information to come home and crunch on. Overlay that with the surface maps at the time, signatures, CAPE, SR/H etc (which at the beginning had to be pulled with acoustic couplers (modems) attached to payphones), the conclusion of the storm, and you can paint a picture of what happened (or didn’t happen). To me that was just about as awesome as I could ask for. Marrying all these bits and pieces into something at least somewhat less chaotic was attractive, and learning hands-on all the while.

But…What does this have to do with Multimodal anything? No much- Although I have been some really cool models and will put any scale tornado from EF0 to EF5 in your living room (that’s not keeping the EF5 power and multiplying it to an 8×10′ tornado naturally that would not end well). I mean that’s important right?


Swainsboro Flea Market

But the flea market part was practical. A town a few hours from Savannah is Swainsboro, which contains my name and I’ll bite. So we cruised through there about 8 years ago and it was dead as dead. I mean antique stores, mostly closed, not a single commercial shop like a Subway or a Pizza Hut, but of course a Dollar Gen within rock throwing distance. Real small town. We rambled through the antique stores and they were packed wall to wall with stuff. A lot of the objects were impossibly small and most had price tags that are slightly readable, maybe. All of the stuff in this joint was on consignment so nobody’s there watching over their stuff usually. They’ll be a few folks up front that will ring Sally with the “Live Laugh Love” whiteboard to see if she wants to let it go for whatever. This also means all of the objects are sorted and labeled by different people. Different setups, booths, and etc. I had my phone and said to myself hm. At the time most Vision models were at most 336x for detection. My phone is..I had no idea..10MP?

Doing the math it looked like my D4 at ~40MP would need to be at about 1′ from the object to get a solid 336x crop- That’s purely optical which I think critical for what I was thinking of doing (identifying, correctly).

Yeah okay, what are you thinking of doing? I was thinking of automatically going through all images, separating each unique object, and cross referencing with an existing object in a database. To provide general information maybe, but I just wanted to see if it was under or over priced. There was no way that if you wanted to find something valuable that you could simply look at 100,000 things at this place and not spend a month there. Of course there’s the fairly well known understanding that actually valuable items end up on Ebay. But you never know. Additionally, one man’s trash and so forth. That little teapot may be one of a set of 6 that is worth $5.99 if you have five, or $199.9 if you have 6. I’m not trying to get rich here, I’m just trying to use tech for real-world stuff that makes things easier. There’s all sorts of potential from people that have bad vision, people than can see but can’t reach down, and so forth. I don’t really want a valuable teapot, though.

So I took the images home ran some scripts to separate everything and then identify it, threw the metadata into sidecar files and kept the IPTC/GPS data for good measure.

TBC