More

mark_l_watson · 2026-04-24T16:05:17 1777046717

The flash version is smaller, I think around 200B parameters and is cheap to run.

mark_l_watson · 2026-04-24T16:03:19 1777046599

I used the flash version on a tricky Common Lisp coding problem this morning. The first cut of the new library had a runtime error. I was running in a simple REPL using:

ollama run deepseek-v4-flash:cloud

so I had to feed the generated code and the error back into the REPL manually, but it nailed it the second time, and the Common Lisp code was very good.

mark_l_watson · 2026-04-22T17:31:08 1776879068

I have been running the slightly larger 31B model for local coding:

ollama launch claude --model qwen3.6:35b-a3b-nvfp4

This has been optimized for Apple Silicon and runs well on a 32G ram system. Local models are getting better!

yougotwill · 2026-04-22T18:01:12 1776880872

Can I ask how much RAM of the 32GB does it use? For example can I run a browser and VS Code at the same time?

mswphd · 2026-04-22T22:54:37 1776898477

1. the 35B model is a "Mixture of Experts" model. So the earlier commenter's point that it is "larger" does not mean it is more capable. Those types of models only have certain parts of themselves active (for 35b-A3b, it's only 3 billion parameters at a time, vs 27 billion for the model this post is about) at a time to speed up inference. So if you're interested in these things for the first time, Qwen3.6-35B-A3B is a good choice, but it is likely not as capable as the model this thread is about.

2. its hard to cite precise numbers because it depends heavily on configuration choices. For example

2a. on a macbook with 32GB unified memory you'll be fine. I can load a 4 bit quant of Qwen3.6-35B-A3B supporting max context length using ~20GB RAM.

2b. that 20GB ram would not fit on many consumer graphics cards. There are still things you can do ("expert offloading"). On my 3080, I can run that same model, at the same quant, and essentially the same context length. This is despite the 3080 only having ~10GB VRAM, by splitting some of the work with the CPU (roughly).

Layer offloading will cause things to slow down compared to keeping layers fully resident in memory. It can still be fast though. Iirc I've measured my 3080 as having ~55 tok/s, while my M4 pro 48GB has maybe ~70 tok/s? So a slowdown but still usable.

If you want to get your feet wet with this, I'd suggest trying out

* Lmstudio, and * the zed.dev editor

they're both pretty straightforward to setup/pretty respectable. zed.dev gives you very easy configuration to get something akin to claude code (e.g. an agent with tool calling support) in relatively little time. There are many more fancy things you can do, but that pair is along the lines of "setup in ~5 minutes", at least after downloading the applications + model weights (which are likely larger than the applications). This is assuming you're on mac. The same stack still works with nvidia, but requires more finnicky setup to tune the amount of expert offloading to the particular system.

It's plausible you could do something similar with LMstudio + vscode, I'm just less familiar with that.

mark_l_watson · 2026-04-22T12:58:35 1776862715

I have an old Mac Mini with 32G of integrated RAM, and the following works for me for small local code changes:

ollama launch claude --model qwen3.6:35b-a3b-nvfp4

In addition to not having an integrated web search tool, one drawback is that it runs more slowly than using cloud servers. I find myself asking for a code or documentation change, and then spending two minutes on my deck getting fresh air waiting for a slower response. When using a fast cloud service I can be a coding slave, glued to my computer. Still, I like running local when I can!

mark_l_watson · 2026-04-22T12:48:37 1776862117

I am on Google's $20/month plan, and I usually get about three half-hour coding sessions a week with AntiGravity using the Claude models. The limit using Gemini Pro models is much higher. I am retired so Google's $20 plan is sufficient for me, but I understand that people who are still working would need higher limits.

I am also on a $10/month plan with Nous Research for supplying open models for their open source Hermes Agent. I run Hermes inside a container, on a dedicated VPS as a coding agent for complex tasks and so far I find the $10/month plan is enough for about five to ten major tasks a month. I think it is also a good deal.

mark_l_watson · 2026-04-21T16:37:48 1776789468

I hope this is not off topic, too much: with the current geopolitical situation I expect reduced capacity to manufacture both memory chips and all types of CPUs/GPUs. I base this on news I read from: Japan, South Korea, and Singapore.

If I am correct (and I hope that I am wrong!) this will drastically increase the cost of building these new data centers.

mark_l_watson · 2026-04-20T12:24:34 1776687874

I ran OpenClaw in a container, on a VPS without connection to messaging systems, so perhaps that is why I didn't get value.

Similarly, I have been using Hermes Agent also inside a container, and on a VPS with only access to a local directory in the VPS with a dozen active projects on GitHub. I don't give it access to my GitHub credentials, but allow it to work in whatever branch is checked out.

This setup is fabulously productive. I use it about every other day to perform some meaningful task for me. It is inexpensive also. A task might take 20 minutes and cost $0.25 in GLP-5.1 API costs.

So TLDR: out of the box, I use Hermes at least one hour a week and find it to be a wonderful tool.

mark_l_watson · 2026-04-20T11:40:37 1776685237

The new movie Mercy is a good take in this, as fiction.

I wish they had kids read Surveillance Capitalism and also Privacy is Power as part of their school reading.

mark_l_watson · 2026-04-19T15:53:25 1776614005

Well, I started a long time ago: I had written several books and readers occasionally contacted me for doing gig work.

mark_l_watson · 2026-04-17T16:10:28 1776442228

I agree and I am amazed at how much money some individuals and also a friend's company burn on token costs. I get huge benefits from this tech just using gemini-cli and Antigravity a few times a week, briefly. I also currently invest about $15/month in GLM-5.1 running Hermes Agent on a small dedicated VPS - fantastically good value for getting stuff done and this requires little of my time besides planning what I need done.

I think the token burners are doing it wrong. I think that long term it is better to move a little slower, do most analysis and thinking myself, and just use AI when the benefits are large while taking little of my time and money to use the tools.