I used the flash version on a tricky Common Lisp coding problem this morning. The first cut of the new library had a runtime error. I was running in a simple REPL using:
ollama run deepseek-v4-flash:cloud
so I had to feed the generated code and the error back into the REPL manually, but it nailed it the second time, and the Common Lisp code was very good.
1. the 35B model is a "Mixture of Experts" model. So the earlier commenter's point that it is "larger" does not mean it is more capable. Those types of models only have certain parts of themselves active (for 35b-A3b, it's only 3 billion parameters at a time, vs 27 billion for the model this post is about) at a time to speed up inference. So if you're interested in these things for the first time, Qwen3.6-35B-A3B is a good choice, but it is likely not as capable as the model this thread is about.
2. its hard to cite precise numbers because it depends heavily on configuration choices. For example
2a. on a macbook with 32GB unified memory you'll be fine. I can load a 4 bit quant of Qwen3.6-35B-A3B supporting max context length using ~20GB RAM.
2b. that 20GB ram would not fit on many consumer graphics cards. There are still things you can do ("expert offloading"). On my 3080, I can run that same model, at the same quant, and essentially the same context length. This is despite the 3080 only having ~10GB VRAM, by splitting some of the work with the CPU (roughly).
Layer offloading will cause things to slow down compared to keeping layers fully resident in memory. It can still be fast though. Iirc I've measured my 3080 as having ~55 tok/s, while my M4 pro 48GB has maybe ~70 tok/s? So a slowdown but still usable.
If you want to get your feet wet with this, I'd suggest trying out
* Lmstudio, and
* the zed.dev editor
they're both pretty straightforward to setup/pretty respectable. zed.dev gives you very easy configuration to get something akin to claude code (e.g. an agent with tool calling support) in relatively little time. There are many more fancy things you can do, but that pair is along the lines of "setup in ~5 minutes", at least after downloading the applications + model weights (which are likely larger than the applications). This is assuming you're on mac. The same stack still works with nvidia, but requires more finnicky setup to tune the amount of expert offloading to the particular system.
It's plausible you could do something similar with LMstudio + vscode, I'm just less familiar with that.
I have an old Mac Mini with 32G of integrated RAM, and the following works for me for small local code changes:
ollama launch claude --model qwen3.6:35b-a3b-nvfp4
In addition to not having an integrated web search tool, one drawback is that it runs more slowly than using cloud servers. I find myself asking for a code or documentation change, and then spending two minutes on my deck getting fresh air waiting for a slower response. When using a fast cloud service I can be a coding slave, glued to my computer. Still, I like running local when I can!
I am on Google's $20/month plan, and I usually get about three half-hour coding sessions a week with AntiGravity using the Claude models. The limit using Gemini Pro models is much higher. I am retired so Google's $20 plan is sufficient for me, but I understand that people who are still working would need higher limits.
I am also on a $10/month plan with Nous Research for supplying open models for their open source Hermes Agent. I run Hermes inside a container, on a dedicated VPS as a coding agent for complex tasks and so far I find the $10/month plan is enough for about five to ten major tasks a month. I think it is also a good deal.
I hope this is not off topic, too much: with the current geopolitical situation I expect reduced capacity to manufacture both memory chips and all types of CPUs/GPUs. I base this on news I read from: Japan, South Korea, and Singapore.
If I am correct (and I hope that I am wrong!) this will drastically increase the cost of building these new data centers.
I ran OpenClaw in a container, on a VPS without connection to messaging systems, so perhaps that is why I didn't get value.
Similarly, I have been using Hermes Agent also inside a container, and on a VPS with only access to a local directory in the VPS with a dozen active projects on GitHub. I don't give it access to my GitHub credentials, but allow it to work in whatever branch is checked out.
This setup is fabulously productive. I use it about every other day to perform some meaningful task for me. It is inexpensive also. A task might take 20 minutes and cost $0.25 in GLP-5.1 API costs.
So TLDR: out of the box, I use Hermes at least one hour a week and find it to be a wonderful tool.
I agree and I am amazed at how much money some individuals and also a friend's company burn on token costs. I get huge benefits from this tech just using gemini-cli and Antigravity a few times a week, briefly. I also currently invest about $15/month in GLM-5.1 running Hermes Agent on a small dedicated VPS - fantastically good value for getting stuff done and this requires little of my time besides planning what I need done.
I think the token burners are doing it wrong. I think that long term it is better to move a little slower, do most analysis and thinking myself, and just use AI when the benefits are large while taking little of my time and money to use the tools.
reply