gemma-3n-E4B-it-Q8_0
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: hipcc not found on $PATH
[...]
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'gemma-3n-E4B-it-Q8_0.gguf'
main: error: unable to load model
This model is fully compatible with anything previously done with gemma3.
Just passed it to one of my vlm fine-tuning scripts and it started without issues (hf transformer code).
On a single GPU with Lora the E4B model takes 18Gb of VRAM in batch size 1 where gemma-4B was 21Gb.
Nice one from deepmind, the gemma3 family tops the open weights VLLMs.
I tried my "Generate an SVG of a pelican riding a bicycle" prompt against Gemma 3n 7.5GB from Ollama and 15GB for mlx-vlm and got a pleasingly different result for the two quantization sizes: https://simonwillison.net/2025/Jun/26/gemma-3n/
It was supposed to be a joke. But weirdly it turns out there's a correlation between how good a model is and how good it as at my stupid joke benchmark.
Always loved this example, what do you think of ASCII art vs SVG?
Since it's not a formal encoding of geometric shapes, it's fundamentally different I guess, but it shares some challenges with the SVG tasks I guess? Correlating phrases/concepts with an encoded visual representation, but without using imagegen, that is.
Do you think that "image encoding" is less useful?
It's a thing I love to try with various models for fun, too.
Talking about illustration-like content, neither text-based ASCII art nor abusing it for rasterization.
The results have been interesting, too, but I guess it's less predictable than SVG.
I've had disappointing results with ASCII art so far. Something I really like about SVG is that most models include comments, which give you an idea of what they were trying to do.
For me, it shows if LLM are generalising from their training data. LLM understand all of the words in the prompt. they understand the spec for svg better than any human. They know what a bird is. They know what a bike is. They know how to draw (and given access to computer-use could probably ace this test). They can plan and execute on those plans.
Everything here should be trivial for LLM, but they’re quite poor at it because there’s almost no “how to draw complex shapes in svg” type content in their training set.
Kinda feel like the content is a much better reason to visit than the pelicans. Though I suppose the pelicans are part of the content.
I'm quite happy that there's someone with both the time to keep up with all the LLM/AI stuff, that is also good enough at writing amusing stuff that I want to keep reading it.
This isn't an image model. It's a text model, but text models can output SVG so you can challenge them to generate a challenging image and see how well they do.
"Gemini Nano allows you to deliver rich generative AI experiences without needing a network connection or sending data to the cloud." -- replace Gemini with Gemma and the sentence still valid.
Licensing. You can't use Gemini Nano weights directly (at least commercial ly) and must interact with them through Android MLKit or similar Google approved runtimes.
You can use Gemma commercially using whatever runtime or framework you can get to run it.
It's not even clear you can license language model weight though.
I'm not a lawyer but the analysis I've read had a pretty strong argument that there's no human creativity involved in the training, which is an entirely automatic process, and as such it cannot be copyrighted in any way (the same way you cannot put a license on a software artifact just because you compiled it yourself, you must have copyright ownership on the source code you're compiling).
IANAL either but the answer likely depends on the jurisdiction
US standards for copyrightability require human creativity and model weights likely don’t have the right kind of human creativity in them to be copyrightable in the US. No court to my knowledge has ruled on the question as yet, but that’s the US Copyright Office’s official stance.
By contrast, standards for copyrightability in the UK are a lot weaker than-and so no court has ruled on the issue in the UK yet either, it seems likely a UK court would hold model weights to be copyrightable
So from Google/Meta/etc’s viewpoint, asserting copyright makes sense, since even if the assertion isn’t legally valid in the US, it likely is in the UK - and not just the UK, many other major economies too. Australia, Canada, Ireland, New Zealand tend to follow UK courts on copyright law not US courts. And many EU countries are closer to the UK than the US on this as well, not necessarily because they follow the UK, often because they’ve reached a similar position based on their own legal traditions
Finally: don’t be surprised if Congress steps in and tries to legislate model weights as copyrightable in the US too, or grants them some sui generis form of legal protection which is legally distinct from copyright but similar to it-I can already hear the lobbyist argument, “US AI industry risks falling behind Europe because copyrightability of AI models in the US is legally uncertain and that legal uncertainty is discouraging investment”-I’m sceptical that is actually true, but something doesn’t have to be true for lobbyists to convince Congress that it is
> US standards for copyrightability require human creativity and model weights likely don’t have the right kind of human creativity in them to be copyrightable in the US. No court to my knowledge has ruled on the question as yet, but that’s the US Copyright Office’s official stance.
Has the US copyright office said that about model weights? I've only heard them saying that about images produced entirely from a prompt to a model.
I thought I read something by them explicitly addressing the question but I can’t find it now.
However, read page 22 of https://www.copyright.gov/comp3/chap300/ch300-copyrightable-... - it is their settled position that the output of a mechanical process cannot be copyrightable unless there was substantial human creative input into it - and it is pretty clear that AI training doesn’t involve human creative input in the relevant sense. Now, no doubt there is lots of human skill and art in picking the best hyperparameters, etc - but that’s not input of the right kind. An analogy - a photocopier does not create a new copyright in the copy, even though there is skill and art in picking the right settings on the machine to produce the most faithful copy. The human creativity in choosing hyperparameters isn’t relevant to copyrightability because it isn’t directly reflected in the creative elements of the model itself
A model with RLHF fine-tuning could be a different story - e.g. Anthropic went to a lot of effort to make Claude speak with a distinctive “voice”, and some of that involved carefully crafting data to use for fine-tuning, and the model may contain some of the copyright of that training data.
But, even if that argument also applies to Gemma or Llama - if someone intentionally further fine-tunes the model in order to remove that distinctive “voice”, then you’ve removed the copyrightable element from the model and what is left isn’t copyrightable. Because the really expensive part of building a model is building the foundation model, and that’s the part least likely to be copyrightable; whereas, fine-tuning to speak with a distinctive voice is more likely to be copyrightable, but that’s the easy part, and easy to rip out (and people have motivation to do so because a lot of people desire a model which speaks with a different voice instead)
A very good lawyer could argue that creating the data sets for training, doing the evals, and RLHF, constitutes -human creativity- and not a mechanical endeavor.
Right, but it isn’t legally enough for there to be creativity in the supervision of the mechanical process - that creativity has to take the form of creative elements which survive in some identifiable form in the end product. The technical skill of managing a mechanical process can involve a great deal of creativity, but that doesn’t legally count as “creative” unless that is directly surfaced in the model output
I think the case is the strongest with RLHF - if your model speaks with a distinctive “voice”, and to make it do so you had to carefully craft training data to give it that voice, such that there are obvious similarities (shared turns of speech, etc) between your RLHF training input and the model outputs - that aspect of the model likely is copyrightable. But if you are trying to improve a model’s performance at mathematics problems, then no matter how much creativity you put into choosing training data, it is unlikely identifiable creative elements from the training data survive in the model output, which suggests that creativity didn’t actually make it into the model in the sense relevant to US copyright law
In that line of reasoning, does it really matter how “close“ jurisdictions are to each other — also considering how what courts rule doesn’t matter as much in countries governed by civil law - but merely the enforcement of the Berne convention? As in, if something is considered to be under copyright in any one of all the signatory countries of it, the others have to respect that?
No, the Berne convention doesn’t work that way. It requires you to extend copyright protection to the works of the nationals of the other parties on the same terms as you offer it to the works of your own nationals; but if a certain category of works are excluded from copyright for your own nationals, it doesn’t require you to recognise copyright in those works when authored by foreign nationals, even if their own country’s laws do
Real example: UK law says telephone directories are eligible for copyright, US law says they aren’t. The US is not violating the Berne convention by refusing to recognise copyright in UK phone directories, because the US doesn’t recognise copyright in US phone directories either. A violation would be if the US refused to recognise copyright in UK phone directories but was willing to recognise it in US ones
> It's not even clear you can license language model weight though.
It is clear you can license (give people permissions to) model weights, it is less clear that there is any law protecting them such that they need a license, but since there is always a risk of suit and subsequent loss in the absence of clarity, licenses are at least beneficial in reducing that risk.
That's one of the reasons why they gate Gemini Nano with the "Gemini Nano Program Additional Terms of Service". Even if copyright doesn't subsist in the weights or if using them would be fair use, they still have recourse in breach of contract.
I've wondered about this for a while now (where e.g. some models of HuggingFace require clickwrap license agreements to download, that try to prohibit you from using the model in certain ways.)
It seems to me that if some anonymous ne'er-do-well were to publicly re-host the model files for separate download; and you acquired the files from that person, rather than from Google; then you wouldn't be subject to their license, as you never so much as saw the clickwrap.
(And you wouldn't be committing IP theft by acquiring it from that person, either, because of the non-copyrightability.)
I feel that there must be something wrong with that logic, but I can't for the life of me think of what it is.
The problem is that contracts don’t bind subsequent recipients, copyright does
Google gives the model to X who gives it to Y who gives it to Z. X has a contract with Google, so Google can sue X for breach of contract if they violate its terms. But do Y and Z have such a contract? Probably not. Of course, Google can put language in their contract with X to try to make it bind Y and Z too, but is that language going to be legally effective? More often than not, no. The language may enable Google to successfully sue X over Y and Z’s behaviour, but not successfully sue Y and Z directly. Whereas, with copyright, Y and Z are directly liable for violations just as X is
Thank you, this is a nice point to consider. Don't know if using the weights could be considered equivalent or implying accepting the terms of services from weights creators.
Contracts require agreement (a “meeting of the minds”)… if X makes a contract with Google, that contract between Google and X can’t create a contract between Google and Y without Y’s agreement. Of course, Google’s lawyers will do all they can possibly can to make the contract “transitive”, but the problem is contracts fundamentally don’t have the property of transitivity.
Now, if you are aware of a contract between two parties, and you actively and knowingly cooperate with one of them in violating it, you may have some legal liability for that contractual violation even though you weren’t formally party to the contract, but there are limits - if I know you have signed an NDA, and I personally encourage you to send me documents covered by the NDA in violation of it, I may indeed be exposed to legal liability for your NDA violation. But, if we are complete strangers, and you upload NDA-protected documents to a file sharing website, where I stumble upon them and download them - then the legal liability for the NDA violation is all on you, none on me. The owner of the information could still sue me for downloading it under copyright law, but they have no legal recourse against me under contract law (the NDA), because I never had anything to do with the contract, neither directly nor indirectly
If you download a model from the vendor’s website, they can argue you agreed to the contract as a condition of being allowed to make the download. But if you download it from elsewhere, what is the consideration (the thing they are giving you) necessary to make a binding contract? If the content of the download is copyrighted, they can argue the consideration is giving you permission to use their copyrighted work; but if it is an AI model and models are uncopyrightable, they have nothing to give when you download it from somewhere else and hence no basis to claim a contractual relationship
What they’ll sometimes do, is put words in the contract saying that you have to impose the contract on anyone else you redistribute the covered work to. And if you redistribute it in full compliance with those terms, your recipients may find themselves bound by the contract just as you are. But if you fail to impose the contract when redistributing, the recipients escape being bound for it, and the legal liability for that failure is all yours, not theirs
Thanks for such a clear and logical explanation, it is a pleasure to read explanations like this. Anyway, I am always skeptical about how law is applied, sometimes the spirit of the law is bended by the weight of the powerful organizations, perhaps there are some books which explains how the spirit of the law is not applied when powerful organizations are able to tame it.
Why not? Training isn't just "data in/data out". The process for training is continuously tweaked and adjusted. With many of those adjustments being specific to the type of model you are trying to output.
The US copyright office’s position is basically this-under US law, copyrightability requires direct human creativity, an automated training process involves no direct human creativity so cannot produce copyright. Now, we all know there is a lot of creative human effort in selecting what data to use as input, tinkering with hyperparameters, etc - but the copyright office’s position is that doesn’t legally count - creative human effort in overseeing an automated process doesn’t change the fact that the automated process itself doesn’t directly involve any human creativity. So the human creativity in model training fails to make the model copyrightable because it is too indirect
By contrast, UK copyright law accepts the “mere sweat of the brow” doctrine, the mere fact you spent money on training is likely sufficient to make its output copyrightable, UK law doesn’t impose the same requirements for a direct human creative contribution
Doesn't that imply just the training process isn't copyrightable? But weights aren't just training, they're also your source data. And if the training set shows originality in selection, coordination, or arrangement, isn't that copyrightable? So why wouldn't the weights also be copyrightable?
The problem is, can you demonstrate that originality of selection and arrangement actually survives in the trained model? It is legally doubtful.
Nobody knows for sure what the legal answer is, because the question hasn’t been considered by a court - but the consensus of expert legal opinion is copyrightability of models is doubtful under US law, and the kind of argument you make isn’t strong enough to change that. As I said, different case for UK law, nobody really needs your argument there because model weights likely are copyrightable in the UK already
> The problem is, can you demonstrate that originality of selection and arrangement actually survives in the trained model? It is legally doubtful.
It's particularly perilous since the AI trainers are at the same time in a position where they want to argue that copyrighted work they included in the training data don't actually survive in the trained model.
For the same reason GenAI output isn't copyrightable regardless of how much time you spend tweaking your prompts.
Also i'm pretty sure none of the AI companies would really want to touch the concept of having the copyright of source data affect the weight's own copyright, considering all of them pretty much hoover up the entire Internet without caring about those copyrights (and IMO trying to claim that they should be able to ignore the copyrights of training data and also that the GenAI output is not under copyright but at the same trying trying to claim copyright for the weights is dishonest, if not outright leechy).
In practice it's not the combination that is copyrighted (you cannot claim copyright over a binary just because you zipped it, or over a movie because you re-encoded it, for instance).
It's the “actual creativity” inside. And it is a fuzzy concept.
From what I understand, copyright only applies to the original source code, GUI and bundled icon/sound/image files. Functionality etc. would fall under patent law. So the compiled code on your .ISO for example would not only be "just raw numbers" but uncopyrightable raw numbers.
The inference code and model architecture IS open source[0] and there are many other high quality open source implementations of the model (in many cases contributed by Google engineers[1]). To your point: they do not publish the data used to train the model so you can't re-create it from scratch.
If for some reason you had the training data, is it even possible to create an exact (possibly same hash?) copy of the model? Seems like there are a lot of other pieces missing like the training harness, hardware it was trained on, etc?
Yes, this is true. A lot of times labs will hold back necessary infrastructure pieces that allow them to train huge models reliably and on a practical time scale. For example, many have custom alternatives to Nvidia’s NCCL library to do fast distributed matrix math.
Deepseek published a lot of their work in this area earlier this year and as a result the barrier isn’t as high as it used to be.
to be entirely fair that's quite a high bar even for most "traditional" open source.
And even if you had the same data, there's no guarantee the random perturbations during training are driven by a PRNG and done in a way that is reproducible.
Reproducibility does not make something open source. Reproducibility doesn't even necessarily make something free software (under the GNU interpretation). I mean hell, most docker containers aren't even hash-reproducible.
That is completely different discussion. Otherwise, even Gemini 2.5 Pro would be open-source with this logic since clients are open-source for interacting with the cloud APIs.
Yes!! But I doubt how many are truly truly open source models since most just confuse open source with open weights and the definition has been changed really smh.
Are you sure? On a quick look, it appears to use its own bespoke license, not the Apache 2.0 license. And that license appears to have field of use restrictions, which means it would not be classified as an open source license according to the common definitions (OSI, DFSG, FSF).
I suspect the difference is in the training data. Gemini is much more locked down and if it tries to repeat something from the draining data verbatim you will get a 'recitation error'.
Gemma 4? I feel that one was incredibly obvious. Let us please just increase the version numbers.
Anthropic is better about this, but then shifted their ordering with the v4 models. Arguably better, but still quite annoying since everything pre-4 uses a different naming scheme.
What do you mean by Anthropic shifting their ordering? It seems to still be consistently Opus > Sonnet > Haiku. They didn't release 4 Haiku, but they also didn't release 3.5 Opus, and pricing wise Sonnet 4 lines up with earlier Sonnets.
As for this Gemma release, I don't think Gemma 4 would be an appropriate name. 3n is limited to very small versions (like 8B total parameters) and is therefore likely less powerful than Gemma 3.
From my impression this is more like a "Gemma 3 Lite" that provides a better speed/quality tradeoff than the smaller Gemma 3 models.
Tried the E4B model in Ollama and it's totally broken when interpreting images. The output depends only on the text and is consistent in that way, but otherwise completely wrong.
Works fine with regular Gemma 3 4B, so I'll assume it's something on Ollama's side. edit: yep, text-only for now[1], would be nice if that was a bit more prominent than burried in a ticket...
Don't feel like compiling llama.cpp myself, so I'll have to wait to try your GGUFs there.
> It’s supported by your favorite tools including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and many others
Does anybody know how to actually run these using MLX? mlx-lm does not currently seem to support them, so I wonder what Google means exactly by "MLX support".
Do you know if these actually preserve the structure of Gemma 3n that make these models more memory efficient on consumer devices? I feel like the modified inference architecture described in the article is what makes this possible, but it probably needs additional software support.
But given that they were uploaded a day ago (together with the blog post), maybe these are actually the real deal? In that case, I wish Google could just link to these instead of to https://huggingface.co/mlx-community/gemma-3n-E4B-it-bf16.
Edit: Ah, these are just non-MLX models. I might give them a try, but not what I was looking for. Still, thank you!
That's a great question that is beyond my technical competency in this area, unfortunately. I fired up LM Studio when I saw this HN post, and saw it updated its MLX runtime [0] for gemma3n support. Then went looking for an MLX version of the model and found that one.
I'd genuinely like to know how these small models are useful for anyone. I've done a lot of experimenting, and anything smaller than 27B is basically unusable, except as a toy. All I can say for smaller models is that they sometimes produce good answers, which is not enough for anything except monkeying around.
I solved my spam problem with gemma3:27b-it-qat, and my benchmarks show that this is the size at which the current models start becoming useful.
Qwen2.5-VL 7B is pretty impressive at turning printed or handwritten maths lecture notes into Latex code, and is small enough to run slowly on a laptop without enough VRAM. Gemma3 4B was useless at this though, and got stuck in loops or tried to solve the maths problems instead of just converting the working to Latex (but it was much faster as it fit into VRAM).
It sounds like you're trying to use them like ChatGPT, but I think that's not what they're for.
I use gemma3:1b model (well, gemma3n:e2b since today) to summarize articles in my RSS reader. Works extremely well for such a simple task and runs on CPU on my hetzner server, so I don't have to pay electricity bill for running it on GPU at home
I am sure as ideation devices these can work fine. I treat this more like basic infra. I would absolutely love the future where most phones have some small LLM built in, kind of like a base layer of infra
There are use cases where even low accuracy could be useful. I can't predict future products, but here are two that are already in place today:
- On the keyboard on iphones some sort of tiny language model suggest what it thinks are the most likely follow up words when writing. You only have to pick a suggested next word if it matches what you were planning on typing.
- Speculative decoding is a technique which utilized smaller models to speed up the inference for bigger models.
I'm sure smart people will invent other future use cases too.
The best use case I've found for tiny models (<5bn params) as a reference tool for when I don't have WiFi. I've been using qwen on my MacBook Air as a replacement for Google while I'm writing code on flights. They work great for asking basic questions about syntax and documentation.
Tiny, 4b or less models are designed for finetuning for some narrow tasks; this way can outperform large commercial models for a tiny fraction of price. Also great for code autocomplete.
7b-8b are great coding assistants if all you need is dumb fast refactoring, that cannot quite be done with macros and standard editor functionality but still primitive, such as "rename all methods having at least one argument of type SomeType by prefixing their names with "ST_".
12b is a threshold where models start writing coherent prose such Mistral Nemo or Gemma 3 12b.
anyone know how much it costs to use the deployed version of gemma 3n? The docs indicate you can use the gemini api for deployed gemma 3n but the pricing page just shows "unavailable"
I read the general parts and skimmed the inner workings but I can't figure out what the high-level news is. What does this concretely do that Gemma didn't already do, or what benchmark/tasks did it improve upon?
Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?
Huh? I'm pretty sure I ran Gemma on my phone last month. Or is there a difference between downloadable (you get the weights because it's necessary to run the thing) and "open" weights?
I think the other poster is confused. Both Gemma 3 and Gemma 3n are open-weight models.
Google's proprietary model line is called Gemini. There is a variant that can be ran offline called Gemini Nano, but I don't think it can be freely distributed and is only allowed as part of Android.
As for what's new, Gemma 3n seems to have some optimizations done to it that lead it to be better than the 'small' Gemma 3 models (such as 4B) at similar speed or footprint.
Whats are some use cases for these local small models, for individuals? Seems like for programming related work, the proprietary models are significantly better and that's all I really use LLMs for personally.
Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.
For me? Handling data like private voice memos, pictures, videos, calendar information, emails, some code etc. Stuff I wouldn't want to share on the internet / have a model potential slurp up and regurgitate as part of its memory when the data is invariably used in some future training process.
I’m thinking about building a pipeline to mass generate descriptions for the images in my photo collection, to facilitate search. Object recognition in local models is already pretty good, and perhaps I can pair it with models to recognize specific people by name as well.
Suppose I'd like to use models like this one to perform web searches. Is there anything available in the open-source world that would let me do that without much tinkering needed?
I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.
Can you recommend a specific piece of software for using an MCP integration for web searches with local LLMs? That’s the first time I’ve heard of this.
Unfortunately 100 queries per day is quite low for LLMs, which tend to average 5-10 searches per prompt in my experience. And paying for the search API doesn’t seem to be worth it compared to something like a ChatGPT subscription.
> Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.
What I meant is that pricing for 10k queries per day does not make sense instead of a simple $20/month ChatGPT subscription.
If you actually did 10k queries a day with any free search service you'd quickly find yourself banned.
ChatGPT deep research does a lot of searches, but it's also heavily rate-limited even on paid accounts.
Building a search index, running a lot of servers, storing and querying all that data costs money. Even a cheap search API is gonna cost a bit for 10k queries if the results are any good.
Kagi charges me 15 USD and last month I've done 863 searches with it, more than worth it for the result quality. For me the Google API would be cheaper. I'm pretty sure Kagi would kick me out if I did 10k searches a day even if I'm paying for "unlimited searches".
A similar API from Bing costs between 15-25 USD for 1000 searches. Exa.ai costs 5 dollars for 1000 searches, and it goes up to 25 dollars if you want more than 25 results for your query.
It depends on your idea of decent speeds and what you would use it for. I just tried it on a laptop with an AMD HX 370 running on battery in power save mode and it's not especially impressive, although it runs much better in balanced or performance mode. I gave it the prompt "write a fizzbuzz program in rust" and it took almost a minute and a half. I expect it to be pretty terrible on an SBC. Your best bet is to try it out on the oldest hardware you have and figure out if you can tolerate worse performance.
I've been playing around with E4B in AI Studio and it has been giving me really great results, much better than what you'd expect from an 8B model. In fact I'm thinking of trying to install it on a VPS so I can have an alternative to pricy APIs.
Well, see it the other way, there is something positive: commenters here on HN claim that AI is useless. You can now also join the bandwagon of people who have free time.
It seems way worse than other small models, including responding with complete non sequiturs. I think my favorite small model is still DeepSeek distilled with Llama 8B.
Anyone have any idea on the viability of running this on a Pi5 16GB? I have a few fun ideas if this can handle working with images (or even video?) well.
The 4-bit quant weighs 4.25 GB and then you need space for the rest of the inference process. So, yeah you can definitely run the model on a Pi, you may have to wait some time for results.
You're right, I should have checked the model settings. For some reason the default model profile in Ollama had temperature set to 0. Changing the temperature and repeat penalty worked much better than it did when I tried to correct similar behavior in the smallest phi4 reasoning model.
Somethings really screwy with on-device models from Google, I can't put my finger on what, and I think being ex-Google is screwing with my ability to evaluate.
Cherry-picking something that's quick to evaluate:
"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."
If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)
So, *0.16* frames a second, not 60 fps.
The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.
- Are they missing a demo APK?
- Was there some massive TPU leap since the Pixel Fold release?
- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?
- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?
In any case, either:
A) I'm missing something, big or
B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."
Everything I've seen the last year or two indicates they are lying, big time, regularly.
But if that's the case:
- How are they getting away with it, over this length of time?
- How come I never see anyone else mention these gaps?
I agree that's the most likely interpretation - does it read as a shell game to you? Like, it can do that but once you get the thing that can use the output involved it's 1/100th of that? Do they have anything that does stuff with the outputs from just MobileNet? If they don't, how are they sure I can build 60 fps realtime audiovisual experiences they say I can?
Classify/similarity/clustering works fine with just an encoder, doesn't it?
I guess there's benefit to running that step without subsampling to the initial 256 tokens per image/frame ( https://ai.google.dev/gemma/docs/gemma-3n/model_card#inputs_... ) to go on from that, https://github.com/antimatter15/reverse-engineering-gemma-3n suggests these are 2048 dimensional tokens, which makes these 60 Hz frame digestion rate produce just under 31.5 Million floats-of-your-choosen-precision per second.
At least at the high (768x768) input resolution, this is a bit less than one float per pixel.
I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?
Or I could imagine just local on-device visual semantic search by expanding the search query into a bunch of tokens that have some signed desire/want-ness each and where the search tokens get attended to the frame's encoded tokens, activation function'd, scaled (to positive/negative) by the search token's desire score, and then just summed over each frame to get a frame score which can be used for ranking and other such search-related tasks.
(For that last thought, I asked Gemini 2.5 Pro to calculate flops load, and it came out to 1.05 MFLOPS per frame per search token; Reddit suggests the current Pixel's TPU does around 50 TOPS, so if these reasonably match each terminology wise, assuming we're spending about 20% of it's compute on the search/match aspect, it comes out to an unreasonably (-seeming) about 190k tokens the search query could get expanded to.
I interpret this result to imply that quality/accuracy issues in the searching/filtering mechanism would hit before encountering throughout issues in this.)
> Mostly because 3P support has not been a engineering priority.
Got it: assuming you're at Google, in eng. parlance, it's okay if it's not Prioritized™ but then product/marketing/whoever shouldn't be publishing posts around the premise it's running 60 fps multimodal experiences on device.
They're very, very, lucky that ratio of people vaguely interested in this, to people follow through on using it, is high, so comments like mine end up at -1.
Man this is a funny situation. Ty for sharing, more or less confirms my understanding. Couldn't quite believe it when I was in Google, or out of Google. This should be a big scandal afaict. What is going on???
(n.b. to readers, if you click through, the Google Pixel Tensor API is coming soon. So why in the world has Google been selling Tensor chips in Pixel as some big AI play since...idk, at least 2019?)
Yes, you can use first-party models on the Pixel NPUs or you're stuck with NNAPI which is self-admittedly deprecated by Google and doesn't work all that well.
On third party model workloads, this is what you will get:
Google is clearly not serious on Pixels in practice, and the GPU performance is also behind by quite a lot compared to flagships, which really doesn't help. CPUs are also behind by quite a lot too...
This looks amazing given the parameter sizes and capabilities (audio, visual, text). I like the idea of keeping simple tasks local. I’ll be curious to see if this can be run on an M1 machine…
I made a simple website[0] to check online model MMLU quickly (runs a subset), and Gemma 3n consistently loses to LLaMA 3.3 (~61% vs ~66%), and definitely loses to LLaMA 4 Scout (~86%). I suspect that means its rating on LMArena Leaderboard is just some form of gaming the metric.
What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.
Maybe you could install it on YouTube, where my 78-year-old mother received a spammy advert this morning from a scam app pretending to be an iOS notification.
Kinda sick of companies spending untold billions on this while their core product remains a pile of user-hostile shite. :-)
My post politely describing this blog post does not match Google's own app, running inference on Pixel, is downvoted to -1, below dead posts with one-off short jokes.
I am posting again because I've been here 16 years now, it is very suspicious that happened, and given the replies to it, we now know this blog post is false.
There is no open model that you can download today and run at even 1% of the claims in the blog post.
You can read a reply from someone indicating they have inside knowledge on this, who notes this won't work as advertised unless you're Google (i.e. internally, they have it binding to a privileged system process that can access the Tensor core, and this isn't available to third parties. Anyone else is getting 1/100th of the speeds in the post)
This post promises $150K in prizes for on-device multimodal apps and tells you it's running at up to 60 fps, they know it runs at 0.1 fps, Engineering says it is because they haven't prioritized 3rd parties yet, and somehow, Google is getting away with this.
https://github.com/Mozilla-Ocho/llamafile not working ;(
gemma-3n-E4B-it-Q8_0 import_cuda_impl: initializing gpu module... get_rocm_bin_path: note: hipcc not found on $PATH [...] llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n' llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'gemma-3n-E4B-it-Q8_0.gguf' main: error: unable to load model
how using ASR ? translate voice to text?
This model is fully compatible with anything previously done with gemma3. Just passed it to one of my vlm fine-tuning scripts and it started without issues (hf transformer code). On a single GPU with Lora the E4B model takes 18Gb of VRAM in batch size 1 where gemma-4B was 21Gb. Nice one from deepmind, the gemma3 family tops the open weights VLLMs.
Fix: it's the E2B
I tried my "Generate an SVG of a pelican riding a bicycle" prompt against Gemma 3n 7.5GB from Ollama and 15GB for mlx-vlm and got a pleasingly different result for the two quantization sizes: https://simonwillison.net/2025/Jun/26/gemma-3n/
Is that actually a useful benchmark, or is it just for the laughs? I've never really understood that.
It was supposed to be a joke. But weirdly it turns out there's a correlation between how good a model is and how good it as at my stupid joke benchmark.
I didn't realize quite how strong the correlation was until I put together this talk: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
Always loved this example, what do you think of ASCII art vs SVG?
Since it's not a formal encoding of geometric shapes, it's fundamentally different I guess, but it shares some challenges with the SVG tasks I guess? Correlating phrases/concepts with an encoded visual representation, but without using imagegen, that is.
Do you think that "image encoding" is less useful?
It's a thing I love to try with various models for fun, too.
Talking about illustration-like content, neither text-based ASCII art nor abusing it for rasterization.
The results have been interesting, too, but I guess it's less predictable than SVG.
I've had disappointing results with ASCII art so far. Something I really like about SVG is that most models include comments, which give you an idea of what they were trying to do.
For me, it shows if LLM are generalising from their training data. LLM understand all of the words in the prompt. they understand the spec for svg better than any human. They know what a bird is. They know what a bike is. They know how to draw (and given access to computer-use could probably ace this test). They can plan and execute on those plans.
Everything here should be trivial for LLM, but they’re quite poor at it because there’s almost no “how to draw complex shapes in svg” type content in their training set.
It’s been useful though given the authors popularity I suspect it’s only a matter of time new LLMs become “more aware” of it
It's useful because it's SVG so it's different than other image generation methods.
I think in 5 years we might have some ultra-realistic pelicans and this benchmark will turn out quite interesting.
And then the author will try the "Pelican tries to swallow the capybara as-is". And it will fall apart again.
It's this part where it gets interesting... how exactly it falls apart :D
[flagged]
Kinda feel like the content is a much better reason to visit than the pelicans. Though I suppose the pelicans are part of the content.
I'm quite happy that there's someone with both the time to keep up with all the LLM/AI stuff, that is also good enough at writing amusing stuff that I want to keep reading it.
> Kinda feel like the content is a much better reason to visit than the pelicans.
That's how the pelicans get ya.
Scored a whole two upvotes here, my scheme is clearly working great!
Leave some upvotes for the rest of us
Given how primitive that image is, what's the point of even having an image model at this size?
This isn't an image model. It's a text model, but text models can output SVG so you can challenge them to generate a challenging image and see how well they do.
>Multimodal by design: Gemma 3n natively supports image, audio, video, and text inputs and text outputs.
But I understood your point, Simon asked it to output SVG (text) instead of a raster image so it's more difficult.
It can handle image and audio inputs, but it cannot produce those as outputs - it's purely a text output model.
Yeah you're right. Also, you're Simon :)
I still don't understand the difference between Gemma and Gemini for on-device, since both don't need network access. From https://developer.android.com/ai/gemini-nano :
"Gemini Nano allows you to deliver rich generative AI experiences without needing a network connection or sending data to the cloud." -- replace Gemini with Gemma and the sentence still valid.
Licensing. You can't use Gemini Nano weights directly (at least commercial ly) and must interact with them through Android MLKit or similar Google approved runtimes.
You can use Gemma commercially using whatever runtime or framework you can get to run it.
It's not even clear you can license language model weight though.
I'm not a lawyer but the analysis I've read had a pretty strong argument that there's no human creativity involved in the training, which is an entirely automatic process, and as such it cannot be copyrighted in any way (the same way you cannot put a license on a software artifact just because you compiled it yourself, you must have copyright ownership on the source code you're compiling).
IANAL either but the answer likely depends on the jurisdiction
US standards for copyrightability require human creativity and model weights likely don’t have the right kind of human creativity in them to be copyrightable in the US. No court to my knowledge has ruled on the question as yet, but that’s the US Copyright Office’s official stance.
By contrast, standards for copyrightability in the UK are a lot weaker than-and so no court has ruled on the issue in the UK yet either, it seems likely a UK court would hold model weights to be copyrightable
So from Google/Meta/etc’s viewpoint, asserting copyright makes sense, since even if the assertion isn’t legally valid in the US, it likely is in the UK - and not just the UK, many other major economies too. Australia, Canada, Ireland, New Zealand tend to follow UK courts on copyright law not US courts. And many EU countries are closer to the UK than the US on this as well, not necessarily because they follow the UK, often because they’ve reached a similar position based on their own legal traditions
Finally: don’t be surprised if Congress steps in and tries to legislate model weights as copyrightable in the US too, or grants them some sui generis form of legal protection which is legally distinct from copyright but similar to it-I can already hear the lobbyist argument, “US AI industry risks falling behind Europe because copyrightability of AI models in the US is legally uncertain and that legal uncertainty is discouraging investment”-I’m sceptical that is actually true, but something doesn’t have to be true for lobbyists to convince Congress that it is
>don’t be surprised if Congress steps in and tries to legislate model weights as copyrightable in the US too
"Your Honor i didn't copy their weights, i used them to train my models weights"
> US standards for copyrightability require human creativity and model weights likely don’t have the right kind of human creativity in them to be copyrightable in the US. No court to my knowledge has ruled on the question as yet, but that’s the US Copyright Office’s official stance.
Has the US copyright office said that about model weights? I've only heard them saying that about images produced entirely from a prompt to a model.
I thought I read something by them explicitly addressing the question but I can’t find it now.
However, read page 22 of https://www.copyright.gov/comp3/chap300/ch300-copyrightable-... - it is their settled position that the output of a mechanical process cannot be copyrightable unless there was substantial human creative input into it - and it is pretty clear that AI training doesn’t involve human creative input in the relevant sense. Now, no doubt there is lots of human skill and art in picking the best hyperparameters, etc - but that’s not input of the right kind. An analogy - a photocopier does not create a new copyright in the copy, even though there is skill and art in picking the right settings on the machine to produce the most faithful copy. The human creativity in choosing hyperparameters isn’t relevant to copyrightability because it isn’t directly reflected in the creative elements of the model itself
A model with RLHF fine-tuning could be a different story - e.g. Anthropic went to a lot of effort to make Claude speak with a distinctive “voice”, and some of that involved carefully crafting data to use for fine-tuning, and the model may contain some of the copyright of that training data.
But, even if that argument also applies to Gemma or Llama - if someone intentionally further fine-tunes the model in order to remove that distinctive “voice”, then you’ve removed the copyrightable element from the model and what is left isn’t copyrightable. Because the really expensive part of building a model is building the foundation model, and that’s the part least likely to be copyrightable; whereas, fine-tuning to speak with a distinctive voice is more likely to be copyrightable, but that’s the easy part, and easy to rip out (and people have motivation to do so because a lot of people desire a model which speaks with a different voice instead)
A very good lawyer could argue that creating the data sets for training, doing the evals, and RLHF, constitutes -human creativity- and not a mechanical endeavor.
but who knows judges can be weird about tech
Right, but it isn’t legally enough for there to be creativity in the supervision of the mechanical process - that creativity has to take the form of creative elements which survive in some identifiable form in the end product. The technical skill of managing a mechanical process can involve a great deal of creativity, but that doesn’t legally count as “creative” unless that is directly surfaced in the model output
I think the case is the strongest with RLHF - if your model speaks with a distinctive “voice”, and to make it do so you had to carefully craft training data to give it that voice, such that there are obvious similarities (shared turns of speech, etc) between your RLHF training input and the model outputs - that aspect of the model likely is copyrightable. But if you are trying to improve a model’s performance at mathematics problems, then no matter how much creativity you put into choosing training data, it is unlikely identifiable creative elements from the training data survive in the model output, which suggests that creativity didn’t actually make it into the model in the sense relevant to US copyright law
In that line of reasoning, does it really matter how “close“ jurisdictions are to each other — also considering how what courts rule doesn’t matter as much in countries governed by civil law - but merely the enforcement of the Berne convention? As in, if something is considered to be under copyright in any one of all the signatory countries of it, the others have to respect that?
No, the Berne convention doesn’t work that way. It requires you to extend copyright protection to the works of the nationals of the other parties on the same terms as you offer it to the works of your own nationals; but if a certain category of works are excluded from copyright for your own nationals, it doesn’t require you to recognise copyright in those works when authored by foreign nationals, even if their own country’s laws do
Real example: UK law says telephone directories are eligible for copyright, US law says they aren’t. The US is not violating the Berne convention by refusing to recognise copyright in UK phone directories, because the US doesn’t recognise copyright in US phone directories either. A violation would be if the US refused to recognise copyright in UK phone directories but was willing to recognise it in US ones
Makes sense. Thanks!
> It's not even clear you can license language model weight though.
It is clear you can license (give people permissions to) model weights, it is less clear that there is any law protecting them such that they need a license, but since there is always a risk of suit and subsequent loss in the absence of clarity, licenses are at least beneficial in reducing that risk.
That's one of the reasons why they gate Gemini Nano with the "Gemini Nano Program Additional Terms of Service". Even if copyright doesn't subsist in the weights or if using them would be fair use, they still have recourse in breach of contract.
I've wondered about this for a while now (where e.g. some models of HuggingFace require clickwrap license agreements to download, that try to prohibit you from using the model in certain ways.)
It seems to me that if some anonymous ne'er-do-well were to publicly re-host the model files for separate download; and you acquired the files from that person, rather than from Google; then you wouldn't be subject to their license, as you never so much as saw the clickwrap.
(And you wouldn't be committing IP theft by acquiring it from that person, either, because of the non-copyrightability.)
I feel that there must be something wrong with that logic, but I can't for the life of me think of what it is.
The problem is that contracts don’t bind subsequent recipients, copyright does
Google gives the model to X who gives it to Y who gives it to Z. X has a contract with Google, so Google can sue X for breach of contract if they violate its terms. But do Y and Z have such a contract? Probably not. Of course, Google can put language in their contract with X to try to make it bind Y and Z too, but is that language going to be legally effective? More often than not, no. The language may enable Google to successfully sue X over Y and Z’s behaviour, but not successfully sue Y and Z directly. Whereas, with copyright, Y and Z are directly liable for violations just as X is
Thank you, this is a nice point to consider. Don't know if using the weights could be considered equivalent or implying accepting the terms of services from weights creators.
Contracts require agreement (a “meeting of the minds”)… if X makes a contract with Google, that contract between Google and X can’t create a contract between Google and Y without Y’s agreement. Of course, Google’s lawyers will do all they can possibly can to make the contract “transitive”, but the problem is contracts fundamentally don’t have the property of transitivity.
Now, if you are aware of a contract between two parties, and you actively and knowingly cooperate with one of them in violating it, you may have some legal liability for that contractual violation even though you weren’t formally party to the contract, but there are limits - if I know you have signed an NDA, and I personally encourage you to send me documents covered by the NDA in violation of it, I may indeed be exposed to legal liability for your NDA violation. But, if we are complete strangers, and you upload NDA-protected documents to a file sharing website, where I stumble upon them and download them - then the legal liability for the NDA violation is all on you, none on me. The owner of the information could still sue me for downloading it under copyright law, but they have no legal recourse against me under contract law (the NDA), because I never had anything to do with the contract, neither directly nor indirectly
If you download a model from the vendor’s website, they can argue you agreed to the contract as a condition of being allowed to make the download. But if you download it from elsewhere, what is the consideration (the thing they are giving you) necessary to make a binding contract? If the content of the download is copyrighted, they can argue the consideration is giving you permission to use their copyrighted work; but if it is an AI model and models are uncopyrightable, they have nothing to give when you download it from somewhere else and hence no basis to claim a contractual relationship
What they’ll sometimes do, is put words in the contract saying that you have to impose the contract on anyone else you redistribute the covered work to. And if you redistribute it in full compliance with those terms, your recipients may find themselves bound by the contract just as you are. But if you fail to impose the contract when redistributing, the recipients escape being bound for it, and the legal liability for that failure is all yours, not theirs
Thanks for such a clear and logical explanation, it is a pleasure to read explanations like this. Anyway, I am always skeptical about how law is applied, sometimes the spirit of the law is bended by the weight of the powerful organizations, perhaps there are some books which explains how the spirit of the law is not applied when powerful organizations are able to tame it.
Why not? Training isn't just "data in/data out". The process for training is continuously tweaked and adjusted. With many of those adjustments being specific to the type of model you are trying to output.
The US copyright office’s position is basically this-under US law, copyrightability requires direct human creativity, an automated training process involves no direct human creativity so cannot produce copyright. Now, we all know there is a lot of creative human effort in selecting what data to use as input, tinkering with hyperparameters, etc - but the copyright office’s position is that doesn’t legally count - creative human effort in overseeing an automated process doesn’t change the fact that the automated process itself doesn’t directly involve any human creativity. So the human creativity in model training fails to make the model copyrightable because it is too indirect
By contrast, UK copyright law accepts the “mere sweat of the brow” doctrine, the mere fact you spent money on training is likely sufficient to make its output copyrightable, UK law doesn’t impose the same requirements for a direct human creative contribution
Doesn't that imply just the training process isn't copyrightable? But weights aren't just training, they're also your source data. And if the training set shows originality in selection, coordination, or arrangement, isn't that copyrightable? So why wouldn't the weights also be copyrightable?
The problem is, can you demonstrate that originality of selection and arrangement actually survives in the trained model? It is legally doubtful.
Nobody knows for sure what the legal answer is, because the question hasn’t been considered by a court - but the consensus of expert legal opinion is copyrightability of models is doubtful under US law, and the kind of argument you make isn’t strong enough to change that. As I said, different case for UK law, nobody really needs your argument there because model weights likely are copyrightable in the UK already
> The problem is, can you demonstrate that originality of selection and arrangement actually survives in the trained model? It is legally doubtful.
It's particularly perilous since the AI trainers are at the same time in a position where they want to argue that copyrighted work they included in the training data don't actually survive in the trained model.
For the same reason GenAI output isn't copyrightable regardless of how much time you spend tweaking your prompts.
Also i'm pretty sure none of the AI companies would really want to touch the concept of having the copyright of source data affect the weight's own copyright, considering all of them pretty much hoover up the entire Internet without caring about those copyrights (and IMO trying to claim that they should be able to ignore the copyrights of training data and also that the GenAI output is not under copyright but at the same trying trying to claim copyright for the weights is dishonest, if not outright leechy).
The weights are mathematical facts. As raw numbers, they are not copyrightable.
A computer program is just 0s and 1s. Harry Potter books are just raw letters or raw numbers if an ebook.
(The combination is what makes it copyrightable).
In practice it's not the combination that is copyrighted (you cannot claim copyright over a binary just because you zipped it, or over a movie because you re-encoded it, for instance).
It's the “actual creativity” inside. And it is a fuzzy concept.
`en_windows_xp_professional_with_service_pack_3_x86_cd_vl_x14-73974.iso` is also just raw numbers, but I believe Windows XP was copyrightable
Interesting.
From what I understand, copyright only applies to the original source code, GUI and bundled icon/sound/image files. Functionality etc. would fall under patent law. So the compiled code on your .ISO for example would not only be "just raw numbers" but uncopyrightable raw numbers.
According go the Gemma 3n preview blog, Gemma 3n shares the same architecture as the upcoming version of Gemini Nano.
The ‘n’ presumably stands for Nano.
Nano is a proprietary model that ships with Android. Gemma is an open model that can be adapted and used anywhere.
Sources: https://developers.googleblog.com/en/introducing-gemma-3n/
Video in the in the blog linked in this post
Gemma is open source and apache 2.0 licensed. If you want to include it with an app you have to package it yourself.
gemini nano is an android api that you dont control at all.
> Gemma is open source and apache 2.0 licensed
Closed source but open weight. Let’s not ruin the definition of the term in advantage of big companies.
Your reply adds more confusion, imo.
The inference code and model architecture IS open source[0] and there are many other high quality open source implementations of the model (in many cases contributed by Google engineers[1]). To your point: they do not publish the data used to train the model so you can't re-create it from scratch.
[0] https://github.com/google-deepmind/gemma [1] https://github.com/vllm-project/vllm/pull/2964
If for some reason you had the training data, is it even possible to create an exact (possibly same hash?) copy of the model? Seems like there are a lot of other pieces missing like the training harness, hardware it was trained on, etc?
Yes, this is true. A lot of times labs will hold back necessary infrastructure pieces that allow them to train huge models reliably and on a practical time scale. For example, many have custom alternatives to Nvidia’s NCCL library to do fast distributed matrix math.
Deepseek published a lot of their work in this area earlier this year and as a result the barrier isn’t as high as it used to be.
to be entirely fair that's quite a high bar even for most "traditional" open source.
And even if you had the same data, there's no guarantee the random perturbations during training are driven by a PRNG and done in a way that is reproducible.
Reproducibility does not make something open source. Reproducibility doesn't even necessarily make something free software (under the GNU interpretation). I mean hell, most docker containers aren't even hash-reproducible.
I am not sure if this adds even more confusion. Linked library is about fine-tuning which is completely different process.
Their publications about producing Gemma is not accurate enough that even with data you would get the same results.
In the README of the linked library they have a code snippet showing how to have a conversation with the model.
Also, even if it were for fine tuning, that would require an implementation of the model’s forward pass (which is all that’s necessary to run it).
That is completely different discussion. Otherwise, even Gemini 2.5 Pro would be open-source with this logic since clients are open-source for interacting with the cloud APIs.
Yes!! But I doubt how many are truly truly open source models since most just confuse open source with open weights and the definition has been changed really smh.
> Gemma is open source and apache 2.0 licensed.
Are you sure? On a quick look, it appears to use its own bespoke license, not the Apache 2.0 license. And that license appears to have field of use restrictions, which means it would not be classified as an open source license according to the common definitions (OSI, DFSG, FSF).
I suspect the difference is in the training data. Gemini is much more locked down and if it tries to repeat something from the draining data verbatim you will get a 'recitation error'.
Perplexity.ai gave an easier to understand response than Gemini 2.5 afaict.
Gemini nano is for Android only.
Gemma is available for other platforms and has multiple size options.
So it seems like Gemini nano might be a very focused Gemma everywhere to follow the biology metaphor instead of the Italian name interpretation
The fact that you need HN and competitors to explain your offering should make Google reflect …
The Gemini billing dashboard makes me feel sad and confused.
I'm not a fan of this anarchic naming convention that OpenAI has apparently made standard across the industry.
What would you have called it?
I wouldn't have added a random letter and I would have chosen a name that's less easy to conflate with Gemini.
Gemma 4? I feel that one was incredibly obvious. Let us please just increase the version numbers.
Anthropic is better about this, but then shifted their ordering with the v4 models. Arguably better, but still quite annoying since everything pre-4 uses a different naming scheme.
What do you mean by Anthropic shifting their ordering? It seems to still be consistently Opus > Sonnet > Haiku. They didn't release 4 Haiku, but they also didn't release 3.5 Opus, and pricing wise Sonnet 4 lines up with earlier Sonnets.
As for this Gemma release, I don't think Gemma 4 would be an appropriate name. 3n is limited to very small versions (like 8B total parameters) and is therefore likely less powerful than Gemma 3.
From my impression this is more like a "Gemma 3 Lite" that provides a better speed/quality tradeoff than the smaller Gemma 3 models.
Made some GGUFs if anyone wants to run them!
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E2B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
I'm also working on an inference + finetuning Colab demo! I'm very impressed since Gemma 3N has audio, text and vision! https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...
Tried the E4B model in Ollama and it's totally broken when interpreting images. The output depends only on the text and is consistent in that way, but otherwise completely wrong.
Works fine with regular Gemma 3 4B, so I'll assume it's something on Ollama's side. edit: yep, text-only for now[1], would be nice if that was a bit more prominent than burried in a ticket...
Don't feel like compiling llama.cpp myself, so I'll have to wait to try your GGUFs there.
[1]: https://github.com/ollama/ollama/issues/10792#issuecomment-3...
Oh I don't think multimodal works yet - it's text only for now!
Literally was typing out "Unsloth, do your thing!!" but you are way ahead of me. You rock <3 <3 <3
Thank you!
:) Thanks!
Thanks! What kind of rig do I need?
Likely nothing crazy. My RTX 2080 is pumping out 45 tok/s.
What is `jinja` in this context?
The chat template is stored as a Jinja template.
https://jinja.palletsprojects.com/en/stable/
> It’s supported by your favorite tools including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and many others
Does anybody know how to actually run these using MLX? mlx-lm does not currently seem to support them, so I wonder what Google means exactly by "MLX support".
I had success using the models from lmstudio-community here.
https://huggingface.co/lmstudio-community
Thank you!
Do you know if these actually preserve the structure of Gemma 3n that make these models more memory efficient on consumer devices? I feel like the modified inference architecture described in the article is what makes this possible, but it probably needs additional software support.
But given that they were uploaded a day ago (together with the blog post), maybe these are actually the real deal? In that case, I wish Google could just link to these instead of to https://huggingface.co/mlx-community/gemma-3n-E4B-it-bf16.
Edit: Ah, these are just non-MLX models. I might give them a try, but not what I was looking for. Still, thank you!
That's a great question that is beyond my technical competency in this area, unfortunately. I fired up LM Studio when I saw this HN post, and saw it updated its MLX runtime [0] for gemma3n support. Then went looking for an MLX version of the model and found that one.
[0]: https://github.com/lmstudio-ai/mlx-engine
I'd genuinely like to know how these small models are useful for anyone. I've done a lot of experimenting, and anything smaller than 27B is basically unusable, except as a toy. All I can say for smaller models is that they sometimes produce good answers, which is not enough for anything except monkeying around.
I solved my spam problem with gemma3:27b-it-qat, and my benchmarks show that this is the size at which the current models start becoming useful.
Qwen2.5-VL 7B is pretty impressive at turning printed or handwritten maths lecture notes into Latex code, and is small enough to run slowly on a laptop without enough VRAM. Gemma3 4B was useless at this though, and got stuck in loops or tried to solve the maths problems instead of just converting the working to Latex (but it was much faster as it fit into VRAM).
It sounds like you're trying to use them like ChatGPT, but I think that's not what they're for.
I use gemma3:1b model (well, gemma3n:e2b since today) to summarize articles in my RSS reader. Works extremely well for such a simple task and runs on CPU on my hetzner server, so I don't have to pay electricity bill for running it on GPU at home
I am sure as ideation devices these can work fine. I treat this more like basic infra. I would absolutely love the future where most phones have some small LLM built in, kind of like a base layer of infra
There are use cases where even low accuracy could be useful. I can't predict future products, but here are two that are already in place today:
- On the keyboard on iphones some sort of tiny language model suggest what it thinks are the most likely follow up words when writing. You only have to pick a suggested next word if it matches what you were planning on typing.
- Speculative decoding is a technique which utilized smaller models to speed up the inference for bigger models.
I'm sure smart people will invent other future use cases too.
The best use case I've found for tiny models (<5bn params) as a reference tool for when I don't have WiFi. I've been using qwen on my MacBook Air as a replacement for Google while I'm writing code on flights. They work great for asking basic questions about syntax and documentation.
Tiny, 4b or less models are designed for finetuning for some narrow tasks; this way can outperform large commercial models for a tiny fraction of price. Also great for code autocomplete.
7b-8b are great coding assistants if all you need is dumb fast refactoring, that cannot quite be done with macros and standard editor functionality but still primitive, such as "rename all methods having at least one argument of type SomeType by prefixing their names with "ST_".
12b is a threshold where models start writing coherent prose such Mistral Nemo or Gemma 3 12b.
Kevin Kwok did a great job taking it apart: https://github.com/antimatter15/reverse-engineering-gemma-3n
The Y-axis in that graph is fucking hilarious
LM Studio has MLX variants of the model out: http://huggingface.co/lmstudio-community/gemma-3n-E4B-it-MLX...
However it's still 8B parameters and there are no quantized models just yet.
anyone know how much it costs to use the deployed version of gemma 3n? The docs indicate you can use the gemini api for deployed gemma 3n but the pricing page just shows "unavailable"
I read the general parts and skimmed the inner workings but I can't figure out what the high-level news is. What does this concretely do that Gemma didn't already do, or what benchmark/tasks did it improve upon?
Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?
> What does this concretely do that Gemma didn't already do
Open weights
Huh? I'm pretty sure I ran Gemma on my phone last month. Or is there a difference between downloadable (you get the weights because it's necessary to run the thing) and "open" weights?
I think the other poster is confused. Both Gemma 3 and Gemma 3n are open-weight models.
Google's proprietary model line is called Gemini. There is a variant that can be ran offline called Gemini Nano, but I don't think it can be freely distributed and is only allowed as part of Android.
As for what's new, Gemma 3n seems to have some optimizations done to it that lead it to be better than the 'small' Gemma 3 models (such as 4B) at similar speed or footprint.
Wasn't it a preview version?
Oh, that could be. So this is the first on-device model that Google releases, that's the news?
Whats are some use cases for these local small models, for individuals? Seems like for programming related work, the proprietary models are significantly better and that's all I really use LLMs for personally.
Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.
For me? Handling data like private voice memos, pictures, videos, calendar information, emails, some code etc. Stuff I wouldn't want to share on the internet / have a model potential slurp up and regurgitate as part of its memory when the data is invariably used in some future training process.
I think speech to text is the highlight used case for local models because they are now really good at it and there’s no network latency.
How does it compare to whisper? Does it hallucinate less or is more capable?
I just like having quick access to reasonable model that runs comfortably on my phone, even if I'm in a place without connectivity.
I’m thinking about building a pipeline to mass generate descriptions for the images in my photo collection, to facilitate search. Object recognition in local models is already pretty good, and perhaps I can pair it with models to recognize specific people by name as well.
Hoping to try it out with home assistant.
filtering out spam SMS messages without sending all SMS to the cloud
Suppose I'd like to use models like this one to perform web searches. Is there anything available in the open-source world that would let me do that without much tinkering needed?
I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.
I have been using ollama + open web ui. open webUI already have a web search tool all you would need to do is click the toggle for it under the chat.
Unfortunately the OWUI web search is really slow and just not great overall. I would suggest using an MCP integration instead.
Can you recommend a specific piece of software for using an MCP integration for web searches with local LLMs? That’s the first time I’ve heard of this.
i think the ones i’ve heard mentioned on youtube use an mcp that interfaces with searxng.
Google do have an API for this. It has limits but perfectly good for personal use.
https://developers.google.com/custom-search/v1/overview
Unfortunately 100 queries per day is quite low for LLMs, which tend to average 5-10 searches per prompt in my experience. And paying for the search API doesn’t seem to be worth it compared to something like a ChatGPT subscription.
You're not limited to 100 queries per day though. You're limited to 10,000 queries per day.
> Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.
What I meant is that pricing for 10k queries per day does not make sense instead of a simple $20/month ChatGPT subscription.
10k queries would cost $50/day
If you actually did 10k queries a day with any free search service you'd quickly find yourself banned.
ChatGPT deep research does a lot of searches, but it's also heavily rate-limited even on paid accounts.
Building a search index, running a lot of servers, storing and querying all that data costs money. Even a cheap search API is gonna cost a bit for 10k queries if the results are any good.
Kagi charges me 15 USD and last month I've done 863 searches with it, more than worth it for the result quality. For me the Google API would be cheaper. I'm pretty sure Kagi would kick me out if I did 10k searches a day even if I'm paying for "unlimited searches".
A similar API from Bing costs between 15-25 USD for 1000 searches. Exa.ai costs 5 dollars for 1000 searches, and it goes up to 25 dollars if you want more than 25 results for your query.
Good web search is quite expensive.
The Google programmable search engine is unlimited and can search the web: https://programmablesearchengine.google.com/about/
That's intended for human users. If you try to use it for automated requests, you'll get banned for botting fairly quickly.
If I wanted to run this locally at somewhat decent speeds, is an RK3588S board (like OrangePi 5) the cheapest option?
It depends on your idea of decent speeds and what you would use it for. I just tried it on a laptop with an AMD HX 370 running on battery in power save mode and it's not especially impressive, although it runs much better in balanced or performance mode. I gave it the prompt "write a fizzbuzz program in rust" and it took almost a minute and a half. I expect it to be pretty terrible on an SBC. Your best bet is to try it out on the oldest hardware you have and figure out if you can tolerate worse performance.
good idea, will test that out
Tried with S25+ (SD 8 elite). 0.82tok/s(4B L model). It's barely useful speed but it's pretty impressive either.
RK3588 uses a 7 year old CPU design and OrangePi 5 looks expensive (well over $100).
A used sub-$100 x86 box is going to be much better
you're right. For my purposes, I was thinking of something I could use if I wanted to manufacture a new (smallish) product
I'm going to attempt to get it running on the BeagleY-AI https://www.beagleboard.org/boards/beagley-ai
Similar form factor to raspberry pi but with 4 TOPS of performance and enough RAM.
We need tabular data somewhere on Google that lists the titles of the products and their descriptions or functions or what they do.
I've been playing around with E4B in AI Studio and it has been giving me really great results, much better than what you'd expect from an 8B model. In fact I'm thinking of trying to install it on a VPS so I can have an alternative to pricy APIs.
Updated Ollama to use this, now neither old or new work - much productivity
Well, see it the other way, there is something positive: commenters here on HN claim that AI is useless. You can now also join the bandwagon of people who have free time.
It seems way worse than other small models, including responding with complete non sequiturs. I think my favorite small model is still DeepSeek distilled with Llama 8B.
The key here is multimodal.
Anyone have any idea on the viability of running this on a Pi5 16GB? I have a few fun ideas if this can handle working with images (or even video?) well.
The 4-bit quant weighs 4.25 GB and then you need space for the rest of the inference process. So, yeah you can definitely run the model on a Pi, you may have to wait some time for results.
https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF
See here, long story short, this is another in a series of blog posts that would lead you to believe this was viable, but it isn't :/ https://news.ycombinator.com/item?id=44389793
I just tried gemma3 out and it seems to be prone to getting stuck in loops where it outputs an infinite stream of the same word.
Sounds a lot like an autoregressive sampling problem. Maybe try to set temperature and repeat penalty differently.
You're right, I should have checked the model settings. For some reason the default model profile in Ollama had temperature set to 0. Changing the temperature and repeat penalty worked much better than it did when I tried to correct similar behavior in the smallest phi4 reasoning model.
Thank you, this was affecting me too.
Is there a chance that we see an uncensored version of this ?
Can you apply abiliteration? I'm not sure if their MatFormer architecture is compatible with current techniques
Any readily-available APKs for testing this on Android?
APK link here: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...
Ah, I already had edge installed and it had gemma 3n-e4b downloaded... is this the same model that was previously released?
Seems like that was a preview model, unknown if this released version is different
I think it's only pulling the older model - I see it's using the liteRT models from May.
Somethings really screwy with on-device models from Google, I can't put my finger on what, and I think being ex-Google is screwing with my ability to evaluate.
Cherry-picking something that's quick to evaluate:
"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."
You can download an APK from the official Google project for this, linked from the blogpost: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...
If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)
So, *0.16* frames a second, not 60 fps.
The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.
- Are they missing a demo APK?
- Was there some massive TPU leap since the Pixel Fold release?
- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?
- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?
In any case, either:
A) I'm missing something, big or
B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."
Everything I've seen the last year or two indicates they are lying, big time, regularly.
But if that's the case:
- How are they getting away with it, over this length of time?
- How come I never see anyone else mention these gaps?
It looks to me by the marketing copy that the vision encoder can run 60FPS.
> MobileNet-V5-300M
Which makes sense as it's 300M in size and probably far less complex, not a multi billions of parameters transformer.
I agree that's the most likely interpretation - does it read as a shell game to you? Like, it can do that but once you get the thing that can use the output involved it's 1/100th of that? Do they have anything that does stuff with the outputs from just MobileNet? If they don't, how are they sure I can build 60 fps realtime audiovisual experiences they say I can?
Classify/similarity/clustering works fine with just an encoder, doesn't it?
I guess there's benefit to running that step without subsampling to the initial 256 tokens per image/frame ( https://ai.google.dev/gemma/docs/gemma-3n/model_card#inputs_... ) to go on from that, https://github.com/antimatter15/reverse-engineering-gemma-3n suggests these are 2048 dimensional tokens, which makes these 60 Hz frame digestion rate produce just under 31.5 Million floats-of-your-choosen-precision per second. At least at the high (768x768) input resolution, this is a bit less than one float per pixel.
I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?
Or I could imagine just local on-device visual semantic search by expanding the search query into a bunch of tokens that have some signed desire/want-ness each and where the search tokens get attended to the frame's encoded tokens, activation function'd, scaled (to positive/negative) by the search token's desire score, and then just summed over each frame to get a frame score which can be used for ranking and other such search-related tasks.
(For that last thought, I asked Gemini 2.5 Pro to calculate flops load, and it came out to 1.05 MFLOPS per frame per search token; Reddit suggests the current Pixel's TPU does around 50 TOPS, so if these reasonably match each terminology wise, assuming we're spending about 20% of it's compute on the search/match aspect, it comes out to an unreasonably (-seeming) about 190k tokens the search query could get expanded to. I interpret this result to imply that quality/accuracy issues in the searching/filtering mechanism would hit before encountering throughout issues in this.)
The APK that you linked, runs the inference on CPU and does not run it on Google Tensor.
That sounds fair, but opens up another N questions:
- Are there APK(s) that run on Tensor?
- Is it possible to run on Tensor if you're not Google?
- Is there anything at all from anyone I can download that'll run it on Tensor?
- If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
> Are there APK(s) that run on Tensor?
No. AiCore service internally uses the inference on Tensor (http://go/android-dev/ai/gemini-nano)
> Is there anything at all from anyone I can download that'll run it on Tensor?
No.
> If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
Mostly because 3P support has not been a engineering priority.
> Mostly because 3P support has not been a engineering priority.
Got it: assuming you're at Google, in eng. parlance, it's okay if it's not Prioritized™ but then product/marketing/whoever shouldn't be publishing posts around the premise it's running 60 fps multimodal experiences on device.
They're very, very, lucky that ratio of people vaguely interested in this, to people follow through on using it, is high, so comments like mine end up at -1.
Tensor is essentially shipping subpar hardware with not even taking care of software properly.
https://ai.google.dev/edge/litert/android/npu/overview has been identical for a year+ now.
In practice Qualcomm and MediaTek ship working NPU SDKs for third party developers, NNAPI doesn't count and is deprecated anyway.
Man this is a funny situation. Ty for sharing, more or less confirms my understanding. Couldn't quite believe it when I was in Google, or out of Google. This should be a big scandal afaict. What is going on???
(n.b. to readers, if you click through, the Google Pixel Tensor API is coming soon. So why in the world has Google been selling Tensor chips in Pixel as some big AI play since...idk, at least 2019?)
Yes, you can use first-party models on the Pixel NPUs or you're stuck with NNAPI which is self-admittedly deprecated by Google and doesn't work all that well.
On third party model workloads, this is what you will get:
https://ai-benchmark.com/ranking.html
https://browser.geekbench.com/ai-benchmarks (NPU tab, sort w/ quantisation and/or half precision)
Google is clearly not serious on Pixels in practice, and the GPU performance is also behind by quite a lot compared to flagships, which really doesn't help. CPUs are also behind by quite a lot too...
How does their demo work then? It's been 3 months since 3n was first released publicly.
What demo?
The only one we have works as described, TL;Dr 0.1 fps.
This looks amazing given the parameter sizes and capabilities (audio, visual, text). I like the idea of keeping simple tasks local. I’ll be curious to see if this can be run on an M1 machine…
Sure it can, easiest way is to get ollama, then `ollama run gemma3n` You can pair it with tools like simonw's LLM to pipe stuff to it.
This should run fine on most hardware - CPU inference of the E2B model on my Pixel 8 Pro gives me ~9tok/second of decode speed.
Can popular sci-fi go 30 seconds without some lame wad naming themselves or a product after it?
I made a simple website[0] to check online model MMLU quickly (runs a subset), and Gemma 3n consistently loses to LLaMA 3.3 (~61% vs ~66%), and definitely loses to LLaMA 4 Scout (~86%). I suspect that means its rating on LMArena Leaderboard is just some form of gaming the metric.
What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.
0. https://mmlu.borgcloud.ai/
1. https://trashtalk.borg.games/
> for everything from safeguarding
Maybe you could install it on YouTube, where my 78-year-old mother received a spammy advert this morning from a scam app pretending to be an iOS notification.
Kinda sick of companies spending untold billions on this while their core product remains a pile of user-hostile shite. :-)
imagine the entire internet is just an on the fly ui, would be pretty cool
My post politely describing this blog post does not match Google's own app, running inference on Pixel, is downvoted to -1, below dead posts with one-off short jokes.
I am posting again because I've been here 16 years now, it is very suspicious that happened, and given the replies to it, we now know this blog post is false.
There is no open model that you can download today and run at even 1% of the claims in the blog post.
You can read a reply from someone indicating they have inside knowledge on this, who notes this won't work as advertised unless you're Google (i.e. internally, they have it binding to a privileged system process that can access the Tensor core, and this isn't available to third parties. Anyone else is getting 1/100th of the speeds in the post)
This post promises $150K in prizes for on-device multimodal apps and tells you it's running at up to 60 fps, they know it runs at 0.1 fps, Engineering says it is because they haven't prioritized 3rd parties yet, and somehow, Google is getting away with this.
[dead]
[flagged]
This is completely offtopic, but in case your question is genuine:
https://www.youtube.com/watch?v=F2X1pKEHIYw
> Why Some People Say SHTRONG (the CHRUTH), by Dr Geoff Lindsey
[flagged]