So in short there's no guarantee for any output from any LLM whether its Gemma or any other (ignoring some details like setting a random seed or parameters like temperature to 0). Like you mentioned though libraries like outlines can constrain the output, whereas hosted models often already include this in their API, but they can do so because its a model + some server side code.
With Gemma, or any open model, you can use the open libraries in conjunction to get what you want. Some inference frameworks like Ollama include structured output as part of their functionality.
But you mentioned all of this already in your question so I feel like I'm missing something. Let me know!
But I think you already mentioned all this in your response so I might be missing the question?
With OpenAI models, my understanding is that token output is restricted so that each next token must conform to the specified grammar (ie json schema) so you’re guaranteed to get either a function call or an error.
Edit: per simonw’s sibling comment, ollama also has this feature.
Ah, There's a distinction here with model vs model framework. The ollama inference framework supports token output restriction. Gemma in AI Studio also does, as does Gemini, there's a toggle in the right hand panel, but that's because both those models are being served in the API where the functionality is present in the server.
The Gemma model by itself does not though, nor does any "raw" model, but many open libraries exist for you to plug into whatever local framework you decide to use.
With Gemma, or any open model, you can use the open libraries in conjunction to get what you want. Some inference frameworks like Ollama include structured output as part of their functionality.
But you mentioned all of this already in your question so I feel like I'm missing something. Let me know!
But I think you already mentioned all this in your response so I might be missing the question?