Gemini 3.1 Flash-Lite: Built for intelligence at scale

vlmutolo · 2026-03-03T18:03:29 1772561009

Lots of comments about the price change, but Artifical Analysis reports that 3.1 Flash-Lite (reasoning) used fewer than half of the tokens of 2.5 Flash-Lite (reasoning).

This will likely bring the cost below 2.5 flash-lite for many tasks (depends on the ratio of input to output tokens).

That said, AA also reports that 3.1 FL was 20% more expensive to run for their complete Intelligence index benchmark.

The overall point is that cost is extremely task-dependent, and it doesn’t work to just measure token cost because reasoning can burn so many tokens, reasoning token usage varies by both task and model, and similarly the input/output ratios vary by task.

XCSme · 2026-03-03T21:55:49 1772574949

> 3.1 Flash-Lite (reasoning)

(reasoning) doesn't say much. Is it low/med/high reasoning? I ran my own benchmarks, and 3.1 Flash-Lite on high costs A LOT: https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...

Do not use 3.1 Flash-Lite with HIGH reasoning, it reasons for almost max output size, you can quickly get to millions of tokens of reasoning in a few requests.

vlmutolo · 2026-03-04T01:02:35 1772586155

Wow, that’s very interesting. I wish more benchmarks were reported along with the total cost of running that benchmark. Dollars per token is kind of useless for the reasons you mentioned.

XCSme · 2026-03-04T01:05:27 1772586327

Yup, MiniMax M-2.5 is a standout in that aspect. It's $/token is very low, because it reasons forever (fun fact, that's also the reason why it's #1 on OpenRouter, because it simply burns through tokens, and OpenRouter ranking is based on tokens usage)...

XCSme · 2026-03-04T01:06:06 1772586366

https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...

msp26 · 2026-03-03T20:08:00 1772568480

many tasks don't need any reasoning

sync · 2026-03-03T17:05:30 1772557530

Unfortunate, significant price increase for a 'lite' model: $0.25 IN / $1.50 OUT vs. Gemini 2.5 Flash-Lite $0.10 IN / $0.40 OUT.

rohansood15 · 2026-03-03T17:33:38 1772559218

For the last 2 years, startup wisdom has been that models will continue to get cheaper and better. Claude first, and now Gemini has shown that it's not the case.

We priced an enterprise contract using Flash 1.5 pricing last summer, and today that contract would be unit economic negative if we used Flash 3. Flash 2.5 and now Flash 3.1 Lite barely breaks even.

I predict open-source models and fine-tuning are going to make a real comeback this year for economic reasons.

dktp · 2026-03-03T18:01:08 1772560868

Opus 4.5 became significantly cheaper than Opus 4.1

simianwords · 2026-03-03T17:42:12 1772559732

Not true. You just measure cost by amount of money spent per task. I would argue that this lite version is equivalent to older flash.

rohansood15 · 2026-03-03T17:52:33 1772560353

Yea but there is a whole world of tasks for which Flash 2.5-lite was sufficiently intelligent. Given Google's depreciation policy, there will soon be no way to get that intelligence at that price.

simianwords · 2026-03-03T18:16:35 1772561795

I hope they release models at every intelligence resolution although the thinking effort can be a good alternative

xnx · 2026-03-03T18:12:20 1772561540

> We priced an enterprise contract using Flash 1.5 pricing last summer,

Interesting. Flash 1.5 was already a year old at that point.

typs · 2026-03-03T17:37:01 1772559421

I mean the same level of intelligence does get cheaper. People just care about being on the frontier. But if you track a single level of intelligence the price just drops and drops.

rohansood15 · 2026-03-03T17:55:00 1772560500

What's the cheaper alternative from Gemini for Flash-2.5-lite level intelligence when it gets deprecated on 22nd July 2026?

GodelNumbering · 2026-03-03T17:52:23 1772560343

That's a 150% increase in the input costs and 275% increase on output costs over the same sized previous generation (2.5-flash-lite) model

k9294 · 2026-03-03T17:40:52 1772559652

You can test Gemini 3.1 Lite transcription capabilities in https://ottex.ai — the only dictation app supporting Gemini models with native audio input.

We benchmarked it for real-life voice-to-text use cases:

                <10s    10-30s   30s-1m    1-2m    2-3m
  Flash         2548     2732     3177     4583    5961
  Flash Lite    1390     1468     1772     2362    3499
  Faster by    1.83x    1.86x    1.79x   1.94x   1.70x

  (latency in ms, median over 5 runs per sample, non-streaming)

Key takeaways:

- 1.8x faster than Gemini 3 Flash on average

- ~1.4 sec transcription time for short to medium recordings

- ~$0.50/mo for heavy users (10h+ transcription)

- Close to SOTA audio understanding and formatting instruction following

- Multilingual: one model, 100+ languages

Gemini is slowly making $15/month voice apps obsolete.

simianwords · 2026-03-03T17:44:33 1772559873

You know what would be great? A light weight wrapper model for voice that can use heavier ones in the background.

That much is easy but what if you could also speak to and interrupt the main voice model and keep giving it instructions? Like speaking to customer support but instead of putting you on hold you can ask them several questions and get some live updates

k9294 · 2026-03-03T18:27:08 1772562428

It's actually a nice idea - an always-on micro AI agent with voice-to-text capabilities that listens and acts on your behalf.

Actually, I'm experimenting with this kind of stuff and trying to find a nice UX to make Ottex a voice command center - to trigger AI agents like Claude, open code to work on something, execute simple commands, etc.

stri8ted · 2026-03-03T17:56:53 1772560613

Can you show some comparisons for WER and other ASR models? Especially for non english.

k9294 · 2026-03-03T18:20:22 1772562022

I've been experimenting with Gemini 3.1 Flash Lite and the quality is very good.

I haven't found official benchmarks yet, but you can find Gemini 3 Flash word error rate benchmarks here: https://artificialanalysis.ai/speech-to-text/models/gemini — they are close to SOTA.

I speak daily in both English and Russian and have been using Gemini 3 Flash as my main transcription model for a few months. I haven't seen any model that provides better overall quality in terms of understanding, custom dictionary support, instruction following, and formatting. It's the best STT model in my experience. Gemini 3 Flash has somewhat uncomfortable latency though, and Flash Lite is much better in this regard.

zacksiri · 2026-03-03T17:07:30 1772557650

This is going to be a fun one to play with. I've been conducting tests on various models for my agentic workflow.

I was just wishing they would make a new flash-lite model, these things are so fast. Unfortunately 2.5-flash and therefore 2.5-flash-lite failed some of my agentic workflows.

If 3.1-flash-lite can do the job, this solves basically all latency issues for agentic workflows.

I publish my benchmarks here in case anyone is interested:

https://upmaru.com/llm-tests/simple-tama-agentic-workflow-q1...

P.S: The pricing bump is quiet significant, but still stomachable if it performs well. It is significant though.

sh4jid · 2026-03-03T17:00:55 1772557255

The Gemini Pro models just don't do it for me. But I still use 2.5 Flash Lite for a lot of my non-coding jobs, super cheap but great performance. I am looking forward to this upgrade!

simianwords · 2026-03-03T17:04:31 1772557471

same - pro is usually a miss for me.

xnx · 2026-03-03T17:56:26 1772560586

I'm still clinging to gemini-2.0-flash which I think is free free for API use(?!).

goldenarm · 2026-03-04T20:04:54 1772654694

Deprecated June 1st, which is a shame, it was the cheapest option and a great model

msp26 · 2026-03-03T19:48:45 1772567325

What the fuck is this price hike? It was such a nice low end, fast model. Who needs 10 years of reasoning on this model size??

I'm gonna switch some workflows to qwen3.5.

There's a lot of tasks that benefit from just having a mildly capable LLM and 2.5 Flash Lite worked out of the box for cheap.

Can we get flash lite lite please?

Edit: Logan said: "I think open source models like Gemma might be the answer here"

Implying that they're not interested in serving lower end Gemini models?

zzleeper · 2026-03-04T02:25:15 1772591115

Are there good open models out there that beat gemini 2.5 flash on price? I often run data extraction queries ("here is this article, tell me xyz") with structured output (pydantic) and wasn't aware of any feasible (= supports pydantic) cheap enough soln :/

kristianp · 2026-03-04T10:27:03 1772620023

You'll have to try out models on your use case. Openrouter makes that easy.

guerython · 2026-03-03T17:09:50 1772557790

[flagged]

zacksiri · 2026-03-03T17:32:57 1772559177

Yes, my workflows use caching intensively. It's the only way to keep things fast / economical.