The NVIDIA tax is a routing problem.
Enterprise AI today runs every query through a GPU, whether the query needs one or not. The result is a structural overpayment that shows up on every compute bill in the industry — a tax levied not by any government but by the absence of an alternative.
2Brains is that alternative. Most enterprise queries are fact retrieval, not language generation. Fact retrieval runs on a CPU. We route accordingly.
Power per query.
A fifty-fold reduction in power per query is not an optimization. It is the difference between architectures. The 300W figure assumes a query that genuinely requires generative inference. The 6W figure assumes a query that the system has correctly identified as fact retrieval — the majority of enterprise traffic.
What that means at scale.
Modeled annual compute cost for an enterprise running one billion queries per year, assuming 68% of queries are routable to deterministic retrieval.
Illustrative model. Actual savings depend on query mix, corpus design, and existing infrastructure.
For the majority of enterprise queries, generative inference is overkill. Deterministic retrieval gives the correct answer at a fraction of the wattage, on hardware buyers already own.
If 68% of traffic does not need a GPU, neither does 68% of an enterprise's procurement runway. AMD, ARM, and sovereign silicon become viable hosts for the work that remains.
Data center planning today assumes generative inference will eat the grid. When retrieval handles two-thirds of traffic at 6W, that assumption stops holding.
Modeling at your scale
We can run the cost model against your actual query mix under NDA.
contact@2brains.net