We Tested a New Model. It Was Worse. Here’s What We Learned.

At OddsRX, we don’t just build models and assume they work. We test them in the open, track every prediction, and let the data decide. This is a recap of our V2 model experiment — what we changed, what the numbers said, and why we’re shutting it down in favor of something better.

Why We Built V2

Our original model (V1) has been quietly accurate on picking winners — finishing the season above 60%. But we suspected it had two blind spots.

First, the spread picks felt soft. V1 was clustering too many games near 50/50, which meant it wasn’t expressing strong conviction even when the talent gap between teams was obvious. A model that thinks every game is a tossup isn’t much use for spread betting.

Second, the totals predictions were drifting. V1’s over/under calls were decent but we weren’t confident the underlying math was producing realistic game scores.

So we built V2 with three specific changes:

Linear grade tilt — removed a compression layer that was squashing the difference between good and bad teams
Probability calibration — added a shrinkage parameter to pull extreme probabilities toward 50%
Independent totals model — projected over/under separately from the winner simulation, with its own variance model

We ran V2 in parallel with V1 for the entire second half of the season — 502 tracked games — and measured everything.

What the Numbers Said

The results were unambiguous. V2 was worse than V1 in every meaningful category.

Metric	V1	V2
Winner accuracy	60.0%	58.2%
Spread ATS	51.8%	46.8%
Totals O/U	49.0%	59.1%*

*The totals number needs an asterisk — more on that below.

The Compression Problem Got Worse, Not Better

The core goal of V2 was to give the model more conviction on lopsided matchups. It did the opposite. 88% of V2’s game predictions landed in the 45–55% win probability band — even tighter than V1. The calibration shrinkage we added to prevent overconfidence ended up compressing an already compressed signal.

The model was seeing a team 1.5 grade points better than their opponent and essentially shrugging.

The EV Signal Inverted

This was the most damaging finding. When we looked at edge — the gap between our model’s probability and the implied probability from the betting line — V2’s signal was backwards:

Positive EV picks hit at 50.4% (essentially a coin flip)
Negative EV picks hit at 68.1%

A model where fading your own picks is the winning strategy isn’t a model — it’s noise. This told us the calibration layer was actively destroying signal rather than refining it.

The Totals Number Is Misleading

V2’s 59.1% totals accuracy looks good on paper. But when we dug into the underlying simulation, we found the model was generating average game totals of 251 points — nearly 22 points above the actual NBA average of 229. Every single game was inflated.

When your baseline is systematically wrong by 9%, you’re not picking totals accurately — you’re accidentally catching market movements that happen to align with your inflated projections. That’s not a repeatable edge.

The Root Causes

Post-mortem analysis identified three specific technical failures:

1. Over-compressed grade-to-outcome conversion. The math translating team quality differences into predicted score margins was applying a tanh scaling function that squeezed large grade differences into tiny outcome differences. A team that was genuinely 7 points better on paper was being treated as 3 points better in the simulation.

2. Calibration shrinkage destroyed the signal. The shrink-toward-50% adjustment was designed to fix overconfidence. But V2’s raw probabilities weren’t overconfident — they were already underconfident. Adding shrinkage on top of compression meant the model had almost no ability to express a strong opinion.

3. The base scoring rate was wrong. The fundamental shooting efficiency parameter powering the simulation was set too high, causing the ~22 point total inflation. This is a straightforward calibration error that compounded through every game.

What We’re Doing Instead

We’re not patching V2. The problems are structural enough that a new architecture is the right answer. V3 addresses each failure directly:

Recalibrated base efficiency — corrected to target the actual NBA average of 229 points per game
Empirical grade-to-margin mapping — derived from regression on 502 real tracked games rather than theoretical assumptions (measured slope: 7.35 points per grade unit)
Bivariate pace variance — separates shared game-environment noise (affects totals) from team-specific noise (affects spreads), fixing a coupling problem that was inflating spread error
Data-driven calibration — replaced the shrinkage parameter with an isotonic calibration curve built from 812 V1 tracked games
Coherent player projections — player point totals now flow directly from the game simulation rather than being calculated independently, so projections are internally consistent

V3 is running in parallel starting now. We’ll report back once we have enough games to measure.

Why We’re Publishing This

Most handicapping services don’t tell you when their model fails. They quietly update the algorithm and pretend nothing happened. We think that’s the wrong approach.

Transparency about failures is how you build a model worth trusting. V2 taught us exactly where our assumptions broke down, and every one of those lessons is now embedded in V3’s design. A model that’s never been stress-tested against real outcomes isn’t a model — it’s a hypothesis.

The data is the truth. We follow it wherever it goes.