Chess has ELO. Forecasting has Brier. Cybersecurity should have calibration.

Here’s the security practitioner I’ve watched cause the most damage on incidents: not the one who doesn’t know something, but the one who is sure they know itand is wrong. The person who confidently says “we’re not vulnerable to that,” “the logs would have caught it,” or “that’s not how this API works,” and is, in that moment, mistaken. Everyone on the bridge call takes their word for it. Two hours later, the assumption quietly breaks.

Knowing the right answer matters. Knowing when you don’t know it matters more. And that second skill is basically unmeasured in the cybersecurity field.

So we built a measurement for it. Every SecProve question now ships with an optional confidence rating (low, medium, high) that you set while answering. Over time, we compare what you said you knew against what you actually knew. The result is a second rating, alongside your Knowledge Rating: the Calibration Score.

The analogy that motivated this

Every serious competitive practice discipline eventually develops two numbers:

Chess has ELO ratings (for how strong you are) and — at the grandmaster level — published accuracy stats (how well you actually played, move by move, versus an engine).
Forecasting has Brier scores.[1]Philip Tetlock’s “superforecasters” don’t just guess better than experts; they’re measurably more calibrated, meaning when they say “70%,” that outcome actually happens 70% of the time.[2]
Medicine has outcomes-based peer review for diagnostic confidence; radiology literature is full of studies on miscalibration as a driver of misdiagnosis.

Cybersecurity, for a field obsessed with certifications and credentials, has never had the second number. You can pass CISSP with 70% correct and exit with a title that doesn’t tell you — or your future employer — anything about where those 30% missed answers clustered, or whether you were sure when you missed them.

Why the “sure and wrong” question matters more than the accuracy one

There is a well-replicated finding in cognitive psychology called the hypercorrection effect: when people commit errors they were highly confident about, those errors are corrected faster and retained longer than low-confidence errors they simply guessed at.[3]

The intuition behind this is simple: a low-confidence wrong answer doesn’t disrupt your mental model, because you didn’t have one. A high-confidence wrong answer collides with an existing belief. Resolving that collision — actually updating the model — is where durable learning happens.

The working implication: the questions you got wrong while sure you were right are the single most valuable items in your practice history. They’re worth more than the ones you guessed and luckily got right, more than the ones you answered correctly with high confidence (those you already knew), and certainly more than the ones you skipped or answered without engaging.

SecProve’s new “sure and wrong”review queue is built around exactly this. When you open your review page, the first questions you see are the ones where you marked “high confidence” and missed — prioritized over even the questions you answered seconds ago, because those are where the hypercorrection opportunity sits.

How the score is computed

Each of the three confidence levels has an implicit claim about your accuracy:

Low— “I’m guessing or having to work hard on this.” Target: about 30% correct.
Medium— “I’m pretty sure but not certain.” Target: about 60% correct.
High— “I know this cold.” Target: about 90% correct.

The Calibration Score is 100 minus the weighted mean absolute deviation, across your three buckets, between target and actual accuracy. A perfectly calibrated practitioner scores 100. A practitioner who marks 100% of questions “high” and gets 50% right scores 60. A practitioner who marks 100% “high” and gets 10% right scores 20.

What “calibrated” means in practice

Accuracy at each stated confidence level, for two hypothetical practitioners.

Calibratedscore 94

High90% target → 91% actual
Medium60% target → 58% actual
Low30% target → 28% actual

Every tier's actual accuracy is within a few points of the predicted target. This practitioner knows when they know it.

Overconfidentscore 56

High90% target → 54% actual
Medium60% target → 48% actual
Low30% target → 40% actual

Worst skew is in the “high” tier — exactly where hypercorrection research says the biggest retention wins live.

The two practitioners could have the same Knowledge Rating. Only calibration tells you which one is dangerous when they show up on a bridge call and say “I’m confident we’re not exposed here.”

The score appears once you have 20 confidence-rated answers. Below that, the sample is too small to say anything about you personally — the answer is “keep practicing.” Above that, the score updates after every session and breaks out per-domain so you can see where your overconfidence concentrates. A SOC analyst might be calibrated in Detection Engineering and heavily overconfident in Cloud Security. That’s actionable in a way a single global score isn’t.

What it’s for, and what it isn’t

Calibration is not a substitute for knowledge. A brand new analyst who confidently marks everything “low” and gets 30% right is perfectly calibrated and also not yet a strong analyst. Knowledge Rating exists for that. The two numbers answer different questions:

Knowledge Rating— what fraction of the field you actually know.
Calibration Score— whether you have an accurate mental model of what you know.

The practitioners I’d bet on long-term have both. High Knowledge Rating, high Calibration Score. They know a lot, and they can be trusted when they say they know something.

The practitioners I’d watch carefully: high Knowledge, lowCalibration. They’re the most dangerous on a bridge call, because the rest of the team grants them authority on the strength of the knowledge score while missing that the person routinely overstates certainty. Every team has at least one.

Where to see yours

Profile— your global Calibration Score alongside Knowledge Rating.
Skills— per-domain calibration so you see where your blind spots concentrate.
Leaderboard— a new “Most Calibrated” view. Being top-rated matters; being most-calibrated says something different and, I think, more practically useful.
Review— your “sure and wrong” queue, sorted to surface the highest-leverage review items first.

The feature shipped today. If you’ve already got confidence ratings on recent answers, the score should be populated on your next session; if not, you’ll start seeing a number after about 20 rated answers. We’ll keep tuning the target percentages and the per-domain breakdown based on what the data actually shows. Feedback welcome.

References

Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1–3. Link
Mellers, B., Stone, E., Murray, T., et al. (2015). Identifying and Cultivating Superforecasters as a Method of Improving Probabilistic Predictions. Perspectives on Psychological Science, 10(3), 267–281. Link
Butterfield, B., & Metcalfe, J. (2001). Errors Committed with High Confidence Are Hypercorrected. Journal of Experimental Psychology: Learning, Memory, and Cognition, 27(6), 1491–1494. Link