Why one verdict is a lie

Every detection engine is an opinion. Point several of them at the same file and you do not get a fact, you get a panel of opinions, and the interesting part is where they disagree. Most security tooling throws that part away: it collapses the opinions into a single score, slaps a label on it, and moves on. The single number feels decisive, and it gets there by omission.

Disagreement between engines is not noise to be smoothed out. It is the signal.

The problem with the average

Imagine four engines look at an attachment. Two call it malicious with high confidence, one calls it suspicious, one calls it clean. Average those and you land somewhere around "probably bad," which is exactly the verdict that helps no one: too uncertain to auto-act, too vague to explain. Worse, the average hides the shape of the disagreement. A 2-2 split between confident extremes is a very different situation from four engines clustered in the murky middle, but the mean can report the same number for both.

When you collapse the spread, you also collapse your options. You can only act on a number. You cannot act on the reasoning behind it, because you threw the reasoning away.

Keep the spread intact

Soarcery does the opposite. We keep every engine's verdict and its confidence, and we surface the whole spread inside the investigation, right where the investigating agent is working. The analyst sees who said what, and how sure each one was. Agreement and disagreement are both visible, because both are useful.

A fair question is whose opinions these are. The engines behind the spread come from established detection providers; inside the product each verdict is attributed by name, and in the examples here they are labelled by what they do: sandbox, static AV, ML model, reputation. Two things about engine panels are worth saying plainly. Engines are not independent jurors; vendors license each other's engines and copy detections, so unanimity is cheaper than it looks. And panels can be studied: there is an underground market in private scanning services that exists so attackers can tune a sample until every engine shrugs. Both facts are arguments for keeping the spread visible instead of trusting any single collapsed score, and both should shape how you set thresholds.

file: invoice_q2.xlsmContested

sandboxmalicious

static_avmalicious

ml_modelsuspicious

reputationclean

confidence spread (illustrative)0.62 · escalate

Turn the spread into a control

Once you keep the spread, you can compute something the average cannot give you: how much the engines disagree. We express that as a confidence spread value, a 0 to 1 score of how much the engines disagree, and it becomes a control you set. A tight, agreeing spread can be safe to act on automatically, within a threshold you choose and scoped to alert types where the blast radius is small. A wide, contested spread routes to a human, with the full evidence already assembled.

Tight spread, malicious: contain automatically, within your threshold.
Wide spread: escalate, because the engines themselves are not sure.
Tight spread, clean: the trap case. Read on.

That last row deserves its own paragraph, because consensus is not correctness. A spread can be unanimously clean because the file is fine, or because the sample was tuned against panels until every engine went quiet; the underground scanning services mentioned above exist to manufacture exactly this pattern. So a clean consensus is never the only thing an auto-close leans on. Soarcery corroborates it with prevalence, sample age, and source context, and puts a sampled human re-review behind everything it closes automatically. The spread is the strongest single signal we know how to surface. It is still one signal.

The threshold is yours, per use case. Triage on a low-risk alert type can act on a looser spread. A high-blast-radius action like disabling an account can demand near-unanimity or a human, every time.

Drift, the part everyone misses

Because the spread is recorded on every investigation, drift becomes visible. An indicator that was unanimously clean a month ago and is drawing disagreement today is telling you something changed. Per-engine attribution is what makes this usable: you can see who moved, which separates a real change in the threat from one vendor shipping an aggressive heuristic that week. Engine churn is constant background noise; attribution is how drift stays a signal instead of becoming a second alert problem.

None of this requires you to trust a black box. The spread is shown, the thresholds are yours, and every call is on the record with its evidence. The verdict can be trustworthy precisely because the disagreement behind it is never hidden.

The spread is easier to argue about with a real one in front of you. Request a demo and bring your noisiest alert type.

Why one verdict is a lie, and a spread is the truth

The problem with the average

Keep the spread intact

Turn the spread into a control

Drift, the part everyone misses

The spread, on your alerts.