All posts

The Third Verdict: Why "Inconclusive" Is the Most Honest Output an AI SOC Can Produce

Most AI SOC tools force a verdict to close tickets. But 'inconclusive' — with explicit evidence gaps named — is often the most honest, useful, and defensible output a system can produce.

Abstract editorial illustration of uncertainty resolving into clarity: a field of diffuse luminous points where some sharpen into defined form while others remain deliberately unresolved, against a deep midnight gradient with warm wheat-gold accents.

The modern SOC is not suffering from a lack of detections. It is suffering from incomplete investigations. And for most of the industry, that distinction has not yet registered in the tools being built to solve it.

Think about what actually lands in an analyst’s queue on a given day. Cloud identities without clear ownership. Ephemeral infrastructure that no longer exists by the time it is flagged. Endpoint telemetry that dropped out mid-session. Threat intelligence connectors returning null because an API key rotated. Encrypted traffic with no payload visibility. Each of these is a gap, and in a real investigation, gaps are not edge cases. They are the environment. The job of a security investigation is not to process a clean dataset. It is to reason carefully about an incomplete one.

Most AI SOC products were not designed for that reality. They were designed to close tickets. They summarize alerts, assign confidence scores, and produce verdicts because the workflow demands closure. And in doing so, they solve the wrong problem with impressive efficiency.

The verdict that is actually missing from most of these systems is not “malicious” or “benign.” It is “inconclusive,” and making that a first-class output may be the most important architectural decision the next generation of AI SOC platforms gets right.

Detection Was Never the Same Thing as Investigation

There is a version of this story that gets told as progress. SIEMs improved correlation. EDRs added behavioral analytics. XDR promised unified visibility. SOAR automated the response layer. Each generation of tooling was genuinely useful, and each one left the same problem structurally unresolved.

Detection is not investigation.

A SIEM surfaces anomalies. A SOAR executes predefined workflows. Neither was ever designed to determine whether the available evidence is sufficient to support a defensible conclusion. That gap, between identifying a signal and understanding what the signal actually means, is where most AI SOC products still fall short.

One senior analyst at a financial services SOC put it plainly in a conversation about AI tooling adoption: “The useful ones reduce investigation time. The rest just reformat alerts.” The observation sounds simple but it identifies something important. Reducing reading time is not the same as producing a trustworthy verdict. And a system that generates a verdict faster than a human analyst would, but without the underlying reasoning to justify it, has not improved the investigation. It has just accelerated the part that was never the bottleneck.

The audit pressure version of this problem is even sharper. When an examiner asks why a particular alert was closed, a confidence score does not constitute an answer. A defensible answer requires the hypothesis that was tested, the evidence collected, the reasoning chain connecting evidence to verdict, the telemetry that was absent, and the conditions under which the conclusion might need to change. That is investigation discipline, and it cannot be substituted with a percentage.

What an Incomplete Investigation Actually Looks Like

Consider a concrete scenario that plays out regularly in enterprise SOCs.

A PowerShell command executes on a workstation at 2:14 AM. The command attempts to download a file from an external IP address. The SIEM fires an alert and routes it to the AI investigation layer. The AI system begins collecting context.

Here is what it finds. The process tree shows PowerShell launched from a scheduled task, but the scheduled task was created six weeks ago and the creation event has already rolled off the endpoint log retention window. The external IP appears once in threat intelligence with low confidence, flagged by a single vendor feed eighteen months prior. The workstation’s EDR agent went offline twenty minutes before the alert fired and came back online four minutes after, leaving a gap in behavioral telemetry that covers exactly the window in question. The user account associated with the scheduled task belongs to a service account used by three different internal applications, none of which have been documented in the asset inventory.

A conventional AI pipeline might still produce: 84% malicious probability. Medium-high confidence. Recommended action: isolate endpoint and escalate.

But every piece of that verdict rests on an assumption the evidence cannot support. Was the scheduled task created legitimately? Unknown. Was the EDR outage coincidental or deliberate? Unknown. Is the external IP currently associated with active infrastructure? Unknown. Does the service account behavior represent normal operations for those three applications? Unknown.

An investigation-centric system asks different questions. What hypotheses were tested? Which evidence supports them? Which evidence contradicts them? What telemetry is absent? What would need to be true for each hypothesis to hold?

If the answers to those questions reveal that the gaps are material, the correct output is not “probably malicious.” It is: “Inconclusive. Scheduled task origin unverifiable due to log retention gap. EDR outage during event window prevents behavioral confirmation. External IP attribution low confidence. Recommend manual review of service account baseline before escalation decision.”

That output is less decisive. It is also more useful, more honest, and more defensible. The analyst who receives it knows exactly what they are working with. The analyst who receives “84% malicious, isolate and escalate” either acts on a conclusion the evidence does not support, or spends the next thirty minutes reconstructing the investigation from scratch to verify whether it does.

What the Research Actually Shows

Two recent studies make the empirical case for why this matters.

A mixed-methods study by Rastogi et al. at the Rochester Institute of Technology, published ahead of ACM CCS 2025, examined how SOC analysts conceptualize AI-generated explanations across investigative roles. Surveying 248 SOC professionals and conducting 24 in-depth interviews across four continents, the researchers found that analysts were consistently willing to accept AI outputs even at lower predictive accuracy when explanations were perceived as relevant and evidence-backed. Analysts expressed a strong preference for contextual depth over a mere presentation of outcomes, repeatedly emphasizing the importance of understanding the rationale behind AI decisions rather than simply receiving a verdict.

That finding reframes the product question considerably. Analysts were not asking for more accurate predictions. They were asking for visible reasoning. The system they trusted was not necessarily the most confident one. It was the one that showed its work.

A longitudinal study by researchers at CSIRO’s Data61, conducted in collaboration with the managed detection and response firm eSentire, offers a complementary view from inside a live SOC environment. Analyzing 3,090 analyst queries submitted over ten months by 45 working analysts, the researchers found that analysts use LLMs as on-demand aids for sensemaking and context-building rather than for making high-stakes determinations. The study concludes that analysts retain final judgment while leveraging AI for rapid sensemaking, and explicitly recommends that well-designed systems should surface evidence over recommendations for investigative tasks, preserving analyst autonomy and context.

Both studies point toward the same underlying principle. Analysts are not waiting for AI to take over their investigations. They are looking for AI that supports their reasoning without substituting for it. The system they want is one that makes the evidence clearer, not one that skips past the evidence to a verdict.

This preference for visible reasoning and the regulatory demand for auditability are not separate concerns. They are both expressions of the same need: decisions made under uncertainty need to show their work. An analyst asking “why did the system flag this?” and a regulator asking “why was this incident closed?” are asking the same question at different points in time.

The Regulatory Dimension

Frameworks like DORA and NIS2 increasingly focus not just on whether incidents were handled, but on whether organizations can explain how decisions were made and what limitations affected the outcome. In that environment, an inconclusive verdict with explicit evidence gaps is far more defensible than a fabricated high-confidence conclusion that collapses under scrutiny. Regulators do not expect perfect visibility. They do expect organizations to know the boundaries of their own knowledge and to say so clearly.

Operational maturity is not the elimination of uncertainty. It is knowing when uncertainty must be formally declared.

The Economics of Forced Certainty

There is a hidden cost buried inside AI SOC architectures that the throughput metrics do not capture. If analysts still spend 20 to 30 minutes verifying every AI-generated conclusion, the AI has not eliminated labor. It has shifted it into a higher-cost verification stage. The case gets closed faster on paper. The actual investigation time has not changed. What has changed is that the analyst is now working backward from a verdict rather than forward from evidence, which is a harder and more error-prone way to investigate anything.

The scenario described earlier illustrates this exactly. An analyst who receives “84% malicious, isolate and escalate” and then discovers the gaps in the evidence has to spend time not just investigating the original alert but also unwinding the AI’s conclusion. An analyst who receives a structured inconclusive output with the gaps explicitly named can begin the investigation at the right starting point: the missing scheduled task history, the EDR outage window, the service account baseline.

One produces the appearance of efficiency. The other produces actual efficiency.

What the Third Verdict Changes

When “inconclusive” becomes a first-class output rather than a system failure, several things shift.

The AI stops functioning as a prediction engine and starts functioning as an evidence-bound investigator. The analyst receives not just a verdict but a map of what is known, what is unknown, and what would need to be resolved to reach a defensible conclusion. The audit trail reflects the actual state of the investigation rather than a confidence score that conceals the gaps underneath it. And the organization can answer, under regulatory scrutiny, not just what decision was made but what evidence existed and what limitations constrained it.

Back to the PowerShell scenario. The analyst who receives the inconclusive output with named gaps has a clear next step: pull the scheduled task creation from backup logs, check the service account’s historical execution patterns, and verify the EDR outage against the infrastructure team’s maintenance records. That is a fifteen-minute targeted investigation. If those steps resolve the gaps, the verdict changes. If they do not, the inconclusive classification stands and gets escalated with the full evidence picture attached.

That is what investigation discipline looks like at scale. Not a system that maximizes closure rates. A system that correctly represents what it knows and what it does not, and gives analysts the information they need to close cases that can actually be closed.

The third verdict is not a fallback. It is proof that the system understood the assignment.


References:

Rastogi, N., Pant, S., Dhanuka, D., Saxena, A., and Mairal, P. (2025). Too Much to Trust? Measuring the Security and Cognitive Impacts of Explainability in AI-Driven SOCs. ACM CCS 2025. https://arxiv.org/abs/2503.02065

Singh, R., Tariq, S., Jalalvand, F., Baruwal Chhetri, M., Nepal, S., Paris, C., and Lochner, M. (2025). LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres. https://arxiv.org/abs/2508.18947