Is AI the Catalyst for Practical Software Verification?

3 minute read

Formal verification for software systems has been a hot topic in systems research over the past decade. The last two OSDI conferences have yielded four best-paper awards for work applying formal verification to real systems. The reason for this popularity is clear. Verification lets developers write programs that are bug-free by construction, determined at compile time. This is in contrast to relying on flaky unit tests, cumbersome integration tests, and cryptic logs that frustrate developers.

However, industry adoption of verification remains limited. Beyond AWS and MongoDB, few tech companies actively use verification in their development process. Even when verification is used, it typically offered as a bespoke service led by researchers, rather than a tool accessible to everyday developers. Existing verification tools are simply too restrictive in their domain, or demand such specialized knowledge and manual effort that the return on investment is unjustifiable for most businesses.

Enter AI. With the explosion of coding assistants, one natural question to ask is: can artificial intelligence help crack the verification adoption barrier? I think the answer is likely yes.

In The Bitter Lesson, Richard Sutton presents how in problems like chess and Go, general search and learning-based AI methods that leverage computational scaling laws have consistently outperformed approaches that rely on encoding human domain knowledge about the structure of these games. The same story has played out in domains such as speech recognition and computer vision.

Verification has striking parallels. Similar to chess and Go, it can be characterized as a search problem. Given a target program as input, and a specification, the goal is to find a proof string that passes a black-box verifier, i.e., the proof “convinces” a verifier oracle that the target program satisfies the specification. Some form of proof is always needed to guide the verifier, because the problem of determining if a program satisfies the specification is fundamentally undecidable.

// Target program we want to prove: return a 
// sorted copy of arr
function sort(arr: seq<int>) : seq<int> {
    // ...
}

// Specification: sort function produces a 
// sorted copy of any input
predicate sortIsCorrect(arr: seq<int>) {
   multiset(arr) ==  multiset(sort(arr))
   && forall i, j :: i < j ==> sort(arr)[i] <= sort(arr)[j]
}

// Proof that our target program is correct, 
// over all possible inputs
lemma sortIsCorrectProof() 
    ensures forall arr :: sortIsCorrect(arr)
{
    // < Proof code here >
}

Today, we rely on developers to manually derive this proof. Hence, the practicality of verification boils down to the ease of finding this string. Current research uses rule-based approaches to narrow the search space that human developers must explore. For instance, tools like Basilisk (OSDI’25) leverage certain recurring structures in these proofs to automatically generate proof fragments, but still require heavy human input to help compose the fragments into a final whole. In fact, it is unlikely we can invent a rule-based algorithm to fully automate proof derivation. The problem is too sensitive to the structure and correctness properties of the target program, and the whims of the underlying black-box verifier.

While difficult for humans, such a problem is perfect for AI. The verifier oracle provides a clear, binary objective function: a proof that passes the verifier is correct, while one that fails is incorrect. This eliminates the ambiguity that plagues many AI applications. Unlike code generation where correctness can be subjective, verification provides a definitive oracle that the AI cannot “hallucinate” around. The AI is free to explore the search space, but it cannot produce a misleadingly convincing but incorrect result.

Moreover, proof code is untrusted code that serves only as input to the verifier, and does not require human review. Once the verifier accepts it, the proof’s job is done. This means AI can produce sloppy proofs without much practical downside, as long as they pass verification.

We’re still in the early days of software verification, and I’m incredibly excited about what’s coming once the power of AI and computational scaling is brought to bear. AI might just be the missing catalyst that tips it from niche research to standard practice.