The Triage Gap: How Valid Reports Become Public Zero-Days

AI multiplied the reports. No one multiplied the judgment. Triage just became security's biggest challenge.

The industry has a triage problem, and it stopped being invisible this spring. Lovable, ClickUp, and Microsoft each received valid vulnerability reports that were closed in triage before anyone understood what they contained. The cost ranged from months of exposed user data to a zero-day published on the open internet.

And the problem is about to get bigger, not smaller. AI has made finding vulnerabilities accessible to almost anyone, frontier models like Mythos are raising the ceiling on what automated research can surface, and report volume is climbing across every program. Too many of those reports are AI-generated and mostly false positives, so the real ones are like a needle in a haystack, and it's hard to know what's genuine and what isn't. Most companies have no reliable way to handle it. When hundreds of reports arrive about a product that changes every week, how do you know which ones are right and which are wrong?

That's the question this post is about, and the one we built Pi to answer.

AI multiplied the reports. No one multiplied the judgment.

Three companies, one pattern

Lovable This spring

The documentation described a product that no longer existed.

A backend refactor re-exposed chat logs on projects marked "public." Researchers caught it and reported it, and triage closed the reports as intended behavior — because the documentation said public projects were public. Nobody in that chain did anything unreasonable. The documentation was simply a year behind a product that had moved to private-by-default, so the reports were judged against a product that no longer existed. To their credit, Lovable published an honest postmortem and credited the researchers.

CLOSED AS INTENDED BEHAVIORMonths of exposed chat logs

ClickUpTwo weeks later

Same surface, entirely new impact behind it.

893 customer emails and a live API token had sat in publicly readable feature-flag configs for over a year. The issue was reported three separate times, and each report was closed in triage before ClickUp's security team saw one. The decisive report was marked a duplicate of an informational finding from fifteen months earlier — same surface, entirely new impact behind it, and no realistic way for a triager working a queue to know the difference. ClickUp now re-reviews every closed report themselves.

CLOSED AS DUPLICATE893 emails · one live API token

MicrosoftThen the cost changed category

If this can happen there, it isn't about effort or talent.

A researcher's report was closed with a standard reply. The researcher published that reply, and a month later published Bitskrieg — a working BitLocker bypass contributed by a second researcher in solidarity — with a comment thread filling up with similar stories. MSRC handles more reports than almost any program on earth, with some of the best people in the industry. If this can happen there, it isn't about effort or talent. The model itself is at its limit.

CLOSED WITH A STANDARD REPLYA public BitLocker bypass

Why this keeps happening

Take something as ordinary as an IDOR. A researcher reports that changing an ID in a request returns another customer's data. The report is three paragraphs long. The questions it opens are not.

Q1 - Is it even reachable, or does a gateway upstream already enforce tenancy?

Q2 - If it's real, where does the fix go: the endpoint handler, the shared data-access layer that six other services import, or the gateway itself?

Q3 - Who owns that code today, after last year's reorg, now that the engineer who wrote it is gone?

Q4 - And which mitigation actually fits this environment? The org has a tenant-scoping helper for exactly this, and a generic check would quietly break internal admin tooling that legitimately reads across tenants.

Every answer lives somewhere different: in the code as it runs today, in the cloud config, in an architecture doc, in a Slack thread from two years ago, in the head of someone who left. A triager working a queue has the report and nothing else, so the call gets made on resemblance. An AI assistant reading the same thin inputs makes the same call, faster, with a confidence score attached. And the engineer who could chase down every answer is looking at a few hundred reports behind this one.

Worse, a wrong close doesn't expire. It becomes the reference point. The next time a researcher reports the same issue, triage matches it against the first verdict and closes it as a duplicate of something already judged harmless. That's exactly how ClickUp's decisive report disappeared. The bug ends up shielded from remediation by the very system meant to surface it, while staying fully exploitable for everyone who never intended to file a report.

A wrong close doesn't expire. It becomes the reference point.

What Pi does differently

Pi's answer is what we call institutional product security memory. Picture onboarding an engineer who has read the design docs behind every service, every line of the codebase, and the cloud infrastructure it runs on. Who has followed the Slack and Teams threads where engineers make their real decisions, and knows not just how each system is built but why: which tradeoffs were weighed, what the team optimized for, which risks were accepted on purpose. Imagine a timeline for every service you've ever developed — every design decision, every change, every incident and its fix, in order, with the reasoning attached. That context is what separates a real vulnerability from a false alarm, and it's exactly what tools reading code in isolation can't see.

Now run that same IDOR through that memory. Pi knows the endpoint is internet-facing, knows the gateway doesn't enforce tenancy, and knows why: that check moved into the service layer two years ago as a deliberate tradeoff, the kind of decision a queue triager could never see. It investigates the actual code path and confirms the missing check, instead of matching report text against old tickets. It knows the handler is shared, which means this isn't one endpoint, it's a pattern, and it sweeps every service that imports it. It knows which team owns that path today, not last year. And it knows the mitigation this codebase already trusts, the same tenant-scoping helper the payments service uses, so the fix fits the environment instead of breaking it.

When the flaw is real, the output isn't a ticket. It's a draft fix that uses the helper this codebase already relies on, opened against every service that shares the pattern, not just the one named in the report. Pi checks that each fix actually closes the issue before the finding is resolved, and the pattern is recorded so the next PR that reintroduces it gets flagged in review. Customers tell us triage time drops by up to 80%, and the weeks engineers spent re-fixing the same recurring bugs come back.

80%

REPORTED BY CUSTOMERSDrop in triage time, with the engineering weeks once lost to re-fixing recurring bugs coming back to the roadmap.

The deeper difference is that none of this resets. Every report triaged, every fix shipped, every decision recorded feeds a memory that compounds and never forgets. The knowledge that resolves today's issue is the same knowledge that catches the next one in design or review, before it ships. A scanner starts every run from zero. Pi starts from everything it has already learned.

A scanner starts every run from zero. Pi starts from everything it has already learned.

Proof from a live program

We've already watched this play out in production. One of our customers runs their bug bounty program on a major platform, and platform triage had closed four reports as informational. When Pi ingested the program's history, it flagged all four. Not by rereading the report text, but by doing the work no queue can do: correlating the four against every other finding in the program, surfacing the pattern they shared, then tracing that pattern into the code itself. Root cause analysis confirmed each report was valid, showed exactly what had gone wrong, and rescored them against the product's real context. The four informationals turned out to be two mediums, a high, and a critical.

The four informationals turned out to be two mediums, a high, and a critical.

All four were reopened and resolved. That chain is the whole point: a report judged against one old ticket looks informational, and the same report judged against the program's history and the code behind it looks like what it actually is. And the win wasn't only the customer's. Four researchers had done real work, found real flaws, and been told they'd found nothing. Reopening those reports gave them the severity and the credit they deserved the first time. After the spring the industry just had, that matters: every wrong close teaches a good researcher to stop reporting to you, and every honest correction teaches the opposite.

The question worth asking

Would the researcher who got your last hundred closes say someone read their work? If you're not sure, we should talk.

Because what Pi brings is a whole different dimension: the knowledge of how things are done in your organization, serving every operator who acts on it, whether that operator is a human or an agent. The reports will keep multiplying. Now the judgment can too.

This changes everything.

We should talk