Our false positive rate dropped from 38% to 4%

38%. That was our false positive rate when we shipped the first version of our intent detection system. One in three messages the model flagged as "hot prospect" turned out to be someone venting about a competitor, a student writing a paper, or a consultant publishing an opinion piece. Our customers were sending replies to people who had zero interest in buying anything. They came back angry. Fair.

The obvious move would have been to throw more data at it, use a bigger model, rent more compute. We didn't do that. What we changed was simpler and more uncomfortable.

The problem wasn't the model, it was the definition

When you tell an ML model to "detect conversations where someone is looking for a solution", you're asking it something it can't resolve without business context. "Looking for a solution" is ambiguous. Someone saying "I'm looking for alternatives to HubSpot" is shopping. Someone saying "I finally found a solution" is confirming a past purchase. Our model mixed these up in roughly 23% of cases.

So we scrapped the label and rebuilt it from scratch. Not "seeking a solution" but something operationally specific: is there a future-tense action verb? Is there any signal of budget or timeline? Is the author responding to suggestions from others in the thread, indicating active evaluation mode?

This definitional work took three weeks. Three weeks of manually annotating hundreds of Reddit and LinkedIn posts with four people who disagreed constantly, then resolving those disagreements in a reference doc. Tedious. Non-negotiable.

The features we were ignoring

Version one looked almost exclusively at raw text. Makes sense, right? Someone writes something, you analyze what they wrote. Except we were ignoring a pile of contextual signals that, once we added them, cut the false positive rate from 38% to 19% in two sprints.

First ignored feature: position in the thread. A post in position 1 (original question) has a completely different intent distribution than a post in position 8 (reply to a reply). People opening a thread on r/SaaS asking "what's the best tool for X" are statistically far more likely to be in buying mode than people commenting at the bottom of a three-week-old thread.

Second ignored feature: author history on the platform. Someone who has posted 400 times on LinkedIn and mentions sales tools every week is almost certainly a consultant or content creator. Their post about "the best CRM solutions" is an opinion, not purchase intent. We started scoring the content-to-question ratio in author history. It wiped out an entire category of false positives.

Third ignored feature: the source subreddit or LinkedIn group. r/smallbusiness and r/sales behave completely differently. The same post in those two contexts has wildly different intent probabilities. We had been treating all sources as interchangeable. That was just wrong.

The moment we stopped optimizing for overall accuracy

Here's something nobody tells you about ML in production: optimizing for global accuracy is usually the wrong objective. You can hit 96% accuracy on an imbalanced dataset and still build something useless.

Our true positives were roughly 8% of the data. A model that always predicted "negative" would have achieved 92% accuracy. Great number. Garbage product.

We switched to precision and recall on the positive class only, with a heavy emphasis on precision because our customers would rather miss a few opportunities than blast irrelevant people. That metric shift forced completely different threshold decisions. The model got more conservative, flagged less, but what it flagged was right.

That's when we understood we were building a precision instrument, not a volume machine. Customers who wanted 500 signals a week weren't our people. Customers who wanted 30 hyper-qualified signals and closed 20% of them? Those were our people.

Three months of feedback loop we should have built on day one

We shipped a simple feature: every signal we send to a customer can be marked "relevant" or "off-target" with one click. No form, no NPS survey, just two buttons. In three months we collected 4,200 human annotations from real customers on real signals.

That dataset is worth more than anything we could have bought or scraped. Because it's labeled by people who know exactly what they sell, who their ICP is, and who have a direct financial interest in the system being accurate. A $5/hr annotator doesn't have that alignment.

We retrained on that dataset every six weeks. Each cycle, the model absorbed industry-specific nuances that no public dataset could have taught it. That's how we went from 19% to 4%. Not by building a more complex architecture. By feeding it better human signal.

This is the core idea behind Novaseed: the model gets smarter through real customer usage, not through generic training data. But even we took too long to fully commit to that principle.

What we'd do differently from the start

38% false positives across six months of production is a lot. We lost a few customers over it. We also learned faster than any lab training models in isolation ever could.

If you're building an intent detection system, here's what I'd do from day one: manually annotate 500 examples with explicit disagreements before training anything. Define the business metric before the ML metric. Treat user feedback as a first-class feature, not an afterthought. And accept that the first weeks in production are learning weeks, not performance weeks.

The perfect model doesn't exist before you have real users. And real users won't wait around if the model is bad for too long. That's the core tension in every B2B ML product. We lived it, survived it, and the 4% we're at today has nothing to do with a brilliant algorithm. It came from six months of iteration we could have shortened if we'd been less impatient at the start.

inbown.com

Want to see Inbown in action?

Scan your site, get 20 prospects ready to buy. Free, 30 seconds.

Scan my product →