When Arithmetic Was Not Enough for News Clustering

Archive note, October 2025: This captures an early
SoCalNomad design decision. The production pipeline later moved toward
deterministic clustering with LLMs used more selectively. The experiment
was still valuable because it exposed the data model the system
needed.

SoCalNomad’s news pipeline monitored dozens of Southern California
feeds. Filtering an individual article was manageable. Determining when
several outlets were covering the same story was harder.

My first approach was pairwise:

if article_a.entities & article_b.entities:
    if model_says_same_story(article_a, article_b):
        cluster.add(article_a)
        cluster.add(article_b)

It handled simple cases. Two headlines about the same festival lineup
shared entities, passed comparison, and formed a cluster.

Then the edge cases arrived.

The transitive problem

Suppose:

Article 1 matches Article 2
Article 3 matches Article 4
Article 1 matches Article 3

A process that immediately marks articles “clustered” can create two
clusters before discovering that all four belong together.

The mistake was treating clustering as a sequence of isolated pair
decisions. Story identity is a property of the group.

The roundup problem

The more important case was an article covering multiple stories:

1: Artist announces a concert
2: Venue roundup mentions the concert and a renovation
3: Tour story mentions the same concert
4: Venue completes its renovation

Pairwise similarity can produce:

1 ↔ 2
2 ↔ 3
2 ↔ 4

One giant cluster would be wrong. There are two stories:

  • The concert announcement
  • The venue renovation

Article 2 legitimately participates in both.

That broke an assumption in the original schema: one article, one
cluster.

The first
useful result was a data-model change

Instead of a Boolean clustered flag, the system needed a
many-to-many relationship:

CREATE TABLE cluster_articles (
  cluster_id INTEGER NOT NULL,
  article_id INTEGER NOT NULL,
  membership_confidence INTEGER,
  is_primary BOOLEAN,
  PRIMARY KEY (cluster_id, article_id)
);

The roundup article could be primary evidence for one story and
supporting evidence for another.

Even if the clustering algorithm changed later, that model remained a
better representation of reality.

Why I reached for an LLM

I had been trying to infer story boundaries by counting shared
entities:

  • Same artist
  • Same venue
  • Similar date
  • Similar headline terms

Those signals are useful, but they do not explain why an
article mentions something. A venue can be the subject of one article
and background context in another.

This was the point where I wrote, in effect: arithmetic is not easily
going to solve meaning.

The proposed answer was a multi-pass workflow.

Pass 1: candidate generation

Use inexpensive deterministic signals to avoid comparing every
article with every other article:

  • Shared normalized entities
  • Publication window
  • Headline similarity
  • Geographic relevance

Pass 2: story identification

Give the model the candidate graph and article context, then ask for
distinct stories and memberships:

{
  "stories": [
    {
      "summary": "Artist announces concert at venue",
      "article_ids": [1, 2, 3]
    },
    {
      "summary": "Venue completes renovation",
      "article_ids": [2, 4]
    }
  ]
}

Pass 3: validation

Ask whether every proposed member actually supports the story
summary. Low-confidence memberships could be removed or sent to
editorial review.

Pass 4: leftovers

Review unassigned articles for roundups or weakly connected stories
that candidate generation missed.

What was attractive about
the design

The model was being used for comprehension, not retrieval or state
management.

SQL remained responsible for:

  • Storing articles
  • Finding candidate relationships
  • Recording cluster membership
  • Enforcing idempotency

The LLM’s job was narrower: interpret context and propose story
boundaries.

Structured JSON made that output testable. Confidence scores created
a place for thresholds and human review.

Batching candidate comparisons also mattered. An LLM call per pair
would have been slow and expensive. Grouping related candidates into a
single request reduced overhead and gave the model more context.

What I would phrase
differently now

The original draft argued that traditional programming could
not
solve the problem and that the LLM approach was more accurate
by nature. That was too absolute.

Deterministic systems can perform semantic clustering with
embeddings, graph algorithms, supervised classifiers, and carefully
designed rules. LLMs can also hallucinate relationships, vary between
runs, exceed budgets, or return malformed output.

The real decision was not arithmetic versus intelligence. It was
where to put ambiguity.

For an early system with limited labeled data, an LLM offered a fast
way to externalize editorial judgment and learn what the production
rules needed to represent. Later, SoCalNomad moved more clustering work
into deterministic code because repeatability, cost, and observability
mattered.

The durable lesson

The experiment changed how I approached AI-assisted architecture.

First, model the real domain before optimizing the algorithm.
Articles can belong to multiple stories, so the schema must allow that
regardless of who makes the classification.

Second, separate candidate generation from semantic judgment.
Deterministic code is good at reducing the search space. Models can be
useful where context changes the meaning of otherwise similar
signals.

Third, treat model output as a proposal. Validate it, store reasoning
when useful, and make retries idempotent.

The most valuable result was not the multi-pass prompt. It was
recognizing that the original single-membership model had encoded the
wrong understanding of journalism.