Fifty-Five Published Stories That the Database Thought Were Unpublished

Archive note: This incident occurred in the SoCalNomad development
environment on December 29, 2025. Identifiers and credentials have been
omitted.

The publishing pipeline appeared to have one pending story.

The database disagreed with WordPress.

Fifty-five story clusters contained a WordPress post identifier,
which meant a post had been created, but their editorial status still
said candidate or approved. They had crossed
the system boundary successfully and then failed to record the final
transition.

They were orphaned in an unusual sense: the public artifact existed,
but the internal workflow did not recognize what it had done.

The Dangerous Partial
Success

Distributed workflows rarely fail in one clean piece.

A publishing operation can perform several steps:

  1. Claim an assignment.
  2. Generate or validate content.
  3. Create a WordPress post.
  4. Store the returned post identifier.
  5. Mark the cluster published.
  6. Mark the assignment complete.
  7. Emit an event for downstream systems.

If step three succeeds and step five fails, retrying the whole
operation can create a duplicate post. If the workflow only checks the
internal status, the cluster still looks unpublished.

That was the shape of this incident. WordPress had accepted the
posts, but the local state machine had not reached its terminal
state.

The Evidence Was Already
There

The fix began with a simple invariant:

A cluster with a valid WordPress post identifier cannot honestly
remain an unpublished candidate.

The post identifier was stronger evidence than the stale status
field. It represented an acknowledgment from the external system.

After confirming the records, I reconciled the editorial state for
clusters that already contained post identifiers. The important part was
not the update statement. It was deciding which field represented
reality when two fields contradicted each other.

Blindly trusting the nominal status would have been wrong. Blindly
trusting any non-null identifier could also be wrong if posts had later
been deleted. Reconciliation needed to include verification against
WordPress or another durable publication log.

Why Transactions
Could Not Solve Everything

A database transaction can make local writes atomic. It cannot
include an ordinary WordPress REST request in the same transaction.

The workflow crossed two systems:

  • PostgreSQL tracked clusters and assignments.
  • WordPress created the public post.

There was no distributed transaction coordinator guaranteeing that
both committed together. The design therefore needed to tolerate partial
success.

The safer pattern was:

  1. Acquire a durable publication lock.
  2. Check whether a post identifier already exists.
  3. If it exists, verify and reconcile instead of posting again.
  4. If it does not exist, create the post with an idempotency strategy
    where possible.
  5. Persist the external identifier immediately.
  6. Advance remaining state transitions.
  7. Make every later step safe to retry.

The post identifier becomes a checkpoint, not merely metadata.

Orphans Were a Monitoring
Failure Too

The inconsistent records had accumulated because no routine query
treated them as an error condition.

Several checks should have existed:

  • Post identifier present while status is not published
  • Published status without a post identifier
  • Assignment marked complete while the cluster is not published
  • Assignment running beyond its expected duration
  • Multiple posts associated with one cluster
  • A publication event missing after a completed post

These are cross-field integrity checks. They catch failures that
individual workflow nodes cannot see.

The system already logged many events, but logs are not invariants. A
dashboard can show green executions while durable state quietly
diverges.

The Real Fix Was Idempotency

Correcting fifty-five rows repaired history. It did not prevent
recurrence.

The subsequent design added duplicate guards around assignments and
publication. Before creating a post, the Publishing Desk attempted to
claim the cluster and checked whether publication had already occurred.
A job processor could retry stalled assignments without assuming that
every previous step had failed.

That distinction is central:

  • A retryable workflow asks what has already succeeded.
  • A naive workflow assumes failure erased all prior effects.

External APIs make the second assumption dangerous.

What the Incident Changed

The investigation turned a vague publishing problem into a concrete
rule: public side effects must leave durable receipts, and those
receipts must participate in recovery.

The WordPress post identifier was not just a link. It was proof that
the system had crossed a point of no return.

Fifty-five inconsistent clusters looked like a data-cleanup task.
They were actually a design review. The system had been optimized for
the happy path and had no formal answer for “the post exists, but the
workflow did not finish.”

Once that question was explicit, the necessary controls became
obvious: locks, checkpoints, reconciliation queries, idempotent retries,
and alarms for impossible state combinations.