Loading...
Ai Automation

Don't Back Up Your Vector Database

Embeddings are derived data — every fingerprint can be recomputed from the content it came from. So instead of snapshotting our vector store, we recover it by regenerating from source, hot content first. Here is why a rebuild button beats a backup.

MangoApps Engineering 11 min read Updated May 27, 2026
MangoApps explains why vector embeddings are derived data, not source data — and why regeneration beats backup for AI-powered search reliability.

It's 6:14 a.m. A closing manager at a distribution center opens the app and types, "What's the call-out policy if someone no-shows the early shift?" The answer comes back in under a second — pulled not by keyword match but by meaning, from a policy document somebody wrote eighteen months ago. The manager reassigns the shift and moves on. They never think about how that answer was found.

That "found by meaning" is vector search, and it sits underneath a lot of what makes the platform useful in the moment: Ask AI, the Service Desk knowledge base, help search, template suggestions. None of it is the kind of feature anyone puts on a slide. All of it breaks the same way if the math underneath drifts — and the manager at 6:14 a.m. is the person who pays for it.

This is a story about a decision that sounds reckless the first time you hear it: we don't back up our vector database. Not because we forgot, and not because the data doesn't matter. Because backing it up is the wrong tool for what this data actually is — and the right tool turned out to be something we already had running.

What's actually in there

A quick, no-code explanation of the thing we're choosing not to back up.

When a manager asks a question in plain language, we can't scan every policy, article, and document word by word — that doesn't scale and it doesn't understand synonyms. Instead, every piece of searchable content is converted, ahead of time, into a list of 1,536 numbers called an embedding. Think of it as a fingerprint of meaning: two documents that say similar things end up with numerically close fingerprints, even if they share no words. A question gets the same treatment, and search becomes "find the fingerprints nearest to this one."

We store those fingerprints in PostgreSQL using the pgvector extension — the same database that holds the source content, no separate vector product to operate. Today that's roughly 29 different kinds of content carrying embeddings: knowledge base entries, help articles, the chunked bodies of long documents, task templates, policies, toolbox talks, and more. Each fingerprint is about 6 KB on disk. Multiply that across every tenant's library and the vector data becomes one of the larger things in the database.

Here's the property that decides everything else: none of it is original. Every embedding was computed from text we still have — the title and body of the article, the paragraphs of the policy. The fingerprint is derived from the source the way a thumbnail is derived from a photo. You don't carefully back up thumbnails. If you lose one, you regenerate it from the photo.

That single observation is the whole article.

The instinct we had to argue ourselves out of

The reflexive engineering move is: important data, big table, therefore back it up. Snapshot the vectors, ship them somewhere safe, restore them if something goes wrong. It feels responsible.

We walked through what a backup would actually buy us, and it kept coming up short:

  • A backup of derived data is stale the moment it's taken. The instant someone edits a policy, the backed-up fingerprint no longer matches the live text. Restore it later and you've faithfully recovered the wrong answer — a vector that describes a document that no longer exists in that form.
  • It only covers one failure. A snapshot protects against corruption, and only up to its age. It does nothing for the other ways this data goes bad — and those turned out to be the ones that actually happen.
  • It's expensive in the least useful way. Vectors are roughly 10× the size of the text they came from. You'd be paying to store, ship, and secure a large, redundant, perishable copy of data you can already reconstruct exactly.
  • Restore paths rot in the dark. A recovery procedure you only run during a disaster is a procedure you don't actually know works. The first real test is the worst possible time to discover a gap.

Once we listed the ways embeddings actually go wrong in production, the backup looked even weaker. The real failure modes were:

  • We change the embedding model. New model, new math — old fingerprints aren't comparable to new ones, so a backup just preserves obsolete vectors.
  • The fingerprint size changes. A new model emits a different-width vector. The old ones literally don't fit.
  • A scoping bug points search at the wrong content. The vectors are fine; the query was wrong. Restoring vectors fixes nothing.
  • Standing up a new region or tenant. There's nothing to restore — the source content exists, the fingerprints were never built.
  • Actual data corruption. The one case a backup addresses — and even here, re-deriving from source is cleaner.

Four of those five are not corruption at all. They're change. And you don't recover from change by restoring an older copy — you recover by regenerating against the new reality. Which meant the tool we needed wasn't a backup. It was a rebuild button.

The rebuild button was already there

The quietly satisfying part: we didn't have to build the recovery system. We'd been running it in production the whole time, for an unglamorous reason — keeping search fresh.

Every piece of searchable content already knows how to regenerate its own fingerprint from its own text. When a manager edits a policy, a background job recomputes that document's embedding from the new wording. That's not a recovery feature; it's just how search stays current. But it has exactly the properties disaster recovery needs:

The same pipeline, two jobs: in steady state, source content flows through the embedding pipeline into the vector store; for disaster recovery, the same pipeline is pointed back at whatever is missing or stale and rebuilds it.

Three things make this safe to lean on:

  1. It's idempotent. Running it again on a healthy record produces the same result and costs almost nothing. There's no "did I already run this?" anxiety — re-running is always safe.
  2. It knows what's missing. The system can ask, precisely, "which records have no fingerprint?" and "which fingerprints were built by an older model version?" — so recovery touches exactly what's broken and leaves healthy data alone. We tag every fingerprint with the model version that produced it specifically so a model change becomes a queryable question, not a guess.
  3. It rebuilds in priority order. This is the part that matters for the manager at 6:14 a.m.

That last point deserves its own paragraph. When we analyzed how the content is actually used, the access pattern was sharply lopsided — a small, active set of knowledge base entries and help articles serves most of the real-time questions, while a long tail sits idle for weeks. So recovery doesn't rebuild alphabetically or by database id. It rebuilds the hot set first: active knowledge base entries newest-first, then help articles by view count, then the long tail of document chunks. Within minutes, the content that answers the most questions is back — and the rarely-touched material refills behind it, invisibly. The floor gets working answers long before the rebuild is "done."

The whole procedure is four steps, and the only true prerequisite is the ordinary one:

  1. Restore the source content from the normal database backup. This is the system of record, and it's the one thing that genuinely must be backed up.
  2. Measure the gap — how many fingerprints are missing or stale, per content type.
  3. Rebuild, hot set first, so search comes back for the busiest content within minutes.
  4. Verify the gap is gone and spot-check real searches for relevance and correct tenant scoping.

Restore the text. Regenerate the math. Done.

The trade-offs we're not hiding

This isn't free, and pretending otherwise would undercut the point.

A full rebuild costs real compute. Regenerating every fingerprint across every tenant means a lot of model calls. We blunt this two ways. First, a content-addressed cache: identical text produces an identical fingerprint, so unchanged content is recomputed from cache rather than re-billed. Second, the pipeline paces itself — small batches with a fixed delay between calls — so a large rebuild stays within provider rate limits instead of slamming into them. The honest caveat: a model change invalidates that cache by definition (different math, different fingerprints), so that specific scenario does incur the full cost. We accept it, because it's rare and because the alternative — comparing fingerprints from two different models — is simply wrong, not cheaper.

Recovery time scales with corpus size, not with the size of the failure. Lose one tenant's vectors and you rebuild fast. A platform-wide model migration is a longer, paced operation. The priority ordering means useful search returns quickly, but "every last fingerprint rebuilt" can take a while on the largest libraries. We'd rather have a slow tail on a correct rebuild than a fast restore of stale data.

Not every content type carries a version tag yet. Most do, which lets us surgically rebuild only what a model change invalidated. A few — the chunked bodies of long documents, in particular — don't yet, so a model change reprocesses all of them rather than just the stale ones. It works, it's just less surgical than we'd like. It's on the list.

And one hard-won lesson that has nothing to do with backups: a rebuild can't fix a query that was looking in the wrong place. We once chased a bug where the knowledge base returned confidently irrelevant answers. The fingerprints were perfect. The problem was that the similarity search ran across the whole library before narrowing to the right category, so the genuinely relevant results fell outside the result window and got filtered out after the search instead of before. Regenerating embeddings would have changed nothing. The fix was to constrain the search to the correct content first, then rank by similarity — which is also, not coincidentally, how we keep one tenant's content from ever surfacing in another's results. Isolation is enforced in the query, before the math runs, not cleaned up afterward. No backup or rebuild touches that class of problem; only correct scoping does.

What this means if you're evaluating a platform

You probably don't operate a vector database. But if AI-assisted answers are going to sit in the path of a frontline manager's morning, the durability of that layer is a fair question to ask any vendor — and "how often do you snapshot it?" is the wrong one. Better questions:

  • Is your search index the system of record, or is it derived? If it's derived from content you already store, an honest answer to "how do you recover it?" is "we regenerate it" — and that's a stronger answer than "we restore a backup," because regeneration is always current. Be wary if the index is treated as precious, irreplaceable data; that usually means it's drifted from its source and nobody's sure they can rebuild it.
  • How fast does useful search come back, not how fast does recovery finish? The number that matters to the person on the floor is time-to-first-working-answer, not time-to-fully-rebuilt. Ask whether recovery prioritizes the content people actually use.
  • Is the recovery path the same machinery you run every day? A recovery procedure that only runs during incidents is one nobody's confident in. The strongest answer is "it's the same pipeline that keeps search fresh — we exercise it constantly, so it's never cold."
  • Where is tenant isolation enforced — in the query, or after results come back? This is the one that separates careful systems from hopeful ones. Filtering to the right scope before the similarity math is correct and safe. Filtering after is how the wrong content leaks into the wrong hands.

The broader principle travels well beyond vectors: don't back up data you can deterministically rebuild — invest in the rebuild instead. Thumbnails, caches, search indexes, computed aggregates, derived fingerprints. Treat them as precious and you pay to store stale copies you're afraid to test. Treat them as disposable, build a fast and well-exercised way to regenerate them, and you get something better than a backup: a recovery that's always current, covers far more than corruption, and is proven every single day by the ordinary work of keeping things fresh.

We don't back up our vector database. The 6:14 a.m. manager is the reason we're confident that's the right call — not despite the missing backup, but because of what we built instead.

Share:

Recent from the Wire

All posts
The MangoApps Team

We're the product, research, and strategy team behind MangoApps — the unified frontline workforce management platform and employee communication and engagement suite trusted by organizations in healthcare, manufacturing, retail, hospitality, and the public sector to connect every employee — deskless or desk-based — to the people, tools, and information they need.

We write about enterprise AI for the workplace, internal communications, AI-powered intranets, workforce management, and the operating patterns behind highly engaged frontline teams. Our perspective is grounded in a decade of building for frontline-heavy industries and shipping AI agents, employee apps, and integrated HR workflows that real employees actually use.

For short-form takes, product news, and field notes from customer rollouts, follow Frontline Wire — our ongoing stream on AI, frontline work, and the modern digital workplace — or learn more about MangoApps.

Apply this in your own org

Related concepts
  • A standard operating procedure (SOP) is a documented, step-by-step procedure for a repeatable task — the written version of "how we do this here." Good SOPs...
  • An intranet is the internal website — and increasingly the internal workspace — that gives employees one place to find company news, policies, tools, people,...
  • HR case management is a structured system for handling employee questions, requests, and issues — with routing, SLAs, an audit trail, and a knowledge base...
  • A knowledge base is a tool — a searchable repository of articles, FAQs, and procedures. Knowledge management is the ongoing practice of capturing, curating,...
Related templates
  • Standard operating procedure for planning, approving, publishing, and measuring internal communications across channels and audience segments.
  • Standardized discharge process — discharge order, med rec at discharge, education with teach-back, follow-up scheduling, transportation, and after-visit...
  • Policy outlining procedures for inclement weather, office closures, essential staff expectations, remote work, pay treatment, and notification protocols.
  • Standardized response when a patient falls. Covers immediate assessment, neuro check, MD notification, family notification, documentation, and post-fall huddle.

Let's Talk

Since 2008, we've been building the workforce platform — earning the trust of 2 million+ users and an NPS of 78.

Why Choose Us?

  • AI-Powered Platform: The most unified workforce experience on the planet.
  • Top Security: HITRUST, ISO & SOC 2 certified.
  • Exceptional UX: Delightful on mobile and desktop.
  • Proven Results: 98% customer retention rate.

Trusted by Legendary Companies:

Trusted by legendary companies
Ask AI Product Advisor

Hi! I'm the MangoApps Product Advisor. I can help you with:

  • Understanding our 40+ workplace apps
  • Finding the right solution for your needs
  • Answering questions about pricing and features
  • Pointing you to free tools you can try right now

What would you like to know?