Enterprise RAG in production: 5 critical decisions your first POCs hide from you

The laptop RAG POC holds up in a day. The same system deployed for 500 users over 100,000 documents cracks in three weeks. Five structuring decisions (ingestion, chunking, search, observability, access control) determine whether your RAG will clear that bar. A decision grid for CTOs and Heads of Data at mid-market companies.

Transparency note. This article is written in line with IgnitionAI's editorial policy. Technical claims about the tools cited (LlamaIndex, Pinecone, Cohere Rerank, Langfuse) link to their official documentation with the date of access. Anthropic's figures on Contextual Retrieval are sourced to their original publication. Architecture recommendations are expert estimates tagged as such.

The moment a RAG project cracks is rarely on go-live day. It's more often after six weeks, when the volume of real users reveals what the POC wasn't testing: latency under load, the forgotten access control, the cost that drifts, the silent regressions, the impossibility of reindexing a modified document without breaking everything.

Five structuring decisions determine whether your RAG will clear that bar. Most are made by default, with no conscious trade-off, before the first POC even exists. Here is the decision grid we use on engagements, in the order these decisions should be handled.

Why the first POCs lie

A lab-condition RAG POC holds up on a laptop with a few dozen documents, a single user, controlled queries, and an unlimited token budget. Those conditions hide the constraints that surface at scale.

At 500 real users querying an index of 100,000 documents simultaneously:

  • Latency becomes an issue (an expected P95 under 2 seconds for interactive use).
  • Inference costs are measured in thousands of euros per month.
  • Questions of who sees what become legal before they become technical.
  • Quality regressions after a model or index update are invisible without observability.
  • Maintaining a system that reindexes in 14 hours blocks any evolution.

The five decisions below structure the whole thing. They're presented in the logical order of scoping, not the order of implementation.

Decision 1 · Ingestion strategy: batch, streaming, or hybrid

The first decision concerns data freshness. Three models:

Nightly batch. All documents are reindexed (or indexed as deltas) over a maintenance window, typically at night. Simple to implement, predictable on cost. The refresh latency is acceptable for most internal use cases: knowledge base, product documentation, contracts, procedures. If a document is modified at 10am, it'll be correctly queryable the next morning.

Event streaming. Each document change (creation, update, deletion) triggers a reindex via webhook or message queue. Refresh latency of a few seconds to a few minutes. Justified for living sources: support tickets, news articles, real-time operational data. The operational complexity is markedly higher (conflict handling, delivery guarantees, update idempotency).

Hybrid. A periodic full batch (weekly or monthly) that serves as the reference, plus streaming for day-to-day deltas. Costly to set up but often the best compromise for mid-market companies that have both stable and living sources.

The classic anti-pattern: the POC ingests documents with an ad-hoc script that's never re-run. In production, nobody knows how to reindex a modified document without breaking everything. The ingestion decision must anticipate the lifecycle: modification, deletion, GDPR right to be forgotten, versioning.

Decision 2 · Chunking strategy: one per document type

Chunking is the splitting of documents into queryable pieces. The choice of strategy directly influences retrieval quality. Most projects use fixed-size splitting (1024 characters with a 20-character overlap), because it's the default in most frameworks. It's often unsuitable.

The official LlamaIndex documentation (accessed May 2026) lists several strategies, each suited to a content type:

  • MarkdownNodeParser for technical documentation, internal wikis, READMEs. Preserves the hierarchical structure (headings, subheadings) that then serves as valuable metadata.
  • CodeSplitter for internal codebases, parameterised by language. Splits along syntactic units (functions, classes) rather than line count.
  • SemanticSplitterNodeParser for long, continuous text (reports, articles, contracts). Adaptively chooses break points based on semantic similarity between sentences, which avoids cutting in the middle of an idea.
  • HierarchicalNodeParser for structured documents with several levels of granularity. Creates several chunk hierarchies, which then enables retrieval that climbs to the parent when more context is needed.
  • SentenceSplitter as a reasonable default when the content type isn't known in advance.

Once the strategy is chosen, the second chunking decision concerns embedded metadata. Each indexed chunk must carry at minimum:

  • The source document identifier and the section it's extracted from
  • The ingestion date
  • The ACL tag(s) that determine who can view it (see decision 5)
  • A version identifier to allow targeted deletion

Without this metadata, you'll be able neither to apply clean access control, nor to answer a GDPR right to be forgotten, nor to audit a response after the fact.

Decision 3 · Search strategy: pure vector, hybrid, or with reranking

This is the decision with the biggest impact on quality, and the one where published figures allow a defensible trade-off.

In September 2024 Anthropic published a study on what they call Contextual Retrieval (accessed May 2026), with figures on the top-20-chunk retrieval failure rate:

  • Dense embeddings alone: failure rate of 5.7 percent
  • Contextual Embeddings: 3.7 percent (a 35 percent reduction)
  • Contextual Embeddings combined with Contextual BM25: 2.9 percent (a 49 percent reduction)
  • With reranking on top: 1.9 percent (a 67 percent reduction)

The practical reading is simple: for a production RAG system with a quality requirement, the combo of hybrid search (dense vector plus BM25) followed by reranking delivers a near-systematic quality gain.

Hybrid search. Combines a vector similarity score (semantic) and a BM25 score (lexical, term frequency and rarity). Pinecone documents (accessed May 2026) a weighted convex-combination approach: combined = alpha * dense + (1 - alpha) * sparse. Typical alpha values:

  • 0.75 for conversational content or natural-language queries
  • 0.5 for a standard balance
  • 0.25 for very specific queries containing technical terms or identifiers

Pinecone notes there's no universally optimal value. Calibrating alpha must be done on your own representative data and queries.

Reranking. An additional layer that reorders the results retrieved by the first stage, using a dedicated model that's more precise but more costly. Cohere Rerank (accessed May 2026) is the commercial reference, billed per query. The added latency is measured in hundreds of milliseconds depending on the size of the list to rerank.

IgnitionAI estimate based on engagements: in 80 percent of the enterprise RAG cases we've observed, the target architecture is hybrid plus reranking, with an alpha around 0.5 then calibrated. Pure vector search remains justified for non-critical cases with a strong latency or cost constraint.

Decision 4 · Observability: what to log, how to alert

Without observability, you'll never know why your RAG answers a specific question badly. Nor will you know whether quality drifts after a model or index update. It's the most systematic blind spot of the first POCs, which make do with a console.log on the prompts.

Several dedicated tools cover this need: Langfuse, LangSmith, Helicone, Arize Phoenix. They all capture, per trace, the prompt sent to the model, the generated response, the tokens consumed, the latency per step and the estimated cost.

What you absolutely must log on every request:

  • The full prompt sent to the LLM, including the sources returned by the retriever
  • The list of retrieved chunks, with their scores and document identifiers
  • The response generated by the LLM
  • The user identifier (anonymised per your GDPR policy)
  • The latency per step: query embedding, retrieval, reranking, generation
  • The input and output tokens, and the estimated cost
  • Any errors or retries

Beyond collection, configure four minimal alerts:

  • P95 latency above the business threshold. Detects infra regressions or load spikes.
  • Average retrieval score dropping. A signal of quality drift, often tied to a change in the indexed data or the embedding model.
  • Rate of queries with no source returned above a threshold. Indicates either a documentation-coverage problem or a problem in how users phrase queries.
  • Daily spend above budget. Detects infinite loops, abuse, or simply unforeseen user behaviour.

Observability is a prerequisite for production. A RAG with no traces stays non-auditable, and therefore non-compliant with Article 12 of Regulation (EU) 2024/1689 on AI (logging for high-risk AI systems).

Decision 5 · Access control: the decision you make first or never

Access control is in fifth position in this article because it's the fifth decision in the technical grid. But in the order of project scoping, it's often the one that should be settled first, because it conditions the architecture of the other four.

Three patterns dominate in enterprise RAG:

Post-retrieval filter. The system retrieves the relevant chunks then filters out the ones the user isn't allowed to see. Simple to code, but inefficient: you bring back results you throw away. If the index contains many documents a typical user has no access to, you waste compute.

Pre-retrieval filter with metadata filtering. At query time, you pass the user's permissions (Active Directory groups, business roles) to the vector store, which uses them as a filter predicate before the similarity search. This is the optimal pattern for most mid-market use cases.

Multi-tenant index. One physical index per tenant or per access group. Conceptually simple, but it explodes in cost and administration complexity beyond a few distinct tenants. Justified when isolation must be strong (distinct clients of a SaaS platform, for example).

We've published a full analysis of the five enterprise-RAG access-control architectures in this article, with their security guarantees, their implementation complexity and their link to the AI Act and GDPR obligations.

The cost of an after-the-fact migration is near-prohibitive. Without clean access control from the design stage, your only option in case of an incident is a full reindex with a new architecture — that is, several weeks of restoring service. The decision is made before the first line of ingestion code.

The grid on one page

The five decisions, in summary:

DecisionQuestionDefault recommendation (mid-market)
1. IngestionWhat data freshness?Nightly batch + delta streaming for living sources
2. ChunkingOne strategy or several?One strategy per document type, plus ACL and version metadata on each chunk
3. SearchPure vector, hybrid, or reranking?Hybrid alpha around 0.5 plus reranking for quality-critical cases
4. ObservabilityWhat traces, what alerts?Full traces, four minimal alerts (latency, quality, coverage, cost)
5. Access controlWhat filtering architecture?Pre-retrieval with metadata filtering, to settle before any ingestion code

This grid is a starting point. Each decision is then calibrated to your precise context: document volume, data sensitivity, user profile, latency and cost constraints, level of regulatory requirement.

Sources and methodology

Official documentation of the cited tools:

Figures:

  • Anthropic Contextual Retrieval (September 2024): anthropic.com/news/contextual-retrieval. Reduction in top-20 retrieval failure rate, measured by Anthropic on their internal dataset. Absolute gains depend on the domain and the data.

Regulatory frameworks:

  • Regulation (EU) 2024/1689 on artificial intelligence, Article 12 (logging). EUR-Lex

IgnitionAI estimates: the recommendations marked "IgnitionAI estimate" draw on large-account engagement experience 2024-2026 on enterprise RAG projects. Variation possible depending on the precise context.

Related IgnitionAI articles:


Are you working on the production rollout of an enterprise RAG and want an outside view on one of these five decisions? We offer a focused thirty-minute audit, with no sales pitch. Request a call.

Contact

Tell us about your AI project

A first 30-minute call with a senior consultant. You leave with a documented opinion on feasibility, scope and order-of-magnitude costs. If we believe the project is not ready, we put that in writing.

Reply within 24 business hours from a named consultant.

Enterprise RAG in production: 5 critical decisions your first POCs hide from you — IgnitionAI