Skip to main content

🧠 RAG Configuration & Strategies

Configure intelligent search and retrieval for optimal AI responses using advanced RAG strategies

🎯 Overview

Archon's RAG (Retrieval-Augmented Generation) system provides configurable search strategies to optimize how your AI agents find and use information from your knowledge base. This guide covers configuration options and optimization strategies.

Quick Configuration

Access RAG settings in the Web InterfaceSettingsRAG Settings for easy configuration without code changes.

🛠️ RAG Configuration Options

Core Settings

SettingDescriptionDefaultImpact
MODEL_CHOICEChat model for query enhancementgpt-4o-miniResponse quality
EMBEDDING_MODELModel for vector embeddingstext-embedding-3-smallSearch accuracy
LLM_PROVIDERProvider (openai/google/ollama/)openaiModel availability

Advanced Strategies

StrategyPurposePerformance ImpactUse Cases
Contextual EmbeddingsEnhanced embeddings with context+30% accuracy, +2x timeTechnical docs, code
Hybrid SearchVector + keyword combination+20% accuracy, +50% timeMixed content types
Agentic RAGAI-powered query enhancement+40% accuracy, +3x timeComplex queries
RerankingAI-powered result reordering+25% accuracy, +2x timeHigh-precision needs

⚙️ Configuration Strategies

1. Basic Configuration (Fastest)

Best for: General documentation, simple queries

# Minimal settings for speed
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=false
USE_AGENTIC_RAG=false
USE_RERANKING=false

Best for: Most production use cases

# Balanced performance and accuracy
USE_CONTEXTUAL_EMBEDDINGS=true
CONTEXTUAL_EMBEDDINGS_MAX_WORKERS=3
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=false

3. High-Accuracy Configuration

Best for: Critical applications, complex technical docs

# Maximum accuracy (slower)
USE_CONTEXTUAL_EMBEDDINGS=true
CONTEXTUAL_EMBEDDINGS_MAX_WORKERS=3
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true

🔍 RAG Strategies Explained

Contextual Embeddings

Enhances embeddings with surrounding document context

# Standard embedding
"authentication"[0.1, 0.3, 0.7, ...]

# Contextual embedding (with document context)
"authentication in React components using JWT tokens"[0.2, 0.4, 0.8, ...]

Benefits:

  • Better understanding of domain-specific terms
  • Improved accuracy for technical content
  • Context-aware search results

Combines vector similarity with keyword matching

Use Cases:

  • Mixed content (docs + code + APIs)
  • Exact term matching needed
  • Better coverage of rare terms

Agentic RAG

AI-powered query enhancement and result interpretation

Reranking

AI-powered result reordering for optimal relevance

# Initial search results (vector similarity)
results = [
{"content": "JWT basics", "score": 0.85},
{"content": "React auth patterns", "score": 0.83},
{"content": "Token validation", "score": 0.81}
]

# AI reranking (considering query context)
reranked = [
{"content": "React auth patterns", "score": 0.95}, # ↑ More relevant
{"content": "Token validation", "score": 0.88}, # ↑ Contextually better
{"content": "JWT basics", "score": 0.78} # ↓ Too generic
]

📊 Performance Optimization

Speed vs Accuracy Trade-offs

Development & Testing

# Fast iteration, basic accuracy
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=false
USE_AGENTIC_RAG=false
USE_RERANKING=false
# ~200ms average query time

Production Documentation

# Balanced performance
USE_CONTEXTUAL_EMBEDDINGS=true
CONTEXTUAL_EMBEDDINGS_MAX_WORKERS=3
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=false
# ~800ms average query time

Mission-Critical Applications

# Maximum accuracy
USE_CONTEXTUAL_EMBEDDINGS=true
CONTEXTUAL_EMBEDDINGS_MAX_WORKERS=2 # Conservative for reliability
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true
# ~3000ms average query time

🔧 Provider-Specific Configuration

LLM_PROVIDER=openai
MODEL_CHOICE=gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-small
# Pros: Best accuracy, reliable API
# Cons: Cost per query

Google Gemini

LLM_PROVIDER=google
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta
MODEL_CHOICE=gemini-2.5-flash
EMBEDDING_MODEL=text-embedding-004
# Pros: Good performance, competitive pricing
# Cons: Different API patterns

Ollama (Local/Private)

LLM_PROVIDER=ollama
LLM_BASE_URL=http://localhost:11434/v1
MODEL_CHOICE=llama2
EMBEDDING_MODEL=nomic-embed-text
# Pros: Privacy, no API costs
# Cons: Local compute requirements

📈 Monitoring & Analytics

Key Metrics to Track

# Query Performance
- Average response time
- Cache hit rate
- Error rate by strategy

# Search Quality
- Result relevance scores
- User interaction patterns
- Query refinement frequency

# System Health
- API rate limit usage
- Embedding generation time
- Memory usage patterns

Optimization Recommendations

If Queries Are Too Slow:

  1. Reduce CONTEXTUAL_EMBEDDINGS_MAX_WORKERS
  2. Disable USE_AGENTIC_RAG for simple queries
  3. Implement result caching
  4. Use lighter embedding models

If Results Are Inaccurate:

  1. Enable USE_CONTEXTUAL_EMBEDDINGS
  2. Add USE_HYBRID_SEARCH for mixed content
  3. Consider USE_RERANKING for critical applications
  4. Improve source document quality

If Hitting Rate Limits:

  1. Reduce max workers: CONTEXTUAL_EMBEDDINGS_MAX_WORKERS=1
  2. Implement exponential backoff
  3. Use caching more aggressively
  4. Consider switching to local models (Ollama)

🎯 Best Practices

Content Optimization

  1. Document Structure: Use clear headings and sections
  2. Code Examples: Include working code snippets
  3. Context: Provide sufficient surrounding context
  4. Tags: Use descriptive tags for better categorization

Query Optimization

  1. Be Specific: "React authentication with JWT" vs "auth"
  2. Use Technical Terms: Include framework/library names
  3. Provide Context: Mention your specific use case
  4. Iterate: Refine queries based on initial results

System Tuning

  1. Start Simple: Begin with basic configuration
  2. Measure Impact: Enable one strategy at a time
  3. Monitor Performance: Track both speed and accuracy
  4. User Feedback: Collect feedback on result quality