Skip to main content

🧠 Archon Knowledge: Complete Feature Reference

Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.

🎯 Core Knowledge Management Features​

Content Acquisition​

  • Smart Web Crawling: Intelligent crawling with automatic content type detection
  • Document Upload: Support for PDF, Word, Markdown, and text files
  • Sitemap Processing: Crawl entire documentation sites efficiently
  • Recursive Crawling: Follow links to specified depth
  • Batch Processing: Handle multiple URLs simultaneously

Content Processing​

  • Smart Chunking: Intelligent text splitting preserving context
  • Contextual Embeddings: Enhanced embeddings for better retrieval
  • Code Extraction: Automatic detection and indexing of code examples
    • Only extracts substantial code blocks (≥ 1000 characters by default)
    • Supports markdown code blocks (triple backticks)
    • Falls back to HTML extraction for <pre> and <code> tags
    • Generates AI summaries for each code example
  • Metadata Extraction: Headers, sections, and document structure
  • Source Management: Track and manage content sources

Search & Retrieval​

  • Semantic Search: Vector similarity search with pgvector
  • Code Search: Specialized search for code examples
  • Source Filtering: Search within specific sources
  • Result Reranking: Optional ML-based result reranking
  • Hybrid Search: Combine keyword and semantic search

🤖 AI Integration Features​

MCP Tools Available​

Knowledge Tools

  • perform_rag_query - Semantic search
  • search_code_examples - Code-specific search
  • crawl_single_page - Index one page
  • smart_crawl_url - Intelligent crawling
  • get_available_sources - List sources
  • upload_document - Process documents
  • delete_source - Remove content

AI Capabilities

  • AI can search your entire knowledge base
  • Automatic query refinement for better results
  • Context-aware answer generation
  • Code example recommendations
  • Source citation in responses

🎨 User Interface Features​

Knowledge Table View​

  • Source Overview: See all indexed sources at a glance
  • Document Count: Track content volume per source
  • Last Updated: Monitor content freshness
  • Quick Actions: Delete or refresh sources
  • Search Interface: Test queries directly in UI

Upload Interface​

  • Drag & Drop: Easy file uploads
  • Progress Tracking: Real-time processing status
  • Batch Upload: Multiple files simultaneously
  • Format Detection: Automatic file type handling

Crawl Interface​

  • URL Input: Simple URL entry with validation
  • Crawl Options: Configure depth, filters
  • Progress Monitoring: Live crawl status via Socket.IO
  • Stop Functionality: Cancel crawls immediately with proper cleanup
  • Result Preview: See crawled content immediately

🔧 Technical Capabilities​

Backend Services​

RAG Services​

  • Crawling Service: Web page crawling and parsing
  • Document Storage Service: Chunking and storage
  • Search Service: Query processing and retrieval
  • Source Management Service: Source metadata handling

Embedding Services​

  • Embedding Service: OpenAI embeddings with rate limiting
  • Contextual Embedding Service: Context-aware embeddings

Storage Services​

  • Document Storage Service: Parallel document processing
  • Code Storage Service: Code extraction and indexing

Processing Pipeline​

  1. Content Acquisition: Crawl or upload
  2. Text Extraction: Parse HTML/PDF/DOCX
  3. Smart Chunking: Split into optimal chunks
  4. Embedding Generation: Create vector embeddings
  5. Storage: Save to Supabase with pgvector
  6. Indexing: Update search indices

Performance Features​

  • Rate Limiting: 200k tokens/minute OpenAI limit
  • Parallel Processing: ThreadPoolExecutor optimization
  • Batch Operations: Process multiple documents efficiently
  • Socket.IO Progress: Real-time status updates
  • Caching: Intelligent result caching

Crawl Cancellation​

  • Immediate Response: Stop button provides instant UI feedback
  • Proper Cleanup: Cancels both orchestration service and asyncio tasks
  • State Persistence: Uses localStorage to prevent zombie crawls on refresh
  • Socket.IO Events: Real-time cancellation status via crawl:stopping and crawl:stopped
  • Resource Management: Ensures all server resources are properly released

🚀 Advanced Features​

Content Types​

Web Crawling​

  • Sitemap Support: Process sitemap.xml files
  • Text File Support: Direct .txt file processing
  • Markdown Support: Native .md file handling
  • Dynamic Content: JavaScript-rendered pages
  • Authentication: Basic auth support

Document Processing​

  • PDF Extraction: Full text and metadata
  • Word Documents: .docx and .doc support
  • Markdown Files: Preserve formatting
  • Plain Text: Simple text processing
  • Code Files: Syntax-aware processing

Search Features​

  • Query Enhancement: Automatic query expansion
  • Contextual Results: Include surrounding context
  • Code Highlighting: Syntax highlighting in results
  • Similarity Scoring: Relevance scores for ranking
  • Faceted Search: Filter by metadata

Monitoring & Analytics​

  • Logfire Integration: Real-time performance monitoring
  • Search Analytics: Track popular queries
  • Content Analytics: Usage patterns
  • Performance Metrics: Query times, throughput

Debugging Tips​

Code Extraction Not Working?​

  1. Check code block size: Only blocks ≥ 1000 characters are extracted
  2. Verify markdown format: Code must be in triple backticks (```)
  3. Check crawl results: Look for backtick_count in logs
  4. HTML fallback: System will try HTML if markdown has no code blocks

Crawling Issues?​

  1. Timeouts: The system uses simple crawling without complex waits
  2. JavaScript sites: Content should render without special interactions
  3. Progress stuck: Check logs for batch processing updates
  4. Different content types:
    • .txt files: Direct text extraction
    • sitemap.xml: Batch crawl all URLs
    • Regular URLs: Recursive crawl with depth limit

📊 Usage Examples​

Crawling Documentation​

AI Assistant: "Crawl the React documentation"

The system will:

  1. Detect it's a documentation site
  2. Find and process the sitemap
  3. Crawl all pages with progress updates
  4. Extract code examples
  5. Create searchable chunks

Searching Knowledge​

AI Assistant: "Find information about React hooks"

The search will:

  1. Generate embeddings for the query
  2. Search across all React documentation
  3. Return relevant chunks with context
  4. Include code examples
  5. Provide source links

Document Upload​

User uploads "system-design.pdf"

The system will:

  1. Extract text from PDF
  2. Identify sections and headers
  3. Create smart chunks
  4. Generate embeddings
  5. Make searchable immediately

🔗 Real-time Features​

Socket.IO Progress​

  • Crawl Progress: Percentage completion with smooth updates
  • Current URL: What's being processed in real-time
  • Batch Updates: Processing batches with accurate progress
  • Stop Support: Cancel crawls with immediate feedback
  • Error Reporting: Failed URLs with detailed errors
  • Completion Status: Final statistics and results

Live Updates​

// Connect to Socket.IO and subscribe to progress
import { io } from 'socket.io-client';

const socket = io('/crawl');
socket.emit('subscribe', { progress_id: 'uuid' });

// Handle progress updates
socket.on('progress_update', (data) => {
console.log(`Progress: ${data.percentage}%`);
console.log(`Status: ${data.status}`);
console.log(`Current URL: ${data.currentUrl}`);
console.log(`Pages: ${data.processedPages}/${data.totalPages}`);
});

// Handle completion
socket.on('progress_complete', (data) => {
console.log('Crawl completed!');
console.log(`Chunks stored: ${data.chunksStored}`);
console.log(`Word count: ${data.wordCount}`);
});

// Stop a crawl
socket.emit('crawl_stop', { progress_id: 'uuid' });

// Handle stop events
socket.on('crawl:stopping', (data) => {
console.log('Crawl is stopping...');
});

socket.on('crawl:stopped', (data) => {
console.log('Crawl cancelled successfully');
});