🧠Archon Knowledge: Complete Feature Reference
Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.
🎯 Core Knowledge Management Features​
Content Acquisition​
- Smart Web Crawling: Intelligent crawling with automatic content type detection
- Document Upload: Support for PDF, Word, Markdown, and text files
- Sitemap Processing: Crawl entire documentation sites efficiently
- Recursive Crawling: Follow links to specified depth
- Batch Processing: Handle multiple URLs simultaneously
Content Processing​
- Smart Chunking: Intelligent text splitting preserving context
- Contextual Embeddings: Enhanced embeddings for better retrieval
- Code Extraction: Automatic detection and indexing of code examples
- Only extracts substantial code blocks (≥ 1000 characters by default)
- Supports markdown code blocks (triple backticks)
- Falls back to HTML extraction for
<pre>and<code>tags - Generates AI summaries for each code example
- Metadata Extraction: Headers, sections, and document structure
- Source Management: Track and manage content sources
Search & Retrieval​
- Semantic Search: Vector similarity search with pgvector
- Code Search: Specialized search for code examples
- Source Filtering: Search within specific sources
- Result Reranking: Optional ML-based result reranking
- Hybrid Search: Combine keyword and semantic search
🤖 AI Integration Features​
MCP Tools Available​
Knowledge Tools
perform_rag_query- Semantic searchsearch_code_examples- Code-specific searchcrawl_single_page- Index one pagesmart_crawl_url- Intelligent crawlingget_available_sources- List sourcesupload_document- Process documentsdelete_source- Remove content
AI Capabilities
- AI can search your entire knowledge base
- Automatic query refinement for better results
- Context-aware answer generation
- Code example recommendations
- Source citation in responses
🎨 User Interface Features​
Knowledge Table View​
- Source Overview: See all indexed sources at a glance
- Document Count: Track content volume per source
- Last Updated: Monitor content freshness
- Quick Actions: Delete or refresh sources
- Search Interface: Test queries directly in UI
Upload Interface​
- Drag & Drop: Easy file uploads
- Progress Tracking: Real-time processing status
- Batch Upload: Multiple files simultaneously
- Format Detection: Automatic file type handling
Crawl Interface​
- URL Input: Simple URL entry with validation
- Crawl Options: Configure depth, filters
- Progress Monitoring: Live crawl status via Socket.IO
- Stop Functionality: Cancel crawls immediately with proper cleanup
- Result Preview: See crawled content immediately
🔧 Technical Capabilities​
Backend Services​
RAG Services​
- Crawling Service: Web page crawling and parsing
- Document Storage Service: Chunking and storage
- Search Service: Query processing and retrieval
- Source Management Service: Source metadata handling
Embedding Services​
- Embedding Service: OpenAI embeddings with rate limiting
- Contextual Embedding Service: Context-aware embeddings
Storage Services​
- Document Storage Service: Parallel document processing
- Code Storage Service: Code extraction and indexing
Processing Pipeline​
- Content Acquisition: Crawl or upload
- Text Extraction: Parse HTML/PDF/DOCX
- Smart Chunking: Split into optimal chunks
- Embedding Generation: Create vector embeddings
- Storage: Save to Supabase with pgvector
- Indexing: Update search indices
Performance Features​
- Rate Limiting: 200k tokens/minute OpenAI limit
- Parallel Processing: ThreadPoolExecutor optimization
- Batch Operations: Process multiple documents efficiently
- Socket.IO Progress: Real-time status updates
- Caching: Intelligent result caching
Crawl Cancellation​
- Immediate Response: Stop button provides instant UI feedback
- Proper Cleanup: Cancels both orchestration service and asyncio tasks
- State Persistence: Uses localStorage to prevent zombie crawls on refresh
- Socket.IO Events: Real-time cancellation status via
crawl:stoppingandcrawl:stopped - Resource Management: Ensures all server resources are properly released
🚀 Advanced Features​
Content Types​
Web Crawling​
- Sitemap Support: Process sitemap.xml files
- Text File Support: Direct .txt file processing
- Markdown Support: Native .md file handling
- Dynamic Content: JavaScript-rendered pages
- Authentication: Basic auth support
Document Processing​
- PDF Extraction: Full text and metadata
- Word Documents: .docx and .doc support
- Markdown Files: Preserve formatting
- Plain Text: Simple text processing
- Code Files: Syntax-aware processing
Search Features​
- Query Enhancement: Automatic query expansion
- Contextual Results: Include surrounding context
- Code Highlighting: Syntax highlighting in results
- Similarity Scoring: Relevance scores for ranking
- Faceted Search: Filter by metadata
Monitoring & Analytics​
- Logfire Integration: Real-time performance monitoring
- Search Analytics: Track popular queries
- Content Analytics: Usage patterns
- Performance Metrics: Query times, throughput
Debugging Tips​
Code Extraction Not Working?​
- Check code block size: Only blocks ≥ 1000 characters are extracted
- Verify markdown format: Code must be in triple backticks (```)
- Check crawl results: Look for
backtick_countin logs - HTML fallback: System will try HTML if markdown has no code blocks
Crawling Issues?​
- Timeouts: The system uses simple crawling without complex waits
- JavaScript sites: Content should render without special interactions
- Progress stuck: Check logs for batch processing updates
- Different content types:
.txtfiles: Direct text extractionsitemap.xml: Batch crawl all URLs- Regular URLs: Recursive crawl with depth limit
📊 Usage Examples​
Crawling Documentation​
AI Assistant: "Crawl the React documentation"
The system will:
- Detect it's a documentation site
- Find and process the sitemap
- Crawl all pages with progress updates
- Extract code examples
- Create searchable chunks
Searching Knowledge​
AI Assistant: "Find information about React hooks"
The search will:
- Generate embeddings for the query
- Search across all React documentation
- Return relevant chunks with context
- Include code examples
- Provide source links
Document Upload​
User uploads "system-design.pdf"
The system will:
- Extract text from PDF
- Identify sections and headers
- Create smart chunks
- Generate embeddings
- Make searchable immediately