🧠 Archon Knowledge: Complete Feature Reference

Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.

🎯 Core Knowledge Management Features

Content Acquisition

Smart Web Crawling: Intelligent crawling with automatic content type detection
Document Upload: Support for PDF, Word, Markdown, and text files
Sitemap Processing: Crawl entire documentation sites efficiently
Recursive Crawling: Follow links to specified depth
Batch Processing: Handle multiple URLs simultaneously

Content Processing

Smart Chunking: Intelligent text splitting preserving context
Contextual Embeddings: Enhanced embeddings for better retrieval
Code Extraction: Automatic detection and indexing of code examples
- Only extracts substantial code blocks (≥ 1000 characters by default)
- Supports markdown code blocks (triple backticks)
- Falls back to HTML extraction for <pre> and <code> tags
- Generates AI summaries for each code example
Metadata Extraction: Headers, sections, and document structure
Source Management: Track and manage content sources

Search & Retrieval

Semantic Search: Vector similarity search with pgvector
Code Search: Specialized search for code examples
Source Filtering: Search within specific sources
Result Reranking: Optional ML-based result reranking
Hybrid Search: Combine keyword and semantic search

🤖 AI Integration Features

MCP Tools Available

perform_rag_query - Semantic search
search_code_examples - Code-specific search
crawl_single_page - Index one page
smart_crawl_url - Intelligent crawling
get_available_sources - List sources
upload_document - Process documents
delete_source - Remove content

AI can search your entire knowledge base
Automatic query refinement for better results
Context-aware answer generation
Code example recommendations
Source citation in responses

🎨 User Interface Features

Knowledge Table View

Source Overview: See all indexed sources at a glance
Document Count: Track content volume per source
Last Updated: Monitor content freshness
Quick Actions: Delete or refresh sources
Search Interface: Test queries directly in UI

Upload Interface

Drag & Drop: Easy file uploads
Progress Tracking: Real-time processing status
Batch Upload: Multiple files simultaneously
Format Detection: Automatic file type handling

Crawl Interface

URL Input: Simple URL entry with validation
Crawl Options: Configure depth, filters
Progress Monitoring: Live crawl status via Socket.IO
Stop Functionality: Cancel crawls immediately with proper cleanup
Result Preview: See crawled content immediately

🔧 Technical Capabilities

Backend Services

RAG Services

Crawling Service: Web page crawling and parsing
Document Storage Service: Chunking and storage
Search Service: Query processing and retrieval
Source Management Service: Source metadata handling

Embedding Services

Embedding Service: OpenAI embeddings with rate limiting
Contextual Embedding Service: Context-aware embeddings

Storage Services

Document Storage Service: Parallel document processing
Code Storage Service: Code extraction and indexing

Processing Pipeline

Content Acquisition: Crawl or upload
Text Extraction: Parse HTML/PDF/DOCX
Smart Chunking: Split into optimal chunks
Embedding Generation: Create vector embeddings
Storage: Save to Supabase with pgvector
Indexing: Update search indices

Performance Features

Rate Limiting: 200k tokens/minute OpenAI limit
Parallel Processing: ThreadPoolExecutor optimization
Batch Operations: Process multiple documents efficiently
Socket.IO Progress: Real-time status updates
Caching: Intelligent result caching

Crawl Cancellation

Immediate Response: Stop button provides instant UI feedback
Proper Cleanup: Cancels both orchestration service and asyncio tasks
State Persistence: Uses localStorage to prevent zombie crawls on refresh
Socket.IO Events: Real-time cancellation status via crawl:stopping and crawl:stopped
Resource Management: Ensures all server resources are properly released

🚀 Advanced Features

Content Types

Web Crawling

Sitemap Support: Process sitemap.xml files
Text File Support: Direct .txt file processing
Markdown Support: Native .md file handling
Dynamic Content: JavaScript-rendered pages
Authentication: Basic auth support

Document Processing

PDF Extraction: Full text and metadata
Word Documents: .docx and .doc support
Markdown Files: Preserve formatting
Plain Text: Simple text processing
Code Files: Syntax-aware processing

Search Features

Query Enhancement: Automatic query expansion
Contextual Results: Include surrounding context
Code Highlighting: Syntax highlighting in results
Similarity Scoring: Relevance scores for ranking
Faceted Search: Filter by metadata

Monitoring & Analytics

Logfire Integration: Real-time performance monitoring
Search Analytics: Track popular queries
Content Analytics: Usage patterns
Performance Metrics: Query times, throughput

Debugging Tips

Code Extraction Not Working?

Check code block size: Only blocks ≥ 1000 characters are extracted
Verify markdown format: Code must be in triple backticks (```)
Check crawl results: Look for backtick_count in logs
HTML fallback: System will try HTML if markdown has no code blocks

Crawling Issues?

Timeouts: The system uses simple crawling without complex waits
JavaScript sites: Content should render without special interactions
Progress stuck: Check logs for batch processing updates
Different content types:
- .txt files: Direct text extraction
- sitemap.xml: Batch crawl all URLs
- Regular URLs: Recursive crawl with depth limit

📊 Usage Examples

Crawling Documentation

AI Assistant: "Crawl the React documentation"

The system will:

Detect it's a documentation site
Find and process the sitemap
Crawl all pages with progress updates
Extract code examples
Create searchable chunks

Searching Knowledge

AI Assistant: "Find information about React hooks"

The search will:

Generate embeddings for the query
Search across all React documentation
Return relevant chunks with context
Include code examples
Provide source links

Document Upload

User uploads "system-design.pdf"

The system will:

Extract text from PDF
Identify sections and headers
Create smart chunks
Generate embeddings
Make searchable immediately

🔗 Real-time Features

Socket.IO Progress

Crawl Progress: Percentage completion with smooth updates
Current URL: What's being processed in real-time
Batch Updates: Processing batches with accurate progress
Stop Support: Cancel crawls with immediate feedback
Error Reporting: Failed URLs with detailed errors
Completion Status: Final statistics and results

Live Updates

// Connect to Socket.IO and subscribe to progress
import { io } from 'socket.io-client';

const socket = io('/crawl');
socket.emit('subscribe', { progress_id: 'uuid' });

// Handle progress updates
socket.on('progress_update', (data) => {
  console.log(`Progress: ${data.percentage}%`);
  console.log(`Status: ${data.status}`);
  console.log(`Current URL: ${data.currentUrl}`);
  console.log(`Pages: ${data.processedPages}/${data.totalPages}`);
});

// Handle completion
socket.on('progress_complete', (data) => {
  console.log('Crawl completed!');
  console.log(`Chunks stored: ${data.chunksStored}`);
  console.log(`Word count: ${data.wordCount}`);
});

// Stop a crawl
socket.emit('crawl_stop', { progress_id: 'uuid' });

// Handle stop events
socket.on('crawl:stopping', (data) => {
  console.log('Crawl is stopping...');
});

socket.on('crawl:stopped', (data) => {
  console.log('Crawl cancelled successfully');
});

Knowledge Overview - High-level introduction
API Reference - Detailed API documentation
MCP Tools Reference - MCP tool specifications
RAG Agent Documentation - How the AI searches knowledge
Server Services - Technical service details

🧠 Archon Knowledge: Complete Feature Reference

Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.

🎯 Core Knowledge Management Features

Content Acquisition

Content Processing

Search & Retrieval

🤖 AI Integration Features

MCP Tools Available

Knowledge Tools

AI Capabilities

🎨 User Interface Features

Knowledge Table View

Upload Interface

Crawl Interface

🔧 Technical Capabilities

Backend Services

RAG Services

Embedding Services

Storage Services

Processing Pipeline

Performance Features

Crawl Cancellation

🚀 Advanced Features

Content Types

Web Crawling

Document Processing

Search Features

Monitoring & Analytics

Debugging Tips

Code Extraction Not Working?

Crawling Issues?

📊 Usage Examples

Crawling Documentation

Searching Knowledge

Document Upload

🔗 Real-time Features

Socket.IO Progress

Live Updates

Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.

🎯 Core Knowledge Management Features​

Content Acquisition​

Content Processing​

Search & Retrieval​

🤖 AI Integration Features​

MCP Tools Available​

Knowledge Tools

AI Capabilities

🎨 User Interface Features​

Knowledge Table View​

Upload Interface​

Crawl Interface​

🔧 Technical Capabilities​

Backend Services​

RAG Services​

Embedding Services​

Storage Services​

Processing Pipeline​

Performance Features​

Crawl Cancellation​

🚀 Advanced Features​

Content Types​

Web Crawling​

Document Processing​

Search Features​

Monitoring & Analytics​

Debugging Tips​

Code Extraction Not Working?​

Crawling Issues?​

📊 Usage Examples​

Crawling Documentation​

Searching Knowledge​

Document Upload​

🔗 Real-time Features​

Socket.IO Progress​

Live Updates​

🔗 Related Documentation​

🎯 Core Knowledge Management Features

Content Acquisition

Content Processing

Search & Retrieval

🤖 AI Integration Features

MCP Tools Available

🎨 User Interface Features

Knowledge Table View

Upload Interface

Crawl Interface

🔧 Technical Capabilities

Backend Services

RAG Services

Embedding Services

Storage Services

Processing Pipeline

Performance Features

Crawl Cancellation

🚀 Advanced Features

Content Types

Web Crawling

Document Processing

Search Features

Monitoring & Analytics

Debugging Tips

Code Extraction Not Working?

Crawling Issues?

📊 Usage Examples

Crawling Documentation

Searching Knowledge

Document Upload

🔗 Real-time Features

Socket.IO Progress

Live Updates

🔗 Related Documentation