Code Extraction Rules
This document outlines the sophisticated rules and algorithms used by Archon's code extraction service to identify, extract, and validate code examples from various sources.
Overview
The code extraction service intelligently extracts meaningful code examples from crawled documents while filtering out non-code content like diagrams, prose, and malformed snippets. It uses dynamic thresholds, language-specific patterns, and quality validation to ensure high-quality code examples.
Key Features
1. Dynamic Minimum Length Calculation
Instead of using a fixed minimum length, the service calculates appropriate thresholds based on:
- Language characteristics: Different languages have different verbosity levels
- Context clues: Words like "example", "snippet", "implementation" adjust expectations
- Base thresholds by language:
- JSON/YAML/XML: 100 characters
- HTML/CSS/SQL: 150 characters
- Python/Go: 200 characters
- JavaScript/TypeScript/Rust/C: 250 characters
- Java/C++: 300 characters
2. Complete Code Block Detection
The service extends code blocks to natural boundaries rather than cutting off at arbitrary character limits:
- Looks for closing braces, parentheses, or language-specific patterns
- Extends up to 5000 characters to find complete functions/classes
- Uses language-specific block end patterns (e.g., unindented line for Python)
- Recognizes common code boundaries like double newlines or next function declarations
3. HTML Span Handling
Sophisticated handling of syntax-highlighted code from various documentation sites:
- Detects when spans are used for syntax highlighting (no spaces between
</span><span>) - Preserves code structure while removing HTML markup
- Handles various highlighting libraries: Prism.js, highlight.js, Shiki, CodeMirror, Monaco
- Special extraction for complex editors like CodeMirror that use nested divs
4. Enhanced Quality Validation
Multi-layer validation ensures only actual code is extracted:
Exclusion Filters
- Diagram languages (Mermaid, PlantUML, GraphViz)
- Prose detection (>15% prose indicators like "the", "this", "however")
- Excessive comments (>70% comment lines)
- Malformed code (concatenated keywords, unresolved HTML entities)
Inclusion Requirements
- Minimum 3 code indicators from:
- Function calls:
function() - Assignments:
var = value - Control flow:
if,for,while - Declarations:
class,function,const - Imports:
import,require - Operators and brackets
- Function calls:
- Language-specific indicators (at least 2 required)
- Reasonable structure (3+ non-empty lines, reasonable line lengths)
5. Language-Specific Patterns
Tailored extraction patterns for major languages:
// TypeScript/JavaScript
{
block_start: /^\s*(export\s+)?(class|interface|function|const|type|enum)\s+\w+/,
block_end: /^\}(\s*;)?$/,
min_indicators: [':', '{', '}', '=>', 'function', 'class']
}
// Python
{
block_start: /^\s*(class|def|async\s+def)\s+\w+/,
block_end: /^\S/, // Unindented line
min_indicators: ['def', ':', 'return', 'self', 'import', 'class']
}
6. Context-Aware Extraction
The service considers surrounding context to make intelligent decisions:
- Adjusts minimum length based on context words ("example" → shorter, "implementation" → longer)
- Uses context to detect language when not explicitly specified
- Preserves 1000 characters of context before/after for better summarization
Extraction Sources
HTML Code Blocks
Supports extraction from 30+ different HTML patterns including:
- GitHub/GitLab highlight blocks
- Docusaurus code blocks
- VitePress/Astro documentation
- Raw
<pre><code>blocks - Standalone
<code>tags (if multiline)
Plain Text Files
Special handling for .txt and .md files:
- Triple backtick blocks with language specifiers
- Language-labeled sections (e.g., "TypeScript:", "Python example:")
- Consistently indented blocks (4+ spaces)
Markdown Content
Falls back to markdown extraction when HTML extraction fails:
- Standard markdown code blocks
- Handles corrupted markdown (e.g., entire file wrapped in backticks)
Code Cleaning Pipeline
- HTML Entity Decoding: Converts
<,>, etc. to actual characters - Tag Removal: Strips HTML tags while preserving code structure
- Spacing Fixes: Repairs concatenated keywords from span removal
- Backtick Removal: Removes wrapping backticks if present
- Indentation Preservation: Maintains original code formatting
Quality Metrics
The service logs detailed metrics for monitoring:
- Number of code blocks found per document
- Validation pass/fail reasons
- Language detection results
- Extraction source types (HTML vs markdown vs text)
- Character counts before/after cleaning
Best Practices for Content Creators
To ensure your code examples are properly extracted:
- Use standard markdown code blocks with language specifiers
- Include complete, runnable code examples rather than fragments
- Avoid mixing code with extensive inline comments
- Ensure proper HTML structure if using custom syntax highlighting
- Keep examples focused - not too short (under 100 chars) or too long (over 5000 chars)
Configuration
The extraction behavior can be tuned through the Settings page in the UI:
Available Settings
Length Settings
- MIN_CODE_BLOCK_LENGTH: Base minimum length for code blocks (default: 250 chars)
- MAX_CODE_BLOCK_LENGTH: Maximum length before stopping extension (default: 5000 chars)
- CONTEXT_WINDOW_SIZE: Characters of context before/after code blocks (default: 1000)
Detection Features
- ENABLE_COMPLETE_BLOCK_DETECTION: Extend code blocks to natural boundaries (default: true)
- ENABLE_LANGUAGE_SPECIFIC_PATTERNS: Use language-specific patterns (default: true)
- ENABLE_CONTEXTUAL_LENGTH: Adjust minimum length based on context (default: true)
Content Filtering
- ENABLE_PROSE_FILTERING: Filter out documentation text (default: true)
- MAX_PROSE_RATIO: Maximum allowed prose percentage (default: 0.15)
- MIN_CODE_INDICATORS: Minimum required code patterns (default: 3)
- ENABLE_DIAGRAM_FILTERING: Filter out diagram languages (default: true)
Processing Settings
- CODE_EXTRACTION_MAX_WORKERS: Parallel workers for code summaries (default: 3)
- ENABLE_CODE_SUMMARIES: Generate AI summaries for code examples (default: true)
How Settings Work
- Dynamic Loading: Settings are loaded from the database on demand, not cached as environment variables
- Real-time Updates: Changes take effect on the next extraction run without server restart
- Graceful Fallbacks: If settings can't be loaded, sensible defaults are used
- Type Safety: Settings are validated and converted to appropriate types
Best Practices
- Start with defaults: The default values work well for most content
- Adjust gradually: Make small changes and test the results
- Monitor logs: Check extraction logs to see how settings affect results
- Language-specific tuning: Different content types may need different settings
Future Enhancements
Potential improvements under consideration:
- Machine learning-based code detection
- Support for notebook formats (Jupyter, Observable)
- API response extraction from documentation
- Multi-file code example correlation
- Language-specific AST validation