Skip to main content

Code Extraction Rules

This document outlines the sophisticated rules and algorithms used by Archon's code extraction service to identify, extract, and validate code examples from various sources.

Overview

The code extraction service intelligently extracts meaningful code examples from crawled documents while filtering out non-code content like diagrams, prose, and malformed snippets. It uses dynamic thresholds, language-specific patterns, and quality validation to ensure high-quality code examples.

Key Features

1. Dynamic Minimum Length Calculation

Instead of using a fixed minimum length, the service calculates appropriate thresholds based on:

  • Language characteristics: Different languages have different verbosity levels
  • Context clues: Words like "example", "snippet", "implementation" adjust expectations
  • Base thresholds by language:
    • JSON/YAML/XML: 100 characters
    • HTML/CSS/SQL: 150 characters
    • Python/Go: 200 characters
    • JavaScript/TypeScript/Rust/C: 250 characters
    • Java/C++: 300 characters

2. Complete Code Block Detection

The service extends code blocks to natural boundaries rather than cutting off at arbitrary character limits:

  • Looks for closing braces, parentheses, or language-specific patterns
  • Extends up to 5000 characters to find complete functions/classes
  • Uses language-specific block end patterns (e.g., unindented line for Python)
  • Recognizes common code boundaries like double newlines or next function declarations

3. HTML Span Handling

Sophisticated handling of syntax-highlighted code from various documentation sites:

  • Detects when spans are used for syntax highlighting (no spaces between </span><span>)
  • Preserves code structure while removing HTML markup
  • Handles various highlighting libraries: Prism.js, highlight.js, Shiki, CodeMirror, Monaco
  • Special extraction for complex editors like CodeMirror that use nested divs

4. Enhanced Quality Validation

Multi-layer validation ensures only actual code is extracted:

Exclusion Filters

  • Diagram languages (Mermaid, PlantUML, GraphViz)
  • Prose detection (>15% prose indicators like "the", "this", "however")
  • Excessive comments (>70% comment lines)
  • Malformed code (concatenated keywords, unresolved HTML entities)

Inclusion Requirements

  • Minimum 3 code indicators from:
    • Function calls: function()
    • Assignments: var = value
    • Control flow: if, for, while
    • Declarations: class, function, const
    • Imports: import, require
    • Operators and brackets
  • Language-specific indicators (at least 2 required)
  • Reasonable structure (3+ non-empty lines, reasonable line lengths)

5. Language-Specific Patterns

Tailored extraction patterns for major languages:

// TypeScript/JavaScript
{
block_start: /^\s*(export\s+)?(class|interface|function|const|type|enum)\s+\w+/,
block_end: /^\}(\s*;)?$/,
min_indicators: [':', '{', '}', '=>', 'function', 'class']
}

// Python
{
block_start: /^\s*(class|def|async\s+def)\s+\w+/,
block_end: /^\S/, // Unindented line
min_indicators: ['def', ':', 'return', 'self', 'import', 'class']
}

6. Context-Aware Extraction

The service considers surrounding context to make intelligent decisions:

  • Adjusts minimum length based on context words ("example" → shorter, "implementation" → longer)
  • Uses context to detect language when not explicitly specified
  • Preserves 1000 characters of context before/after for better summarization

Extraction Sources

HTML Code Blocks

Supports extraction from 30+ different HTML patterns including:

  • GitHub/GitLab highlight blocks
  • Docusaurus code blocks
  • VitePress/Astro documentation
  • Raw <pre><code> blocks
  • Standalone <code> tags (if multiline)

Plain Text Files

Special handling for .txt and .md files:

  • Triple backtick blocks with language specifiers
  • Language-labeled sections (e.g., "TypeScript:", "Python example:")
  • Consistently indented blocks (4+ spaces)

Markdown Content

Falls back to markdown extraction when HTML extraction fails:

  • Standard markdown code blocks
  • Handles corrupted markdown (e.g., entire file wrapped in backticks)

Code Cleaning Pipeline

  1. HTML Entity Decoding: Converts &lt;, &gt;, etc. to actual characters
  2. Tag Removal: Strips HTML tags while preserving code structure
  3. Spacing Fixes: Repairs concatenated keywords from span removal
  4. Backtick Removal: Removes wrapping backticks if present
  5. Indentation Preservation: Maintains original code formatting

Quality Metrics

The service logs detailed metrics for monitoring:

  • Number of code blocks found per document
  • Validation pass/fail reasons
  • Language detection results
  • Extraction source types (HTML vs markdown vs text)
  • Character counts before/after cleaning

Best Practices for Content Creators

To ensure your code examples are properly extracted:

  1. Use standard markdown code blocks with language specifiers
  2. Include complete, runnable code examples rather than fragments
  3. Avoid mixing code with extensive inline comments
  4. Ensure proper HTML structure if using custom syntax highlighting
  5. Keep examples focused - not too short (under 100 chars) or too long (over 5000 chars)

Configuration

The extraction behavior can be tuned through the Settings page in the UI:

Available Settings

Length Settings

  • MIN_CODE_BLOCK_LENGTH: Base minimum length for code blocks (default: 250 chars)
  • MAX_CODE_BLOCK_LENGTH: Maximum length before stopping extension (default: 5000 chars)
  • CONTEXT_WINDOW_SIZE: Characters of context before/after code blocks (default: 1000)

Detection Features

  • ENABLE_COMPLETE_BLOCK_DETECTION: Extend code blocks to natural boundaries (default: true)
  • ENABLE_LANGUAGE_SPECIFIC_PATTERNS: Use language-specific patterns (default: true)
  • ENABLE_CONTEXTUAL_LENGTH: Adjust minimum length based on context (default: true)

Content Filtering

  • ENABLE_PROSE_FILTERING: Filter out documentation text (default: true)
  • MAX_PROSE_RATIO: Maximum allowed prose percentage (default: 0.15)
  • MIN_CODE_INDICATORS: Minimum required code patterns (default: 3)
  • ENABLE_DIAGRAM_FILTERING: Filter out diagram languages (default: true)

Processing Settings

  • CODE_EXTRACTION_MAX_WORKERS: Parallel workers for code summaries (default: 3)
  • ENABLE_CODE_SUMMARIES: Generate AI summaries for code examples (default: true)

How Settings Work

  1. Dynamic Loading: Settings are loaded from the database on demand, not cached as environment variables
  2. Real-time Updates: Changes take effect on the next extraction run without server restart
  3. Graceful Fallbacks: If settings can't be loaded, sensible defaults are used
  4. Type Safety: Settings are validated and converted to appropriate types

Best Practices

  • Start with defaults: The default values work well for most content
  • Adjust gradually: Make small changes and test the results
  • Monitor logs: Check extraction logs to see how settings affect results
  • Language-specific tuning: Different content types may need different settings

Future Enhancements

Potential improvements under consideration:

  • Machine learning-based code detection
  • Support for notebook formats (Jupyter, Observable)
  • API response extraction from documentation
  • Multi-file code example correlation
  • Language-specific AST validation