CodePilot
Architecture

Chunking Strategy

How CodePilot intelligently splits source code into semantic chunks for embedding.

Chunking is the process of splitting source code into meaningful segments that can be individually embedded and retrieved. CodePilot uses AST-aware chunking rather than naive line-based splitting to preserve semantic boundaries.

Why AST-Aware Chunking?

Naive chunking (splitting at fixed line counts) breaks code at arbitrary points, potentially splitting a function in half or separating a class from its methods. AST-aware chunking respects code structure:

ApproachProsCons
Fixed-size (line-based)Simple, predictable sizesBreaks semantic boundaries
AST-awarePreserves code meaningVariable chunk sizes
Sentence-basedGood for proseMeaningless for code

CodePilot uses AST-aware chunking for TypeScript/JavaScript and falls back to section-based or line-based chunking for other file types.

Chunk Types

Each chunk is tagged with a type that indicates what kind of code element it represents:

type ChunkType =
  | "function"
  | "class"
  | "component"
  | "hook"
  | "type"
  | "export"
  | "section"      // Markdown headings
  | "full-file"    // JSON and small files
  | "block";       // Fallback chunks

Chunking by File Type

TypeScript / JavaScript

ts-morph identifies structural boundaries and creates chunks at:

  • Function declarations — Each named function becomes one chunk
  • Arrow functions — Exported arrow functions are captured
  • Class declarations — Entire class (with methods) becomes one chunk
  • React components — Function components are identified by JSX returns
  • Hooks — Functions starting with use are extracted
  • Type definitions — Interfaces, types, enums become individual chunks
  • Exports — Top-level export statements

Large Functions

If a single function or class exceeds the maximum chunk size, it is split at logical sub-boundaries (method boundaries for classes, statement boundaries for functions) with overlapping context.

Markdown

Markdown files are chunked at heading boundaries. Each section (from one heading to the next) becomes a chunk, including:

  • The heading text (for context in retrieval)
  • All content under that heading
  • Nested sub-headings are included in the parent section

JSON

JSON files like package.json and tsconfig.json are typically small and meaningful as complete units. They are stored as full-file chunks.

Fallback Chunker

Files that don't match a specific parser use a line-based fallback with:

  • Chunk size — approximately 60 lines per chunk
  • Overlap — 10 lines of context overlap between adjacent chunks
  • Boundary detection — The chunker attempts to break at blank lines or logical boundaries rather than mid-statement

Chunk Metadata

Every chunk stores metadata alongside its content:

interface CodeChunk {
  content: string;        // The actual code text
  filePath: string;       // Relative path in the repository
  startLine: number;      // Starting line number
  endLine: number;        // Ending line number
  type: ChunkType;        // Semantic type of the chunk
  name?: string;          // Function/class/component name
  language: string;       // Programming language
  repositoryId: string;   // Parent repository reference
}

This metadata enables:

  • Source attribution — Show users exactly where code came from
  • Filtered search — Query only specific file types or chunk types
  • Context building — Include file path and surrounding code in LLM prompts

On this page