Chunking Strategy

How CodePilot intelligently splits source code into semantic chunks for embedding.

Chunking is the process of splitting source code into meaningful segments that can be individually embedded and retrieved. CodePilot uses AST-aware chunking rather than naive line-based splitting to preserve semantic boundaries.

Why AST-Aware Chunking?

Naive chunking (splitting at fixed line counts) breaks code at arbitrary points, potentially splitting a function in half or separating a class from its methods. AST-aware chunking respects code structure:

Approach	Pros	Cons
Fixed-size (line-based)	Simple, predictable sizes	Breaks semantic boundaries
AST-aware	Preserves code meaning	Variable chunk sizes
Sentence-based	Good for prose	Meaningless for code

CodePilot uses AST-aware chunking for TypeScript/JavaScript and falls back to section-based or line-based chunking for other file types.

Chunk Types

Each chunk is tagged with a type that indicates what kind of code element it represents:

type ChunkType =
  | "function"
  | "class"
  | "component"
  | "hook"
  | "type"
  | "export"
  | "section"      // Markdown headings
  | "full-file"    // JSON and small files
  | "block";       // Fallback chunks

Chunking by File Type

TypeScript / JavaScript

ts-morph identifies structural boundaries and creates chunks at:

Function declarations — Each named function becomes one chunk
Arrow functions — Exported arrow functions are captured
Class declarations — Entire class (with methods) becomes one chunk
React components — Function components are identified by JSX returns
Hooks — Functions starting with use are extracted
Type definitions — Interfaces, types, enums become individual chunks
Exports — Top-level export statements

Large Functions

If a single function or class exceeds the maximum chunk size, it is split at logical sub-boundaries (method boundaries for classes, statement boundaries for functions) with overlapping context.

Markdown

Markdown files are chunked at heading boundaries. Each section (from one heading to the next) becomes a chunk, including:

The heading text (for context in retrieval)
All content under that heading
Nested sub-headings are included in the parent section

JSON

JSON files like package.json and tsconfig.json are typically small and meaningful as complete units. They are stored as full-file chunks.

Fallback Chunker

Files that don't match a specific parser use a line-based fallback with:

Chunk size — approximately 60 lines per chunk
Overlap — 10 lines of context overlap between adjacent chunks
Boundary detection — The chunker attempts to break at blank lines or logical boundaries rather than mid-statement

Chunk Metadata

Every chunk stores metadata alongside its content:

interface CodeChunk {
  content: string;        // The actual code text
  filePath: string;       // Relative path in the repository
  startLine: number;      // Starting line number
  endLine: number;        // Ending line number
  type: ChunkType;        // Semantic type of the chunk
  name?: string;          // Function/class/component name
  language: string;       // Programming language
  repositoryId: string;   // Parent repository reference
}

This metadata enables:

Source attribution — Show users exactly where code came from
Filtered search — Query only specific file types or chunk types
Context building — Include file path and surrounding code in LLM prompts

On this page