Repository Ingestion

How CodePilot clones repositories and processes them into searchable embeddings.

Repository ingestion is the core pipeline that transforms a GitHub repository into searchable vector embeddings. This page covers each stage of the process.

Overview

When a user connects a repository, the following pipeline executes:

Clone → Scan → Parse → Chunk → Embed → Store

The entire pipeline runs as a background job via BullMQ, so the user experience is non-blocking.

Stage 1: Cloning

The worker clones the repository using simple-git with a GitHub App installation token for authentication:

import simpleGit from "simple-git";

const git = simpleGit();
await git.clone(
  `https://x-access-token:${installationToken}@github.com/${owner}/${repo}.git`,
  clonePath,
  ["--depth", "1", "--single-branch"]
);

Shallow Clones

CodePilot uses shallow clones (--depth 1) to minimize bandwidth and disk usage. Only the latest commit on the default branch is cloned.

Stage 2: File Scanning

After cloning, the worker scans the repository and filters files by supported types:

TypeScript (.ts, .tsx)
JavaScript (.js, .jsx)
Markdown (.md, .mdx)
JSON (package.json, tsconfig.json, etc.)

Files in node_modules, .git, dist, build, and other generated directories are excluded.

Stage 3: AST Parsing

For TypeScript and JavaScript files, CodePilot uses ts-morph to perform AST (Abstract Syntax Tree) analysis. This identifies structural elements:

Functions and arrow functions
Classes and methods
React components
Custom hooks
Type definitions and interfaces
Module exports

import { Project, SyntaxKind } from "ts-morph";

const project = new Project();
const sourceFile = project.addSourceFileAtPath(filePath);

// Extract functions
const functions = sourceFile.getFunctions();
for (const fn of functions) {
  chunks.push({
    type: "function",
    name: fn.getName(),
    content: fn.getFullText(),
    startLine: fn.getStartLineNumber(),
    endLine: fn.getEndLineNumber(),
  });
}

interface IngestJobData {
  repositoryId: string;
  installationId: string;
  repoFullName: string;     // "owner/repo"
  defaultBranch: string;    // "main"
}

Jobs include built-in retry logic, priority queuing, and stall detection via BullMQ.

Incremental Ingestion

For repositories that are already indexed, CodePilot supports incremental ingestion triggered by GitHub push webhooks:

Compare the commit diff to identify changed files
Remove existing chunks and embeddings for changed/deleted files
Re-parse and re-embed only the changed files
Insert updated vectors into the database

This is significantly faster than full re-ingestion and keeps the index current.

Repository Ingestion

Overview

Stage 1: Cloning

Stage 2: File Scanning

Stage 3: AST Parsing

Stage 4: Chunking

Stage 5: Embedding

Stage 6: Storage

Job Structure

Incremental Ingestion

On this page