CodePilot
Architecture

Repository Ingestion

How CodePilot clones repositories and processes them into searchable embeddings.

Repository ingestion is the core pipeline that transforms a GitHub repository into searchable vector embeddings. This page covers each stage of the process.

Overview

When a user connects a repository, the following pipeline executes:

Clone → Scan → Parse → Chunk → Embed → Store

The entire pipeline runs as a background job via BullMQ, so the user experience is non-blocking.

Stage 1: Cloning

The worker clones the repository using simple-git with a GitHub App installation token for authentication:

import simpleGit from "simple-git";

const git = simpleGit();
await git.clone(
  `https://x-access-token:${installationToken}@github.com/${owner}/${repo}.git`,
  clonePath,
  ["--depth", "1", "--single-branch"]
);

Shallow Clones

CodePilot uses shallow clones (--depth 1) to minimize bandwidth and disk usage. Only the latest commit on the default branch is cloned.

Stage 2: File Scanning

After cloning, the worker scans the repository and filters files by supported types:

  • TypeScript (.ts, .tsx)
  • JavaScript (.js, .jsx)
  • Markdown (.md, .mdx)
  • JSON (package.json, tsconfig.json, etc.)

Files in node_modules, .git, dist, build, and other generated directories are excluded.

Stage 3: AST Parsing

For TypeScript and JavaScript files, CodePilot uses ts-morph to perform AST (Abstract Syntax Tree) analysis. This identifies structural elements:

  • Functions and arrow functions
  • Classes and methods
  • React components
  • Custom hooks
  • Type definitions and interfaces
  • Module exports
import { Project, SyntaxKind } from "ts-morph";

const project = new Project();
const sourceFile = project.addSourceFileAtPath(filePath);

// Extract functions
const functions = sourceFile.getFunctions();
for (const fn of functions) {
  chunks.push({
    type: "function",
    name: fn.getName(),
    content: fn.getFullText(),
    startLine: fn.getStartLineNumber(),
    endLine: fn.getEndLineNumber(),
  });
}

Stage 4: Chunking

Code is split into semantic chunks based on AST analysis. See Chunking Strategy for details.

Stage 5: Embedding

Each chunk is sent to Ollama's nomic-embed-text model to generate a 768-dimensional vector. See Embeddings for details.

Stage 6: Storage

Vectors are stored in PostgreSQL using the pgvector extension. See Vector Database for details.

Job Structure

Ingestion jobs contain all the metadata needed by the worker:

interface IngestJobData {
  repositoryId: string;
  installationId: string;
  repoFullName: string;     // "owner/repo"
  defaultBranch: string;    // "main"
}

Jobs include built-in retry logic, priority queuing, and stall detection via BullMQ.

Incremental Ingestion

For repositories that are already indexed, CodePilot supports incremental ingestion triggered by GitHub push webhooks:

  1. Compare the commit diff to identify changed files
  2. Remove existing chunks and embeddings for changed/deleted files
  3. Re-parse and re-embed only the changed files
  4. Insert updated vectors into the database

This is significantly faster than full re-ingestion and keeps the index current.

On this page