Repository Ingestion
How CodePilot clones repositories and processes them into searchable embeddings.
Repository ingestion is the core pipeline that transforms a GitHub repository into searchable vector embeddings. This page covers each stage of the process.
Overview
When a user connects a repository, the following pipeline executes:
Clone → Scan → Parse → Chunk → Embed → StoreThe entire pipeline runs as a background job via BullMQ, so the user experience is non-blocking.
Stage 1: Cloning
The worker clones the repository using simple-git with a GitHub App installation token for authentication:
import simpleGit from "simple-git";
const git = simpleGit();
await git.clone(
`https://x-access-token:${installationToken}@github.com/${owner}/${repo}.git`,
clonePath,
["--depth", "1", "--single-branch"]
);Shallow Clones
CodePilot uses shallow clones (--depth 1) to minimize bandwidth and disk usage. Only the latest commit on the default branch is cloned.
Stage 2: File Scanning
After cloning, the worker scans the repository and filters files by supported types:
- TypeScript (
.ts,.tsx) - JavaScript (
.js,.jsx) - Markdown (
.md,.mdx) - JSON (
package.json,tsconfig.json, etc.)
Files in node_modules, .git, dist, build, and other generated directories are excluded.
Stage 3: AST Parsing
For TypeScript and JavaScript files, CodePilot uses ts-morph to perform AST (Abstract Syntax Tree) analysis. This identifies structural elements:
- Functions and arrow functions
- Classes and methods
- React components
- Custom hooks
- Type definitions and interfaces
- Module exports
import { Project, SyntaxKind } from "ts-morph";
const project = new Project();
const sourceFile = project.addSourceFileAtPath(filePath);
// Extract functions
const functions = sourceFile.getFunctions();
for (const fn of functions) {
chunks.push({
type: "function",
name: fn.getName(),
content: fn.getFullText(),
startLine: fn.getStartLineNumber(),
endLine: fn.getEndLineNumber(),
});
}Stage 4: Chunking
Code is split into semantic chunks based on AST analysis. See Chunking Strategy for details.
Stage 5: Embedding
Each chunk is sent to Ollama's nomic-embed-text model to generate a 768-dimensional vector. See Embeddings for details.
Stage 6: Storage
Vectors are stored in PostgreSQL using the pgvector extension. See Vector Database for details.
Job Structure
Ingestion jobs contain all the metadata needed by the worker:
interface IngestJobData {
repositoryId: string;
installationId: string;
repoFullName: string; // "owner/repo"
defaultBranch: string; // "main"
}Jobs include built-in retry logic, priority queuing, and stall detection via BullMQ.
Incremental Ingestion
For repositories that are already indexed, CodePilot supports incremental ingestion triggered by GitHub push webhooks:
- Compare the commit diff to identify changed files
- Remove existing chunks and embeddings for changed/deleted files
- Re-parse and re-embed only the changed files
- Insert updated vectors into the database
This is significantly faster than full re-ingestion and keeps the index current.