Skip to content

perf(db): add compound indexes to fix query regression from TS migration#632

Open
carlos-alm wants to merge 3 commits intomainfrom
perf/benchmark-regression-indexes
Open

perf(db): add compound indexes to fix query regression from TS migration#632
carlos-alm wants to merge 3 commits intomainfrom
perf/benchmark-regression-indexes

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Add migration v16 with idx_edges_kind_target and idx_edges_kind_source compound indexes on the edges table. The fan-in/fan-out GROUP BY subqueries (used by fnDeps, roles, triage) need kind-first indexes to avoid full-table scans. Without them, the 32% graph growth from the TypeScript migration caused a disproportionate 136% fnDeps slowdown.
  • Optimize classifyNodeRoles (both full and incremental paths) to skip ~4,100 parameter/property nodes from the expensive fan-in/fan-out JOINs. These leaf nodes are always dead-leaf by definition and can never have callers or callees.

Measured locally: fan-in subquery ~8ms → 0.43ms, roles query 7ms → 5.3ms, fnDeps path 8.5ms → ~1ms.

Addresses regressions reported in #625, #626, #627.

Test plan

  • Full test suite passes (2,128 tests, 0 failures)
  • codegraph build --no-incremental succeeds, new indexes created
  • codegraph stats shows unchanged node/edge counts
  • Local benchmark confirms query latency improvement

…tion

Add migration v16 with idx_edges_kind_target and idx_edges_kind_source
indexes. The fan-in/fan-out GROUP BY subqueries used by fnDeps, roles,
and triage need kind-first indexes to avoid full-table scans on the
edges table. Without them, a 32% growth in graph density (from the
TypeScript migration) caused a disproportionate 136% query slowdown.

Also optimize classifyNodeRoles (both full and incremental paths) to
skip parameter and property nodes from the expensive fan-in/fan-out
JOINs. These ~4,100 nodes are always classified as dead-leaf by
definition and can never have callers or callees.

Measured locally: fan-in subquery ~8ms → 0.43ms, roles query 7ms → 5.3ms.
@claude
Copy link

claude bot commented Mar 26, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR: Performance indexes for query regression fix

Todo List:

  • Read and analyze changed files
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity
  • Phase 4: Code Quality Review
  • Phase 5: Backlog Compliance Assessment
  • Provide structured final assessment

Phase 1: Root Cause Analysis ✓

Problem: TypeScript migration caused 32% graph growth → 136% fnDeps slowdown due to missing compound indexes
Solution: Add idx_edges_kind_target/idx_edges_kind_source indexes + skip 4,100 leaf nodes from expensive JOINs
Risk: Low - purely additive, preserves correctness

Phase 2: Configuration Change Detection ✓

HIGH-RISK CHANGES: ✅ NONE - No config/linting/build changes

Phase 3: Test Integrity ✓

RED FLAGS: ✅ NONE - No test modifications, author reports all 2,128 tests pass

Phase 4: Code Quality Review ✓

MIGRATION v16: Proper compound index ordering, uses IF NOT EXISTS, follows patterns
OPTIMIZATION: Correctly identifies parameter/property as structural leaves, safe 4,100-node skip
TECHNICAL: Index order matches access patterns, incremental path preserved, no behavior changes

Phase 5: Backlog Compliance Assessment ✓

Root Cause Analysis

Problem Being Solved: TypeScript migration regression - 32% graph growth caused disproportionate 136% query slowdown due to missing compound indexes for fan-in/fan-out subqueries.
Why This Approach: Database-first solution targeting the actual bottleneck (index scans) plus algorithmic optimization (skip leaf nodes that can't participate in call relationships).
Risk Assessment: Minimal - purely additive schema changes + logically sound optimization that preserves all semantics.

Backlog Compliance

  • Zero-dep: ✅ - No new runtime dependencies, builds on existing SQLite/tree-sitter infrastructure
  • Foundation-aligned: ✅ - Pure performance optimization, no cloud APIs, maintains offline-first principle
  • Problem-fit: 5 - Directly addresses core issue: prevents AI agents from experiencing slow codegraph queries that waste tokens during re-orientation in large codebases
  • Breaking: No - Schema migrations are backward compatible, no API changes, preserves all existing behavior
  • Tier: 1 (zero-dep ✅ + foundation-aligned ✅)

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: Textbook performance fix addressing a real regression with proper root cause analysis. Database indexes are correctly ordered for access patterns, optimization correctly identifies structural properties of parameter/property nodes, no correctness risks, comprehensive test validation, and directly improves AI agent experience by preventing query slowdowns.

Excellent work - this is exactly how performance regressions should be addressed: measure the bottleneck, fix the underlying issue with proper database design, and add algorithmic optimizations based on domain knowledge.


@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 26, 2026

Greptile Summary

This PR addresses a 136% fnDeps latency regression introduced by the TypeScript-migration graph growth by adding two compound indexes and optimizing the node role classifier.\n\n- src/db/migrations.ts — Migration v16 adds idx_edges_kind_target ON edges(kind, target_id) and idx_edges_kind_source ON edges(kind, source_id). Placing kind first is the right choice: the fan-in/fan-out GROUP BY subqueries (full path) and the correlated WHERE kind = 'calls' AND target_id = n.id subqueries (incremental path) both benefit from a leading kind prefix that eliminates the full-table scans responsible for the regression.\n- src/features/structure.ts — Both classifyNodeRolesFull and classifyNodeRolesIncremental now fast-path parameter/property nodes as dead-leaf without executing the expensive fan-in/fan-out JOINs. The idsByRole map is seeded with these leaf ids before the main classifier loop, and the subsequent loop correctly appends any classifier-determined dead-leaf nodes to the same array rather than creating a duplicate entry. The incremental scoped NULL-reset (WHERE file IN (...) AND kind NOT IN ('file', 'directory')) still covers parameter/property nodes, so they receive their updated role correctly in the same transaction.\n\nThe only non-blocking finding is a corrupted UTF-8 replacement character in a comment on the incremental path — a cosmetic artifact with no runtime impact.

Confidence Score: 5/5

Safe to merge — performance fix is correct and well-tested with no logic regressions.

The compound indexes are correctly ordered for the existing query patterns and IF NOT EXISTS makes the migration idempotent. The classifier optimization is logically sound: parameter/property nodes are guaranteed leaf nodes, idsByRole merges pre-classified and classifier-determined dead-leaf ids without duplication, and the transaction-scoped NULL reset covers all affected node kinds. The prior reviewer concern (unused columns in leafRows) was already addressed. The only remaining finding is a cosmetic corrupted comment character — a non-blocking P2 that does not affect correctness or performance.

No files require special attention.

Important Files Changed

Filename Overview
src/db/migrations.ts Adds migration v16 with two compound indexes (kind, target_id) and (kind, source_id) on the edges table; index column order correctly places kind first to cover the WHERE/GROUP BY patterns used in both the full-path JOIN subqueries and the incremental correlated subqueries.
src/features/structure.ts Optimizes both full and incremental classifyNodeRoles paths by short-circuiting ~4,100 parameter/property nodes as dead-leaf without fan-in/fan-out JOINs; logic is correct — idsByRole merges leaf pre-classified ids with classifier-assigned dead-leaf ids safely, and the incremental NULL reset still covers parameter/property nodes. One corrupted UTF-8 comment character at line 600.

Sequence Diagram

sequenceDiagram
    participant C as classifyNodeRoles
    participant DB as SQLite DB

    Note over C,DB: Full path (classifyNodeRolesFull)
    C->>DB: SELECT id FROM nodes WHERE kind IN ('parameter','property')
    DB-->>C: leafRows (~4,100 nodes)
    C->>DB: SELECT id,name,kind,file,fan_in,fan_out FROM nodes LEFT JOIN edges(kind_target idx) fi LEFT JOIN edges(kind_source idx) fo WHERE kind NOT IN ('file','directory','parameter','property')
    DB-->>C: rows (callable nodes only)
    Note over C: Pre-assign leafRows to dead-leaf
    Note over C: classifyRoles(rows) to roleMap
    C->>DB: UPDATE nodes SET role=NULL (global reset)
    C->>DB: Batch UPDATE role per chunk for all idsByRole entries

    Note over C,DB: Incremental path (classifyNodeRolesIncremental)
    C->>DB: Expand affected files via neighbour JOIN
    C->>DB: SELECT id WHERE kind IN ('parameter','property') AND file IN (affected)
    DB-->>C: leafRows (scoped)
    C->>DB: SELECT id,... WHERE kind NOT IN (...,'parameter','property') AND file IN (affected) correlated subqueries use idx_edges_kind_target / idx_edges_kind_source
    DB-->>C: rows
    Note over C: Pre-assign leafRows to dead-leaf
    Note over C: classifyRoles(rows, globalMedians) to roleMap
    C->>DB: UPDATE nodes SET role=NULL WHERE file IN (affected) AND kind NOT IN ('file','directory')
    C->>DB: Batch UPDATE role per chunk for all idsByRole entries
Loading

Reviews (2): Last reviewed commit: "fix: remove unused columns from full-pat..." | Re-trigger Greptile

Comment on lines +422 to +428
const leafRows = db
.prepare(
`SELECT n.id, n.name, n.kind, n.file
FROM nodes n
WHERE n.kind IN ('parameter', 'property')`,
)
.all() as { id: number; name: string; kind: string; file: string }[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused columns in full-path leafRows query

name, kind, and file are fetched here but only id is ever accessed in the processing loop below (for (const row of leafRows) leafIds.push(row.id)). The incremental path correctly fetches only SELECT n.id — the full path can match that.

Suggested change
const leafRows = db
.prepare(
`SELECT n.id, n.name, n.kind, n.file
FROM nodes n
WHERE n.kind IN ('parameter', 'property')`,
)
.all() as { id: number; name: string; kind: string; file: string }[];
const leafRows = db
.prepare(
`SELECT n.id
FROM nodes n
WHERE n.kind IN ('parameter', 'property')`,
)
.all() as { id: number }[];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6c8cc17. The full-path leafRows query now selects only n.id, matching the lean form already used by the incremental path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6c8cc17. The full-path leafRows query now selects only n.id, matching the lean form already used by the incremental path.

The full-path leafRows query selected name, kind, and file columns
that were never accessed — only id is used. Align with the incremental
path which already uses the lean SELECT n.id form.
@carlos-alm
Copy link
Contributor Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant