Skill Evaluation Test Harness

Quick Start

cd tests
pnpm install
pnpm harness --list                              # List available skills
pnpm harness azure-ai-projects-py --mock --verbose # Run evaluation
pnpm test                                        # Run unit tests

Overview

A TypeScript test framework for evaluating AI-generated code against acceptance criteria defined in skill files. Powered by the GitHub Copilot SDK.

Workflow:

Load acceptance criteria from .github/skills/<skill>/references/acceptance-criteria.md
Run test scenarios from tests/scenarios/<skill>/scenarios.yaml
Generate code using GitHub Copilot SDK (or mock responses)
Evaluate code against correct/incorrect patterns
Report results via console, markdown, or JSON

Architecture

tests/
├── harness/
│   ├── types.ts              # Type definitions
│   ├── criteria-loader.ts    # Parses acceptance-criteria.md
│   ├── evaluator.ts          # Validates code against patterns
│   ├── copilot-client.ts     # Wraps Copilot SDK (with mock fallback)
│   ├── runner.ts             # Main CLI runner
│   ├── ralph-loop.ts         # Iterative improvement loop
│   ├── feedback-builder.ts   # LLM-actionable feedback generator
│   ├── index.ts              # Package exports
│   └── reporters/
│       ├── console.ts        # Pretty console output
│       └── markdown.ts       # Markdown report generation
│
├── scenarios/
│   └── <skill-name>/
│       └── scenarios.yaml    # Test scenarios for the skill
│
├── fixtures/                 # Test fixtures
├── package.json              # Dependencies (pnpm)
├── tsconfig.json             # TypeScript config
└── vitest.config.ts          # Test configuration

CLI Usage

# Basic usage
pnpm harness <skill-name>

# Options
pnpm harness azure-ai-projects-py \
    --mock                  # Use mock responses (no Copilot SDK)
    --verbose               # Show detailed output
    --filter basic          # Filter scenarios by name/tag
    --output json           # Output format (text/json)
    --output-file report.json

# Ralph Loop (iterative improvement)
pnpm harness azure-ai-projects-py \
    --ralph                 # Enable iterative improvement
    --max-iterations 5      # Max iterations per scenario
    --threshold 80          # Quality threshold (0-100)

Ralph Loop

The Ralph Loop enables iterative code improvement by re-generating code until quality thresholds are met:

Generate → Evaluate → Analyze → Re-generate (with feedback)
    ↑                                    │
    └────────────────────────────────────┘
         (Loop until threshold met)

Stop conditions:

Quality threshold met (default: 80)
Perfect score (100)
Max iterations reached (default: 5)
No improvement between iterations
Score regression

Programmatic Usage

import {
  AcceptanceCriteriaLoader,
  CodeEvaluator,
  SkillEvaluationRunner,
  RalphLoopController,
  createRalphConfig,
} from './harness';

// Simple evaluation
const loader = new AcceptanceCriteriaLoader();
const criteria = loader.load('azure-ai-projects-py');
const evaluator = new CodeEvaluator(criteria);

const result = evaluator.evaluate(code, 'my-test');
console.log(`Score: ${result.score}`);

// Full runner
const runner = new SkillEvaluationRunner({ useMock: true });
const summary = await runner.run('azure-ai-projects-py');

// With Ralph Loop
const ralphSummary = await runner.runWithLoop('azure-ai-projects-py', undefined, {
  maxIterations: 5,
  qualityThreshold: 80,
});

Adding Tests for a New Skill

1. Create Acceptance Criteria

Create .github/skills/<skill-name>/references/acceptance-criteria.md:

# Acceptance Criteria: skill-name

## Imports

### ✅ Correct
\`\`\`python
from azure.ai.mymodule import MyClient
\`\`\`

### ❌ Incorrect
\`\`\`python
from azure.ai.mymodule.models import MyClient  # Wrong location
\`\`\`

2. Create Test Scenarios

Create tests/scenarios/<skill-name>/scenarios.yaml:

config:
  model: gpt-4
  max_tokens: 2000
  temperature: 0.3

scenarios:
  - name: basic_usage
    prompt: |
      Create a basic example using the SDK.
    expected_patterns:
      - "DefaultAzureCredential"
    forbidden_patterns:
      - "hardcoded-endpoint"
    tags:
      - basic
    mock_response: |
      from azure.identity import DefaultAzureCredential
      # ... working example

3. Run Tests

pnpm harness <skill-name> --mock --verbose
pnpm test

Evaluation Scoring

Factor	Impact
Syntax error	-100
Incorrect pattern found	-15 each
Error finding	-20 each
Warning finding	-5 each
Correct pattern matched	+5 each

A result passes if it has no error-severity findings.

Test Coverage

123 skills with 1114 test scenarios

Language	Skills	Scenarios
Core	5	51
Python	41	333
.NET	28	286
TypeScript	23	249
Java	26	195

pnpm harness --list  # See all available skills

Troubleshooting

Issue	Solution
No skills found	Check `acceptance-criteria.md` exists in `references/`
Copilot SDK unavailable	Use `--mock` flag or set up PAT authentication (see below)
Tests fail with real Copilot	Mock responses are hand-crafted; review criteria flexibility

Real SDK Evaluation

The harness supports two authentication methods for real Copilot SDK evaluation:

Local Development (Copilot CLI)

Install Copilot CLI: npm install -g @github/copilot
Run copilot and authenticate via /login
Run without --mock: pnpm harness azure-ai-projects-py --verbose

CI/CD (PAT Authentication)

For automated pipelines, use a Personal Access Token:

Create a fine-grained PAT at https://github.com/settings/personal-access-tokens/new
Add the "Copilot Requests" permission
Set the token as environment variable GH_TOKEN or GITHUB_TOKEN

export GH_TOKEN="your-pat-with-copilot-requests-permission"
pnpm harness azure-ai-projects-py --verbose

GitHub Actions Workflows

Workflow	Trigger	Mode	Purpose
`test-harness.yml`	PR, push to main	Mock	Fast, deterministic CI
`skill-evaluation.yml`	Nightly, manual	Real SDK	Quality measurement

To enable real SDK evaluation in GitHub Actions:

Create repository secret COPILOT_TOKEN with PAT (Copilot Requests permission)
Set repository variable ENABLE_REAL_EVAL=true
Trigger manually via Actions tab, or wait for nightly run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skill Evaluation Test Harness

Quick Start

Overview

Architecture

CLI Usage

Ralph Loop

Programmatic Usage

Adding Tests for a New Skill

1. Create Acceptance Criteria

2. Create Test Scenarios

3. Run Tests

Evaluation Scoring

Test Coverage

Troubleshooting

Real SDK Evaluation

Local Development (Copilot CLI)

CI/CD (PAT Authentication)

GitHub Actions Workflows

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Skill Evaluation Test Harness

Quick Start

Overview

Architecture

CLI Usage

Ralph Loop

Programmatic Usage

Adding Tests for a New Skill

1. Create Acceptance Criteria

2. Create Test Scenarios

3. Run Tests

Evaluation Scoring

Test Coverage

Troubleshooting

Real SDK Evaluation

Local Development (Copilot CLI)

CI/CD (PAT Authentication)

GitHub Actions Workflows