cd tests
pnpm install
pnpm harness --list # List available skills
pnpm harness azure-ai-projects-py --mock --verbose # Run evaluation
pnpm test # Run unit testsA TypeScript test framework for evaluating AI-generated code against acceptance criteria defined in skill files. Powered by the GitHub Copilot SDK.
Workflow:
- Load acceptance criteria from
.github/skills/<skill>/references/acceptance-criteria.md - Run test scenarios from
tests/scenarios/<skill>/scenarios.yaml - Generate code using GitHub Copilot SDK (or mock responses)
- Evaluate code against correct/incorrect patterns
- Report results via console, markdown, or JSON
tests/
├── harness/
│ ├── types.ts # Type definitions
│ ├── criteria-loader.ts # Parses acceptance-criteria.md
│ ├── evaluator.ts # Validates code against patterns
│ ├── copilot-client.ts # Wraps Copilot SDK (with mock fallback)
│ ├── runner.ts # Main CLI runner
│ ├── ralph-loop.ts # Iterative improvement loop
│ ├── feedback-builder.ts # LLM-actionable feedback generator
│ ├── index.ts # Package exports
│ └── reporters/
│ ├── console.ts # Pretty console output
│ └── markdown.ts # Markdown report generation
│
├── scenarios/
│ └── <skill-name>/
│ └── scenarios.yaml # Test scenarios for the skill
│
├── fixtures/ # Test fixtures
├── package.json # Dependencies (pnpm)
├── tsconfig.json # TypeScript config
└── vitest.config.ts # Test configuration
# Basic usage
pnpm harness <skill-name>
# Options
pnpm harness azure-ai-projects-py \
--mock # Use mock responses (no Copilot SDK)
--verbose # Show detailed output
--filter basic # Filter scenarios by name/tag
--output json # Output format (text/json)
--output-file report.json
# Ralph Loop (iterative improvement)
pnpm harness azure-ai-projects-py \
--ralph # Enable iterative improvement
--max-iterations 5 # Max iterations per scenario
--threshold 80 # Quality threshold (0-100)The Ralph Loop enables iterative code improvement by re-generating code until quality thresholds are met:
Generate → Evaluate → Analyze → Re-generate (with feedback)
↑ │
└────────────────────────────────────┘
(Loop until threshold met)
Stop conditions:
- Quality threshold met (default: 80)
- Perfect score (100)
- Max iterations reached (default: 5)
- No improvement between iterations
- Score regression
import {
AcceptanceCriteriaLoader,
CodeEvaluator,
SkillEvaluationRunner,
RalphLoopController,
createRalphConfig,
} from './harness';
// Simple evaluation
const loader = new AcceptanceCriteriaLoader();
const criteria = loader.load('azure-ai-projects-py');
const evaluator = new CodeEvaluator(criteria);
const result = evaluator.evaluate(code, 'my-test');
console.log(`Score: ${result.score}`);
// Full runner
const runner = new SkillEvaluationRunner({ useMock: true });
const summary = await runner.run('azure-ai-projects-py');
// With Ralph Loop
const ralphSummary = await runner.runWithLoop('azure-ai-projects-py', undefined, {
maxIterations: 5,
qualityThreshold: 80,
});Create .github/skills/<skill-name>/references/acceptance-criteria.md:
# Acceptance Criteria: skill-name
## Imports
### ✅ Correct
\`\`\`python
from azure.ai.mymodule import MyClient
\`\`\`
### ❌ Incorrect
\`\`\`python
from azure.ai.mymodule.models import MyClient # Wrong location
\`\`\`Create tests/scenarios/<skill-name>/scenarios.yaml:
config:
model: gpt-4
max_tokens: 2000
temperature: 0.3
scenarios:
- name: basic_usage
prompt: |
Create a basic example using the SDK.
expected_patterns:
- "DefaultAzureCredential"
forbidden_patterns:
- "hardcoded-endpoint"
tags:
- basic
mock_response: |
from azure.identity import DefaultAzureCredential
# ... working examplepnpm harness <skill-name> --mock --verbose
pnpm test| Factor | Impact |
|---|---|
| Syntax error | -100 |
| Incorrect pattern found | -15 each |
| Error finding | -20 each |
| Warning finding | -5 each |
| Correct pattern matched | +5 each |
A result passes if it has no error-severity findings.
123 skills with 1114 test scenarios
| Language | Skills | Scenarios |
|---|---|---|
| Core | 5 | 51 |
| Python | 41 | 333 |
| .NET | 28 | 286 |
| TypeScript | 23 | 249 |
| Java | 26 | 195 |
pnpm harness --list # See all available skills| Issue | Solution |
|---|---|
| No skills found | Check acceptance-criteria.md exists in references/ |
| Copilot SDK unavailable | Use --mock flag or set up PAT authentication (see below) |
| Tests fail with real Copilot | Mock responses are hand-crafted; review criteria flexibility |
The harness supports two authentication methods for real Copilot SDK evaluation:
- Install Copilot CLI:
npm install -g @github/copilot - Run
copilotand authenticate via/login - Run without
--mock:pnpm harness azure-ai-projects-py --verbose
For automated pipelines, use a Personal Access Token:
- Create a fine-grained PAT at https://github.com/settings/personal-access-tokens/new
- Add the "Copilot Requests" permission
- Set the token as environment variable
GH_TOKENorGITHUB_TOKEN
export GH_TOKEN="your-pat-with-copilot-requests-permission"
pnpm harness azure-ai-projects-py --verbose| Workflow | Trigger | Mode | Purpose |
|---|---|---|---|
test-harness.yml |
PR, push to main | Mock | Fast, deterministic CI |
skill-evaluation.yml |
Nightly, manual | Real SDK | Quality measurement |
To enable real SDK evaluation in GitHub Actions:
- Create repository secret
COPILOT_TOKENwith PAT (Copilot Requests permission) - Set repository variable
ENABLE_REAL_EVAL=true - Trigger manually via Actions tab, or wait for nightly run