How Claude Code will automatically score every developer each week and each month — converting raw GitHub metrics into a fair, weighted composite across all four KPI categories, with clear rating bands and anti-gaming safeguards.
From raw metric to a single, fair number — in four steps.
| Step | What happens |
|---|---|
| 1 · Measure | Claude Code audits GitHub and pulls each raw metric for the period. |
| 2 · Normalize | Each metric is converted to a 0–100 score vs its target (direction-aware). |
| 3 · Weight | Metric scores roll up to a category score; categories roll up to a composite. |
| 4 · Band | The composite maps to a rating band and a trend arrow vs last period. |
Every metric becomes a 0–100 score. Direction matters: some metrics are better high, others better low.
Scores are capped at 100 — beating target earns a full mark, not bonus inflation. A floor of 0 applies. "Exceeds" is recognised at the band stage (§6), not by letting one metric exceed 100.
| Metric | Category | Direction | Default target | Weight in cat. |
|---|---|---|---|---|
| Cycle time | Productivity | Lower | ≤ 2 days | 35% |
| Merged PR throughput | Productivity | Higher | baseline | 25% |
| Time to first review | Productivity | Lower | ≤ 4 hrs | 25% |
| PR size (median) | Productivity | Lower | ≤ 400 LOC | 15% |
| Change failure rate | Quality | Lower | < 15% | 35% |
| Review pass rate | Quality | Higher | ≥ 85% | 30% |
| Test coverage (changed) | Quality | Higher | ≥ 80% | 20% |
| 21-day rework ratio | Quality | Lower | baseline | 15% |
| AI-assisted PR share | AI adoption | Higher | growth | 45% |
| Active usage (days/wk) | AI adoption | Higher | baseline | 30% |
| Workflow contributions | AI adoption | Higher | qualitative | 25% |
| CI pass rate | Delivery | Higher | ≥ 90% | 30% |
| Deploy frequency* | Delivery | Higher | ≥ weekly | 25% |
| Lead time for changes* | Delivery | Lower | baseline | 25% |
| MTTR on incidents* | Delivery | Lower | < 24 hrs | 20% |
*Delivery metrics are largely team-level; individuals inherit the team score for these, so no one is graded on a deploy they didn't own.
Category scores combine into one composite using the policy weights.
| Category | Weight | Reads as |
|---|---|---|
| Code quality | 30% | Correctness is the highest-weighted aspect |
| Productivity & velocity | 25% | Flow of shipped value |
| Delivery & reliability | 25% | DORA-style stability |
| AI adoption | 20% | Capability, leading indicator |
A fast pulse — flow and quality signals only. Coaching-grade, not a formal rating.
AI adoption is reported weekly but lightly weighted in the weekly score — it's a slower-moving capability better judged monthly.
The official composite — all four categories, plus trend and a light qualitative overlay.
| Dimension | Weekly | Monthly |
|---|---|---|
| Categories scored | 3 (subset) | All 4 |
| Trend factor | Arrow only | ±5 points |
| Qualitative overlay | No | Yes (≤5) |
| Used for | Coaching | Review record |
The composite maps to one of four bands.
Bands describe performance against target, not against peers — everyone can land in "Exceeds" in a strong month. "Developing" and below trigger a support plan, never an automatic penalty.
One developer, one week. Placeholder numbers showing the math end-to-end.
| Metric | Actual | Target | Metric score |
|---|---|---|---|
| Cycle time (lower) | 2.4 d | 2.0 d | 83 |
| Throughput (higher) | 6 PRs | 5 PRs | 100 (capped) |
| Time to first review (lower) | 5 hrs | 4 hrs | 80 |
| Change failure rate (lower) | 10% | 15% | 100 (within) |
| Review pass rate (higher) | 78% | 85% | 92 |
| CI pass rate (higher) | 88% | 90% | 98 |
Self-rated 90% vs audited 93% → small, well-calibrated gap. No flag.
How Claude Code carries the scoring out automatically — proposed, not yet built.
Every Friday EOD, the audit + weekly rubric runs per developer and team; outputs the weekly scorecard for Monday 1:1s.
First working day of the month, the four weekly results roll into the monthly composite + trend + qualitative; outputs the review-record scorecard.
# Claude Code command: /kpi-score (period = week | month)
For AUTHOR=[user] (or --team) over PERIOD:
1. Pull metrics via the /weekly-audit data.
2. Normalize each to 0-100 (direction-aware, capped at 100).
3. Roll up to category scores, then the weighted composite.
4. Apply pairing rule, small-sample flags, leave exclusions.
5. (month only) add trend ±5 and qualitative ≤5.
6. Map to band; emit scorecard: metric | actual | target | score,
category scores, composite, band, trend, self-vs-audited gap.