Proposal · For Director Review

KPI Assessment & Scoring Criteria

How Claude Code will automatically score every developer each week and each month — converting raw GitHub metrics into a fair, weighted composite across all four KPI categories, with clear rating bands and anti-gaming safeguards.

Prepared by: Paul Audience: Directors (approval) Status: Plan — not yet implemented Feeds from: Weekly KPI Reporting Process

1How Scoring Works

From raw metric to a single, fair number — in four steps.

StepWhat happens
1 · MeasureClaude Code audits GitHub and pulls each raw metric for the period.
2 · NormalizeEach metric is converted to a 0–100 score vs its target (direction-aware).
3 · WeightMetric scores roll up to a category score; categories roll up to a composite.
4 · BandThe composite maps to a rating band and a trend arrow vs last period.
⚠️ Plan for approval. Thresholds and weights below are starting defaults — they get calibrated against a 4-week baseline before any score counts.

2Per-Metric Scoring (0–100)

Every metric becomes a 0–100 score. Direction matters: some metrics are better high, others better low.

2.1 Normalization formulas

Higher-is-better   →   score = min(100, (actual / target) × 100)
Lower-is-better   →   score = min(100, (target / actual) × 100)
Threshold (pass/fail band)   →   100 if within target, scaled down by distance outside

Scores are capped at 100 — beating target earns a full mark, not bonus inflation. A floor of 0 applies. "Exceeds" is recognised at the band stage (§6), not by letting one metric exceed 100.

2.2 Metric scoring table

MetricCategoryDirectionDefault targetWeight in cat.
Cycle timeProductivityLower≤ 2 days35%
Merged PR throughputProductivityHigherbaseline25%
Time to first reviewProductivityLower≤ 4 hrs25%
PR size (median)ProductivityLower≤ 400 LOC15%
Change failure rateQualityLower< 15%35%
Review pass rateQualityHigher≥ 85%30%
Test coverage (changed)QualityHigher≥ 80%20%
21-day rework ratioQualityLowerbaseline15%
AI-assisted PR shareAI adoptionHighergrowth45%
Active usage (days/wk)AI adoptionHigherbaseline30%
Workflow contributionsAI adoptionHigherqualitative25%
CI pass rateDeliveryHigher≥ 90%30%
Deploy frequency*DeliveryHigher≥ weekly25%
Lead time for changes*DeliveryLowerbaseline25%
MTTR on incidents*DeliveryLower< 24 hrs20%

*Delivery metrics are largely team-level; individuals inherit the team score for these, so no one is graded on a deploy they didn't own.

3Category Weights & Composite

Category scores combine into one composite using the policy weights.

CategoryWeightReads as
Code quality30%Correctness is the highest-weighted aspect
Productivity & velocity25%Flow of shipped value
Delivery & reliability25%DORA-style stability
AI adoption20%Capability, leading indicator
Composite = 0.30·Quality + 0.25·Productivity + 0.25·Delivery + 0.20·AI adoption
Quality-weighted on purpose: in an AI workflow, velocity is cheap and correctness is the constraint. The weights stop high output from masking high defect rates.

4Weekly Assessment Rubric every week

A fast pulse — flow and quality signals only. Coaching-grade, not a formal rating.

Weekly score = 0.40·Quality(subset) + 0.35·Productivity(subset) + 0.25·Delivery(subset)

AI adoption is reported weekly but lightly weighted in the weekly score — it's a slower-moving capability better judged monthly.

5Monthly Assessment Rubric every month

The official composite — all four categories, plus trend and a light qualitative overlay.

Monthly score = Composite (§3)  ±  trend(≤5)  +  qualitative(≤5)
DimensionWeeklyMonthly
Categories scored3 (subset)All 4
Trend factorArrow only±5 points
Qualitative overlayNoYes (≤5)
Used forCoachingReview record

6Rating Bands

The composite maps to one of four bands.

90–100Exceeds
75–89Meets
60–74Developing
< 60Needs attention

Bands describe performance against target, not against peers — everyone can land in "Exceeds" in a strong month. "Developing" and below trigger a support plan, never an automatic penalty.

7Worked Example

One developer, one week. Placeholder numbers showing the math end-to-end.

MetricActualTargetMetric score
Cycle time (lower)2.4 d2.0 d83
Throughput (higher)6 PRs5 PRs100 (capped)
Time to first review (lower)5 hrs4 hrs80
Change failure rate (lower)10%15%100 (within)
Review pass rate (higher)78%85%92
CI pass rate (higher)88%90%98
Quality(subset) = 0.54·100 + 0.46·92 ≈ 96
Productivity(subset) = weighted(83,100,80) ≈ 88
Delivery(subset) = 98
Weekly = 0.40·96 + 0.35·88 + 0.25·98 ≈ 93 → Exceeds ▲

Self-rated 90% vs audited 93% → small, well-calibrated gap. No flag.

8Guardrails & Overrides

9Automation & Cadence

How Claude Code carries the scoring out automatically — proposed, not yet built.

Weekly run

Every Friday EOD, the audit + weekly rubric runs per developer and team; outputs the weekly scorecard for Monday 1:1s.

Monthly run

First working day of the month, the four weekly results roll into the monthly composite + trend + qualitative; outputs the review-record scorecard.

9.1 Proposed scoring command

# Claude Code command: /kpi-score  (period = week | month)
For AUTHOR=[user] (or --team) over PERIOD:
  1. Pull metrics via the /weekly-audit data.
  2. Normalize each to 0-100 (direction-aware, capped at 100).
  3. Roll up to category scores, then the weighted composite.
  4. Apply pairing rule, small-sample flags, leave exclusions.
  5. (month only) add trend ±5 and qualitative ≤5.
  6. Map to band; emit scorecard: metric | actual | target | score,
     category scores, composite, band, trend, self-vs-audited gap.
Decisions for sign-off: confirm default targets & weights, the ±5 trend/qualitative caps, the band thresholds, and whether scoring runs on-demand first or as a scheduled job from day one.