Using Feature Flags to Fix CLI Score Accuracy — Solving the False-Pass Problem

Solving the false-pass problem in MMU's checklist by designing a condition marker system with 17 feature flags and improving score accuracy through 48 regression tests.

The False-Fail Problem

When early testers ran MMU’s checklist, one piece of feedback came up repeatedly:

“I don’t use Stripe, but I’m getting billing failures.”

Specifically:

Scenario	Checklist Item	User’s Actual State	Result	Problem
Stripe webhook verification	”Are you verifying webhook signatures?”	Not using Stripe	Fail	false-fail
Refund policy	”Is a refund policy published?”	Free project	Fail	false-fail
GDPR consent banner	”Is there a GDPR cookie consent banner?”	Korea-only service	Fail	false-fail
iOS guidelines	”Are you compliant with App Store review guidelines?”	Web only	Fail	false-fail

Out of 534 checklist items, some were conditional — applicable only under specific circumstances. When a project that doesn’t use payments gets flagged for payment-related failures, credibility of the entire checklist erodes.

graph TB
    A["534 checklist items"] --> B{"Are all items\napplicable to\nall projects?"}
    B -->|"No"| C["Conditional items\n~120"]
    B -->|"Yes"| D["Universal items\n~414"]

    C --> E["False-fails occur\n(item doesn't apply but fails)"]
    E --> F["Score distortion\nChecklist credibility drops"]

    style E fill:#ffebee,stroke:#f44336
    style F fill:#ffebee,stroke:#f44336

The Solution: Condition Markers

Instead of applying every item to every project, we added a marker that declares this item is valid only under these conditions.

Design Principles

Explicit declaration: Add a when field to each conditional item
User selects conditions: Input project characteristics when running the CLI
Non-applicable items are skipped: Treated as not-applicable, not fail
Excluded from scoring: Skipped items are removed from the denominator too

Condition Marker Data Structure

{
  "id": "billing-webhook-verification",
  "title": "Are you verifying webhook signatures?",
  "category": "billing",
  "priority": "P0",
  "when": {
    "has_billing": true,
    "payment_provider": ["stripe", "lemonsqueezy", "paddle"]
  }
}

{
  "id": "gdpr-cookie-consent",
  "title": "Is there a GDPR cookie consent banner?",
  "category": "legal",
  "priority": "P1",
  "when": {
    "target_region": ["eu", "global"]
  }
}

17 Feature Flags

We consolidated the project characteristics into 17 flags.

#	Flag	Values	Scope of Impact
1	`has_billing`	true/false	48 billing-related items
2	`payment_provider`	stripe/lemonsqueezy/paddle/none	Provider-specific items
3	`has_auth`	true/false	32 authentication-related items
4	`auth_provider`	supabase/firebase/auth0/custom	Provider-specific items
5	`target_region`	kr/us/eu/global	Regulatory compliance items
6	`platform`	web/mobile/desktop/cli	Platform-specific items
7	`has_api`	true/false	API security, documentation items
8	`has_database`	true/false	DB backup, migration items
9	`has_file_upload`	true/false	File processing, storage items
10	`has_email`	true/false	Email sending, bounce handling items
11	`has_analytics`	true/false	Analytics integration items
12	`has_ci_cd`	true/false	CI/CD pipeline items
13	`is_oss`	true/false	Open source items (license, README)
14	`team_size`	solo/small/medium	Team-size-dependent items
15	`has_i18n`	true/false	Internationalization items
16	`has_realtime`	true/false	WebSocket, SSE items
17	`has_ai_features`	true/false	AI/LLM items (prompt injection, etc.)

How We Landed on 17

Flags Considered	Result
5	Insufficient coverage (only 50% of false-fails resolved)
10	Major scenarios covered (80% resolved)
17	Nearly all scenarios covered (95%+ resolved)
25+	Excessive user input burden, risk of abandonment

17 is the “balance point between coverage and usability.” Answering 17 Y/N questions in the CLI takes about 2 minutes. Those 2 minutes eliminate 95%+ of false-fails.

CLI Flow Changes

Before (Without Flags)

$ npx make-me-unicorn

Make Me Unicorn - SaaS Launch Checklist

Running all 534 items...

[FAIL] Webhook signature verification missing
[FAIL] Refund policy not found
[FAIL] GDPR cookie consent banner missing
...

Score: 312/534 (58%)

User reaction: “I don’t have billing or EU users — why am I at 58%?”

After (With Flags)

$ npx make-me-unicorn

Make Me Unicorn - SaaS Launch Checklist

Quick setup (2 min):
  Has billing?        [Y/n] n
  Has authentication? [Y/n] y
  Target region?      [kr/us/eu/global] kr
  Platform?           [web/mobile/desktop/cli] web
  Has AI features?    [Y/n] n
  ...

Applying 17 flags -> 89 items skipped (not applicable)
Running 445 applicable items...

[PASS] Auth session expiry configured
[FAIL] Password reset flow missing
[SKIP] Webhook verification (no billing)
...

Score: 380/445 (85%)
(89 items skipped as not applicable)

User reaction: “85% — almost there. Just 65 items left to fix.”

graph LR
    A["534 total items"] -->|"17 flags applied"| B["445 applicable items"]
    A -->|"17 flags applied"| C["89 skipped items"]

    B --> D["380 PASS"]
    B --> E["65 FAIL"]

    F["Before: 312/534 = 58%"] -->|"False-fails removed"| G["After: 380/445 = 85%"]

    style F fill:#ffebee,stroke:#f44336
    style G fill:#e8f5e9,stroke:#4caf50

Test Design: 48 Regression Tests

To verify the feature flag system works correctly, we wrote 48 test cases.

Test Categories

Category	Test Count	What It Verifies
Flag combinations	12	Correct items included/excluded for representative 17-flag combinations
Edge cases	8	All flags false, all flags true, single flag true
Score calculation	10	Skipped items correctly excluded from denominator
Backward compatibility	6	Running without flags still executes all 534 items
Condition conflicts	8	Correct interpretation when multiple `when` conditions overlap
Output formatting	4	Skipped items displayed correctly in reports

Key Test Examples

// Flag combination test: Korean web project without billing
test('billing=false, region=kr should skip billing + GDPR items', () => {
  const flags = {
    has_billing: false,
    target_region: 'kr',
    platform: 'web',
  };

  const result = filterItems(ALL_ITEMS, flags);

  // 48 billing-related items skipped
  expect(result.skipped).toContain('billing-webhook-verification');
  expect(result.skipped).toContain('refund-policy');

  // GDPR items skipped (Korea-only service)
  expect(result.skipped).toContain('gdpr-cookie-consent');

  // Korean privacy law items included
  expect(result.applicable).toContain('kr-privacy-policy');
});

// Score calculation test: skipped items excluded from denominator
test('score calculation excludes skipped items from denominator', () => {
  const flags = { has_billing: false };
  const result = runChecklist(ALL_ITEMS, flags, mockAnswers);

  // 534 - 48(billing) = 486 applicable
  expect(result.total).toBe(486);
  expect(result.score).toBe(result.passed / 486);
  // NOT result.passed / 534
});

Score Accuracy Is the Core of Trust

Why Accuracy Matters

When false-fails are high	When false-passes are high
”This tool is useless” — users leave	”Everything passed!” — real problems ignored
Distrust of the checklist itself	Post-launch incidents, legal issues
Users start ignoring results	The checklist loses its reason to exist

The most important metric for a checklist tool is accuracy — not features. Even with 100 features, if the results are inaccurate, nobody will use it. Feature flags are not a feature addition; they’re an accuracy improvement.

Improvement Results

Metric	Before	After	Change
False-fail rate	~22%	~3%	-19pp
User-perceived score	58% (understated)	85% (accurate)	+27pp
”Useless” feedback	Frequent	Nearly zero	Resolved
Completion rate (checking all results)	~40%	~75%	+35pp

Lessons Learned

1. A Checklist’s Value Comes From Relevance, Not Comprehensiveness

The assumption that all 534 items apply to everyone was wrong. Value comes not from “maximum items” but from “the right items for your project.” Feature flags don’t reduce the 534 items — they precisely select which of the 534 are right for you.

2. Accuracy and Usability Are a Trade-off

More flags increase accuracy but reduce usability. 17 is the current balance point, subject to adjustment based on user feedback.

3. Tests Are Essential for a Flag System

Conditional logic is a breeding ground for bugs. Without 48 tests backing the 17 flags, adding new flags could silently break existing combinations.

Summary

Key Point	Details
Problem	False-fails — items that don’t apply were marked as failures, distorting scores
Solution	Condition markers + 17 feature flags
Results	False-fail rate 22% to 3%, completion rate 40% to 75%
Lesson	A checklist’s value = accuracy, not item count