Minbook
KO
Using Feature Flags to Fix CLI Score Accuracy — Solving the False-Pass Problem

Using Feature Flags to Fix CLI Score Accuracy — Solving the False-Pass Problem

MJ · · 3 min read

Solving the false-pass problem in MMU's checklist by designing a condition marker system with 17 feature flags and improving score accuracy through 48 regression tests.

The False-Fail Problem

When early testers ran MMU’s checklist, one piece of feedback came up repeatedly:

“I don’t use Stripe, but I’m getting billing failures.”

Specifically:

ScenarioChecklist ItemUser’s Actual StateResultProblem
Stripe webhook verification”Are you verifying webhook signatures?”Not using StripeFailfalse-fail
Refund policy”Is a refund policy published?”Free projectFailfalse-fail
GDPR consent banner”Is there a GDPR cookie consent banner?”Korea-only serviceFailfalse-fail
iOS guidelines”Are you compliant with App Store review guidelines?”Web onlyFailfalse-fail

Out of 534 checklist items, some were conditional — applicable only under specific circumstances. When a project that doesn’t use payments gets flagged for payment-related failures, credibility of the entire checklist erodes.

graph TB
    A["534 checklist items"] --> B{"Are all items\napplicable to\nall projects?"}
    B -->|"No"| C["Conditional items\n~120"]
    B -->|"Yes"| D["Universal items\n~414"]

    C --> E["False-fails occur\n(item doesn't apply but fails)"]
    E --> F["Score distortion\nChecklist credibility drops"]

    style E fill:#ffebee,stroke:#f44336
    style F fill:#ffebee,stroke:#f44336

The Solution: Condition Markers

Instead of applying every item to every project, we added a marker that declares this item is valid only under these conditions.

Design Principles

  1. Explicit declaration: Add a when field to each conditional item
  2. User selects conditions: Input project characteristics when running the CLI
  3. Non-applicable items are skipped: Treated as not-applicable, not fail
  4. Excluded from scoring: Skipped items are removed from the denominator too

Condition Marker Data Structure

{
  "id": "billing-webhook-verification",
  "title": "Are you verifying webhook signatures?",
  "category": "billing",
  "priority": "P0",
  "when": {
    "has_billing": true,
    "payment_provider": ["stripe", "lemonsqueezy", "paddle"]
  }
}
{
  "id": "gdpr-cookie-consent",
  "title": "Is there a GDPR cookie consent banner?",
  "category": "legal",
  "priority": "P1",
  "when": {
    "target_region": ["eu", "global"]
  }
}

17 Feature Flags

We consolidated the project characteristics into 17 flags.

#FlagValuesScope of Impact
1has_billingtrue/false48 billing-related items
2payment_providerstripe/lemonsqueezy/paddle/noneProvider-specific items
3has_authtrue/false32 authentication-related items
4auth_providersupabase/firebase/auth0/customProvider-specific items
5target_regionkr/us/eu/globalRegulatory compliance items
6platformweb/mobile/desktop/cliPlatform-specific items
7has_apitrue/falseAPI security, documentation items
8has_databasetrue/falseDB backup, migration items
9has_file_uploadtrue/falseFile processing, storage items
10has_emailtrue/falseEmail sending, bounce handling items
11has_analyticstrue/falseAnalytics integration items
12has_ci_cdtrue/falseCI/CD pipeline items
13is_osstrue/falseOpen source items (license, README)
14team_sizesolo/small/mediumTeam-size-dependent items
15has_i18ntrue/falseInternationalization items
16has_realtimetrue/falseWebSocket, SSE items
17has_ai_featurestrue/falseAI/LLM items (prompt injection, etc.)

How We Landed on 17

Flags ConsideredResult
5Insufficient coverage (only 50% of false-fails resolved)
10Major scenarios covered (80% resolved)
17Nearly all scenarios covered (95%+ resolved)
25+Excessive user input burden, risk of abandonment

17 is the “balance point between coverage and usability.” Answering 17 Y/N questions in the CLI takes about 2 minutes. Those 2 minutes eliminate 95%+ of false-fails.


CLI Flow Changes

Before (Without Flags)

$ npx make-me-unicorn

Make Me Unicorn - SaaS Launch Checklist

Running all 534 items...

[FAIL] Webhook signature verification missing
[FAIL] Refund policy not found
[FAIL] GDPR cookie consent banner missing
...

Score: 312/534 (58%)

User reaction: “I don’t have billing or EU users — why am I at 58%?”

After (With Flags)

$ npx make-me-unicorn

Make Me Unicorn - SaaS Launch Checklist

Quick setup (2 min):
  Has billing?        [Y/n] n
  Has authentication? [Y/n] y
  Target region?      [kr/us/eu/global] kr
  Platform?           [web/mobile/desktop/cli] web
  Has AI features?    [Y/n] n
  ...

Applying 17 flags -> 89 items skipped (not applicable)
Running 445 applicable items...

[PASS] Auth session expiry configured
[FAIL] Password reset flow missing
[SKIP] Webhook verification (no billing)
...

Score: 380/445 (85%)
(89 items skipped as not applicable)

User reaction: “85% — almost there. Just 65 items left to fix.”

graph LR
    A["534 total items"] -->|"17 flags applied"| B["445 applicable items"]
    A -->|"17 flags applied"| C["89 skipped items"]

    B --> D["380 PASS"]
    B --> E["65 FAIL"]

    F["Before: 312/534 = 58%"] -->|"False-fails removed"| G["After: 380/445 = 85%"]

    style F fill:#ffebee,stroke:#f44336
    style G fill:#e8f5e9,stroke:#4caf50

Test Design: 48 Regression Tests

To verify the feature flag system works correctly, we wrote 48 test cases.

Test Categories

CategoryTest CountWhat It Verifies
Flag combinations12Correct items included/excluded for representative 17-flag combinations
Edge cases8All flags false, all flags true, single flag true
Score calculation10Skipped items correctly excluded from denominator
Backward compatibility6Running without flags still executes all 534 items
Condition conflicts8Correct interpretation when multiple when conditions overlap
Output formatting4Skipped items displayed correctly in reports

Key Test Examples

// Flag combination test: Korean web project without billing
test('billing=false, region=kr should skip billing + GDPR items', () => {
  const flags = {
    has_billing: false,
    target_region: 'kr',
    platform: 'web',
  };

  const result = filterItems(ALL_ITEMS, flags);

  // 48 billing-related items skipped
  expect(result.skipped).toContain('billing-webhook-verification');
  expect(result.skipped).toContain('refund-policy');

  // GDPR items skipped (Korea-only service)
  expect(result.skipped).toContain('gdpr-cookie-consent');

  // Korean privacy law items included
  expect(result.applicable).toContain('kr-privacy-policy');
});
// Score calculation test: skipped items excluded from denominator
test('score calculation excludes skipped items from denominator', () => {
  const flags = { has_billing: false };
  const result = runChecklist(ALL_ITEMS, flags, mockAnswers);

  // 534 - 48(billing) = 486 applicable
  expect(result.total).toBe(486);
  expect(result.score).toBe(result.passed / 486);
  // NOT result.passed / 534
});

Score Accuracy Is the Core of Trust

Why Accuracy Matters

When false-fails are highWhen false-passes are high
”This tool is useless” — users leave”Everything passed!” — real problems ignored
Distrust of the checklist itselfPost-launch incidents, legal issues
Users start ignoring resultsThe checklist loses its reason to exist

The most important metric for a checklist tool is accuracy — not features. Even with 100 features, if the results are inaccurate, nobody will use it. Feature flags are not a feature addition; they’re an accuracy improvement.

Improvement Results

MetricBeforeAfterChange
False-fail rate~22%~3%-19pp
User-perceived score58% (understated)85% (accurate)+27pp
”Useless” feedbackFrequentNearly zeroResolved
Completion rate (checking all results)~40%~75%+35pp

Lessons Learned

1. A Checklist’s Value Comes From Relevance, Not Comprehensiveness

The assumption that all 534 items apply to everyone was wrong. Value comes not from “maximum items” but from “the right items for your project.” Feature flags don’t reduce the 534 items — they precisely select which of the 534 are right for you.

2. Accuracy and Usability Are a Trade-off

More flags increase accuracy but reduce usability. 17 is the current balance point, subject to adjustment based on user feedback.

3. Tests Are Essential for a Flag System

Conditional logic is a breeding ground for bugs. Without 48 tests backing the 17 flags, adding new flags could silently break existing combinations.


Summary

Key PointDetails
ProblemFalse-fails — items that don’t apply were marked as failures, distorting scores
SolutionCondition markers + 17 feature flags
ResultsFalse-fail rate 22% to 3%, completion rate 40% to 75%
LessonA checklist’s value = accuracy, not item count
Share

Related Posts