Solving the false-pass problem in MMU's checklist by designing a condition marker system with 17 feature flags and improving score accuracy through 48 regression tests.
The False-Fail Problem
When early testers ran MMU’s checklist, one piece of feedback came up repeatedly:
“I don’t use Stripe, but I’m getting billing failures.”
Specifically:
| Scenario | Checklist Item | User’s Actual State | Result | Problem |
|---|---|---|---|---|
| Stripe webhook verification | ”Are you verifying webhook signatures?” | Not using Stripe | Fail | false-fail |
| Refund policy | ”Is a refund policy published?” | Free project | Fail | false-fail |
| GDPR consent banner | ”Is there a GDPR cookie consent banner?” | Korea-only service | Fail | false-fail |
| iOS guidelines | ”Are you compliant with App Store review guidelines?” | Web only | Fail | false-fail |
Out of 534 checklist items, some were conditional — applicable only under specific circumstances. When a project that doesn’t use payments gets flagged for payment-related failures, credibility of the entire checklist erodes.
graph TB
A["534 checklist items"] --> B{"Are all items\napplicable to\nall projects?"}
B -->|"No"| C["Conditional items\n~120"]
B -->|"Yes"| D["Universal items\n~414"]
C --> E["False-fails occur\n(item doesn't apply but fails)"]
E --> F["Score distortion\nChecklist credibility drops"]
style E fill:#ffebee,stroke:#f44336
style F fill:#ffebee,stroke:#f44336
The Solution: Condition Markers
Instead of applying every item to every project, we added a marker that declares this item is valid only under these conditions.
Design Principles
- Explicit declaration: Add a
whenfield to each conditional item - User selects conditions: Input project characteristics when running the CLI
- Non-applicable items are skipped: Treated as not-applicable, not fail
- Excluded from scoring: Skipped items are removed from the denominator too
Condition Marker Data Structure
{
"id": "billing-webhook-verification",
"title": "Are you verifying webhook signatures?",
"category": "billing",
"priority": "P0",
"when": {
"has_billing": true,
"payment_provider": ["stripe", "lemonsqueezy", "paddle"]
}
}
{
"id": "gdpr-cookie-consent",
"title": "Is there a GDPR cookie consent banner?",
"category": "legal",
"priority": "P1",
"when": {
"target_region": ["eu", "global"]
}
}
17 Feature Flags
We consolidated the project characteristics into 17 flags.
| # | Flag | Values | Scope of Impact |
|---|---|---|---|
| 1 | has_billing | true/false | 48 billing-related items |
| 2 | payment_provider | stripe/lemonsqueezy/paddle/none | Provider-specific items |
| 3 | has_auth | true/false | 32 authentication-related items |
| 4 | auth_provider | supabase/firebase/auth0/custom | Provider-specific items |
| 5 | target_region | kr/us/eu/global | Regulatory compliance items |
| 6 | platform | web/mobile/desktop/cli | Platform-specific items |
| 7 | has_api | true/false | API security, documentation items |
| 8 | has_database | true/false | DB backup, migration items |
| 9 | has_file_upload | true/false | File processing, storage items |
| 10 | has_email | true/false | Email sending, bounce handling items |
| 11 | has_analytics | true/false | Analytics integration items |
| 12 | has_ci_cd | true/false | CI/CD pipeline items |
| 13 | is_oss | true/false | Open source items (license, README) |
| 14 | team_size | solo/small/medium | Team-size-dependent items |
| 15 | has_i18n | true/false | Internationalization items |
| 16 | has_realtime | true/false | WebSocket, SSE items |
| 17 | has_ai_features | true/false | AI/LLM items (prompt injection, etc.) |
How We Landed on 17
| Flags Considered | Result |
|---|---|
| 5 | Insufficient coverage (only 50% of false-fails resolved) |
| 10 | Major scenarios covered (80% resolved) |
| 17 | Nearly all scenarios covered (95%+ resolved) |
| 25+ | Excessive user input burden, risk of abandonment |
17 is the “balance point between coverage and usability.” Answering 17 Y/N questions in the CLI takes about 2 minutes. Those 2 minutes eliminate 95%+ of false-fails.
CLI Flow Changes
Before (Without Flags)
$ npx make-me-unicorn
Make Me Unicorn - SaaS Launch Checklist
Running all 534 items...
[FAIL] Webhook signature verification missing
[FAIL] Refund policy not found
[FAIL] GDPR cookie consent banner missing
...
Score: 312/534 (58%)
User reaction: “I don’t have billing or EU users — why am I at 58%?”
After (With Flags)
$ npx make-me-unicorn
Make Me Unicorn - SaaS Launch Checklist
Quick setup (2 min):
Has billing? [Y/n] n
Has authentication? [Y/n] y
Target region? [kr/us/eu/global] kr
Platform? [web/mobile/desktop/cli] web
Has AI features? [Y/n] n
...
Applying 17 flags -> 89 items skipped (not applicable)
Running 445 applicable items...
[PASS] Auth session expiry configured
[FAIL] Password reset flow missing
[SKIP] Webhook verification (no billing)
...
Score: 380/445 (85%)
(89 items skipped as not applicable)
User reaction: “85% — almost there. Just 65 items left to fix.”
graph LR
A["534 total items"] -->|"17 flags applied"| B["445 applicable items"]
A -->|"17 flags applied"| C["89 skipped items"]
B --> D["380 PASS"]
B --> E["65 FAIL"]
F["Before: 312/534 = 58%"] -->|"False-fails removed"| G["After: 380/445 = 85%"]
style F fill:#ffebee,stroke:#f44336
style G fill:#e8f5e9,stroke:#4caf50
Test Design: 48 Regression Tests
To verify the feature flag system works correctly, we wrote 48 test cases.
Test Categories
| Category | Test Count | What It Verifies |
|---|---|---|
| Flag combinations | 12 | Correct items included/excluded for representative 17-flag combinations |
| Edge cases | 8 | All flags false, all flags true, single flag true |
| Score calculation | 10 | Skipped items correctly excluded from denominator |
| Backward compatibility | 6 | Running without flags still executes all 534 items |
| Condition conflicts | 8 | Correct interpretation when multiple when conditions overlap |
| Output formatting | 4 | Skipped items displayed correctly in reports |
Key Test Examples
// Flag combination test: Korean web project without billing
test('billing=false, region=kr should skip billing + GDPR items', () => {
const flags = {
has_billing: false,
target_region: 'kr',
platform: 'web',
};
const result = filterItems(ALL_ITEMS, flags);
// 48 billing-related items skipped
expect(result.skipped).toContain('billing-webhook-verification');
expect(result.skipped).toContain('refund-policy');
// GDPR items skipped (Korea-only service)
expect(result.skipped).toContain('gdpr-cookie-consent');
// Korean privacy law items included
expect(result.applicable).toContain('kr-privacy-policy');
});
// Score calculation test: skipped items excluded from denominator
test('score calculation excludes skipped items from denominator', () => {
const flags = { has_billing: false };
const result = runChecklist(ALL_ITEMS, flags, mockAnswers);
// 534 - 48(billing) = 486 applicable
expect(result.total).toBe(486);
expect(result.score).toBe(result.passed / 486);
// NOT result.passed / 534
});
Score Accuracy Is the Core of Trust
Why Accuracy Matters
| When false-fails are high | When false-passes are high |
|---|---|
| ”This tool is useless” — users leave | ”Everything passed!” — real problems ignored |
| Distrust of the checklist itself | Post-launch incidents, legal issues |
| Users start ignoring results | The checklist loses its reason to exist |
The most important metric for a checklist tool is accuracy — not features. Even with 100 features, if the results are inaccurate, nobody will use it. Feature flags are not a feature addition; they’re an accuracy improvement.
Improvement Results
| Metric | Before | After | Change |
|---|---|---|---|
| False-fail rate | ~22% | ~3% | -19pp |
| User-perceived score | 58% (understated) | 85% (accurate) | +27pp |
| ”Useless” feedback | Frequent | Nearly zero | Resolved |
| Completion rate (checking all results) | ~40% | ~75% | +35pp |
Lessons Learned
1. A Checklist’s Value Comes From Relevance, Not Comprehensiveness
The assumption that all 534 items apply to everyone was wrong. Value comes not from “maximum items” but from “the right items for your project.” Feature flags don’t reduce the 534 items — they precisely select which of the 534 are right for you.
2. Accuracy and Usability Are a Trade-off
More flags increase accuracy but reduce usability. 17 is the current balance point, subject to adjustment based on user feedback.
3. Tests Are Essential for a Flag System
Conditional logic is a breeding ground for bugs. Without 48 tests backing the 17 flags, adding new flags could silently break existing combinations.
Summary
| Key Point | Details |
|---|---|
| Problem | False-fails — items that don’t apply were marked as failures, distorting scores |
| Solution | Condition markers + 17 feature flags |
| Results | False-fail rate 22% to 3%, completion rate 40% to 75% |
| Lesson | A checklist’s value = accuracy, not item count |
Related Posts

Why the Share Feature Comes First in PLG — Designing mmu share
Documenting the psychological mechanisms and viral loop strategies behind three sharing elements designed for MMU's PLG growth: score cards, badges, and custom checklist links.

The Code Was Done, But Everything Else Wasn't
Analyzing the difference between 'code complete' and 'product complete' through the 3-week post-feature work (payment stability, security, legal docs, SEO, etc.) which accounted for 88% of the effort.

Where the 534 Items Came From
Documenting how 534 SaaS launch checklist items were derived from 80 service analyses, 12 guidelines, and 5 real-world failures, including the P0-P3 priority logic.