What the score looks like
Every tool page on fewertools shows its Best Score at the top of the verdict block. Below the headline, the Stack Score component (verdict, ownership, pricing trajectory) is visible as the breakdown. Here is the legacy Stack Score badge so you can see the underlying anatomy:
The number on the left is the composite score. The pill is the tier. The bar visualises the score against 100. The line underneath names the three signals that produced it.
The tier ladder
Tiers are buckets, not the score itself. Two tools with very different breakdowns can land in the same tier. The tier is a quick read for "should I shortlist this." The score is for finer-grained comparison.
The Stack Score formula (the 30% component of Best Score)
Three components add up to 100. Each one rewards a different kind of evidence.
The hands-on rating from our review of the tool. This is the dominant signal because everything else (ownership, pricing) only matters once we know the tool actually works.
Founder-led tools historically deliver lower pricing risk and slower price creep. Private-equity-owned tools historically deliver the opposite. Acquired tools sit in between because the trajectory is uncertain. We track this in our public ownership ledger.
Drawn from our Pulse log of pricing events. Recent hikes pull the score down. Recent drops or new free tiers push it up. A long stretch with no events is treated as neutral, because no news is genuinely no news.
Why these three
A score is only useful if the inputs are things the user actually cares about. We started with a longer list (community size, integration count, GitHub stars, market share) and cut every signal that did not change a buying decision for a solo or bootstrapped founder.
What survived:
- Verdict answers "is the tool any good." Without that nothing else matters.
- Ownership answers "is the tool likely to still be the same product in two years." This is where most ranking sites refuse to take a position.
- Pricing trajectory answers "is my bill about to go up." This is where the editorial layer earns its keep, because it is unique to fewertools.
If we cannot defend a signal as decision-changing, it does not go in.
What the score is not
- Not a popularity score. Traffic, GitHub stars, and Twitter mentions are not inputs. The number does not move when a tool gets press.
- Not a vendor survey. No tool company is ever asked for input on its own ranking. There is no questionnaire and no opt-in.
- Not paid. No tool has paid for placement or for a higher score. Our affiliate links never change a verdict, and verdicts are the dominant input.
- Not the headline number on its own. Stack Score is one of two inputs to Best Score (the headline you see on every page). The other 70% is Category Fit, which uses category-specific factors so a Solid CRM and a Solid form builder don't end up at identical numbers anymore.
How often it updates
The score is recomputed whenever any of its inputs change. In practice that means:
- A new Pulse drop for a tool re-runs the pricing component within a day.
- An ownership change (acquisition, IPO, founder exit) re-runs the ownership component immediately.
- A re-review or verdict flip re-runs the verdict component on the next deploy.
Every tool page shows its current score. There is no archived "this tool used to score 82" history yet, but we are adding score history alongside pricing history for every tool with enough events.
Verdict ranges, not buckets
Earlier versions of the score gave every tool with the same verdict the exact same number of points. Twelve "Recommended" tools all scored exactly 71. That's clean but unhelpful: it makes the rankings feel ungranular and indistinguishable.
So each verdict now carries a range rather than a fixed bucket. A tool's exact position within the range is determined by per-criterion scores against the five review factors documented on how we review: Functionality, Pricing Value, Ease of Use, Reliability, and Founder Fit. Each is scored 0 to 10; the average sets the position within the verdict's range.
| Verdict | Range | Was (fixed) |
|---|---|---|
| Our Pick | 44 to 56 | 50 |
| Recommended | 32 to 48 | 40 |
| Not yet | 22 to 38 | 30 |
| Skip | 8 to 22 | 15 |
| Replace | 0 to 10 | 5 |
So a "Recommended" tool that scores 9/10 on Functionality, 9/10 on Pricing Value, and 8/10 on the rest now sits at 47, not 40. A weaker "Recommended" with 6s and 7s sits at 38. Twenty different "Recommended" tools no longer share a single number. The midpoint of each range equals the old fixed bucket, so unscored tools don't drift.
Tools without per-criterion data fall to the range midpoint by design. Same principle as Category Fit: we score conservatively rather than fabricate.
Category Fit Score
Stack Score is the same formula for every tool. That gives it a useful property for cross-category comparisons (Linear vs Stripe vs Notion), but on its own it would be blunt inside a single category (Cursor vs Claude Code vs GitHub Copilot all care about different things). Category Fit Score solves that by scoring each tool against criteria specific to what actually matters in its category.
So inside any category page that supports it, each tool also gets a Category Fit Score built from category-specific factors. For AI Coding, that's code quality, context awareness, debugging, workflow integration, speed, pricing value, and reliability. For AI Assistants, it's reasoning, writing, research, tool ecosystem, speed, pricing value, and reliability. Each factor is scored 0-10 and weighted; the result is a 0-100 number directly comparable to Stack Score.
The final position on a category leaderboard is the Category Rank Score:
Category Rank = 30% Stack Score + 70% Category Fit Score
The 30/70 weighting is deliberate: when you're on a category page, you mostly want to know how a tool performs in that category (the 70%). The Stack Score floor (the 30%) keeps tools with poor durability or aggressive pricing from outranking better-fit alternatives.
Tools that don't yet have category-fit scores fall back to Stack Score alone for ranking. We score one category at a time as we publish them, rather than backfilling the whole catalogue with low-confidence numbers.
Every scored tool also gets a one-sentence "Why it ranks here" explanation, derived either from a manual editorial note or auto-generated from its strongest and weakest sub-factors. That sentence is the defendable answer to "why is this tool 84 and not 71."
Best Score: the headline number across the site
Stack Score answers "is this tool safe to build around". Category Fit answers "how good is it inside this category". Neither alone tells you what to pick. So the headline number on every tool page, every category ranking, the homepage Top 10, and the main /rankings/ page is Best Score:
Best Score = 30% Stack Score + 70% Category Fit
Stack stays as the durability floor. Category Fit dominates because when you're choosing a tool, in-category performance is what you actually need. Tools whose category does not yet have a hand-built rubric are scored against a universal fallback rubric (functionality, pricing value, ease of use, reliability, founder fit) so every tool in the catalogue has a Best Score that means the same thing.
As we publish editorial coverage of more categories, the universal rubric is replaced by category-specific factor sets (Code quality and Repo awareness for AI Coding; Output quality and Prompt accuracy for AI Video; Design flexibility and SEO control for Website Builders; and so on) with weights tuned to what actually matters in that category. This is the layer the original rankings page was missing.
Confidence Score
We're ranking 1,600+ tools. Some have been tested deeply, some are listed and monitored. Hiding that gap would feel dishonest, so each tool now carries a Confidence label derived from how much editorial data we have on it:
- High: 5-criteria reviewed + category-fit scored + pricing verified + hands-on tested. Strongest editorial position.
- Medium: 5-criteria reviewed + (category-fit scored OR pricing verified). Solid review, not yet hands-on.
- Basic: 5-criteria reviewed only. Score has differentiation within its verdict range, but no category-specific or hands-on layers yet.
- Auto: no manual review yet. The score is auto-derived from the tool's verdict, pricing model, stage, ownership type, and description signals. Useful for ranking but should be weighted lower than Basic or above.
The Confidence label appears as a coloured pill on tool detail pages alongside the scores, and is the honest answer to "how much should I weight this score?"
How auto-derived scores work
Most of the catalogue is "Auto" confidence. Manually reviewing 1,600 tools across five criteria each is years of work; we've done about 100 so far. For the rest, the system derives criteria scores from data we already have:
- Functionality from verdict tier and description richness (length, presence of "use when" / "avoid when" notes, capability keywords like API, automation, real-time).
- Pricing value from pricing model (open source, free, freemium, trial, paid, premium) plus descriptor signals (free forever, generous, per-seat, enterprise).
- Ease of use from simplicity keywords (simple, intuitive, no-code, drag and drop, fast, beautiful) minus complexity keywords (powerful, advanced, technical, dev-first, self-hosted).
- Reliability from stage (starting, building, growing, scaling), maturity keywords (industry standard, trusted by, established, since 2010), enterprise compliance signals (SOC 2, GDPR, HIPAA), minus newness keywords (beta, prototype, experimental).
- Founder fit from ownership type, verdict, pricing accessibility (free / open source pulls up, enterprise pulls down), and audience signals (solo, indie, founder, startup vs Fortune, mid-market, IPO).
To prevent generic-description tools from clustering at the same score, an Auto-confidence tool also receives a small deterministic per-slug nudge (±2 points across each of the 5 criteria). The nudge is derived from the tool's slug, so the same tool always gets the same score; it's only applied when no manual editorial review exists. Once a tool is manually reviewed, the nudge disappears and the score reflects the editorial criteria directly.
This is the mechanism that turns "everyone's at 71" into a genuinely differentiated ranking across 1,600 tools without faking deep reviews we haven't done. It's also why we tag the score "Auto" instead of pretending it's a hand-graded number.
Use Case Score (in development)
The same tool can be #1 for one job and #4 for another. Cursor is the right pick for a serious engineer; it's overkill for a beginner who'd be better served by Replit Agent. So the next layer in development is Use Case Score: each use case gets its own factor weights, and tools rank differently on each.
The infrastructure is shipped (see /assets/use-case-rubrics.js). Five seed use cases are defined: AI Coding for Beginners, AI Coding for Engineers, AI Assistant for Research, AI Image for Brand Design, AI Image for Faceless YouTube. Per-tool scoring is being filled in now; expect this layer to go live category-by-category.
Disagreements
If you think a score is wrong, two things are useful: tell us which input you would change (verdict, ownership, or pricing trajectory) and what evidence would support the change. The formula is fixed, but the inputs are editorial and we update them when we are wrong.
Send the evidence to clinton@fewertools.com or open an issue against the public site. We log corrections in the changelog.