Methodology
·
Accessibility Usability
The two scores
Accessibility score
Usability score
Accessibility is whether you can open the drawer and reach what's inside. Can you tell what's in the drawer? Can you physically get the spoon?
Accessibility
When we talk about accessibility in user testing, we're asking:
- Can people navigate independently? Without needing someone else to guide them or explain what to do.
- Can they access enough information to make an informed decision? Not every word, but enough to understand what matters. In an insurance quote, that means understanding what's included, what your options are, and what it actually costs.
- Is the information perceivable and understandable? This includes assistive technology compatibility, yes. But it also means: can they comprehend the language well enough to make an informed decision.
Accessibility isn't just about assistive technology. Can the participant understand the critical information to make a decision or access the service.
Usability
When we talk about usability, we're asking:
- Is it seamless and intuitive? Does the experience flow naturally, or does it feel clunky?
- Does it respond the way participants expect? When they click a button, does something happen? When they fill in a form, can they predict what comes next?
- Do they have to find workarounds? Or do things work the first time, the obvious way?
- Do they have to repeat actions? Click things multiple times? Re-enter information they've already given? Navigate in circles to find what they need?
Accessibility and usability often travel together, but not always. We measure them separately because they're answering different questions.
An icon with no label might be perfectly usable for a sighted participant who recognises the shape, but it's inaccessible to someone using a screen reader because there's nothing for the assistive technology to announce. Conversely, a form might be intuitive and easy to navigate visually but completely inaccessible to someone using a screen reader if the underlying code doesn't support it properly.
Comprehension is where this gets interesting. Comprehension friction around critical information (what's included in a quote, what your options are, what something costs) has a big impact on accessibility. If you can't understand what you're choosing between, you can't make an informed decision, full stop. But struggling to parse dense legal language on a product disclosure statement? That's usability friction. The information is technically accessible; it's just not intuitive or easy to parse. The distinction matters, because it shapes where we route that signal in the scoring.
We don't score individual participants and surface those to clients. That's not the point of what we do. Instead, we score:
- Task-level scores. One usability score and one accessibility score for each specific task or component we test. Homepage navigation, login, account creation, form completion, checkout: each gets its own pair.
- Project-level scores. One headline usability score and one headline accessibility score for the overall project. This is what a client sees at the top of their dashboard.
If a participant is blocked at a project-level dependency, the project score reflects that. It can drop to zero for that participant. But their blocked status doesn't contaminate task scores for later stages they didn't participate in. If they never got past the date picker, they don't have a meaningful experience of the checkout page, so they don't count in that task score. That way, every task score accurately reflects the experience of the people who actually attempted that task.
The four bands
Ouch
0–40
Significant barriers. Immediate attention required.
Meh
41–60
The basics are there. Notable gaps remain.
Nice
61–80
Good practices in place. Room for refinement.
Awesome
81–100
Excellent experience. Industry-leading.
Standardised classification turns individual observations into aggregate intelligence. Without common terminology, every project is an island. We can't spot patterns across participants, compare experiences across cohorts, prioritise fixes consistently, or benchmark how a product is improving over time.
Two dimensions, every observation
We classify friction across two dimensions:
| Dimension | What it captures |
|---|---|
| Severity | How significantly did this impede the participant? (Blocker → Low) |
| Type | What was the nature of the impediment? (The category of friction experienced) |
Every friction observation is tagged with both a severity level and a friction type. This lets us identify not just how bad issues are, but what kind of issues they are.
Every time a participant hits a snag during testing, we classify it by type. Seven categories, each with its own fingerprint.
Comprehension
Can't understand what's being said or asked.
Confidence
Understands the screen but unsure what to do next.
Accessibility
Doesn't work with assistive technology or input method.
Unresponsive interface
Action taken, no response, or response delayed.
Unexpected behaviour
Interface responds in a way the participant didn't see coming.
Content not found
Can't find information they need to decide.
Excessive effort
Too many steps, clicks, or repeats.
Read the full friction taxonomy
All seven types, with definitions, examples, and the assistive-technology nuance that decides which signal we trust.
Not every friction type affects both scores equally. Here's how we route them:
| Friction type | Primarily affects | Why |
|---|---|---|
Comprehension | Critical information routes to accessibility. Non-critical content routes to usability. | |
Confidence | Uncertainty about what to do next is a usability issue. | |
Accessibility | Direct incompatibility with assistive technology. This is the core accessibility signal. | |
Unresponsive interface | Interface not responding is a usability breakdown. | |
Unexpected behaviour | Mismatch between expectation and outcome is usability. | |
Content not found | Navigation and findability are usability concerns. | |
Excessive effort | Cognitive and interaction overhead is usability friction. |
Some friction types carry more weight than others within their stream. We come back to that in the scoring section.
We don't just classify what kind of friction happened. We also classify how badly it impacted the participant. Six levels, from "fully blocked" to "mild annoyance," plus a positive moment for delight.
Blocked (Project)
Participant blocked from the entire project outcome.
Task Blocker
Blocked from one task; rest of the journey still works.
Component Blocker
Blocked from a specific component; task completed via a workaround.
High Friction
Major difficulty; multiple failed attempts or extended time.
Medium Friction
Noticeable delay, hesitation, or confusion.
Low Friction
Minor inconvenience noticed but didn't slow them down.
Positive
Participant impressed or delighted by an interaction.
How we decide which severity to apply
The decision tree below is what an analyst walks through for every friction observation. Each step's answer routes to either an outcome (a severity level) or the next question.
Step 1: Did the participant achieve the primary project outcome independently?
- No
- YesStep 2
Step 2: Were any component tasks completely blocked?
- Yes
- NoStep 3
Step 3: Did any component clearly fail or prove inaccessible (even if the participant worked around it)?
- Yes
- NoStep 4
Step 4: How much effort or difficulty was required?
- Major difficulty, near-abandonment
- Noticeable confusion or hesitation
- Minor inconvenience
Learn more about severity thresholds
Six levels in detail, plus the three flavours of blocker (component, task, project) and how each one shapes the score.
For the curious
For those who want to understand the methodology at a deeper level, here's how the components fit together.
This is an empirical model, not a statistical one
We want to be honest about what our scores are and what they aren't.
Our scores are empirical. They reflect the experiences our participants actually had. Whether they could use the product. Whether they were blocked. What friction they hit and how severe it was. This is what makes the scores meaningful and actionable: they describe real human experiences of your product, captured across multiple data modalities.
They are not statistical in the sense of a rigorously sampled population study. The sample sizes are smaller than you'd use for hypothesis testing, and the goal is qualitative depth, not quantitative generalisation. A score of 73 isn't telling you "73% of your participants will succeed." It's telling you "across our diverse panel of participants, the pattern of experience we observed lands in the Nice range, with these specific friction points to address."
Why we don't rely on self-scores alone
Most user testing platforms ask participants to rate their experience (say, 1 to 5), average those ratings across participants, and publish that as the score. Participant self-ratings are valuable, but insufficient as a sole data source.
- Adaptation bias. Participants with disabilities often adapt to poor experiences and rate them higher than warranted because "it's better than most sites" (sadly, what they're used to).
- Expectation calibration. Participants may not know what "good" looks like if they've never experienced a genuinely accessible product. Without that benchmark, even mediocre experiences can read as positive.
- Social desirability. Some participants avoid giving harsh ratings, particularly in facilitated sessions, or because they're pleased to be included in the research, this is their first paid employment, or because they feel heard for the first time.
- Task completion disconnect. A participant might rate satisfaction highly despite being blocked from key tasks. "I couldn't finish but the parts I could do were nice" produces a high rating that hides a blocker.
- Severity blindness. Self-scores don't distinguish between a minor annoyance and a complete barrier. A 3-out-of-5 tells us nothing about whether the participant completed independently.
Our approach: self-scores form the baseline, but analyst-observed friction classifications provide the severity and type data that self-scores can't capture. Over time, additional data modalities (task completion rates, time-to-complete, error frequency) will further strengthen score validity.
Why we keep the survey
Surveys capture things participants don't naturally verbalise. When someone engages thoughtfully with a question, that's their explicit, considered judgement at a fixed moment in the experience, and it's genuinely valuable. We don't drop it; we contextualise it against the other modalities.
Adding more survey questions to clean up the data would make things worse, not better. The more questions participants face, the less meaningful engagement we get with any of them. The answer isn't more questions or fewer; it's better triangulation.
Why testers give one score per task, not separate accessibility and usability scores
Asking testers to separately rate accessibility and usability would introduce tester-level accessibility bias. A tester who has spent years developing strategies to navigate inaccessible forms will rate the same combo-box failure as 2-out-of-5 for accessibility. A less experienced tester rates it 1-out-of-5. The rating measures the tester's adaptation, not the product. This is the adaptation bias described above.
Accessibility and usability are also not cleanly separable in a live session. When a blind participant struggles with an enrolment form, is that an Accessibility issue (combo boxes don't work with screen readers) or an Excessive-effort issue (the form is too complicated)? From the tester's perspective, it's one experience. Asking them to split it introduces artificial precision.
The model resolves this by:
- Accepting one task rating from the tester (their holistic experience, no categorisation required)
- Using transcript analysis to identify the friction type (one of the seven categories above)
- Routing Accessibility-friction deductions to the accessibility stream only; the accessibility score is finalised here, with no further modification
- Routing the other six friction types (and the non-critical subset of Comprehension friction) to the usability stream only
- Applying a survey-sentiment calibration to the usability stream only, using two end-of-project questions: "How well did this product meet your expectations?" and "How well were you able to navigate and access the product?"
- The accessibility score reflects objective transcript evidence only. Sentiment does not touch it.
The model in plain English
We fuse multiple modalities of data:
- Survey scores. What the participant said explicitly in their reflective survey response.
- Transcript analysis. What the participant said out loud during the session: verbal feedback, confusion, frustration, relief, expressions of struggle.
- Observed behaviour. What the participant actually did: hesitation, backtracking, repeated attempts, abandoned actions.
- Time to complete. Relative to the participant's own baseline and their cohort's baseline.
- Participant context. Cohort, assistive technology, experience level.
No single modality is enough on its own. Surveys alone miss the in-the-moment signal. Transcripts alone miss what participants don't verbalise. Time alone is noisy. The signal comes from triangulating across modalities.
Our ethical commitment
Confidence ratings
Every friction insight our LLM surfaces carries a confidence rating (0.0 to 1.0). We use confidence as a multiplier on how much weight an insight carries. A high-confidence severe friction (0.9) carries substantially more weight than a low-confidence one (0.4). There's a minimum confidence threshold (currently 0.7) below which a friction is noted but doesn't affect the score.
Friction-type weighting
The accessibility score is driven primarily by Accessibility friction and the critical-information subset of Comprehension friction. Other friction types don't feed the accessibility score directly. The usability score is driven by Confidence, Unresponsive interface, Unexpected behaviour, Content not found, Excessive effort, and the non-critical subset of Comprehension friction. Within each stream, severity levels carry different weight: a blocker drives the score dramatically lower; a low-severity friction registers but doesn't meaningfully change the score on its own.
Cohort-aware weighting
Cohorts experience friction differently. Comprehension friction for a participant who speaks English as a second language is often a direct language accessibility barrier. The same friction for a sighted, native English speaker is more likely to be a usability concern about interface clarity. Our model applies cohort-specific weightings to comprehension friction to reflect these structural differences.
Active development
This model is under active development. Current investments include longitudinal individual baselines, hesitation and filled-pause analysis, keyword and phrase detection, and refined confidence calibration through retrospective validation against expert review. A roadmap of planned enhancements will be published shortly.
If you'd like to understand how your scores were calculated, which friction points drove them, or how we'd recommend acting on them, talk to your See Me Please contact. We can walk through any individual score and explain what contributed to it, because if we can't explain it, it shouldn't be in the score.


