Skip to main content

Methodology

·

Accessibility Usability

The two scores

Accessibility score

Score of 81 out of 100, placing in the Leading tier.
Awesome
Accessibility score: 81 out of 100, Awesome tier

Usability score

Score of 21 out of 100, placing in the Foundational tier.
Ouch
Usability score: 21 out of 100, Ouch tier

Think of it like a kitchen drawer

Copies the section URL.

Accessibility is whether you can open the drawer and reach what's inside. Can you tell what's in the drawer? Can you physically get the spoon?

Accessibility

When we talk about accessibility in user testing, we're asking:

  1. Can people navigate independently? Without needing someone else to guide them or explain what to do.
  2. Can they access enough information to make an informed decision? Not every word, but enough to understand what matters. In an insurance quote, that means understanding what's included, what your options are, and what it actually costs.
  3. Is the information perceivable and understandable? This includes assistive technology compatibility, yes. But it also means: can they comprehend the language well enough to make an informed decision.

Accessibility isn't just about assistive technology. Can the participant understand the critical information to make a decision or access the service.

Usability

When we talk about usability, we're asking:

  1. Is it seamless and intuitive? Does the experience flow naturally, or does it feel clunky?
  2. Does it respond the way participants expect? When they click a button, does something happen? When they fill in a form, can they predict what comes next?
  3. Do they have to find workarounds? Or do things work the first time, the obvious way?
  4. Do they have to repeat actions? Click things multiple times? Re-enter information they've already given? Navigate in circles to find what they need?

How they overlap and where they don't

Copies the section URL.

Accessibility and usability often travel together, but not always. We measure them separately because they're answering different questions.

An icon with no label might be perfectly usable for a sighted participant who recognises the shape, but it's inaccessible to someone using a screen reader because there's nothing for the assistive technology to announce. Conversely, a form might be intuitive and easy to navigate visually but completely inaccessible to someone using a screen reader if the underlying code doesn't support it properly.

Comprehension is where this gets interesting. Comprehension friction around critical information (what's included in a quote, what your options are, what something costs) has a big impact on accessibility. If you can't understand what you're choosing between, you can't make an informed decision, full stop. But struggling to parse dense legal language on a product disclosure statement? That's usability friction. The information is technically accessible; it's just not intuitive or easy to parse. The distinction matters, because it shapes where we route that signal in the scoring.

How we determine scores

Copies the section URL.

We don't score individual participants and surface those to clients. That's not the point of what we do. Instead, we score:

  • Task-level scores. One usability score and one accessibility score for each specific task or component we test. Homepage navigation, login, account creation, form completion, checkout: each gets its own pair.
  • Project-level scores. One headline usability score and one headline accessibility score for the overall project. This is what a client sees at the top of their dashboard.

If a participant is blocked at a project-level dependency, the project score reflects that. It can drop to zero for that participant. But their blocked status doesn't contaminate task scores for later stages they didn't participate in. If they never got past the date picker, they don't have a meaningful experience of the checkout page, so they don't count in that task score. That way, every task score accurately reflects the experience of the people who actually attempted that task.

The four bands

Ouch

0–40

Significant barriers. Immediate attention required.

Meh

41–60

The basics are there. Notable gaps remain.

Nice

61–80

Good practices in place. Room for refinement.

Awesome

81–100

Excellent experience. Industry-leading.

Why common classifications matter

Copies the section URL.

Standardised classification turns individual observations into aggregate intelligence. Without common terminology, every project is an island. We can't spot patterns across participants, compare experiences across cohorts, prioritise fixes consistently, or benchmark how a product is improving over time.

The principle: common terminology transforms individual observations into aggregate intelligence. Without it, every project is an island.

Two dimensions, every observation

We classify friction across two dimensions:

The two dimensions used to classify every friction observation
DimensionWhat it captures
SeverityHow significantly did this impede the participant? (Blocker → Low)
TypeWhat was the nature of the impediment? (The category of friction experienced)

Every friction observation is tagged with both a severity level and a friction type. This lets us identify not just how bad issues are, but what kind of issues they are.

Types of friction we look for

Copies the section URL.

Every time a participant hits a snag during testing, we classify it by type. Seven categories, each with its own fingerprint.

  • Comprehension

    Can't understand what's being said or asked.

  • Confidence

    Understands the screen but unsure what to do next.

  • Accessibility

    Doesn't work with assistive technology or input method.

  • Unresponsive interface

    Action taken, no response, or response delayed.

  • Unexpected behaviour

    Interface responds in a way the participant didn't see coming.

  • Content not found

    Can't find information they need to decide.

  • Excessive effort

    Too many steps, clicks, or repeats.

Read the full friction taxonomy

All seven types, with definitions, examples, and the assistive-technology nuance that decides which signal we trust.

How friction types influence accessibility and usability scoring

Copies the section URL.

Not every friction type affects both scores equally. Here's how we route them:

How each friction type routes to accessibility versus usability scores
Friction typePrimarily affectsWhy
Comprehension
AccessibilityUsability
Critical information routes to accessibility. Non-critical content routes to usability.
Confidence
Usability
Uncertainty about what to do next is a usability issue.
Accessibility
Accessibility
Direct incompatibility with assistive technology. This is the core accessibility signal.
Unresponsive interface
Usability
Interface not responding is a usability breakdown.
Unexpected behaviour
Usability
Mismatch between expectation and outcome is usability.
Content not found
Usability
Navigation and findability are usability concerns.
Excessive effort
Usability
Cognitive and interaction overhead is usability friction.

Some friction types carry more weight than others within their stream. We come back to that in the scoring section.

How Severe was the Friction

Copies the section URL.

We don't just classify what kind of friction happened. We also classify how badly it impacted the participant. Six levels, from "fully blocked" to "mild annoyance," plus a positive moment for delight.

Critical distinction: we differentiate between project-level outcomes (the overall goal) and task-level outcomes (component steps). A participant may achieve the project goal while still experiencing task-level blockers along the way.
  • Blocked (Project)

    Participant blocked from the entire project outcome.

  • Task Blocker

    Blocked from one task; rest of the journey still works.

  • Component Blocker

    Blocked from a specific component; task completed via a workaround.

  • High Friction

    Major difficulty; multiple failed attempts or extended time.

  • Medium Friction

    Noticeable delay, hesitation, or confusion.

  • Low Friction

    Minor inconvenience noticed but didn't slow them down.

  • Positive

    Participant impressed or delighted by an interaction.

How we decide which severity to apply

The decision tree below is what an analyst walks through for every friction observation. Each step's answer routes to either an outcome (a severity level) or the next question.

  1. Step 1: Did the participant achieve the primary project outcome independently?

    • NoBlocked (Project)
    • YesStep 2
  2. Step 2: Were any component tasks completely blocked?

    • YesTask Blocker
    • NoStep 3
  3. Step 3: Did any component clearly fail or prove inaccessible (even if the participant worked around it)?

    • YesComponent Blocker
    • NoStep 4
  4. Step 4: How much effort or difficulty was required?

    • Major difficulty, near-abandonmentHigh Friction
    • Noticeable confusion or hesitationMedium Friction
    • Minor inconvenienceLow Friction

Learn more about severity thresholds

Six levels in detail, plus the three flavours of blocker (component, task, project) and how each one shapes the score.

For the curious

Technical deep dive

Copies the section URL.

For those who want to understand the methodology at a deeper level, here's how the components fit together.

This is an empirical model, not a statistical one

We want to be honest about what our scores are and what they aren't.

Our scores are empirical. They reflect the experiences our participants actually had. Whether they could use the product. Whether they were blocked. What friction they hit and how severe it was. This is what makes the scores meaningful and actionable: they describe real human experiences of your product, captured across multiple data modalities.

They are not statistical in the sense of a rigorously sampled population study. The sample sizes are smaller than you'd use for hypothesis testing, and the goal is qualitative depth, not quantitative generalisation. A score of 73 isn't telling you "73% of your participants will succeed." It's telling you "across our diverse panel of participants, the pattern of experience we observed lands in the Nice range, with these specific friction points to address."

Why we don't rely on self-scores alone

Most user testing platforms ask participants to rate their experience (say, 1 to 5), average those ratings across participants, and publish that as the score. Participant self-ratings are valuable, but insufficient as a sole data source.

  • Adaptation bias. Participants with disabilities often adapt to poor experiences and rate them higher than warranted because "it's better than most sites" (sadly, what they're used to).
  • Expectation calibration. Participants may not know what "good" looks like if they've never experienced a genuinely accessible product. Without that benchmark, even mediocre experiences can read as positive.
  • Social desirability. Some participants avoid giving harsh ratings, particularly in facilitated sessions, or because they're pleased to be included in the research, this is their first paid employment, or because they feel heard for the first time.
  • Task completion disconnect. A participant might rate satisfaction highly despite being blocked from key tasks. "I couldn't finish but the parts I could do were nice" produces a high rating that hides a blocker.
  • Severity blindness. Self-scores don't distinguish between a minor annoyance and a complete barrier. A 3-out-of-5 tells us nothing about whether the participant completed independently.

Our approach: self-scores form the baseline, but analyst-observed friction classifications provide the severity and type data that self-scores can't capture. Over time, additional data modalities (task completion rates, time-to-complete, error frequency) will further strengthen score validity.

Non-negotiable: we don't override participant voice. We exist to amplify the voices of people with lived experience. The solution isn't to ignore the survey score; it's to contextualise it against other evidence.
The principle: a product where any participant is completely blocked from achieving their goal cannot score in the Nice or Awesome bands, regardless of how many other participants had good experiences. Clear prioritisation incentive: fix blockers first.

Why we keep the survey

Surveys capture things participants don't naturally verbalise. When someone engages thoughtfully with a question, that's their explicit, considered judgement at a fixed moment in the experience, and it's genuinely valuable. We don't drop it; we contextualise it against the other modalities.

The triangulation principle: the richest insight isn't in any single modality. It's in the tension between what someone said (survey), what they did (behaviour), and what they spoke aloud (transcript). A participant who selects "that was fine" while their recording shows four minutes lost on a single page: that contradiction is the signal.

Adding more survey questions to clean up the data would make things worse, not better. The more questions participants face, the less meaningful engagement we get with any of them. The answer isn't more questions or fewer; it's better triangulation.

Why testers give one score per task, not separate accessibility and usability scores

One score in → two scores out. The tester rates their holistic experience once. The model separates it into accessibility and usability dimensions based on what was found in the transcript. Two end-of-project survey questions then calibrate the usability score only. The accessibility score is finalised from objective transcript evidence and never modified by sentiment.

Asking testers to separately rate accessibility and usability would introduce tester-level accessibility bias. A tester who has spent years developing strategies to navigate inaccessible forms will rate the same combo-box failure as 2-out-of-5 for accessibility. A less experienced tester rates it 1-out-of-5. The rating measures the tester's adaptation, not the product. This is the adaptation bias described above.

Accessibility and usability are also not cleanly separable in a live session. When a blind participant struggles with an enrolment form, is that an Accessibility issue (combo boxes don't work with screen readers) or an Excessive-effort issue (the form is too complicated)? From the tester's perspective, it's one experience. Asking them to split it introduces artificial precision.

The model resolves this by:

  • Accepting one task rating from the tester (their holistic experience, no categorisation required)
  • Using transcript analysis to identify the friction type (one of the seven categories above)
  • Routing Accessibility-friction deductions to the accessibility stream only; the accessibility score is finalised here, with no further modification
  • Routing the other six friction types (and the non-critical subset of Comprehension friction) to the usability stream only
  • Applying a survey-sentiment calibration to the usability stream only, using two end-of-project questions: "How well did this product meet your expectations?" and "How well were you able to navigate and access the product?"
  • The accessibility score reflects objective transcript evidence only. Sentiment does not touch it.

The model in plain English

We fuse multiple modalities of data:

  • Survey scores. What the participant said explicitly in their reflective survey response.
  • Transcript analysis. What the participant said out loud during the session: verbal feedback, confusion, frustration, relief, expressions of struggle.
  • Observed behaviour. What the participant actually did: hesitation, backtracking, repeated attempts, abandoned actions.
  • Time to complete. Relative to the participant's own baseline and their cohort's baseline.
  • Participant context. Cohort, assistive technology, experience level.

No single modality is enough on its own. Surveys alone miss the in-the-moment signal. Transcripts alone miss what participants don't verbalise. Time alone is noisy. The signal comes from triangulating across modalities.

Our ethical commitment

We are conservative in correcting participant voice. Any mechanism that reduces what a participant stated in their survey is triggered only by strong, corroborating evidence from other modalities. The threshold for correction is high, and every correction is explainable and auditable. A participant should be able to ask us why their stated score of 75 was adjusted, and we should be able to give them an answer they could understand and that wouldn't make them feel their voice was disregarded.

Confidence ratings

Every friction insight our LLM surfaces carries a confidence rating (0.0 to 1.0). We use confidence as a multiplier on how much weight an insight carries. A high-confidence severe friction (0.9) carries substantially more weight than a low-confidence one (0.4). There's a minimum confidence threshold (currently 0.7) below which a friction is noted but doesn't affect the score.

Friction-type weighting

The accessibility score is driven primarily by Accessibility friction and the critical-information subset of Comprehension friction. Other friction types don't feed the accessibility score directly. The usability score is driven by Confidence, Unresponsive interface, Unexpected behaviour, Content not found, Excessive effort, and the non-critical subset of Comprehension friction. Within each stream, severity levels carry different weight: a blocker drives the score dramatically lower; a low-severity friction registers but doesn't meaningfully change the score on its own.

Cohort-aware weighting

Cohorts experience friction differently. Comprehension friction for a participant who speaks English as a second language is often a direct language accessibility barrier. The same friction for a sighted, native English speaker is more likely to be a usability concern about interface clarity. Our model applies cohort-specific weightings to comprehension friction to reflect these structural differences.

Active development

This model is under active development. Current investments include longitudinal individual baselines, hesitation and filled-pause analysis, keyword and phrase detection, and refined confidence calibration through retrospective validation against expert review. A roadmap of planned enhancements will be published shortly.

Questions?

Copies the section URL.

If you'd like to understand how your scores were calculated, which friction points drove them, or how we'd recommend acting on them, talk to your See Me Please contact. We can walk through any individual score and explain what contributed to it, because if we can't explain it, it shouldn't be in the score.

Accessibility v Usability – Knowledge Hub – See Me Please