What is the difference between accessibility and usability in digital products?

Accessibility asks whether a task can be completed without encountering a barrier — for example, whether every interactive element is announced by a screen reader, or every image has meaningful alt text. Usability asks how well the task can be completed — whether the flow is intuitive, the language is clear, and the interaction model makes sense. A product can be technically accessible yet practically frustrating to use, which is why See Me Please scores both dimensions separately.

What does an SMP Accessibility Score represent?

An SMP Accessibility Score reflects the proportion of task attempts that succeeded without a blocking accessibility barrier across a tested cohort. It is grounded in observed user sessions rather than code inspection, capturing barriers that WCAG does not address and issues that automated tools miss.

What are friction types in the SMP methodology?

Friction types classify the nature of barriers testers encounter. The SMP taxonomy includes Task Blockers (barriers that prevent task completion entirely), Component Blockers (barriers tied to a specific UI element), and a range of lower-severity friction that slows users down without stopping them. Consistent classification across cohorts means findings from different disability groups can be directly compared.

Methodology

Estimated reading time, 9 minutes · May 2026

Accessibility Usability

Q: Why can automated WCAG scanning tools not replace user testing?

Automated tools analyse code against technical criteria and detect approximately 30% of WCAG issues. They cannot detect whether a screen reader announces content in a logical order, whether a cognitive user can follow a multi-step flow, or whether error messages are clear enough to act on. User testing with disabled participants surfaces friction that no code analysis tool can identify.

Accessibility score

Awesome

Accessibility score: 81 out of 100, Awesome tier

Usability score

Ouch

Usability score: 21 out of 100, Ouch tier

Think of it like a kitchen drawer

Accessibility is whether you can open the drawer and reach what's inside. Can you tell what's in the drawer? Can you physically get the spoon?

Accessibility

When we talk about accessibility in user testing, we're asking:

Can people navigate independently? Without needing someone else to guide them or explain what to do.
Can they access enough information to make an informed decision? Not every word, but enough to understand what matters. In an insurance quote, that means understanding what's included, what your options are, and what it actually costs.
Is the information perceivable and understandable? This includes assistive technology compatibility, yes. But it also means: can they comprehend the language well enough to make an informed decision.

Accessibility isn't just about assistive technology. Can the participant understand the critical information to make a decision or access the service.

Usability

When we talk about usability, we're asking:

Is it seamless and intuitive? Does the experience flow naturally, or does it feel clunky?
Does it respond the way participants expect? When they click a button, does something happen? When they fill in a form, can they predict what comes next?
Do they have to find workarounds? Or do things work the first time, the obvious way?
Do they have to repeat actions? Click things multiple times? Re-enter information they've already given? Navigate in circles to find what they need?

How they overlap and where they don't

Accessibility and usability often travel together, but not always. We measure them separately because they're answering different questions.

An icon with no label might be perfectly usable for a sighted participant who recognises the shape, but it's inaccessible to someone using a screen reader because there's nothing for the assistive technology to announce. Conversely, a form might be intuitive and easy to navigate visually but completely inaccessible to someone using a screen reader if the underlying code doesn't support it properly.

Comprehension is where this gets interesting. Comprehension friction around critical information (what's included in a quote, what your options are, what something costs) has a big impact on accessibility. If you can't understand what you're choosing between, you can't make an informed decision, full stop. But struggling to parse dense legal language on a product disclosure statement? That's usability friction. The information is technically accessible; it's just not intuitive or easy to parse. The distinction matters, because it shapes where we route that signal in the scoring.

How we determine scores

We don't score individual participants and surface those to clients. That's not the point of what we do. Instead, we score:

Task-level scores. One usability score and one accessibility score for each specific task or component we test. Homepage navigation, login, account creation, form completion, checkout: each gets its own pair.
Project-level scores. One headline usability score and one headline accessibility score for the overall project. This is what a client sees at the top of their dashboard.

If a participant is blocked at a project-level dependency, the project score reflects that. It can drop to zero for that participant. But their blocked status doesn't contaminate task scores for later stages they didn't participate in. If they never got past the date picker, they don't have a meaningful experience of the checkout page, so they don't count in that task score. That way, every task score accurately reflects the experience of the people who actually attempted that task.

The four bands

Ouch

0–40

Significant barriers. Immediate attention required.

Meh

41–60

The basics are there. Notable gaps remain.

Nice

61–80

Good practices in place. Room for refinement.

Awesome

81–100

Excellent experience. Industry-leading.

Why common classifications matter

Standardised classification turns individual observations into aggregate intelligence. Without common terminology, every project is an island. We can't spot patterns across participants, compare experiences across cohorts, prioritise fixes consistently, or benchmark how a product is improving over time.

The principle: common terminology transforms individual observations into aggregate intelligence. Without it, every project is an island.

Two dimensions, every observation

We classify friction across two dimensions:

The two dimensions used to classify every friction observation
Dimension	What it captures
Severity	How significantly did this impede the participant? (Blocker → Low)
Type	What was the nature of the impediment? (The category of friction experienced)

Every friction observation is tagged with both a severity level and a friction type. This lets us identify not just how bad issues are, but what kind of issues they are.

Types of friction we look for

Every time a participant hits a snag during testing, we classify it by type. Seven categories, each with its own fingerprint.

Comprehension
Can't understand what's being said or asked.
Confidence
Understands the screen but unsure what to do next.
Accessibility
Doesn't work with assistive technology or input method.
Unresponsive interface
Action taken, no response, or response delayed.
Unexpected behaviour
Interface responds in a way the participant didn't see coming.
Content not found
Can't find information they need to decide.
Excessive effort
Too many steps, clicks, or repeats.

Read the full friction taxonomy

All seven types, with definitions, examples, and the assistive-technology nuance that decides which signal we trust.

How friction types influence accessibility and usability scoring

Not every friction type affects both scores equally. Here's how we route them:

How each friction type routes to accessibility versus usability scores
Friction type	Primarily affects	Why
Comprehension	AccessibilityUsability	Critical information routes to accessibility. Non-critical content routes to usability.
Confidence	Usability	Uncertainty about what to do next is a usability issue.
Accessibility	Accessibility	Direct incompatibility with assistive technology. This is the core accessibility signal.
Unresponsive interface	Usability	Interface not responding is a usability breakdown.
Unexpected behaviour	Usability	Mismatch between expectation and outcome is usability.
Content not found	Usability	Navigation and findability are usability concerns.
Excessive effort	Usability	Cognitive and interaction overhead is usability friction.

Some friction types carry more weight than others within their stream. We come back to that in the scoring section.

How Severe was the Friction

We don't just classify what kind of friction happened. We also classify how badly it impacted the participant. Six levels, from "fully blocked" to "mild annoyance," plus a positive moment for delight.

Critical distinction: we differentiate between project-level outcomes (the overall goal) and task-level outcomes (component steps). A participant may achieve the project goal while still experiencing task-level blockers along the way.

Blocked (Project)
Participant blocked from the entire project outcome.
Task Blocker
Blocked from one task; rest of the journey still works.
Component Blocker
Blocked from a specific component; task completed via a workaround.
High Friction
Major difficulty; multiple failed attempts or extended time.
Medium Friction
Noticeable delay, hesitation, or confusion.
Low Friction
Minor inconvenience noticed but didn't slow them down.
Positive
Participant impressed or delighted by an interaction.

How we decide which severity to apply

The decision tree below is what an analyst walks through for every friction observation. Each step's answer routes to either an outcome (a severity level) or the next question.

1
Step 1: Did the participant achieve the primary project outcome independently?
- NoBlocked (Project)
- YesStep 2
2
Step 2: Were any component tasks completely blocked?
- YesTask Blocker
- NoStep 3
3
Step 3: Did any component clearly fail or prove inaccessible (even if the participant worked around it)?
- YesComponent Blocker
- NoStep 4
4
Step 4: How much effort or difficulty was required?
- Major difficulty, near-abandonmentHigh Friction
- Noticeable confusion or hesitationMedium Friction
- Minor inconvenienceLow Friction

Learn more about severity thresholds

Six levels in detail, plus the three flavours of blocker (component, task, project) and how each one shapes the score.

For the curious

Technical deep dive

For those who want to understand the methodology at a deeper level, here's how the components fit together.

Why automated scanning isn't enough on its own

Automated WCAG scanners catch roughly 30% of accessibility issues by design — they check code against technical criteria and can't tell you whether a screen reader announces content in a logical order, whether a cognitive user can follow a multi-step flow, or whether error messages are clear enough to act on. That's the gap user testing fills.

This is an empirical model, not a statistical one

We want to be honest about what our scores are and what they aren't.

Our scores are empirical. They reflect the experiences our participants actually had. Whether they could use the product. Whether they were blocked. What friction they hit and how severe it was. This is what makes the scores meaningful and actionable: they describe real human experiences of your product, captured across multiple data modalities.

They are not statistical in the sense of a rigorously sampled population study. The sample sizes are smaller than you'd use for hypothesis testing, and the goal is qualitative depth, not quantitative generalisation. A score of 73 isn't telling you "73% of your participants will succeed." It's telling you "across our diverse panel of participants, the pattern of experience we observed lands in the Nice range, with these specific friction points to address."

Why we don't rely on self-scores alone

Most user testing platforms ask participants to rate their experience (say, 1 to 5), average those ratings across participants, and publish that as the score. Participant self-ratings are valuable, but insufficient as a sole data source.

Adaptation bias. Participants with disabilities often adapt to poor experiences and rate them higher than warranted because "it's better than most sites" (sadly, what they're used to).
Expectation calibration. Participants may not know what "good" looks like if they've never experienced a genuinely accessible product. Without that benchmark, even mediocre experiences can read as positive.
Social desirability. Some participants avoid giving harsh ratings, particularly in facilitated sessions, or because they're pleased to be included in the research, this is their first paid employment, or because they feel heard for the first time.
Task completion disconnect. A participant might rate satisfaction highly despite being blocked from key tasks. "I couldn't finish but the parts I could do were nice" produces a high rating that hides a blocker.
Severity blindness. Self-scores don't distinguish between a minor annoyance and a complete barrier. A 3-out-of-5 tells us nothing about whether the participant completed independently.

Our approach: self-scores form the baseline, but analyst-observed friction classifications provide the severity and type data that self-scores can't capture. Over time, additional data modalities (task completion rates, time-to-complete, error frequency) will further strengthen score validity.

Non-negotiable: we don't override participant voice. We exist to amplify the voices of people with lived experience. The solution isn't to ignore the survey score; it's to contextualise it against other evidence.

The principle: a product where any participant is completely blocked from achieving their goal cannot score in the Nice or Awesome bands, regardless of how many other participants had good experiences. Clear prioritisation incentive: fix blockers first.

Why we keep the survey

Surveys capture things participants don't naturally verbalise. When someone engages thoughtfully with a question, that's their explicit, considered judgement at a fixed moment in the experience, and it's genuinely valuable. We don't drop it; we contextualise it against the other modalities.

The triangulation principle: the richest insight isn't in any single modality. It's in the tension between what someone said (survey), what they did (behaviour), and what they spoke aloud (transcript). A participant who selects "that was fine" while their recording shows four minutes lost on a single page: that contradiction is the signal.

Adding more survey questions to clean up the data would make things worse, not better. The more questions participants face, the less meaningful engagement we get with any of them. The answer isn't more questions or fewer; it's better triangulation.

Why testers give one score per task, not separate accessibility and usability scores

One score in → two scores out. The tester rates their holistic experience once. The model separates it into accessibility and usability dimensions based on what was found in the transcript. Two end-of-project survey questions then calibrate the usability score only. The accessibility score is finalised from objective transcript evidence and never modified by sentiment.

Asking testers to separately rate accessibility and usability would introduce tester-level accessibility bias. A tester who has spent years developing strategies to navigate inaccessible forms will rate the same combo-box failure as 2-out-of-5 for accessibility. A less experienced tester rates it 1-out-of-5. The rating measures the tester's adaptation, not the product. This is the adaptation bias described above.

Accessibility and usability are also not cleanly separable in a live session. When a blind participant struggles with an enrolment form, is that an Accessibility issue (combo boxes don't work with screen readers) or an Excessive-effort issue (the form is too complicated)? From the tester's perspective, it's one experience. Asking them to split it introduces artificial precision.

The model resolves this by:

Accepting one task rating from the tester (their holistic experience, no categorisation required)
Using transcript analysis to identify the friction type (one of the seven categories above)
Routing Accessibility-friction deductions to the accessibility stream only; the accessibility score is finalised here, with no further modification
Routing the other six friction types (and the non-critical subset of Comprehension friction) to the usability stream only
Applying a survey-sentiment calibration to the usability stream only, using two end-of-project questions: "How well did this product meet your expectations?" and "How well were you able to navigate and access the product?"
The accessibility score reflects objective transcript evidence only. Sentiment does not touch it.

The model in plain English

We fuse multiple modalities of data:

Survey scores. What the participant said explicitly in their reflective survey response.
Transcript analysis. What the participant said out loud during the session: verbal feedback, confusion, frustration, relief, expressions of struggle.
Observed behaviour. What the participant actually did: hesitation, backtracking, repeated attempts, abandoned actions.
Time to complete. Relative to the participant's own baseline and their cohort's baseline.
Participant context. Cohort, assistive technology, experience level.

No single modality is enough on its own. Surveys alone miss the in-the-moment signal. Transcripts alone miss what participants don't verbalise. Time alone is noisy. The signal comes from triangulating across modalities.

Our ethical commitment

We are conservative in correcting participant voice. Any mechanism that reduces what a participant stated in their survey is triggered only by strong, corroborating evidence from other modalities. The threshold for correction is high, and every correction is explainable and auditable. A participant should be able to ask us why their stated score of 75 was adjusted, and we should be able to give them an answer they could understand and that wouldn't make them feel their voice was disregarded.

Confidence ratings

Every friction insight our LLM surfaces carries a confidence rating (0.0 to 1.0). We use confidence as a multiplier on how much weight an insight carries. A high-confidence severe friction (0.9) carries substantially more weight than a low-confidence one (0.4). There's a minimum confidence threshold (currently 0.7) below which a friction is noted but doesn't affect the score.

Friction-type weighting

The accessibility score is driven primarily by Accessibility friction and the critical-information subset of Comprehension friction. Other friction types don't feed the accessibility score directly. The usability score is driven by Confidence, Unresponsive interface, Unexpected behaviour, Content not found, Excessive effort, and the non-critical subset of Comprehension friction. Within each stream, severity levels carry different weight: a blocker drives the score dramatically lower; a low-severity friction registers but doesn't meaningfully change the score on its own.

Cohort-aware weighting

Cohorts experience friction differently. Comprehension friction for a participant who speaks English as a second language is often a direct language accessibility barrier. The same friction for a sighted, native English speaker is more likely to be a usability concern about interface clarity. Our model applies cohort-specific weightings to comprehension friction to reflect these structural differences.

Active development

This model is under active development. Current investments include longitudinal individual baselines, hesitation and filled-pause analysis, keyword and phrase detection, and refined confidence calibration through retrospective validation against expert review. A roadmap of planned enhancements will be published shortly.

Questions?

If you'd like to understand how your scores were calculated, which friction points drove them, or how we'd recommend acting on them, talk to your See Me Please contact. We can walk through any individual score and explain what contributed to it, because if we can't explain it, it shouldn't be in the score.

Accessibility score

Usability score

Think of it like a kitchen drawer

Accessibility

Usability

How they overlap and where they don't

How we determine scores

The four bands

Ouch

Meh

Nice

Awesome

Why common classifications matter

Two dimensions, every observation

Types of friction we look for

Comprehension

Confidence

Accessibility

Unresponsive interface

Unexpected behaviour

Content not found

Excessive effort

Read the full friction taxonomy

How friction types influence accessibility and usability scoring

How Severe was the Friction

Blocked (Project)

Task Blocker

Component Blocker

High Friction

Medium Friction

Low Friction

Positive

How we decide which severity to apply

Learn more about severity thresholds

Technical deep dive

Why automated scanning isn't enough on its own

This is an empirical model, not a statistical one

Why we don't rely on self-scores alone

Why we keep the survey

Why testers give one score per task, not separate accessibility and usability scores

The model in plain English

Our ethical commitment

Confidence ratings

Friction-type weighting

Cohort-aware weighting

Active development

Questions?