The law firm of choice for internationally focused companies

+263 242 744 677

admin@tsazim.com

4 Gunhill Avenue,

Harare, Zimbabwe

Why Realistic Scenarios Matter More Than More AI – Above the Law

Legal
AI
is
often
evaluated
by
scale.
Bigger
models.
More
data.
Longer
lists
of
capabilities.
Demos
emphasize
volume:
how
many
questions
a
system
can
answer,
how
many
issues
it
can
spot,
how
fast
it
can
respond.

That
framing
misses
the
real
constraint.

The
problem
with
most
legal
AI
tools
is
not
that
they
are
insufficiently
powerful.
It
is
that
they
are
insufficiently
grounded
in
realistic
scenarios.
More
AI
does
not
compensate
for
shallow
context.

This
became
clear
during
a
series
of
empirical
classroom
pilots
run
through

Product
Law
Hub

using
an
AI-based
legal
coach
called
Frankie.
The
pilots
were
designed
to
observe
how
users
engage
with
AI
while
learning
judgment-based
legal
skills.
The
findings
draw
on
quantitative
engagement
data
and
qualitative
interviews
conducted
during
and
after
the
course.

The
signal
was
consistent.
Fewer,
richer
scenarios
produced
deeper
engagement,
stronger
reasoning,
and
higher
trust
than
high-volume
question
sets
ever
did.


Volume
Looks
Impressive.
Scenarios
Do
The
Work.

In
demos,
volume
is
persuasive.
A
system
that
can
answer
dozens
of
questions
in
seconds
feels
powerful.
Buyers
infer
competence
from
speed
and
breadth.

In
the
classroom,
that
illusion
collapsed
quickly.

When
students
were
presented
with
large
numbers
of
short,
repetitive
prompts,
engagement
dropped.
Sessions
shortened.
Follow-up
questions
declined.
Interviews
revealed
a
common
reaction:
the
interactions
felt
mechanical,
even
when
the
content
was
correct.

By
contrast,
when
students
were
given
fewer
scenarios
with
richer
context,
they
stayed
longer
and
worked
harder.
They
revisited
assumptions,
asked
clarifying
questions,
and
refined
their
analysis.
The
difference
was
not
sophistication
of
the
model.
It
was
quality
of
the
situation.


Ambiguity
Invites
Judgment

The
most
effective
scenarios
shared
a
common
feature.
They
were
ambiguous.

Exercises
that
included
stakeholder
disagreement,
incomplete
information,
or
competing
incentives
consistently
outperformed
cleaner
hypotheticals.
Students
leaned
in
when
they
had
to
decide
what
mattered,
not
when
they
were
asked
to
identify
what
applied.

Quantitative
data
showed
higher
completion
rates
and
longer
session
times
for
these
scenarios.
Qualitative
interviews
confirmed
that
students
found
them
more
credible
and
more
useful.
They
felt
closer
to
real
work.

Legal
judgment
does
not
emerge
from
clean
facts.
It
emerges
from
tension.
AI
that
avoids
ambiguity
to
simplify
interactions
undermines
the
very
skill
it
claims
to
support.


Repetition
Erodes
Trust
Faster
Than
Difficulty

One
of
the
more
counterintuitive
findings
was
how
users
responded
to
difficulty
versus
repetition.
Hard
problems
did
not
drive
disengagement.
Repetitive
ones
did.

When
scenarios
reused
the
same
structure
or
language,
users
quickly
lost
trust.
Even
minor
variations
felt
shallow.
The
system
appeared
inattentive,
as
though
it
were
pattern-matching
rather
than
reasoning.

In
contrast,
users
tolerated
complexity
and
uncertainty
when
the
scenario
felt
authentic.
They
did
not
expect
the
AI
to
make
the
problem
easier.
They
expected
it
to
take
the
problem
seriously.

This
distinction
matters
for
buyers
evaluating
tools.
A
demo
that
showcases
dozens
of
similar
questions
may
signal
capability,
but
it
does
not
predict
sustained
use.


Realism
Is
Not
About
Polish

It
is
tempting
to
equate
realism
with
polish.
Better
UX.
Cleaner
flows.
More
reassuring
language.
The
pilot
suggests
the
opposite.

Realism
came
from
friction.
Stakeholders
who
disagreed.
Constraints
that
could
not
be
optimized
away.
Tradeoffs
that
had
no
clean
resolution.
When
the
AI
engaged
with
those
elements
instead
of
smoothing
them
over,
users
trusted
it
more.

This
mirrors
real
legal
work.
Lawyers
trust
colleagues
who
acknowledge
uncertainty
and
wrestle
with
it.
They
distrust
those
who
offer
tidy
answers
to
messy
problems.

AI
that
prioritizes
smoothness
over
substance
feels
less
credible,
not
more.


Scenario
Quality
Shapes
Learning
And
Trust

The
classroom
setting
made
visible
something
that
is
harder
to
detect
in
practice.
Scenario
quality
shapes
not
just
learning
outcomes,
but
trust
in
the
system
itself.

When
scenarios
felt
generic,
users
disengaged
cognitively.
When
scenarios
felt
grounded,
users
attributed
more
intelligence
to
the
system,
even
when
its
responses
were
constrained.

Trust
followed
attention.
Systems
that
appeared
to
understand
the
situation
earned
credibility.
Systems
that
recycled
patterns
lost
it.

This
has
implications
beyond
education.
In
firms,
scenario
quality
influences
whether
lawyers
treat
AI
as
a
serious
tool
or
a
novelty.
High-volume
outputs
cannot
compensate
for
shallow
context.


Why
Buyers
Should
Rethink
Evaluation
Criteria

Legal
tech
buyers
often
ask
how
many
use
cases
a
tool
supports.
A
better
question
is
how
well
it
handles
one
difficult
case.

The
Product
Law
Hub
pilot
suggests
that
depth
beats
breadth
when
it
comes
to
judgment-based
work.
Tools
that
invest
in
realistic,
high-fidelity
scenarios
deliver
more
value
than
tools
that
chase
coverage.

That
may
require
different
procurement
thinking.
Scenario
design
is
harder
to
evaluate
than
feature
lists.
It
does
not
demo
well
in
five
minutes.
But
it
predicts
long-term
usefulness
far
better
than
model
size.


The
Quiet
Cost
Of
Shallow
Scenarios

The
cost
of
shallow
scenarios
is
not
just
wasted
time.
It
is
missed
development.

Junior
lawyers
do
not
build
judgment
by
answering
dozens
of
simplified
questions.
They
build
it
by
grappling
with
realistic
situations
that
force
prioritization
and
explanation.
AI
that
substitutes
volume
for
realism
accelerates
output
without
accelerating
growth.

The
classroom
data
made
this
visible
early.
In
practice,
the
cost
shows
up
later
as
stalled
development
and
diminished
confidence.


The
Takeaway
Vendors
Do
Not
Want
To
Hear

The
uncomfortable
takeaway
from
the
pilot
is
that
scenario
design
matters
more
than
AI
sophistication.
Bigger
models
will
not
fix
shallow
context.
Faster
answers
will
not
build
judgment.

Legal
AI
that
succeeds
will
not
be
defined
by
how
much
it
can
do,
but
by
how
well
it
can
inhabit
realistic
situations
and
resist
the
urge
to
oversimplify
them.

More
AI
is
easy
to
sell.
Better
scenarios
are
harder
to
build.
The
data
suggests
they
are
worth
the
effort.




Olga
V.
Mack
is
the
CEO
of
TermScout,
where
she
builds
legal
systems
that
make
contracts
faster
to
understand,
easier
to
operate,
and
more
trustworthy
in
real
business
conditions.
Her
work
focuses
on
how
legal
rules
allocate
power,
manage
risk,
and
shape
decisions
under
uncertainty.



A
serial
CEO
and
former
General
Counsel,
Olga
previously
led
a
legal
technology
company
through
acquisition
by
LexisNexis.
She
teaches
at
Berkeley
Law
and
is
a
Fellow
at
CodeX,
the
Stanford
Center
for
Legal
Informatics.



She
has
authored
several
books
on
legal
innovation
and
technology,
delivered
six
TEDx
talks,
and
her
insights
regularly
appear
in
Forbes,
Bloomberg
Law,
VentureBeat,
TechCrunch,
and
Above
the
Law.
Her
work
treats
law
as
essential
infrastructure,
designed
for
how
organizations
actually
operate.