
Harvey,
the
legal
AI
company
whose
valuation
recently
hit
$11
billion,
recently
released
what
it
is
calling
the
Legal
Agent
Benchmark,
or
LAB
—
an
open-source
evaluation
framework
designed
to
measure
how
well
AI
agents
can
perform
extended,
real-world
legal
work
rather
than
the
discrete
reasoning
tasks
that
have
dominated
legal
AI
benchmarks
to
date.
Announced
May
6
in
a
post
by
Harvey
researchers
Niko
Grupen,
Gabe
Pereyra
(Harvey’s
cofounder),
and
Julio
Pereyra,
the
first
version
of
LAB
contains
more
than
1,200
tasks
spanning
24
legal
practice
areas,
graded
against
more
than
75,000
expert-written
rubric
criteria.
The
code
and
a
portion
of
the
dataset
are
available
on
GitHub.
“The
goal
of
LAB
is
to
provide
a
clear
picture
of
how
agents
can
be
deployed
to
support
legal
work
in
the
real
world,”
the
researchers
write.
“By
articulating
where
agents
can
do
all,
some,
or
none
of
a
task,
LAB
helps
law
firms
measure
the
ROI
of
AI
investments
and
where
such
investments
can
augment
their
teams’
work.”
Notably,
Harvey
is
launching
LAB
without
a
leaderboard.
The
company
says
it
will
work
with
research
partners
over
the
coming
weeks
to
produce
baseline
results
and
publish
standards
for
normalizing
submissions
before
any
rankings
appear.
“We’re
intentionally
launching
LAB
without
a
leaderboard
because
we
expect
the
dataset
to
evolve
over
time
and
we
want
to
work
with
the
community
to
ensure
results
are
clear
and
intuitive
in
how
they
convey
agent
performance,”
Harvey
says.
What
LAB
Tests
In
creating
LAB,
Harvey
says
that
existing
legal
AI
benchmarks
—
including
LegalBench,
CUAD,
LEXam,
and
Harvey’s
own
earlier
BigLaw
Bench
—
measure
short-horizon
reasoning,
such
as
ability
to
read
a
contract,
answer
a
question,
compare
cases,
or
analyze
an
argument.
LAB
is
meant
to
measure
something
closer
to
the
unit
of
work
that
actually
gets
delegated
inside
a
law
firm.
Each
LAB
task
is
structured
around
four
elements
that
mirror
an
associate’s
assignment:
-
An
instruction
written
as
a
partner-to-associate
request
—
short
(averaging
50
words)
and
framed
as
what’s
needed
rather
than
how
to
produce
it. -
An
environment
built
as
a
client
matter,
with
a
closed
universe
of
documents
that
the
agent
must
sort
through.
Materials
include
both
relevant
files
and
peripheral
ones
the
agent
has
to
learn
to
ignore. -
An
output
that
has
to
be
reviewable
legal
work
product,
not
just
an
answer. -
Verification
through
expert
rubrics
that
break
the
deliverable
into
atomic
pass/fail
criteria
covering
facts,
conclusions,
citations,
severity
ratings,
recommendations,
deadlines,
dollar
amounts,
and
formatting.
To
illustrate
the
structure,
Harvey
uses
a
fictional
corporate
M&A
example.
It
involves
a
$458
million
all-equity
acquisition
of
Crestview
Software
Solutions
in
which
the
agent
must
review
a
virtual
data
room
containing
eight
material
contracts
plus
adjacent
documents
such
as
a
10-K
and
a
deferred
compensation
plan,
identify
change-of-control
provisions
across
the
matter,
assess
deal
risk,
recommend
next
steps,
and
produce
a
draft
memorandum
for
the
deal
team
and
board.
The
rubric
for
that
single
task
contains
57
criteria
covering
nine
legal
issues
planted
across
the
materials.
LAB
uses
what
Harvey
calls
“all-pass”
grading,
meaning
that a
task
is
marked
complete
only
if
every
rubric
criterion
passes.
There
is
no
partial
credit.
The
rationale
is
that
a
deal
memo
that
catches
eight
of
10
material
risks
is
not
80%
useful.
One
missed
issue
could
blow
up
the
transaction
or
surface
as
a
problem
post-closing.
The
24
practice
areas
in
the
initial
release
span
transactional,
advisory,
regulatory
and
litigation
work.
Harvey
says
future
versions
will
expand
within
those
areas,
add
new
practices,
and
eventually
move
beyond
law
firms
to
in-house
legal
work
and
adjacent
professional
services
like
asset
management
and
banking.
Why
a
Benchmark?
Harvey’s
thesis
is
that
benchmarks
have
served
as
leading
indicators
of
capability
inflection
points
in
other
agentic
domains
—
most
visibly
in
software
engineering,
where
benchmarks
such
as
SWE-Bench
Verified
and
Terminal-Bench
2.0
tracked
the
shift
that
AI
researcher
Andrej
Karpathy
summarized
by
saying
coding
agents
“basically
didn’t
work
before
December
and
basically
work
since.”
Harvey
argues
that
similar
benchmarks
(GDPval,
OSWorld-Verified,
BrowseComp,
FinanceAgent,
and
others)
are
now
extending
legibility
to
knowledge
work,
web
research,
financial
analysis
and
professional
services.
Harvey
positions
LAB
as
the
legibility
layer
for
legal
agents.
The
use
case
Harvey
describes
for
law
firms
is
straightforward:
identify
the
workflows
where
agents
perform
well
enough
to
be
delegated
under
a
“review
pattern,”
identify
the
workflows
where
they
don’t
and
need
to
stay
heavily
human-in-the-loop,
and
make
deployment
and
ROI
decisions
accordingly.
For
most
firms,
that
may
matter
more
than
technical
details.
The
legal
industry
has
spent
two
years
cycling
through
vendor
demos
and
pilot
programs
without
a
shared
way
to
answer
the
question
every
managing
partner
and
innovation
lead
is
being
asked,
which
is
where,
specifically,
can
we
put
these
things
to
work?
A
credible,
public
benchmark,
particularly
one
structured
around
actual
deliverables
rather
than
multiple-choice
questions,
could
change
that
conversation.
Of
course,
it
could
also
complicate
it,
by
revealing
how
far
agents
still
are
from
autonomous
practice
in
many
areas.
Practical
Applications
of
LAB
To
my
mind,
a
few
practical
applications
of
LAB
jump
out:
-
For
law
firms,
LAB
offers
a
reference
point
for
vendor
evaluation.
A
firm
evaluating
competing
products
could,
in
theory,
ask
each
vendor
to
report
performance
on
specific
LAB
practice
areas
and
compare
results,
rather
than
rely
on
vendor
demos
and
case
studies. -
For
vendors,
LAB
offers
a
public
yardstick
for
claims
about
agent
capability.
Harvey
has
acknowledged
contributions
from
a
substantial
list
of
labs
and
companies
(including
Anthropic,
OpenAI,
Nvidia,
Google
DeepMind,
Mistral,
LangChain,
Fireworks,
Snorkel,
Mercor,
and
Stanford
LIFTLab),
which
suggests
the
major
frontier
labs
see
value
in
a
shared
evaluation
context
for
legal
agents. -
For
researchers,
LAB
provides
a
longer-horizon,
domain-specific
task
set
that
they
can
use
for
evaluation,
fine-tuning
and
post-training
work. -
For
legal
journalists
and
analysts,
LAB
could
provide
something
more
useful
than
vendor-supplied
claims
about
their
products
—a
way
of
actually
putting
those
claims
to
the
test.
The
Bottom
Line
It
is
worth
noting
that
LAB
is
a
benchmark
built
by
a
market
participant.
Harvey
is
a
dominant
and
well-funded
legal
AI
vendor,
and
the
company
has
not
been
shy
about
its
commercial
positioning.
The
tasks
and
definitions
of
“legal
work
product”
within
LAB
reflect
choices
about
what
good
legal
work
looks
like,
and
those
choices
were
made
by
Harvey’s
team
in
consultation
with
its
research
partners.
None
of
that
makes
the
benchmark
unreliable,
but
it
is
something
the
legal
community
needs
to
keep
in
mind
going
forward.
There
is
also
the
question
of
what
exactly
is
the
impact
of
“open
source”
in
this
context.
In
a
post
at
Alt-Counsel,
Houfu
Ang
argues
that
legal
open
source
is
not
really
a
community
but
rather
“a
federation
of
solo-author
archipelagos.”
He
points
specifically
to
projects
that
come
from
well-funded
vendors
such
as
Harvey,
whose
repositories
are
maintained
almost
exclusively
by
in-house
staff
in
what
the
Open
Source
Initiative
calls
“Open
Source
theatre.”
Virtually
none
of
these,
Ang
argues,
graduate
from
individual
showcase
to
sustained
codebase
with
outside
contributors.
Even
so,
LAB
is
the
most
ambitious
public
attempt
yet
to
measure
what
legal
AI
agents
can
actually
do
on
the
kind
of
work
law
firms
actually
delegate.
Whether
it
becomes
the
shared
yardstick
Harvey
wants
it
to
be
will
depend
on
how
the
leaderboard
rolls
out,
how
transparently
submissions
are
normalized,
and
how
much
room
the
project
leaves
for
outside
contributors
to
shape
what
gets
measured.
