The law firm of choice for internationally focused companies

+263 242 744 677

admin@tsazim.com

4 Gunhill Avenue,

Harare, Zimbabwe

Some Thoughts On Harvey’s Launch of ‘LAB,’ An Open-Source, Long-Horizon Benchmark for Legal AI Agents

Harvey,
the
legal
AI
company
whose
valuation

recently
hit
$11
billion
,
recently
released
what
it
is
calling
the
Legal
Agent
Benchmark,
or
LAB

an
open-source
evaluation
framework
designed
to
measure
how
well
AI
agents
can
perform
extended,
real-world
legal
work
rather
than
the
discrete
reasoning
tasks
that
have
dominated
legal
AI
benchmarks
to
date.

Announced
May
6
in

a
post

by
Harvey
researchers

Niko
Grupen
,

Gabe
Pereyra

(Harvey’s
cofounder),
and

Julio
Pereyra
,
the
first
version
of
LAB
contains
more
than
1,200
tasks
spanning
24
legal
practice
areas,
graded
against
more
than
75,000
expert-written
rubric
criteria.
The
code
and
a
portion
of
the
dataset
are
available
on

GitHub
.

“The
goal
of
LAB
is
to
provide
a
clear
picture
of
how
agents
can
be
deployed
to
support
legal
work
in
the
real
world,”
the
researchers
write.
“By
articulating
where
agents
can
do
all,
some,
or
none
of
a
task,
LAB
helps
law
firms
measure
the
ROI
of
AI
investments
and
where
such
investments
can
augment
their
teams’
work.”

Notably,
Harvey
is
launching
LAB
without
a
leaderboard.
The
company
says
it
will
work
with
research
partners
over
the
coming
weeks
to
produce
baseline
results
and
publish
standards
for
normalizing
submissions
before
any
rankings
appear.

“We’re
intentionally
launching
LAB
without
a
leaderboard
because
we
expect
the
dataset
to
evolve
over
time
and
we
want
to
work
with
the
community
to
ensure
results
are
clear
and
intuitive
in
how
they
convey
agent
performance,”
Harvey
says.

What
LAB
Tests

In
creating
LAB,
Harvey
says
that
existing
legal
AI
benchmarks

including
LegalBench,
CUAD,
LEXam,
and
Harvey’s
own
earlier
BigLaw
Bench

measure
short-horizon
reasoning,
such
as
ability
to
read
a
contract,
answer
a
question,
compare
cases,
or
analyze
an
argument.
LAB
is
meant
to
measure
something
closer
to
the
unit
of
work
that
actually
gets
delegated
inside
a
law
firm.

Each
LAB
task
is
structured
around
four
elements
that
mirror
an
associate’s
assignment:

  • An
    instruction
    written
    as
    a
    partner-to-associate
    request

    short
    (averaging
    50
    words)
    and
    framed
    as
    what’s
    needed
    rather
    than
    how
    to
    produce
    it.
  • An
    environment
    built
    as
    a
    client
    matter,
    with
    a
    closed
    universe
    of
    documents
    that
    the
    agent
    must
    sort
    through.
    Materials
    include
    both
    relevant
    files
    and
    peripheral
    ones
    the
    agent
    has
    to
    learn
    to
    ignore.
  • An
    output
    that
    has
    to
    be
    reviewable
    legal
    work
    product,
    not
    just
    an
    answer.
  • Verification
    through
    expert
    rubrics
    that
    break
    the
    deliverable
    into
    atomic
    pass/fail
    criteria
    covering
    facts,
    conclusions,
    citations,
    severity
    ratings,
    recommendations,
    deadlines,
    dollar
    amounts,
    and
    formatting.

To
illustrate
the
structure,
Harvey
uses
a
fictional
corporate
M&A
example.
It
involves
a
$458
million
all-equity
acquisition
of
Crestview
Software
Solutions
in
which
the
agent
must
review
a
virtual
data
room
containing
eight
material
contracts
plus
adjacent
documents
such
as
a
10-K
and
a
deferred
compensation
plan,
identify
change-of-control
provisions
across
the
matter,
assess
deal
risk,
recommend
next
steps,
and
produce
a
draft
memorandum
for
the
deal
team
and
board.
The
rubric
for
that
single
task
contains
57
criteria
covering
nine
legal
issues
planted
across
the
materials.

LAB
uses
what
Harvey
calls
“all-pass”
grading,
meaning
that a
task
is
marked
complete
only
if
every
rubric
criterion
passes.
There
is
no
partial
credit.
The
rationale
is
that
a
deal
memo
that
catches
eight
of
10
material
risks
is
not
80%
useful.
One
missed
issue
could
blow
up
the
transaction
or
surface
as
a
problem
post-closing.

The
24
practice
areas
in
the
initial
release
span
transactional,
advisory,
regulatory
and
litigation
work.
Harvey
says
future
versions
will
expand
within
those
areas,
add
new
practices,
and
eventually
move
beyond
law
firms
to
in-house
legal
work
and
adjacent
professional
services
like
asset
management
and
banking.

Why
a
Benchmark?

Harvey’s
thesis
is
that
benchmarks
have
served
as
leading
indicators
of
capability
inflection
points
in
other
agentic
domains

most
visibly
in
software
engineering,
where
benchmarks
such
as
SWE-Bench
Verified
and
Terminal-Bench
2.0
tracked
the
shift
that
AI
researcher
Andrej
Karpathy
summarized
by
saying
coding
agents
“basically
didn’t
work
before
December
and
basically
work
since.”

Harvey
argues
that
similar
benchmarks
(GDPval,
OSWorld-Verified,
BrowseComp,
FinanceAgent,
and
others)
are
now
extending
legibility
to
knowledge
work,
web
research,
financial
analysis
and
professional
services.

Harvey
positions
LAB
as
the
legibility
layer
for
legal
agents.
The
use
case
Harvey
describes
for
law
firms
is
straightforward:
identify
the
workflows
where
agents
perform
well
enough
to
be
delegated
under
a
“review
pattern,”
identify
the
workflows
where
they
don’t
and
need
to
stay
heavily
human-in-the-loop,
and
make
deployment
and
ROI
decisions
accordingly.

For
most
firms,
that
may
matter
more
than
technical
details.
The
legal
industry
has
spent
two
years
cycling
through
vendor
demos
and
pilot
programs
without
a
shared
way
to
answer
the
question
every
managing
partner
and
innovation
lead
is
being
asked,
which
is
where,
specifically,
can
we
put
these
things
to
work?

A
credible,
public
benchmark,
particularly
one
structured
around
actual
deliverables
rather
than
multiple-choice
questions,
could
change
that
conversation.
Of
course,
it
could
also
complicate
it,
by
revealing
how
far
agents
still
are
from
autonomous
practice
in
many
areas.

Practical
Applications
of
LAB

To
my
mind,
a
few
practical
applications
of
LAB
jump
out:

  • For
    law
    firms,
    LAB
    offers
    a
    reference
    point
    for
    vendor
    evaluation.
    A
    firm
    evaluating
    competing
    products
    could,
    in
    theory,
    ask
    each
    vendor
    to
    report
    performance
    on
    specific
    LAB
    practice
    areas
    and
    compare
    results,
    rather
    than
    rely
    on
    vendor
    demos
    and
    case
    studies.
  • For
    vendors,
    LAB
    offers
    a
    public
    yardstick
    for
    claims
    about
    agent
    capability.
    Harvey
    has
    acknowledged
    contributions
    from
    a
    substantial
    list
    of
    labs
    and
    companies
    (including
    Anthropic,
    OpenAI,
    Nvidia,
    Google
    DeepMind,
    Mistral,
    LangChain,
    Fireworks,
    Snorkel,
    Mercor,
    and
    Stanford
    LIFTLab),
    which
    suggests
    the
    major
    frontier
    labs
    see
    value
    in
    a
    shared
    evaluation
    context
    for
    legal
    agents.
  • For
    researchers,
    LAB
    provides
    a
    longer-horizon,
    domain-specific
    task
    set
    that
    they
    can
    use
    for
    evaluation,
    fine-tuning
    and
    post-training
    work.
  • For
    legal
    journalists
    and
    analysts,
    LAB
    could
    provide
    something
    more
    useful
    than
    vendor-supplied
    claims
    about
    their
    products
    —a
    way
    of
    actually
    putting
    those
    claims
    to
    the
    test.

The
Bottom
Line

It
is
worth
noting
that
LAB
is
a
benchmark
built
by
a
market
participant.
Harvey
is
a
dominant
and
well-funded
legal
AI
vendor,
and
the
company
has
not
been
shy
about
its
commercial
positioning.

The
tasks
and
definitions
of
“legal
work
product”
within
LAB
reflect
choices
about
what
good
legal
work
looks
like,
and
those
choices
were
made
by
Harvey’s
team
in
consultation
with
its
research
partners.
None
of
that
makes
the
benchmark
unreliable,
but
it
is
something
the
legal
community
needs
to
keep
in
mind
going
forward.

There
is
also
the
question
of
what
exactly
is
the
impact
of
“open
source”
in
this
context.
In
a

post
at
Alt-Counsel
,
Houfu
Ang
argues
that
legal
open
source
is
not
really
a
community
but
rather
“a
federation
of
solo-author
archipelagos.”

He
points
specifically
to
projects
that
come
from
well-funded
vendors
such
as
Harvey,
whose
repositories
are
maintained
almost
exclusively
by
in-house
staff
in
what

the
Open
Source
Initiative
calls

“Open
Source
theatre.”
Virtually
none
of
these,
Ang
argues,
graduate
from
individual
showcase
to
sustained
codebase
with
outside
contributors.

Even
so,
LAB
is
the
most
ambitious
public
attempt
yet
to
measure
what
legal
AI
agents
can
actually
do
on
the
kind
of
work
law
firms
actually
delegate.
Whether
it
becomes
the
shared
yardstick
Harvey
wants
it
to
be
will
depend
on
how
the
leaderboard
rolls
out,
how
transparently
submissions
are
normalized,
and
how
much
room
the
project
leaves
for
outside
contributors
to
shape
what
gets
measured.