How
speed
and
accuracy
benchmarks
misrepresent
the
real
value
of
legal
AI
Welcome
to
the
era
of
the
AI
superlative.
While
the
first
two
years
of
generative
artificial
intelligence
(GenAI)
development
were
an
all-out
sprint
to
create
new
models,
establish
proof-of-concept solutions,
and
define
optimal
use
cases,
the
next
phase
to
deliver
increased
efficiency
and
better work
product
to
clients
in
the
AI
lifecycle
will
be
dominated
by
marketing
as
well.
Product
claims
of
the
fastest,
most
accurate
large
language
model
(LLM)
or
“hallucination-free”
results
have
entered
the
marketplace.
As
more
companies
develop
AI
solutions
and
start-ups
seek capital
investment
in
an
increasingly
crowded
field,
customers
will
seek
benchmarks
to
evaluate the
efficacy
of
these
tools.
For
benchmarks
to
be
valuable,
they
must
test
real-world
problems
that legal
professionals
face
and
measure
what
customers
care
about.
The
challenge
is
one-dimensional
metrics
do
not
offer
a
reliable
representation
of
the
real
value
of
GenAI
in
the
legal
research
process.
No
LLM-based
legal
research
products
in
the
market
today
provide
answers
with
100%
accuracy,
so
users
must
engage
in
a
two-step
process
of
1)
getting
the
answer
and
2)
checking
the
answer
for
accuracy.
It’s
the
end
result
of
this
two-step
process
that
matters.
Benchmarking
just
part
of
this
process
does
not
provide
useful
information
—
unless
there
is
a
part
of
the
process
that
is
completely
broken.
In
drag
racing,
cars
need
to
accelerate
as
fast
as
they
can
and
then
brake
quickly.
For
braking,
they
typically
deploy
a
parachute
behind
the
car
to
increase
drag
and
traditional
braking
methods.
What
drag
racers
care
about
is
how
quickly
and
safely
the
car
brakes.
If
we
wanted
to
benchmark
different
braking
systems,
we’d
test
them
from
the
time
of
deployment
to
the
time
the
car
stopped
and
measure
time
and
distance.
Instead,
imagine
benchmarking
braking
systems
by
measuring
how
fast
the
parachutes
deployed.
Similarly,
with
a
research
product
where
all
answers
must
be
checked,
what
matters
most
is
how
quickly
and
accurately
researchers
can
get
to
the
end
of
that
process.
For
instance,
which
legal
research
system
would
you
prefer?
One
where:
a)
LLM-generated
answers
are
accurate
95%
of
the
time,
and
researchers,
on
average,
can
verify
accuracy
within
25
minutes
and
get
to
an
accurate
answer
97%
of
the
time,
or
b)
LLM-generated
answers
are
accurate
85%
of
the
time,
and
researchers,
on
average,
can
verify
accuracy
within
15
minutes
and
get
to
an
accurate
answer
100%
of
the
time.
Since
all
researchers
need
to
engage
in
this
two-step
process
100%
of
the
time,
it’s
clear
that
Option
B
would
be
better.
So
why
would
we
just
benchmark
the
first
part
of
the
process?
Technology
companies
care
deeply
about
benchmarking.
However,
benchmarks
must
measure
products
the
way
they’re
designed
to
be
used
and
should
focus
on
results
customers
care
about.
It
makes
sense
that
the
legal
field
would
become
an
early
test
bed
for
this
type
of
analysis.
From
the
earliest
days
of
mainstream
GenAI
development
when
ChatGPT
aced
the
LSAT,
legal
use
cases have
been
prime
examples
of
both
the
power
and
the
risks
associated
with
AI.
The
legal
field
is
no
stranger
to
AI;
leading
companies
have
been
using
it
for
decades
in
our
legal
research
platform,
and
likewise,
lawyers
have
been
benefitting
from
it.
Measuring
the
Full
Scope
Working
with
our
customers
to
continually
improve
legal
research,
we
understand
it
is
a
multiphase
process
with
many
inputs
and
factors
—
with
GenAI
capabilities
being
just
one
part
of
it.
The
entire
legal
research
process
is
detailed
and
complex,
and
lawyers
must
check
sources
and
validate
material
—
in
essence,
follow
holistic
sound
research
practices
to
ensure
their
research
is comprehensive
and
accurate.
Benchmarking
one
part
of
this
process
cannot
measure
the
full
scope
or
true
value
of
legal
research.
“There
is
a
widespread
misperception
around
how
law
firms
are
using
AI
and
how
we
conduct
legal
research.
We
are
not
bringing
in
AI
and
saying:
‘Go
do
all
the
research
and
write
a
brief,’
and
then
replacing
all
of
our
junior
associates
with
automated
results,”
said
Meredith
Williams-Range,
chief
legal
operations
officer,
Gibson,
Dunn
&
Crutcher
LLP.
“We’re
using
AI-enabled
tools
that
are
integrated
directly
into
the
research
and
drafting
tools
we
were
using
already,
and,
as
a
result,
we’re
getting
deeper,
more
nuanced,
and
more
comprehensive
insights
faster.
We
have
highly
trained
professionals
doing
sophisticated
information
analysis
and
reporting,
augmented
by
technology.”
Looking
Beyond
the
Basics
of
AI
Evaluation
To
state
the
obvious,
benchmark
testing
should
evaluate
solutions
in
accordance
with
their intended
use.
In
legal
research,
GenAI
has
demonstrated
significant
benefits;
however,
it
is
meant
to
be
integrated
into
a
comprehensive
workflow
that
includes
reviewing
primary
law,
verifying
citations,
and
utilizing
statute
annotations
to
ensure
a
thorough
understanding
of
the
law.
“At
Husch
Blackwell,
we
have
focused
on
end-to-end
project
efficiency
in
building
and
deploying our
in-house
AI
tools,”
said
Blake
Rooney,
the
firm’s
chief
information
officer.
“While
performance
metrics
that
focus
on
task
efficiency
can
be
helpful,
project-level
performance
metrics
for
efforts
such
as
contract
drafting
or
discovery
in
litigation
do
a
much
better
job
at
underscoring
the
efficiencies
that
resonate
with
both
our
lawyers
and
our
clients
because
they
provide
a
clearer
picture
of
overall
value
and
time
savings.
Time
is
a
finite
resource
that
we
always
wish
we
could
have
more
of,
and
our
lawyers
understand
that
—
when
used
properly
and
responsibly
—
AI
tools
enable
them
to
finish
projects
faster
(and
oftentimes
better)
than
they
could
without
AI,
thereby
delivering
true
value
to
our
clients
and
ultimately
enabling
our
lawyers
to
do
more
work
(or
spend
more
time
with
family)
with
the
time
that
they
have.”
For
legal
research,
accuracy,
consistency,
and
speed
do
matter
—
but
none
of
them
offers
a
single
indicator
of
success.
When
it
comes
to
evaluating
the
performance
of
professional-grade
solutions
in
specialized
fields
like
law,
it
is
critical
not
to
let
isolated
snapshots
of
a
single
performance
metric
distort
our
perspective.
The
value
of
legal
AI
—
of
any
technological
innovation
for
that
matter
—
is
in
how
it
gets
used
in
the
real
world
and
how
well
all
the
different
components
come
together
to
help
lawyers
do
their
jobs more
effectively.
About
the
author
Raghu
Ramanathan
is
president
of
Legal
Professionals
at
Thomson
Reuters.