The
Robot
Lawyer
revolution
remains
on
hold.
Despite
advances
in
generative
AI
and
growing
adoption
throughout
the
legal
industry,
we
must
all
continue
to
wait
for
a
lawyerly
version
of
Skynet
to
become
fully
self-aware…
and
sign
its
own
surrender
deal
with
the
Trump
administration.
But
in
the
meantime,
legal
applications
of
generative
AI
continue
to
produce
remarkable
and
task-focused
time-saving
tools.
As
developers
work
to
bring
the
latest
advancements
in
LLMs
to
bigger
and
better
legal
applications,
what
does
it
even
look
like
to
build
something
that
works
in
law?
And,
no,
it’s
not
“feed
it
all
court
cases”
unless
you’re
a
deeply
unserious
person.
Thomson
Reuters
CTO
Joel
Hron
opened
up
in
a
recent
article
about
the
company’s
approach
to
benchmarking
large
language
models
as
it
builds
out
its
AI
offering.
It’s
a
detailed
yet
approachable
account
of
the
philosophical
and
practical
concerns
that
go
into
melding
the
almost
daily
shifting
generative
AI
world
into
a
coherent
product
that
approaches
tasks
in
a
manner
that
can
produce
usable
results
for
attorneys.
One
might
think
that
one
of
the
biggest
factors
in
building
a
more
sophisticated
tool
is
being
able
to
handle
more
content.
And
that’s
true
to
a
point.
That
said,
as
Hron
points
out,
tokens
ain’t
everything
and
simply
stuffing
a
million
tokens
into
a
model
isn’t
a
magic
spell
for
accuracy:
When
GPT-4
was
first
released
in
2023,
it
featured
a
context
window
of
8K
tokens,
equivalent
to
approximately
6,000
words
or
20
pages
of
text.
To
process
documents
longer
than
this,
it
was
necessary
to
split
them
into
smaller
chunks,
process
each
chunk
individually,
and
synthesize
the
final
answer.
Today,
most
major
LLMs
have
context
windows
ranging
from
128K
to
over
1M
tokens.
However,
the
ability
to
fit
1M
tokens
into
an
input
window
does
not
guarantee
effective
performance
with
that
much
text.
Often,
the
more
text
included,
the
higher
the
risk
of
missing
important
details.
To
ensure
CoCounsel’s
effectiveness
with
long
documents,
we’ve
developed
rigorous
testing
protocols
to
measure
long
context
effectiveness.
It’s
a
paradox
that
every
lawyer
on
the
wrong
end
of
an
irrelevant
document
dump
knows
well.
Yet
without
sifting
through
the
kitchen
sink,
there’s
no
way
to
get
comfortable
as
an
attorney.
So
developers
need
to
get
AI
in
a
place
where
it
can
handle
large
amounts
of
information
without
missing
the
most
important
point.
But
it’s
also
true
that
legal
work
is
rarely
about
pulling
a
needle
out
of
a
haystack
as
much
as
identifying
material
strewn
throughout
a
mass
of
text.
Our
initial
benchmarks
measure
LLM
performance
across
key
capabilities
critical
to
our
skills.
We
use
over
20,000
test
samples
from
open
and
private
benchmarks
covering
legal
reasoning,
contract
understanding,
hallucinations,
instruction
following,
and
long
context
capability.
These
tests
have
easily
gradable
answers
(e.g.,
multiple-choice
questions),
allowing
for
full
automation
and
easy
evaluation
of
new
LLM
releases.
Thomson
Reuters
employs
a
multi-LLM
approach,
so
it’s
not
just
evaluating
potential
AI
“engines”
as
a
binary,
“use/don’t
use”
test,
but
to
figure
out
what
tasks
the
model
might
be
well-suited
or
ill-suited
to
perform
and
adjusting
its
role
within
the
“secret
sauce”
accordingly.
It’s
not
about
crowning
a
winner,
but
building
a
functional
team.
For
our
long
context
benchmarks,
we
use
tests
from
LOFT,
which
measures
the
ability
to
answer
questions
from
Wikipedia
passages,
and
NovelQA,
which
assesses
the
ability
to
answer
questions
from
English
novels.
Both
tests
accommodate
up
to
1M
input
tokens
and
measure
key
long
context
capabilities
critical
to
our
skills,
such
as
multihop
reasoning
(synthesizing
information
from
multiple
locations
in
the
input
text)
and
multitarget
reasoning
(locating
and
returning
multiple
pieces
of
information).
These
capabilities
are
essential
for
applications
like
interpreting
contracts
or
regulations,
where
the
definition
of
a
term
in
one
part
of
the
text
determines
how
another
part
is
interpreted
or
applied.
After
this
round
of
testing,
they
run
the
models
through
skill-specific
tests
that
they
design
to
mimic
rubber-meets-road
legal
tasks:
Once
a
skill
flow
is
fully
developed,
it
undergoes
evaluation
using
LLM-as-a-judge
against
attorney-authored
criteria.
For
each
skill,
our
team
of
attorney
subject
matter
experts
(SMEs)
has
generated
hundreds
of
tests
representing
real
use
cases.
Each
test
includes
a
user
query
(e.g.,
“What
was
the
basis
of
Panda’s
argument
for
why
they
believed
they
were
entitled
to
an
insurance
payout?”),
one
or
more
source
documents
(e.g.,
a
complaint
and
demand
for
jury
trial),
and
an
ideal
minimum
viable
answer
capturing
the
key
data
elements
necessary
for
the
answer
to
be
useful
in
a
legal
context.
Our
SMEs
and
engineers
collaborate
to
create
grading
prompts
so
that
an
LLM
judge
can
score
skill
outputs
against
the
ideal
answers
written
by
our
SMEs.
This
is
an
iterative
process,
where
LLM-as-a-judge
scores
are
manually
reviewed,
grading
prompts
are
adjusted,
and
ideal
answers
are
refined
until
the
LLM-as-a-judge
scores
align
with
our
SME
scores.
More
details
on
our
skill-specific
benchmarks
are
discussed
in
our
previous
post.
A
takeaway
from
this
process
is
that
the
advertised
context
windows
from
LLM
designers
don’t
necessarily
pan
out
in
complex
legal
work.
In
fact,
models
with
smaller
windows
can
perform
better
when
it
comes
to
complex
tasks
because
larger
context
windows
can
lose
effectiveness
as
they
get
stretched.
For
this
reason,
Thomson
Reuters
still
employs
a
“split
and
synthesize”
approach
for
some
documents
to
avoid
this
problem.
When
you
look
at
the
advertised
context
window
for
leading
models
today,
don’t
be
fooled
into
thinking
this
is
a
solved
problem.
It
is
exactly
the
kind
of
complex,
reasoning-heavy
real-world
problem
where
that
effective
context
window
shrinks.
Our
challenge
to
the
model
builders:
keep
stretching
and
stress-testing
that
boundary!
After
all
this,
human
subject
matter
experts
perform
a
manual
review
to
capture
nuanced
issues
that
might
get
lost
across
everything
else.
And
that’s
how
they
build
an
AI
infrastructure
with
a
multi-LLM
strategy.
It’s
a
buddy
cop
show:
one
AI
is
the
straight-laced
by-the-book
type,
the
other’s
the
unorthodox.
Together,
they
solve
crimes
—
or
at
least
contract
reviews.