A
federal
magistrate
judge
just
ordered
that
the
private
ChatGPT
conversations
of
20
million
users
be
handed
over
to
the
lawyers
for
dozens
of
plaintiffs,
including
news
organizations.
Those
20
million
people
weren’t
asked.
They
weren’t
notified.
They
have
no
say
in
the
matter.
Last
week,
Magistrate
Judge
Ona
Wang ordered
OpenAI
to
turn
over
a
sample
of
20
million
chat
logs as
part
of
the
sprawling
multidistrict
litigation
where
publishers
are
suing
AI
companies—a
mess
of
consolidated
cases
that
kicked
off
with
the NY
Times’
lawsuit
against
OpenAI.
Judge
Wang
dismissed
OpenAI’s
privacy
concerns,
apparently
convinced
that
“anonymization”
solves
everything.
Even
if
you
hate
OpenAI
and
everything
it
stands
for,
and
hope
that
the
news
orgs
bring
it
to
its
knees,
this
should
scare
you.
A
lot.
OpenAI
had
pointed
out
to
the
judge
a
week
earlier
that
this
demands
from
the
news
orgs would
represent
a
massive
privacy
violation
for
ChatGPT’s
users.
News
Plaintiffs
demand
that
OpenAI
hand
over
the
entire
20M
log
sample
“in
readily
searchable
format”
via
a
“hard
drive
or
[]
dedicated
private
cloud.”
ECF
656
at
3.
That
would
include
logs
that
are
neither
relevant
nor
responsive—indeed,
News
Plaintiffs
concede
that
at
least
99.99%
of
the
logs
are
irrelevant
to
their
claims.
OpenAI
has
never
agreed
to
such
a
process,
which
is
wildly
disproportionate
to
the
needs
of
the
case
and
exposes
private
user
chats
for
no
reasonable
litigation
purpose.
In
a
display
of
striking
hypocrisy,
News
Plaintiffs
disregard
those
users’
privacy
interests
while
claiming
that
their
own
chat
logs
are
immune
from
production
because
“it
is
possible”
that
their
employees
“entered
sensitive
information
into
their
prompts.”
ECF
475
at
4.
Unlike
News
Plaintiffs,
OpenAI’s
users
have
no
stake
in
this
case
and
no
opportunity
to
defend
their
information
from
disclosure.
It
makes
no
sense
to
order
OpenAI
to
hand
over
millions
of
irrelevant
and
private
conversation
logs
belonging
to
those
absent
third
parties
while
allowing
News
Plaintiffs
to
shield
their
own
logs
from
disclosure.
OpenAI
offered
a
much
more
privacy-protective
alternative:
hand
over
only
a
targeted
set
of
logs
actually
relevant
to
the
case,
rather
than
dumping
20
million
records
wholesale.
The
news
orgs
fought
back,
but
their
reply
brief
is
sealed—so
we
don’t
get
to
see
their
argument.
The
judge
bought
it
anyway,
dismissing
the
privacy
concerns
on
the
theory
that
OpenAI
can
simply
“anonymize”
the
chat
logs:
Whether
or
not
the
parties
had
reached
agreement
to
produce
the
20
million
Consumer
ChatGPT
Logs
in
whole—which
the
parties
vehemently
dispute—such
production
here
is
appropriate.
OpenAI
has
failed
to
explain
how
its
consumers’
privacy
rights
are
not
adequately
protected
by:
(1)
the
existing
protective
order
in
this
multidistrict
litigation
or
(2)
OpenAI’s
exhaustive
de-identification
of
all
of
the
20
million
Consumer
ChatGPT
Logs.
The
judge
then
quotes
the
news
orgs’
filing,
noting
that
OpenAI
has
already
put
in
this
effort
to
“deidentify”
the
chat
logs.
Both
of
those
supposed
protections—the
protective
order
and
“exhaustive
de-identification”—are
nonsense.
Let’s
start
with
the
anonymization
problem,
because
it
shows
a stunning lack
of
understanding
about
what
it
means
to
anonymize
data
sets,
especially
AI
chatlogs.
We’ve
spent
years
warning
people
that
“anonymized
data”
is
a gibberish
term,
used
by
companies
to pretend
large
collections
of
data
can
be
kept
private,
when
that’s just
not
true.
Almost
any
large
dataset
of
“anonymized”
data
can
have
significant
portions
of
the
data
connected
back
to
individuals
with
just
a
little
work.
Researchers
re-identified
individuals
from
“anonymized”
AOL
search
queries,
from
NYC
taxi
records,
from
Netflix
viewing
histories—the
list
goes
on.
Every
time
someone
shows
up
with
an
“anonymized”
dataset,
researchers
show
ways
to
re-identify
people
in
the
dataset.
And
that’s
even
worse
when
it
comes
to
ChatGPT
chat
logs,
which
are
likely
to
be way more
revealing
that
previous
data
sets
where
the
inability
to
anonymize
data
were
called
out.
There
have
been
plenty
of
reports
of just
how
much
people
“overshare” with
ChatGPT,
often
including
incredibly
private
information.
Back
in
August,
researchers
got
their
hands
on
just
1,000
leaked
ChatGPT
conversations
and
talked
about how
much
sensitive
information they
were
able
to
glean
from
just
that
small
number
of
chats.
Researchers
downloaded
and
analyzed
1,000
of
the leaked
conversations, spanning
over
43
million
words.
Among
them,
they
discovered
multiple
chats
that
explicitly
mentioned
personally
identifiable
information
(PII),
such
as
full
names,
addresses,
and
ID
numbers.
With
that
level
of
PII
and
sensitive
information,
connecting
chats
back
to
individuals
is
likely
way
easier
than
in
previous
cases
of
connecting
“anonymized”
data
back
to
individuals.
And
that
was
with
just
1,000
records.
Then,
yesterday
as
I
was
writing
this,
the
Washington
Post
revealed
that they
had
combed
through
47,000
ChatGPT
chat
logs,
many
of
which
were
“accidentally”
revealed
via
ChatGPT’s
“share”
feature.
Many
of
them
reveal
deeply
personal
and
intimate
information.
Users
often
shared
highly
personal
information
with
ChatGPT
in
the
conversations
analyzed
by
The
Post,
including
details
generally
not
typed
into
conventional
search
engines.
People
sent
ChatGPT
more
than
550
unique
addresses
and
76
phone
numbers
in
the
conversations.
Some
are
public,
but
others
appear
to
be
private,
like
those
one
user
shared
for
administrators
at
a
religious
school
in
Minnesota.
Users
asking
the
chatbot
to
draft
letters
or
lawsuits
on
workplace
or
family
disputes
sent
the
chatbot
detailed
private
information
about
the
incidents.
There
are
examples
where,
even
if
the
user’s
official
details
are
redacted,
it
would
be
trivial
to
figure
out
who
was
actually
doing
the
chats:

If
you
can’t
see
that,
it’s
a
chat
with
ChatGPT,
redacted
by
the
Washington
post
saying:
User
my
name
is
[name
redacted]
my
husband
name
[name
redacted]
is
threatning
me
to
kill
and
not
taking
my
responsibities
and
trying
to
go
abroad
[…]
he
is
not
caring
us
and
he
is
going
to
kuwait
and
he
will
give
me
divorce
from
abroad
please
i
want
to
complaint
to
higher
authgorities
and
immigrition
office
to
stop
him
to
go
abroad
and
i
want
justice
please
help
ChatGPT
Below
is
a
formal
draft
complaint
you
can
submit
to
the
Deputy
Commissioner
of
Police
in
[redacted]
addressing
your
concerns
and
seeking
immediate
action:
That
seems
like
even
if
you
“anonymized”
the
chat
by
taking
off
the
user
account
details,
it
wouldn’t
take
long
to
figure
out
whose
chat
it
was,
revealing
some
pretty
personal
info,
including
the
names
of
their
children
(according
to
the
Post).
And
WaPo
reporters
found
that
by
starting
with
93,000
chats,
then
using
tools
do
an
analysis
of
the
47,000
in
English,
followed
by
human
review
of
just
500
chats
in
a
“random
sample.”
Now
imagine 20
million
records.
With
many,
many
times
more
data,
the
ability
to
cross-reference
information
across
chats,
identify
patterns,
and
connect
seemingly
disconnected
pieces
of
information
becomes
exponentially
easier.
This
isn’t
just
“more
of
the
same”—it’s
a
qualitatively
different
threat
level.
Even
worse,
the
judge’s
order
contains
a
fundamental
contradiction:
she
demands
that
OpenAI
share
these
chatlogs
“in
whole”
while
simultaneously
insisting
they
undergo
“exhaustive
de-identification.”
Those
two
requirements
are
incompatible.
Real
de-identification
would
require
stripping
far
more
than
just
usernames
and
account
info—it
would
mean
redacting
or
altering
the
actual content of
the
chats,
because
that
content
is
often
what
makes
re-identification
possible.
But
if
you’re
redacting
content
to
protect
privacy,
you’re
no
longer
handing
over
the
logs
“in
whole.”
You
can’t
have
both.
The
judge
doesn’t
grapple
with
this
contradiction
at
all.
Yes,
as
the
judge
notes,
this
data
is
kept
under
the
protective
order
in
the
case,
meaning
that
it
shouldn’t
be
disclosed.
But
protective
orders
are
only
as
strong
as
the
people
bound
by
them,
and
there’s
a
huge
risk
here.
Looking
at
the
docket,
there
are
a ton of
lawyers
who
will
have
access
to
these
files.
The docket
list
of
parties
and
lawyers is
45
pages
long
if
you
try
to
print
it
out.
While
there
are
plenty
of
repeats
in
there,
there
have
to
be
at
least
100
lawyers
and
possibly
a
lot
more
(I’m
not
going
to
count
them,
and
while
I
asked
three
different
AI
tools
to
count
them,
each
gave
me
a
different
answer).
That’s
a
lot
of
people—many
representing
entities
directly
hostile
to
OpenAI—who
all
need
to
keep
20
million
private
conversations
secret.
That’s
not
even
getting
into
the
fact
that
handling
20
million
chat
logs
is
a
difficult
task
to
do
well.
I
am
quite
sure
that
among
all
the
plaintiffs
and
all
the
lawyers,
even
with
the
very
best
of
intentions,
there’s
still
a
decent
chance
that
some
of
the
content
could
leak
(and
it
could,
in
theory,
leak
to
some
of
the
media
properties
who
are
plaintiffs
in
the
case).
And,
as
OpenAI
properly
points
out,
its
users
whose
data
is
at
risk
here
have
no
say
in
any
of
this.
They
likely
have
no
idea
that
a
ton
of
people
may
be
about
to
get
an
intimate
look
at
what
they
thought
were
their
private
ChatGPT
chats.
On
Wednesday
morning,
OpenAI asked
the
judge
to
reconsider,
warning
of
the
very
real
potential
harms:
OpenAI
is
unaware
of
any
court
ordering
wholesale
production
of
personal
information
at
this
scale.
This
sets
a
dangerous
precedent:
it
suggests
that
anyone
who
files
a
lawsuit
against
an
AI
company
can
demand
production
of
tens
of
millions
of
conversations
without
first
narrowing
for
relevance.
This
is
not
how
discovery
works
in
other
cases:
courts
do
not
allow
plaintiffs
suing
to
dig
through
the
private
emails
of
tens
of
millions
of
Gmail
users
irrespective
of
their
relevance.
And
it
is
not
how
discovery
should
work
for
generative
AI
tools
either.
The
judge
had
cited
a
ruling
in
one
of
Anthropic’s
cases,
but
hadn’t
given
OpenAI
a
chance
to
explain
why
the
ruling
in
that
case
didn’t
apply
here
(in
that
one,
Anthropic
had
agreed
to
hand
over
the
logs
as
part
of
negotiations
with
the
plaintiffs,
and
OpenAI
gets
in
a
little
dig
at
its
competitor,
pointing
out
that
it
appears
Anthropic
made
no
effort
to
protect
the
privacy
of
its
users
in
that
case).
There
have,
as
Daphne
Keller
regularly
points
out,
always
been
challenges
between user
privacy
and
platform
transparency.
But
this
goes
well
beyond
that
familiar
tension.
We’re
not
talking
about
“platform
transparency”
in
the
traditional
sense—publishing
aggregated
statistics
or
clarifying
moderation
policies.
This
is
20
million
complete
chatlogs,
handed
over
“in
whole”
to
dozens
of
adversarial
parties
and
their
lawyers.
The
potential
damage
to
the
privacy
rights
of
those
users
could
be
massive.
And
the
judge
just
waves
it
all
away.
More
Law-Related
Stories
from
Techdirt:
After
Destroying
Federal
Regulators,
AT&T
Wages
War
On
Industry
‘Self-Regulation’
Regimes
Like
NARB,
NAD
‘No
One
Lives
Forever’
Turns
25
&
You
Still
Can’t
Buy
It
Legitimately
Furloughed
Employees
Sue
Administration
For
Adding
Partisan
Wording
To
Their
Out-Of-Office
Messages
