Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google. – Above the Law

Photo
by
Kevin
Carter/Getty
Images)

Last
week, Google
filed
suit against
SerpApi,
a
scraping
company
that
helps
businesses
pull
data
from
Google
search
results.
The
lawsuit
claims
SerpApi
violated
DMCA
Section
1201
by
circumventing
Google’s
“technological
protection
measures”
to
access
search
results—and
the
copyrighted
content
within
them—without
permission.

There’s
just
one
problem
with
this
theory:
Google
built
its
entire
business
on
scraping
the
web
without
asking
permission
first.
And
now
it
wants
to
use
one
of
the
most
abused
provisions
in
copyright
law
to
stop
others
from
doing
something
functionally
similar
to
what
made
Google
a
tech
giant
in
the
first
place.

The
lawsuit
comes
on
the
heels
of
Reddit’s equally
problematic
anti-scraping
suit
from
October—which
we
called
an
attack
on
the
open
internet.
Reddit
sued
Perplexity
and
various
scraping
firms
(including
SerpApi),
claiming
they
violated
1201
by
circumventing…
Google’s
technological
protections.
Reddit
was
mad
it
had
cut
a
multi-million
dollar
licensing
deal
with
Google
for
access
to
Reddit
content,
and
these
firms
were
routing
around
both
that
deal
and
Google
itself
to
provide
similar
results
to
users.
The
legal
theory
was
bizarre:
Reddit
didn’t
own
the
copyright
on
user
posts,
and
the
scrapers
weren’t
even
touching
Reddit
directly—yet
Reddit
claimed
standing
to
sue
based
on
circumventing
someone
else’s
TPMs.

So
now,
Google
has
filed
its
own,
similar
lawsuit,
going
after
SerpApi
directly,
focused
on
how
SerpApi
gets
around
its
attempts
to
block
such
scraping.
Google
released a
blog
post
defending
this
lawsuit:

We filed
a
suit today
against
the
scraping
company
SerpApi
for
circumventing
security
measures
protecting
others’
copyrighted
content
that
appears
in
Google
search
results.
We
did
this
to
ask
a
court
to
stop
SerpApi’s
bots
and
their
malicious
scraping,
which
violates
the
choices
of
websites
and
rightsholders
about
who
should
have
access
to
their
content.
This
lawsuit
follows legal
action that
other
websites
have
taken
against
SerpApi
and
similar
scraping
companies,
and
is
part
of
our
long
track
record
of
affirmative
litigation
to fight
scammers and bad
actors on
the
web.

Google
follows
industry-standard
crawling
protocols,
and
honors
websites’
directives
over
crawling
of
their
content.
Stealthy
scrapers
like
SerpApi
override
those
directives
and
give
sites
no
choice
at
all.
SerpApi
uses
shady
back
doors
—
like
cloaking
themselves,
bombarding
websites
with
massive
networks
of
bots
and
giving
their
crawlers
fake
and
constantly
changing
names
—
circumventing
our
security
measures
to
take
websites’
content
wholesale.
This
unlawful
activity
has
increased
dramatically
over
the
past
year.

SerpApi
deceptively
takes
content
that
Google
licenses
from
others
(like
images
that
appear
in
Knowledge
Panels,
real-time
data
in
Search
features
and
much
more),
and
then
resells
it
for
a
fee.
In
doing
so,
it
willfully
disregards
the
rights
and
directives
of
websites
and
providers
whose
content
appears
in
Search.

Look,
SerpApi’s
behavior
is
sketchy.
Spoofing
user
agents,
rotating
IPs
to
look
like
legitimate
users,
solving
CAPTCHAs
programmatically—Google’s
complaint
paints
a
picture
of
a
company
actively
working
to
evade
detection.
But
the
legal
theory
Google
is
deploying
to
stop
them
threatens
something
far
bigger
than
one
shady
scraper.

Google’s entire
business is
built
on
scraping
as
much
of
the
web
as
possible
without
first
asking
permission.
The
fact
that
they
now
want
to
invoke
DMCA
1201—one
of
the
most
consistently
abused
provisions
in
copyright
law—to
stop
others
from
scraping
them
exposes
the
underlying
problem
with
these
licensing-era
arguments:
they’re
attempts
to
pull
up
the
ladder
after
you’ve
climbed
it.

Just
from
a
straight
up
perception
standpoint,
it looks bad.

To
be
clear:
this
isn’t
about
defending
SerpApi.
They
appear
to
be
bad
actors
who
built
a
business
on
evading
detection
systems.
The
problem
is
that
Google
chose
to
go
after
them
using
a
legal
weapon
with
a
long
history
of
collateral
damage.
When
you
invoke
Section
1201
against
web
scraping,
you’re
not
just
targeting
one
sketchy
company—you’re
potentially
rewriting
the
rules
for
how
the
entire
open
web
functions.
The
choice
of
weapon
matters,
especially
when
that
weapon
has
been
repeatedly
abused
to
stifle
legitimate
competition
and
could
now
be
turned
against
the
very
openness
that
made
the
modern
internet
possible.

For
many
years,
we’ve
discussed
the
many,
many
problems
of DMCA
Section
1201.
It’s
the
“anti-circumvention”
part
of
the
law
that
says
merely
any
attempt
to
get
around
a
“technological
protection
measure”
(or
even
just
tell
someone
else
how
to
get
around
a
technological
protection
measure)
could
be
deemed
to
violate
the
law,
even
if
the
TPMs
in
question
were
wholly
ineffective,
and
even
if
the
intent
in
getting
around
the
TPM
had
nothing
to
do
with
copyright
infringement.

That
has
lead
to
years
of
abusive
practices
by
companies
who
would
put
silly,
pointless
“TPMs”
in
place
just
in
order
to
be
able
to
use
the
law
to
limit
competition.
There
were
lawsuits
over printer
ink
cartridges and garage
door
openers,
among
other
things.

Here,
Google
is
saying
that
it
put
in
place
a
TPM
in
January
of
2025
called
“SearchGuard”
(which
sounds
like
an
advanced
CAPTCHA
of
some
sort)
to
prevent
SerpApi
from
scraping
its
search
results,
but
SerpApi
figured
out
a
way
around
it:

When
SearchGuard
launched
in
January
2025,
it
effectively
blocked
SerpApi
from
accessing
Google’s
Search
results
and
the
copyrighted
content
of
Google’s
partners.
But
SerpApi
immediately
began
working
on
a
means
to
circumvent
Google’s
technological
protection
measure.
SerpApi
quickly
discovered
means
to
do
so
and
deployed
them.

SerpApi’s
answer
to
SearchGuard
is
to
mask
the
hundreds
of
millions
of
automated
queries
it
is
sending
to
Google
each
day
to
make
them
appear
as
if
they
are
coming
from
human
users.
SerpApi’s
founder
recently
described
the
process
as
“creating
fake
browsers
using
a
multitude
of
IP
addresses
that
Google
sees
as
normal
users.”

SerpApi’s
fakery
takes
many
forms.
For
example,
when
SerpApi
submits
an
automated
query
to
Google
and
SearchGuard
responds
with
a
challenge,
SerpApi
may
misrepresent
the
device,
software,
or
location
from
which
the
query
is
sent
in
order
to
solve
the
challenge
and
obtain
authorization
to
submit
queries.
Additionally
or
alternatively,
SerpApi
may
solve
SearchGuard’s
challenge
with
a
“legitimate”
request
and
then
syndicate
the
resulting
authorization,
that
is,
share
it
with
unauthorized
machines
around
the
world,
to
enable
their
“fake
browsers”
to
generate
automated
queries
that
appear
to
Google
as
authorized.
It
also
uses
automated
means
to
bypass
CAPTCHAs,
another
aspect
of
SearchGuard
that
tests
users
to
ensure
they
are
humans
rather
than
machines.

Getting
around
these
protections
eats
up
Google’s
resources,
and
sure,
that
must
be
annoying
for
Google.
But
the
real
motivation
shows
up
when
Google
gets
to
the
economics
of
the
situation.
Google
has
started
cutting
licensing
deals
with
content
partners—most
notably
the
multi-million
dollar
Reddit
deal—and
now
those
partners
are
pissed
that
SerpApi
lets
others
access
similar
data
without
paying
anyone:

For
Google,
SerpApi’s
automated
scraping
not
only
consumes
substantial
computing
resources
without
payment,
but
also
disrupts
Google’s
content
partnerships.
Google
licenses
content
so
that
it
can
enhance
the
Search
results
it
provides
to
users
and
thereby
boost
its
competitive
standing.
SerpApi
undermines
Google’s
substantial
investment
in
those
licenses,
making
the
content
available
to
other
services
that
need
not
incur
similar
costs.

SerpApi’s
scraping
of
Google
Search
results
also
impacts
the
rights
holders
who
license
content
to
Google.
Without
permission
or
compensation,
SerpApi
takes
their
content
from
Google
and
widely
distributes
it
for
use
by
third
parties.
That,
in
turn,
threatens
to
disrupt
Google’s
relationship
with
the
rights
holders
who
look
to
Google
to
prevent
the
misappropriation
of
the
content
Google
displays.
At
least
one
Google
content
partner,
Reddit,
has
already
sued
SerpApi
for
its
misconduct.

This
is
where
the
1201
theory
becomes
genuinely
dangerous.
Google’s
argument,
if
accepted,
provides
a
roadmap
for
any
website
operator
who
wants
to
lock
down
their
content:
slap
on
a
trivial
TPM—a
CAPTCHA,
an
IP
check,
anything—and
suddenly
you
can
invoke
federal
law
against
anyone
who
figures
out
how
to
get
around
it,
even
if
their
purpose
has
nothing
to
do
with
copyright
infringement.

The
implications
spiral
outward
quickly.
If
Google
succeeds
here,
what
stops
every
major
website
from
deciding
they
want
licensing
revenue
from
the
largest
scrapers?
Cloudflare
could
put
bot
detection
on
the
huge
swath
of
the
internet
it
serves
and
demand
Google
pay
up.
WordPress
could
do
the
same
across
its
massive
network.
The
open
web—built
on
the
assumption
that
published
content
is
publicly
accessible
for
indexing
and
analysis—becomes
a
patchwork
of
licensing
requirements,
each
enforced
through
1201
threats.

That
doesn’t
seem
good
for
the
prospects
of
a
continued
open
web.

Google’s
legal
theory
has
another
significant
problem:
the
requirement
that
a
TPM
must
“effectively
control”
access.
Just
last
week,
a
court rejected Ziff
Davis’s
attempt
to
turn
robots.txt
into
a
1201
violation
when
OpenAI
allegedly
ignored
its
crawling
restrictions.
The
court’s
reasoning
is
directly
applicable
here:

Robots.txt
files
instructing
web
crawlers
to
refrain
from
scraping
certain
content
do
not
“effectively
control”
access
to
that
content
any
more
than
a
sign
requesting
that
visitors
“keep
off
the
grass”
effectively
controls
access
to
a
lawn.
On
Ziff
Davis’s
own
telling,
robots.txt
directives
are
merely
requests
and
do
not
effectively
control
access
to
copyrighted
works.
A
web
crawler
need
not
“appl[y]
.
.
.
information,
or
a
process
or
a
treatment,”
in
order
to
gain
access
to
web
content
on
pages
that
include
robots.txt
directives;
it
may
access
the
content
without
taking
any
affirmative
step
other
than
impertinently
disregarding
the
request
embodied
in
the
robots.txt
files.
The
FAC
therefore
fails
to
allege
that
robots.txt
files
are
a
“technological
measure
that
effectively
controls
access”
to
Ziff
Davis’s
copyrighted
works,
and
the
DMCA
section
1201(a)
claim
fails
for
this
reason.

Google
will
argue
SearchGuard
is
different—it’s
more
than
a
polite
request,
it
actively
challenges
and
blocks
scrapers.
But
if
SerpApi
can
routinely
bypass
it
by
spoofing
browsers
and
rotating
IPs,
does
it
really
“effectively
control”
access?
Or
is
it
just
a
slightly
more
sophisticated
“keep
off
the
grass”
sign
that
determined
actors
can
ignore?

This
question
matters
enormously
because
it
determines
whether
the
statute
that
was
supposed
to
prevent
piracy
of
CDs
and
DVDs
now
also
governs
every
attempt
to
access
publicly-available
web
pages
through
automated
means.

For
decades,
we’ve
operated
under
a
system
where
robots.txt
represented
a
voluntary,
good-faith
approach
to
web
crawling.
The
major
players
respected
these
directives
not
because
they
had
to,
but
because
maintaining
that
norm
benefited
everyone.
That
system
is
breaking
down,
not
because
of
SerpApi,
but
because
of
the
rise
of
scrapers
focused
on
LLM
training,
mixed
with
other
companies
wanting
to
find
licensing
deals
to
get
a
cut
of
the
money
flows.
Reddit
and
Google
negotiating
licensing
deals
over
open
web
content
was
a
warning
sign
of
all
of
this,
and
now
it’s
spilling
out
into
the
courts
with
questionable
1201
claims.

Both
Reddit
and
Google
frame
this
as
protecting
the
open
internet
from
bad
actors.
But
pulling
up
the
ladder
after
you’ve
climbed
it
isn’t
protection—it’s
rent-seeking.
Google
built
an
empire
on
the
assumption
that
publicly
accessible
web
content
could
be
freely
scraped
and
indexed.
Now
it
wants
to
rewrite
the
rules…
using
Hollywood’s
favorite
tool
to
block
access
to
information.

The
real
problem
isn’t
that
Google
is
fighting
back
against
SerpApi’s
evasive
tactics.
It’s
that
they
chose
to
fight
using
a
legal
weapon
that,
if
successful,
fundamentally
changes
how
we
understand
access
to
the
open
web.
Section
1201
has
already
been
wildly
abused
to
stifle
competition
in
everything
from
printer
cartridges
to
garage
door
openers.
Extending
it
to
cover
basic
web
scraping
because
SerpApi
seems
sketchy
threatens
the
foundational
assumption
that
published
web
content
is
accessible
for
indexing,
research,
and
analysis.

Google
has
the
resources
to
solve
this
problem
through
better
engineering
or
by
raising
the
actual
cost
of
evasion
high
enough
that
SerpApi’s
business
model
fails.
Instead,
they’ve
opted
for
a
legal
shortcut
that,
if
it
works,
will
reshape
the
internet
in
ways
that
go
far
beyond
one
sketchy
scraping
company.

The
internet
is
changing,
and
legitimate
questions
exist
about
how
web
scraping
should
function
in
an
era
of
large
language
models
and
AI
training.
But
those
questions
won’t
be
answered
well
by
stretching
copyright
law
to
cover
something
it
was
never
designed
for,
and
empowering
every
website
operator
to
demand
licensing
fees
simply
by
putting
up
a
CAPTCHA.

That’s
not
protecting
the
open
web.
That’s
closing
it.

Click
here
to
see
the
docs.

Google
Built
Its
Empire
Scraping
The
Web.
Now
It’s
Suing
To
Stop
Others
From
Scraping
Google

+263 242 744 677

4 Gunhill Avenue,

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google. – Above the Law