►
From YouTube: IETF111-PEARG-20210726-1900
Description
PEARG meeting session at IETF111
2021/07/26 1900
https://datatracker.ietf.org/meeting/111/proceedings/
A
A
A
A
Okie
doke
welcome
everybody
to
ietf
111
and
welcome
to
the
pergy
meeting.
We
have
a
note
12
for
everybody,
the
regular
note
well,
of
course,
the
session
is
being
recorded
as
well
as
per
usual
for
our
etf.
A
That's
please
know
that
this
is
our
agenda
for
today,
blue
shield's
automatic.
Now
the
other
chairs
will
actually
be
watching
the
chat.
So
I
don't
believe
we
need
a
chat
prescribed
and
we
already
have
a
minute.
A
No,
if
everybody
is
happy
with
that
agenda,
we
will
move
forward.
So
the
agenda
today
is
a
couple
of
updates
on
drafts.
We've
got
a
very
brief
mention
of
the
address
privacy
draft
and
then
mallory
present
on
safe
measurement.
The
update
she's
done
recently.
A
A
Hopefully,
you've
seen
there
was
an
email
to
the
list
with
the
first
version
of
it
and
there's
also
a
github
repo
if
you're,
interested,
matthew
and
luigi
have
provided
prs
so
far
at
the
moment,
it's
a
fairly
skeleton
structure
with
headings
and
placeholders,
matthew
and
luigi
are
going
to
add
more
content
over
the
coming
weeks,
but
we
are
still
looking
for
the
contributors
on
this,
particularly
on
the
use
case,
sections
on
which
there
were
a
lot
of
presentations
during
the
interim.
A
So
if
you
are
interested
in
working
on
that,
please
read
the
document
and
please
let
the
chairs
all
the
other
authors
know.
A
Okay,
so
that's
just
a
brief
overview
on
that.
So
I
will
now
pass
over
to
mallory
to
do
a
presentation
on
the
safe
measurements
draft.
So
let
me
stop.
D
Right,
okay,
I
clicked
all
the
buttons
we
see
you
go
ahead:
great,
hey
everybody,
so
I've
recently
agreed
to
take
on
some
of
the
work
for
this
draft,
which
is
a
working
group
draft
or
research
group
draft,
rather
on
guidelines
for
performing
safe
measurement
on
the
internet,
ian
lairman
and
gershobad
grover
from
cis
india.
They've
worked
on
this
already.
D
I
just
very
few
changes
in
the
latest
version
that
went
up
right
before
the
doc
freeze,
so
I
want
to
just
go
over
those
and
give
you
all
an
overview
of
where
this
document's,
at
in
case
you'd
like
to
help,
because
I
think
it's
an
important
document
and
I
think
yeah,
the
research
group
should
continue
to
work
on
it.
It
fills
a
major
need
so
yeah.
I've
already
mentioned
the
authors,
the
goal
of
the
draft
at
overarching
goal.
D
We
can
discuss
whether
or
not
it's
reaching
that
goal
so
for
industry
and
academia,
presumably
others
anyone
doing
measurements
as
part
of
their
research
on
the
internet
and
its
use.
We
are
trying
to
describe
some
guidelines
to
ensure
that
those
measurements
can
be
carried
out
without
violating
user
privacy.
That's
why
this
is
in
this
research
group,
so
there's
some
interesting
stuff
around
scope.
That's
at
the
beginning
of
the
draft.
It's
actually
really
important.
I
think
that
it's
there
and
I've
it's
one
of
the
things
I
strengthened
in
the
last
version.
D
With
some
of
these
so
yeah,
it's
not
meant
to
be
a
substitution
for
ethics
review.
In
fact,
I
think
it's
important
to
actually
just
complement
ethics
and
state
that,
rather
than
try
to
sound
like
it's
an
ethics
consideration,
I
think
it's
not.
It's
also
not
legal
advice,
of
course,
and
then
it
tries
to
also,
I
think,
define
better
what
internet
measurement
scope
means.
D
So
the
network,
its
hosts
and
links
or
its
traffic
I'd
be
interested
in
people
in
their
assessment
of
that
scope,
definition
and
then
the
last
thing
is
it
needs
to
identify
the
user
and
who
it's
being
safe.
For
so
that
would
be
an
individual
organization
whose
data
is
used
in
communications
over
the
internet
and
that
might
be
swept
up
in
measurements
of
the
internet.
D
So
in
three
parts
is
the
main
structure
of
this
document.
It
talks
about
consent
as
a
concept
we
could.
We
could
discuss
whether
or
not
meaningful
consent
can
be
gotten,
and
if
that
is
enough
to
really
if
it
actually
creates
safety
and
privacy
preserving
then.
The
second
piece
is
around
safety
considerations,
so
this
goes
into
a
depth
and
a
few
different
subtopics,
so
isolating
risk
with
test
beds.
D
There's
a
part
about
being
mindful
that
you're
maybe
encroaching
on
others
infrastructure,
an
important
piece
on
data
minimization,
which
I
don't
think
we
talked
enough
about
these
days,
not
creating
it.
If
you've
got
it,
get
rid
of
it
eventually
that
sort
of
thing
and
then
masking
data,
which
could
be
all
kinds
of
different
techniques
within
that
subsection
and
then
lastly,
there's
a
section
on
risk
analysis.
D
That's
the
main.
Those
are
the
broad
strokes.
Oh
dear,
that
looks
terrible.
This
is
the
table
of
contents
yeah,
you
can
see,
then
those
three
main
parts
are
there
and
then
the
subtopics
and
then
the
subtopics
there.
So
just
you
know
if
you
wanted
to
have
an
overview.
Unfortunately,
the
draft
at
the
moment
isn't
produced
like
the
xml
is
not
producing
properly
a
table
of
contents
in
the
draft.
So
this
I
made
by
hand
and
then
displayed
poorly
in
the
slide.
D
Sorry
so
yeah,
the
the
last
time
it
was
presented
which,
if
I'm
not
wrong,
was
maybe
108
or
109.
D
ian
had
been
working
on,
adding
us
or
in
gerschabot
as
well,
adding
a
section
on
responsible
disclosure.
So
there's
some
important,
then
stuff
that
followed
on
from
that
open
issue
that
turned
into
talking
about
accountability
and
transparency
of
any
measurement
research
project.
Then
it
was
all.
There
was
also
some
need
to
cite
some
research
that
aligned
very
closely
with
the
goals
of
this
discussion
of
ip
address
and
then
also
like
the
thing
that
I
mentioned
in
the
beginning.
Like
you
know,
safety
is
not
equal
to
ethics.
D
Can
we
to
align
with
ethics
efforts
and
be
complementary
to
them,
but
not
actually
pass
this
draft
off
as
ethics
and
measurements
so
yeah?
That's
essentially
what
was
what's
been
done
since
the
last
version,
as
well
since
04
05,
and
that's
really
it
so
the
things
that
are
still
open
in
github,
which
you
can
check
out
needing
to
do
a
better
job
of
the
responsible
disclosure
piece.
D
Disclosure,
the
appearance
issues
out
there,
undone
there's
a
lot
of
stuff
that
are
undone,
but
I
haven't
enumerated
them
here
to
think
about
what
future
computing
capabilities
might
mean
for
safe
measurement.
D
So
just
because
you
can
compute
that,
should
you
there's,
then,
then,
these
last
two
pieces
are
just
like
making
sure
that
the
I
think
the
citations
and
the
a
literature
review
essentially
is
present
so
to
look
at
more
resources
and
if
you
have
additional
resources
that
you
don't
see
cited
here,
but
you
think
we
should
please
open
an
issue
and
when
you
do
that,
if
you
can
be
specific
about
the
section
you
think
we
should
hone
in
on
or
even
you
know,
pull
requests
of
course
are
very
welcome
and
reviews.
D
D
Just
going
to
then
so,
I'm
going
to
the
chat
to
see
if
anybody's
I'd
be
really
happy.
If
folks
who've
chatted.
While
I
was
talking
want
to
come
on
mike
and
say
things
to
me,.
A
I
think
your
audio
dropped
for
a
couple
of
people
just
on
the
very
last
slide.
That's
okay!
So
if
you
could
just.
D
Sort
of
repeat
I'd
be
happy
to
go
back.
Is
it
this
one
then
I
think
maybe
the
last
thing
I
was
saying
was
just
that
you
know
we
want
to
make
sure
that
we're,
citing
and
lifting
up
other
efforts
that
are
similar
to
this
or
bring
in
learnings,
especially
as
they
fit
into
the
table
of
contents.
So
if
you
know
of
resources,
especially
this
section,
potentially,
we
should
be
quoting
from
or
referencing.
D
Please
open
an
issue
with
that
information,
but
even
better
than
that
right
would
be
a
pull
request.
A
So
the
chairs
also
want
to
thank
mallory
for
picking
up
this
draft
because
it
it
had
originated
a
while
back
and
unfortunately
the
authors
weren't
able
to
work
on
it
at
that
time
and
mallory's
now
picked
up.
So
we're
hoping
that
with
the
latest
update
she's
done,
we
can
get
some
more
reviews
and
we
can
get
input
and
actually
get
this
job
moving
forward.
A
So
I
think
mallory
would
you
be
interested
if
there
are
other
people
who
want
to
to
contribute
and
help?
You
pick
the
document
up
right
now,
something
else
to
appeal.
Please,
research,
grateful.
D
Yeah
indeed,
if
you
want
to
do
pull
requests,
that's
also,
if
you
want
to
do
substantial
edits
or
just
get
on
a
call
with
me,
or
we
can
talk
through
changes
you
want
to
make,
we
can
make
you
you
can
do
it
directly
to
the
repository
as
well.
Just
let
me
know
I
should
probably
drop
my
email.
If
you
don't
have
it
also,
it
should
be
on
the
draft
itself,
but
yeah
I'd
love
some
help.
D
I
am
gonna
finish
this
one
way
or
another,
but
for
people
who
are
particularly
interested
in
this,
especially
some
of
the
you
know
some
of
the
more
interesting
topics
I
think
we'll
be
able
to
get
into
discussion
and
debate
in
the
future.
Once
the
draft
once
there's
like
what
I
would
consider
to
be
a
complete
and
full
draft,
because
there's
a
bunch
of
sections
that
are
just
kind
of
placeholders
but
yeah,
I
love.
D
A
G
A
H
A
D
Yeah,
so
I
can
actually
voice
stephen.
What
he
says
is.
It
would
be
good
to
include
examples
of
when
that
was
handled
well
and
when
badly-
and
I
like
already
the
model
of
case
studies,
although
I
would
probably
rename
them
as
examples
instead
of
case
studies-
they're
not
complete
enough.
I
think,
to
be
considered
a
case
study
and
to
also
not
stack
them
in
the
hierarchy,
the
way
they
are.
But
anyway
I
like
that
idea,
stephen
and
any,
and
then
he
goes
on
to
say.
D
I
think
this
document
might
end
up
being
a
model
for
other
ietf
documents
that
mention
consent.
So
it
should
be
done
carefully.
I
think
it
should.
I
think
what
you
mean
by
carefully
or
safely
is
that
it
should
be
done
very
conservatively
and
not
giving
the
idea
of
user
consent
free
reign
to
to
do
a
lot,
because
I
don't
think,
there's
a
great
deal
of
possibility
for
meaningful
consent,
but
we
can
yeah
we'll
we'll
dive
into
that,
and
we
can
do
that
by
looking
at
examples.
So
yeah.
A
H
D
So
really
it's
a
really
good
point
just
to
answer
in
the
chat.
I
think
yes,
examples
will
work
well,
I
think
we
have
plenty
of
very
good
internet
researchers
that
participate
internet
measurement,
researchers
that
participate
in
irtf
in
particular,
and
the
ietf
more
broadly,
that
we
can
look
and
read
their
methodology
and
ask
them
if
we
have
questions
if
it's
not
in
their
literature,
we
can
use
those
examples.
I'm
not
sure
that
using
consent
approach
outside
of
this
context
will
be
helpful.
D
I
think
in
general
right,
like
there's
an
ethics
consideration
around
how
you
get
consent
from
people
participating
in
your
studies,
but
I
think
what
we're
trying
to
articulate
here
is
that
it's
difficult
when
you
know
folks,
internet
traffic
is
not
it's
not
the
same
as
you
know,
being
in
a
qualitative
study,
or
things
like
that.
So
I
think
we
have
to
yeah
we'll
look
for
examples.
E
I
Yes,
I'm
not
sure
why
you
think
I
do
it,
because
I'm
showing
a
video
you're.
I
Okay,
well,
that
was
sorry,
I'm
just
catching
up
with
the
protocol.
Okay,.
A
A
Okay,
next
up,
we
have
jean-pierre
smith
presentation
on
website.
Fingerprinting
in
the
age
of
quick
is
jean-pierre,
so,
let's,
okay,
you
should
be
getting
your
preloaded
slides.
Hopefully.
K
J
H
H
H
K
Great
that's
much
clearer!
Thank
you,
yeah
great,
so
hello,
everyone,
my
name
is
jean-pierre
smith,
I'm
a
phd
student
at
etihad
zork
and
I'm
pleased
to
have
been
invited
to
share
with
you
today
or
work
on
website
fingerprinting
in
the
age
of
quick
which
was
presented.
K
I
believe
two
weeks
ago
at
the
privacy
enhancing
technology
symposium,
and
this
is
joint
work
between
among
myself,
professor
patek
mitel
of
princeton
university
and
professor
adrian
perry,
get
it
to
her
zurich,
and
so,
as
I'm
sure
many
of
you
can
remember
in
the
early
days
of
the
internet.
K
If
an
adversary
was
interested
in
identifying
what
a
what
content
a
user
was
browsing
or
viewing
on
the
internet,
you
could
do
so
by
simply
observing
the
the
urls
the
get
requested,
data
etc
in
the
packets,
as
well
as
a
destination
ip
address
to
counteract
this.
Of
course,
ssl
tls
was
added
to
the
network,
and
now
the
content
was
no
longer
encrypted.
K
However,
there
are
still
the
certificates
which
are
exchanged
between
the
use
and
the
web
server
and
tags,
just
the
server
name
indicator
which
still
reveal
to
the
adversary,
what
a
user
is
browsing
or
what
a
user
is
viewing.
When
he's
browsing
the
internet,
because.
L
K
K
Unfortunately,
this
communication
still
leaks
metadata,
so
packet
sizes,
timings
and
direct
and
the
directions
of
the
packets,
and
this
forms
these
this
metadata
forms
the
basis
of
a
slew
of
attacks
called
website,
fingerprinting
attacks,
which
are
what
this
work
is
based
on
and
so
essentially
in
a
network-based
website.
Fingerprinting
attack
an
adversary
who
is
interested
in
identifying
what
a
user
is,
what
a
user
is
viewing,
what
webpage
uses
viewing
over
some
encrypted
channel
like
a
the
toronto
network
or
the
vpn
etc.
K
K
This
metadata,
which
web
page
is
being
leaked-
and
here
usually
the
adversary,
is
in,
is
interested
in
a
set
of
web
pages,
so
he's
interested
if
the
user
visiting
wikileaks.org
youtube
or
facebook
etc,
or
he
would
like
to
know
if
well
this
this
traffic
trace
is
just
is,
is
not
anything
I'm
interested
in,
so
we
can
discard
it.
It's
other
I'm
not
interested
in
it
in
in
this
particular
visit
now,
there
has
actually
been
quite
a
number
of
work
in
the
website,
fingerprinting
in
in
the
in
the
website.
I
think
printing
literature.
H
K
That
website
fingerprinting
is
possible
on
on
anonymous
networks
like
tor,
it's
certainly
possible
on
regular,
regular
browsing
traffic,
but
most
of
this
work
has
focused
on
the
tcp
or
tor
over
tcp
tls
setting.
K
K
Now
most
of
you
are
imagine
already
familiar
with
quick
and
well
quick
has
the
number
of
differences
between
quick
and
tcp,
which
will
likely
impact
the
website
fingerprinting
topic,
for
example,
whereas
in
with
a
tcp
connection,
there
is
a
single
byte
stream
between
the
client
and
the
server.
This
is
now
multiple
individually
flow,
controlled
by
streams
which
affects
kind
of
this.
This
affects
the
resulting
image
of
the
connect
of
the
traffic
on
the
on
the
on
the
network.
K
Moving
from
this
one
flow
per
connection
to
multiple
flows
inside
a
single
connection
which
are
all
sharing
this
one
flow
level,
this
one
congestion
control
as
well
as
this
connection
based
flow
control.
K
K
K
Additionally,
due
to
quick's
faster
connection,
establishment
features
in
a
connection
such
as
you
know,
you
can
imagine
the
connection,
an
adversary.
Looking
at
the
connection,
we'll
see
bursts
of
packets
going
in
going
out.
These
are
shifted
in
time
when
going
from
one
protocol
to
the
other
and
finally,
before
in
this
tcp
only
internet,
we
had
the
situation
where,
when
an
adversary
observed
traffic
on
the
network,
which
he
believed
was
brought,
which
he
knew
or
had
belief
that
it
was
browsing
traffic
web
browsing
traffic,
then
he
was.
K
The
network
is
actually
a
mixture
even
for
a
single
web
page
load,
a
mixture
of
quick
and
tcp
connections,
and
so,
given
all
of
given
these
such
differences
between
tcp
and
quick.
The
question
we
sought
to
answer
in
this
paper
is:
how
might
how
might
website
think
of
printing
change,
given
this
transition
in
the
web
from
tcp
to
quick?
K
So
to
do
that,
we
essentially
coming
from
the
previous
presentation.
We
collected
a
quite
large
data
set
by
scanning
and
downloading
web
pages
from
a
large
number
of
web
servers,
so
we
emulated
what
the
adversary
would
be
doing
to
set
up
his
website,
fingerprinting
attack
and
in
our
setting.
The
adversary
was
located
on
a
wireguard
vpn,
with
connections
to
new
york,
frankfurt
and
bengaluru
india,
and
we
emulated
the
user
by
using
a
browser.
K
M
K
We
then
created
a
data
set
from
this,
and
so
in
total
we
had
some
117
000
samples,
so
we
we
call
it
downloaded.
117
000
samples
of
web
pages,
of
which
100
web
pages
were
what
we
call
the
monitored
web
pages.
So
these
are
the
web
pages
which
the
adversary
are
interested
in
in
identifying,
and
for
these
we
collected
20
000
samples.
So
this
was
about
200
some
200
times
we
loaded
each
web
page
using
the
browser.
K
We
then,
additionally,
we
also
collected
some
16
000
unwanted
web
pages,
and
this
gives
the
this
allows
adversary
to
have
an
to
train
his
classifier
to
get
a
feel
for
what
is
not
one
of
the
monitored
web
pages.
So
what
everything
else
in
the
internet
looks
like,
and
here
we
collected
those
six
samples
per
yeah-
is
that
right?
No,
that's,
not
right
one
sample
per
protocol
per
gateway.
There
are
two
protocols:
three
three
gateways,
so
yes,
six
samples
per
unmonitored
web
page
and
that
formed
the
bulk
of
order
to
set
across
these
two
protocols.
K
We
then
train
a
number
of
both
a
number
of
classifiers,
many
of
them
deep
learning,
classifiers
and
their
task
is
given
these
features
that
we've
downloaded
to
identify,
identify
the
the
web
page,
and
so
the
first
question
that
we
yeah-
and
we
here
we
measured,
recall
and
precision
so
for
those
of
you
unfamiliar
with
this
topic,
so
essentially
what
we
measured
is
of
the
fraction.
K
So
there
we
have
the
data
set,
and
some
of
the
samples
are
the
ones
that
the
adversary
is
interested
in.
These
are
the
monitor
samples
and
when
we
pass
this
to
the
classifier,
what
fraction
of
those
monitored
samples
is
the
classifier
actually
take?
That's
the
recall
so
of
you
know,
100,
monitored
samples.
Does
the
adversary
detect
80
of
them,
which
is
80
recall,
or
does
he
detect
only
15
of
them?
15
recall,
I
know
our
pers,
we
use
precision.
K
I
won't
get
into
the
detail
of
this
variant,
precision
which
we
used,
however,
so
what
precision
essentially
answers
is,
if
the
if
the
classifier
says
that
these
one,
that
these
1000
samples
are
all
monitored
web
pages
and
should
be
blocked
or
should
be
logged,
and
you
know
or
should
be
logged
what
fraction
of
those
1000
claims
were
actually
correct.
K
If
an
adversary
is
in
the
setting
where
the
web
page,
where
quick
has
been
deployed,
the
web
pages
that
he's
interested
in
tracking
or
identifying
may
be
downloaded
with
quick.
However,
he
has
not.
He
has
not
adapted
to
this
setting
because
well,
it's
the
same
web
page
being
downloaded.
The
protocols
are
similar,
similar
condition,
control,
algorithm,
similar
similar
loss
detection
and
recovery,
etc.
K
K
He
can
suffer
greatly
in
terms
of
in
terms
of
recall,
if
you
consider,
for
example,
the
var
cnn
classifier,
which
is
one
of
these
deep
learning
classifiers
when
it
was
achieving
near
100,
precision
and
recall
on
the
on
tcp
on
tcp
samples,
it
was
achieving
less
than
four
percent
recall
on
the
quick
samples
that
means
over
96
percent
of
the
quick
samples
were
actually
invading
the
adversary
in
this
setting.
K
So
the
adversary,
given
that
the
web
is
transitioning
to
quit.
I'm
sorry
most
no
account
for
that
in
his
procedure
in
his
machine
learning
classification
in
his
attack.
Otherwise
he
will.
He
will
miss
visits
the
web
pages
okay.
K
So
the
next
question
we
sought
to
ask
is
perhaps
this
is
because
quick
is
somehow
inherently
more
difficult
to
fingerprint
than
tcp,
and
is
this
really
the
case
and
in
this
setting
we
just
we
tested,
we
assumed
a
world
where
there's
only
tcp,
and
this
is
our
tcp
scenario-
the
adversary,
trains
and
tests
and
evaluates
attempts
to
detect
web
pages
in
this
scenario-
and
we
do
the
same
for
quick,
a
world
which
has
only
quick
where
every
web
page
supports
quick
and
the
web
page.
The
main
web
page
will
first
be
loaded
over
quick.
H
K
Classifiers
that
we
evaluated
perform
nearly
equally
well
in
both
both
the
quick
and
the
tcp
settings.
Interestingly,
though,
we
did
find
that
it
seemed
that
some
of
the
classifiers
performed,
perhaps
even
a
bit
better,
on
quick
in
the
quick
setting
than
in
the
tcp
setting,
which
may
allude
to
quick
being
somewhat
easier
to
fingerprint
than
tcp.
K
However,
we
suspect
that
this
was
due
to
the
fact
that,
at
the
time
of
running
these
experiments,
the
deployment
of
quick
was
still
in
its
infancy.
You
could
say
so.
The
variety
of
of
quick
server
stacks
was
very
limited,
and
so
the
amount
of
difference
in
what
kind
of
samples
would
be
seen
between
across
web
pages,
etc
was
was
was
different.
It
could
also
be
due
to
to
reduce
middle
box
interference
we
have,
but
regardless
they
are
more
or
less
similar
in
the
adversary's
ability.
K
To
fingerprint
I
mean
both
the
quick
and
tcp
settings
so
given
that
quick
is
no
harder
to
harder
to
fingerprint
than
tcp,
and
given
that
an
adversary
must
account
for
quick
moving
forward
once
account
for
web
page
is
being
loaded
with
quick
moving
forward.
How
may
he
go
about
doing
so,
and
so
this
is
the
next
thing
that
we
evaluated
and
we
evaluated
two
potential
approaches.
The
adversary
may
use
to
jointly
classify
quick
and
tcp.
K
So
in
the
first
approach
he
says,
okay,
web
pages
may
be
loaded
with
quick
web
pages
may
be
loaded
with
tcp
I'll
collect
samples
of
both.
I
will
attempt
to
figure
out
I'll
attempt
to
train
and
teach
my
machine
learning
classifier
on
both
samples
and
hope
that
when
it
does
see
a
sample
in
the
wild,
whether
it's
quick
or
tcp
will
make
the
right
decision-
and
this.
K
Mix
classification
approach:
we
also
considered
another
approach,
which
we
call
the
split
ensemble
approach
and
in
this
approach
the
idea
is
well
given
that
the
quick,
the
dedicated
setting
in
the
dedicated
settings
either
the
tcp
sitting
or
the
quick
setting
the
classifiers
are
the
classifiers
perform.
So
well.
If
an
adversary
is
actually
able
to
distinguish
between
the
two
between
the
two
between
the
two
settings,
then
he
could
just
pass
the
sample
to
the
dedicated
classifier
and
voila.
K
So
here
the
dotted
line
represents
the
mixed
setting
and
I'm
here
we're
showing
the
tcp
and
quick
settings
for
reference,
and
so
what
we
found
is
that,
as
expected,
due
to
the
increased
noise
in
these
samples,
so
because
now
the
adversary
has
some
samples
loaded
with
tcp
has
some
samples
with
loaded
with
quick.
This
is
more
variance
in
the
data
set,
and
this
means
that
performance
suffers
to
counteract
this
adversary
has
to
either
train
call.
K
So
what
about
this
other
approach
this
this
split,
this
splits
ensemble-based
approach.
So
the
first
question
we
had
to
answer
here
was
whether
or
not
actually
it's
possible
to
distinguish
between
quick
and
t-speed
traces
in
the
network,
and
so
we
we
created
a
classifier
to
do
that,
and
this
is
us
in
this
setting.
K
It's
it's
binary
classification,
so
we
take
a
sample
and
it's
going
to
be
either
quick
or
tcp,
and
we
used
essentially
a
random
forest
classifier
which
just
spat
out
a
probab
confidence
that
this
sample
is
tcp
and
then,
with
one
minus
that
confidence
we
have.
We
take
that
sample
to
be
quick
and
we
found
that,
after
only
around
100
and
after
only
150
samples,
we
were
able
to
detect
whether
trace
was
a
quick
trace
or
a
tcp
trace
with
over
99
accuracy.
K
Looking
into
this,
what
we
found
was
that
this
was
due
to
the
handshake,
so
the
differences
in
the
handshake,
so
both
in
terms
of
the
rtts
for
the
connection
establishment.
So
when
these
the
these
bursts
of
packets
were
going
out
as
well
as
due
to
the
fact
that
for
quick
for
quick,
the
initial
plant,
hello
is
actually
quite
large,
whereas
for
tcp
it's
actually
quite
small,
the
same,
the
synaptic
handshake
is
is
quite
small,
and
so
this
actually
made
them
made
it
quite
easy
for
the
classifier
to
distinguish
between
the
two
protocols.
K
So,
given
that
we
can
actually
distinguish
between
the
protocols,
we
can
go
ahead
and
create
this
ensemble,
and
so
in
the
ensemble.
The
adversary
is
going
to
trains,
his
tcp
classifier
and
only
tcp
samples.
He
trains,
his
quick
classifier,
only
quick
samples,
and
then
he
trains.
This
distinguish
on
both
quick
and
tcp
samples,
and
his
job
will
be
to
distinguish
between
the
two
the
sets
of
the
two
samples.
K
Then,
when
he's
he
encounters
some
sample
in
the
while
he
passes
it
to
all
three
classifiers
the
tcp
classifier
says.
Well,
I
think
this
is
some.
I
think
this
is
youtube
with
probability
p1.
The
quick
classifier
says
I
think
it's
youtube
probability
p2
and
the
distinguisher
says
well.
I
believe
this
is
more
of
a
quick
sample
than
a
tcp
sample,
and
then
we
do
the
weighted
average,
which
allows
us
to
it's
a
safer
approach
for
combining
the
combining,
though
the
predictions
from
both
classifiers
and
that
forms
a
prediction
for
a
particular
web
page.
K
So
this
was
our
setup
for
the
split
ensemble
and
we
tested
on
our
one-to-one
ratio
of
quicken
tcp
samples,
and
we
found
that
surprising
surprisingly,
despite
the
fact
that
both
the
dedicated
tcp
and
quick
classifiers
had
a
very
high
precision.
The
split
ensemble
based
approach
actually
performed
worse
than
the
mixed
mixed
than
the
mixed
approach.
K
But
on
thinking
about
this,
we
realized
that
well,
it
makes
sense,
because,
even
though
there
are
different
protocols,
there's
still
actually
some
shared
information
in
between
these
two.
These.
These
two
sets
of
samples
about
that
web
page.
So
there
are
different
protocols,
but
they
do
contain
some
shared
information
about
the
web
page
being
visited
in
the
mixed.
The
mixed
classifier
is
actually
able
to
leverage
that,
whereas,
when
an
adversary
splits
it
splits,
the
two
data
sets
he's
not
able
to
benefit
from
that.
K
Quick
is
not
more
difficult
to
fingerprint
than
tcp,
however,
if
an
adversary
must
account
for
it
or
or
or
users
will
be
able
to
evade
classification
by
simply
switching
to
quick,
join
classification
of
the
both
protocols
are
are
is
possible
over
this
comes
the
cost,
for
the
adversary
has
to
be
expected
from
the
addition
of
of
of
more
noise
or
more
variance
to
the
or
more
variability
to
the
network,
and
so
he
will
either
need
to
increase
the
number
of
samples
or
or
increase
his
trading
time
to
account
for
this
and
finally
determining
if
a
trace
begins
with
toothpick
or
tcp
is
currently
trivial,
and
this
can
lead
to
attacks
on.
K
I
use
the
moment,
for
example,
if
an
adversary
observes
is
interested
in
a
hundred
web
pages,
only
40
of
which
currently
support
quick.
Then,
if
he
observes
a
mind,
if
he
observes
a
web
page,
which
he
knows
is
if
he
observes
a
web
page,
he
can
already
tell
that
it
is
not
any
of
those
60
tcp
tcps.
Only.
K
Samples
and
this
can
have
implications
further
implications
for
the
user's
privacy.
So
with
that,
I
thank
you
for
your
attention
and
well,
and
I
welcome
any
other
questions.
Thank
you.
A
Thank
you
very
much
jean-pierre.
Do
we
have
any
questions
in
there
in
the
queue.
N
Hello,
thanks
for
the
presentation
have
questions
regarding
the
data
that
you
use
for
classifying
your
slides.
You
present
the
training
set
and
the
test
on
which
you
you
do
the
determination
about
the
web
page.
How
large
is
the
training
set
compared
to
the
overall
data
that
you
are
using?
This
is
my
first
question
and
the
second
question
is:
are
you
doing
your
training
and
tests
on
a
page
before
it
changes,
or
is
there
a
drift
over
time?
K
So
to
your
first
question,
the
split
was,
if
I'm,
recalling
correctly,
it
was
90
90
to
10,
so
90
of
the
samples
were
used
for
training.
10
of
the
samples
were
used
for
testing,
and
this
accounted
to
something
like
on
the
order
of
so
100
webpages.
K
Ninety
percent
of
that,
so
it
was
usually
about
90
samples
per
web
page,
the
nine
nine
thousand
monitored
samples
for
training
and
something
around
the
same
unmonitored
sample.
So
a
total
of
like
18
000.
No,
that's
not
true,
nine
thousand
monitored
and
something
like,
I
think,
40
000
unmonitored
for
training
and
the
rest
for
testing.
K
There
was
something
some
stuff
there
with
making
sure
in
the
monitor
unmonitored
said
that
the
same
thing
doesn't
end
up
in
both
the
training
and
the
test
etc,
which
reduce
the
effective
size
of
the
data
set.
But
that's
the
first
question
to
the
second
question:
we
did
not
evaluate
that
with
the
quicken
in
the
quick
and
tcp
setting,
but
that
has
been
evaluated
by
a
number
of
authors
up
to
this
point.
K
K
So
removing
like
the
oldest
samples
and
incorporating
new
samples,
one
could
still
maintain
quite
a
high,
quite
a
high
precision
and
recall
for
performing
the
attack,
and
there
are
also
quite
a
number
of
recent
works
once
I
I
I
guess,
if
I
can
stop
sharing
my
screen
and
just
speak
while
I
am
I'm
looking,
but
there
are
a
number,
I
think
two
works
which
looked
at,
I
think
triplet
fingerprinting
is
one
work
which
looked
at
how
to
kind
of
take
classifier
and
then
just
collect
like
20
samples
or
10,
some
new
samples
and
then
get
it
back
up
the
high
precision,
even
though
the
original
samples
were
collected
months
ago.
K
So
this
really
makes
it
easy
for
the
adversary
to
keep
the
data
set
still
relevant
well
into
t,
plus
one
month
t
plus
two
months
d,
plus
three
months
after
he
collected
his
original
large
data
set.
O
Yeah,
I'm
just
curious
when,
when
were
your
measurements
actually
taken
like
or
the
real
question
is,
if
this
was
taken
a
while
ago,
are
you
planning
any
follow-up
work
to
see
what
the
fingerprinting
potential
is
and,
as
adoption
increases
real
quick.
K
So
this
was
done.
This
was
done
summer
2020,
the
collection
is
there.
Are
there
plans
for
follow-up
work?
Yes,
but
like
most
of
my
plans,
things
don't
go
according
to
them.
However,
I
I
have
gotten
quite
quite
a
lot
of
interest
from
other
researchers
in
the
past
three
or
four
months,
so
maybe
four
or
five
different
research
groups
looking
to
also
look
at
website
the
website,
fingering,
printing
topic
and
quick.
So
I
would
also
expect
some
more
recent
works
to
be
coming
out
in
the
next
in
the
next
months
or
yeah.
K
A
H
M
Can
everybody
hear
me
okay
yep,
so
we
actually
presented
this
at
the
last
irtf
meeting
and
we're
very
excited
to
be
back.
I'm
kyle
working
with
zach,
who
presented
last
time,
sasha
ben
who's
here
today,
and
our
advisors
christina
and
shrini.
So
this
is
work
in
progress
and
I'll
be
looking
at
the
chat.
So
if
anybody
has
questions,
I'd
be
very
happy
to
get
them
and
answer
them
as
I'm
going
all
right.
M
So
what
let's
see
slides
are
not
changing
all
right,
so
what
short
order
is
doing
is
shorter
is
a
overlay
for
the
tor
network.
That's
intended
to
reduce
latency
between
the
relays
on
a
circuit
by
making
better
informed
routing
decisions,
and
what
we're
going
to
do
is
we'll
go
over
first.
M
The
design
of
how
this
is
working,
try
to
evaluate
it
and
say,
are
we
actually
getting
better
latencies
and,
of
course,
integration
with
tor
is
important
if
it's
supposed
to
increa
improve
latency
on
tour,
and
then
you
know,
as
for
any
tor
proposal,
security
is
also
a
goal
for
this
project,
and
a
reminder
is
where,
as
we're
going,
that
this
is
work
in
progress,
so
I'd
be
interested
in
in
hearing
feedback
and
reminder
that
our
measurement
data
set
is
incomplete.
M
So
this
is
like
just
the
client
server
they're
trying
to
talk
to
each
other,
and
this
black
path
is
going
across
and
some
of
the
links
are
slow
and
you
know
you,
as
a
normal
person,
cannot
just
change
the
path
you
take
across
the
internet,
so
instead,
what
a
cdns
will
do
is
they'll
in
the
middle
actually
connect.
M
M
So
what
we're
proposing
to
change
here
is
not
this
component,
not
changing
the
onion
routing
at
all.
We're
instead
trying
to
say
that
maybe
maybe
some
of
these
relay
relay
connections
on
tor
are
actually
going
over
slower
paths
than
they
could
be,
and
if
you
can
notice
this,
you
can
route
them
via
a
different
tor
relay
rather
than
directly
to
the
next
circuit
relay,
and
what
this
can
do
is
similarly
to
what
it
does
for
cdns
is
it
will
boost
your
performance
you'll,
be
able
to
find
a
faster
route,
avoid
congestion,
etc?
M
Okay,
so
this
is
a
good
point
for
anyone
who
who
doesn't
get
where
this
is
fitting
into
tour.
If
you
can,
you
know
you
can
put
a
question
in
the
chat
and
I'll
get
to
it,
so
we'll
move
on
then
to
when,
when
this
happens,
I
mentioned
it,
but
here's.
M
Here's
like
what
you'd
expect
it's
just
only
going
to
do
this
when
the
path
between
a
and
b,
which
are
relays
in
a
circuit,
is
actually
like
higher
latency
than
the
path,
if
relay
a
connected
to
relay
c
and
then
relay
c
connected
to
relay
b
and
remember.
This
is
not
part
of
the
onion
routing.
This
is
not
in
the
encryption
layer,
yeah,
so
triangle,
inequality.
Failure
is
correct,
so
this
is.
This
is
specifically
just
on
top
of
the
already
existing
torah
routing
and
to
evaluate
this.
M
H
M
M
M
This
is,
of
course,
we
were
not
able
to
get
to
our
relays
to
participate
in
this
measurement
directly,
so
we
are
running
our
own
tor
relays
and
creating
circuits
using
those
to
do
our
measurements.
Our
tor
relays
are
well.
We
can't
you
know,
prevent
them
100
from
carrying
other
clients
traffic.
We
have
very
restrictive
policies
and
bandwidth
advertisements
that
effectively
prevent
them
from
carrying
anybody
else's
traffic,
while
also
not
observing,
recording
or
doing
anything
with
anything
we
see
about
other
traffic.
That
is
not
our
own.
M
Okay,
all
right
a
question.
So
since
we
don't
know
the
whole
path
and
this
optimization
optimization
on
a
per
hop
basis,
does
it
mean
that
a
tor
relay
has
to
encapsulate
so
it's
to
steer
the
packet
through
the
right
middle
relay?
Okay,
so
this
is
more
of
a
design
question
on
exactly
how
taurus
is
doing
this.
The
answer
to
this
is
sort
of.
It
depends
on
how
you
want
to
implement
it
right
now.
M
Our
proposal
is
that
there's
actually
going
to
be
an
additional
control,
plane
packet,
basically
so
a
new
torso
with
a
command
that's
going
to
indicate
that
this
is
supposed
to
be
a
via
packet,
and
these
via
packets
are
just
going
to
update
the
routing
information
that
relays
already
have,
and
the
only
thing
that
really
needs
to
get
skipped,
and
this
is
actually
so.
This
is
important
for
for
people
who
are
familiar
with
how
tour
relays
operate,
we're
skipping
onion
encryption.
We
are
not
skipping
normal
queuing.
M
We
do
not
want
our
our
traffic
to
sort
of
out
compete,
normal
normal
circuit
traffic
under
relay,
and
so
we
do
actually
want
relays
to
for
the
most
part
process,
this
special
via
traffic.
The
same
way
they
would
process
normal
torque
traffic,
with
the
exception
that
it
should
not
be
doing
the
audio
encryption
and
decryption
did
answer
your
question.
M
Okay,
so
then
here's
here's
what
we
have
so
far
for
the
evaluation.
Our
current
data
set
is
approximately
125
000
pairs
of
measurements.
We're
focusing
right
now
on
relays
with
the
largest
consensus
weight
to
start
with.
So
our
reasoning
for
this
is
first
of
all,
that
these
are
the
relays
that
are
most
likely
to
be
chosen
for
circuits
the
top
1000-ish
relays
and
tor
are
present
on
about
75
of
circuits
so
like
at
least
one
of
these
top
1000
relays
is
going
to
be
on
75
percent
of
all
tor
circuits.
M
We've
actually
been
spot
checking
for
now
with
some
smaller
relays
and
found
that
it's
quite
challenging
to
get
them
to
actually
answer
an
all
pairs
data
set.
So
we
can
get
some
measurements,
but
we're
not
able
to
get
smaller
relays
connected
to
every
other
relay
in
tor
to
get
the
the
pairwise
latency
for
completely
and
that's
it
brings
us
to
the
future
component
of
these
measurements.
M
We're
not
ever,
we
think,
going
to
actually
achieve
a
full
all
pairs
measurement
data
set
for
tor
tor
has
very
high
churn.
Some
relays
are
going
to
go
offline
prior
to
the
completion
of
these
measurements,
so
we
might
get
part
way
through
it
and
then.
M
Okay,
all
right,
so
here's
our
our
first
set
of
graphs.
So
what
we're
seeing
here
is
these
are
the
pairwise
round
trip,
chime
relay
distributions
short
or
versus
baseline
tour
and
then
below.
It
is
the
first
circuit
so
we'll
get
to
the
circuits
in
a
second,
so
the
top
one
is
just
directly
we've
been
measuring
and
what
you're
seeing
is
round
trip
times
between
these
pairs
of
relays?
The
dark
green
is
round
trip
times.
We've
observed
when
just
connecting
directly,
as
would
be
done
in
tor
and
the
yellow
is
what
we're
seeing
for.
M
If
we're
able
to
connect
indirectly
and
it's
faster,
we
do
that
so
the
yellow
and
the
green
difference
are
the
same
size.
It's
in
terms
of
number
of
pairs.
It's
just
that
some
of
the
pairs
in
the
yellow
data
set
are
actually
connecting
via
an
additional
third
relay
hop
I'll
mention
here,
and
this
is
more
of
a
security
point.
We
only
do
connect
via
one
additional
hop
cdns
connect,
sometimes
via
several
it's
a
little
bit
unusual,
they're,
more
likely
to
use
one,
but
occasionally
it
is
faster
actually
via
several,
but
for
tor.
M
You
need
to
avoid
that,
because
you
can't
actually
like
loop
through
the
same
location
twice,
so
you
start
to
violate
their
security
assumptions,
so
we're
we're
only
able
to
do
one
and
okay,
so
the
so
this
is
this
is
so
benjamin
your
question
I'm
assuming
is
for
the
circuits,
so
this
is
actually
not
a
simulation.
M
This
is
just
directly
measured,
latencies
between
relay
pairs
and
then
summing
them
up
to
get
the
triplet
latency,
but
we
are
not
actually
repeating
the
measurement
through
the
third
relay,
so
we're
taking
the
measurement
from
relay
to
relay
b
and
relay
b
to
relay
c
and
if
that's
a
shorter
path
from
a
to
b,
then
we'll
we'll
insert
that
into
this.
This
plot
instead.
M
Okay,
yeah,
so
for
the
circuits,
the
part
that's
simulated
for
these
circuits
is
that
we're
using
the
tor
path,
selector
tool,
we've
picked
about
125,
000
circuits
importantly,
only
from
the
relays
we
actually
have
measurements
for,
because
otherwise
we
we
can't
plot
the
data,
we're
not
right
now
wanting
to
try
to
estimate
any
of
the
measurements.
So
if
we
don't
have
a
measurement
from
our
own
data,
set
we're
not
building
that
circuit.
M
We're
seeing
in
pairs
tend
to
be
lower
bandwidth
relays
that
are
chosen
less
frequently
for
the
circuits.
So
when
we're
picking
circuits,
we
tend
to
choose
the
bigger,
more
popular
relays
and
so
actually
like
just
because
something
is
showing
up
as
really
nice
in
the
pairs
data
set
doesn't
mean
it's
actually
going
to
be
useful
in
practice,
because
maybe
nobody
ever
chooses
that
pair.
M
Okay,
so
the
question
is:
is
the
circuit
building
entirely
based
on
latency
or
is
bandwidth
also
considered,
for
example?
Could
a
node
with
really
fantastic
latency,
but
at
10
kilobit
per
second
link
caused
degradation
for
clients
during
the
search
circuit
building?
Okay,
so
we
are
not
we're
not
changing
circuit
building
and
that's
a
pretty
it's
going
to
be
very
important
for
security
later
we're
actually
just
strictly
building
circuits
exactly
the
same
way
that
torb
would
and
then
after
a
circuit
has
already
been
built.
M
The
relays
net
circuit
may
choose
to
route
between
themselves,
but
not
between
the
two
ends
of
the
client,
the
server
via
an
additional
relay,
so
the
circuits
are
only
looking
at
you
know,
tours
towards
selection,
which
is
like
bandwidth
and
family,
like
no
relays
operated
by
the
same
operator
should
be
chosen
twice
things
like
this:
it's
not
looking
at
latency,
specifically
when
we
are
choosing
to
grow
via
an
additional
thing.
We
are
looking
only
asterisk
at
latency.
M
The
the
asterisk
here
is
that
obviously
no
intermediate
relay
no
like
via
relay
should
ever
take
traffic
that
it
can't
hold.
So
if,
if
it's
overwhelmed,
it
will
not
agree
to
be
an
intermediate
relay,
we'll
drop
that
request
and
when
the
request
is
dropped.
The
the
requesting
relay
that
says.
M
Oh
I'd,
like
you
to
be
this
transit
via
relay
we'll
know
not
to
use
it,
it
can
never
be
sort
of
forced
to
do
it
and
then
secondarily,
we're
planning
to
use
tor's
sort
of
scheduling
to
make
our
via
traffic
lower
priority
than
circuit
traffic,
and
so
it
should
never.
It
should
never
out
compete
circuits
that
are
that
are
like
committed
to
this
relay.
These
via
connections
are
transient.
If
they
start
to
slow
down
and
not
be
fast,
you
just
stop
using
them.
M
M
M
M
M
We
got
last
time
at
this
meeting,
and
so
I'm
specifically
mentioning
it
now,
because
it
is
a
little
bit
unusual
compared
to
most
proposals
in
this
space
and
the
reason
we
think
it's
okay
is
going
to
be
sort
of
on
the
next
slide,
but
here's
the
like
minimum
requirements,
the
minimum
requirements
to
actually
benefit
from
this
as
a
client,
is
that
you've
picked
a
circuit
where
two
adjacent
relays
both
updated
to
support
this
proposal
and
that
some
other
relay
in
tor.
M
That
is
also
updated
to
support
this
proposal
can
be
used
to
provide
a
faster
path
between
those
two
relies
on
your
circuit
than
their
direct
route.
That
was
chosen
directly
with
pgp.
So
it's
really
more
about
this.
Like
probability
of
selecting
things
that
have
chosen
to
update-
and
of
course
this
is
higher
when
more
of
the
network
has
has
updated,
but
I
think,
as
everyone
is
familiar
with,
tor
doesn't
update
that
quickly
and
so
we're
we're
doing
a
little
bit
of
analysis
on
incremental
deployment
and
for
our
currently
small
data
set.
M
M
The
like
darkest
color
here
is,
if,
if
every
pair
in
the
set
everyone
supports
this
so
we're
seeing.
This
is
sorry.
Let
me
also
back
up
one
second.
This
is
in
reduction
in
range
of
times.
This
is
no
longer
just
the
the
round
trip
time.
So
here
larger
numbers
are
better.
This
is
the
speed
up
we're
seeing
so
if
everyone
supports
it,
we're
seeing
some
quite
large
speed.
Ups
as
fewer
relays
start
supporting
it
so
say
like
this.
M
Purple
is
only
the
top
100
relays
in
our
data
set,
so
the
largest
in
terms
of
consensus,
weight,
100
relays
in
the
data
sets
support
this
protocol.
We're
actually
still
seeing
that
some,
you
know
small
fraction,
but
existing
amount
of
the
pairs
are
getting
like
500
millisecond
speed.
Ups.
If
you
go
up
to
the
250
relays,
we're
getting
some
1500
millisecond
speedups.
So
we
think
that
this
protocol
is
actually
going
to
be
quite
useful,
even
if
really
like.
M
Okay,
so,
finally,
security,
we're
not
changing
how
clients
select
their
circuits.
So
as
a
result,
every
part
of
this
path,
including
our
via
relays,
is
completely
independent
of
the
client
identity
and
their
destination
identity.
M
We're
analyzing
security
using
the
matters
framework,
which
looks
at
basically
observation
points
on
a
circuit
and
the
probability.
This
observation
point
is
useful
in
either
identifying
the
client,
their
destination
or
the
connection
between
them.
We're
also
looking
at
network
share
and
what
we're
referring
to
is
network
share.
It's
just
the
probability
that
some
adversarial
relay
might
actually
see
a
larger
fraction
of
tor
traffic
when
using
our
protocol
than
they
would
have
it's
just
a
circuit
relay
and
vanilla
tour.
M
Finally,
because
we're
not
a
circuit
selection
process,
we
do
trivially
support
any
alternate
circuit
selection
processes
that
may
be
more
security
focused
than
tor's
current
protocol,
specifically
maybe
proposals
that
avoid
passing
through
the
same
autonomous
system
twice
or
things
like
this.
M
Okay
we'd
like
to
finish
our
measurements
collection.
Obviously,
right
now,
you
know,
like
I
said
we
have
about,
I
think
it's
currently
actually
at
140
000.,
the
top
1000
realize
all
paris
is
a
million
pairs.
The
full
tour
relays
is
closer
to
50
million
pairs
and
the
reason
I'm
not
giving
you
concrete
numbers
for
this
security
slide
we
just
looked
at
is
that
currently,
our
data
is
just
not
representative,
so
we'd
like
to
you
know
finish
evaluating
the
security
and
effectiveness
of
this
proposal.
M
A
That's
great,
thank
you
so
much
for
the
presentation
and
thank
you
for
coming
back
to
the
group
to
follow
up
on
on
your
early
work.
We
appreciate
it
and
thank
you
for
yet
again
multitasking
superbly
and
handling
almost
all
the
questions,
while
you're
speaking
and
doing
my
job
for
me.
Do
we
have
any
more
questions
in
the
queue?
Oh,
I
see
one
in
the
chat.
Please
please
go
ahead
and
yeah.
Okay,.
M
So
would
the
data
set
be
used
by
clients
or
are
they
used
by
the
relays?
The
data
set
is
currently
a
touchy
subject.
Actually,
so
we
are
not
sure
if
the
data
set
is
something
that
will
ever
be
released
in
its
complete
form.
There
has
been
some
recent
work
that
saying
that
latency
across
tor
potentially
can
be
used
to
identify
the
exit
relay
that
a
client
has
chosen,
and
so,
and
so
potentially
we're
we're
not
sure
in
practice,
and
because
of
that,
the
client
will
never
see
this
data
set.
M
The
client
is
actually
totally
agnostic
to
whether
they're
relays
in
their
circuit
or
using
this
protocol
at
all,
and
we
think
this
is
important,
because
we
really
don't
want
clients
to
change
their
behavior.
Because
of
this,
we
don't
want
anybody
to
behave
differently
from
the
majority
of
torah
users
as
part
of
this
protocol,
because
this
could
make
them
more
identifiable.
M
B
M
Yeah
so
so
currently
the
plan
is,
there
are
tour
cells,
have
a
header
that
includes
the
circuit
id
and
you
know
command
field,
and
what
we're
planning
to
do
is
include
two
two
new
fields
to
support
this
that
are
the
prior
relay
on
a
circuit
and
the
next
relay
on
a
circuit,
and
so
what
an
intermediate
relay
is
going
to
see
is
they'll
now
be
able
to
learn
where
to
forward
traffic
based
on
the
next
relay
and
the
next
relay
on
the
circuit.
That's
now
confused,
like
you
mentioned
they
don't.
The
connection
is
different.
M
M
Okay,
so
there's
a
question
in
the
chat
on
the
use
of
latency,
introducing
a
bias
that
could
allow
de-anonymization
based
on
knowledge
of
the
network.
Topology,
so
sort
of
this
is
this-
is
some
recent
research
that's
been
saying
that
rate
latency
is
heavily
correlated
with
geography,
and
so
fast
paths
tend
to
be
short
paths,
and
this
has
been
a
problem
with
proposals
that
are
selecting
especially
proposals
that
are
selecting
the
circuit
relays,
based
on
fast
circuits,
because
now
you're
selecting
circuit
relays
that
are
physically
close
to
either
the
client
or
their
destination.
M
We
have
a
less
correlation
because
the
circuit
relays
are
still
being
selected
at
random
core,
just
based
on
bandwidth,
not
on
latency.
However,
I
believe
this
is
sort
of
saying
is
that,
like
by
reducing
the
latency
it's
becoming
closer
to
the
minimum,
which
is
closer
to
the
geographic
distance?
So
in
that
regard?
Yes,
this
is
technically
technically
making
the
round-trip
time
of
an
entire
circuit
closer
to
the
optimal
which
is
closer
to
the
geographic
distance.
Between
those
two
points.
A
Okay,
thank
you
for
that.
I'm
sure
you'll
want
to
follow
up
in
the
chat
directly
with
the
other
questions
there.
Thank
you
once
again
for
your
presentation.
Thanks.
A
So,
moving
on
to
the
next
talk,
we
now
have
a
presentation
by
tommy
pauly
on
private
relay
tommy.
I
believe
you
want
me
to
run
the.
A
H
P
Yes,
lovely.
Thank
you
all
right,
yes,
I'll,
be
giving
a
quick
introduction
to
a
new
service
that
we
at
apple
announced
recently
called
icloud
private
relay,
and
I
think
you
know
given
a
lot
of
the
discussions
that
have
already
happened
in
perigee
about
ip
privacy
and
some
of
these
earlier
talks.
I
think
this
will
be
interesting
for
this
group
and
we'd
love
to
hear
feedback
on
how
people
could
see
this
evolving
going
forward.
So
next
slide,
please
all
right!
So
you
know
what
is
private
relay.
P
This
is
kind
of
the
name.
We've
come
up
for
this
service
overall,
but
it
includes
several
different
aspects:
several
different
pieces
of
ietf-based
technology
that
we're
trying
to
put
together
and
overall,
the
goal
is
to
have
a
solution:
that's
promoting
user
privacy,
specifically
by
separating
out
any
cases
where
you
have
client
ip
addresses
that
are
identifying
users
or
their
location
and
separating
that
out
from
what
origin
servers
that
they're
trying
to
access.
P
It's
not
a
full
tour
or
threat
model,
but
as
we
were
looking
at
things,
this
seemed
to
be
a
very,
very
common
linkage
of
information
that
was
used
by
many
different
parties
to
track
users
and
overall
was
really
hurting
user
privacy,
and
so.
P
Here
we
have
mask
which
is
proxing
over
quick.
The
mask
working
group
is
going
to
meet
later
today.
So
come
to
that,
if
you
want
to
talk
more
about
it,
we
have
some
mass
proxies
to
fully
protect
traffic
and
we're
always
putting
these
in
multi-hop
configurations,
always
more
than
one
proxy,
so
kind
of
like
what
we
were
just
talking
about
with
tour
nodes,
we're
using
oblivious
doe,
so
oblivious
dns
over
https,
which
is
dough,
that's
being
proxied
for
all
of
the
other
traffic.
So
essentially
you
can
have
two
modes.
P
Clients
are
authenticating
using
tokens
that
are
rsa
blind
signatures,
so
we're
able
to
vend
out
tokens
to
legitimate
users
and
legitimate
devices,
but
then,
when
they
redeem
these
tokens
a
single
use
tokens
into
the
system,
they
cannot
be
tracked
as
who
the
user
was
anything
about
their
device.
It's
a
blinded
token
there
so
next
slide,
so
just
practically
the
scope
of
what
we're
applying
this
for
now
is
in
ios,
15
and
mac
os
monterey,
and
these
are
the
versions
of
the
os
os.
P
That
apple
is
putting
out
that
are
currently
in
beta
and
for
those
versions
we're
putting
all
of
your
safari
browsing
traffic
through.
If
you
have
this
feature,
enabled
all
of
your
dns
traffic
and
then
all
of
the
unencrypted
http
traffic
on
the
system.
These
are
kind
of
identified
as
the
highest
vulnerability
traffic
for
both
security
and
privacy,
without
kind
of
going
full
bore
and
putting
everything
through
a
proxy
and
we're
also
using
this
as
an
underlying
technology
to
protect
against
pixel
trackers
in
mail.
P
P
Oh
did
that
go
backwards
so
forwards.
Thank
you
all
right
so
to
get
into
some
of
the
privacy
goals
and
just
talk
about
some
of
the
design
pieces
we
have
here.
I
think
we
have
three
overall
goals
that
I
wanted
to
share
with
this
group
and
kind
of
hear
people's
thoughts
on
first.
P
The
again,
the
main
privacy
thing
is
that
we
want
no
entity
anywhere
along
the
chain
of
your
traffic
to
see
both
who
the
user
is
based
on
their
ip
address,
for
example,
as
well
as
what
they're
accessing
be
that
the
name
of
the
origin
server,
so
any
entity
along
the
network
path.
Until
you
know
the
user,
you
know
they
can,
of
course,
log
into
an
origin
server,
but
no
one
can
passively
track
who
this
person
is
and
what
they're
accessing
either
the
isp
the
servers
or
any
of
the
relay
infrastructure
in
between.
P
We
also
want
to
ensure
that
the
performance
through
this
has
to
be
good
enough
for
just
generic
web
browsing
and
for
anyone
rather
than
having
it,
be
something
where
someone
says.
Oh,
I
have
to
turn
on
my
vpn
or
tour,
and
I
know
it's
going
to
be
slow,
but
that's
okay.
We
want
to
push
it
to
say
you
know
at
least
for
these
web
browsing
cases.
P
You
know
people
should
not
have
to
notice
a
difference
and
then
kind
of
along
those
lines
to
really
get
a
lot
of
these
benefits
over
time.
We
wanted
to
make
sure
it's
something
that
could
be
left
on
all
the
time
as
you're
using
it
and
not
something
that
you
have
to
flip
on
and
off,
like
many
vpns
at
least
on
our
platforms,
vpns
often
have
to
be
manually,
controlled
users
have
to
be
aware
of
when
they're
on
or
off,
and
what
we're
trying
to
figure
out
is.
P
How
can
we
get
to
a
place
where,
just
like
you
know,
using
tls,
your
web
browser
is
a
given
and
a
default?
You
don't
have
to
say.
Oh,
I
want
security
now,
how
do
we
have
privacy,
like
this,
be
a
default
next
slide.
P
P
So
here's
one
of
the
diagrams
we
were
working
with.
So
if
you
have
the
status
quo,
if
you
have
a
client
talking
through
its
access
network
through
the
internet,
all
the
routers
up
to
the
server
and
the
two
pieces
of
information.
We're
particularly
looking
at
here
are
the
name
of
the
server
we're
trying
to
access
and
the
client's
ip
address
and
right
now
this
is
shared
pretty
much
to
everyone
and
that's
not
good.
P
The
client
ip
address
is
only
visible
from
the
client
through
the
access
network
up
to
the
ingress
proxy.
It
has
its
own
encrypted
connection
there.
It's
forwarding
along
an
encrypted
connection
to
the
egress
proxy,
and
that
only
sees
the
server
name,
but
not
where
the
client
ip
address
is
coming
from,
and
these
two
hops
have
to
be
operated
by
different
entities.
Different
companies
and
the
client
is
able
to
authenticate
that
at
least
you
know
through
what
it
knows
about
the
key
management
for
these
different
proxies
next
slide.
P
Yeah,
so
in
this,
the
client
is
responsible
for
selecting
what
hops
it
goes
through,
and
it's
controlling
the
fact
that
there's
nested
encryption
to
each
hop
and
it's
managing
that
handshake
and,
as
I
mentioned
before,
the
hops
have
to
be
run
by
separate
entities.
P
P
You
know
this
is
something
that
is
more
of
a
policy
through
the
system
that
we
kind
of
by
contract
are
not
going
to
be
sharing
the
data
here
and
I
see
jonathan
in
queue.
Do
you
have
a
question.
Q
I
can
hear
you
I'll
I'll
try
and
talk
loudly,
then
they're
more
of
a
comment.
I
guess,
for
a
global,
passive
adversary.
Unless
you're
adding
cover
traffic,
they
can
track
a
user.
They
can
identify
them
through
both
sets
of
the
hop
yep
or
at
least
like
in
a
theory
perspective,
it's
impossible
to
prevent
that
without
cover
traffic.
R
P
Yeah,
absolutely
so
that
is
kind
of
not
part
of
the
immediate
threat
model
that
we're
trying
to
address
here.
It's
certainly
still
possible
for
someone
to
stitch
everything
together.
What
we're
trying
to
do
is
say
you
know
is:
is
there
a
kind
of
an
incremental
step
that
we
can
make
available
to
pretty
much
everyone
that
at
least
removes
the
fact
that
today,
multiple
entities
are
seeing
all
of
the
data
and
it's
just
trivial
for
them
to
log
as
a
web
server?
All
of
your
information
like
that.
P
Good
clarification
all
right
next
slide,
all
right,
so
the
other
thing
that
we're
we're
trying
to
work
on
here
is
the
fact
that
oftentimes
people
assume
that
you
know
having
privacy
is
slow.
We
just
heard
a
great
presentation
about
trying
to
reduce
some
of
the
latencies
that
you
have
with
tor,
but
we
knew
that.
Try
to
you
know,
jumping
to
latencies
like
we
would
see
like
that.
Often
today
would
not
be
acceptable
for
a
default
stance
for
a
browser,
so
next
slide.
P
So
what
we're
doing
here
is
trying
to
take
advantage
of
a
lot
of
the
benefits
we
have
through
quick
and
through
mask
and
do
many
things
to
improve
the
latency
that
we
have
through
this
system.
Part
of
that
is
just
about
the
deployment
of
the
various
relays,
making
sure
that
they
are,
you
know,
optimally
routing
traffic
between
them
and
have
a
very
good
coverage
of
deployment
throughout
the
world.
P
Some
of
the
other
things,
though,
that
I
wanna
you
know
call
out
for
hey
mask,
is
very
cool
in
this
way
is
as
we're
doing
the
forwarding,
through
these
different
hops,
we're
able
to
take
advantage
of
the
fact
that
by
proxing
we
can
get
a
lot
of
fast,
open
connections,
we're
not
doing
full
ip
proxy
here
we're
proxying
at
the
http
3
stream
level.
P
So
if,
in
this
case,
if
I'm
talking
to
a
normal
tls,
tcp
origin
server,
I'm
essentially
just
forwarding
a
quick
datagram
through
the
ingress
proxy,
very,
very
little
processing
there.
It's
a
connect
request
to
the
egress
proxy,
which
also
includes
at
the
same
time
the
tls
client
load
that
we
want
end
to
end,
and
so
the
egress
proxy
is
able
to
set
up
tcp
to
the
server,
send
the
tls
handshake
such
that
in
essentially
one
rtt.
P
From
the
perspective
of
the
client,
it's
able
to
set
up
a
full
secure
connection
to
most
any
server,
and
it's
not
having
to
do
dns
for
that.
It's
not
having
to
wait
for
the
tcp
handshake
before
starting
tls
next
slide
yeah.
So
we're
able
to
pretty
much
always
do
a
fast
open,
and
this
really
brings
a
big
benefit.
P
P
So
this
is
not
having
a
bad
impact
on
normal
web
browsing
and
there
are
a
couple
of
other
side
benefits
here
that
you
have
by
proxing
at
this
level.
You
know
we're
actually
able
to
be
using
quick
anytime.
Your
local
network
supports
it,
regardless
of
what
the
actual
end
server
supports.
So,
even
though
you
know,
most
servers
in
the
world
have
not
adopted
quick.
P
Clients
can
pretty
much
always
use
native
v6
to
talk
to
the
proxy
as
long
as
the
network
supports
it,
and
so
we're
able
to
skip
certain
inefficiencies
like
what
you'd
have
on
a
ipv6
only
network
where
you'd
have
to
re-encapsulate
or
translate
ipv4
packets.
P
So
in
general
there
are
actually
a
lot
of
wins
that
we
see
by
switching
to
this
type
of
proxy
next
slide
and
then
last.
I
just
want
to
talk
about
some
of
the
design
decisions
we
had
to
make
to
try
to
make
this,
be
something
that
you
could
have
on
all
the
time
and
break
as
little
as
possible,
really
with
the
goal
of
making
sure
that
as
many
people
as
possible
can
use
this
type
of
privacy
without
having
to
turn
it
off
next
slide.
P
So
there
are
a
couple
considerations
we
had
for
creating
exceptions
for
traffic
that
should
not
go
through
this,
and
these
are
things
that
oftentimes
are
not
handled
by
vpns
that
are
on
by
default.
P
So
there's
no
impact
on
local
network
traffic
at
all.
So
if
we
detect
that
you're
using
a
private
ip
address
space,
that's
going
to
route
over
your
local
link
that
just
doesn't
go
through
this.
That's
not
part
of
the
threat
model
we're
trying
to
address.
P
So
this
is
important
to
make
sure
that
if
you
are
in
an
enterprise
environment
on
an
internal
network,
you're
still
able
to
access
the
private
names,
because
we
always
would
first
check
with
the
relay
infrastructure-
and
if
it
says
this
is
not
resolvable
in
the
public
dns,
then
we
are
allowed
to
try
automatically
on
the
local
network
and
then,
similarly,
if
the
user
has
explicitly
installed
enterprise
software
to
have
a
vpn
or
a
proxy,
we
let
that
take
precedence
over
this.
So
this
can
just
be
kind
of
the
background
default
internet
privacy
level.
P
Next
slide,
there's
also
a
lot
of
interesting
things.
We
thought
about
for
compatibility
for
existing
servers
right
now,
we're
trying
to
maintain
your
rough
ipg
location.
P
So
it's
definitely
better
privacy
than
you
have
before,
but
the
user,
if
they
want,
can
preserve
some
location
and
as
we're
looking
at
this,
I
think
we
recognize
that
there
are
a
lot
of
there's
a
lot
of
need
for
more
standards,
work
to
be
done
in
the
area
of
geolocation
and
how
that
shared
with
permission
and
how
to
maintain
fraud
prevention
mechanisms
in
the
face
of
ip
address
privacy.
P
Next
slide,
yep
next
slide
so
going
forward.
You
know
we're
using
mask
for
this
we'd
love
to
see
more
expanded,
support
from
ask
by
other
vendors
we'd
love
to
see
an
open
and
integral
interoperable
network
of
these.
That
would
help
privacy
for
all
users
everywhere.
P
I
think
you
know
there's
interesting
models
we
can
imagine
of
having
ingress
proxies,
move
into
isp
and
carrier
networks
to
make
sure
that
they
are
very
optimized,
we're
already
hosting
some
egress
proxies
within
content
provider
networks
to
make
sure
that
it's
very
accessible
and
very
quick
to
get
to
your
content
and
there's
a
lot
of
interesting
work.
To
do
that.
I
think
we'd
love
to
learn
from
the
torah
community
about
how
you
discover
hops
and
choose
interesting
routes
through
the
network
and
next
slide
great.
A
L
A
L
Okay,
the
authentication
of
the
origin
server.
Is
that
still
end
to
end,
or
is
it
you
know
authenticated
by
the
proxy
in
some
sense,.
P
All
everything
is
intend,
the
only
thing
that
would
be
between
the
proxy
and
origin
would
be
a
tcp
handshake,
but
everything
on
top
of
that
is
end
to
end
and
in
the
case
of
quick
origins,
it's
fully
entanned.
H
F
The
chat
already-
I
don't
know
if
my
audio
is
going
to
work
this
time,
but
it's
choppy.
F
P
So
I
I
see
his
comment
here.
How
would
you
characterize
the
longer
term
complexity,
trade-offs
between
this
approach
and
trying
to
eventually
move
to
something
simpler,
more
generic
but
harder
to
get
deployed
like
oliver
tor?
I
think
I
don't
actually
think
there's
a
terrible
amount
of
complexity
in
this.
It's
just
you
know
one
particular
protocol.
So
I
guess
what
I
would
like
to
see
is
you
know,
let's
try
to
figure
out
collectively
in
standards
and
industry.
P
What
solution
do
we
think
makes
sense
to
be
deployed
all
over
and
I
think
mask
proxies
is
a
good
starting
point
and
seeing
that
converge
with
something
that's
compatible
with
tor
would
be
very
interesting.
A
K
O
Yeah,
so
I
have
five
questions:
welcome
I'll
start
with
one
in
half.
Are
you
using
dns
for
key
management
and
are
you
hard
coding
the
list
of
relays
or
are
you
using
something
else.
P
Yeah
right
now
the
public
keys
used
for
authenticating
the
proxies
doing,
oblivious
dough.
All
of
that
is
coming
from
kind
of
a
control
plane
server
from
icloud
that
the
device
checks
into
that's
very
much
a
kind
of
a
short-term
practical
decision
for
this
feature-
and
I
think
one
of
the
interesting
areas
to
look
forward
to
especially
for
standardization-
is
making
them
more
open
and
discoverable
and
extensible
in
a
maybe
dns,
maybe
some
other
type
of
public
log.
P
S
A
I,
if
joseph,
if
you
have
a
very
quick
question,
so
I
had
to
cook
my
blind.
But
if
you
want
to
jump
in
real
quick.
A
Oh,
no,
I
think,
or
if
you
can
take
that
to
the
chat.
That
would
be
great
good.
A
R
R
A
R
R
All
right
so
I'm
josh
carlin
tech,
lead
manager
and
the
privacy
sandbox
team
at
chrome,
and
thanks
for
having
me
out
to
present
on
our
work,
I
really
do
appreciate
the
opportunity,
for
those
that
don't
know
flock
is
a
project
in
chrome's
privacy
sandbox
and
it
just
wrapped
its
origin
trial,
I'll,
walk
you
through
the
technology
behind
that
trial
and
we're
also
chewing
on
the
feedback
that
we've
received
from
the
trial.
So
I'll
talk
about
some
of
our
thinking
about
what
comes
next.
R
So,
in
order
to
talk
about
flock,
you
have
to
understand
what
browsers
are
doing
and
where
they're
heading
research
has
shown
that
up
to
52
companies
can
theoretically
observe
up
to
91
of
the
average
user's
web
browsing
history
and
greater
than
600
companies
can
observe
at
least
50
of
the
user's
browsing
history.
R
It's
a
significant
amount
of
work.
We
need
to
partition
everything
in
the
browser
resource,
caches,
dns,
caches,
cookies,
javascript
storage,
socket
pools,
tls
session
identifiers.
Everything,
and
I
just
want
to
be
clear
that
when
I
say
we
are
building
walls
between
sites,
what
I
mean
is
like
the
registrable
domain
of
the
url.
That's
in
your
url
bar.
H
R
So
the
the
mainframe
of
the
site,
which
is
in
the
url
bar
it's
that
registrable
domain.
That
is
what
I
mean
by
the
site.
R
Flock
is
focused
on
interest-based
advertising
interest.
Advertising
collects
the
topics
of
the
page
that
the
users
visited
to
form
a
user
for
that
profile
of
interests,
which
then
allows
ads
to
target
a
user's
array
of
interests
instead
of
the
context
currently
available
on
the
page,
which
might
not
be
particularly
valuable.
R
The
goals
of
this
project
are
to
support
interest
based
advertising
in
an
easy
to
use
way,
while
still
making
it
hard
to
track
individual
users
online.
When
I
say
easy
to
use,
I
mean
that
advertisers
can
use
similar
technology
to
what
they're
already
using,
so
that
it's
easy
to
transition
to
which
greatly.
R
This
is
what
the
api
looks
like
it's
a
single
call,
document.interest
cohort
it's
async
and
returns
a
promise
when
it
resolves.
You
read
a
dictionary
value
that
has
a
version
and
a
cohort
id.
The
version
is
there
because
we
expect
flock
to
iterate
before
we
and
the
ecosystem
are
satisfied
plus.
Even
then,
we'll
still
need
to
update
things.
R
R
So
this
is
a
research
group,
so
let's
go
into
a
little
bit
of
detail
about
how
a
cohort's
actually
produced
to
begin
with.
It's
done
entirely.
Client-Side
with
the
same
data.
That's
used
to
store
user
browsing
history,
no
new
data
is
collected
and
the
only
part
of
the
history
of
the
of
urls
that
is
used
are
the
domains
of
the
urls
we're
not
looking
at
the
path
we're
not
looking
at
the
contents
of
the
pages
we're
just
looking
at
the
domains.
R
R
We
then
use
simhash
to
reduce
this
vector
down
to
50
bits
again.
The
distance
between
input
items
should
be
preserved
in
the
output
of
the
sim
hash.
And
finally,
we
apply
a
mapping
provided
by
a
chrome
server
that
groups
adjacent
sim
hatched
values
together
to
ensure
that
there
are
at
least
several
thousand
users
in
each
cohort.
This
reduces
the
dimensions
down
to
16
bits,
so
that's
it
no
smarts.
It's
all
client-side,
no
ad-tech
logic,
just
a
dumb
dimension
reduction
and
some
merging
of
neighbors
at
the
end
to
ensure
that
cohorts
are
large
enough.
R
R
Ip
addresses
and
if
the
page
hadn't
opted
out
now
for
the
origin
trial,
we
were
concerned
that
the
data
wouldn't
be
representative
for
early
adopters,
since
these
early
adopters
would
be
the
only
ones
triggering
the
api
and
their
intent
was
to
try
to
figure
out
if
this
was
an
api
that
was
useful
to
them,
but
since
they
were
the
only
ones
using
it,
and
there
were
a
few
of
them
and
origin
trials
specifically
require
that,
like
0.1
or
up
to
0.5
percent
of
traffic,
be
within
the
origin
trial
and
b
with
using
the
api,
the
usage
would
be
severely
limited.
R
R
Pages
with
ads
are
ones
that
we
considered
most
likely
to
use
the
api
all
right.
So
let's
talk
about
some
of
these
characteristics
we
wanted.
We
wanted
the
cohorts
to
be
okay
anonymous.
We
ensured
that
there
are
at
least
2000
chrome,
sync
users
per
cohort,
so
the.
R
Chrome's
researchers
anonymously
gathered
the
sites
that
people
and
different
cohorts
visit
from
sync
users
in
an
effort
to
try
to
reduce
the
amount
of
sensitive
information
that
might
leak.
So
if
a
cohort
is
more
strongly
correlated
with
a
sensitive
topic
than
the
general
population,
then
the
cohort
is
added
to
a
list
of
revoked
cohorts
and
get
distributed
to
clients.
R
R
R
There
are,
in
this
origin
trial,
quite
a
few
parameters
and
for
some
reason,
several
of
them
landed
on
seven
flock
is
updated
once
every
seven
days,
including
data
from
your
last
seven
days,
and
you
must
have
had
at
least
seven
different
sites
in
your
history
to
get
a
cohort.
R
R
So
now,
let's
transition
from
talking
about
the
completed
origin
trial
to
how
flock
might
improve
in
future
iterations.
Based
on
this
feedback,
these
ideas
are
still
being
evaluated.
None
of
this
is
locked
in
we
plan
to
discuss
these
ideas
in
the
open,
refine
them
and
test
them
to
better
understand
their
privacy
and
utility.
R
R
Second,
cohorts
are
hard
to
understand
for
end
users
and
even
technologists
right.
We
started
seeing
comments
along
the
lines
of
oh
flock
is
going
to
reveal
that
you,
like
green
sports
cars,
you
mowing
the
lawn
and
like
a
particular
kind
of
window,
shade
it's
it
just
simply
was
not
nearly
that
specific
and
couldn't
be
so,
but
that
was
still
hard
to
express
and
still
hard
to
understand.
R
R
R
The
topic
taxonomy
can
be
shorter,
like
say,
256
topics
as
opposed
to
the
roughly
30
000
cohorts,
so
there's
less
fingerprinting
service
as
well,
and
users
might
be
able
to
opt
in
or
out
of
particular
topics,
and
I
should
note
that
we're
not
the
only
ones
thinking
about
this.
The
privacy
cg
is
talking
about
ad
topic
hints,
which
is
an
alternative
api
to
flock,
meant
to
address,
interest-based
advertising
and
is
also
based
on
providing
interest
via
topics.
R
The
third
issue
is
that
flock
adds
new
fingerprinting
service,
and
this
is
true
note
that
today's
fingerprinting
service,
even
without
flock,
is
easily
enough
to
uniquely
identify
users
and
part
of
the
privacy.
Sandbox's
goal
is
to
reduce
that
fingerprinting
service
and
budget.
Its
usage
nonetheless
we'd
like
to
add
as
little
new
fingerprinting
service
as
possible
and
still
be
useful,
and
we
believe
that
there
is
more
that
we
can
do
first.
R
R
Second,
we
can
add
in
random
topics
with
some
probability
so
for
a
given
week,
you
might
give
a
site
instead
of
one
of
or
the
user's
topic,
you
would
just
with
five
percent
chance
say,
actually
we're
going
to
give
you
a
totally
random
topic.
R
F
R
To
different
sites,
this
actually
has
a
pretty
significant
impact
on
cross-site
fingerprint,
for
example,
you
could
take
the
user's
top
five
interests
instead
of
just
the
top
one
and
give
each
site
one
of
one
for
the
given
week
at
random
and
the
chances
that
two
sites
see
the
same
topic
for
a
given
user
is
only
20
taken
together.
We
think
these
mitigations
could
dramatically
drop
the
usefulness
of
flock
for
cross-site
fingerprinting.
R
Regarding
sensitivities,
this
is
significantly
improved
by
means
of
having
a
small
curated
list
of
topics,
and
perhaps
this
list
could
be
maintained
by
an
external
organization
to
stay
on
the
safe
side.
We'd
likely
continue
to
monitor
topics
for
sensitivity
with
t
closeness
analysis,
as
we
did
before.
R
R
This
is
beyond
the
scope
of
interest
derived
from
third
party
cookies,
which
are
limited
to
the
pages
that
the
third
party
was
present
on
one
option.
We're
considering
is
to
provide
a
per
third-party
set
of
topics
based
on
the
pages
that
the
third
party
was
on
and
then
calling
document.interest
cohort
from
then
we
could
say
that
flock
is
100
a
subset
of
the
capabilities
of
third-party
cookies.
R
So
we'd
have
to
counter
this
by
limiting
the
number
of
topics
readable
on
a
page
or
a
site
rather
to
a
given
to
like
two
or
three
per
per
week,
and
while
mitigated
this
way,
it
does
still
increase
the
fingerprinting
risk
over
global
values.
R
So
that's
it
for
more
information.
We
have
a
github
page
with
our
issue
tracker
and
our
explainer.
R
We
intend
to
ramp
up
our
conversations
quite
a
bit
more
we're
gonna
have,
I
hope,
to
have
phone
calls
started
soon,
that'll
be
every
other
week
or
so
we,
the
privacy
sandbox
overview,
is
available
at
privacysandbox.com
and
then
for
more
details
about
the
flock
origin
trial
and
all
its
specifics
and
technical
details.
Please
see
the
specific
privacy
sandbox
flock
page
and
I'm
happy
to
take
questions.
A
Q
R
The
the
client
has
it's
cohort
determined
by
the
browser
which
then
gets
sent.
If
you
wanted
to
override
the
javascript
api
and
have
it
return
a
random
value,
you
absolutely
could,
if
you
disable
flock
via
the
the
like
the
ux,
then
what
happens
is
that
you
just
don't
have
any
flux
sent.
It
rejects.
Q
Information
about
about
me,
whereas
just
returning
a
random
string,
it's
very
difficult
to
put
me
in
a
group
that
says:
oh,
this
is
a
person
always
has
it
off.
R
To
being
in
this
group
of
users
that
have
it
off,
because
that
group
is
not
small
and
it
does
provide,
it
makes
it
easier
to
train
on
the
on
the
advertising
side
to
understand
like
do
these,
judy's
topics
or
cohorts
have
meaning
right,
so
it
deletes
the
meaning,
if
you
make
it
random
and
we
are
doing
some
five
percent
random
and
the
number
of
users
that
actually
can't
have
a
cohort
for
one
of
a
myriad
of
reasons
is
actually
fairly
high.
So
we
don't
want
to
completely
like
drown
out
the.
A
Signal:
okay,
thank
you,
watson,
you're.
Next,
to
me,.
B
Thank
you
very
much
for
your
presentation.
It
was
pretty
informative.
I'm
not
seeing
the
justification
for
leaking
information
about
user
history
beyond
we're
already
doing
it
today
and
we
need
to
support
this
use
case.
Well,
it's
not.
You
know
there
there's
it's
clear,
there's
a
considerable
number
of
consumers
who
don't
want
to
see.
B
R
This
is
from.
R
R
Personalized
advertising
is
a
huge
chunk
of
change
and
something
that
we
feel
needs
to
be
supported.
If
you,
if
you
don't
want
to
have
that,
then
you
are
feel
free
to
disable.
It
is
very
easy
to
disable.
B
Watson,
I
think
I
would
need
to
see
a
lot
more
data
we
can't
just
I
mean
display.
Advertising
worked
for
many
many
years
to
support
the
news
industry
that
doesn't
require
having
information
about
user
identity,
and
it's
not
clear
to
me
users
understand
the
consequences,
because
he
has
a
great
incentive
in
making
sure
they
don't
understand
the
consequences
of
what
clock
is
revealing,
and
you
know
when
we
talk
about
open,
free
and
open.
We
need
that.
That's
not
about
it.
Doesn't
to
me
mean
you
can
make
money
off
it
to
me.
B
It
means
that
you
can
put
websites
up
and
you
can
you
don't
have
to
worry
about
your
privacy
or
security
being
violated.
These
are
much
more
important
questions
than
who
exactly
is
making
money.
How
did
the
slimiest
people
on
the
web
continue
to
have
a
living
yeah?
That's
not,
as
I
don't
think,
that's
as
important.
A
Okay,
matthew
you're
next
up,
please
go
ahead.
A
O
R
So
the
intent
was
now,
things
are
changing
since
we're
talking
about
new
directions,
but
the
intent
of
that
mapping
sent
from
google
to
clients
was
to
do
two
things.
One
was
for
cohorts
that
we
deemed
to
be
sensitive.
We
wanted
them
to
not
be
used
by
clients,
so
that
list
would
remove.
R
Cohorts
the
other
purpose
of
that
list
was
to
ensure
that
groups
were
the
certain
size,
and
so
you
know
squish
together
neighboring
values,
and
that
is
something
that
would
have
to
be
done
on
a
regular
basis.
We
expect
you
know
the
web
is
dynamic.
We
expect
sensitivities
to
change
over
time.
We
expect
the
sizes
to
change
over
time,
so
in
our
minds.
That
would
be
a
thing
that
would
be
done
on
a
regular
basis
and
distributed.
A
So
I'm
just
going
to
jump
and
there's
a
almost
a
very
similar
question
in
the
mic.
Sorry
in
the
java
chat,
which
was
what
is
us,
this
is
from
stephen
farrell.
What
is
to
stop
topics
being
used
to
censor
or
imprison
people
eg,
an
lgbt
topic
in
many
countries,
says
okay,
but
in
some
it
could
get
you
locked
up.
R
So
I
think
geographic
I
was
sort
of
looking
for
it.
Different
geographies
have
different
sensitivities
like
for
sure.
I
think
that's
important,
we've
gotten
that
feedback,
and
I
I
think
that
needs
to
be
taken
into
consideration
absolutely
in
in
how
we
choose
which
topics
would
be
included
and
it
may
be
geography
specific
or
I
just
remove
it
all
together
if
there's
any
chance
of
it
being
sensitive
like
now,
and
that's
a
hard
thing
to
say,
because
I
understand
and
lots
of
things
can
be
sensitive,
but
that
one
in
particular
is
easy.
A
Thank
you
and
where's
your
question.
Please.
T
Thank
you
very
much
so
if
I
understand
this
from
a
very
high
level
in
terms
of
what's
going
to
eventually
be
offered
to
the
user,
the
user
you
know
won't,
have
it
won't,
have
a
choice
to
say
don't
be
tracked,
so
my
fundamental
question
is:
what's
going
to
happen,
to
do
not
track
checkboxes
that
actually
gonna
go
away
and
the
user
will
be
forced
to
pick
between
full
tracking,
with
no
anonymity
or
tracking.
Only
through
this
aggregation
process.
R
So
that's
the
reality
of
that.
What
we
are
offering
is
as
if
you
would
like
to
use
the
privacy
sandbox
and
all
of
its
tools
or
individual
apis.
You
can
enable
or
disable
them.
If
you
don't
want
to
have
privacy
sandbox,
you
can
disable
it
as
a
whole.
We
still
will
offer
blocking
all
third-party
storage
and
cookies
that
so
you
can.
You
can
have
like
you,
you
can
customize
it
to
to
your
needs.
T
All
right,
so
you
still
allow
users
to
put
up
the
walls
that
sort
of
the
blockades
yep.
A
Okay,
I
think
that's
all
the
questions
from
the
queue
there
were
some.
There
were
there's
quite
a
bit
of
discussion
in
the
chat
and
we're
at
time
now,
but
if
anybody
wants
to
follow
up
with
anything
specific
there
on
the
list,
hopefully
josh
is
able
to
answer
some
of
those
questions
potentially
via
the
chat
or
the
list.
If
you're
able
to.
A
For
the
presentation,
so
that's
a
wrap
for
us
today.
Thank
you
for
joining
everybody
and
enjoy
the
rest
of
ietf.