►
From YouTube: CI WG demo: Google Cloud Public Datasets Program
Description
Date: 2/1/2019
Presenter: Shane Glass
Institution: Google
West Big Data Hub
A
B
A
A
Shane
is
a
program
manager
at
Google
under
the
cloud
developers
relation
group
where
he
leads
the
public
data
set
program,
and
this,
as
I
said,
is
facilitating
high
demand.
Public
data
sets
in
order
to
make
it
easier
for
researchers
to
access
and
uncover
new
insights
and
and
and
do
things
that
they
can't
do.
Otherwise.
A
That's
really
I
think
the
the
point
I've
seen
several
examples
of
bigquery
in
action,
and
it
has
a
lot
of
very
nice
capabilities
before
during
Google
now,
Shane
was
a
project
manager
at
NOAA's,
Big,
Data
project
and
his
currently
serves
as
public
affairs
office
to
the
US
Army
Reserve
and
has
received
his
bachelor's
degree
from
the
University
of
South
Carolina,
a
master's
from
the
University
of
Maryland,
UC
and
I'm.
Looking
forward
to
hearing
all
the
details
here
so
without
further
ado,
please
take
it
away.
Shane
yeah.
B
Thanks
Nile,
yeah
I
think
you
really
summarized
the
program
nicely
of
enabling
people
to
do
things
that
they
otherwise
wouldn't
be
able
to
do,
or
maybe
you
won't
be
able
to
do,
but
is
like
really
really
intensive
and
really
labor-intensive
and
and
probably
not
worth
doing.
Otherwise,
it's
just
because
the
amount
of
property
goes
into
it.
B
You
know:
bigquery
has
some
really
great
capabilities
and
has
some
really
great
potential
I
in
kind
of
serving
some
of
these
workloads,
especially
for
structured
data,
but
also
can
do
some
cool
things
with
unstructured
data
as
well
and
I.
Think
that
will
we'll
cover
that
a
little
bit
the
demonstrations
here
at
the
end
that'll
be
a
nice
teaser
to
hopefully
keep
people
sticking
around
until
the
end
overview
of
some
of
the
some
of
the
public
datasets.
We
have
in
the
program
and,
of
course,
the
fun
part.
B
At
the
end,
the
demonstrations
I
was
telling
Carl
before
everyone
joined
I
have
tried
to
keep
Murphy
from
attending
this
presentation
as
Murphy's
Law
tech
demos
is
a
a
well-known
phenomenon
to
me.
So
most
of
those
are
pre-recorded.
So
hopefully
we
will
do
this
as
smooth
as
possible,
but
I'm
sure
I
managed
to
break
my
last
demo,
which
is
supposed
to
be
alive
and
Carlos.
B
B
So,
let's
start
by
just
going
over
the
program
itself,
but
I
want
to
start
with
kind
of
the
scientific
process
right
and
not
be
about
8-step
scientific
process
that
we
all
learned
in
in
middle
school
or
in
elementary
school,
but
kind
of
the
scientific
process.
As
we
know
it
today,
and
this
is
kind
of
what
it
looks
like
right
like
I-
don't
think
this
is
a
crazy
off-base
way
of
describing
it
in
that
you
have
to
discover
the
dataset
where
to
access
it
I
and
in
my
time,
at
NOAA.
B
One
of
the
phenomenon
that
we
we
continue
to
hear
from
users
was
that
in
order
to
use
NOAA
data,
you
have
to
know
what
did
you're
looking
for
and
where
to
look
for
it.
I
mean
NOAA
does
a
fantastic
job
of
sharing
data
that
they
produce
for
their
mission,
but
it
can
be
difficult
times
to
discover
it
because
there's
no
central,
catalog
or
all
no
a
dado
or
even
most,
of
the
line.
Ups
as
most
the
individual
line
offices,
don't
kind
of
have
that
centralized
catalog.
B
But
let's
say
you
know
you
get
past
that
point
you've
discovered
your
data
set.
You
discover
better
to
access
it.
You
need
now
to
be
able
to
access
it
in
a
bulk
machine,
readable
format.
Now,
hopefully,
the
open
data
Act,
which
was
recently
signed
into
law,
will
help
with
that.
I
does
have
a
requirement
for
machine
readable
data
publication,
but
in
the
meantime,
and
that
can
be
difficult
to
find
and
and
to
access
your
to
write,
a
program
to
parse
all
of
your
data
into
some
analysis
to
write
whatever.
B
That
is,
if
it's
bigquery,
if
it's,
if
it's
Excel,
which
is
a
perfectly
legitimate
analysis
tool
with
its
if
it's
tableau
whatever
it
might
be,
you
don't
have
to
load
the
data
into
some
database.
Using
this
program,
you
just
wrote
you
need
to
manage.
You
need
to
update.
You
need
to
maintain
you
need
to
secure
this
data
and
this
database.
B
You
need
to
update
your
data
regularly
and
then
you,
you
probably
want
to
link
your
data
or
your
research
with
private
data,
set
something
that
you've
produced
in
your
lab
and
are
looking
to
do
some
kind
of
unique
analysis
on,
and
you
now
kind
of
have
to
go
back
through
this
whole
process.
To
find
these
public
data
sets,
so
these
private
data
sets
again
and
bring
them
into
the
same
place
or
go
through
this
process
twice
and
then
need
to
analyze
it
which
to
me,
is
the
fun
part.
B
You
need
to
share
your
data
and
then
you
need
to
visualize
and
communicate
your
results.
That
I
like
needing
a
cup
of
coffee
after
describing
this
like
this
is
a
long
kind
of
complicated
process
and
I.
Think
that
part
of
our
goal
here
in
the
public
data
sets
program
is
to
alleviate
a
lot
of
this
burden
from
the
scientist
and
I
guess.
My
question
to
you
would
be
well
what
if
someone
else
did
these
steps?
B
What
if
someone
else
discovered
the
data
and
where
to
access
it,
kept
it
updated,
kept
it
maintained
and
kept
it
in
an
easy
to
link
place
for
public
data
and
then
on
the
back
end
kind
of
helped.
You
share
your
data
that
that
would
sound
pretty
nice
right
and
that
would
allow
you
to
do
the
analysis
and
visualize
and
communicate
your
results,
which
is
I,
think
the
fun
stuff
right.
That's
kind
of
the
get
to
the
fun
part.
I
mean-
and
the
crux
of
this
here
is
that
you
know
that's
really.
B
The
purpose
of
the
public
data
sets
program
is
to
kind
of
take
these
steps
away
from
from
the
burden
from
you,
and
what
we
found
is
that
you
know
we
have
hundreds
of
users
that
are
doing
all
of
these
same
steps
in
parallel
and
all
repeating
the
same
process,
and
so
we
at
you
know
the
public
dataset
program
kind
of
take
on
this
process
of
working
directly
with
providers
of
data
of
onboarding.
These
datasets,
bringing
them
in
keeping
them
updated.
B
Keeping
them
maintained,
making
sure
they're
well
described
that
they
link
out
to
legitimate
metadata
that
they
link
out
to
do
is
that
they
linked
out
to
source
pages
for
more
question
and
help
people
share
data
so
that
they
can
focus
on
their
analysis
and
so
that
they
can
focus
on
visualizing
and
communicating
the
results
that
we
think
that
this
kind
of
allows
you
to
do.
I
have
a
lot
more
science.
B
B
So
almost
a
year
ago,
I
previously
managed
the
NOAA
Big
Data
project
working
for
Ag
currents
over
at
NOAA
and
still
work
with
them
pretty
closely.
Today,
as
one
of
our
providers
and
my
background
is
in
in
data
and
analytics
and
and
the
reason
I
I
like
to
include
that
here,
is
that
you
know
data,
the
difficulties
of
finding
a
data
set
are
very
personal
to
me
and
net
when
I
was
working
on
my
Master's,
it
took
me
to
find
a
dataset
that
would
work
for
the
type
of
analysis.
B
I
was
doing
was
big
enough
to
help
me
learn
how
to
build
these
models,
and
that
would
give
me
kind
of
a
full-featured
model,
but
wasn't
so
deeply
complex.
That
I
had
to
be
a
PhD,
climatologist
or
I
had
to
be
an
expert
in
that
field
to
really
know
how
to
work
with
the
data.
So
you
know
this
kind
of
challenge
of
finding
and
helping
people
discover
public
data
and
and
helping
people
share
their
public.
It
is
is
something
that's
very
personal
to
me
in
my
background.
B
B
B
This
concept,
that
data
is
the
new
oil
right
that
oil
and
it's
it's
kind
of
raw
form
has
value,
but
it's
really
after
the
refinement,
much
like
data
when
you
refine
that
in
insight,
that's
really
where
that
value
comes
from
and
and
so
making
it
as
easy
as
possible
for
all
users
to
get
right
to
the
refinement,
step.
I
think
that's
really
kind
of
the
crux
of
the
value
proposition
we're
finding
here.
B
You
know
we
think
this
is
done
best
by
providing
scalable,
centralized
processes
vetted
best
practice,
cross-functional,
launch
teams
and
unified
messaging
to
help
bring
in
everybody
and
and
lowering
these
barrel
to
entries,
whether
it's
not
knowing
how
to
work
with
netcdf,
so
core,
grib
or
or
hdf
files
or
or
a
lack
of
subject
matter.
Expertise
couldn't
make
someone
the
subject
matter
than
in
a
dataset,
but
if
we
can
explain
to
them,
hey
here's
what
the
shorthand
and
the
column
name
means
and
here's.
B
B
Sort
our
current
catalog
has
about
115
public
datasets.
It's
a
pretty
ballpark
estimate,
we're
kind
of
onboarding,
more
all
the
time
and
so
I
think
last
I
looked
as
I
was
preparing
these
slides.
It's
115!
You
could
see
this
like
fun,
scatter,
shot
of
other
people's
logos
that
I
get
to
take
credit
for
on
these
slides.
B
B
So
more
than
thousand
tables
in
bigquery
currently
available
for
public
use,
I
kind
of
got
some
high
level
metrics
here
on
the
next
slide.
So
you
know
in
those
hundred
and
fifteen
tables,
and
or
excuse
me
abruptly
100,
datasets
and
bigquery
was
the
other
15
or
in
Google
Cloud
Storage
about
2,000
tables
across
them.
We
have
42
billion
rows
across
those
more
than
two
thousand
tables,
and
this
number
it
continually
grows
every
day,
as
we
continue
to
keep
datasets
updated.
B
Some
was
as
frequently
as
hourly
and
more
than
six
petabytes
of
data
in
Google
Cloud
storage.
So
this
is
primarily
things
like
satellite
imagery
of
looking
back
down
on
earth.
We're
we're
also
typing
some
really
exciting
discussions
with
astronomers,
with
with
with
kind
of
different
different
subsets
and
different
users
around
different
communities,
and
so
so
there's
a
great
variety
of
datasets
in
here
that
they
surprisingly
could
have
some
like
really
exciting
impacts
for
our
users.
B
It
probably
knows
a
good
bit,
if
not
Stephen
your
secret's
safe
with
me
and
and
feel
free
to
to
take
notes
here
and
no
one
will
ever
know
so
the
just
kind
of
a
brief
overview
you
know
so
bigquery
is
this:
serverless,
sequel,
implementation
and
I.
Think
server
list
is
kind
of
a
misuse
kind
of
misnomer
in
the
community
right
now.
It's
not
that
there
are
no
servers
behind
it.
It's
that
you
don't
have
to
manage
the
servers.
You
don't
have
to
add
servers.
B
If
you
need
to
get
more
capacity,
you
don't
have
to
turn
off
servers
if
you
have
excess
capacity.
It's
that
this
kind
of
scales
seamlessly
with
what
your
needs
are
and
that's
why
I
kind
of
say,
seamlessly
scale
without
me
without
the
manual
management.
You
know
if
you
need
to
do
a
multi,
petabyte
analysis.
You
can
do
that
big
query
and
if
you
need
to
do
a
five
kilobyte
analysis,
you
can
do
that
in
bigquery.
B
Both
will
be
fast
and
pretty
responsive
and
and
both
will
kind
of
scale
to
meet
your
demand,
while
only
charging
for
what
your
actual
usage
is.
So
it's
it's
a
really
nice
data
warehouse
and
access
layer
with
a
really
nice
clean
sequel
implementation
to
either
slice
datasets
like
really
huge
datasets
that
can
either
be
locally,
downloaded
or
fed
into
GCP.
We've
got
some
great
demos
on
that
later
of
kind
of
some
best
practices
of
doing
that.
That
I
think
will
be
really
helpful
for
the
community.
B
One
of
the
other
benefits
of
bigquery
is
that
it
allows
you
to
simply
share
the
data
you
want,
but
it's
secure
enough
to
protect
the
data
you
don't
so.
These
kind
of
user
defined
access
control
logs
and
the
high
availability
and
redundancy
of
bigquery,
combined
with
this,
like
very
cheap
storage
and
the
pay
for
what
you
use
model.
B
So
if
all
of
your
users
currently
have
a
copy
of
the
same
data
set
downloaded
and
on
their
personal
hard
drives
right
now,
and
you
don't
have
to
worry
about
potentially
a
hundreds
of
different
users
or
tens
of
different
users
or
keeping
the
same
data
set
updated
having
to
pay
for
storage
for
the
same
data
set
10
20
30
times
having
to
having
to
say
well.
Are
you
using
the
latest
version
of
this?
Well?
No
that
this
version,
you
never
download
it.
B
This
kind
of
single,
centralized
copy
updates
for
everybody
and
and
that
kind
of
allows
you
to
to
more
effectively
and
more
quickly
share.
The
latest
version
of
the
data
and
I've
touched
on
this
a
little
bit.
It's
a
pricing
model,
but
it
Abel's
you
to
focus
on
science,
so
really
afford
it's:
affordable,
storage,
pretty
pretty
intense,
free
tears
and
usage
face
pricing,
so
10
gigabytes
a
month,
free
storage
and
bigquery.
Your
first
terabyte
of
scanned
data
is
free
and
it's
$5
a
terabyte
beyond
that
so
I.
B
My
opinion
is
that
it's
you
get
this
really
great
performance
for
really
large
data.
Sets
you
get
integration
with
a
lot
of
other
scientific
tools,
whether
it's
Kegel,
whether
it's
pandas
Python,
our
tableau,
whether
it's
you
know
joining
with
datasets
that
are
either
public
or
private,
but
allowing
you
to
kind
of
focus
on
that
science
and
not
have
to
worry
about
the
the
infrastructure
in
the
backend
behind
it.
I'm.
In
fact,
I
think.
Yesterday
we
rolled
out
a
really
exciting
update
for
bigquery
the
bigquery
sandbox.
B
That
kind
of
locks
you
into
this
free
tier
bigquery
and
helps
you
be
sure
that
you
know
it's.
It's
designed
to
help
users
get
started
and
become
familiar
with
it
in
a
really
fast
easy
to
sign
up
way.
It
doesn't
require
you
to
put
down
a
credit
card,
so
it
keeps
you
in
this
free
tier
of
storage
and
of
queries.
I
mean
it
does
have
some
some
other
limitations
behind
it.
B
I
think
tables
are
only
persistent
for
kind
of
30
or
45
to
60
days,
and
the
number
of
top
I
had,
but
a
really
really
nice
tool
that
you
know
my
my
kind
of
former
colleagues
in
the
federal
government
have
told
me
is,
if
something's
really
exciting
for
them
not
having
to
put
down
a
government
Purchase
card
for
something
that
you
know
is
is
not
necessarily
a
set
price
or
a
the
same
skin
system
price.
Every
month,
we've
got
some
really
great
case
studies
available
on
our
website
that
you
know
you're
all
very
intelligent
people.
B
You
certainly
don't
need
me
to
read
them
to
you
on
kind
of
the
usage
of
science
data
for
genomics
for
chemistry,
for
climatology,
meteorology
economics,
patents
kind
of
so
a
lot
of
great
and
stuff
content
in
there.
I
mean
encourage
you
to
check
some
of
that
out.
That's
of
interest
to
you,
okay,
so
that's
I,
think
a
pretty
good
overview
of
what
bigquery
is
now
how
bigquery,
how
can
bigquery
support
open
data
use
I've
kind
of
grayed
out
the
business
one
on
here?
B
We
use
this
slide
to
talk
some
pretty
generalized
audiences
right,
I,
don't
know
that
this
is
particularly
applicable
to
this
there's
audience
I'm
more
than
happy
to
ungrate
if
I'm
wrong,
but
I
think
that
you
know
we
can
we
work
really
well
with
the
public
data
side.
Specifically
works
really
well
with
with
researchers
in
joining
public
data
from
multiple
sources,
with
with
each
other
or
with
kind
of
private
internal
data,
to
conduct
your
analysis
and
with
data
providers
we
talked
earlier
about.
B
You
know
providing
that
one
simple
copy,
providing
read
access
out
to
users
and
allowing
them
to
kind
of
have
this
really
fast,
really
simple
access
without
having
to
scale
the
oncome
services.
What's
really
nice
about
this,
the
challenge
we
found
when
I
was
at
NOAA
was
that
the
more
popular
a
data
set
was
the
harder
it
was
for
NOAA
to
share
that
data
set
the
reason
for
that
being.
If
one
user
wanted
to
copy
that
data
set,
let's
say
they
wanted
a
copy
of
a
one.
B
Terabyte
data
set,
no
one
had
to
send
out
one
terabyte
worth
of
data
to
that
user,
but
if
a
thousand
users
wanted
to
all
use
that
same
one,
terabyte
data
set
NOAA
then
had
to
send
out
a
thousand
copies,
which
ends
up
being
a
thousand
terabytes
of
that
same
data
set.
What's
nice
about
bigquery.
Is
that
when
you
make
this
available
and
read-only,
you
don't
have
to
worry
about
scaling
the
bandwidth
to
meet
that
thousand
person
demand
the
user.
Pay
is
for
the
query
costs
that
they
incur
so
the
project.
B
That's
the
billing
account
behind
the
project
that
accesses
your
data
is
charged
for
that
query.
So,
if
you
make
your
data
public
and
someone
you've
never
heard
of
before,
comes
in
and
queries
it,
you
don't
have
to
pay
for
their
access.
You've
only
paid
for
the
storage
and
they're
paying
for
their
access
in
their
analysis,
so
I
just
want
to
give
a
kind
of
high-level
overview
of
the
public
data
sets.
We
have
in
the
program
if
anybody
sees
their
favorite
data
set
on
here.
B
Let
me
know:
that's
a
like
thing
that
only
works
and
audiences
that
are
really
passionate
about
seeing
their
favorite
data
set,
which
I
love,
because
that's
this
group,
if
you
see
your
favorite
data,
set
missing
shoot.
You
know
and
I'd
love
to
talk
and
learn
more
about
what
the
use
cases
are
and
how
we
might
be
able
to
support.
This
is
like
free
generic
snapshot
to
some
of
the
data
sets
we
have.
This
is
out
of
our
our
kind
of
catalog
of
data
sets
it's.
You
could
see
the
link
there
in
the
bottom
right.
B
We
call
this
GCP
marketplace.
We
host
whole
ton
of
solutions
in
there
from
pre-built
virtual
machines
that
have
now
all
these
different
tools
built
into
that,
but
this
is
also
a
restore
our
lessening
of
the
public
data
sets
you
can
see.
This
is
a
really
diverse,
offering
here
from
from
Nitsa
traffic
fatality
data
when
we
were
supporting
them
on
their
solving
for
safety
challenge,
you
can
see
NEXRAD
level,
two
kind
of
radar
imagery
from
NOAA.
You
can
see
blockchain
data
from
some
of
the
most
popular
crypto
currencies
in
the
world
sunroofs.
B
So
you
know
it's
kind
of
solar
data
or
for
how
much?
How
about
some?
We
expect
your
your
your
roof
to
see
at
a
given.
You
know
kind
of
date
and
time
in
the
year,
and
we
even
have
Major
League
Baseball
pitch
by
pitch
data,
and
so
a
you
know
a
offering
of
pretty
diverse
data
sets
so
we're
pretty
excited
about,
and
in
particular,
or
we
find
that
our
weather
and
climate
data
set
so
really
popular
and
the
reason
for
that
being
that
users
understand
what
temperature
means.
B
I
can
promise
you
that
everyone
in
Pittsburgh
knows
what
a
zero
degree
temperature
means.
After
this
week
and
I
can
really
promise
you
that
everyone
jicama
knows
what
that
means,
assuming
they
got
as
warm
as
zero
degrees.
So
you
know
what
we
find
is
that
users
from
a
really
broad
range
of
industries,
from
retail,
from
from
hospitality,
all
the
way
over
to
climatology
and
meteorology,
find
use
cases
for
these
data
sets.
B
In
fact,
weather
and
climate
data
sets
from
NOAA
were
three
of
the
ten
most
heavily
used
data
sets,
and
we
measure
that
by
the
terabytes
of
data
scan
I'm,
estimating
that
those
data
sets
those
three
data
sets
or
maybe
a
couple
gigabytes
in
size,
maybe
10.
If
I'm
really
rounding
up,
we
saw
more
than
a
petabyte
of
data
scan
out
of
those
data
sets.
You
know
from
from
my
experience
or
know
we
found
that
that's
anywhere
between
you
know,
kind
of
30
and
300
times
more
data
sort
of
out
of
here.
B
Then
the
NOAA
serves
out
of
there.
It's
it's
tough
to
do.
The
kind
of
one-to-one
comparison
is
they're,
just
very
different
systems,
but
you
know
it's
they're,
really
popular
more.
You
know
we're
finding
and
NOAA's
finding
that
you
know
we're
helping
amplify
kind
of
that
accessibility
for
users
without
the
taxpayer
or
the
researcher
having
to
pay
for
that.
Whether
in
climate
data
sets
were
we're
also
our
most
popular
data
set
in
terms
of
the
average
daily
users
and
we're
two
of
our
six
most
frequently
used
data
sets.
B
So
it's
daily
average
average
queries
per
day,
yeah,
there's
a
little
bit,
I'll
freely
admit,
there's
a
little
bit
of
cherry-picking
going
out
in
the
sample
size.
Right,
like
you
know,
it's
it's
two
of
six
of
our
most
frequently
as
datasets.
It's
also
two
of
seven
or
else
I
would
have
said
three
of
seven,
but
I
think
that
this
really
clearly
illustrates
that
you
know
we're
we're.
Helping
NOAA
meet
this
kind
of
level
of
demand
that
existed
long
before
we
kind
of
started
working
with
that.
B
C
Perfect,
thank
you.
So
my
name
is
Florence
Hudson
and
I
work
with
all
the
hubs
and
I
also
work
for
the
NSF
cybersecurity
Center
of
Excellence
at
Indiana,
University
and
I
lead
a
program
called
TTP,
which
is
cyber
security
research
transition
to
practice.
So
as
I'm
working
with
the
cyber
security
researcher,
some
of
them
are
saying
like
one
of
them
in
particular,
at
RIT
and
Rochester
Institute
of
Technology,
said:
I
need
an
intrusion,
alert
data
to
test
my
machine
learning,
algorithms
for
cyber
security
and
so
I'm
kind
of
on
the
hunt
for
David
Davis.
C
That's
like
this.
You
know
or
like
when
I
work
with
smart
grid.
Guys
everybody
wants.
You
know,
pmu
data,
a
synchro,
phaser
data
and
a
lot
of
this
stuff
is
very
you
know
confidential
DoD.
Do
this?
Do
that?
So
you
know
it's
not
readily
available,
but
do
you
have
or
expect
you
might
have?
Data
sets
like
this,
which
are
a
non
confidential
that
can
be
public
datasets
that
you
know
I'm
thinking
I
would
send
this
link.
I
was
just
looking
at
it,
and
I
saw
there
were
some
datasets
there.
There
was
like
a
you
know.
C
This
probably
isn't
what
I
need,
but
VM
series
next-gen
firewall
bundle
as
an
example.
Right
it
says,
word
firewall
and
I
go.
Maybe
there's
some
security
data,
so
I'm
thinking
that
maybe
I
would
send
this
to
the
researchers
are
asking
me
for
datasets
and
say:
do
you
see
anything
that
might
be
useful
or
if
you
don't?
What
do
you
think?
Would
it
is
that,
like
a
good
idea
or
a
bad
idea,
it's.
B
B
Think
that's
one
of
the
things
we're
seeing
is
a
really
popular
use
case
for
data
sets
is
training
machine
learning
models
in
general
because
you
just
need
a
lot
of
data
to
do
that
and
no
one
has
you
know:
most
people
don't
have
a
lot
of
disk
space,
just
kind
of
laying
around
to
to
test
kind
of
ideas
with
and
that's
I
think
of
a
really
awesome
use
case
for
public
datasets
on
yeah.
My
email
is
on
the
first
slide,
I'll
be
happy
to
go
back
to
it
later
and
write.
C
B
C
And
I
got
to
share
with
you,
because
the
other
thing
that
I
found
to
one
of
these
researchers
is
there's
actually
a
conference
called
camless
conference
on
applied
machine
learning
for
information
security
and,
as
you
know,
if
the
machines
learning
there's
data
involved
right.
So
it
could
be
it's
on
my
list
and
reach
out
to
more
of
the
people
who
present
there.
So
we
could
create
something
a
bit
rather
interesting
that
could
really
support
some
of
this
AI
machine
learning
for
cybersecurity.
Research
going
on
yeah.
B
And
so
these
some
of
our
are
very
recently
added
unstructured
data
sets
and
and
they're
all
data
sets
I'm
really
excited
about,
despite
the
fact
that
I
clearly
forgot
to
replace
the
national
water
model
picture
over
here
on
the
right
with
something
a
little
less
grainy.
So
the
what
roasting
over
here
on
the
far
left
is
the
her.
B
It's
the
high
resolution,
rapid
refresh
model,
I
I'm,
gonna
butcher,
the
details
here,
but
imma
try
my
best
I,
believe
it's
a
half
kilometer,
it's
a
two
half
kilometer
resolution,
high
high
frequency
weather
model
kind
of
over
the
United
States,
and
in
fact
we
just
started
bringing
this
on
and
and
I
think
that
there's
some
some
really
really
cool
use
cases
for
this
and
in
kind
of
a
private
weather
Enterprise,
but
also,
of
course,
in
the
research
phase
as
well.
You
see
this
like
really
nice
beautiful
picture
in
the
middle
of
of
planet
Earth.
B
This
is
from
NOAA's
goes.
Seventeen
satellite
in
cooperation
with
NASA
NOAA
launched,
goes
sixteen
and
seventeen
within
the
last
few
years
to
replace
kind
of
their
previous
generation
of
geo
orbiting
satellites.
So
it
goes.
Sixteen
now
sits
over
the
eastern
half
of
the
u.s.
kind
of
focused
on
you
know
the
eastern
coast
in
the
Atlantic
Basin.
Whereas
go
seventeenth
now
over
the
western
half
of
the
US,
we
have
both
go.
B
Sixteen
and
seventeen
data
on
our
platform,
including
the
geostationary
Lightning
mapper,
which
is
really
really
cool,
I,
gives
you
you
know
kind
of
detailed,
lightning
strike
information.
That
is
the
first
time
kind
of
this
kind
of
instruments
flown
on
a
geostationary
satellite
on
the
far
right,
as
I
alluded
to
earlier,
is
representation
in
the
national
water
model.
This
is
two
and
a
half
million
points
of
kind
of
continuous
stream
flow,
soil,
moisture,
snowpack
data
over
the
continental
United
States
as
well.
B
So
these
are
all
of
the
we've
talked
a
lot
about
bigquery
and
we'll
talk
a
lot
about
how
we
can
use
bigquery
to
work
with
some
of
these
datasets
in
a
minute.
But
these
are
some
of
the
datasets
that
that
we're
hosting
in
Google
Cloud
storage
that
have
some
really
really
great
scientific
applications
as
well.
B
Okay,
so
you've
listened
to
me
going
on
for
quite
a
long
time.
I
appreciate
that
if
you've
kind
of
taken
a
headset
off
and
walk
away
and
God,
not
something
more
productive,
I,
don't
know
the
difference.
You
still
get
credit
for
it,
but
so
I
I
think
this
is
the
fun
part.
I've
got
a
like
handful
of
demonstrations
here
and
we'll
kind
of
go
from
what
I
think
being
easiest
to
like
most
challenging
or
like
most
advanced
level.
B
Don't
worry,
you're
all
gonna
leave
and
you're
all
gonna
be
experts
in
this,
and
you
will
be
able
to
allow
all
your
friends,
colleagues,
family
and,
and
maybe
even
people
you
don't
like.
So
let's
start
with
with
what
sequel
and
I'm
I'm
assuming
that
most
people
here,
if
not
everyone
on
the
call
are
familiar
with
sequel,
but
if
not
I
want
to
start
from
baseline
just
to
make
sure
that
you
know
we're
all
in
the
same
page
of
what
I
mean
to
say
this,
so
sequels
been
around
since
1976
I.
B
Think
one
of
the
like
co-workers
is
it's
gonna.
Give
me
a
hard
time
if
I
got
that
wrong,
but
and
kind
of
the
real
basic
premise
of
sequel
is:
is
the
selectfrom
statement
so
select
and
say
what
columns
are
what
variables
do
I
want
from
what
table
of
my
selecting
knows
those
variables
from
so
this
is
kind
of
the
the
basics
of
what
you
need,
in
sequel
and
in
bigquery
to
access
and
to
start
working
with
data.
Okay,
so
you've
mastered
the
selectfrom
statement
and
you
want
to
do
something
a
little
more
advanced.
B
You
say:
okay!
Well,
this
is
really
great,
but
what
about
data
or
I
want
to
meet
a
certain
condition.
So
that's
where
you
would
add
a
where
statement
that
says
you
know
you
can
give
it
so
well,
where
these
kind
of
columns
that
you
have
in
your
select
statement
meet
these
given
parameters
right,
okay,
so
I
you,
you
stepped
up
to
the
where
statement,
they're
able
to
say:
okay
I
want
these
data,
but
only
when,
within
a
certain
time
period,
but
boy
I'd,
love
to
see
them
in
a
certain
order.
B
B
What
would
you
be
averaging
and
then
what?
How
are
you
grouping
them
in
the
results?
And
let's
say
you
want
to
get
real
fancy
and
you
want
to
go
as
a
joint
statement.
This
is
where
you
would
pull
in
data
from
two
or
three
different
tables:
you'd
bring
them
all
together
on
a
common
key,
some
kind
of
common
identifier,
that's
unique
for
each
row
and-
and
you
could
still
do
a
lot
of
these-
you
know
you
can
still
do
all
these.
These
other
statements
that
we
talked
about
before
happen.
B
Okay,
so
with
that
out
of
the
way
we
now
have
a
clean,
baseline,
a
sequel
and
the
first
thing
we're
going
to
do
is
we're
gonna
figure
out
what
are
the
most
famous
trees
in
San
Francisco,
because
I
know
everyone
woke
up
this
morning.
It
was
like
you
know.
This
was
really
been
bugging
me.
What
are
the
most
famous
trees
in
San
Francisco,
good
news?
We
have
an
answer
for
you.
How
do
we
do
that?
B
B
You
can
click
on
the
link,
it'll
load,
this
query
for
you
and
you
can
run
it
and
you
can
learn
what
the
most
famous
trees
are
in
San
Francisco.
So
you
know
that's
kind
of
one
of
these
things.
We've
done
a
marketplace
to
help
people
get
started,
is
providing
these
sample
queries
and
allowing
users
to
click
through
and
so
run
these
queries
and
the
idea
being
it
gets
you
a
good
start
on
the
data
set.
It
gives
you
a
place
to
start
and
you
can
kind
of
play
out
from
there.
B
So,
okay,
so
what's
next
right,
so
that's
kind
of
a
simplest
version
of
this.
So
here's
something
I
think
is
a
little
more
advanced.
So
what
we're
doing
here
is
we're
taking
hurricane
data
from
the
no
YouTube
I.
Don't
want
to
see
more
videos
like
this
we're
taking
hurricane
data
from
NOAA's
I-b
tracks,
which
is
kind
of
international
community.
Coming
together
and
saying.
We
agree
that
this
cyclone,
this
you
know
hurricane
or
the
this
typhoon
was-
was
in
this
given
place
this
given
time.
It
apparently
is
shocking
the
difficult
relative
to
how
easy
it
sounds.
B
The
query
was
actually
cashed
ahead
of
time.
That's
why
it
only
took
you
know
three
hundredths
of
a
second
to
run.
It
I
had
recently
loaded
it,
but
I
think
one
of
the
cool
things
that
that
bigquery
can
do
is
connecting
out
to
data
studio.
This
is
Google's
data
visualization
platform,
and
it
allows
you
to
kind
of
connect
out
and
visualize
this
data
very
quickly.
What
I'm
doing
here
is
I'm
just
going
in
and
I'm.
B
Instead
of
saying
that
you
know
the
Hurricane
Center
point
was
text,
we're
just
changing
it
to
a
latitude
longitude
so
that
bigquery
interprets
as
data
properly
and
then,
with
a
few
clicks,
I
went
out
and
I
kind
of
sent.
You
know
drag
in
this,
this
new
observation
and
we
visualized
the
path
of
the
SAR
cane
very
quickly.
You
can
see.
This
is
the
path
for
the
hurricane
took
and
the
shading
is
kind
of
done
by
distance
to
land
that
the
sum
of
the
distance
to
land
here.
B
So
you
can
see
you
can
kind
of
drill
down
and
see
some
of
these
individual
points.
So
this
there's
a
lot
of
overlap
here,
so
you
can
drill
down
and
see
the
individual
parameters
for
a
given
particular
point.
You
might
want
to
look
at
so
I
think
this
is
a
really
cool
way
to
go
from
kind
of
having
data
to
having
visualization
really
quickly
and
and
the
nice
about
data
studio.
Is
it's
entirely
free
and
it
is
super
easy
to
share
it's
entirely
web-based.
B
All
you
have
to
do
is
send
someone
a
link
to
your
visualization
once
they
have
the
link,
they
will
be
able
to
kind
of
view
and
and
interact
with
this,
this
kind
of
visualization
without
having
download
any
other
additional
software.
Any
other
additional
tools
like
that
I
love
data,
studio,
I,
don't
say
that
because
I
work
for
Google
it
doesn't
hurt,
but
I
do
love
data
studio,
mostly
because
I
am
a
self-described
data.
B
Visualization
enthusiast
I
even
have
a
favorite
data,
visualization
professor,
but
we
can
move
on
from
that
before
I,
embarrass
myself,
more
okay,
so
I
think
this
is
kind
of
the
intermediate
step
right
of
we
found
the
data
set,
we're
looking
for.
We
subsetted
it
to
find
just
the
data
we
want,
and
then
we
went
out
and
visualized
it
and
that's
a
really
important
part
of
the
scientific
process,
especially
for
those
of
us
like
myself,
there
were
really
visual
people,
but
what
if
I
want
to
do
something
even
more
kind
of
advanced
than
that
right?
B
And
so
let's
talk
about,
you
know.
We
talked
a
lot
about
ghost
16
data
or
GCSE
datasets
kind
of
unstructured
data
in
Google
cloud.
So
let's
take
a
look
at
that
and
how
you
would
discover
those
data
sets.
I'm
gonna
go
out
to
a
random
search
browser
I'm,
just
gonna
pick
Google
I,
you
know
I,
don't
know
why
I
land
it
there
and
you
can
see
kind
of
one
of
the
first
search
results.
Is
this
marketplace
page
we've
talked
a
little
bit
about
I
and
here
we'll
load
the
the
marketplace
page
for
go.
B
Sixteen
you
can
see
in
here.
We
have
a
description
of
the
data
set.
We
have
some
links
out
to
other
data
and
to
kind
of
some
of
the
tools
to
work
with
it
and
if
you
click
on
this
link
down
here,
it'll
take
you
to
the
bucket
that
has
the
raw
data
in
it,
which
is
the
rhondettes
CDF
files
as
they're
produced
by
NOAA,
which
is
great,
except
if
you
don't
know
exactly
what
data
set
you
want
to
use.
If
you
don't
know
exactly
what
subset
of
the
data
you
want
to
look
for,
so
it's.
B
If
you
know,
if
you
have
the
naming
convention
of
file
memorized,
if
you
have
kind
of
some
of
these
other,
like
very
particular
things
memorized
and
you're-
willing
to
search
through
the
bucket
by
all
means,
don't
let
me
stop
you
but
I
think
there's
an
easier
way
to
do
it
all
right.
And
so,
if
we
go
back,
I'll
go
back
just
a
little
bit
here
and
you
can
see
as
we
go
through
the
marketplace.
B
Page
I'll
click
on
the
big
blue
button
at
the
top,
because
that's
what
we
want
you
to
click
on,
so
I
put
it
there
and
you
can
see
we're
loading,
a
metadata
index
of
the
ghost
16
data
and
what
we've
done
is
we're
parsing
out
the
metadata
in
the
self-described
netcdf
files
and
we're
putting
it
in
in
bigquery
to
allow
you
to
search
and
kind
of
dig
through,
and
you
can
see
a
preview
of
the
data
here
and
one
of
the
exciting
things
is.
Is
it
gives
you
you
could
see
at
the
very
end?
B
It
gives
you
the
link
to
the
data
set
as
it
exists
on
our
site
on
google
cloud,
so
you
can
go
in
much
like
we
did
with
the
Hurricane
Maria
data.
You
can
subset
by
the
the
points
that
make
up
the
corners
of
each
bounding
box.
That
are
the
contain.
That's
the
image
you
can
subset
by
time
you
can
subset
by
by
kind
of
the
channel
of
the
data
set,
you
can
look
at
and
it'll
give
you
a
list
of
the
files
that
match
that
result.
So
you
can
go
in.
B
You
can
subset
this
data
in
bigquery
and
then
you
can
grab
these
file
names
and
you
can
begin
to
create
these
really
cool
images
right
and
there's
the
most
popular
way
that
I've
seen
for
people
do
this
and
I
think
the
way
we've
done
it
for
a
very
long
time
is
to
take
this
data
and
manually
download
it
to
your
local
computer,
and
if
it
ain't
broke,
don't
fix
it
right,
but
I
think
it
might
be
broke
and
here's
why?
Okay,
so
we've
gone
in
and
I'm
gonna
subset.
B
The
data
to
include
this
is
one
hours
worth
of
files
from
goes
16
just
from
the
the
level
and
B
products
as
I
scroll
down.
Here
you
can
kind
of
see
you
again
I
cheated.
It
was
cached,
that's
why
the
results
were
so
fast,
but
you
can
see
that
you
know
you
get
the
individual
file
name
now.
What
you're
not
seeing
is
that
there
are
about
a
hundred
a
hundred
files
that
come
up
in
this
in
this
they
come
up
in
this
result,
and
this
is
just
for
one
hours
worth
of
data.
B
So
if
you
wanted
to
visualize
all
of
Hurricane
Marija,
for
instance,
you're
looking
at
days
worth
of
data,
yes
over
given
points,
but
you
know
you're
looking
for
much
a
much
higher
level,
a
much
higher
volume
of
data,
and
so
we
could
take
this
out.
We
could
go
to
our
terminal
and
we
could
use
the
GS
util
SDK
to
download
these
data
directly
to
our
to
our
computer
and
it
would
work
you
would
get
the
data
on
there.
B
B
If
you're
working
from
home
like
I,
am
today
because
I
don't
like
being
cold,
then
you
have
to
compete
with
your
family
member
streaming,
Netflix
or
Hulu
or
whatever
you
have
to
compete
with
your
internet
service
provider,
maybe
not
having
a
great
day,
and
it
can
take
a
while
right.
So
this
is
a
way
to
do
it,
but
I
think
there's
a
faster
way
to
do
it
and
that
is
putting
manually
downloading
the
data
into
a
VM,
and
so
this
works.
B
This
is
a
very
similar
process,
we'll
go
and
we'll
subset
the
data
again
again,
I
cheated.
It
was
cached
ahead
of
time,
but
this
result
comes
back
in
a
few
seconds.
We
can
copy
the
file
name
out
and
we
could
put
it
in
a
VM
in
Google,
Cloud
I.
Would
click
over
to
my
console?
I?
Would
click
over
to
kind
of
the
part
of
the
web
UI
then
I
was
going
to
launch
new
VM
I
would
create
an
instance.
I
really
just
need
kind
of
the
basic
parameters.
B
I'm
gonna
enable
some
of
these
api's
to
allow
me
to
download
this
data.
I'm
gonna
take
a
few
other
settings
and
I'm
gonna
create
a
VM.
Now
the
the
magic
kind
of
behind
the
screen
the
cooking
show
magic
is
that
it
takes
about
a
minute
to
spin
up
or
to
create
one
of
these
VMs
from
thin
air.
But
no
one
wants
to
watch.
B
You
know
if
you're
watching
a
Martha,
Stewart
cooking
show
no
one
wants
to
watch
the
pie
sitting
in
the
oven
for
an
hour
and
that's
why
Martha
Stewart
magically
pulls
out
another
copy
of
the
another
pie
from
underneath
the
counter.
So
if
you
see
here,
you'll
see
that
I
magically
pull
out
my
VM
from
underneath
the
counter
so
about
a
minute
later,
I
can
come
in
and
I
would
use
the
exact
same
command.
I
used
on
my
terminal,
the
benefit
here
being
that
the
cloud
SDK
is
already
preloaded
on
all
these
VMs
and
I.
B
Well,
what
if
I
told
you
there
was
an
even
faster
way
to
do
it?
Wouldn't
it
be
nice
if
we
could
automatically
create
a
bunch
of
VMs,
run
some
code
and
then
have
them
shut
themselves
down,
and
wouldn't
it
be
nice
if
each
of
these
steps
kind
of
auto-scale
themselves
so
much
like
bigquery?
Does
you
kind
of
pull
in
more
resources
when
you
need
them
and
you
turn
off
those
resources?
B
We've
already
looked
at
a
lot
of
these
steps
actually
and
I
have
taken
credit
for
them.
Please
don't
tell
lack,
he
does
know,
but
you
know
it
doesn't
hurt
to
not
remind
them,
and
so
we
have
kind
of
these
steps
in
here
and
he
goes
through.
If
you
wanted
to
plot
a
step
by
step
image
of
Hurricane
Maria
as
it
was
captured
by
ghost
16,
you
could
do
that
using
Python
using
the
the
pyria
sample
Python
package,
and
he
includes
the
code
here.
I
get
up
for
it,
but
what?
B
If,
instead
of
having
to
spin
up
the
VM
manually,
he
talks
about
this
a
little
bit
in
the
blog
post
and
instead
of
having
to
kind
of
pre
configure
these
VMs.
What
if
every
time
a
new
image
came
in,
you
could
automatically
stand
up
the
VMs
that
you
need
to
process
and
add
to
this
image
instead
of
having
to
take
the
time
to
create
one
JPEG
image
at
a
time
and
kind
of
string
them
together.
B
What
if
this
kind
of
all
happened
in
the
background,
and
that's
the
benefit
of
cloud
dataflow,
of
allowing
you
to
kind
of
connected
to
these
these
datasets
and
and
using
them
in
a
more
repeatable
process?
So
I
don't
have
time,
unfortunately,
to
go
through
all
of
this
I
thought
you
guys
would
probably
enjoy
the
other
demos
more
and
retrospect
I
may
have
been
wrong,
but
so
the
you
know
the
codes
available,
gimmick
github.
B
If
you
could
go
from
having
run
all
those
steps
to
just
doing
your
analysis
and
just
visualizing
your
analysis
and
let
us
take
care
of
the
rest
on
the
back
end.
I
think
that
that
helps
everyone
kind
of
get
where
they
want
to
be
and
get
to
the
fun
part
of
this
I
mentioned
earlier.
I'll
share
out
my
contact
information
through
email
with
Carl
I'll,
be
sure
to
include
a
link
to
this
blog
and
I'll
shirt,
be
sure
to
include
a
link
to
this
github
page
as
well.
B
It's
it
is
linked
throughout
the
post,
but
I
think
it'll
just
be
easier
to
have
it
right
in
front
you
so
with
that
that
is
all
I
have
I'm.
You
know
about
ten
minutes
left
I'm
more
than
happy
to
take
some
questions.
If
we
run
out
of
time,
of
course,
I'm
I'm
also
more
than
happy
to
take
those
over
email
as
well.
A
C
We've
got
a
couple
of
the
executive
directors
on
the
line.
I
saw
Melissa
Kragen
is
on
and
I
believe
Meredith
Lee,
so
Midwest
hub
and
West
hug,
perhaps
they'd
like
to
chime
in
first
before
so.
The
other
folks
jump
in
sure
thanks
Leah.
This
is
Meredith.
I
can
definitely
say
that
we
have
already
benefited
from
partnerships
and
collaborations
with
Google
cloud
and
all
the
different
public
data
sets
that
we
work
together
to
put
on
that.
Nice
snapshot
that
Shane
showed
from
the
Department
of
Transportation,
and
it's
a
you
know.
C
Amy
unruhe
from
Google
in
Seattle,
actually
flew
out
and
hackathon
in
the
Midwest
and
was
mentoring
and
serving
as
a
chair
topically
for
some
of
those
effort.
So
it's
been
really
great
so
far
and
actually
to
build
upon
Steve's
question
and
to
look
at
sort
of
future
collaborations.
I
wanted
to
ask
Xin
and
any
other
Google
folks
on
the
line
about
those
that
snapshot
of
all
the
logos
that
you
showed
at
the
very
beginning
and
apologize.
B
And
so
I
think
we
kind
of
focused
initially
on
these
national
level
datasets
one
because
you
know
they
tend
to
be
like
really
broadly
useful
right,
like
they
kind
of
tend
to
give
you
at
least
a
good
map
of
like
the
entire
country
right
we've
on-boarded
a
handful
of
city
datasets.
You
know.
Frankly,
we
found
that
since
we
hit
or
miss
on
on
some
of
the
usage
for
these-
and
you
know
we
we
don't
have
infinite
resources,
and
so
you
know
we
we
tend
to
focus
our
efforts
where
we
can
have
the
most
positive
impact.
B
You
know
that
being
said,
you
know
we
recognize
that
they're
also
gaps
there
too,
and
and
we'd
love
to
work
with
you
guys
to
help
fill
some
of
those
in
and
and
I'd
love
to
hear,
obviously,
in
a
longer
forum,
then
the
next
few
minutes
kind
of
what
that
looks
like
and
how
we
can
help
work
with
you
guys
on
that
I
I'm
really
appreciate
you
calling
out
Amy,
because
she's
awesome,
I
love
working
with
Amy.
She
does
really
a
really
amazing
stuff
and
I
almost
forgot
to
mention
her.
B
C
We're
looking
forward
to
phase
two
and
given
yourself
professed
enthusiasm
for
data
visualization
I,
think
it's
a
great
match
moving
forward
and
the
sandbox
is
super,
exciting
I
think
that's
going
to
go
a
long
way
and
showing
that
early
value
proposition
for
some
of
the
city
and
regional
and
federal
connections.
Yeah.
B
C
C
This
is
Steve
I
I
can't
so
much
comment
on
the
regional
data
sets,
but
I
can
just
give
you
a
little
glimpse
of
what
I've
been
doing
and
that's
been
working
on
chemical
biological
data,
so
I've
been
working
to
get
pubchem
loaded
with
and
I
H
through
the
EPA
Environmental
Protection
Agency.
They
have
a
it's
called
DSS
talks
database,
part
of
the
actor
datasets,
so
we're
getting
all
the
toxicology
data
associated
with
molecular
content.
C
I've
we've
already
got
the
Kemble
database
from
the
EBI
european
institute
of
bioinformatics.
Of
course,
we
have
all
the
short
Kemble
data
available,
which
is
the
molecular
content
from
patents.
So,
basically
most
of
the
molecules
that
have
been
patented
and
we
also
have
I'm
working
on.
We
have
the
orange
book
data
from
the
FDA
and
I'm
working
with
the
genus
to
try
to
get
that
updated,
which
would
be
like
an
international
or
global
dictionary.
C
You
might
say,
or
encyclopedia
of
drugs
available
globally
and
we're
also
getting
all
the
G
wast
data,
which
is
the
genetic
data.
So
we're
working
to
get
all
that
data
loaded
in
as
well.
We
hope
to
have
a
announcement
in
August
and
at
that
time,
we'll
be
presenting
a
lot
of
the
scientific
data,
sets
that
we're
focused
but
I've
personally
been
focused
on
working
with
partners
and
as
well
as
some
new
tools
such
as
crime.
A
Sounds
good,
so
I
know
we're
reaching
the
top
of
the
hour,
so
I
just
want
to
let
everybody
who
needs
to
run
off
to
the
next
meeting
go
ahead
and
do
that.
But
I
wanted
to
thank
the
Shannon
folks
for
joining
us
today
and
I'll
stick
around
here.
If
others
want
to
have
discussions
about
path
forward
or
other
pieces,
and
the
only
other
thing
I
wanted
say
is
that
we
are
swapping
around
I
think
order.
C
B
C
B
C
C
B
Well,
thank
you.
I
really
appreciate
the
opportunity
to
speak
today.
You
know
I
actually
am
one
of
those
people
that
has
to
kind
of
run
to
the
next
meeting
but,
like
I,
said
I'm
happy
to
share
my
contact
information
with
Carl
and
please
feel
free
to
reach
out
to
me.
I'd
love
to
kind
of
continue.
The
conversation
well.