►
Description
Jennifer Marsman, a software developer and technical evangelist, discusses how Microsoft supports the Hadoop community and their deployment as part of Azure Data Lake.
About CodeConf
CodeConf improves the software community by providing a forum for thought-provoking talks and forging social connections. The third installment of the CodeConf series took place in Nashville in 2015. Attendees came together to discuss open source, best practices, documentation, and community.
For more information on this year's CodeConf, go to:
https://codeconf.com/
A
My
name
is
Jennifer
Marsden
and
I
have
been
with
Microsoft
for
about
13
years
now,
so
I
am
essentially
a
grandma,
a
Microsoft,
and
one
of
the
things
that
I
have
noticed
over
my
time
at
Microsoft
is
that
there
is
a
real
culture
change
happening.
All
right,
like
especially
in
the
last
two
years,
I
think
we're
really
seeing
the
battleship
turning
in
a
new
direction,
especially
in
regards
to
open
source
software.
A
So
some
of
the
openness
that
we've
got
going
with
with
our
cloud,
which
is
specifically
on
Microsoft
Azure,
so
Azure
from
the
beginning,
has
been
a
very
open
platform
when
it
was
very
first
launched,
even
at
the
very
beginning,
there
was
support
to
put
up
websites
in
this
cloud
using
not
only
you
know,
asp.net
like
a
good
little
net
developer
would
but
also
PHP
and
Ruby
and
Python,
and
all
these
other
languages
as
well.
We
have
no
js'
support.
A
We
have
all
kinds
of
good
stuff
and
then
there's
also
the
capability
to
put
operating
systems
up
so
run
hold
virtual
machines
in
our
cloud,
and
so
we
give
people
the
option
to
do
several.
Windows
operating
systems
there's
a
whole
lot
of
different
Windows
options,
but
we've
also
supported
Linux
from
the
very
beginning
and
there's
a
number
of
different
distributions.
We
support.
We
support
a
bun,
we
support
core
OS,
open
logics,
who
say
so,
there's
a
number
of
different
options
up
there
and
available
and
different
versions
of
each
of
these
different
options.
A
So
again
from
the
very
beginning,
it's
been
doing
that
and
then
there's
also
something
called
Azure
mobile
services
and
that's
specifically
kind
of
a
back-end
as
a
service.
When
you
want
to
do
mobile
development
and
from
the
beginning,
we've
supported
not
only
a
Windows
Phone,
but
also
Android
and
iOS
devices.
So
there
are
api's,
you
can
call
and
code.
You
can
write
in
objective-c
to
call
in
to
Azure
mobile
services,
and
what
that
gives
you
is
all
the
things
that
you
typically
need
when
you're
doing
phone
development.
A
It
gives
you
off
capabilities,
it
gives
you
push
notifications
and
again,
it's
not
just
through
our
windows
push
notification
framework,
but
we
also
integrate
with
Google
Cloud
messaging,
which
is
how
you
do
push
on
Android
and
Apple's
push
notification
service,
so
just
very
inclusive
and
kind
of
open
on
Azure
from
the
from
the
very
beginning
from
the
get-go
and
so,
and
we
continue
to
kind
of
add
new
things
in
there.
So
one
of
the
newer
things
we
have
added
is
the
support
for
Big
Data.
A
So
in
the
Big
Data
world
there
is
something
called
Hadoop,
and
so
it
is
an
Apache
project,
Apache
Hadoop
and
currently
it
is
like
the
industry,
what
the
industry
uses
for
big
data
and
so
for
the
rest
of
the
talk.
I'm
gonna
focus
on
kind
of
what
we've
done
in
our
path
and
cultural
journey,
specifically
with
Hadoop
all
right
so
I
mentioned
Hadoop
is
used
for
Big
Data
now
just
to
level
set.
A
Let's
make
sure
we're
all
on
the
same
page
about
what
Big
Data
is
I
know
some
people
when
they're
new
to
it
are
kind
of
like
okay
where's,
the
cutoff.
Is
it
just
like
data
and
then
that
plus
one
is
big
data
like
how
does
that
work,
so
Big
Data
when
we're
talking
about
Big
Data,
typically
we're
talking
about
the
three
V's,
a
Big
Data
and
the
first
of
those
is
data
volume.
A
So
when
you're
working
just
at
a
certain
quantity
of
data,
that's
considered
AJ
and
obviously,
if
you
have
petabytes
those
kind
of
things,
just
massive
amounts
of
data
EXO
bites
on
even
terabytes.
That's
that's
just
a
lot
of
data
to
play
with,
and
those
kind
of
things
are
considered
a
big
data
problem.
A
I'm
like
which
yields
to
plant
and
those
sort
of
things,
there's
also
a
really
amazing
connected
cow
story
that
I
really
want
to
tell,
but
there's
not
time
so
I
think
I've
already
like
told
some
people
at
the
conference,
but
go
to
youtube
and
look
up
connected
cow.
There's
a
guy,
Joseph,
Soroush,
there's
I
think
a
video
clip
of
him
presenting
it
at
strata
in
New
York,
and
it's
just
it's
awesome.
It's
amazing
stuff,
so
volume
volume
is
the
first
thing.
The
second
is
data
velocity.
A
So
what
I
mean
by
that
is
when
you
have
a
problem
where
data
is
streaming
in
in
real
time,
and
you
want
to
be
processing
it
like
in
near
real
time,
real
time
or
near
real
time.
So
in
those
scenarios,
that's
also
like
a
big
data
problem
and
the
scenario
I
always
think
of
for
this
specifically,
is
that
in
a
hospital,
so
I
think
it
was
about.
Five
years
ago,
I
attended
fairly
close
to
here
the
kentucky
celebration
of
women
in
computing
and
one
of
the
women
that
spoke.
There
was
a
PhD.
A
I
forget
if
she
was
academically,
there's
a
lot
of
healthcare
stuff
in
kentucky,
but
she
spoke
on
and
told
us
how
you
can
actually
predict
when
a
heart
attack
is
going
to
happen
in
a
hospital.
If
the
right
sensors
are
hooked
up
so
think
about
when
you
go
in
the
hospital
they
put
all
those
sensors
on
you
and
they're,
measuring
your
blood
pressure
and
your
heart
rate,
and
all
these
other.
A
You
know
millions
of
things,
and
so
they
actually
have
the
knowledge
to
be
able
to
detect
when
cardiac
failure
is
going
to
occur,
but
we
weren't
doing
it
because
we
couldn't
keep
up
with
the
processing
and
that
just
infuriated
me
right,
like
the
fact
that
we
had
the
ability
like
we
could
potentially
save
lives.
We
could
predict
that
people
were
going
to
have
heart
attacks,
but
we
weren't
doing
it
because
it
was
just
too
much
processing
overhead
to
hook
up
like
every
people.
A
You
know
every
person
who
came
in
to
these
types
of
things
and
then
do
all
that
monitoring
and
processing.
So
another
thing
that
got
me
really
really
passionate
like
five
years
ago.
Whenever
this
happened
about
big
data
and
then
the
last
V
is
variety.
It's
also
sometimes
called
versatility
and
some
other
V's.
But
essentially
what
I
mean
by
this
one
is
the
concept
of
you:
have
data
from
lots
of
disparate
sources.
So
a
scenario
where
you
might
want
to
do
something
like
this
is
March
Madness.
A
Let's
say
we
want
to
write
like
a
machine
learning
algorithm
and
we
want
to
figure
out.
You
know
we're
gonna
up
our
chances
in
the
poll
and
do
some
machine
learning
to
try
to
figure
out
who's
going
to
win
March,
Madness
tournaments
or
whatever,
and
okay.
So
now,
my
my
lack
of
basketball
now
is
just
going
to
start
to
show
here,
but
the
things
that
so
we
probably
want
a
whole
bunch
of
different
kind
of
sources
of
data
for
that
and
I'm
get.
A
Let
me
just
kind
of
make
it
more
generic,
because
my
baseball
now
or
my
basketball
knowledge
is
going
to
fail
me
here,
but,
let's,
let's
say
football
or
something
so
look
at
a
football
scenario.
It
says
it's
played
outdoors.
You
might
want
to
grab
weather
data
right
because
you
know
when
a
team
from
the
south
comes
up
to
Michigan,
where
I
live
and
tries
to
play
in
the
snow.
You'd
hope
they're
going
to
have
some
problems,
so
maybe
weather
is
a
factor
and
how
well
people
play.
Maybe
things
like
injuries?
A
You
want
an
injury
reports
and
grab
that
too.
You
need
individual
team
stats.
You
need
kind
of
team
collectively,
all
of
those
different
stats
and
then,
of
course,
the
the
features
that
would
go
in
differ
depending
on
what
what
sport
you're
looking
at.
But
you
know
all
the
stats
that
everyone
tracks
like
crazy
during
their
when
you're
trying
to
pick
out
your
fantasy
team,
all
of
those
things
I
need
to
play
in
and
then
another
thing
that
might
be
an
interesting
factor
is
just
like
raw
emotion.
A
You
know
like
I
know
when
Michigan
plays
Michigan
State
and
like
Michigan
State
won
the
previous
year,
like
Michigan
comes
back
with
a
vengeance
right,
they
are
mad.
They
want
to
win
this
year,
so
things
like
that.
So
there's
all
these
different
factors
and
stuff
where
we
might
want
to
draw
in
and
use
those
together,
and
so
that
can
also
be
considered
a
big
data
problem.
A
So
all
these
three
B's,
either
kind
of
by
themselves
or
in
combination,
are
kind
of
what
make
up
that
the
big
data
realm-
and
this
is
this
big
data
thing
that
we
want
to
conquer.
So
at
this
point,
big
data
is
out
there.
People
are
doing
it,
people
are
using
it.
So
we're
faced
with
a
strategic
decision.
Microsoft
wants
to
be
involved
with
this.
We
want
to
be
able
to
provide
you
know
big
data
solutions
to
our
customers
like
what
do
we
do?
What
do
we
do
about
big
data?
A
So
Microsoft
is:
oh
I
got
a
few
laughs,
they
I
do
not
think
I
would
I'm
like
Sex
in
the
City
I'm,
not
sure
that
that
will
fly
with
this
target
audience,
but
I'll
give
it
a
try.
So,
okay,
we
got
like
one
laughs
so
that
by
the
way
is
on
that
winkler.
Who
is
the
head
of
the
HD
insight
team,
which
is
the
Hadoop
running
on
Azure
team?
A
But
in
this
scenario
Microsoft
does
have
experience
with
big
data,
so
we
essentially
have
done
like
stuff
in
the
past,
like
there's,
Bing
Bing
runs
one
of
the
largest
data
centers
in
the
world.
Where
we're
keeping,
essentially
you
know
our
copy
of
the
internet
for
doing
you
know,
search
engine
stuff.
We
also
have
a
sure,
of
course,
in
a
jurors
running
our
we've
been
monitoring
the
health
and
bringing
machines
up
automatically
when
they
fail
and
all
that
stuff.
For
a
long
time.
A
We
have
all
the
telemetry
data
that
comes
in
from
office
and
windows,
and
that
has
a
lot
of
users
and
stuff,
so
we've
been
managing
that
for
a
while
and
then
on
things
like
like
Xbox
Live,
oh
my
gosh,
so
all
of
the
Xbox
Live
users
out
there
and
just
handling
that
so
so
we
did
have
experience
in
the
space
of
big
data.
So
because
we
had
this
this
expertise
already,
it
really
became
a
question
of
you
know:
build
versus,
buy
right.
A
We
had
the
expertise,
so
we
could
potentially
build
something
ourselves
or
we
could
choose
to
buy
something
or
adopt.
You
know
a
solution,
that's
already
out
there
in
the
open
source
world
and
so
I'm
really
excited
that
we
did
actually
make
the
decision
to
let's
use
what
the
industry
is
already
using
and
I
think
that
kind
of
speaks
to
some
of
the
culture
change
that
we're
seeing
at
Microsoft
because
ten
years
ago,
I
don't
know.
If
that's
would've
been
the
decision,
you
know
so
we
decided
to
go
ahead
and
adopt
Hadoop.
A
Now
Hadoop
is
an
Apache
project.
It
is
open
sourced
in
that
manner
and
Hadoop.
Essentially,
what
it
is
is
a
distributed.
It's
distributed,
processing
right,
so
you
spin
up
a
big
cluster
of
machines
and
you
have
a
main
controller
and
it
kind
of
uses
the
MapReduce
pattern
and
forms
out
a
whole
lot
of
work
to
all
these
different
worker
nodes
and
then
reduces
down
and
gives
you
some
output
and
way
oversimplifying
things.
A
If
you
want
to
talk
about
it
in
more
depth
and
talk
about
some
of
the
other
pieces
of
the
Hadoop
ecosystem,
because
there's
a
lot
of
other
stuff
in
there,
no
sequel
database,
which
is
HBase
and
high
for
real-time
processing
and
all
this
other
stuff.
So
people
want
to
talk
about
it
more
later.
Come
find
me
because
I'm
happy
to
go
geek
out
a
little
bit
more
on
this.
A
So
this
Hadoop
infrastructure
was
running
kind
of
alongside
you
know.
At
the
same
level,
you
know
at
Facebook
scale
with
you
know,
best
debris
things
like
Oracle
and
sequel
server.
So
it
was
depth,
so
it
just
kind
of
speaks
very
highly
to
you
know
the
quality
of
open
source
software
and
the
kind
of
things
amazing
things
that
people
are
doing
with
it.
So
I
thought
that
was
really
awesome,
all
right,
so
we're
going
with
Hadoop.
We
know
that
so
the
next
question
was:
how
do
we
do
that
right
to
refer
to
for
an
active
work?
A
How
do
we?
How
do
we
move
forward?
Do
we
a
branch
for
our
own
distribution
and
maintain
our
own
distribution
of
Hadoop
from
Apache?
Do
we
go
with
one
of
the
existing
ones
out
there
I
mentioned
that
cloud
era
and
Hortonworks
already
had
were
maintaining
their
own.
You
know
enterprise
grade
distributions
of
Hadoop,
and
so
those
were
options
and
such
and
essentially
we
decided
to
go
with
that.
A
So
again,
Microsoft
chose
not
to
build
it
ourselves,
but
to
use
an
existing
thing
which
was
kind
of
cool,
and
so
we,
we
partnered
with
Hortonworks
and
we're
using
their
enterprise
distribution
and
running
that
in
Azure.
In
our
data
centers,
and
one
of
the
reasons
we
chose
awkward
norc's
specifically,
is
that
I
think
they
were.
We
were
very
aligned
and
kind
of
how
we
felt
about
this,
and
Hortonworks
have
always
said
that
their
mantra
is
on
Apache.
A
First,
so,
like
all
the
great
stuff
that
they're
doing,
they
make
sure,
gets
back
into
the
hibachi
Hadoop,
and
so
that
kind
of
mentality
of
you
know
rising
waters
lifts
all
boats
right,
we're
all
in
a
community
we're
going
to
help
each
other
out
and
make
it
better
for
everyone,
and
that
was
very
much
aligned
with
what
how
we
were
failing
with
Hadoop
as
well
all
right.
So
the
next
question
now
that
we
decided
we're
gonna
go
with
Hadoop
I,
won't
make
it
run
and
doesn't
run
on
Windows.
A
So
we're
going
with
the
solution,
we're
gonna
work
with
Hadoop
and
Hadoop
was
written
in
Java.
You
know,
Java
write
once
run
anyway,
I
right,
uh-huh
yeah.
It
didn't
work
on
Windows
well,
so
the
first
thing
we
had
to
do
was
go
forward
and
actually
make
it
work
on
Windows
and
that
kind
of
got
us
into
the
open-source
community
kind
of
step
by
step.
So
the
very
first
thing
that
had
to
happen
was
the
team
that
forms
the
Big
Data
team
at
Microsoft
was
our
existing
data
team.
A
A
And
then
we
started
submitting
issues
and
patches
and
and
that
sort
of
thing
and
contributing
a
little
bit
and
then
the
first
priority
there
was
just
ensuring
that
Hadoop
would
run
on
Windows.
So
do
you
know,
do
kind
of
the
basement
line
work
to
get
it,
get
it
working
on
Windows
and
then,
after
that,
we
got
it
working
on
Windows
next
leg
was
actually
get
it
working
well
on
Windows,
so
the
Hadoop
running
on
Linux
still
just
like
massively
outperformed,
Hadoop
running
on
Windows
and
so
we're
like.
A
So
we
got
to
the
level
that
some
folks
out
Microsoft
are
actually
committers
on
Hadoop.
So
that's
great
to
you,
know
kind
of
have
that
vote
of
confidence
that
we
were.
We
were
good
contributors
there
alright.
So
where
was
it
hard?
Let's
talk
about
where
we
were
kind
of
we
kind
of
struggled
a
little
bit.
The
first
thing
was
just
a
complete
culture
change.
A
The
other
thing
was
just
around
on
just
timing
of
work
and
workflow
and
how
that
went.
We
as
that
you
know
it's
a
corporation.
We
set
deadlines
right,
we
have
milestones
or
sprints
or
that
sort
of
thing,
but
kind
of
dates
where
we
want
to
try
to
get
stuff
done
by
and
when
you're
working
with
the
open
source
world
like
a
lot
of
people,
are
volunteers
all
right
and
you
can't
push
your
timelines
on
other
people.
So
we
we
had
to
do
some
adjustments
there
and
figure
out
how
to
make
that
work.
A
But
it
really
kind
of
grew
this
this
new
culture
at
Microsoft,
so
we
were
getting
there.
We
were
kind
of
fumbling
a
little
bit
and
I,
don't
want
to
say
that
we
have
it
perfect,
yeah
I'm
sure
we
know
positive,
that
we
don't,
but
we're
learning
and
we've
actually
seen
some
things
that
have
been
kind
of
changing.
That
makes
me
really
excited.
So
the
first
thing
I
think
a
really
key
turning
point
here.
A
It's
helping
us
to
make
it
to
make
it
one
well
in
Windows,
but
there's
an
initiative
called
the
stinger
project
and
that's
another
Apache
thing
for
trying
to
optimize
and
make
some
of
these
things
run
faster,
there's
a
whole
kind
of
group
of
them
tase
and
some
of
these
other
things
are
also
in
that
in
that
camp,
and
so
this
initiative,
one
of
the
goals,
was
to
make
Apache
or
these
things
run
faster
and
hive
is
a
query
language.
A
So
what
kind
of
like
sequel
for
working
with
Hadoop
and
HBase
and
and
others
in
that
in
the
Hadoop
ecosystem,
and
so
it
wasn't
performing
as
well
as
maybe
it
could.
So
we
were
looking
for
ways
to
make
it
better,
and
so
what
happened
is
we
took?
You
know
Humvees
again,
these
these
PhDs
and
like
query,
optimization,
who
had
written
sequel
server
and
they
were
like.
Well,
you
know
what
you
know
based
on
all
the
stuff
that
we
know
from
writing:
sequel
server.
A
We
know
ways
to
make
this
faster
and
they
wrote
a
paper
called
the
high
of
100
and
a
sickle
II
put
together.
Ways
to
make
hive
run
for
some
queries,
100x
faster,
all
right,
and
then
we
took
that
and
partnered
with
Hortonworks
and
Facebook
and
others
and
contributed
all
of
that
back
into
into
Hadoop
and
like
that
is
awesome
when
you
think
about
it
right,
because
that
was
essentially
like
intellectual
property
right.
A
This
was
some
of
the
I,
don't
almost
a
trade
secrets,
but
it
was
intellectual
property
that
we
used
to
help,
make
sequel
server
so
great
and
we
were
giving
it
back
to
the
open
source
community
and
so
like.
That's
something
I'm,
really
proud
of,
and
another
thing
I
think
that
signals
that
culture
change
that
you
wouldn't
have
seen
that
maybe
ten
years
ago.
So
in
summary,
like
a
bunch
of
sequel
server
developers
are
writing.
You
know
Java
code
now
to
improve
and
support
open
source.
Like
that's
awesome,
that's
really
really
Google.
A
We
look
at
what
are
people
using
right
now?
What
are
the
most
popular
tools?
You
know,
where
is
the
open
source
community
going
and
that
we
were
making
decisions
based
on
that
and,
like
that's
awesome
like
open
source,
really
is
helping
drive
and
shape
this
product?
So
in
summary,
there's
just
been
so
many
kind
of
cool
things
around
the
hadoop
story
that
I
found
these
numbers.
Don't
quote
me
on
these
numbers?
A
They
actually
I
told
kind
of
an
old
slide,
so
these
numbers
might
actually
be
even
bigger
now,
but
at
the
time
I
made
the
sword.
It
was
about
ten
thousand
thousand
engineering
hours
that
we
were
using
it
over.
Thirty
thousand
lines
of
code
contributed
back
into
Hadoop.
So
that's
awesome.
We
responsible
for
helping
get.
You
know
Hadoop
on
Windows
working.
We
had
the
hive
100x
query
speed
up.
We
some
of
the
people
Microsoft
advanced
to
the
kind
of
port.
Were
there
there
committers
into
Hadoop.
A
We
offered
that
HDFS
service
in
Azure
data
Lake,
so
that
we're
you
know
using
what
the
open
source
community
wanted
to
use
in
our
system.
So
all
of
these
things
together,
I
think
just
have
me
so
excited
about
the
way
kind
of
how
the
battleship
has
turned
and
how
Microsoft
and
open-source
are
working
together.
Much
better
now
and
actually
I
think
that
the
Big
Data
team,
like
the
Hadoop
team
here,
is
actually
hiring.