►
From YouTube: Example: Data Cleanup
Description
Latest release version of this tutorial: https://intel.github.io/dffml/examples/data_cleanup/index.html
Current development version of this example: https://intel.github.io/dffml/master/examples/data_cleanup/index.html
A
A
During
my
tenure
at
google
summer
of
code,
I
have
actually
worked
on
two
projects,
so
one
of
the
projects
was
actually
related
to
accuracy
scorers.
So
here
I
have
created
a
documentation
for
it.
So
this
was
the
accuracy
scorer
on
which
I
worked
on.
As
you
can
see,
all
these
accuracy
scorers
actually
comes
from
sql
and
matrix
methods.
So
all
these
scorers
are
part
of
the
scores
which
I
have
integrated.
A
So
we
have
scores
for
regression.
We
have
scores
for
classification,
we
have
scores
for
clustering
and
also
we
have
supervised
scorers,
which
are
actually
the
model's
default
score.
So
every
sql
model
also
comes
with
a
default
score.
So
if
you
would
like
to
use
the
default
score,
we
can
use
the
sk
model
score
method.
A
So
as
we
know
that
most
of
the
data
scientists
and
data
engineers
actually
spend
like
80
percent
of
their
time
on
data
cleanup
processes,
so
this
project
was
targeted
at
how
we
can
pre-process
data,
how
we
can
clean
up
data
in
the
data
flow
itself.
So
I
have
written
a
documentation
here.
You
can
always
visit
the
documentation
with
this
link,
so
I
will
be
running
some
of
the
code
from
this
documentation
and
I
will
be
showing
you
how
we
can
actually
pre-process
data
using
the
data
flow.
A
A
So
I
think
there
are
yes.
There
are
like
20
21
000
close
to
21
000
entries
are
there,
so
we
will
be
using
this
data
set,
so
we
will
be
doing
it
in
two
steps.
First,
we
will
train
this
data
set
as
it
is,
and
we
will
check
what
is
the
accuracy
that
we
are
getting
and
then
we
will
apply
some
of
our
own
preprocessing
on
top
of
it,
and
then
we
will
see
what
is
the
accuracy
that
we
are
getting
out
of
it.
A
A
So
here
is
the
train
processing.
This
is
a
normal
training
part
data
model
training
part.
So
we
are
using
the
model
of
a
scikit
eln
and
we
have
also
given
the
features
and
what
is
the
predict
method
we
have,
then
we
will
be
storing
the
model
in
the
temporary
directory
and
our
source
will
be
in
the
format
of
a
csv
file
as
it
is
noticeable
here,
and
we
have
also
given
the
source
file
name.
A
A
And
before
we
actually
start
with
before,
you
guys
started
any
of
the
steps
mentioned
here,
you
first
make
sure
that
you
have
installed
all
the
plugins
that
we
require
for
going
through
this
documentation,
so
make
sure
you
have
installed
all
of
the
document
all
of
the
packages
first.
A
So,
let's
check
in
the
accuracy
file,
so
this
is
the
accuracy
which
we
are
running.
We
are
taking
the
same
circuit
model
now
the
scorer
which
we
are
using
is
exv
score
method,
so
this
scorer
actually
comes
here
in
the
exp
score
method.
So
this
is
the
easy
score
method
which
we
are
using,
which
is
the
regression
part
here,
so
we
will
be
using.
A
So
this
is
the
create
command
which
we
have
here
so
in
the
create
command.
What
we
are
doing
is
we
are
actually
trying
to
create
a
data
flow
and
in
the
create
command
we
actually
define
how
the
data
flow
should
run
like
which
operations
output
should
become
which
operations
input.
That
all
thing
we
will
actually
define
here.
So
we
have
actually
provided
a
config
here,
which
is
the
file
which
we
have
the
kc
house
data
set.
A
We
have
given
it
to
a
one
of
the
methods,
so
this
method
is
converts.
Convert
records
to
list.
So
using
this
method,
what
happens?
Is
we
actually
convert?
So
this
is
a
source
file,
so
you
with
the
source
file.
It
actually
converts
all
of
the
records
in
the
source
in
the
form
of
a
list,
because
all
the
cleanup
operations
that
we
have
here
all
the
cleanup
operations
that
we
have
actually
works
on
a
matrix
of
data
and
not
a
single
row
of
data.
A
So
what
we
need
to
do
here
is
we
will
need
some
some
operation
to
actually
convert
all
of
those
records
into
a
form
of
a
list
of
lists.
So
this
this
was
the
so
it
needs
a
config.
So
for
that
config
we
are
providing
it
that
this
is
the
file
here
and
we're
also
providing
what
type
of
functions.
Then
it
is
a
csv
file.
A
Then
we
are
providing
the
inputs
here.
So
these
are
the
inputs
for
the
preprocessing
part.
Now
here
is
the
flow
which
we
have
defined
so
the
first
operation
which
we
are
performing
on
top
of
it
is
standard
scalar.
So
the
standard
scalar
method
is
actually
an
operation
using
which
the
mean
of
the
data
is
actually
reduced
to
one.
So
it
will
have
a
unit
variance
and
a
unit
mean
it
will
have,
and
then
we
have
also
applied
another
linear
population,
which
is
called
as
principal
component
analysis.
So
what
principle
component
analysis
will
do?
A
It
will
convert
it
into
a
shape.
So,
let's
suppose
we
have
a
data
set
of
500
rows
and
10
features
so
using
principal
component
analysis.
We
can
actually
reduce
this
data,
set
into
something
like
500
rows
and,
let's
say
three
features.
So
here
we
are
not
doing.
We
are
actually
keeping
the
data,
the
rows
in
the
columns
of
the
data
set
same
and
we
are
trying
to
extract
what
are
the
important
features
in
the
data
set.
A
A
A
So
this
is
the
merge
command,
so
using
merge
command.
What
we
do
is
we
will
take.
So
this
is
the
source
and
this
is
the
destination.
So,
whatever
the
source
data
data
which
we
have
the
source
sources
records
that
we
have,
we
will
move
it
to
this
destination
source.
So
this
is
the
merge
command,
so
our
source
is
actually
a
data
flow
source.
So
using
the
data
flow
source,
what
we
will
do,
we
will
run
our
data
flow,
which
we
have
created
earlier.
A
On
top
of
this
data
set
that
we
have
and
on
running
the
data
set,
we
will
then
be
having
these
pre-process
data
set,
so
these
pre-process
data
set,
we
will
store
it
in
a
csv
file
which
will
be
pre-processed
csv
file,
and
we
will
store
all
of
these
file
into
this
files,
and
these
are
the
features
that
we
want
to
be
pre-processed.
So
these
are
the
features
which
we
have
provided,
and
this
is
the
data
flow
that
we
had
earlier
generated.
A
A
A
And
these
are
the
model
features,
and
this
is
the
file
using
which
will
be
predicted.
So,
let's
check
what
the
accuracy
we
are
getting.
A
So
this
was
my
second
project
which
I
had
worked
on
during
my
period
during
my
tenure
at
google
summer,
of
course,
and
I'm
very
thankful
of
john
anderson,
saksham
arora.