►
From YouTube: DevoWorm (2023, Meeting #28): DevoLearn/DevoGraph, Genomics of Differentiation II, Recent Papers
Description
GSoC updates (DevoLearn and DevoGraph). MedSAM, Persistent Homology, and TOGL. Genomics of Differentiation: protein alignments for C. elegans, Global vs. Mosaic Methylation and Control of Cell Differentiation. Differentiation codes and associated analysis. Attendees: Sushmanth Reddy Mereddy, Himanshu Chougule, Bradly Alicea, Susan Crawford-Young, Richard Gordon, Jyothi Swaroop, and Lukas.
A
B
Homanshu,
hello,
everyone.
Well,
we
had
a
discussion.
What
a
discussion
about
this
paper
on
the
genomics
of
differentiation
I
think
it's
great
I
know:
Lucas
was
working
on
it
quite
a
bit
this
weekend
and
we're
going
to
go
over
a
lot
of
that
stuff.
I
did
a
deep
dive
into
the
literature
on
C
elegans.
Differentiation
is
quite
interesting.
Actually
so
yeah
we'll
be
keep
working
on
that
I.
B
Don't
know
if
Lucas
is
going
to
be
here
today,
but
he
sent
me
a
version
of
the
paper
and
then
of
course
yeah
so
and
then,
of
course
we
have
our
vsoc
students
and
I
also
got
the
document
from
socialmont.
The
microsam
and
I
have
that
in
my
tabs,
or
he
can
share
a
screen
as
well,
but
just
to
go
over
that
and
then
hamanchi's
here.
B
So
let's
get
started
wants
to
go
first
with
an
update.
C
I
hope,
you're
fine,
I
started
writing
segmented,
microscopy
segment
emitting
model,
so
microsam
I
just
came
in
that
paper
I.
Just
given
a
gentle
introduction
about
the
using
segment
everything
model
for
cell
image
segment.
B
Could
you
go
well
setting
up
with
your
update.
D
D
Okay,
so
this
this
weekend,
I
was
basically
working
on
demograph
and
trying
to
set
up
like
the
persistent
homology,
like
the
topological
data
analysis,
part
and.
D
Or
get
the
same
results
as
previously
before,
because
there
was
some
issue
in
or
no.
D
A
The
repository-
let's
share
it
over
here.
D
Thank
you,
yeah,
okay,
so
in
the
requirements.txt.
So
basically
the
this
is
like
out
different
from
what
is
actually
working
right
now.
So
since
the
project
was
like
a
year
ago,
I
go
and
send
any
version.
D
Versions
that
could
be
used
so
like
I,
had
to
figure
out
what
the
exact
version
for
it
was
and
I'll
send
an
update
to
xiaang
and
tell
him
about
this.
Like
the
kit
which
they
were
going
to
use
and.
D
Updated
as
well
like
it
has
gone
to
a
version,
1.0
and
python
is
also
gone
to
version
2.0,
so
the
interchangeable
things
of
it
is.
D
Right
now
so,
but
then
I
got
it
to
work
and
the
stage
two
right
now
working
on
my
PC
as
well
so
and
I
got
the
same
results
as
before
this
one.
D
So
my
second
stage
is,
as
of
now
I
is
to
get
some
papers
done
like
or
this
topological
graph
neural
networks,
paper
and
I've
been
looking
at
the
GitHub
repository
for
a
while
and
trying
to
figure
out
what
how
exactly
they
used
it
and
how
we
can
like
try
to
get
a
similar
results
for
our
tasks,
that
is,
for
cell
tracking,
and
they
have
done
it
on
classification
tasks
like
node
and
Edge
Edge
classification.
D
So,
basically,
what
they
have
used
is
something
called
as
torch,
persistent
homology
repository,
which
is
a
different
from
what
most
of
the
people
have
been
using.
That
is,
the
tricer
tda1,
so
right
now,
I'm
just
figuring
out
how
to
get
into
GN
and
TDS
and
once
I
get
a
small
example
code
ready
for
our
data
set.
I'll
move
on
to
the
cell
tracking
part
I'm
trying
to
get
but
try
to
resolve
results
like
I'll
try
to
solve
the
problems
that
Jiang
has
all
had.
D
Also
mentioned
in
the
previous
means,
like
sensor,
data
is
like
there
is
some
kind
of
rotational
Interstellar
or
a
different
ability,
and
also
the
other.
D
It
was
when
like,
since
this
paper
only
considers
like
the
static
graphs
and
our
graphs
are
dynamic,
so
we'll
have
to
figure
out
that
that
is
one
of
the
challenges
as
well.
So
right
now,
I'm
ordering
creature
to
review
on
that
one
sec,
an
example:
I'll
move
on
to
the
next
step.
Okay,.
A
D
My
work
this
week
would
be
like
to
get
a
working
example
ready
for
tocls,
but
on
static,
static.
D
A
D
Which
is,
which
is
what
these
guys
have
done
as
well,
like
the.
D
Is
this
year
and
I'll
trade
on
that
and
if
it's
useful
I'll
send
that
as
well.
D
C
C
I
mean
we
have
never
integrated,
you
can
run
separately
stage
one
and
you
can
run
separately
streets
too
yeah,
then,
if
we
want
but
combinedly,
they
have
never
integrated,
and
that
is
the
hard
part,
because
a
stage
2
needs
cell
lineage
analysis.
That
is
the
main
reason
with
development.
We
are
not
able
to
create
the
CSV
file
with
cells,
Androids
and
cell
lineage
analysis.
D
And
versions,
and
all
of
that
like
also
you,
we
also
started
to
update
the
develop
new
repository
as
well
like
to
change
the
functions
to
by
using
the
Skype
and
Library
functions
for
like
image,
processing
and
everything,
and
also
like
the
dependency
was
updated.
So
I
had
to
use
a
different
development
for
to
get
my
setup
ready
for
stage
toward
stage
so
right
now.
C
A
A
B
All
right
that
sounds
great
yeah
all
right,
yeah
thanks
for
the
update
glad
to
see
your
computer
is
back
up
and
running
yep
I'm
in
trouble
with
that
yeah,
but.
A
B
This
is,
this
is
good,
so
social
moth.
Are
you
ready.
A
C
B
All
right
so
yeah,
this
micro
Sam
document
and
it's
basically
an
overview
of
his
some
of
the
proposed
work
on
microscopy
image
segmentation,
this
segment
anything
model
and
is
we're
trying
to
build
a
pipeline
of
papers
here.
So
this
is
the
first
one
I
guess,
where
he's
proposing
that
we
have
this
version
of
Sam
called
microsam
that
can
do
work
on
microscopy
image
images,
specifically
it's
just
using
the
Sam
methods
of
from
meta
and
applying
it
to
this
specific
problem.
So.
B
C
B
C
Compute
units
on
Google,
collab,
I
started
training
the
model
and
apparently
I
was
writing.
The
whole
paper
for
microscopy
images
of
sense.
I
have
implemented
I,
actually
I
shared
the
dog
about
it.
I
wrote
an
abstract
what
it
is
doing.
An
extra
after
that
I
am
giving
a
general
introduction
with
that
corresponding
images.
I
am
comparing
our
model
with
unit
structure
segmentation
yeah,
that's
what
idea
is
there
in
my
mind?
Exactly
I
will
be
writing
paper
about
cell
segmentation
using
a
zero
shot
classification,
zero,
shot
segmentation,
our
FM
and
apparently
I'm
writing
I'm.
C
A
C
Now
I
bought
the
collab
upgraded
version,
I'm
training,
the
model
there
are
some
little
small
bugs
I
need
to
work
with
them
out.
Actually,
the
bug
is
just
a
shape
error
for
the
loss
function.
This
is
smaller
Maybe
by
tomorrow,
I'll
sort
it
out
and
model
will
work.
Fine,
but
most
of
the
main
problem
would
be
writing
this
paper
and.
A
C
C
This
complete
week,
I
couldn't
work
at
most
because
I
was
having
a
fewer
and
back
will
manage
my
work,
but
this
week
I
will
try
try
to
complete,
try
in
the
model
and
I
will
try
the
complete
paper
about
what
is
what
is
happening
with
our
model
and
how
it
is
implementing
water
park.
Accuracy
are
compared
to
other
models.
Etc,
apparently,
this
stock
will
I
will
make
changes
completely.
C
Please
give
a
look
at
the
end
of
the
week
and
let
me
know
like
what
all
changes
do
I
need
to
make
after
this
paper,
I'll
start
working
on
the
table
net,
which
I
have
proposed
in
my
case
of
proposal.
I
will
try
in
that
and
I
will
write
a
separate
paper
on
that.
Also,
that
would
be
one
and
devolin
actually
Jyoti
attended
today's
meeting
for
developing
only
he
started
reading
the
docs,
which
you
are
shared
and
he
will
start
writing
papers
from
tomorrow
about
develop
Etc,
maybe
from
next
week
onwards.
C
C
A
C
A
C
I
will
try
to
make
draft
of
all
papers
and
I'll
share
it
to
you.
After
reading
them,
you
decide
whether
it
is
up
to
mark
for
a
general
paper
or
for
a
conference
or
something
kind
of
just
publishing
a
market.
Okay
that
would
be,
and
my
Bradley
I
think
there
is
something
with
the
deadline
thing.
It
was
not
updated
on
my
Visa,
but
actually
oh.
B
C
A
A
A
A
C
C
B
The
papers
will
be
I,
think
more
iterative,
it's
it's
fine.
You
know
for
right
now,
because
they're
going
to
be
a
lot
of
changes
made
like
we're.
Gonna,
probably
we
could
even
have
you
know
multiple
versions
of
the
same
thing
where
you
know
the
stuff
you're
working
on
for
Med
Sam,
my
yeah.
We
can
start
with
one
and
then
we
might
also
make
other
documents.
It's
just
you
know.
That's
the
way.
B
This
kind
of
works
like
you
have
you
have
the
work
and
then
you
want
to
make
sure
that
it
gets
out
there,
but
you
want
to
do
it
in
a
way.
That's
like
interesting
or
irrelevant
to
people
who
you
want
to
Target.
So
if
we
want
to
say
talk
to
a
machine
learning
audience
we're
gonna
have
to
use
a
lot
of
the
the
jargon
of
the
field.
You
know
they
have.
They
want
technical
details
now,
a
general
audience
doesn't
necessarily
care
about
that.
B
A
B
B
If
you
read
a
a
lot
of
these
scientific
papers
where
they
have
like
all
these,
you
know
supplemental
graphs
and
data
and
figures,
and
it's
kind
of
like
who
I
mean
people
read
it,
but,
like
some
of
these
papers
have
like
240
supplemental
documents
now-
and
it's
like
this
reads
all
that
sometimes
like
their
technical
experts
are
interested
in
all
of
it,
but
no
one's
real
interested
in
every
piece.
So
it's
like
you
know
we
have
to
we.
It's
like
one
of
those
things
where
we'll
be
creating
things
for
specific
purposes.
Yeah.
C
B
Yeah
one
thing
we
should
have,
of
course,
is
a
persistent
URL.
When
you
make
a
when
you
put
up
like
code
or
something
you
want
to
have.
Sometimes
people
will,
you
know,
have
well
GitHub.
I
guess
is
a
persistent
URL,
but
a
lot
of
times
people
will
make
like
a
release,
like
probably
at
near
the
end
of
g-soc,
will
make
a
release
for
code
the
code,
and
we
already
have
releases
for
Diva
learn
but
we'll
make
another
one
and
then
we'll
also
like.
Maybe
even
we
can
even
archive
it
on
foreign.
B
But
I'm
talking
about
like
the
data
data
sets
and
things,
so
those
can
be
archived
separately
and
they
usually
have
like
data
archives,
where
you
can
make
a
stop
and
it
gives
you
a
DOI
and
then
it
gives
you
like.
You
can
update
the
files
in
it,
so
people
can
download
it.
B
C
C
C
B
Should
have
you
know
there
should
be
a
copy
of
the
code
with
a
license
in
that
same
repo,
so
people
can
have
that
information,
but
then
we'll
also
have
an
archive
of
it,
and
the
archives
are
usually
just
the
releases.
So
like
a
group
will
make
a
release
then
make
it
to
a
certain
point.
There
are
different
strategies
for
doing
that,
but
generally
they'll
say
we're
going
to
make
a
release.
Now
we're
this
we're
going
to
call
this
1.7.
B
Maybe
it's
a
stable
version
and
then
you
put
it
up
in
an
archive
or
there's
like
a
little
tag
in
GitHub
repos.
That
say
the
latest
release
and
you'll.
Just
actually
just
click
on
that
and
get
it
you'll
be
able
to
get
the
stable
version.
The
repo
itself
will
have
maybe
any
changes
that
are
like
current.
So
the
last
push
to
that
repo
will
be
in
the
repo
files
and
that
may
not
work
stably
for
whatever
reason,
but
the
the
last
really
should
be
stable.
B
So
people
will
do
releases
just
to
make
sure
everything's
stable
and
then
release,
and
then
people
can
download
that
and
then
the
people
can
still
keep
working
on
the
open
source.
They
can
open
source
the
files
they
can
keep
working
on
the
code
and
then
that's
that's
not
interrupted.
You
know,
okay,
so,
okay,.
C
Thanks
next
week,
yeah.
C
C
C
Like
you
actually
talked
in
the
other
meetings
right,
like
a
destination
of
code
and
all
this
stuff,
what
it
is
happening
under
the
hood,
actually,
we
need
to
add
a
meeting.
I
explained
it
him.
Everything
like
whatever
happened.
We
will
make
a
draft
of
whatever
is
happening
which
process
We
are
following
Etc.
B
Okay,
that
sounds
good.
Thank
you
for
the
update.
Let's
see
so
hamanchi
posted
something
in
the
chat
here
and
it
was
didn't.
Oh
well,
okay,
hamanchi
posted
the
Joss,
the
OJ,
so
the
Journal
of
Open
Source
software
yeah.
That's
a
that's
an
a
good
venue
for
papers.
You
know
the
it
depends.
They
have
a
certain
standard
for
like
applicability
and
use
and
things
I
think
I
submitted
there
once
and
they
didn't
want
the
paper
at
that
point.
B
But
you
know
that
it's
basically,
where
it's
the
kind
of
model
of
a
journal
where
you,
you
know,
prepare
the
code
and
you
push
it
to
their
repository
and
you
put
it
in
the
template
and
then
they
publish
it
directly
from
like
a
GitHub
push
it's
kind
of
an
interesting
model
for
public
publication,
but
it's
usually
these
papers
on
open
source
software.
Where
they're
you
know
describing
the
work,
but
also
you
know
they
have
certain
standards
for
like
what
they
find
interesting.
B
So
it's
kind
of
like
a
regular
Journal,
but
then
yeah
the
archive
is
is
a
pre-print
server.
So
that's
that's,
obviously
a
place
to
go
with
the
different
with
the
paper
versions,
so
you
can
put
papers
there,
but
not
necessarily
code,
although
they
they
have
improved
their
support
for
code
links
and
things
like
that.
B
There
is
like
papers
with
code
and
things
like
that
that
are
like
archive
papers
that
also
have
the
code
and
they
don't
know
how
that's
set
up.
But
I
know
that
the
person
in
the
other
group
Ankit
Grover,
is
familiar
with
because
he
published
he
got
a
paper
on
the
archive
and
papers
with
code
and
all
that
so
will
I
might
talk
to
him
about
that.
B
B
Oh
there,
there
are
a
lot
of
Publications,
but
one
of
the
things
in
our
group
is
that
we
have
we're
at
the
intersection
of
Cell
Biology
and
computational
Science
computational
Biology,
so
that
the
the
journals
there,
you
know,
that's
going
to
be
very
different
from
what
they
publish
in
the
cell
biology
journals.
Just
in
terms
of
the
yeah
biosystems,
is
a
interesting
Journal.
That
kind
of
straddles,
all
of
that
it's
kind
of
even
more
theoretical,
but
they're
different.
B
You
know
if
you
have
and
we
can
come
up
with
a
list
of
journals
for
a
certain
Paper,
but
I
mean
you
know.
The
the
point
I'm
trying
to
make
here
is
that
a
lot
of
your
cell
biology
journals
are
not
going
to
publish
like
a
machine
learning
paper.
It's
just
not
their
audience
and
Their
audience
is
more
like
I.
Did
this
experiment
or
I?
Did
this
observation
in
the
lab?
It's
it's.
B
It's
kind
of
an
odd
thing,
but
like
there
are
other
journals
too,
like
you
know
where
they
they
do
have
this
mix,
but
but
this
is
a
thing
we'll
have
to
talk
about
well
with
some
finesse,
okay,
yeah.
So
thanks
for
all
that
all
those
updates,
yeah
and
so
we'll
be
coming
up
on
I.
Guess
we're
going
to
extend
the
projects
to
October
30th
I'll.
Try
to
get
in
touch
with
there
are
people
at
incf
for
that
yeah.
B
So,
thanks
again
and
now,
I
want
to
turn
attention
to
the
genomics
of
cell
differentiation
paper.
So
it's
like,
there
was
a
lot
of
activity
on
that.
This
week
we
had
a
bunch
of
I
think
we're
at
version
10
now
right,
yeah,
and
so
we
have
Lucas,
did
a
last
search
and
we
have
so
I'd
like
to
hear
from
Lucas.
Actually,
if
you
don't
mind
talk
a
little
bit
about
that.
E
Good
day
so,
yeah
actually
I
actually
had
a
couple
of
meetings
with
Mr
Gordon,
and
what
we
came
up
with
was
an
idea
which
I'm.
E
A
E
E
Okay,
so
this
is
the
figure
like
8.13
from
embryogenesis
explained
by
Natalie
K
Gordon.
A
E
Richard
Gordon-
and
this
like
these,
are
a
bunch
of
like
fcr
wnt
map
Pro
table
Associated
protein.
These
are
like
a
bunch
of
proteins
and
apparently
based
on
what
Mr
Gordon
said
they
are
somehow
like.
They
seem
to
be
involved
in
the
differentiation
in
CL
I
guess.
So
let
me
stop
sharing.
E
So
Bobby
came
up
with
yes,
okay
take
shared
there,
so
what
we
came
up
with
was
that
we
might
be
able
to
run
a
blast
search
for
and
after
we
won
the
last
search
for,
like,
let's
say
one
of
these
proteins.
So
what
I
did,
for
example,
was
running
so,
let's
say
for
I,
don't
know,
let's
say
for
like
pkc
protein
protein
can
I
see.
E
I
went
on
ncbi
National,
essential
biotechnology
and
I
searched
in
a
database
like
protein
q,
a
c
C
elegans,
and
it
like
it
shows
up
a
bunch
of
results
from,
and
most
of
them
are
the
sequences
for
the
protein
of
that.
Finally,
so
like
it
gives
you
the
protein
sequence
and
then
what
I
did
was
I
run
blast
against.
So
when
you
run
blast
so
like
you
could
search
it
up,
it's
for
those
of
you
who
don't
know
is
so.
E
This
plus
is
taking
through
like
a
soft
word,
which
helps
with
DNA
alignment
or,
like
sequence,
alignment
in
general.
So
what
happened
was
I
run.
I
ran
last
and
I
specific
I
specifically
narrowed
that
down
the
results
to
the
taxonomy
of
Clans,
so
that
we
could
find
results
there
and
then
I
started
counting
like
so
based
on
Richard
said
based
on
what
he
said.
When
you
keep
count
for
each
Protein,
that's
going
to
be
a
specific
enhancement
to
the
paper
and
how
it
works
is
I.
It.
A
E
Showed
about
100
hits
per
each
protein
and
then
I
started
counting
the
copies.
So
these
copies
are
the
end
copies.
Technically
they
are
the
that
have
been
drifted
apart
so
and
they
they
correspond
to
the
end
edges
of
the
differentiation
tree.
Although
there
is
a
limitation
to
this
procedure,
which
is
when
you
run
when
you
do
this,
it's
not
gonna
technically
show
all
of
the
hits
and
when
you
find
the
significant
similarity,
the
problem
is
that
some
of
these,
like
so
I,
could
share.
E
I
used
the
I
think
it
was
I'm,
not
sure,
but
I
have
to
look
it
up
again,
but
it's
the
normal
settings,
I
I,
don't
think
it's
specifies
Regency.
Oh
I
should
default
something
yeah
default
settings.
I
have
to
look
it
up
that
what
the
percentage
is,
but
so
let
me
show
you
what's
going
on
so
this
is
for,
for
example,
like
I
ran
the
last
search
and
it
these
are
all
of
the
results
that
it
found,
but
the
limitation
here
is
so
that
some
of
them
are
like.
E
You
know,
for
example,
if
you
look
at
here,
it's
technically
the
same
thing.
That's
showing
us,
and
the
problem
here
is
that
I
think
the
ncbi
database
is
not
like
that,
like
how
could
I
put
it
like
it's,
not
that
developed.
So,
for
example,
let
me
go
to
version
10
like
for
diam
d-a-am
one
and
a
d
like
the
digital.
This
protein
I
could
not
find
any
results
and
for
APC
I.
Think,
since
the
amino
acid,
like
sequence,
was
too
short,
it
didn't
find
any
copy.
E
So
for
some
of
the
proteins
I
could
not
even
find
the
sequence.
So
that's
I
would
say
that
that's
like
a
limitation
to
the
to
this
procedure
that
we
are
you're
doing,
but
yeah,
that's
technically
what
I
did
and
Mr
Gordon
also
sent
some
interesting
ideas
and
I
could
read
it
like
one.
He
said.
Look
for
regulatory
elements,
not
genes
and
I
started
looking
out
for
those
like
the
blast.
E
Search
is
the
same
thing,
but
I
have
to
look
for
those
specific
elements
and
two
he
said,
reduce
the
stringency.
I
have
to
look
that
up
and
three
he
said
see
if
there
is
data,
an
expression
but
I'm,
gonna,
I'm
gonna
put
this
into
chat,
and
let
me
just
okay.
E
Morning,
yeah
yeah
I'm,
not
sure
if
we
I'm
not
again
I'm,
not
sure
if
I
could
find
the
data
for
three
like
for
the
different
stages
of
development.
However,
it's
it's.
We
could
look
at
this
protein.
I
left
a
note
that
I
think
it
was.
F
Can
you
answer
that
is
for
C
elegans?
Has
anyone
done
the
gene
expression
for
individual
cell
types.
B
B
Yeah
yeah
so
yeah
this
is
yeah,
so
this
is
all
these
are
all
good,
especially
the
first
one
I
think
looking
for
regulatory
elements,
so
I
mean
this
is
where
a
lot
of
the
action's
going
on
like
the
regulatory
elements,
apparently
yeah.
F
It
could
be
that
whatever,
if
there
is
a
differentiation
code,
whatever
codes
it
right,
it
might
be
regulatory
rather
than
these
Express
streams.
On
the
other
hand,
Lucas
got
hits
up
with
the
default
string,
so
you
got
hits
up
to
40.
B
F
B
F
Okay,
four
is
something
maybe
you
could
answer.
Are
there
other
nematodes
that
have
been
for
which
the
limits
of
the
cell
lineage
has
been
ripped
out?
Besides
the
elements.
B
Yeah
there
are
a
couple
like
I,
think,
C,
brigsier
and
some
other
ones
they're.
You
know
they're
different,
like
they're
different
cell
numbers
and
and
I.
Don't
know
what
the
available
data.
B
Oh
yeah,
yeah
I,
don't
know
about
smaller
I
know.
There
are
some
that
are
larger,
but
well.
A
B
Could
compare
yeah
across
Canada's
yeah.
F
F
B
A
B
F
B
I
know
I
think
so,
I
don't
know
we
could
look
for
it.
Yeah
yeah
one
one
word
about
like
the
drift
part,
so
I
can
blast
the
results
that
you
get
or
like
matches
like
across
the
sequence.
So
when
it
gives
you
like
a
percent
match,
it's
really
it's
a
how
much
it's
predicted
from
the
input.
So
it's
like
I,
don't
know
if
that's
like
going
to
be
different,
I
mean
there's,
obviously
they're
going
to
be
differences
in
the
samples.
B
If
you
go
from
like
C
elegans
to
another
species,
okay,
they're,
obviously
going
to
be
differences
between,
say,
C
elegans
in
another
species,
but
there's
also
this
issue
of
prediction,
strength
of
prediction,
so
I
don't
remember
what
parameter
they
use
but,
like
usually
your
match
is
you
know
pretty
close
with
protein
sequences.
It's
better
than
DNA,
and
the
reason
for
that
is.
The
DNA
is
often
when
they
make
when
they
put
together
like
a
genome
or
they
have
DNA
sequences.
B
You
often
get
like
Missing
bases
like
they
have
to
infer
the
bases
from
using
an
algorithm
or
something,
and
then
that
or
there
are
a
lot
of
repeats
and
that
can
be
a
problem
with
last
so
using
the
protein
sequence
is
generally
better
and
it's
generally
more
stable,
so
I
mean
you
know,
I,
don't
know
what
those
changes.
F
B
B
B
But
you
know
one
of
the
things
about
like
blast
and
like
in
general.
A
lot
of
genomic
stuff
is
that
organisms
tend
to
share
a
lot
of
like
basic.
You
know,
DNA
that
involves
housekeeping
genes
and
things
like
that.
You
know
like
things
that
make
things
like
cells,
and
so
those
things
tend
not
to
to
be
different
across
species,
so
you
could
compare,
for
example,
nematodes
and
humans,
and
there
are
a
lot
of
Pathways
that
are
similar
or
the
same.
B
So
that's
why
we
can
do
that
with
with
genomes
and
with
proteins
the
of
course.
The
differences
are
in
some
of
the
other
functional
things
that
are
specialized
for
that
species.
So,
like
that's
it
like,
if
you
look
at
sequence,
homology
between
say,
like
bananas
and
humans,
it's
like
50,
and
it's
like
you
know
you
would
think.
Well.
Why
is
that?
And
the
answer
is
as
well.
You
have
a
lot
of
metabolic
genes.
B
You
have
a
lot
of
genes
that
involve,
like
you
know,
cell
signaling
and
things
that
are
basically
don't
need
to
be
reinvented
so
by
Evolution.
So.
D
B
That's
all
great,
that's
all
great!
Thank
you,
Lucas
for
the
great
presentation
of
work,
and
so
next
steps
I'll
take
a
look
at
the
draft.
I
haven't
really
I
mean
I've,
taken
a
look
at
the
earlier
drafts,
but
I
didn't
see
the
latest
draft
I,
don't
know
if
I
actually
have
the
right.
Current
draft
I
didn't
see
anything
in
my
inbox
this
morning,
Lucas.
So
if.
E
You
could
send
it
to
you
after
the
meeting:
okay,
yeah,
that's
good
yeah
and
the
sixth
like
the
six
proteins
that
I
send
there
for
five
of
them.
I
actually
couldn't
find
the
protein
sequence,
so
even
finding
a
protein
sequence
would
be
a
I,
the
protein
sequence
for
C
elegans
of
like
for
the
protein
of
sealions
like
if
we
could
find
that
that's
that's
like
a
I
could
try
blessing
that
as
well,
but
I
couldn't
even
find
the
floating
sequence
for
those
six
or
five
of
them.
E
B
Yeah,
that's
a
common
problem,
often
because
it's
really
what
it's
doing
is
it's
taking
a
sequence
and
it's
looking
at
all
the
other
sequences
and
it's
trying
to
infer
like
a
batch
and
again
it
has
some
like
you
know,
it'll
make
some
account
for
some
of
the
sampling
error
so
like
when
they
put
together
a
sequence
they're
really,
some
of
it
is
like
actual
sequence,
and
some
of
it
is
inference
of
of
what
should
be
there.
And
so,
when
you
get
a
sequence
in
in
the
database,
it's
usually
pretty
clean.
B
But
you
do
have
this
issue
of
like
getting
like
an
alignment,
and
so
this
is
where
you
know:
you'll
you'll
have
a
certain
degree
of
accuracy,
but
you
also
sometimes
get
matches
that
don't
make
any
sense,
and
it's
just
because
they're
very
similar,
but
you
know
they
may
be.
You
know
false
positive.
B
So
there
are
a
lot
of
things
that
you
know
that,
but
it's
a
very
useful
tool
and
so
yeah
in
in
no
matches
this
means
that
no
one's
put
it
in
the
database
yet
sometimes
because
you
know
we
only
have
so
many
we've
only
sampled,
so
much
biology
and
you'd
think
LCL
studied
well,
not
always.
E
And
I
have
a
question
for
Mr
Gordon.
So
how
why
should
we
cross
compare
with
yeast
and
less
complex
nematodes?
If
you
could
explain
the
reason
and
how
you
came
up
with
the
idea,
I.
B
A
B
B
Oh
okay,
so
you
know
what
the
phenotype
is
in
the
or
you
know
what
it
corresponds
to
yeah.
There's
there's
there
is
the
issue
of
annotation
like
knowing
what
it
does
so
like
what
I
think
in
Blast
it'll
give
you
an
annotation,
but
the
annotations
are
generally
very
simple.
It's
not
like
very
detailed
information.
B
You
can
look
at
yeast
if
it's
conserved
between
yeast
and
c
elegans
and
see
like
what
the
function
is
in
East
and
it
might
have
like
I
said
you
know
we
share
a
lot
of
DNA
across
species,
so
you'd
have
like
a
maybe
a
corresponding
function
and
see
elegans.
B
A
A
B
B
B
All
right,
yeah
I'll
see
what
I'll
see
what
they
have
in
the
literature,
yeah
and
then
Lucas.
You
know
the
parameter
values.
Sometimes,
if
you
play
around
with
them,
you
can
get
different
results.
It
I,
don't
know
what
how
you
played
around
with
the
numbers:
yeah
I,
don't.
B
Well,
the
E
value
is
like
a
significance
value,
so
the
E
value
is
like.
Basically,
usually
you
don't
worry
about
the
value,
because
it's
always
like
pretty
small
and
like
what
it's
doing
is
it's
just
getting
us
a
statistical
significance,
and
so
it's
not
exactly
that.
But
it's
basically
the
same,
but
I
mean
like
any
parameters
that
you
put
in
for
like
the
percent
like
a
Criterion
for
a
percent
match,
if
whether
you
put
it
in
or
you're
just
using
it.
E
I
use
default
setting,
but
I
have
to
check
that
off.
But
in
terms
of,
are
you
also
referring
to
percentage
identity
like
when
it
matches.
A
B
A
A
B
There
are
any
input
parameters,
I,
don't
know
what
the
input
parameters
are
on
the
window
there,
but
what
you're
using
but
like
if
you
do
make
different.
If
you
do
searches
of
different
parameters
that
input
parameters,
then
you
should
make
note
of
that
and.
E
B
I
mean
that,
just
just
as
a
note
when
you
have
when
you're
doing
these
analyzes,
if
you
do
like,
if
you
do
them
under
a
certain
set
of
conditions,
make
sure
that
everyone
knows
what
they
are,
because
when
you
go
to
publish
them,
it's
you
know
it's
like
with
the
machine
learning
stuff.
We
have
to
have
like
technical
detail,
but
but
you
know
it
makes
a
difference
sometimes,
and
then
you
know
sometimes
I
don't
know
if
it'll
make
much
of
a
difference
of
changing
input
parameters.
B
For
you
know,
I
mean
you
might
have
a
specific
question
where
you
ask.
You
know
a
very
specific
question
with
respect
to
input
parameters,
but
it
it's
generally
just
you
know
make
sure
that
you
know
what
you
started
out
with
and
with
the
results
are,
and
you
know
we'll
probably
make
like
a
table
or
something
to
show
what
we're
getting.
B
Yeah,
that's
great
yeah
and
I'll
look
into
the
literature
on
the
cells,
sell,
IDs
and
the
different
types
of
different
or
the
different
types
of
lineage
trees
across
species.
I
know
that
they're
in
canor
habitus,
which
is
the
genus
of
that
of
our
interest.
There
are
a
number
of
nematodes
that
have
been
studied:
they're,
not
model
organisms,
but
I
think
they
understand
the
lineage
tree.
But
again
the
nomenclature
is
sometimes
different.
B
So
it's
you
know
it
may
be
something
that
is
easily
comparable
or
not.
I
have
to
find
out.
B
B
So
I
wanted
to
go
over
this
deep
dive
that
I
did
on
differentiation
last
week,
I
talked
about
the
stuff
with
methylation,
and
this
is
a
different
type
of
data
than
what
Lucas
was
looking
at.
So
Lucas
is
looking
at
the
output
proteins,
so
it
happens.
Of
course
is
we
have
transcription.
B
We
have
a
promoter
region,
we
have
a
gene,
a
promoter
triggers
things
on
the
gene
or
expresses
certain
parts
of
the
Gene,
and
then
you
get
like
you
know
a
protein
made
from
that.
So
what
but
what
controls
transcription
and
what
controls
transcription?
Are
these
epigenetic
things
that
you
know
control
the
openness
or
the
closeness
of
the
promoter?
So
this
is
where
methylation
comes
into
play.
B
Now,
when
I
talked
about
last
week,
was
the
standard
model
for
awake
in
stem
cell
research,
which
is
mostly
a
mammalian
cells
and
what's
interesting,
is
that
in
mammalian
cells?
There's
this
Assumption
of
global
regulation
of
methylation
state
or
Global
regulation
of
state.
So
what
that
means
is
that
all
across
the
genome,
in
a
certain
organism,
if
you
have
a
cell,
it's
a
stem
cell
and
you
have
a
change
in
methylation
state.
B
So
this
these
methyl
marks
will
change
their
state
to
sort
of
Drive
the
thing
towards
a
certain
differentiated
State,
and
we
talked
about
by
stability
and
all
that
that's
kind
of
an
aside
to
the
main
idea,
which
is
that
in
general,
in
a
Cell,
every
Gene
will
be
regulated
in
the
same
way.
In
other
words,
they're
going
in
the
same
direction.
Every
every
promoter
will
be
regulated
or
primed
towards
this
differentiated
State.
B
And
so
that's
that's
what
we
have
in
mammals
and
it's
it's
very
interesting
that
that's
actually
maybe
not
the
case
in
C
elegans,
although
maybe
it
is
now.
We
don't
know
this
for
sure,
because
apparently
the
literature
is
a
little
bit
scattered,
but
let
me
go
through.
Basically
what
we
have
in
terms
of
genomes
for
C
elegans,
so
C
elegans
was,
you
know.
The
genome
was
sequenced
before
the
human
genome
a
couple
years
before
there
was
a
draft
sequence
or
a
draft
genome.
B
It
was
put
out
in
1998
and
published
so
there's
this
genome
sequence
for
the
nematode
C
elegans.
This
is
the
C
elegant,
sequencing
Consortium
and
they
put
out
a
97
megabase,
genomic
sequence,
which,
in
the
original
version,
revealed
over
19
000
genes.
B
So
the
number
of
human
genes
predicted
by
the
human
genome
sequencing
project
was
something
like
20
to
30
000
and
they
keep
revising
the
numbers,
the
the
refine,
the
sequence.
So
what
happens?
Is
they
put
a
draft
sequence
out
and
then
they
refine
it?
This
is
not
that
far
off
from
what
we
have
in
humans,
people
thought
a
long
time
ago.
There
were
a
lot
more
genes
in
humans
than
in
say,
like
other
organisms,
especially
the
what
they
call
the
lower
organisms,
but
that's
actually
not
true.
It
seems
like
C,
elegans
and
humans.
B
Have
you
know
within
an
order
of
magnitude
similar
genes?
Now
the
size
of
the
genome
is
different,
and
certainly,
if
you
look
at
the
C,
elegans
genome
and
I,
don't
know
if
I
have
a
copy
of
it
here.
But
it's
you
know
it's
it's
a
couple
of
chromosomes
and
I
think
a
sex
chromosome.
So
there
are
one
a
couple
of
autosomes
in
a
sex
chromosome
and
the
C
elegans
genome
does
not
have
the
genes
themselves,
don't
have
centromeres.
So
that's
that
has
implications
for
This
Global
regulation.
B
B
And
then
this
is
you
know
where
we
get
have
like
a
a
bunch
of
you
know,
genes
that
we
can
sequence
in
DNA
and
then
we
can
make
we
can
infer
proteins
from
this
or
we
can
actually
get
the
protein
sequences
that
that's
where
we're
getting
that.
But
this
is
actually
from
encode.
So
there
was
a
part
of
the
encode
project
which
is
called
modern
code,
and
that
was
part
of
the
project
where
they
did
a
lot
of.
B
They
collected
a
lot
of
data
on
C,
elegans
and
drosophila,
and
so
they
make
these
comparisons
between
like
CL,
Williams
and
drosophila,
and
humans
and
mice.
You
know
so
there's
this
broad
comparability
aspect,
and
so
you
know,
if
you
can,
you
know,
do
experiments
and
see
elegans
like
say
for
aging
or
for
other
types
of
things.
B
We
have
like
the
the
genetic
Pathways,
we
kind
of
know
what
they
look
like
they're,
very
similar
in
humans,
because,
like
I
said,
we
have
a
very
high
degree
of
similarity
for
such
different
organisms
and
then
we
can
make
you
know
inferences.
They
have
these
things
called.
You
know
homologs
and
paralogs,
which
are
how
you
know
you
get
like
genes
that
are
similar
in
different
species
and
they
have
different
names
but
they're
basically
doing
the
same
thing.
B
This
is
an
example
of
what
we
have
with
the
modern
code
data
set,
which
are
these.
These
data
that
have
been
generated
on
these
are
gene
expression
data.
So
this
is
Chip
seek
data,
which
is
where
they.
A
B
This
next-gen
sequencing,
where
they
put
a
sample
on
a
chip,
they
sequence
it
for
each
oligonucleotide,
they
get
a
sequence
and
they
get
a
sort
of
a
an
amount
of
that
sequence,
that's
expressed,
and
then
they
compare
it
against
the
genome
and
they
try
to
find
these
little
stretches
of
DNA
and
how
intensely
they've
been
expressed
or
how
intensely
they're
in
the
sample.
So
we
can
say
a
lot
of
things
about
gene
expression,
using
chip,
seek
data
and
other
types
of
data.
This
has
a
lot
of
trip
seek
data
in
it.
B
So
this
is
all
this
was
something
that
encode
did
and
the
reason
they
did.
This
is
that
they
wanted
to
infer
function
from
The
genome
of
these
different
species.
So
the
methodology
was
that
if
there
is
a
transcription
factor
in
association
with
a
promoter
that
promote
a
region
that
that's
expression
or
that
that's
function
and
they
had
different,
there
was
a
controversy
about
how
they
Define
function,
but
basically
this
data
exist.
B
This
is
blast.
Of
course,
this
is
the
Wikipedia
stub
for
blast
I,
don't
know
how
much
I
need
to
go
through
this
for
people,
but
basically
this
is
the
the
origins
of
blast,
we're
trying
to
find
a
way
to
compare
DNA
sequences
and
make
this
comparison
between
similar
DNA
sequences.
So
this
has
been
around
a
long
time
and
you
can
do
this
through
the
GUI
that,
like
how
Lucas
did
it,
you
can
also
set
up
lasts
on
like
a
cluster
or
even
on
your
I.
B
Don't
know
if
you
could
really
do
a
good
job
on
your
laptop
or
desktop
environment,
but
you
can
set
it
up
so
that
you
have
like
the
database
in
a
fasta
file.
You
plug
it
into
the
program.
It
runs,
usually
a
command
line
thing,
and
then
it
will.
You
know,
give
you
yours
your
matches,
but
of
course
you
know
that's
going
to
take
a
lot
of
memory
so
using
the
guise
probably
good
enough
for
a
lot
of
this
sort
of
thing.
B
But
if
you
doing
like
a
you,
know
a
sort
of
a
genome-wide
essay
or
a
survey,
this
would
be
you
know
installing
it
on
your
machine
is
good,
and
this
just
explains
like
how
this
process
works.
So
it's
really,
you
know
comparing
two
sequences
and
finding
a
match.
It's
inferring
matches
from
this,
these
pairwise
comparisons
and
it's
calculating
a
score
which
is
then
the
degree
of
match,
and
then
it's
generating
that
the
value
where
we
talked
about
where
it's
it's
evaluating,
the
I
guess
the
significance
of
this.
B
You
know
the
match
where
the
result-
and
it
gives
you
a
score,
a
similarity
score,
which
is
how
similar
are
the
two
sequences
for
reasons
you
know,
like
I,
said,
for
reasons
of
sampling,
for
reasons
of
other
reasons,
these
aren't
always
going
to
be
a
hundred
percent,
so
we
wanna
take
note
of
when
they're,
not
100
and-
and
you
know
that
can
be
it's
not
usually
a
problem,
though
this
is
of
course,
in
worm
base.
So
this
is
worm-based
specific.
This
is
an
ncbi.
B
If
you
have
trouble
Lucas
and
finding
some
of
the
protein
sequences
on
ncbi,
you
might
try
worm
base,
and
this
is
wormbase.org,
and
this
is
tools
blast
black.
So
this
is
a
blast
plant
search
for
specific
weave
for
C
elegans.
This
is
like
based
on
that
C
elegans
genome,
so
you
can
actually
choose
the
version
of
The
genome
that
you
want.
B
The
latest
is
ws288
and
you
can
do
the
the
sequence
search,
they're,
actually
different
bio
projects,
so
the
VC
2010
genome
was
done
in
like
2019,
and
it
was
just
the
revision
of
that
20
or
that
1998
genome,
and
so
this
is
a
tool
specifically
for
so
that
you
have
the
e-value
threshold.
This
is
the
e-value
that
we
talked
about.
You
can
just
threshold.
It
at
I
think
there's
a
default
value,
but
this
is
the
number
the
significance
and
then
this
is
the
database.
B
You
know
it's
usually
blast
P,
but
you
can
also
compare
nucleotides
versus
proteins
yeah
and
then
this
is.
Finally,
this
is
the
C
elegans
genome
assembly.
This
is
you
know
on
and
CBI,
so
we
can
get
the
whole
genome
if,
if
needed,
but
I,
don't
think
we
need
the
whole
genome
just
to
let
you
know
what
the
state
of
that
is
a
bunch
of
comments
in
the
discussion
here.
So
we
had
okay
yeah
we're
talking
about
the
figure.
Then
dick
has
two
citations
of
the
Stull
State
splitter.
F
F
B
I,
remember
those
this
is
a
lukai
and
then
let's
see
yeah,
okay,
so
that's
and
then
so
then
I
had
this
other
thing
that
I
did
on
the
Deep
dive
where
I
talked
about
where
I
was
thinking
about
this
problem
of
these
different
methylation
patterns
across
the
genome.
So
so
what
happened?
Basically,
is
that
you
have
this
problem
where,
in
mammals
you
have
this
Global
control.
So
you
have
these
methylation
marks
on
the
on
the
promoters
and
they
all
sort
of
go
in
the
same
direction.
The
C
elegans.
B
However,
that's
not
the
case
necessarily-
and
this
is
the
same
in
drosophila,
so
C
elegans
has
what
we
call
mosaic
form
of
development,
which
means
that
instead
of
having
like
this,
these
cells
that
respond
to
cues
and-
and
you
know,
differentiation
cues
in
the
environment-
the
cells
are
deterministic
in
terms
of
what
they're
going
to
be.
So
you
can
take
a
lineage
tree
and
anyone's
cell
from
a
developmental
state,
which
is
usually
a
stem
cell
state,
will
differentiate
into
a
certain
type
of
cell.
B
So
this
is
something
they
call
Mosaic
development,
but
apparently
there's
also
a
mosaic
form
of
methylation,
and
originally
they
didn't
think
that
c
elegans
had
an
ethylation.
They
thought
that
it
was
restricted
to
mammalian
cells,
but
what
they
found
is
that
in
mammalian
cells
you
have
this
Global
regulation
and
then
in
C
elegans.
You
have
this
Mosaic
regulation,
so
the
this
is
kind
of
talk.
I
have
a
couple
papers
here:
I'm
not
going
to
go
over
too
much.
This
is
Mosaic
methylation
and
clonal
tissue.
B
This
talks
about
some
of
this,
where,
if
you
have
a
tissue
type,
you
can
have
this
Mosaic
regulation
of
methylation,
so
you
can
have
cells
within
the
tissue
that
are
sort
of
maybe
can
jump
to
different
states.
This
methylation
isn't
like
stable,
always
over
tissue.
So
you
know
you
have
this
Global
regulation,
the
genome
in
mammals,
but
even
in
mammals,
you
have
this
sort
of
variation
across
cell
cells
in
a
tissue,
and
so
that
brings
us
to
C
elegans,
where
they
have.
B
Apparently
they
have
these
clusters
of
place
of
locations,
these
clusters
of
methylation
marks
and
they
tend
to
be
in
the
promoters
of
genes,
and
this
will
allow,
for
you
know
this
sort
of
differentiation
in
in
different
cells.
But
it's
mosaically
regulated.
So
you
know
there
are
certain
places
in
the
genome
where
they're
at
the
methylation
marks
are
in
one
state
in
another
part
of
the
genome,
where
they're
in
another
state
and
if
you
think
about
this
Mosaic
development
mode.
B
That
makes
sense
because
this
not
all
cells,
are
going
to
end
up
in
the
same
state.
At
the
same
time,
sometimes
cells
will
differentiate
early
into
neurons,
for
example,
and
sometimes
they'll
differentiate
later
into
muscle
or
into
something
else,
and
so
this
this
paper
is
on
induced
neurons
from
germ
cells
and
c
elegans,
and
it
talks
about
actually
inducing
this
process
and
some
of
the
things
that
they
do
with
transcription
factors
and
they're.
B
There
are
also
these
what
they
call
Hot
regions
which
are
regions
in
the
C
elegans
genome
that
are
cpg,
rich
and
the
cpg
again
is
the
cytosine
de
guanine
transition
and
that's
what
they
they
look
for
with
these
methylation
marks.
So
you
have
these
sequences
that
are
CG,
CG,
CG,
sometimes
they're.
You
know
in
this
kind
of
what
they
call
a
micro,
satellite
or
a
satellite,
and
sometimes
they're
just
in
the
genome.
B
Now
in
the
promoter
regions,
you
tend
to
get
these
satellites
where
you
get
these
long
repeats
of
CG,
and
that's
where
you
get
these
this
sort
of
methylation
activity
that
affects
differentiation,
because
it
changes
how
the
gene
is
access,
and
you
know
it
changes
what's
regulated.
So
this
talks
about
these
hot
regions.
They
talk
about
them
in
C,
elegans
and
humans,
and
this
kind
of
this
work
kind
of
sets
up.
B
This
difference
between
C
elegans
and
humans
in
that
respect,
so
this
they
they
find
that
there
are
these
regions
where
you
get
clusters
of
these
CG
repeats.
You
get
this
higher
potential
for
regulation,
that's
based
on
maybe
like
cell
differentiation,
and
that
you
get
differences
between
humans
and
C
elegans
in
terms
of
the
stability
across
the
genome.
So
there's
a
lot
of
work
in
this
and
this
paper
Okay.
So
that's
the
Mosaic
methylation
work.
B
This
is
the
genome
and
then
there's
some
other
papers.
I
got
on.
You
know
this.
One
cpg,
ions
and
regulation
are
transcription
basically
driving
home.
This
message
that
there
is
this,
that
there
are
these
areas
of
the
genome,
that
or
these
air,
these
methylation
marks,
which
are
epigenetic
that
regulate
the
promoters.
That
then
regulate
the
gene
gene
expression,
but
we
can
actually
identify
the
state
of
this
or
the
potential
state
of
of
methylation
and
and
this
change
from
the
sequence,
because
the
DNA
sequence
should
have
a
lot
of
these
CG
repeats.
B
So
the
idea
would
be
that
you
have
a
lot
of
CG
repeats
somewhere.
There
is
a
potential
for
differentiation
and
regulating
cell
State,
and
so
that's
that's
what
I
found
in
my
deep
dive.
I
was
really
interested
in
that
because
I
thought
well,
you
know
this
is
in
the
all
over
the
literature
and
I
I
knew
about
million
cells
and
I
wasn't
familiar
with
C
elegans.
B
Look
at
you
know,
maybe
just
do
some
work
on
not
not
an
entire
genome
sequence
but
like
specific
genes
and
then
they'll
actually
look
at
the
function
between
the
cells,
but
I
don't
know
of
any
study.
I
don't
know
if
we
have
like
the
entire
genome
for
each
cell.
We
just
have
like
these
data
sets
that
are
kind
of
like,
for
you
know
whatever
people,
people
ask
a
question
and
they
generate
a
data
set
and
that's
what
exists.
B
So
that's
that's
it,
and
then
you
know
back
to
this
paper
with
the
with
the
Volvo
cells.
They
were
able
to
show
that
these
different
methylation
states
actually
govern
the
sort
of
differences
between
Volvo
cells.
So
there
are
actually
two
cells
or
in
one
state,
two
cells
were
in
another
state
and
it
you
know
they
were
both.
They
were
all
in
the
vulva,
but
they
had
different
functions.
B
So
this
is
again,
you
know
something
that
you
know
maybe
a
different
analysis
from
the
proteomics,
but
that's
that's
something
we
can
put
together
in
in
the
paper.
B
I
think
the
missing
part
of
this,
of
course,
is
the
differentiation
code,
and
we
did
some
work
actually
in
the
differentiation
code
in
2016
our
paper
in
2016
on
it
has
a
title:
that's
not
really
what
I'm
looking
for
in
this
paper,
but
I
we
did
do
some
differentiation
trees
for
C
elegans
and
for
siona
intestinalis,
which
is
a
c-square,
and
we
generated
those
in
this
paper
and
we
evaluated
the
lineage
trees
with
respect
to
differentiation,
and
there
were
some
other
things
in
here.
B
But
the
thing
that
I
wanted
to
point
out
here
was
that
we
did
work
out
some
something
about
the
differentiation
code
in
these
type
of
organisms.
So
this
was
a
mosaic
organism
where
we
had
reorganized
the
lineage
tree,
and
then
we
did
this
cast
analysis,
which
is
kind
of
like
a
blast.
B
It's
just
analyzing,
like
I,
think
oh
yeah,
like
basically
the
differentiation
code,
is
this
binary
code,
where
you
have
these
binary
divisions
in
the
lineage
tree
and
you
attach
binary
numbers
to
them,
and
the
binary
numbers
get
larger.
As
you
go
down
the
tree
and
then
you
can
take
like
a
certain
level
of
that
tree
and
you
can
take
another
tree
and
you
can
compare
the
sequence
of
numbers.
B
So
if
you
have
like
you
know,
binary
numbers
they're,
they
kind
of
act
like
in
computationally
in
the
same
way
as
a
DNA
sequence
or
a
protein
sequence,
and
you
can
actually
align
those
and
you
can
get
a
score,
and
so
that's
what
we
were
doing
here,
we're
generating
a
code
for
the
different
nodes
of
of
the
differentiation
tree,
which
is
a
resorting
of
the
lineage
tree.
We
were
comparing
level
by
level
different
trees
or
different
formulations
of
the
tree,
and
then
we
were
getting
a
score
for
the
matches
between
those
two
trees.
B
So
we
could
actually
get
like
a
sequence
and
it's
alignment.
So
that
was
the
way
we
approached
in
that
paper.
I,
don't
know
if
that's,
maybe
that's
I'll
go
ahead.
E
So
that
might
help
us
with
the
what
life
was
looking
differentiation,
trees
yeah.
Would
it
work,
but
I
don't
know
I
could
not
find
the
software
like?
Did
you
find
like
a
specific
software?
We.
B
Didn't
yeah,
we
didn't
write
software
for
it,
we
did
I
mean
that's
not
going
to
be
the
same
as
like
a
blast,
but
we
didn't
write
soft
performable
package
for
it.
We
just
kind
of
did
it
with
some
code
and
some
sorting.
You
know
as
we
could
write
up
some
code
or
we
could
write
up
some
software
for
it
for
what
we're
doing
here
but
I'm
just
saying
we
don't
have
the
software,
it
doesn't
really
exist.
B
It's
just
kind
of
like
you
know
software
operations,
it's
not
really
something
you
can
release
to
people,
but
but
yeah
I
think
that's
I
mean
that
might
be
a
good
method
going
forward,
but
I
don't
know
if
that's
the
best
method,
so
we
might
okay,
yeah
yeah.
B
Yeah
no
problem:
okay,
now
I'd
like
to
say
a
few
words
about
methylation
and
cell
differentiation.
B
So
the
first
thing
I
want
to
talk
about
as
I
mentioned
these
cpg
ions,
so
cpg
ions
are
these
very
small
motifs
of
C
to
G.
So
it's
like
C
and
G.
B
It's
a
two
base
Motif-
and
this
is
a
very
short
motif,
and
so
you
can
have
these
kind
of
paired
together
in
different
parts
of
the
genome
right
and
then
those
are
kind
of
all
over
the
place,
but
they're
clustered
in
the
promoter
regions.
As
I
mentioned,
these
OT
regions,
That
You
observe
where
they're
clustered
in
functionally
relevant
places
like
a
promoter
region
and
they
open
and
close
the
promoter
so
that
you
can,
the
transcriptional
Machinery
can
get
access
to
the
DNA
in
the
gene.
B
Or
patterns
here
or
our
richness
of
these
cpgs-
and
that
is
where
you
get
something
like
this,
which
is
what
we
called
a
satellite
or
a
micro
satellite,
and
this
is
a
term
from
genetics
where
they
talk
about
satellites
and
the
reason
they
talk
about
it.
This
way
is
because
the
way
they've
discovered
it
was
by
running
an
electrophoretic
radiant,
where
you
help
bulk
DNA.
B
And
then
you
have
these
bands
or
satellites
of
cpg
content,
and
so
this
is,
you
know
this
is
an
electrophoretic
gel,
so
it's
running
in
this
direction
and
things
segregate
out
along
that
gradient
according
to
their
molecular
weight,
so
you
can
actually
pull
out
sometimes
pull
out
different
types
of
proteins,
different
types
of
DNA
sequence-
and
this
is
the
way
they
used
to
do
this
before
a
lot
of
the
modern
sequencing.
Technologies
came
about.
B
B
B
So
if
it's
open,
that
means
that
you
have
this
transcription
of
a
gene
or
you
have
a
transcription
of
some
allele
or
something
from
a
coding
region
of
a
gene,
and
it
makes
a
product
it
makes
an
RNA
Mr
mRNA.
If
you
have,
if
it's
closed
of
course,
then
nothing
it's
closed
down.
If
it's
by
stable,
you
can
have
something
making
other
like
different
types
of
products,
and
so
this
is
something
where,
if
we
have
a
gene,
for
example,
that's
involved.
D
B
Making
muscle
like
myod,
you
can
have
this
switch
on
that
Gene
and
the
promoter
and
I
can
turn
it
on
and
off,
or
can
turn
it
on
and
on
in
different
ways,
so
that
it's
making
different
products
and
it's
it's
making
more
of
a
certain
mRNA
than
it
would
otherwise.
B
So
this
by
stability
allows
it
to
be
poised
to
turn
on
and
off
during
the
process
of
development,
and
so
this
is
what
we
mean
by
when
we
say,
buy
stable,
that
it
can
be
in
in
either
State,
and
the
thing
I
mentioned
in
the
meeting
is
that
in
mammals
you
have
Global
regulation,
so
in
mammals.
B
B
And
this
makes
sense,
because
you
want
to
have
these-
you
want
to
have
this
coordinated
across
the
genome,
because
you
have
these
different
intermediate
States
and
you
have
this
complex
signaling
that
happens
in
cells
and
they
sort
of
they.
They
have
a
great
plasticity
as
to
what
they
are.
B
You
have
Mosaic
regulation,
which
means
that
there's
it's
local
I,
guess
you
get
local
regulation.
It's
a
local
regulation
is
where
it's
on
a
gene
by
Gene
basis,
and
so
this
is.
This
is
important
for
this
type
of
development,
because
in
this
type
of
development
we
have
a
lot
of
cases
where
we
have
these
deterministic
lineage
trees.
B
So
we
have
these
lineage
trees
that
might
have
like
you
know
two
daughter
cells
and
those
you
might
go
from
this
level
of
development
and
all
these
cells
coming
down
like
this
will
maybe
contribute
to
the
that
the
epidermis
you
know
different
parts
of
the
different
tissues,
so
fate
is
restricted
by
sort
of
the
level
of
Developmental
cell.
If
I
were
to
take
out
this
developmental
cell,
for
example,
I
take
up
this
entire
part
of
the
lineage
tree
and
I
would
basically
deprive
the
organism
of
maybe
one
half
of
its
body.
B
So
you'd
end
up
with
an
adult
looking
like
this,
instead
of
a
worm
that
we're
used
to
this
isn't
C
elegans.
So
this
is
so.
This
is
definitely
in
cells
can't
fill
in
the
Gap.
So
you
can't
produce
more
cells
here.
You
can't
proliferate
more
cells
to
make
the
back
end,
whereas
in
a
human
or
a
mouse
there's
a
mammalian
system,
you
could
do
that.
So
this
is
the
difference,
and
so
the
methylation
marks
are
just
ways
to
regulate
that
process.
B
To
keep
these
cells
deterministic
instead
of
you
know,
I,
don't
know
what
you
would
call
it.
Maybe
regulative
or
you
can
regulate
cells
to
a
new
fade
as
you
need
to
so.
This
is
all
kind
of
the
background
for
this
and
help
you
learn
something
all
right.
Finally,
I'd
like
to
talk
about
the
differentiation
code,
as
we
talked
about
in
the
meeting.
So
briefly
in
our
2016
paper,
we
defined
the
differentiation
code
as
the
outcome
of
a
reorganized
lineage
tree.
B
So
what
we
did
was
we
took
a
lineage
tree,
so
one
age
tree
has
you
know
we
have
mother,
so
we
have
the
daughter
cells.
We
have
these
binary
divisions,
I'm
just
going
to
do
a
four-cell
tree
to
get
give
you
the
idea.
Now
these
are
divided.
Usually
this
is
a
anterior
posterior,
basic
anterior,
posterior
orientation.
So
in
C
elegans
you
have
the
anterior
cell
and
the
posterior
cell,
and
then
in
this
four
cell
example,
you
have
true
interior
cells
to
posterior
cells
and
they're,
organized.
B
And
by
these,
like
one
of
these
posterior
cells
is
going
to
go
on
to
form
the
germline
and
another.
Posterior
cell
is
going
to
go
on
to
form
specialized
cells
in
the
intestines
and
the
muscle
and
some
other
things,
and
then
the
the
anterior
cells
are
going
down
for
most
of
the
epidermis,
while
some
muscle
a
lot
of
like
cuticle
and
things
like
that
and
neural
cells,
of
course.
So
this
is
how
it's
structured.
What
we
do
in
a
this
is
a
lineage
tree.
B
B
This
is
an
expansion
from
here
you're
either,
like
you
know,
the
the
shape
and
the
in
the
form
of
the
thing
is
either
an
expansion
or
a
contraction.
So
the
expansion
is
usually
on
this
side.
The
contraction
is
usually
on
the
side
in
this,
in
the
regular
or
in
the
Mosaic
embryo.
We
had
to
take
some
Liberty
with
that
to
say
that
the
larger
cells
are
on
one
side,
the
smaller
cells
are
available.
B
Well,
the
consequence
of
this
is
you
end
up
with
a
different
topology.
It
shifted
from
the
lineage
tree
so
that
you
don't
really
care
about
the
anterior,
posterior
orientation,
you
care
about
this
size
orientation
and
so
we're
just
using
single
cell
size
as
a
method
for
that,
at
least
before
we
get
tissues,
so
we
don't
have
tissues
at
this
stage
or
just
cells
and
so
tracking,
the
sort
of
the
developmental
Cells
versus
the
terminally
differentiated
cells.
So
this
is
the
way
we
did
this
in
2016.
now.
B
B
B
So
this
is
the
differentiation
tree.
This
is
the
lineage
tree,
okay,
all
right!
So
then
what
we
can
do
is
then
we
can
take
those
for
trees
and
see
how
distant
is
each
level.
So
we
can
actually
look
at
this
level
here
too,
so
we
can
say
zero
one
and
zero
one.
Let's
say
that
there
was
no
change,
so
the
anterior
cell
was
actually
larger
than
the
posterior
cell.
B
B
Distance
is
zero
here,
which
is
great
because
we
did
you
know
it's
interesting,
because
we
there
is
no
difference
between
them.
In
this
case,
however,
there's
a
big
difference
and
that
big
difference
is
where
you
have
it's
basically
I
think
everything
has
changed
here.
So
there's
a
distance
of
four.
So
that
means
it's
maximally
distant
here.
I.
Don't
think
this
was
the
empirical
result,
but
I
can't
remember
I'm.
A
B
It
doesn't
have
to
be
the
lineage
tree
and
the
differentiation
tree
two
different
differentiation
trees.
It
could
even
be
two
lineage
trees,
although
the
lineage
trees
don't
vary
in
this
way,
so
it's
really
useful
for
either
impairing
it
with
a
winning
industry
or
comparing
a
different,
like
maybe
samples,
different
individuals,
different
species,
and
so
then
we
actually
have
this
code
that
we
can
compare
and
align,
and
this
this
code
increases.
So
it's
a
binary
state.
So,
as
you
get
to
the
eight
cell,
the
16
cells
32,
so
the
number
of
binary
digits
increases.
B
But
now
we
have
this
problem
where
these
are
just
the
cells
in
their
states.
So
what
we're
looking
at
in
the
paper
in
2016
was
characterizing
the
cell
size
and
it's
it's
reorganization
of
the
lineage
tree.
So
it's
order
from
left
to
right
all
right,
which
is
fine,
except
that
now
we
don't.
We
only
have
the
information
for
cell
size,
that's
our
soap
Criterion
and
we
also
have
the
information
for
lineage,
but
that
sort
of
implied
in
the
structure.
B
But
what
we
need
in
this
case
and
we're
looking
at
genomes
and
we're
looking
at
protein
sequences
and
so
forth,
as
we
need
a
way
to
map
those
changes
onto
this
tree
structure
and
when
we
realign
them,
you
know
having
those
States
like
also
realigned,
so
the
realignment
isn't
the
problem.
It's
characterizing,
the
state
differences
where
the
things
that
define
each
cell-
and
so
this
is
where
we're
kind
of
at
a
sort
of
I
think
it
impasse
right
now
is
that
we
don't
know
how
to
make
that
mapping.
B
So
each
cell
has
like
this
I,
don't
know
this
content,
it's
like
an
end
Tuple
of
what
so
traditionally,
we've
used
spatial
location,
XYZ
T.
This
is
our
five
Tuple
for
usually
what
we,
how
we
model
these
trees.
So
we,
we
might
have
like
three
dimensions,
a
spatial
position,
one
dimension
with
temporal
information,
and
then
this
variable
that
measures-
maybe
some
other
Factor.
It
could
be
some
summary
of
molecular
data.
It
could
be
something
else,
but
that's
maybe
not
enough.
B
Maybe
we
need
to
have
multiple
entries
in
here
where
we
have
like
a
huge
list
of
attributes
that
are
at
the
molecular
level
and
to
be
able
to
summarize
those
into
this
parameter.
But
you
know,
maybe
we
need
just
to
pick
one
parameter
and
build
a
tree
each
tree
having
one
parameter
and
then
get
a
distance,
so
you're
getting
a
Hamming
distance
that
would
be
suitable,
I
guess
I'm,
trying
to
figure
out
how
to
model
this
here
in
my
head.
A
B
This
would
be
representative
of
some
molecular
attribute.
We
could
even
reorder
it
instead
of
left
to
right
or
largest
small.
B
It
would
be
like
presence
or
absence
of
a
certain
protein,
so
it
would
be
like
you
know
protein
that
we
don't
know
what
the
name
of
it
is.
Is
it
there
or
not
or
what's
the
state
or
whatever?
Now
this
complicates
things,
because
we
don't
have
single
cell
data,
as
we
mentioned
in
the
meeting,
so
this
might
be
a
problem,
but
I
hope
it's
not,
but
we
can
we
can
organize.
We
can
arrange
this
in
different
ways
and
get
a
result
that
maybe
is
informative
to
people.
So
do
we
have
anything
else?
A
B
Well,
it
just
means
I,
guess
that
there
would
be
like
fewer
hits
with
a
higher
stringency.
I
would
imagine,
but.
F
B
If
something
is
more,
maybe
more
common
across
like
if
it
can
identify
things
that
are
more
common
across
different
samples
in
the
database,
which
you
know
is
not
like
everything
in
biology,
but
it's
what's
in
their
database
yeah.
This
will
be
a
curve
that
rises
no
plateaus
yeah
it
should.
It
should
vary
based
on
like
how
common
that
is
in
in
this
yeah.
B
The
handle
yeah
and
then,
of
course
not
all
matches,
are
going
to
be
relevant
like
sometimes
you'll
get
matches
that
are
like
something's,
totally
different
has
a
different
function
and
you
don't
think
it's
like
relevant
to
what
you
were
getting,
because
you
know
you
you
think
about
like
sequences
or
like
combinations,
so
protein
sequences.
This
doesn't
happen
as
much
but
they're
kind
of
like
combinations
of
in
DNA
sequences.
There
are
combinations
of
four
characters,
so
you
get
like.
If
you
get
a
small
sequence,
you
can
get
a
lot
of
noise.
F
E
Don't
know
if
I
can
well
but
yeah,
let's
see
what's
going
on
yeah,
if
if
there
was
a
software
for
person,
that
would
have
been
interesting,
I
mean
yeah,
but
this
the
thing
the
problem
with
the
matching
in
general,
is
even
when
you
run
like
protein
to
Protein
Plus.
Sometimes
it
shows
you.
Duplications
of
results
like
they're
the
same
thing
right
but
like
I
put
it
as
a
non-redolin.
I
want
non-reductive
results,
but
it
still
shows
me
the
same
thing
with
the
same
percentage
identity.
E
F
Okay,
Lucas
one
suggestion:
we're
writing
a
paper
here,
and
your
list
of
things
that
you
put
in
the
paper
needs
to
be
turned
into
proper
English.
E
B
All
right,
that's
that's
great
all
right!
Well!
Thank
you
for
attending
see
you
next
week,
okay,
great
good
session
yeah
thanks,
bye,
all.
B
Now
I'd
like
to
go
over
a
few
papers
that
have
come
out
have
to
do
with
some
of
the
things
we
talked
about
today.
So
actually,
this
paper
has
to
do
with
some
of
the
things
we've
talked
about
in
the
past
few
weeks
on
human
embryoids
and
sort.
B
Human
embryos
outside
of
the
normal
process
of
human
development,
so
they're
been
a
host
of
papers
in
this
area
and
it's
really
kind
of
in
a
breakthrough
recent
times
so
this,
but
this
paper
actually
focuses
on
some
of
the
things
going
on
in
that
stage
of
where
the
Inner
Cell
mass
is
forming
from
a
blastocyst.
So
you
see
the
blastocyst
here,
it's
kind
of
moving
out.
You
have
the
Inner
Cell
Mass,
the
trifecta,
Derm
and
they're
arguing
here
that
they're
able
to
find
using
live
Imaging
nuclear
DNA
shedding
during
last
assist
expansion
and
biopsy.
B
So
this
is
the
blastocyst
up
at
the
top.
It's
it's
starting
to
form
this
Trifecta
derminar
cell
Mass
and
then
that's
where
that's
the
the
sort
of
the
stage
of
development
that
we're
in,
and
so
this
is
the
what
I'm
pointing
to
here
is
the
graphical
abstract.
So
they're
actually
live
Imaging.
This
they're
putting
it
in
you
know,
they're,
live
doing.
This
live
Imaging
technique,
they're,
getting
images
of
this
they're
able
to
see
different
things
that
are
going
on
here.
B
So
the
contents
of
a
cell
split
apart
and
move
towards
the
poles,
as
you
get
cell
division,
because
you're
going
to
eventually
get
two
cells
and
they're
going
to
pull
apart
so
that
you
have
to
have
you
know
basically
a
copy
of
the
DNA
and
the
contents
of
the
cytoplasm
or
the
inside
of
the
cell,
and
so
that
segregation
process
is
happening
here.
So
you're
observing
errors
of
that.
You
also
get
this
nuclear
DNA
shedding
during
expansion.
So
there's
a
cell
structure
expands
as
you
see
here.
B
You
get
this
DNA
shedding
that
comes
nuclear
DNA,
shedding
that
comes
in
the
in
the
I
guess
in
the
nucleus
of
the
cell.
So
the
nuclear
DNA
kind
of
comes
off
and
sheds,
and
then
they
do
this
biopsy
and
you
have
more
shedding
hair.
So
the
highlights
of
the
paper
they're
using
a
fluorescent
dye
assay,
which
means
they
introduce
this
dye
to
stain
the
things
that
they're
interested
in.
So
you
can
see
them
under
a
microscope.
Clearly,
fluorescent
dyes
enable
live
Imaging,
a
human
embryos
without
genetic
manipulation,
so
they're
able
to
actually
use
a
Dye.
B
They
have
things
if
you're
familiar
with
voltage
dies
in
neurons,
so
they'll
sometimes
use
voltage
dyes
to
reveal
electrical
activity
instead
of
using
transgenes
or
instead
of
using
like
other
types
of
assays
like
recordings
or
electrodes.
So
these
dies
are
actually
quite
flexible.
These
are
actually
introduced
into
the
sample
and
they're
able
to
pick
up
some
of
these
things.
The
alternative
would
be
using
a
gfp
or
yfp
transgene,
and
so
that
that
has
its
own
challenges
in
these
life
samples
live
Imaging
reveals
differences
between
human
and
mouse
embryomorphogenesis.
B
So
there
are
differences
between
human
and
mouse
morphogenesis
that
they're
able
to
do
I
guess
they
also
sample
Mouse
cells
that
have
the
similar
mode
of
development,
but
you're
able
to
observe
the
differences
there,
and
we
talked
about
that
in
terms
of
genomics
today,
but
they're.
Also
in
cell
biology.
You
have
these
systems
that
you
can
compare.
They
have
different
processes
going
on,
but
in
like
human
and
mouse,
the
processes
are
similar
enough.
So
you
can
get
a
sense
of
the
underlying
sort
of
process.
B
Less
assist
expansion
causes
Trifecta,
Durham,
so
nuclear
budding
and
DNA
shedding.
So
you
get
this
nuclear
butting
here
in
this
image
and
then
you
get
the
shedding
that
comes
from
this
budding
so
and
then
mechanical
stress
from
blast,
assist
expansion
or
biopsy
triggers
nuclear
DNA
loss.
So
basically,
what
they're
arguing
is
that
in
this
process,
you're
getting
nuclear
DNA
loss
you're,
hitting
these
mitosis
and
segregation
errors,
and
this
is
something
that
you.
B
Implications
for
you
know,
genetic
anomalies
and
development.
Perhaps
so
that's
what
they're
interested
in
this.
So
this
paper,
you
know,
there's
a
lot
of
technical
detail
here:
I'm
not
going
to
go
into
just
to
show
that
this
paper
exists.
This
is
from
cell,
and
this
is
a
recent
paper
2023..
B
The
other
paper
I
want
to
talk
about.
Is
this
it's
from
the
bio
archive?
That's
probably
I,
don't
know
what
conference
it's
going
to
be
at
it's
probably
going
to
be
at
a
conference.
This
is
called
synergizing,
geometric,
deep
learning
and
data
Centric
methods
for
improved
protein
structural
alignment.
B
So
we
were
talking
about
protein
structure
and
protein
sequence
alignment.
We've
talked
in
the
past
about
some
of
the
tools
that
these
for
protein
folding
and,
of
course,
Alpha
fold,
which
is
machine
learning
technique
for
looking
at
protein
folding.
This
is
protein
structural
alignment
and,
in
this
case,
they're
using
geometric,
deep
learning
for
this.
B
So
the
abstract
reads:
structures
are
replacing
the
role
of
sequences,
so,
as
we
saw
in
the
meeting,
we
have
these
sequences
that
have
a
certain
amount
of
information,
they're
good
for
conveying
what
was
transcribed
and
translated
from
the
DNA.
So
the
DNA
structure
tells
us
what's
in
the
genome,
but
then
that
gets
transcribed
certain
parts
of
that
get
transcribed.
B
Generally.
The
sequences
that
of
interest
that
get
put
in
the
in
like
something
like
ncbi
or
some
other
centralized
database
are
things
that
are
biologically
interesting.
So.
D
B
Usually
things
that
get
transcribed,
and
so
we
get
this.
We
have
this
different
different
alleles
or
different
isoforms
of
of
a
gene
and
what's
being
expressed
by
the
gene.
B
But
then
we
also
have
translation,
which
means
that
it's
being
turned
into
a
protein
sequence
or
an
amino
acid
sequence,
and
so
you
know
we're
interested
in
the
work
that
Lucas
was
doing
on
the
amino
acid
sequence,
but
there's
also
the
structure,
and
then
we
have
the
instruction
on
RNA
as
well,
where
their
folds
and
their
turns-
and
there
are
other
types
of
topological
features
that
are
functional.
They
have
functional
significance,
so
a
small
RNA
molecule
might
be
a
straight
line
and
it's
a
sequence,
but
in
larger
RNA
molecules
you
have
secondary
sequence.
B
There
is
secondary
structure
that
actually
the
sequence
in
in
alignment
with
its
structure
is
the
information
of
that
molecule.
So
this
is
what
they
mean
by
this
sentence.
Traditional
bioinformatics
research
focuses
on
sequences
because
they
were
reasonably
obtained,
and
so
this
is
again
the
sequence.
Is
you
can
get
them
from
studies?
There
are
ways
you
can
do
Mass
sequencing,
so
it's
it's
cheap
to
do
sequences.
B
It's
not
so
cheap
to
do
structural
analysis,
and
so
this
is
why
you
can
get
sequences
more
readily
than
structured
advances
in
techniques
like
cry
on:
electric
micro,
electron,
microscopy,
molecular
modeling,
docking,
algorithms
and
structure
prediction.
Software
have
shifted
the
focus
to
structures.
So
there's
there's
microscopy
that
that
happens,
that
you
can
get
the
part
you
can
actually
do
x-ray
crystallography
as
well.
B
But
then,
of
course,
once
you
have
the
structure
you
need
to
model
it,
you
need
to
understand
how
it
works
functionally,
and
so
that's
where
a
lot
of
this
stuff
comes
in
molecular
modeling,
which
is
where
you
have
a
three-dimensional
model
of
the
structure
docking
algorithms,
which
are
you
know
when
you
have
a
a
cleft
in
the
protein.
You
know
they're
different
things
like
electrons,
a
dock
in
there
and
those
are
biologically
important
and
so
that's
important
to
know
how
those
work
and
then
structured
prediction
software
again.
B
This
is
protein
folding,
what's
the
conformational
state
of
that
shape,
and
is
it
biologically
viable?
So
all
those
things
are
necessary
to
know,
and
you
can
get
these
structures
actually
on
ncbi
or
some
other
resource,
and
you
can
model
them
in
in
software,
so
they're
different
software
packages
for
protein
modeling
there's
actually
also
something
called
Nano,
which
is
a
VR
based
protein
modeling
platform,
where
you
can
actually
pull
protein
structures
up
in
front
of
you
and
play
with
them,
and
you
know
do
all
these
things
that
you
can
do
in
traditional
programs.
B
It's
really
interesting
stuff.
So
this
is
something
that
we're
moving
towards
now,
given
the
importance
of
deep
learning
in
many
of
these
breakthroughs,
it
makes
sense
to
also
explore
how
it
can
modernize
classic
bioinformatics
tools.
So
this
is
again
you
know
we
want
to
know
if
we
can
apply
deep
learning
to
some
of
these
new
or
graph
neural
networks,
even
or
geometric,
deep
learning,
I
guess
in
this
paper
to
some
of
these
older
techniques
and
they're
older
only
in
the
sense
of
relative
old,
because
we
have
these.
These
are
like
maybe
10
years
old.
B
A
lot
of
these
deep
learning
methods,
these
other
methods
like
molecular
modeling,
are
maybe
40
years
old
at
most
and
then
x-ray
crystallography
or
some
of
these
other
techniques
you
use
to
get
the
or
the
protein
structure.
Maybe
it's
70
to
80
years
old
or
maybe
a
little
bit
older
than
that.
But
the
point
being
is
that
it's
it's
not
a
really
old
science,
but
it's
moving
forward.
B
So,
however,
empirical
findings
have
shown
that
machine
learning
based
methods
have
many
pitfalls,
resulting
in
over
optimistic
conclusions,
including
data
leakage
between
test
and
training
data.
So
again,
in
in
our
typical
deep
learning
model,
we
have
test
and
training
data.
We
test
our
data
on
what
we've
trained
on
and
we
can
only
our
model
is
only
as
good
as
the
training
data,
and
so
this
is
a
problem
that
you
know
we
they're
trying
to
kind
of
get
around.
This
is
a
caveat,
especially
with
biological
data.
B
So,
in
this
paper
we
have
developed
Van
Gogh,
a
geometric,
deep
learning,
based
structural
alignment
approach
that
performs
on
part
of
the
state
of
the
art
without
ever
having
been
trained
on
a
pair
of
natural,
we
found
homologues,
so
we
talked
about
homologs
where
these
are
analogous
genes
or
analogous
proteins
in
different
species
or
their
analogous
in
terms
of
being
duplicates.
So
these
are
your
homologs
are
where
they
have
a
similar
function,
they're
just
Divergent.
B
In
some
way,
we
adopted
a
data
Centric
approach
to
address
deep
learning
and
data
limitations
by
augmenting
protein
templates
since
into
synthetic
homologs
for
training.
Our
method
allows
us
to
supplement
homolog
data
by
knowledge,
driven
augmentation,
self-learning
role
and
structural
features
by
supervised
examples
and
protein
alignment
that
is
competitive
with
state
of
the
earth
methods.
B
So,
let's
break
this
down
a
bit,
so
they
want.
They
have
homolog
data
which
is
kind
of
the
standard
in
proteomics.
Where
you
have,
you
know,
comparisons
between,
say
species,
and
it
gives
you
well
sort
of
what
we
saw
in
the
blast:
searches
where
you
have
two
different
sequences
and
you're,
trying
to
infer
like
the
relatedness
of
the
two
sequences
or
the
similarity,
and
so
it
doesn't
tell
you
a
lot
about
function.
It
doesn't
tell
you
a
lot
about
like
this
sort
of
evolutionary
homology.
B
Necessarily
it's
just
that
similarity,
and
so
we
need
to
have
more
information
here.
So
knowledge,
driven
augmentation,
is
something
where
we
know
something
about
the
proteins
and
their
function
and
there's
in
their
context.
So
we
can
apply
that
as
data
augmentation.
We
also
know
a
lot
more
about
its
structure.
We
we
can
supplement
with
sequence,
information
or
functional
information,
and
we
can,
even
you
know,
backfill
this
with
molecular
simulation
with
another
sources,
so
we
can
actually
augment
our
data
set
in
that
way.
There's
also
self-learning
of
relevant
structural
features
by
supervised
examples.
B
A
B
Is
a
very
basic,
deep
learning
stuff,
so
this
is
not
like
anything
new,
except
that
in
this
field
it
would
be
an
advance,
and
then
this
is
some
in
protein
alignment
that
is
competitive
with
state-of-the-art
methods,
so
along
the
way,
you're
being
you're
able
to
align
proteins
with
use.
You
know
in
comparison
to
other
methods
using
these
sources
of
information,
and
it
will
give
you
a
result.
Now
they
don't
talk
about
the
improvements
that
they've
made
necessarily
in
the
abstract
two
traditional
deep
learning
methods.
B
So
there
are
some
caveats
that
they're
getting
around
I,
guess
they're
using
these
they're
bootstrapping
this
with
training
and
with
data
augmentation
to
get
a
a
good
result.
So
this
is
the
referral
Network
framework
that
they
mentioned.
This
is
github.com
deep
rank,
deep,
deep
brain
core
tree
main
deep
brain
core
and
that's
the
place
where
you
can
find
the
code
for
this.
So
that's
those
are
two
new
papers
that
just
came
out
thanks.