►
Description
Date: 2/2/18
Presenter: Jacob Bor
Institution: Boston University
Northeast Big Data Hub
A
His
research
applies
to
the
tools
of
economics,
epidemiology
and
data
science
to
the
study
of
population
health,
with
a
focus
on
HIV
treatment
and
prevention
in
southern
Africa.
His
current
research
interests
include
economic
spillover
effects
of
HIV
treatment
on
patients,
households
and
communities
decision,
making
an
HIV
endemic
risk
environment,
population
health
impacts
of
social
policy
and
causal
inference
in
public
health
and
epidemiologic
research.
His
work
has
been
published
in
journals
such
as
science,
epidemiology,
Lancet
and
plus
medicine
and
prior
to
his
graduate
training
at
Harvard
School
of
Public
Health.
Dr.
A
B
A
B
As
you
can
see,
yeah
III
actually
I
come
later
than
probably
everyone
else
here
to
data
science.
My
background
is
in
epidemiology
and
and
economics,
but
I
am
based
at
a
School
of
Public
Health
and
work
with
a
group
of
collaborators
who
have
been
working
on
questions
around
HIV
treatment
and
prevention
in
southern
Africa.
For
quite
some
time
now,
and
just
thinking
about
the
u.s.
B
So
here
it's
the
big
picture
questions
there
are
now
very
large.
Chronic
disease
epidemics
worldwide,
chronic
chronic
diseases
are
the
leading
cause
of
morbidity
and
mortality
and
middle-income
countries
and
present
a
major
health
systems
challenge
in
a
lot
of
developing
countries.
Health
systems
were
really
developed
to
provide
acute
care
for
infections
and
emergencies
and
mom
and
kids
care
for.
B
Vaccinations
and
and
births,
and
so
there's
a
we're
asking
now
health
systems
to
do
something
differently
in
managing
chronic
disease,
because
it
requires
a
longitudinal
relationship
with
patients
over
a
long
period
of
time
and
there's
a
key
role.
Obviously
there
of
data
to
both
managed
care
at
the
individual
level
and
also
to
inform
policy
so
in
lower
resource
settings.
B
Chronic
disease
management
is
essentially
inter
iterative,
collection
of
biomarkers
in
HIV,
a
cd4
count
or
viral
load
and
diabetes
in
a1c
or
blood,
glucose,
etc,
and
then
a
clinical
action
on
the
basis
of
that
biomarker
and
it's
quite
algorithmic,
and
the
advantage
of
that
is.
You
can
have
lower
skilled
cadres
of
health
workers
with
less
less
training,
provide
evidence-based
healthcare
at
scale
and
to
large
populations
quite
quickly.
So
there's
an
opportunity
there
to
optimize
those
rules
and
to
you
that
one
needs
data
and
databases.
So
HIV
is
a
manageable
chronic
disease.
B
Now,
7:37
million
people
globally
who
have
HIV
and
one
in
five
of
those
lives
in
South,
Africa,
South,
Africa
scaled
up
antiretroviral
therapy
for
HIV,
starting
in
2004.
A
RT
allows
someone
with
HIV
to
have
near
normal
life
expectancy.
It's
really
been
a
miracle
drug
or
set
of
drugs,
and
in
recent
years
the
evidence
has
accumulated
that
that
er
bees
don't
just
improve
someone
someone's
health,
but
also
essentially
eliminate
onward
transmission
of
the
virus.
B
That
will
be
hard,
and
there
are
lots
of
reasons
why
each
of
those
steps
along
this
so-called
treatment
cascade
is
is
difficult,
but
I'll
focus
more
on
the
data
side,
which
is
that
currently
there's
no
data
set.
That
provides
a
system-wide
first
longitudinal
perspective
on
this
HIV
cascade
of
care,
from
the
point
of
diagnosis
and
linkage
to
a
facility
to
initiation
of
treatment
and
on
an
on
to
viral
suppression,
including
following
people
as
they
move
around
the
health
system
to
different
facilities
in
this
highly
mobile
setting.
B
So
the
National
Health
Laboratory
Service
is
the
sole
provider
for
South
Africa's
national
HIV
program.
They
have
done
about
40
million
cd4
count
and
viral
loads
since
2004,
and
essentially
any
HIV
monitoring
is
done
by
n
HLS
there's
one
province
that
joined
in
2010,
but
other
than
that.
We
essentially
have
data
for
the
entire
public
sector
HIV
program.
B
The
data
are
high-quality
in
the
sense
that
they
are
the
exact
data
used
in
patient
care
and
even
collect
in
what
we
have
access
to
are
the
data
before
they've
even
been
sent
back
to
the
clinic
so
a
lot
of
times.
Data
are
extracted
from
charts
where
they
have
to
have
been
written
down
where
they
have
to
have
been
acted
upon
before
being
written
down
and
received
from
the
lab,
and
so
all
of
those.
B
During
all
those
steps,
there's
opportunity
for
error,
and
so
we
have
high
quality
data
and
that
it
precedes
that
it's
system-wide
and
continuously
updated,
and
the
major
limitation
is
that
up
to
this
point,
there's
been
no
unique
patient
identifier.
So
our
question
was:
can
we
build
a
national
HIV
cohort
from
these
data?
B
We
use
the
flaky
Center
methods
to
score
those
edges
and
then
the
search,
traditional
approach
to
determining
whether
something's
a
match
or
not,
is
to
identify
a
lower
threshold,
a
higher
threshold
and
then
an
area
in
the
middle
which
we
which
would
be
held
back
for
manual
review.
Well,
our
problem
is
big
enough
that
it
was
essentially
non.
Not
practical
to
hold
to
hold
to
do
manual
review
here,
and
so
what
we
did
was
instead,
we
said:
can
we
use
information
about
the
network
structure
or
a
graph
structure
of
these
potential
patients?
B
To
say
is
this:
is
this
possibly
an
individual
and
what
this
does
is
it
prevents
this
sort
of
blowing
up
of
a
clot
of
large
clusters
which
can
be
identified?
It's
not
possible
patients,
so
here's
a
case
where
I
happen
to
be
falsely
linked
with
Jason
Bourne,
which
we
know
that
we're
not
the
same
person,
and
so
we
have
an
ongoing
work.
We
were
working
with
some
faculty
in
computer
science
at
BU
to
develop
some
new
graph
based
linkage
techniques.
B
B
On
the
one
hand,
we
do
see
that
patients
are
presenting
for
care
earlier,
an
infection
than
ever
before,
so
first
cd4,
count
at
presentation
to
care
is
is,
is
a
measure
of
patients,
health,
and
so
their
health
is
improving
over
time.
The
health
of
patients
is
improving
over
time
and
in
terms
of
patients
when
they
first
present,
although
still
at
about
325
and
2017.
B
It's
it's
not
that
high,
and
there
are
many
people
who
present
still
quite
late
at
lower
cd4
counts
and
the
recent
policy
is
essentially
to
be
extrude
to
extend
eligibility
for
treatment
to
people
who
present
with
cd4
counts
above
500.
What
you
can
see
from
this
histogram
is
that
is
that
only
maybe
a
third
here
of
patients
are
actually
affected
by
that
policy,
because
most
patients
are
still
presenting
quite
late.
B
This
plot
shows
the
relationship
between
first
cd4
count
and
whether
a
patient
proceeded
to
have
a
treatment
lab
workup
so
before
being
initiated
on
treatment.
There's
a
set
of
lab
rats
that
are
done
to
make
sure
that
there
are
no
adverse
reactions,
and
so
what
you
see-
and
so
what
you
see
here
is
that
the
in
in
in
the
prior
regime,
before
the
implementation
of
this
eligibility
expansion,
there
is
a
significant
discontinuity
at
500,
which
is
the
prior
eligibility
threshold.
B
So
if
you
presented
below
500
about
40
percent,
had
it
went
on
to
a
lab
workup
and
if
you
present
it
above
500
about
15
percent,
had
a
had
a
lab
workup.
So
extending
eligibility
to
these
people
about
500
is
going
to
have
a
substantial
impact
on
their
progression
to
initiation
of
therapy.
So
that's
the
good
news,
of
course.
B
The
bad
news
is
that
lots
of
patients
are
not
starting,
even
though
they're
eligible,
and
so
this
highlights
a
major
gap
and
in
place
where
much
more
work
needs
to
be
done
to
figure
out
why
patients
are
not
starting
and
to
you
know
again,
to
identify
to
identify
facilities
that
are
that
are
doing
better
than
those
that
are
doing
worse.
One
of
the
benefits
of
what
we're
able
to
do
here
is
that
we
can
I.
B
We
can
follow
patients
as
they
proceed
through
the
AR
T
program,
so
the
red
line
here
shows
the
proportion
who
are
retained
at
their
initiating
clinic
where
they
started
treatment,
and
this
falls
off
quite
precipitously,
and
it
is
a
pretty
bad
news
story
frankly,
that
has
that
has
colored
a
lot
of
the
discussion
around
retention
and
care.
But
actually,
if
you
look
within
the
whole
national
treatment
program
and
allow
people
to
move
across
facilities,
the
the
picture
is
much
better
and
and
so
we're
able
to
show
this
with
these
data.
C
B
B
There's
the
methods,
development
aspect
of
thinking
about
how
to
incorporate
graft
based
methods
in
record
linkage
and
thinking
about
statistical
uncertainty
when
you
have
clustered
data
potential,
additional
patient
linkage
with
other
individual
level
data
sets
and
in
existing
clinical
cohorts
potential
to
integrate
this
cohort
into
systems
of
clinical
care.
We're
already
sending
reports
to
clinics
of
patients
who
have
elevated
viral
loads
for
them
to
act
on
and
that's
a
sort
of
intervention
that
we're
looking
to
evaluate
their
clinical
epidemiology.
B
Questions
can
be
used
for
continuous
quality
improvement
and
strappin
policy
evaluations
when
to
evaluate
new
policies
that
are
rolled
out
and
and
finally
because
the
underlying
denominator
is
essentially
the
complete
population
of
South,
Africa,
there's,
not
least
the
90
percent
that
use
the
public
sector.
There's
an
opportunity
to
link
this
to
two
external
data
sets
and
think
about
the
you
know,
but
this
cohort
has
actually
the
exposure
to
look
at
Health,
Systems
interventions
and
their
effect
on
other
outcomes
at
the
population
level
and
actually
think
about
population
health
impacts
of
Health
Policy.
Of
course.
B
Finally,
there
are,
there
are
questions
about
sort
of
you
know,
ethics
in
security,
questions
and
all
in
all
of
the
linkage
to
additional
databases-
and
you
know
we're
working
with
our
South
African
colleagues
to
embed
all
of
this
in
their
systems
there,
and
so
that
is
sort
of
a
snapshot
of
this.
This
agenda
we've
benefited
from
some
funding
from
NIH
and
are
looking
for
more.
B
A
C
C
We
cleanse
them
before
anybody
else
sees
them.
We
cleanse
them
before.
We
do
anything
on
it,
so
that
we
can
verify
that
the
data
is
good
beyond
that.
We
don't
really
share
the
datasets.
We
partner
with
the
people
who
are
really
well-versed
at
managing
all
of
the
policies
and
managing.
We
provide
the
infrastructure
and
we
advise
on
architecture
for
scale
and
for
fast
access.
B
You
know,
HLS
owns
the
data
and
we
see
our
role
as
providing
a
service
to
n
HLS,
which
has
a
which
you
know
everything
has
turned.
Everything
is
the
results
of
the
linkage.
Return
are
turned
over
to
them
and
if
one
needs
access
to
the
data,
it
goes
through
it
through
an
HLS
and
I
think
you
know
particularly
working
internationally.
There
are
no
sovereignty
issues
that
that
arise
that
we're
you
know.
We.
B
It's
of
all
the
things
in
our
work.
To
be
honest,
it's
something
that
that
we
we
take.
We
take
very
seriously,
but
it's
not
something
that
we
try
that
we
push
much
on
because
we
are
you
know
it's
it's
their.
It's
their
data,
so
I
think
I
think
there
are
questions
in
terms
of
the
future
of
this
project.
When
we
talk
about,
you
know
how
to
link
with
other.
B
C
Tell
you
I'll
give
you
a
compelling
problem
that
we
are
unable
to
do
because
of
data
sharing
and
privacy
restrictions.
So
maternal
mortality
is
a
rare
outcome,
so
there's
roughly
28
to
30
women
in
the
state
of
Texas,
and
so
it
goes
roughly
by
scale
very
small
percentage
that
die
every
year.
But
it's
almost
100%
preventable
I
can
get
a
hold
of
birth
data,
but
it
doesn't
identify.
Mom
and
I
can
get
a
hold
of
death
data,
but
it's
scrubbed
so
that
rare
events
don't
appear.
C
C
C
This
third
party
does-
and
the
third
party
won't
give
access
to
it,
because
you
will
be
able
to
identify
individuals,
even
if
you
go
through
so
they'll,
allow
you
access
on
their
computers
to
do
their
things
on
sort
of
a
server
based
model,
but
they
won't
allow
you
to
take
the
data
out
so
that
you
can
try
to
link
and
match
it,
and
then
do
these
studies.
We
were
told
that
we
should
probably
start
collecting
them,
as
we
see
them
going
here
forward.
C
Right
and
there
they
would
base
what
they
said
to
us
was
that
there
are
HIPPA
laws,
and
then
there
are
the
US
government
laws
that
govern
people
who
hold
our
National
Statistics.
They
are
considered
the
keeper
of
National
Statistics
and
they
don't
let
that
data
go
and
I
said.
Well,
then,
how
are
you
ever
gonna
do?
How
are
you
ever
gonna
identify
any
rare
outcomes
or
causes
hastened
we're
not.
D
C
Now,
in
other
countries,
I
think
it's
that
the
Netherlands
or
Switzerland
somewhere
there
they
have
a
national
registry.
I
know
New
Zealand
does
because
I've
talked
to
them
as
well.
They
have
a
national
registry
where
everyone
participates.
It's
just
it's!
It's
culturally
expected
that
you
contribute
back
for
the
betterment
of
society
as
a
whole.
We
just
don't.
We
don't
subscribe
to
that
policy.
I!
Think
we
subscribe
more
to
privacy.
First
yeah.
D
A
When
I
think
Jacob,
you
just
touched
on
it.
This
concept
that
maybe
the
sale
to
an
American
audience
or
audiences
with
similar
values,
where
the
individual's
needs
come
before
the
greater
Goods
and
versus
like
a
Confucian
influence
society
that
we
need
to
just
demonstrate
that
your
self-interest
is
best
served
by
the
sharing
of
data,
so
that
you
can
reap
the
benefits
of
these
discoveries
and
live
a
healthier
longer.
Life.