Scien
tific Disc
o
v
ery
as Link Pr
ediction
in
Influence
and
Cit
a
tion Gr
aphs
F
an Luo
,
Mar
co V
alenzuela-Escár
ceg
a
,
Gus Hahn-Po
well,
Mihai Surdeanu
T
e
xt Gr
ap
hs W
ork
shop,
June 6, 2018
1
B
ac
k
g
r
o
u
n
d
2
3
Source: Nels
on, et al. Effects of povert
y
o
n
i
n
te
ractin
g
b
i
o
l
o
g
i
c
al
s
ys
te
ms underlyi
ng
c
hild developmen
t.
The Lancet Child & Adolescent Health
https://doi.org/10.1016/S2352-464
2(17)30024-
X
P
sy
chology
Biology
E
conom
y
Envir
onment
4
Publica
tions inde
x
ed by PubMed each y
ear since 1995
5
If humans
cannot k
eep up,
machines
mus
t
help!
•
W
e
implemen
ted a machine r
eading s
ys
t
em
f
ocused on
influen
ce
st
a
temen
ts in
children’
s
health lit
er
a
ture
6
Pr
e
vious W
ork: Influence Sear
ch
1.
Large-scal
e automated reading w
ith
Reach discovers new
cancer driving
mechanisms.
2.
Sw
anson
linking revisited:
Accelerating literature-based discovery
across
domains using
a conceptual
influence
graph.
Influence
Sear
ch
7
Use Case
8
D
E
-
MO
D
I
F
I
ED
1
K
O
U
T
PU
T
6
©
Bi
l
l
&
Me
l
i
n
d
a
G
a
t
e
s
F
o
u
n
d
a
t
i
o
n
|
Ca
mp
yl
oba
ct
e
r
Dia
rrh
ea
Infe
ct
ion
Ad
ult obe
si
t
y
e
co
no
mi
c
de
cisi
on
s
in
Ba
s-C
ongo
ra
pid
u
rb
ani
sa
tion
go
ve
rn
me
nt
su
b
si
d
y
e
co
no
mi
c
g
ro
wth
po
ve
rt
y
food in
secu
ri
t
y
HI
V
EBF
soci
o
-e
co
n
o
mi
c
ch
ange
s
St
unting
Di
se
a
se
(in
cl
ci
)
Death
Brea
st
feeding
EBF
health p
ra
ct
itione
r
informed pa
re
nt
obe
si
t
y
late
r
obe
si
t
y
o
ve
rw
eight obe
si
t
y
ch
ildhood obe
si
ty
HFD
rura
l
se
tting
C
VD
inadequate nut
ri
tion
unde
r-n
ut
ri
tion
PEM
ma
lnut
ri
tion
Nut
ri
tion edu
ca
tion
low
ma
te
rn
a
l
cd
4
ce
ll
co
unt
poo
r
ma
te
rn
al heath/death
Inadequate infant
ca
re
nut
ri
tion
(i
n
cl
uding all po
si
ti
ve
food
su
ppl
y ch
an
ge
s)
a
cu
te
ma
lnut
ri
tion
su
pple
me
nta
ry
food
ch
ildhood/infant
ma
l.
enfo
rce
me
nt of
st
ate
re
gulation
s
ea
rl
y ch
ildhood obe
si
t
y
p
re
ve
ntion
MG
NR
EG
R
weight gain
in infan
cy
e
xcess
high
p
ro
tein inta
ke
nut
ri
ent en
ri
ch
ed diet
AR
I
bi
rt
h
a
sp
h
yxi
a
I
mmu
ne
syst
e
m
abnormalit
y
I
MR
pa
re
nt
ca
re
gi
ve
r
pneu
mo
nia
poo
r
nut
ri
tion
mms
p
ro
visi
on
NCD
I
mp
ro
ve
d wate
r
a
cce
ss
ch
eap
p
ro
cesse
d
food
Construct
ed
in 2 day
s
(human + machine)
;
Normally
, it tak
es
1 month (human alone).
Model courtes
y of: L
yn P
owell, HBGDki-qPM team
9
M
o
t
iv
a
ti
o
n
P
as
t vs.
Futur
e
•
This s
ys
t
em can only search
pas
t
,
published
f
acts
•
No inf
ormation about what
comes
ne
xt
in
science
…
10
De
finition
•
Whit
e spaces
in science
+
T
opics tha
t ar
e ins
uf
ficien
tly
s
tudied,
but
+
Ma
y lead t
o import
ant scien
tific disc
overies
11
Our Con
tributions
1.
White
space discov
er
y =
link
prediction o
ver
the influence
gr
aph
•
Pr
edict wh
ether an infl
uence
link will be added
t
o the gr
aph
12
reduc
es
reduces
diet
ary
fish oil
blood
viscosit
y
promotes
R
eynaud’
s
disease
•
Binary classi
fication task:
•
positive
,
if the
influence rela
tion will be added to
the
influence gr
aph in the futur
e
;
•
neg
ativ
e,
other
wise
Swanson,
D.R.
Undiscover
ed public knowledge. The Libr
ary Quarterly
, 56 (2)
, 1986.
Our Con
tributions
2.
F
eatur
es from
multiple gr
aphs!
13
Cit
ation gr
aph (to under
s
tand
community
overlap)
Influence
gr
aph (to under
s
tand
influence
c
onnectivity)
14
D
a
t
a
se
t
Complic
ation: No "Back
t
o the Futur
e
"
15
Da
t
aset
16
Construct
ed thr
ough backt
esting
?
>=
t
“fish oil”
“blood visc
osity
”
“Ra
ynaud’
s disease”
17
t
<=
r3.
y
ear
<=
p
r
e
sent
(P
ositiv
e)
r3
not
exist
until
pr
esent
(Neg
ativ
e)
Da
t
aset
Not
e:
T
r
ansitivity
Gen
er
ally
Not
T
rue!
18
Hurric
ane
Rainf
all
Cr
op
yield
Missing inf
orma
tion
impacts
non-linear models!
19
t = 20
12
Da
t
aset
F
ea
tures
•
Extr
acted fr
om tw
o gr
aphs
•
Influence gr
aph (influ
ence
rela
tions betw
een concepts)
•
1,564,74
8 dis
tinct nodes
•
Connected
by 2,3
95,944 in
fluence r
elations
•
Cit
ation gr
aph (cit
ations between paper
s)
•
119K paper
s
•
5,523,75
9 cit
ation
links
20
F
ea
ture
gr
oups
Fea
ture
Gr
oup
Intuition
From
Connectivity
f
eatur
es
The mor
e connected
concepts ar
e, the easier
is to
discov
er a r
ela
tion between them
influence
gr
aph
Community
-
based f
eatur
es
The lar
ger the int
er
section
of communities
con
taining
the two in
fluence
s
tat
ements, the
easier it is to mak
e
the connection
cita
tion
gr
aph
Inf
ormation
r
etri
ev
al
f
eatur
es
The mor
e distinct a concept or
an
influence
s
ta
tem
ent is, the harder
it is to mak
e a
discov
ery ar
ound it
paper
s
con
taining
influence
s
ta
tem
ents
21
22
A
C
B
Out degr
ee
Connecting
paths
In
degr
ee
Connectivity
F
ea
tur
es
Community-based
F
eatur
es
23
The communities wer
e
det
ected using the
Coda
algori
thm (Y
ang et al., 2014)
A
B
24
Community-based
F
eatur
es
A
B
“Bridging
”
int
er-disciplinary
paper
s
c
i
t
e
B
C
Inf
orma
tion
R
etrie
v
al
F
ea
tures
•
In
ver
se document
frequency (IDF)
scor
e of lemmas
in con
cept
A
•
IDF sc
ore of lemmas
in concept
B
•
IDF sc
ore of lemmas
in concept
C
•
Number of paper
s that men
tion A
B
•
Number of paper
s that men
tion B
C
25
26
E
v
a
lu
a
t
io
n
E
v
alua
tion Metrics
•
•
•
F1 = harmonic mean of P and R
•
P@10 = how many links
predict
in top 10 are correct
•
MAP
= mean average
precision
27
Unr
ank
ed
Rank
ed
R
esults
28
All F
ea
ture
Gr
oups
Help
F1
sc
or
es
f
or
f
eatur
e
abla
tion
Wha
t
Does
the
Sy
s
t
em
Pr
edict?
30
Conclusions
•
Nov
el s
tr
at
egy f
or the
identific
ation
of white spaces
in
scientific
knowledg
e
•
Oper
at
es o
ver r
eal-w
orld gr
aphs of
influence
rela
tions and
cit
ations
•
F1 sc
or
e of 27 poin
ts, and a mean a
v
erag
e pr
ecision of 68%
•
Import
ant t
o
R
esearche
rs:
“What
should I r
esear
ch
next?
”
Pr
ogr
am officer
s:
“What should
I fund ne
xt?
”
31
Thank y
ou!
R
esour
ce
A
vaila
ble:
•
Dat
a
and
code
ht
tps://
github.com/
clula
b/releases/tr
ee/master/t
extgr
aphs2018
-d
iscov
ery
•
Influence
sear
ch
engine
ht
tp://influence.clulab.org
/
F
an Luo
,
Marco V
alenzuela-Escár
ceg
a,
Gus Hahn-P
owell,
Mihai Sur
deanu
{f
anlu
o
, marc
ov
,
hahnpow
ell
,
m
surdeanu
}@
email.ari
zona.
edu
32
Acknowledg
emen
ts
•
Mar
co
V
ale
nzuela-Esc
ar
ceg
a, Gus Hahn-P
owell, and Mihai
Sur
dea
nu declar
e a financial in
ter
est in lum.ai.
This int
eres
t has
been pr
operly disclosed t
o the Univer
sity of Ariz
ona
Ins
titutional R
eview Commit
t
ee
and is managed
in accor
dance
with its c
onflict of in
ter
est policies.
This work w
as funded by
the Bill and Melinda Ga
tes F
ounda
tion HBGDki Initia
tive.
33