Scientific Discovery as Link Prediction in
Influence and Citation Graphs
Fan Luo, Marco Valenzuela-Escárcega, Gus Hahn-Powell, Mihai Surdeanu
Text Graphs Workshop, June 6, 2018
1
Background
2
3
Source: Nelson, et al. Effects of poverty on interacting biological systems underlying child development. The Lancet Child & Adolescent Health
https://doi.org/10.1016/S2352-4642(17)30024-X
Psychology
Biology
Economy
Environment
4
Publications indexed by PubMed each year since 1995
5
If humans cannot keep up, machines must help!
We implemented a machine reading system
focused on influence statements in children’s
health literature
6
Previous Work: Influence Search
1. Large-scale automated reading with Reach discovers new cancer driving
mechanisms.
2. Swanson linking revisited: Accelerating literature-based discovery across
domains using
a conceptual influence graph.
Influence Search
7
Use Case
8
DE - MODIFIED 1K OUTPUT
6 © Bill & Melinda Gates Foundation |
Campylobacter
Diarrhea
Infection
Adult obesity
economic decisions
in Bas-Congo
rapid
urbanisation
government
subsidy
economic growth
poverty
food insecurity
HIV
EBF socio-economic
changes
Stunting
Disease
(incl ci) Death
Breastfeeding
EBF
health practitioner informed parent
obesity
later obesity
overweight obesity
childhood obesity
HFD
rural setting
CVD
inadequate nutrition
under-nutrition
PEM
malnutrition
Nutrition education
low maternal
cd4 cell count
poor maternal heath/death
Inadequate infant care
nutrition
(including all positive
food supply changes)
acute malnutrition
supplementary food
childhood/infant mal.
enforcement of state
regulations early childhood obesity prevention
MGNREGR
weight gain
in infancy
excess high
protein intake nutrient enriched diet
ARI birth
asphyxia
Immune system abnormality
IMR
parent caregiver
pneumonia
poor nutrition
mms
provision
NCD
Improved water access
cheap
processed
food
Constructed in 2 days (human + machine);
Normally, it takes 1 month (human alone).
Model courtesy of: Lyn Powell, HBGDki-qPM team
9
Motivation
Past vs. Future
This system can only search past, published facts
No information about what comes next in science
10
Definition
White spaces in science
+ Topics that are insufficiently studied, but
+ May lead to important scientific discoveries
11
Our Contributions
1. White space discovery = link prediction over the influence
graph
Predict whether an influence link will be added to the graph
12
reduces
reduces
dietary
fish oil
blood
viscosity
promotes Reynaud’s
disease
Binary classification task:
positive, if the influence relation will be added to the
influence graph in the future;
negative, otherwise
Swanson, D.R. Undiscovered public knowledge. The Library Quarterly, 56 (2), 1986.
Our Contributions
2. Features from multiple graphs!
13
Citation graph (to understand
community overlap)
Influence graph (to understand
influence connectivity)
14
Dataset
Complication: No "Back to the Future"
15
Dataset
16
Constructed through backtesting
?>= t
“fish oil”
“blood viscosity
“Raynaud’s disease”
17
t<= r3.year <= present (Positive)
r3 not exist until present (Negative)
Dataset
Note: Transitivity Generally Not True!
18
Hurricane
Rainfall
Crop
yield
Missing information impacts non-linear models!
19
t = 2012
Dataset
Features
Extracted from two graphs
Influence graph (influence relations between concepts)
1,564,748 distinct nodes
Connected by 2,395,944 influence relations
Citation graph (citations between papers)
119K papers
5,523,759 citation links
20
Feature groups
Feature Group
Intuition
From
Connectivity
features
The more connected concepts are, the easier
is to discover a relation between them
influence
graph
Community
-
based features
The larger the intersection of communities
containing the two influence statements, the
easier it is to make the connection
citation
graph
Information
retrieval
features
The more distinct a concept or
an influence
statement is, the harder it is to make a
discovery around it
papers
containing
influence
statements
21
22
A
C
B
Out degree Connecting
paths
In degree
Connectivity Features
Community-based Features
23
The communities were detected using the Coda algorithm (Yang et al., 2014)
AB
24
Community-based Features
AB
“Bridging
inter-disciplinary papers
cite
BC
Information Retrieval Features
Inverse document frequency (IDF) score of lemmas in concept A
IDF score of lemmas in concept B
IDF score of lemmas in concept C
Number of papers that mention A B
Number of papers that mention B C
25
26
Evaluation
Evaluation Metrics
F1 = harmonic mean of P and R
P@10 = how many links predict in top 10 are correct
MAP = mean average precision
27
Unranked
Ranked
Results
28
All Feature Groups Help
F1 scores for feature ablation
What Does the System Predict?
30
Conclusions
Novel strategy for the identification of white spaces in
scientific knowledge
Operates over real-world graphs of influence relations and
citations
F1 score of 27 points, and a mean average precision of 68%
Important to
Researchers: “What should I research next?
Program officers: “What should I fund next?
31
Thank you!
Resource Available:
Data and code
https://github.com/clulab/releases/tree/master/textgraphs2018-discovery
Influence search engine
http://influence.clulab.org/
Fan Luo, Marco Valenzuela-Escárcega, Gus Hahn-Powell, Mihai Surdeanu
{fanluo, marcov, hahnpowell, msurdeanu}@email.arizona.edu
32
Acknowledgements
Marco Valenzuela-Escarcega, Gus Hahn-Powell, and Mihai
Surdeanu declare a financial interest in lum.ai. This interest has
been properly disclosed to the University of Arizona
Institutional Review Committee and is managed in accordance
with its conflict of interest policies. This work was funded by
the Bill and Melinda Gates Foundation HBGDki Initiative.
33