TOWARDS THE ADVANCEMENT OF OPEN-DOMAIN
TEXTUAL QUESTION ANSWERING METHODS
Fan Luo
Nov 17, 2022
Dissertation Committee:
Mihai Surdeanu, Lila Bozgeyikli, Joshua A Levine, Chicheng Zhang
Overview
2
3
3
1998
Open-Domain Question Answering (ODQA)
4
QA categories
Open-domain QA vs. Closed-domain QA
Textual QA vs. Knowledge-based QA vs. Table-based QA
5
A history of open-domain textual QA
-Simmons et al. (1964) was the first to explore answering questions based on matching dependency
parses of a question and answer
-Murax (Kupiec 1993) aimed to answer questions over an online encyclopedia using IR and shallow
linguistic processing
-The NIST TREC QA track begun in 1999 first rigorously investigated answering fact questions over a
large collection of documents
-IBM’s Jeopardy! System (DeepQA, 2011) used an ensemble of many methods
-Many neural approaches after 2015… (more later)
https://www.cs.princeton.edu/courses/archive/spring20/cos598C/lectures/lec10-open-qa.pdf
6
7
Recent Advancements in NLP
Transformers
8
Open Research Questions
01 Explainability
Retrieve and synthesize
info from multi-resources
02 Annotation
cost
Explain with evidence
03
Complex
questions
Less human labeling effort
9
Challenges in Multi-Hop QA
'Multi-hops’ refers to
multiple pieces of information
relevant to tackle the multi-
hop question.
Secondary hops are lexically
or semantically distant to
questions.
More challenging to provide
explainability for the answer
prediction for multi-hop QA. Qi, Peng, et al. "Answering complex open-domain questions through iterative query
generation." arXiv preprint arXiv:1910.07000 (2019). 10
Architecture of Modern Textual QA systems
Retriever-Reranker
Collect relevant context
Reader
Extract the answer
(2) (3)(1)
Question Processor
11
Analysis and reformulate
questions
My works
12
A STEP towards Interpretable Multi-Hop Reasoning:
Bridge Phrase Identification and Query Expansion
Complex questions
Explainability
Annotation
13
Bridge Phrase
14
Name of the executive producer of Alien:
Ronald Shusett
The film that has a score composed by Jerry
Goldsmith: Alien
Not explicit lexical overlap between the
answer sentence and the question
Application: query expansion
Approach
Graph-based Unsupervised Modular
15
Quotation
Extraction (Q) Phrase
Grounding (G)
Named Entity
Recognition (E) Noun Chunks
Extraction (C)
Noun Phrase Extraction
16
Noun Phrase Graph Construction
SENT-SENT TITLE-SENT TITLE-TITLE Coreference
17
George Francis Abbott
[P2] playwriter
Three Men on a Horse
[P1] play
Tomb Raider
2013 video game
Ronald Shusett
Shusett
Question Phrase Identification & Graph Pruning
Question: [Three Men on a Horse]Gis a [play]Cby a [playwright]Cborn in which year?
1
8
STEP (Steiner Tree Phrases identification)
Steiner Points:
19
George Francis Abbott’ and ‘George Abbott’
An approximate solution for the Steiner problem in graphs (Takahashi and others, 1980)
Algorithm:
Steiner Tree:
Minimum spanning tree of the sub-graph that contains all question phrases
Identified Bridge Phrases
Query Expansion and Retrieval
Identified Bridge Phrases: ‘George Francis Abbott’ and ‘George Abbott’
Query Expansion: Three Men on a Horse is a play by a playwright born in which year, George Abbott, George Francis Abbott
Retriever: BM25 and MSMARCO cross-encoder 20
Question: Three Men on a Horse is a play by a playwright born in which year?
Supporting Document 1: Three Men on a Horse
Three Men on a Horse is a play by George Abbott and John Cecil Holm. . . .
Supporting Document 2: George Abbott
George Francis Abbott (June 25, 1887 January 31, 1995) was an American theater producer and director,
playwright, screenwriter, and film director and producer whose career spanned nine decades. ...
Experiments
Retrieval performance with/o
our query expansion strategy
Evidence Retrieval
Accuracy of predicted answer
using the retrieved context with/o
our query expansion strategy
Answer Prediction
Manual evaluation of the accuracy
of identified bridge phrases
Bridge Phrases
Manual evaluation of the quality
of the explanations
Post-hoc Explanation
21
Dataset: HotpotQA (Yang et al., 2018) Development Set
5,918 bridge-type questions
Example: Alice David is the voice of Lara Croft in a video game developed by which company?
1,487 comparison-type questions
Example: Which American singer and songwriter has a mezzo-soprano vocal range, Tim Armstrong or Tori Amos?
22
When STEP is coupled with a retriever:
BM25: traditional information retrieval model
MSMARCO cross-encoder: a transformer-based neural dense retrieval model
evidence retrieval performance (evaluated against annotated supporting facts) increase
Results: Evidence Retrieval
Reader: Longformer Fine-tuned with HotpotQA training data
Input: concatenating the question and context sentences
[CLS][Q] Question [/Q][SEP] [T] title1[/T] sent11 [/S] sent12 [/S] . . .[SEP] [T] title2[/T ] sent21 [/S] sent12 [/S] . . .
Context sentences:
Random: a set of k sentences randomly selected
Question-only: top ranked sentences with/o query expansion (i.e., original question)
SF only: the gold supporting sentences
Oracle: top ranked sentences with query expansion, using oracle bridge phrases extracted directly from the ground-truth supporting sentences
STEP: top ranked sentences with query expansion, using identified bridge phrases
Baseline
Ceiling
Retrieved w/ Original question
Retrieved w/ Query expansion
Results: Answer Prediction
23
Manually evaluate the quality of identified bridge phrases
100 randomly selected questions 2 human annotators
3 annotations: correct, partial, incorrect
Average accuracy: 76.3% Kappa agreement: 46%
Results: Bridge Phrases
24
Nodes in the Steiner Tree:
Bell Labs, headquarters, Ravi Sethi, computer scientist, Avaya Labs Research, American research and scientific development company
Steiner Points: Bell Labs, Avaya Labs Research
Query expansion: In Murray Hill city are the headquarters of the American research and scientific development company where Ravi Sethi worked as
computer scientist located, Bell Labs, Avaya Labs Research 25
Post-hoc Explanations
1, 2, and 4 are the gold supporting facts
2
6
50 random sampled questions Top 10 candidate evidences
Results: Post-hoc Explanations
Quality of explanations
Accuracy of post-hoc bridge phrases
Manually evaluate
44.5 (89%)
48 (96%)
Takeaways
Post-hoc explanation
Post-hoc explanations can be made
available to interpret answers provided.
Query expansion
Identified bridge phrases can be used to
expand the query used for improving
evidence retrieval and answer
extraction;
Bridge phrase identification
We introduce a graph-based strategy for
the identification of bridge phrases for
multi-hop QA;
27
Divide & Conquer for Entailment-aware Multi-hop
Evidence Retrieval
Complex questions
Explainability
Annotation
28
Evidence Retrieval and Rerank
https://blog.griddynamics.com/question-answering-system-using-bert/
29
Evidence Ranking Subtasks
30
In this work, we propose to capture textual entailment in parallel
with the semantic equivalence with separate models, which
produce different and potentially conflicting rankings.
The goal is to combine them to figure out an aggregated ranking
that promote gold evidence sentences to the top of the list.
Base Models
Three off-the-shelf base models to capture the diverse relevance signals.
Sparse model
BM25
Statistical model relying on
lexical overlap
Dense model
MSMARCO CE
transformers pre-trained for
semantic search
Dense model
QNLI CE
transformers pre-trained for
question-answer entailment
CE (Cross-Encoder): the standard BERT design that benefits from all-to-all attention across tokens in the input sequence.
MS MARCO: a large scale corpus consists of about 500k real search queries with 1000 most relevant passages (Bajaj et al., 2016)
QNLI: Question Natural Language Inference (QNLI) dataset introduced by GLUE Benchmark (Wang et al., 2018) 31
14% questions with at least one
evidence sentence is ranked
within top-3 by BM25, but ranked
beyond top-3 by MSMARCO CE
and QNLI CE;
35% questions with at least one
evidence sentence ranked within
top-3 by QNLI CE, but ranked
beyond top-3 by MSMARCO CE.
Base Models Comparison
’ indicates that an evidence is ranked within top-k by the base model, while ‘’ indicates that the evidence is ranked beyond top-k.
The three base models independently capture diverse relevance signals and are complementing each other
Percentage of questions with at least one evidence are
ranked within top-k by a base model or not.
32
Similarity Combination (SimCom)
SimCom calculates hybrid relevance
scores though a linear combination
of scores from base models
Average ranking (AR)
Semantic Textual Similarity (STS) and Inference Similarity (IS)
are scores from MSMARCO CE and QNLI CE.
AR simply sums all the ranks for each
sentence, and re-rank all the sentences
according to the summation of ranks.
Ensemble Baselines
33
Techniques to combine the results from base models
Entailment-Aware Ranking (EAR)
Idea
34
Jointly consider pairs of top-ranked
candidate evidence sentences by
base models with respect to
semantic equivalence and textual
entailment, respectively.
Goal
Combine complementary relevance
signals captured by base models to
retrieve candidate evidences for
multi-hop questions.
BM25: {Sa1, Sa2, Sa3, Sa4, Sa5, Sa6, . . .}
MSMARCO CE: {Sa4, Sa3, Sa1, Sa6, Sa2,. . .}
QNLI CE: {Sb1, Sb2, Sb3, Sb4, Sb5, Sb6, . . . }
Top ranked by semantic equivalence A = {Sa1,Sa2,Sa3,Sa4}
Top ranked by textual entailment B = {Sb1, Sb2, Sb3}
Pairs we consider are Cartesian product of two sets
Pairs = A × B = {(a, b) | a A b B}
Score pairs against question with a ranker
(q, a || b )
Top scored sentence pair Sai, Sbj form a compositional
relevant context covering both signals
Re-rank the rest against q || a || b
(q || a || b, Si)
EARnest Evidences for a multi-hop question should be
intuitively related, and often logically connected via
a shared named entity.
EAR
Entailment-Aware Retrieval
NEST
Named Entity Similarity Term
Sim(): the scoring function of the reranker
NSET is a binary switch. That is, if the two sentences share one or more named entity, the promotion mechanism is activated,
because they are more likely to be connected to form a coherent context.
When scoring sentence pairs:
35
MAP
ignoring other relevance signals
75%
+ 10%
Evidence Ranking Results
36
highest among the base models
better than individual base models
ignoring interactions between
the relevance signals
best model
P@n and R@n fail to take into account the positions of the relevant sentences among the top n. MAP(Mean average precision)
is a relative more importance metric to exam.
SimCom uses α = 3 and β = 1, according to the grid search results on 10% of the full dataset.
Learning Strategies for Question Answering with Fewer
Annotations
Annotation
38
A deep neural network to extract the answer to a question from the given context.
Challenges: Suffer from "data hunger" and low robustness issues.
1998
Answer extraction with a deep reader model
39
QA Dataset Annotation
Costly
Noisy
Intensive manual labor
Tedious and time consuming
Low Agreement 40
Objectives
Less Annotation More Robust
41
Active Learning
42
Output a relative
good Model
Make much less
annotation request
Query the most
informative instances
DeepAL, the combination of Deep Learning (DL) and Active Learning (AL), considers the complementary advantages of the
two methods to achieve better results.
• DL achieved state-of-the-art results in QA tasks, but is limited by the high cost of labeling,
• AL maximize the value of labeling a small set of examples.
DeepAL for QA task
43
Deep Reader with BERT
🤖Language Models 🤖
Most modern textual QA systems has a
deep reader model performing reading
comprehension (RC) to extract an
answer from given documents.
A question-answering head is applied
on top of the BERT model to produce
probabilities over the tokens in the
documents for being answer start and
end token.
44
question ="How many parameters does BERT-large have?"
Context ="BERT-large is really big... it has 24-layers and an embedding size of 1,024,
for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple
minutes to download to your Colab instance."
Answer: "340 ##m"
https://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/
45
Key Idea
We hypothesize that a robust model should produce similar probability distributions on the
original context’s part after perturbation of the context with an additional distracting sentence.
distracting sentence ="BERT is designed to pre-train deep bidirectional representations
from unlabeled text by jointly conditioning on both left and right context."
46
Our approach: Perturbation-based AL
1Creating perturbation for
unlabeled candidates
The first step of our PAL
acquisition strategy finds the
distracting sentence from the
context of the most similar labeled
questions using the embeddings of
the fine-tuned model.
A perturbed instance is generated
by appending the distractor
sentence to the original context.
2Scoring the robustness
to perturbation
We compute the Kullback-Leibler
divergences in the model
predictive probabilities between
each candidate unlabeled question
and its corresponding perturbed
question as the perturbation
sensitive score.
3Select candidates to
query
We then rank the unlabeled
candidates according to their
perturbation sensitive scores.
PAL selects top n unlabeled
questions that have the highest
score to improve the robustness
of the current model.
47
Experiments
We experiment with the pre-trained BERT-BASE
model in combination with AL query strategies for
selecting the next informative question examples to
be evolve the QA reader model.
The model predicts the start and the end position
of the answer, and calculates the cross-entropy
loss.
In each iteration, we continue fine-tuning the model
on the newly labelled 10% of the rest of the
unlabeled dataset selected by the active learning
acquisition functions.
We used SQuAD as the benchmark dataset for the
QA answer extraction task. 48
Stanford Question Answering Dataset (SQuAD)
- 107,785 question-answer pairs on
536 articles.
- The text passages are taken from
Wikipedia across a wide range of
topics
"SQuAD: 100,000+ Questions for Machine Comprehension of Text" 49
Uncertainty: select the unlabeled data samples with least confidence (largest uncertainty), measured based on the output predictions.
Density/Clustering-based: finds representative data samples by clustering data in the embedding space and selected ones close to
centroids.
Maximal Diversity: query the unlabeled ones that have maximally distant from the labeled ones.
Common Active Learning Strategies
50
In general, uncertainty-based strategy outperforms the two other common sampling strategies as it always searches for the
“valuable” samples around the current decision boundary.
Clustering sampling strategy performs better when the number of labeled samples is very small, while uncertainty-based criterion
usually overtakes the clustering strategy afterwards.
Our PAL acquisition method, utilize both the input feature and model’s output predictions to select the most informative instances.
Results
Fine-tuning BERT-base model with various
AL acquisition strategies for the QA task.
The F1 scores are evaluated at every n-th
training step (with batch size of 12) on the
SQuAD dataset.
51
Future works
52
Knowledge Base as external resources
53
KBQA (Knowledge-Base Question Answering) uses structured knowledge graphs as
the knowledge sources.
Pros: High precision
Cons: Low coverage, expensive to obtain an extensive and high-quality KB
TextQA (often referred to as ODQA) leverages text (e.g. Wikipedia articles).
Pros: Vast amount of data; can readily use SoTA transformer models
Cons: Neglects valuable knowledge sources such as KBs and tables.
Unifying KBQA and TextQA has proven challenging
Human-in-the-loop Interactive learning
Can human supervision and intervention in the learning process of the model
help it learn faster and make better predictions and explanations?
https://hub.packtpub.com/what-is-interactive-machine-learning/
-Besides the answers as direct supervision,
would extra information (feedback) provided by
humans provide rich guidance to the model?
(User input: corrections, rankings, or
evaluations)
-How to incorporate the user feedback into an
existing QA model?
54
Combination of active learning and self-learning
Self-learning
Discovers highly reliable instances based on its
own predictions to teach itself.
Active Learning
Select the most informative instances.
Hybrid
Update the model with the most informative and
highly reliable instances. 55
Yih, Wen-tau, and Hao Ma. "Question answering with knowledge base, Web and beyond."
More Challenging QA Tasks
56
QUESTIONS & DISCUSSION
57
Acknowledgements
I would like to express my heartfelt gratitude to the following people for
helping me realize my dream (in no particular order).
My advisors.
My committee.
My colleagues.
Most important of all, my family!
Thank You!
References
Abdalla, Muhammad Anwar, and Sameh Basha. Active Learning on Graph Neural Network for Enzymes Classification. Diss. Cairo University, 2021.
Yih, Wen-tau, and Hao Ma. "Question answering with knowledge base, Web and beyond." Proceedings of the 39th International ACM SIGIR conference on
Research and Development in Information Retrieval. 2016.
Abbasiantaeb, Z. and S. Momtazi (2021). Text-based question answering from information retrieval and deep neural network perspectives: A survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(6), p. e1412.
Allam, A. M. N. and M. H. Haggag (2012). The question answering systems: A survey. International Journal of Research and Reviews in Information
Sciences (IJRRIS), 2(3).
Beltagy, I., M. E. Peters, and A. Cohan (2020). Longformer: The Long-Document Transformer.
Chen, D., A. Fisch, J. Weston, and A. Bordes (2017). Reading wikipedia to answer opendomain questions. arXiv preprint arXiv:1704.00051
Clark, C. and M. Gardner (2017). Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723.
Dasgupta, S. (2011). Two faces of active learning. Theoretical computer science, 412(19), pp. 17671781.
De Cao, N., W. Aziz, and I. Titov (2018). Question answering by reasoning across documents with graph convolutional networks. arXiv preprint
arXiv:1808.09920.
Esposito, M., E. Damiano, A. Minutolo, G. De Pietro, and H. Fujita (2020). Hybrid query expansion using lexical resources and word embeddings for
sentence retrieval in question answering. Information Sciences, 514, pp. 88105.
Fu, R., H. Wang, X. Zhang, J. Zhou, and Y. Yan (2021). Decomposing complex questions makes multi-hop QA easier and more interpretable. arXiv preprint
arXiv:2110.13472.
Fu, Y., X. Zhu, and B. Li (2013). A survey on instance selection for active learning. Knowledge and information systems, 35(2), pp. 249283.
Guo, L., X. Su, L. Zhang, G. Huang, X. Gao, and Z. Ding (2018). Query expansion based on semantic related network. In Pacific Rim International
Conference on Artificial Intelligence, pp. 1928. Springer.
Guu, K., K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020a). Retrieval augmented language model pre-training. In International Conference on Machine
Learning, pp. 39293938. PMLR.
60
Backup Slides
61
RELATED WORK
62
Question Decomposition
(Min et al. 2019) proposed DecompRC, a system that learns to break
compositional multi-hop questions into simpler, singlehop sub-questions.
(Jiang and Bansal 2019) designed four types of language reasoning
modules, and proposed a controller RNN which decomposes the multi-
hop question into multiple single-hop sub-questions, and dynamically
infers a series of reasoning modules.
63
Multi-step (iterative) retrievers
(Feldman and El-Yaniv 2019)
- A joint vector representation of both a question and a paragraph.
- In each retrieval iteration, reformulate the search vector
GOLDEN Retriever introduced by (Qi et al. 2019)
- Generates queries given the question and available context for two steps to search
documents for HotpotQA full wiki.
(Asai et al. 2019)
- Iteratively retrieve a subsequent passage in the reasoning chain with RNN, until the
end-of-evidence symbol is selected.
- Beam search outputs the top reasoning paths with the highest scores and passes
them to the reader model.
64
Graph-based models
Recent studies build entity graphs from multiple paragraphs, and apply graph
neural networks to conduct reasoning across documents over the graphs (De Cao,
Aziz, and Titov 2019; Xiao et al. 2019).
DFGN(Qiu et al. 2019) also constructed an entity graph, and predicted a dynamic
mask to select a subgraph, so that in each reasoning step irrelevant entities are
softly masked out.
CogQA (Ding et al. 2019) iteratively extracted entities and answer candidate spans
for each hop and organized them as a cognitive graph.
65
Learning with Limited Annotations
(Celikyilmaz, Thint, and Huang 2009) implemented a SSL approach by creating a graph for
labeled and unlabeled data using match-scores of textual entailment features as similarity
weights between data points, and demonstrated that utilization of more unlabeled data
points can improve the answer-ranking task of QA.
(Dhingra, Danish, and Rajagopal 2018) showed that fine-tuning the pre-trained QA models
on the small set of labeled QA pairs improves the performance of the models significantly.
(Zhou, Chen, and Wang 2010) applied active learning in the semi-supervised learning
framework to identify reviews that should be labeled as training data for review sentiment
classification.
66
Early QA systems
67
Modern QA systems - Deep Neural Models
-Representation-based models
-Encode Q and A into fixed vectors (using BiLSTM and CNN) + similarity of these vectors
-Interaction-based models
-Capture the interaction between individual words in Q and A usually using attention
mechanisms (e.g., Transformer models)
68
Comparison against KB-based QA
KB-based QA also use graph-based representations, but
Our approach uses a dynamically-constructed graph that is built on-the-fly from
the documents relevant for a query
More relevant information than a static KB
Smaller search space than a static KB
69