TOWARDS THE ADVANCEMENT OF OPEN-DOMAIN

TEXTUAL QUESTION ANSWERING METHODS

Fan Luo

Nov 17, 2022

Dissertation Committee:

Mihai Surdeanu, Lila Bozgeyikli, Joshua A Levine, Chicheng Zhang

Overview

1998

Open-Domain Question Answering (ODQA)

QA categories

Open-domain QA vs. Closed-domain QA

Textual QA vs. Knowledge-based QA vs. Table-based QA

A history of open-domain textual QA

-Simmons et al. (1964) was the first to explore answering questions based on matching dependency

parses of a question and answer

-Murax (Kupiec 1993) aimed to answer questions over an online encyclopedia using IR and shallow

linguistic processing

-The NIST TREC QA track begun in 1999 first rigorously investigated answering fact questions over a

large collection of documents

-IBM’s Jeopardy! System (DeepQA, 2011) used an ensemble of many methods

-Many neural approaches after 2015… (more later)

https://www.cs.princeton.edu/courses/archive/spring20/cos598C/lectures/lec10-open-qa.pdf

Recent Advancements in NLP

Transformers

Open Research Questions

01 Explainability

Retrieve and synthesize

info from multi-resources

02 Annotation

cost

Explain with evidence

Complex

questions

Less human labeling effort

Challenges in Multi-Hop QA

●'Multi-hops’ refers to

multiple pieces of information

relevant to tackle the multi-

hop question.

●Secondary hops are lexically

or semantically distant to

questions.

●More challenging to provide

explainability for the answer

prediction for multi-hop QA. Qi, Peng, et al. "Answering complex open-domain questions through iterative query

generation." arXiv preprint arXiv:1910.07000 (2019). 10

Architecture of Modern Textual QA systems

Retriever-Reranker

Collect relevant context

Reader

Extract the answer

(2) (3)(1)

Question Processor

Analysis and reformulate

questions

My works

A STEP towards Interpretable Multi-Hop Reasoning:

Bridge Phrase Identification and Query Expansion

✔Complex questions

✔Explainability

✔Annotation

Bridge Phrase

Name of the executive producer of Alien:

Ronald Shusett

The film that has a score composed by Jerry

Goldsmith: Alien

Not explicit lexical overlap between the

answer sentence and the question

Application: query expansion

Approach

Graph-based Unsupervised Modular

Quotation

Extraction (Q) Phrase

Grounding (G)

Named Entity

Recognition (E) Noun Chunks

Extraction (C)

Noun Phrase Extraction

Noun Phrase Graph Construction

SENT-SENT TITLE-SENT TITLE-TITLE Coreference

George Francis Abbott

[P2] playwriter

Three Men on a Horse

[P1] play

Tomb Raider

2013 video game

Ronald Shusett

Shusett

Question Phrase Identification & Graph Pruning

Question: [Three Men on a Horse]Gis a [play]Cby a [playwright]Cborn in which year?

STEP (Steiner Tree Phrases identification)

Steiner Points:

‘George Francis Abbott’ and ‘George Abbott’

An approximate solution for the Steiner problem in graphs (Takahashi and others, 1980)

Algorithm:

Steiner Tree:

Minimum spanning tree of the sub-graph that contains all question phrases

Identified Bridge Phrases

✅

Query Expansion and Retrieval

Identified Bridge Phrases: ‘George Francis Abbott’ and ‘George Abbott’

Query Expansion: Three Men on a Horse is a play by a playwright born in which year, George Abbott, George Francis Abbott

Retriever: BM25 and MSMARCO cross-encoder 20

Question: Three Men on a Horse is a play by a playwright born in which year?

Supporting Document 1: Three Men on a Horse

Three Men on a Horse is a play by George Abbott and John Cecil Holm. . . .

Supporting Document 2: George Abbott

George Francis Abbott (June 25, 1887 –January 31, 1995) was an American theater producer and director,

playwright, screenwriter, and film director and producer whose career spanned nine decades. ...

Experiments

Retrieval performance with/o

our query expansion strategy

Evidence Retrieval

Accuracy of predicted answer

using the retrieved context with/o

our query expansion strategy

Answer Prediction

Manual evaluation of the accuracy

of identified bridge phrases

Bridge Phrases

Manual evaluation of the quality

of the explanations

Post-hoc Explanation

Dataset: HotpotQA (Yang et al., 2018) Development Set

●5,918 bridge-type questions

○Example: Alice David is the voice of Lara Croft in a video game developed by which company?

●1,487 comparison-type questions

○Example: Which American singer and songwriter has a mezzo-soprano vocal range, Tim Armstrong or Tori Amos?

When STEP is coupled with a retriever:

●BM25: traditional information retrieval model

●MSMARCO cross-encoder: a transformer-based neural dense retrieval model

evidence retrieval performance (evaluated against annotated supporting facts) increase

Results: Evidence Retrieval

Reader: Longformer Fine-tuned with HotpotQA training data

Input: concatenating the question and context sentences

[CLS][Q] Question [/Q][SEP] [T] title1[/T] sent11 [/S] sent12 [/S] . . .[SEP] [T] title2[/T ] sent21 [/S] sent12 [/S] . . .

Context sentences:

Random: a set of k sentences randomly selected

Question-only: top ranked sentences with/o query expansion (i.e., original question)

SF only: the gold supporting sentences

Oracle: top ranked sentences with query expansion, using oracle bridge phrases extracted directly from the ground-truth supporting sentences

STEP: top ranked sentences with query expansion, using identified bridge phrases

Baseline

Ceiling

Retrieved w/ Original question

Retrieved w/ Query expansion

Results: Answer Prediction

Manually evaluate the quality of identified bridge phrases

100 randomly selected questions 2 human annotators

3 annotations: correct, partial, incorrect

Average accuracy: 76.3% Kappa agreement: 46%

Results: Bridge Phrases

Nodes in the Steiner Tree:

Bell Labs, headquarters, Ravi Sethi, computer scientist, Avaya Labs Research, American research and scientific development company

Steiner Points: Bell Labs, Avaya Labs Research

Query expansion: In Murray Hill city are the headquarters of the American research and scientific development company where Ravi Sethi worked as

computer scientist located, Bell Labs, Avaya Labs Research 25

Post-hoc Explanations

1, 2, and 4 are the gold supporting facts

50 random sampled questions Top 10 candidate evidences

Results: Post-hoc Explanations

Quality of explanations

Accuracy of post-hoc bridge phrases

Manually evaluate

44.5 (89%)

48 (96%)

Takeaways

Post-hoc explanation

Post-hoc explanations can be made

available to interpret answers provided.

Query expansion

Identified bridge phrases can be used to

expand the query used for improving

evidence retrieval and answer

extraction;

Bridge phrase identification

We introduce a graph-based strategy for

the identification of bridge phrases for

multi-hop QA;

Divide & Conquer for Entailment-aware Multi-hop

Evidence Retrieval

✔Complex questions

✔Explainability

✔Annotation

Evidence Retrieval and Rerank

https://blog.griddynamics.com/question-answering-system-using-bert/

Evidence Ranking Subtasks

In this work, we propose to capture textual entailment in parallel

with the semantic equivalence with separate models, which

produce different and potentially conflicting rankings.

The goal is to combine them to figure out an aggregated ranking

that promote gold evidence sentences to the top of the list.

Base Models

Three off-the-shelf base models to capture the diverse relevance signals.

Sparse model

BM25

Statistical model relying on

lexical overlap

Dense model

MSMARCO CE

transformers pre-trained for

semantic search

Dense model

QNLI CE

transformers pre-trained for

question-answer entailment

CE (Cross-Encoder): the standard BERT design that benefits from all-to-all attention across tokens in the input sequence.

MS MARCO: a large scale corpus consists of about 500k real search queries with 1000 most relevant passages (Bajaj et al., 2016)

QNLI: Question Natural Language Inference (QNLI) dataset introduced by GLUE Benchmark (Wang et al., 2018) 31

14% questions with at least one

evidence sentence is ranked

within top-3 by BM25, but ranked

beyond top-3 by MSMARCO CE

and QNLI CE;

35% questions with at least one

evidence sentence ranked within

top-3 by QNLI CE, but ranked

beyond top-3 by MSMARCO CE.

Base Models Comparison

‘✓’ indicates that an evidence is ranked within top-k by the base model, while ‘✗’ indicates that the evidence is ranked beyond top-k.

The three base models independently capture diverse relevance signals and are complementing each other

Percentage of questions with at least one evidence are

ranked within top-k by a base model or not.

Similarity Combination (SimCom)

SimCom calculates hybrid relevance

scores though a linear combination

of scores from base models

Average ranking (AR)

Semantic Textual Similarity (STS) and Inference Similarity (IS)

are scores from MSMARCO CE and QNLI CE.

AR simply sums all the ranks for each

sentence, and re-rank all the sentences

according to the summation of ranks.

Ensemble Baselines

Techniques to combine the results from base models

Entailment-Aware Ranking (EAR)

Idea

Jointly consider pairs of top-ranked

candidate evidence sentences by

base models with respect to

semantic equivalence and textual

entailment, respectively.

Goal

Combine complementary relevance

signals captured by base models to

retrieve candidate evidences for

multi-hop questions.

BM25: {Sa1, Sa2, Sa3, Sa4, Sa5, Sa6, . . .}

MSMARCO CE: {Sa4, Sa3, Sa1, Sa6, Sa2,. . .}

QNLI CE: {Sb1, Sb2, Sb3, Sb4, Sb5, Sb6, . . . }

Top ranked by semantic equivalence A = {Sa1,Sa2,Sa3,Sa4}

Top ranked by textual entailment B = {Sb1, Sb2, Sb3}

Pairs we consider are Cartesian product of two sets

Pairs = A × B = {(a, b) | a ∈A ∧b ∈B}

Score pairs against question with a ranker

(q, a || b )

Top scored sentence pair Sai, Sbj form a compositional

relevant context covering both signals

Re-rank the rest against q || a || b

(q || a || b, Si)

EARnest Evidences for a multi-hop question should be

intuitively related, and often logically connected via

a shared named entity.

EAR

Entailment-Aware Retrieval

NEST

Named Entity Similarity Term

Sim(): the scoring function of the reranker

NSET is a binary switch. That is, if the two sentences share one or more named entity, the promotion mechanism is activated,

because they are more likely to be connected to form a coherent context.

When scoring sentence pairs:

MAP

ignoring other relevance signals

75%

+ 10%

Evidence Ranking Results

highest among the base models

better than individual base models

ignoring interactions between

the relevance signals

best model

P@n and R@n fail to take into account the positions of the relevant sentences among the top n. MAP(Mean average precision)

is a relative more importance metric to exam.

SimCom uses α = 3 and β = 1, according to the grid search results on 10% of the full dataset.

Learning Strategies for Question Answering with Fewer

Annotations

✔Annotation

●A deep neural network to extract the answer to a question from the given context.

●Challenges: Suffer from "data hunger" and low robustness issues.

1998

Answer extraction with a deep reader model

QA Dataset Annotation

●Costly

●Noisy

●Intensive manual labor

●Tedious and time consuming

●Low Agreement 40

Objectives

Less Annotation More Robust

Active Learning

Output a relative

good Model

Make much less

annotation request

Query the most

informative instances

DeepAL, the combination of Deep Learning (DL) and Active Learning (AL), considers the complementary advantages of the

two methods to achieve better results.

• DL achieved state-of-the-art results in QA tasks, but is limited by the high cost of labeling,

• AL maximize the value of labeling a small set of examples.

DeepAL for QA task

Deep Reader with BERT

🤖Language Models 🤖

○Most modern textual QA systems has a

deep reader model performing reading

comprehension (RC) to extract an

answer from given documents.

○A question-answering head is applied

on top of the BERT model to produce

probabilities over the tokens in the

documents for being answer start and

end token.

question ="How many parameters does BERT-large have?"

Context ="BERT-large is really big... it has 24-layers and an embedding size of 1,024,

for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple

minutes to download to your Colab instance."

Answer: "340 ##m"

https://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/

Key Idea

We hypothesize that a robust model should produce similar probability distributions on the

original context’s part after perturbation of the context with an additional distracting sentence.

distracting sentence ="BERT is designed to pre-train deep bidirectional representations

from unlabeled text by jointly conditioning on both left and right context."

Our approach: Perturbation-based AL

1Creating perturbation for

unlabeled candidates

The first step of our PAL

acquisition strategy finds the

distracting sentence from the

context of the most similar labeled

questions using the embeddings of

the fine-tuned model.

A perturbed instance is generated

by appending the distractor

sentence to the original context.

2Scoring the robustness

to perturbation

We compute the Kullback-Leibler

divergences in the model

predictive probabilities between

each candidate unlabeled question

and its corresponding perturbed

question as the perturbation

sensitive score.

3Select candidates to

query

We then rank the unlabeled

candidates according to their

perturbation sensitive scores.

PAL selects top n unlabeled

questions that have the highest

score to improve the robustness

of the current model.

Experiments

We experiment with the pre-trained BERT-BASE

model in combination with AL query strategies for

selecting the next informative question examples to

be evolve the QA reader model.

The model predicts the start and the end position

of the answer, and calculates the cross-entropy

loss.

In each iteration, we continue fine-tuning the model

on the newly labelled 10% of the rest of the

unlabeled dataset selected by the active learning

acquisition functions.

We used SQuAD as the benchmark dataset for the

QA answer extraction task. 48

Stanford Question Answering Dataset (SQuAD)

- 107,785 question-answer pairs on

536 articles.

- The text passages are taken from

Wikipedia across a wide range of

topics

"SQuAD: 100,000+ Questions for Machine Comprehension of Text" 49

●Uncertainty: select the unlabeled data samples with least confidence (largest uncertainty), measured based on the output predictions.

●Density/Clustering-based: finds representative data samples by clustering data in the embedding space and selected ones close to

centroids.

●Maximal Diversity: query the unlabeled ones that have maximally distant from the labeled ones.

Common Active Learning Strategies

●In general, uncertainty-based strategy outperforms the two other common sampling strategies as it always searches for the

“valuable” samples around the current decision boundary.

●Clustering sampling strategy performs better when the number of labeled samples is very small, while uncertainty-based criterion

usually overtakes the clustering strategy afterwards.

● Our PAL acquisition method, utilize both the input feature and model’s output predictions to select the most informative instances.

Results

Fine-tuning BERT-base model with various

AL acquisition strategies for the QA task.

The F1 scores are evaluated at every n-th

training step (with batch size of 12) on the

SQuAD dataset.

Future works

Knowledge Base as external resources

●KBQA (Knowledge-Base Question Answering) uses structured knowledge graphs as

the knowledge sources.

○Pros: High precision

○Cons: Low coverage, expensive to obtain an extensive and high-quality KB

●TextQA (often referred to as ODQA) leverages text (e.g. Wikipedia articles).

○Pros: Vast amount of data; can readily use SoTA transformer models

○Cons: Neglects valuable knowledge sources such as KBs and tables.

Unifying KBQA and TextQA has proven challenging

Human-in-the-loop Interactive learning

Can human supervision and intervention in the learning process of the model

help it learn faster and make better predictions and explanations?

https://hub.packtpub.com/what-is-interactive-machine-learning/

-Besides the answers as direct supervision,

would extra information (feedback) provided by

humans provide rich guidance to the model?

(User input: corrections, rankings, or

evaluations)

-How to incorporate the user feedback into an

existing QA model?

Combination of active learning and self-learning

●Self-learning

Discovers highly reliable instances based on its

own predictions to teach itself.

●Active Learning

Select the most informative instances.

●Hybrid

Update the model with the most informative and

highly reliable instances. 55

Yih, Wen-tau, and Hao Ma. "Question answering with knowledge base, Web and beyond."

More Challenging QA Tasks

QUESTIONS & DISCUSSION

Acknowledgements

I would like to express my heartfelt gratitude to the following people for

helping me realize my dream (in no particular order).

●My advisors.

●My committee.

●My colleagues.

●Most important of all, my family!

Thank You!

References

•Abdalla, Muhammad Anwar, and Sameh Basha. Active Learning on Graph Neural Network for Enzymes Classification. Diss. Cairo University, 2021.

•Yih, Wen-tau, and Hao Ma. "Question answering with knowledge base, Web and beyond." Proceedings of the 39th International ACM SIGIR conference on

Research and Development in Information Retrieval. 2016.

•Abbasiantaeb, Z. and S. Momtazi (2021). Text-based question answering from information retrieval and deep neural network perspectives: A survey. Wiley

Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(6), p. e1412.

•Allam, A. M. N. and M. H. Haggag (2012). The question answering systems: A survey. International Journal of Research and Reviews in Information

Sciences (IJRRIS), 2(3).

•Beltagy, I., M. E. Peters, and A. Cohan (2020). Longformer: The Long-Document Transformer.

•Chen, D., A. Fisch, J. Weston, and A. Bordes (2017). Reading wikipedia to answer opendomain questions. arXiv preprint arXiv:1704.00051

•Clark, C. and M. Gardner (2017). Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723.

•Dasgupta, S. (2011). Two faces of active learning. Theoretical computer science, 412(19), pp. 1767–1781.

•De Cao, N., W. Aziz, and I. Titov (2018). Question answering by reasoning across documents with graph convolutional networks. arXiv preprint

arXiv:1808.09920.

•Esposito, M., E. Damiano, A. Minutolo, G. De Pietro, and H. Fujita (2020). Hybrid query expansion using lexical resources and word embeddings for

sentence retrieval in question answering. Information Sciences, 514, pp. 88–105.

•Fu, R., H. Wang, X. Zhang, J. Zhou, and Y. Yan (2021). Decomposing complex questions makes multi-hop QA easier and more interpretable. arXiv preprint

arXiv:2110.13472.

•Fu, Y., X. Zhu, and B. Li (2013). A survey on instance selection for active learning. Knowledge and information systems, 35(2), pp. 249–283.

•Guo, L., X. Su, L. Zhang, G. Huang, X. Gao, and Z. Ding (2018). Query expansion based on semantic related network. In Pacific Rim International

Conference on Artificial Intelligence, pp. 19–28. Springer.

•Guu, K., K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020a). Retrieval augmented language model pre-training. In International Conference on Machine

Learning, pp. 3929–3938. PMLR.

Backup Slides

RELATED WORK

Question Decomposition

•(Min et al. 2019) proposed DecompRC, a system that learns to break

compositional multi-hop questions into simpler, singlehop sub-questions.

•(Jiang and Bansal 2019) designed four types of language reasoning

modules, and proposed a controller RNN which decomposes the multi-

hop question into multiple single-hop sub-questions, and dynamically

infers a series of reasoning modules.

Multi-step (iterative) retrievers

•(Feldman and El-Yaniv 2019)

- A joint vector representation of both a question and a paragraph.

- In each retrieval iteration, reformulate the search vector

•GOLDEN Retriever introduced by (Qi et al. 2019)

- Generates queries given the question and available context for two steps to search

documents for HotpotQA full wiki.

•(Asai et al. 2019)

- Iteratively retrieve a subsequent passage in the reasoning chain with RNN, until the

end-of-evidence symbol is selected.

- Beam search outputs the top reasoning paths with the highest scores and passes

them to the reader model.

Graph-based models

•Recent studies build entity graphs from multiple paragraphs, and apply graph

neural networks to conduct reasoning across documents over the graphs (De Cao,

Aziz, and Titov 2019; Xiao et al. 2019).

•DFGN(Qiu et al. 2019) also constructed an entity graph, and predicted a dynamic

mask to select a subgraph, so that in each reasoning step irrelevant entities are

softly masked out.

•CogQA (Ding et al. 2019) iteratively extracted entities and answer candidate spans

for each hop and organized them as a cognitive graph.

Learning with Limited Annotations

•(Celikyilmaz, Thint, and Huang 2009) implemented a SSL approach by creating a graph for

labeled and unlabeled data using match-scores of textual entailment features as similarity

weights between data points, and demonstrated that utilization of more unlabeled data

points can improve the answer-ranking task of QA.

•(Dhingra, Danish, and Rajagopal 2018) showed that fine-tuning the pre-trained QA models

on the small set of labeled QA pairs improves the performance of the models significantly.

•(Zhou, Chen, and Wang 2010) applied active learning in the semi-supervised learning

framework to identify reviews that should be labeled as training data for review sentiment

classification.

Early QA systems

Modern QA systems - Deep Neural Models

-Representation-based models

-Encode Q and A into fixed vectors (using BiLSTM and CNN) + similarity of these vectors

-Interaction-based models

-Capture the interaction between individual words in Q and A usually using attention

mechanisms (e.g., Transformer models)

Comparison against KB-based QA

•KB-based QA also use graph-based representations, but

•Our approach uses a dynamically-constructed graph that is built on-the-fly from

the documents relevant for a query

•More relevant information than a static KB

•Smaller search space than a static KB