The following is a list of active research topics we have right now, and we would love to collaborate on other topics in areas of Natural Language Processing and (Robust and Accountable) Machine Learning.
Knowledge graphs (KGs) provide both open-world and domain-specific knowledge representations that are integral to many AI systems. However, constructing KGs is usually very costly and requires extensive effort. A widely attempted solution is to learn knowledge acquisition models that automatically induce structured knowledge from unstructured text. However, such models developed through data-driven machine learning are usually fragile to noise in learning resources, and may fall short of providing reliable inference on large, heterogeneous real-world data. We are developping a general meta-learning framework that seeks to systematically improve the robustness of learning and inference for data-driven knowledge acquisition models. We seek to solve several key problems to accomplish the goal: (i) How to identify incorrect training labels and prevents overfitting on noisy labels; (ii) how do detect invalid input instances in inference (e.g., out-of-distribution ones) and allow prediction with abstention; (iii) automated constraint learning that ensures inference with global consistency; (iv) model robustness against noise perturbation; (v) mitigating spurious correlation of models captured in biased training.
Knowledge acquisition tasks (e.g., relation extraction, entity and event extraction and typing, consolidation) face challenges including extreme label spaces, insufficient annotations and out-of-distribution prediction. To this end, we study the method for leveraging indirect supervision signals from auxiliary tasks (e.g., natural language inference, abstractive summarization, etc.) to foster robust and generalizable inference for (open-domain) knowledge acquisition or information extraction. In the same context, study the method for generating semantically rich label representations based on either gloss knowledge or structural knowledge from a well-populated lexical knowledge base, in order to better support learning with limited labels.
Natural language understanding and generation tasks need to handle equivariance properties in data. For example, the narrative structure of an article can be reorganized, while still presenting the same content. In constrained NLG tasks with structural priors (e.g. data-to-text generation tasks), the structure of the prior can also be modified while presenting semantically equivalent content. However, existing sequential modeling of Transformer LMs cause downstream information extraction and NLG systems to be brittle to content-neutral transformations of input data. We are developping equivariance learning methods that allow Transformer models to give consistent output under any such content-neutral perturbations on input data.
Human languages evolve to communicate about events happening in the real world. Therefore, understanding events plays a critical role in natural language understanding (NLU). A key challenge to this mission lies in the fact that events are not just simple, standalone predicates. Rather, they are often described at different granularities, temporally form event processes, and are directed by specific central goals in a context. Our research in this line helps the machine understand events described in natural language. This includes the understanding of how events are connected, form processes or structure complices, and the recognition of typical properties of events (e.g., space, time, salience, essentiality, implicitness, memberships, etc.).
Constructing KGs is usually very costly and requires extensive effort. Representation learning offers a solution to this problem, by using relational embeddings to complete KGs. However, when obtaining such embedding representations, each (general or domain-specific) KG has been captured separately. In particular, we study a novel direction of transferable representation learning for KGs, which seeks to associate the interrelated knowledge from different isolated KGs in a common embedding scheme, and allows complementary knowledge to easily migrate across different KGs. This project will systematically solve several key technical challenges of leveraging incidental and auxiliary supervision to capture various types of knowledge association. The project will develop technologies to support robust inference in a multi-source knowledge transfer setting with noise-aware meta learning and constrained inference, particularly for knowledge curated for low-resource domains and languages.
Various types of commonsense inference tasks are challenging the SOTA language models. Such tasks may include inferring preconditions of facts, typical properties of entities and events (e.g. time, scales and numerical properties), and typical relations (e.g. ordering and membership of events, topological relations of entities). While annotating data for those aspects of commonsense inference can be costly, we seek to minimally leverage any expensive annotations, but instead develop linguistic pattern mining techniques to find vast cheap (though allowably noisy) supervision data from the Web, and lead that towards a scalable and generalizable solution to improve commonsense inference based on distant supervision.
Multi-modal data such as images, videos and tables may contain rich information that is complementary to that in human language. To synthesize actionable knowledge from semi-structural and multimedia that would help downstream NLU tasks (e.g. QA and fact verification), a system needs to has the ability to summarize the salient information and well connect it with natural language. However, this is accompanied with several key challenges: (i) How to find salient information in data of different structures or displays? (ii) Since different subparts of the structure or media represent different knowledge/facts, how to foster controlled natural language generation precisely describe the highlighted knowledge/fact? (iii) How do we enable effective aggregation of information in the generated summaries (e.g. finding the max, averages, or identifying specific patterns)? (iv) How do we summarize in a way that helps downstream tasks for question answering and fact verification?
We are always interested in applying our discoveries in NLU and data-driven machine learning in areas of biology, medicine, geology and social sciences.