Investigating machine moral judgement through the Delphi experiment
Liwei Jiang,
Jena D. Hwang,
Chandra Bhagavatula,
Ronan Le Bras,
Jenny T. Liang,
Sydney Levine,
Jesse Dodge,
Keisuke Sakaguchi,
Maxwell Forbes,
Jack Hessel,
Jon Borchardt,
Taylor Sorensen,
Saadia Gabriel,
Yulia Tsvetkov,
Oren Etzioni,
Maarten Sap,
Regina Rini,
Yejin Choi
As our society adopts increasingly powerful artificial intelligence (AI) systems for pervasive use, there are growing concerns about machine moralityâor lack thereof. Millions of users already rely on the outputs of AI systems, such as chatbots, as decision aids. Meanwhile, AI researchers continue to grapple with the challenge of aligning these systems with human morality and values. In response to this challenge, we build and test Delphi, an open-source AI system trained to predict the moral judgements of US participants. The computational framework of Delphi is grounded in the framework proposed by the prominent moral philosopher John Rawls. Our results speak to the promises and limits of teaching machines about human morality. Delphi demonstrates improved generalization capabilities over those exhibited by off-the-shelf neural language models. At the same time, Delphiâs failures also underscore important challenges in this arena. For instance, Delphi has limited cultural awareness and is susceptible to pervasive biases. Despite these shortcomings, we demonstrate several compelling use cases of Delphi, including its incorporation as a component within an ensemble of AI systems. Finally, we computationally demonstrate the potential of Rawlsâs prospect of hybrid approaches for reliable moral reasoning, inspiring future research in computational morality.
@article{jiang2025delphi,author={Jiang, Liwei and Hwang, Jena D. and Bhagavatula, Chandra and Bras, Ronan Le and Liang, Jenny T. and Levine, Sydney and Dodge, Jesse and Sakaguchi, Keisuke and Forbes, Maxwell and Hessel, Jack and Borchardt, Jon and Sorensen, Taylor and Gabriel, Saadia and Tsvetkov, Yulia and Etzioni, Oren and Sap, Maarten and Rini, Regina and Choi, Yejin},date={2025/01/01},doi={10.1038/s42256-024-00969-6},id={Jiang2025},isbn={2522-5839},journal={Nature Machine Intelligence},number={1},pages={145--160},title={Investigating machine moral judgement through the Delphi experiment},url={https://doi.org/10.1038/s42256-024-00969-6},volume={7},year={2025}}
arXiv
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference
Go Kamoda,
Benjamin Heinzerling,
Tatsuro Inaba,
Keito Kudo,
Keisuke Sakaguchi,
Kentaro Inui
According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the modelâs "inner vocabulary". Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior. Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps. Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2. Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects. By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.
@article{kamoda2025detok,title={Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference},author={Kamoda, Go and Heinzerling, Benjamin and Inaba, Tatsuro and Kudo, Keito and Sakaguchi, Keisuke and Inui, Kentaro},journal={arXiv},year={2025},doi={10.48550/arXiv.2501.15754}}
arXiv
FinchGPT: a Transformer based language model for birdsong analysis
The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPTâs attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the modelâs outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.
@article{kobayashi2025finch,title={FinchGPT: a Transformer based language model for birdsong analysis},author={Kobayashi, Kosei and Matsuzaki, Kosuke and Taniguchi, Masaya and Sakaguchi, Keisuke and Inui, Kentaro and Abe, Kentaro},journal={arXiv},year={2025},doi={10.48550/arxiv.2502.00344}}
2024
EMNLP
First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning
"Explicit multi-step reasoning, such as chain-of-thought, is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process.Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, their reliance on heuristics decreases as LMs progress closer to the final answer. This suggests that LMs track only a limited number of future steps and dynamically combine heuristic strategies with rational ones in solving tasks involving multi-step reasoning."
@inproceedings{aoki2024heuristics,title={First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning},author={Aoki, Yoichi and Kudo, Keito and Kuribayashi, Tatsuki and Sone, Shusaku and Taniguchi, Masaya and Sakaguchi, Keisuke and Inui, Kentaro},editor={Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung},booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2024},address={Miami, Florida, USA},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.emnlp-main.789/},doi={10.18653/v1/2024.emnlp-main.789},pages={14255--14271}}
Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement.
@inproceedings{brassard2024acorn,title={{ACORN}: Aspect-wise Commonsense Reasoning Explanation Evaluation},author={Brassard, Ana and Heinzerling, Benjamin and Kudo, Keito and Sakaguchi, Keisuke and Inui, Kentaro},booktitle={First Conference on Language Modeling},year={2024},url={https://openreview.net/forum?id=2oHnsM9M9D}}
arXiv
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning
This study investigates the internal reasoning mechanism of language models during symbolic multi-step reasoning, motivated by the question of whether chain-of-thought (CoT) outputs are faithful to the modelâs internals. Specifically, we inspect when they internally determine their answers, particularly before or after CoT begins, to determine whether models follow a post-hoc "think-to-talk" mode or a step-by-step "talk-to-think" mode of explanation. Through causal probing experiments in controlled arithmetic reasoning tasks, we found systematic internal reasoning patterns across models; for example, simple subproblems are solved before CoT begins, and more complicated multi-hop calculations are performed during CoT.
@article{kudo2024think,author={{Kudo}, Keito and {Aoki}, Yoichi and {Kuribayashi}, Tatsuki and {Sone}, Shusaku and {Taniguchi}, Masaya and {Brassard}, Ana and {Sakaguchi}, Keisuke and {Inui}, Kentaro},title={{Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning}},journal={arXiv},year={2024},month=dec,doi={10.48550/arXiv.2412.01113}}
arXiv
Self-Training Meets Consistency: Improving LLMsâ Reasoning With Consistency-Driven Rationale Evaluation
Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
@article{lee2024crest,author={{Lee}, Jaehyeok and {Sakaguchi}, Keisuke and {Bak}, JinYeong},title={{Self-Training Meets Consistency: Improving LLMs' Reasoning With Consistency-Driven Rationale Evaluation}},journal={arXiv},year={2024},month=nov,doi={10.48550/arXiv.2411.06387}}
BEA
Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond
Natural language processing (NLP) technology has rapidly improved automated grammatical error correction (GEC) tasks, and the GEC community has begun to explore document-level revision. However, there are two major obstacles to going beyond automated \textitsentence-level GEC to NLP-based document-level revision support: (1) there are few public corpora with document-level revisions annotated by professional editors, and (2) it is infeasible to obtain all possible references and evaluate revision quality using such references because there are infinite revision possibilities. To address these challenges, this paper proposes a new document revision corpus, Text Revision of ACL papers (TETRA), in which multiple professional editors have revised academic papers sampled from the ACL anthology. This corpus enables us to focus on document-level and paragraph-level edits, such as edits related to coherence and consistency. Additionally, as a case study using the TETRA corpus, we investigate reference-less and interpretable methods for meta-evaluation to detect quality improvements according to document revisions. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle.
@inproceedings{mita2024tetra,title={Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond},author={Mita, Masato and Sakaguchi, Keisuke and Hagiwara, Masato and Mizumoto, Tomoya and Suzuki, Jun and Inui, Kentaro},booktitle={Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)},month=jun,year={2024},address={Mexico City, Mexico},publisher={Association for Computational Linguistics},pages={251--265}}
SigDial
A Multimodal Dialogue System to Lead Consensus Building with Emotion-Displaying
The evolution of large language models has enabled fluent dialogue, increasing interest in the coexistence of humans and avatars. An essential aspect of achieving this coexistence involves developing sophisticated dialogue systems that can influence user behavior. In this background, we propose an effective multimodal dialogue system designed to promote consensus building with humans. Our system employs a slot-filling strategy to guide discussions and attempts to influence users with suggestions through emotional expression and intent conveyance via its avatar. These innovations have resulted in our system achieving the highest performance in a competition evaluating consensus building between humans and dialogue systems. We hope that our research will promote further discussion on the development of dialogue systems that enhance consensus building in human collaboration.
@inproceedings{nozue2024multimodal,title={A Multimodal Dialogue System to Lead Consensus Building with Emotion-Displaying},author={Nozue, Shinnosuk and Nakano, Yuto and Moriya, Shoji and Ariyama, Tomoki and Kokuta, Kazuma and Xie, Suchun and Sato, Kai and Sone, Shusaku and Kamei, Ryohei and Akama, Reina and Matsubayashi, Yuichiroh and Sakaguchi, Keisuke},booktitle={Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue},month=sep,year={2024},address={Kyoto, Japan},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.sigdial-1.57},doi={10.18653/v1/2024.sigdial-1.57},pages={669--673}}
arXiv
Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection
Subaru Kimura,
Ryota Tanaka,
Shumpei Miyawaki,
Jun Suzuki,
Keisuke Sakaguchi
We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.
@article{kimura2024hijack,author={{Kimura}, Subaru and {Tanaka}, Ryota and {Miyawaki}, Shumpei and {Suzuki}, Jun and {Sakaguchi}, Keisuke},title={{Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection}},journal={arXiv},year={2024},month=aug,doi={10.48550/arXiv.2408.03554}}
arXiv
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit this https URL.
@article{llmjp2024,author={{LLM-jp} and {:} and {Aizawa}, Akiko and {Aramaki}, Eiji and {Chen}, Bowen and {Cheng}, Fei and {Deguchi}, Hiroyuki and {Enomoto}, Rintaro and {Fujii}, Kazuki and {Fukumoto}, Kensuke and {Fukushima}, Takuya and {Han}, Namgi and {Harada}, Yuto and {Hashimoto}, Chikara and {Hiraoka}, Tatsuya and {Hisada}, Shohei and {Hosokawa}, Sosuke and {Jie}, Lu and {Kamata}, Keisuke and {Kanazawa}, Teruhito and {Kanezashi}, Hiroki and {Kataoka}, Hiroshi and {Katsumata}, Satoru and {Kawahara}, Daisuke and {Kawano}, Seiya and {Keyaki}, Atsushi and {Kiryu}, Keisuke and {Kiyomaru}, Hirokazu and {Kodama}, Takashi and {Kubo}, Takahiro and {Kuga}, Yohei and {Kumon}, Ryoma and {Kurita}, Shuhei and {Kurohashi}, Sadao and {Li}, Conglong and {Maekawa}, Taiki and {Matsuda}, Hiroshi and {Miyao}, Yusuke and {Mizuki}, Kentaro and {Mizuki}, Sakae and {Murawaki}, Yugo and {Mousterou}, Akim and {Nakamura}, Ryo and {Nakamura}, Taishi and {Nakayama}, Kouta and {Nakazato}, Tomoka and {Niitsuma}, Takuro and {Nishitoba}, Jiro and {Oda}, Yusuke and {Ogawa}, Hayato and {Okamoto}, Takumi and {Okazaki}, Naoaki and {Oseki}, Yohei and {Ozaki}, Shintaro and {Ryu}, Koki and {Rzepka}, Rafal and {Sakaguchi}, Keisuke and {Sasaki}, Shota and {Sekine}, Satoshi and {Suda}, Kohei and {Sugawara}, Saku and {Sugiura}, Issa and {Sugiyama}, Hiroaki and {Suzuki}, Hisami and {Suzuki}, Jun and {Suzumura}, Toyotaro and {Tachibana}, Kensuke and {Takagi}, Yu and {Takami}, Kyosuke and {Takeda}, Koichi and {Takeshita}, Masashi and {Tanaka}, Masahiro and {Taura}, Kenjiro and {Tolmachev}, Arseny and {Ueda}, Nobuhiro and {Wan}, Zhen and {Yada}, Shuntaro and {Yahata}, Sakiko and {Yamamoto}, Yuya and {Yamauchi}, Yusuke and {Yanaka}, Hitomi and {Yokota}, Rio and {Yoshino}, Koichiro},title={{LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs}},journal={arXiv e-prints},keywords={Computer Science - Computation and Language, Computer Science - Artificial Intelligence},year={2024},month=jul,doi={10.48550/arXiv.2407.03963}}
arXiv
The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models
Ryosuke Takahashi,
Go Kamoda,
Benjamin Heinzerling,
Keisuke Sakaguchi,
Kentaro Inui
Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of knowledge deletion and the entities related to the knowledge. Our findings reveal that deleting knowledge related to popular entities can have catastrophic side effects. Furthermore, this research is the first to analyze knowledge deletion in models trained on synthetic knowledge graphs, indicating a new direction for controlled experiments.
@article{takahashi2024edit,author={{Takahashi}, Ryosuke and {Kamoda}, Go and {Heinzerling}, Benjamin and {Sakaguchi}, Keisuke and {Inui}, Kentaro},title={{The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models}},journal={arXiv},year={2024},month=jun,doi={10.48550/arXiv.2406.06032}}
MORPHON
J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema
We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the languageâs agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] + suru (do-PRS)). Morphologically, this inflection pattern is same as the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We will release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.
@inproceedings{matsuzaki2024junimorph,title={{J}-{U}ni{M}orph: {J}apanese Morphological Annotation through the Universal Feature Schema},author={Matsuzaki, Kosuke and Taniguchi, Masaya and Inui, Kentaro and Sakaguchi, Keisuke},booktitle={Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology},month=jun,year={2024},address={Mexico City, Mexico},publisher={Association for Computational Linguistics},pages={7--19}}
LREC-COLING
Beam Decoding with Controlled Patience
Jungo Kasai,
Keisuke Sakaguchi,
Ronan Le Bras,
Dragomir Radev,
Yejin Choi,
Noah A. Smith
Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
May
2024
Text generation with beam search has proven successful in a wide range of applications. The commonly-used implementation of beam decoding follows a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. We introduce a patience factor, a simple modification to this decoding algorithm, that generalizes the stopping criterion and provides flexibility to the depth of search. Extensive empirical results demonstrate that the patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation.
@inproceedings{kasai2024beam,title={Beam Decoding with Controlled Patience},author={Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Radev, Dragomir and Choi, Yejin and Smith, Noah A.},booktitle={{Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},year={2024},month=may}
ICLR
PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning
Procedural planning, which entails decomposing a high-level goal into a sequence of temporally ordered steps, is an important yet intricate task for machines. It involves integrating common-sense knowledge to reason about complex and often contextualized situations, e.g. âscheduling a doctorâs appointment without a phoneâ. While current approaches show encouraging results using large language models (LLMs), they are hindered by drawbacks such as costly API calls and reproducibility issues. In this paper, we advocate planning using smaller language models. We present PlaSma, a novel two-pronged approach to endow small language models with procedural knowledge and (constrained) language-based planning capabilities. More concretely, we develop symbolic procedural knowledge distillation to enhance the commonsense knowledge in small language models and aninference-time algorithm to facilitate more structured and accurate reasoning. In addition, we introduce a new related task, Replanning, that requires a revision of a plan to cope with a constrained situation. In both the planning and replanning settings, we show that orders-of-magnitude smaller models (770M-11B parameters) can compete and often surpass their larger teacher modelsâ capabilities. Finally, we showcase successful application of PlaSma in an embodied environment, VirtualHome.
@inproceedings{brahman2024plasma,title={PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning},author={Brahman, Faeze and Bhagavatula, Chandra and Pyatkin, Valentina and Hwang, Jena D. and Li, Xiang Lorraine and Arai, Hirona Jacqueline and Sanyal, Soumya and Sakaguchi, Keisuke and Ren, Xiang and Choi, Yejin},booktitle={The Twelfth International Conference on Learning Representations},year={2024},url={https://github.com/allenai/PlaSma},month=may}
We introduce RealTime QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). RealTime QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open domain QA datasets and pursues, instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this preliminary report presents real-time evaluation results over the past month. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that RealTime QA will spur progress in instantaneous applications of question answering and beyond.
@inproceedings{kasai2023realtime,title={RealTime {QA}: What's the Answer Right Now?},author={Kasai, Jungo and Sakaguchi, Keisuke and Takahashi, Yoichi and Bras, Ronan Le and Asai, Akari and Yu, Xinyan Velocity and Radev, Dragomir and Smith, Noah A. and Choi, Yejin and Inui, Kentaro},booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},year={2023},url={https://openreview.net/forum?id=HfKOIPCvsv},month=dec}
EMNLP Findings
Test-time Augmentation for Factual Probing
Go Kamoda,
Benjamin Heinzerling,
Keisuke Sakaguchi,
Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2023
Dec
2023
Factual probing is a method that uses prompts to test if a language model âknowsâ certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.
@inproceedings{kamoda2023tta,title={Test-time Augmentation for Factual Probing},author={Kamoda, Go and Heinzerling, Benjamin and Sakaguchi, Keisuke and Inui, Kentaro},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},month=dec,year={2023},address={Singapore},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2023.findings-emnlp.236},doi={10.18653/v1/2023.findings-emnlp.236},pages={3650--3661}}
ACL
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation
Chandra Bhagavatula,
Jena D Hwang,
Doug Downey,
Ronan Le Bras,
Ximing Lu,
Keisuke Sakaguchi,
Swabha Swayamdipta,
Peter West,
Yejin Choi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jul
2023
Commonsense capabilities of pre-trained language models dramatically improve with scale, leading many to believe that scale is the only winning recipe. But is it? Here, we investigate an alternative that a priori seems impossible: can smaller language models (e.g., GPT-2) win over models that are orders of magnitude larger and better (e.g., GPT-3), if powered with novel commonsense distillation algorithms?The key intellectual challenge is to design a learning algorithm that achieve a competitive level of commonsense acquisition, without relying on the benefits of scale. In particular, we study generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly.We introduce I2D2, a novel commonsense distillation framework that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale teacher model with two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the modelâs own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date.
@inproceedings{bhagavatula2023i2d2,title={I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation},author={Bhagavatula, Chandra and Hwang, Jena D and Downey, Doug and Bras, Ronan Le and Lu, Ximing and Sakaguchi, Keisuke and Swayamdipta, Swabha and West, Peter and Choi, Yejin},year={2023},booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=jul,address={Toronto, Canada},publisher={Association for Computational Linguistics},doi={10.18653/v1/2023.acl-long.535},pages={9614--9630}}
ACL
ELQA: A Corpus of Metalinguistic Questions and Answers about English
Shabnam Behzad,
Keisuke Sakaguchi,
Nathan Schneider,
Amir Zeldes
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jul
2023
We present ELQA, a corpus of questions and answers in and about the English language. Collected from two online forums, the >70k questions (from English learners and others) cover wide-ranging topics including grammar, meaning, fluency, and etymology. The answers include descriptions of general properties of English vocabulary and grammar as well as explanations about specific (correct and incorrect) usage examples. Unlike most NLP datasets, this corpus is metalinguisticâit consists of language about language. As such, it can facilitate investigations of the metalinguistic capabilities of NLU models, as well as educational applications in the language learning domain. To study this, we define a free-form question answering task on our dataset and conduct evaluations on multiple LLMs (Large Language Models) to analyze their capacity to generate metalinguistic answers.
@inproceedings{behzad2023elqa,title={ELQA: A Corpus of Metalinguistic Questions and Answers about English},author={Behzad, Shabnam and Sakaguchi, Keisuke and Schneider, Nathan and Zeldes, Amir},booktitle={{Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}},year={2023},month=jul,address={Toronto, Canada},publisher={Association for Computational Linguistics},doi={10.18653/v1/2023.acl-long.113},pages={2031--2047}}
arXiv
Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all five years of the exams, highlighting LLMsâ potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
@article{kasai2023med,author={{Kasai}, Jungo and {Kasai}, Yuhei and {Sakaguchi}, Keisuke and {Yamada}, Yutaro and {Radev}, Dragomir},title={{Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations}},journal={arXiv},year={2023},doi={10.48550/arXiv.2303.18027}}
arXiv
An Analysis of GPT-3âs Performance in Grammatical Error Correction
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of NaturalLanguage Processing tasks. However, there isa relative lack of detailed published analysisof their performance on the task of grammatical error correction (GEC). To address this,we perform experiments testing the capabilitiesof a GPT-3.5 model (text-davinci-003)and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shotand few-shot settings, analyzing intriguing orproblematic outputs encountered with differentprompt formats. We report the performance ofour best prompt on the BEA-2019 and JFLEGdatasets, finding that the GPT models can perform well in a sentence-level revision setting,with GPT-4 achieving a new high score on theJFLEG benchmark. Through human evaluation experiments, we compare the GPT modelsâcorrections to source, human reference, andbaseline GEC system sentences and observedifferences in editing strategies and how theyare scored by human raters.
@article{coyne2023gptgec,author={{Coyne}, Steven and {Sakaguchi}, Keisuke},title={{An Analysis of GPT-3's Performance in Grammatical Error Correction}},journal={arXiv},year={2023},doi={10.48550/arXiv.2303.14342}}
arXiv
Causal schema induction for knowledge discovery
Michael Regan,
Jena D. Hwang,
Keisuke Sakaguchi,
James Pustejovsky
Making sense of familiar yet new situations typically involves making generalizations about causal schemas, stories that help humans reason about event sequences. Reasoning about events includes identifying cause and effect relations shared across event instances, a process we refer to as causal schema induction. Statistical schema induction systems may leverage structural knowledge encoded in discourse or the causal graphs associated with event meaning, however resources to study such causal structure are few in number and limited in size. In this work, we investigate how to apply schema induction models to the task of knowledge discovery for enhanced search of English-language news texts. To tackle the problem of data scarcity, we present Torquestra, a manually curated dataset of text-graph-schema units integrating temporal, event, and causal structures. We benchmark our dataset on three knowledge discovery tasks, building and evaluating models for each. Results show that systems that harness causal structure are effective at identifying texts sharing similar causal meaning components rather than relying on lexical cues alone. We make our dataset and models available for research purposes.
@article{regan2023causalschema,author={{Regan}, Michael and {Hwang}, Jena D. and {Sakaguchi}, Keisuke and {Pustejovsky}, James},title={{Causal schema induction for knowledge discovery}},journal={arXiv},year={2023},doi={10.48550/arXiv.2303.15381}}
EACL
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?
Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.
@inproceedings{Kudo2023eacl,title={Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?},author={Kudo, Keito and Aoki, Yoichi and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro},year={2023},booktitle={Proceedings of the 2023 Conference of the {E}uropean Chapter of the Association for Computational Linguistics},month=may,publisher={Association for Computational Linguistics},address={Dubrovnik, Croatia},pages={1351--1362},doi={10.18653/v1/2023.eacl-main.98}}
EACL Findings
Empirical Investigation of Neural Symbolic Reasoning Strategies
Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear.Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning.Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer.Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation.Our results indicate the importance of further exploring effective strategies for neural reasoning models.
@inproceedings{Aoki2023eacl,title={Empirical Investigation of Neural Symbolic Reasoning Strategies},author={Aoki, Yoichi and Kudo, Keito and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro},year={2023},booktitle={Findings of the Association for Computational Linguistics: EACL 2023 },doi={10.18653/v1/2023.findings-eacl.86},pages={1154--1162},month=may,address={Dubrovnik, Croatia},publisher={Association for Computational Linguistics}}
Jxiv
Evaluating GPT in Japanese Bar Examination: Insights and Limitations
Large-scale language models like ChatGPT have been reported to exceed the accuracy of human experts in a wide range of tasks. Recent research reports that ChatGPT passed the Japanese National Medical Examination, confirming its high performance in Japanese. We evaluated the accuracy of GPT-3, GPT-4, and ChatGPT in the Japanese Bar Examination (the multiple-choice format section), focusing on Constitutional Law, Civil Law, and Criminal Law over the past five years. The results revealed that the current correct answer rate for these exams is only 30-40% (compared to the average pass rate of 70%), which is significantly low. This study went beyond just the correct answer rate, dissecting the necessary reasoning and knowledge for the responses, and examining the performance of large-scale language models from each perspective. The findings show that 1) large-scale language models possess extensive knowledge of many statutes, 2) they have a high correct answer rate for questions that require understanding of legal theories but not specific knowledge of law, and 3) they have a low correct answer rate for questions requiring knowledge of case law. The primary reason for their lower performance compared to the American Bar Examination is thought to be a lack of knowledge in Japanese law, especially in case law.
@techreport{choi_et_al_2023_j_bar_exam_en,author={Choi, Jungmin and Kasai, Jungo and Sakaguchi, Keisuke},title={{Evaluating GPT in Japanese Bar Examination: Insights and Limitations}},year={2023},month=dec,booktitle={Jxiv}}
2022
EMNLP
Twist Decoding: Diverse Generators Guide Each Other
Jungo Kasai,
Keisuke Sakaguchi,
Ronan Le Bras,
Hao Peng,
Ximing Lu,
Dragomir Radev,
Yejin Choi,
Noah A. Smith
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Dec
2022
Many language generation models are now available for a wide range of generation tasks, including machine translation and summarization. Combining such diverse models may lead to further progress, but ensembling generation models is challenging during inference: conventional ensembling methods (e.g., shallow fusion) require that the models share vocabulary/tokenization schemes. We introduce Twist decoding, a simple and general text generation algorithm that benefits from diverse models at inference time. Our method does not assume the vocabulary, tokenization or even generation order is shared. Our extensive evaluations on machine translation and scientific paper summarization demonstrate that Twist decoding substantially outperforms each model decoded in isolation over various scenarios, including cases where domain-specific and general-purpose models are both available. Twist decoding also consistently outperforms the popular reranking heuristic where output candidates from one model are rescored by another. We hope that our work will encourage researchers and practitioners to examine generation models collectively, not just independently, and to seek out models with complementary strengths to the currently available models.
@inproceedings{kasai2022twist,title={Twist Decoding: Diverse Generators Guide Each Other},author={Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Peng, Hao and Lu, Ximing and Radev, Dragomir and Choi, Yejin and Smith, Noah A.},booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},month=dec,year={2022},address={Abu Dhabi, UAE},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2022.emnlp-main.326},pages={4909--4923}}
arXiv
Can Machines Learn Morality? The Delphi Experiment
Liwei Jiang,
Jena D. Hwang,
Chandra Bhagavatula,
Ronan Le Bras,
Jenny Liang,
Jesse Dodge,
Keisuke Sakaguchi,
Maxwell Forbes,
Jon Borchardt,
Saadia Gabriel,
Yulia Tsvetkov,
Oren Etzioni,
Maarten Sap,
Regina Rini,
Yejin Choi
As AI systems become increasingly powerful and pervasive, there are growing concerns about machinesâ morality or a lack thereof. Yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for AI. Existing AI systems deployed to millions of users, however, are already making decisions loaded with moral implications, which poses a seemingly impossible challenge: teaching machines moral sense, while humanity continues to grapple with it.
To explore this challenge, we introduce Delphi, an experimental framework based on deep neural networks trained directly to reason about descriptive ethical judgments, e.g., "helping a friend" is generally good, while "helping a friend spread fake news" is not. Empirical results shed novel insights on the promises and limits of machine ethics; Delphi demonstrates strong generalization capabilities in the face of novel ethical situations, while off-the-shelf neural network models exhibit markedly poor judgment including unjust biases, confirming the need for explicitly teaching machines moral sense.
Yet, Delphi is not perfect, exhibiting susceptibility to pervasive biases and inconsistencies. Despite that, we demonstrate positive use cases of imperfect Delphi, including using it as a component model within other imperfect AI systems. Importantly, we interpret the operationalization of Delphi in light of prominent ethical theories, which leads us to important future research questions.
@article{jiang2022delphi,title={Can Machines Learn Morality? The Delphi Experiment},author={Jiang, Liwei and Hwang, Jena D. and Bhagavatula, Chandra and Bras, Ronan Le and Liang, Jenny and Dodge, Jesse and Sakaguchi, Keisuke and Forbes, Maxwell and Borchardt, Jon and Gabriel, Saadia and Tsvetkov, Yulia and Etzioni, Oren and Sap, Maarten and Rini, Regina and Choi, Yejin},journal={arXiv},year={2022},volume={abs/2110.07574},doi={10.48550/ARXIV.2110.07574}}
NAACL
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai,
Keisuke Sakaguchi,
Ronan Le Bras,
Lavinia Dunagan,
Jacob Morrison,
Alexander R. Fabbri,
Yejin Choi,
Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Jul
2022
Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to focus on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (BILLBOARDs), that simultaneously tracks progress in language generation tasks and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a BILLBOARD accepts both generators and evaluation metrics as competing entries. A BILLBOARD automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlations with human judgments. We release four BILLBOARDs for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.
@inproceedings{Kasai2022BidimensionalLG,title={Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},author={Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Dunagan, Lavinia and Morrison, Jacob and Fabbri, Alexander R. and Choi, Yejin and Smith, Noah A.},year={2022},booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},month=jul,pages={3540--3557},address={Seattle, United States},publisher={Association for Computational Linguistics}}
NAACL
Transparent Human Evaluation for Image Captioning
Jungo Kasai,
Keisuke Sakaguchi,
Lavinia Dunagan,
Jacob Morrison,
Ronan Le Bras,
Yejin Choi,
Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Jul
2022
We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.
@inproceedings{Kasai2022TransparentHE,title={Transparent Human Evaluation for Image Captioning},author={Kasai, Jungo and Sakaguchi, Keisuke and Dunagan, Lavinia and Morrison, Jacob and Bras, Ronan Le and Choi, Yejin and Smith, Noah A.},year={2022},booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},pages={3464--3478},month=jul,address={Seattle, Washington},publisher={Association for Computational Linguistics}}
IMLW@AAAI
Interscript: A dataset for interactive learning of scripts through error feedback
Niket Tandon,
Aman Madaan,
Peter Clark,
Keisuke Sakaguchi,
Yiming Yang
The AAAI-22 Workshop on Interactive Machine Learning
2022
How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, INTERSCRIPT, containing user feedback on a deployed model that generates complex everyday tasks. INTERSCRIPT contains 8,466 data pointsâ the input is a possibly erroneous script and a user feedback and the output is a modified script. We posit two use-cases of INTERSCRIPT that might significantly advance the state-of-the-art in interactive.
@inproceedings{Tandon2021InterscriptAD,title={Interscript: A dataset for interactive learning of scripts through error feedback},author={Tandon, Niket and Madaan, Aman and Clark, Peter and Sakaguchi, Keisuke and Yang, Yiming},year={2022},booktitle={The AAAI-22 Workshop on Interactive Machine Learning}}
2021
arXiv
Improving Neural Model Performance through Natural Language Feedback on Their Explanations
Aman Madaan,
Niket Tandon,
Dheeraj Rajagopal,
Yiming Yang,
Peter Clark,
Keisuke Sakaguchi,
Eduard H. Hovy
A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIEan interactive system that refines its explanations for a given reasoning task by getting human feedback in natural language. Our approach generates graphs that have 40% fewer inconsistencies as compared with the off-the-shelf system. Further, simply appending the corrected explanation structures to the output leads to a gain of 1.2 points on accuracy on defeasible reasoning across all three domains.
@article{Madaan2021ImprovingNM,title={Improving Neural Model Performance through Natural Language Feedback on Their Explanations},author={Madaan, Aman and Tandon, Niket and Rajagopal, Dheeraj and Yang, Yiming and Clark, Peter and Sakaguchi, Keisuke and Hovy, Eduard H.},journal={arXiv},year={2021},volume={abs/2104.08765}}
arXiv
GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education
We present GrammarTagger, an open-source grammar profiler which, given an input text, identifies grammatical features useful for language education. The model architecture enables it to learn from a small amount of texts annotated with spans and their labels, which 1) enables easier and more intuitive annotation, 2) supports overlapping spans, and 3) is less prone to error propagation, compared to complex hand-crafted rules defined on constituency/dependency parses. We show that we can bootstrap a grammar profiler model with F1 â 0.6 from only a couple hundred sentences both in English and Chinese, which can be further boosted via learning a multilingual model. With GrammarTagger, we also build Octanove Learn, a search engine of language learning materials indexed by their reading difficulty and grammatical features.
@article{Hagiwara2021GrammarTaggerAM,title={GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education},author={Hagiwara, Masato and Tanner, Joshua and Sakaguchi, Keisuke},journal={arXiv},year={2021},volume={abs/2104.03190}}
EMNLP Findings
proScript: Partially Ordered Scripts Generation
Keisuke Sakaguchi,
Chandra Bhagavatula,
Ronan Le Bras,
Niket Tandon,
Peter Clark,
Yejin Choi
Findings of the Association for Computational Linguistics: EMNLP 2021
Nov
2021
Scripts â prototypical event sequences describing everyday activities â have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models can be finetuned to generate high-quality scripts, at varying levels of granularity, for a wide range of everyday scenarios (e.g., bake a cake). To do this, we collect a large (6.4k) crowdsourced partially ordered scripts (named proScript), that is substantially larger than prior datasets, and develop models that generate scripts by combining language generation and graph structure prediction. We define two complementary tasks: (i) edge prediction: given a scenario and unordered events, organize the events into a valid (possibly partial-order) script, and (ii) script generation: given only a scenario, generate events and organize them into a (possibly partial-order) script. Our experiments show that our models perform well (e.g., F1=75.7 on task (i)), illustrating a new approach to overcoming previous barriers to script collection. We also show that there is still significant room for improvement toward human level performance. Together, our tasks, dataset, and models offer a new research direction for learning script knowledge.
@inproceedings{sakaguchi-etal-2021-proscript-partially,title={pro{S}cript: Partially Ordered Scripts Generation},author={Sakaguchi, Keisuke and Bhagavatula, Chandra and Le Bras, Ronan and Tandon, Niket and Clark, Peter and Choi, Yejin},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2021},month=nov,year={2021},address={Punta Cana, Dominican Republic},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2021.findings-emnlp.184},pages={2138--2149}}
CACM
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi,
Ronan Le Bras,
Chandra Bhagavatula,
Yejin Choi
Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the recent neural language models have reported above 90% accuracy on the Winograd Schema Challenge (WSC), a commonsense benchmark originally designed to be unsolvable for statistical models that rely simply on word associations. This raises an important questionâwhether these models have truly acquired robust commonsense capabilities or they rely on spurious biases in the dataset that lead to an overestimation of the true capabilities of machine commonsense.To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) large-scale crowdsourcing, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. Our experiments demonstrate that state-of-the-art models achieve considerably lower accuracy (59.4%-79.1%) on WINOGRANDE compared to humans (94%), confirming that the high performance on the original WSC was inflated by spurious biases in the dataset.Furthermore, we report new state-of-the-art results on five related benchmarks with emphasis on their dual implications. On the one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, the high performance on all these benchmarks suggests the extent to which spurious biases are prevalent in all such datasets, which motivates further research on algorithmic bias reduction.
@article{10.1145/3474381,author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},year={2021},issue_date={September 2021},publisher={Association for Computing Machinery},address={New York, NY, USA},volume={64},number={9},issn={0001-0782},url={https://doi.org/10.1145/3474381},doi={10.1145/3474381},journal={Commun. ACM},month=aug,pages={99--106},numpages={8}}
AAAI
COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs
Jena D. Hwang,
Chandra Bhagavatula,
Ronan Le Bras,
Jeff Da,
Keisuke Sakaguchi,
Antoine Bosselut,
Yejin Choi
Proceedings of the AAAI Conference on Artificial Intelligence
May
2021
Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and challenging tasks. At the same time, there remain questions about the quality and coverage of these resources due to the massive scale required to comprehensively encompass general commonsense knowledge.
In this work, we posit that manually constructed CSKGs will never achieve the coverage necessary to be applicable in all situations encountered by NLP agents. Therefore, we propose a new evaluation framework for testing the utility of KGs based on how effectively implicit knowledge representations can be learned from them.
With this new goal, we propose ATOMIC 2020, a new CSKG of general-purpose commonsense knowledge containing knowledge that is not readily available in pretrained language models. We evaluate its properties in comparison with other leading CSKGs, performing the first large-scale pairwise study of commonsense knowledge resources. Next, we show that ATOMIC 2020 is better suited for training knowledge models that can generate accurate, representative knowledge for new, unseen entities and events. Finally, through human evaluation, we show that the few-shot performance of GPT-3 (175B parameters), while impressive, remains  12 absolute points lower than a BART-based knowledge model trained on ATOMIC 2020 despite using over 430x fewer parameters.
@article{Hwang2021COMETATOMIC2O,title={COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs},journal={Proceedings of the AAAI Conference on Artificial Intelligence},author={Hwang, Jena D. and Bhagavatula, Chandra and Le Bras, Ronan and Da, Jeff and Sakaguchi, Keisuke and Bosselut, Antoine and Choi, Yejin},volume={35},number={7},year={2021},month=may,pages={6384--6392}}
2020
EMNLP
A Dataset for Tracking Entities in Open Domain Procedural Text
Niket Tandon,
Keisuke Sakaguchi,
Bhavana Dalvi,
Dheeraj Rajagopal,
Peter Clark,
Michal Guerquin,
Kyle Richardson,
Eduard Hovy
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Nov
2020
We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky, opaque, and clear. Previous formulations of this task provide the text and entities involved, and ask how those entities change for just a small, pre-defined set of attributes (e.g., location), limiting their fidelity. Our solution is a new task formulation where given just a procedural text as input, the task is to generate a set of state change tuples (entity, attribute, before-state, after-state) for each step, where the entity, attribute, and state values must be predicted from an open vocabulary. Using crowdsourcing, we create OPENPI, a high-quality (91.5% coverage as judged by humans and completely vetted), and large-scale dataset comprising 29,928 state changes over 4,050 sentences from 810 procedural real-world paragraphs from WikiHow.com. A current state-of-the-art generation model on this task achieves 16.1% F1 based on BLEU metric, leaving enough room for novel model architectures.
@inproceedings{tandon-etal-2020-dataset,title={A Dataset for Tracking Entities in Open Domain Procedural Text},author={Tandon, Niket and Sakaguchi, Keisuke and Dalvi, Bhavana and Rajagopal, Dheeraj and Clark, Peter and Guerquin, Michal and Richardson, Kyle and Hovy, Eduard},booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},month=nov,year={2020},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2020.emnlp-main.520},doi={10.18653/v1/2020.emnlp-main.520},pages={6408--6417}}
ACL
Uncertain Natural Language Inference
Tongfei Chen,
Zhengping Jiang,
Adam Poliak,
Keisuke Sakaguchi,
Benjamin Van Durme
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Jul
2020
We introduce Uncertain Natural Language Inference (UNLI), a refinement of Natural Language Inference (NLI) that shifts away from categorical labels, targeting instead the direct prediction of subjective probability assessments. We demonstrate the feasibility of collecting annotations for UNLI by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise. We describe a direct scalar regression modeling approach, and find that existing categorically-labeled NLI data can be used in pre-training. Our best models correlate well with humans, demonstrating models are capable of more subtle inferences than the categorical bin assignment employed in current NLI tasks.
@inproceedings{chen-etal-2020-uncertain,title={Uncertain Natural Language Inference},author={Chen, Tongfei and Jiang, Zhengping and Poliak, Adam and Sakaguchi, Keisuke and Van Durme, Benjamin},booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},month=jul,year={2020},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2020.acl-main.774},doi={10.18653/v1/2020.acl-main.774},pages={8772--8779}}
LREC
The Universal Decompositional Semantics Dataset and Decomp Toolkit
Aaron Steven White,
Elias Stengel-Eskin,
Siddharth Vashishtha,
Venkata Subrahmanyan Govindarajan,
Dee Ann Reisinger,
Tim Vieira,
Keisuke Sakaguchi,
Sheng Zhang,
Francis Ferraro,
Rachel Rudinger,
Kyle Rawlins,
Benjamin Van Durme
Proceedings of the 12th Language Resources and Evaluation Conference
May
2020
We present the Universal Decompositional Semantics (UDS) dataset (v1.0), which is bundled with the Decomp toolkit (v0.1). UDS1.0 unifies five high-quality, decompositional semantics-aligned annotation sets within a single semantic graph specificationâwith graph structures defined by the predicative patterns produced by the PredPatt tool and real-valued node and edge attributes constructed using sophisticated normalization procedures. The Decomp toolkit provides a suite of Python 3 tools for querying UDS graphs using SPARQL. Both UDS1.0 and Decomp0.1 are publicly available at http://decomp.io.
@inproceedings{white-etal-2020-universal,title={The Universal Decompositional Semantics Dataset and Decomp Toolkit},author={White, Aaron Steven and Stengel-Eskin, Elias and Vashishtha, Siddharth and Govindarajan, Venkata Subrahmanyan and Reisinger, Dee Ann and Vieira, Tim and Sakaguchi, Keisuke and Zhang, Sheng and Ferraro, Francis and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin},booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},month=may,year={2020},address={Marseille, France},publisher={European Language Resources Association},url={https://aclanthology.org/2020.lrec-1.699},pages={5698--5707},language={English},isbn={979-10-95546-34-4}}
ICLR
Abductive Commonsense Reasoning
Chandra Bhagavatula,
Ronan Le Bras,
Chaitanya Malaviya,
Keisuke Sakaguchi,
Ari Holtzman,
Hannah Rashkin,
Doug Downey,
Wen-tau Yih,
Yejin Choi
International Conference on Learning Representations
2020
@inproceedings{bhagavatula2020abductive,title={Abductive Commonsense Reasoning},author={Bhagavatula, Chandra and Bras, Ronan Le and Malaviya, Chaitanya and Sakaguchi, Keisuke and Holtzman, Ari and Rashkin, Hannah and Downey, Doug and Yih, Wen-tau and Choi, Yejin},booktitle={International Conference on Learning Representations},year={2020},url={https://openreview.net/forum?id=Byg1v1HKDB}}
AAAI
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi,
Ronan Le Bras,
Chandra Bhagavatula,
Yejin Choi
Proceedings of the AAAI Conference on Artificial Intelligence
Apr
2020
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 â 79.1%, which are âŒ15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% â 100% respectively). Furthermore, we establish new state-of-the-art results on five related benchmarks â WSC (â 90.1%), DPR (â 93.1%), COPA(â 90.6%), KnowRef (â 85.6%), and Winogender (â 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
@article{Sakaguchi-etal-2020-winogrande,title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},volume={34},url={https://ojs.aaai.org/index.php/AAAI/article/view/6399},doi={10.1609/aaai.v34i05.6399},number={05},journal={Proceedings of the AAAI Conference on Artificial Intelligence},author={Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin},year={2020},month=apr,pages={8732--8740}}
2019
EMNLP
WIQA: A dataset for âWhat if...â reasoning over procedural text
Niket Tandon,
Bhavana Dalvi,
Keisuke Sakaguchi,
Peter Clark,
Antoine Bosselut
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Nov
2019
We introduce WIQA, the first large-scale dataset of âWhat if...â questions over procedural text. WIQA contains a collection of paragraphs, each annotated with multiple influence graphs describing how one change affects another, and a large (40k) collection of âWhat if...?â multiple-choice questions derived from these. For example, given a paragraph about beach erosion, would stormy weather hasten or decelerate erosion? WIQA contains three kinds of questions: perturbations to steps mentioned in the paragraph; external (out-of-paragraph) perturbations requiring commonsense knowledge; and irrelevant (no effect) perturbations. We find that state-of-the-art models achieve 73.8% accuracy, well below the human performance of 96.3%. We analyze the challenges, in particular tracking chains of influences, and present the dataset as an open challenge to the community.
@inproceedings{tandon-etal-2019-wiqa,title={{WIQA}: A dataset for {``}What if...{''} reasoning over procedural text},author={Tandon, Niket and Dalvi, Bhavana and Sakaguchi, Keisuke and Clark, Peter and Bosselut, Antoine},booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},month=nov,year={2019},address={Hong Kong, China},publisher={Association for Computational Linguistics},url={https://aclanthology.org/D19-1629},doi={10.18653/v1/D19-1629},pages={6076--6085}}
2018
ACL
Efficient Online Scalar Annotation with Bounded Support
Keisuke Sakaguchi,
Benjamin Van Durme
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jul
2018
We describe a novel method for efficiently eliciting scalar annotations for dataset construction and system quality estimation by human judgments. We contrast direct assessment (annotators assign scores to items directly), online pairwise ranking aggregation (scores derive from annotator comparison of items), and a hybrid approach (EASL: Efficient Annotation of Scalar Labels) proposed here. Our proposal leads to increased correlation with ground truth, at far greater annotator efficiency, suggesting this strategy as an improved mechanism for dataset creation and manual system evaluation.
@inproceedings{sakaguchi-van-durme-2018-efficient,title={Efficient Online Scalar Annotation with Bounded Support},author={Sakaguchi, Keisuke and Van Durme, Benjamin},booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=jul,year={2018},address={Melbourne, Australia},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P18-1020},doi={10.18653/v1/P18-1020},pages={208--218}}
2017
IJCNLP
Grammatical Error Correction with Neural Reinforcement Learning
Keisuke Sakaguchi,
Matt Post,
Benjamin Van Durme
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Nov
2017
We propose a neural encoder-decoder model with reinforcement learning (NRL) for grammatical error correction (GEC). Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes towards an objective that considers a sentence-level, task-specific evaluation metric, avoiding the exposure bias issue in MLE. We demonstrate that NRL outperforms MLE both in human and automated evaluation metrics, achieving the state-of-the-art on a fluency-oriented GEC corpus.
@inproceedings{sakaguchi-etal-2017-grammatical,title={Grammatical Error Correction with Neural Reinforcement Learning},author={Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin},booktitle={Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},month=nov,year={2017},address={Taipei, Taiwan},publisher={Asian Federation of Natural Language Processing},url={https://aclanthology.org/I17-2062},pages={366--372}}
BEA
GEC into the future: Where are we going and how do we get there?
Keisuke Sakaguchi,
Courtney Napoles,
Joel Tetreault
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Sep
2017
The field of grammatical error correction (GEC) has made tremendous bounds in the last ten years, but new questions and obstacles are revealing themselves. In this position paper, we discuss the issues that need to be addressed and provide recommendations for the field to continue to make progress, and propose a new shared task. We invite suggestions and critiques from the audience to make the new shared task a community-driven venture.
@inproceedings{sakaguchi-etal-2017-gec,title={{GEC} into the future: Where are we going and how do we get there?},author={Sakaguchi, Keisuke and Napoles, Courtney and Tetreault, Joel},booktitle={Proceedings of the 12th Workshop on Innovative Use of {NLP} for Building Educational Applications},month=sep,year={2017},address={Copenhagen, Denmark},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W17-5019},doi={10.18653/v1/W17-5019},pages={180--187}}
ACL
Error-repair Dependency Parsing for Ungrammatical Texts
Keisuke Sakaguchi,
Matt Post,
Benjamin Van Durme
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Jul
2017
We propose a new dependency parsing scheme which jointly parses a sentence and repairs grammatical errors by extending the non-directional transition-based formalism of Goldberg and Elhadad (2010) with three additional actions: SUBSTITUTE, DELETE, INSERT. Because these actions may cause an infinite loop in derivation, we also introduce simple constraints that ensure the parser termination. We evaluate our model with respect to dependency accuracy and grammaticality improvements for ungrammatical sentences, demonstrating the robustness and applicability of our scheme.
@inproceedings{sakaguchi-etal-2017-error,title={Error-repair Dependency Parsing for Ungrammatical Texts},author={Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin},booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},month=jul,year={2017},address={Vancouver, Canada},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P17-2030},doi={10.18653/v1/P17-2030},pages={189--195}}
EACL
JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction
Courtney Napoles,
Keisuke Sakaguchi,
Joel Tetreault
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
Apr
2017
We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding. We describe the types of corrections made and benchmark four leading GEC systems on this corpus, identifying specific areas in which they do well and how they can improve. JFLEG fulfills the need for a new gold standard to properly assess the current state of GEC.
@inproceedings{napoles-etal-2017-jfleg,title={{JFLEG}: A Fluency Corpus and Benchmark for Grammatical Error Correction},author={Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel},booktitle={Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},month=apr,year={2017},address={Valencia, Spain},publisher={Association for Computational Linguistics},url={https://aclanthology.org/E17-2037},pages={229--234}}
AAAI
Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network
Keisuke Sakaguchi,
Kevin Duh,
Matt Post,
Benjamin Van Durme
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
2017
Language processing mechanism by humans is generally more robust than computers. The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated such a robust word processing mechanism, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. On the other hand, computational models for word recognition (e.g. spelling checkers) perform poorly on data with such noise.Inspired by the findings from the Cmabrigde Uinervtisy effect, we propose a word recognition model based on a semi-character level recurrent neural network (scRNN). In our experiments, we demonstrate that scRNN has significantly more robust performance in word spelling correction (i.e. word recognition) compared to existing spelling checkers and character-based convolutional neural network. Furthermore, we demonstrate that the model is cognitively plausible by replicating a psycholinguistics experiment about human reading difficulty using our model.
@inproceedings{10.5555/3298023.3298045,author={Sakaguchi, Keisuke and Duh, Kevin and Post, Matt and Durme, Benjamin Van},title={Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network},year={2017},publisher={AAAI Press},booktitle={Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence},pages={3281--3287},numpages={7},location={San Francisco, California, USA},series={AAAI'17}}
2016
EMNLP
Universal Decompositional Semantics on Universal Dependencies
Aaron Steven White,
Drew Reisinger,
Keisuke Sakaguchi,
Tim Vieira,
Sheng Zhang,
Rachel Rudinger,
Kyle Rawlins,
Benjamin Van Durme
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
Nov
2016
We present a framework for augmenting data sets from the Universal Dependencies project with Universal Decompositional Semantics. Where the Universal Dependencies project aims to provide a syntactic annotation standard that can be used consistently across many languages as well as a collection of corpora that use that standard, our extension has similar aims for semantic annotation. We describe results from annotating the English Universal Dependencies treebank, dealing with word senses, semantic roles, and event properties.
@inproceedings{white-etal-2016-universal,title={Universal Decompositional Semantics on {U}niversal {D}ependencies},author={White, Aaron Steven and Reisinger, Drew and Sakaguchi, Keisuke and Vieira, Tim and Zhang, Sheng and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin},booktitle={Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2016},address={Austin, Texas},publisher={Association for Computational Linguistics},url={https://aclanthology.org/D16-1177},doi={10.18653/v1/D16-1177},pages={1713--1723}}
EMNLP
Thereâs No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction
Courtney Napoles,
Keisuke Sakaguchi,
Joel Tetreault
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
Nov
2016
Current methods for automatically evaluating grammatical error correction (GEC) systems rely on gold-standard references. However, these methods suffer from penalizing grammatical edits that are correct but not in the gold standard. We show that reference-less grammaticality metrics correlate very strongly with human judgments and are competitive with the leading reference-based evaluation metrics. By interpolating both methods, we achieve state-of-the-art correlation with human judgments. Finally, we show that GEC metrics are much more reliable when they are calculated at the sentence level instead of the corpus level. We have set up a CodaLab site for benchmarking GEC output using a common dataset and different evaluation metrics.
@inproceedings{napoles-etal-2016-theres,title={There{'}s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction},author={Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel},booktitle={Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2016},address={Austin, Texas},publisher={Association for Computational Linguistics},url={https://aclanthology.org/D16-1228},doi={10.18653/v1/D16-1228},pages={2109--2115}}
ACL
Phrase Structure Annotation and Parsing for Learner English
Ryo Nagata,
Keisuke Sakaguchi
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aug
2016
There has been almost no work on phrase structure annotation and parsing specially designed for learner English despite the fact that they are useful for representing the structural characteristics of learner English. To address this problem, in this paper, we first propose a phrase structure annotation scheme for learner English and annotate two different learner corpora using it. Second, we show their usefulness, reporting on (a) inter-annotator agreement rate, (b) characteristic CFG rules in the corpora, and (c) parsing performance on them. In addition, we explore methods to improve phrase structure parsing for learner English (achieving an F -measure of 0.878). Finally, we release the full annotation guidelines, the annotated data, and the improved parser model for learner English to the public.
@inproceedings{nagata-sakaguchi-2016-phrase,title={Phrase Structure Annotation and Parsing for Learner {E}nglish},author={Nagata, Ryo and Sakaguchi, Keisuke},booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=aug,year={2016},address={Berlin, Germany},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P16-1173},doi={10.18653/v1/P16-1173},pages={1837--1847}}
TACL
Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality
Keisuke Sakaguchi,
Courtney Napoles,
Matt Post,
Joel Tetreault
Transactions of the Association for Computational Linguistics
2016
The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GECâs reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (Ï = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.
@article{sakaguchi-etal-2016-reassessing,title={Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality},author={Sakaguchi, Keisuke and Napoles, Courtney and Post, Matt and Tetreault, Joel},journal={Transactions of the Association for Computational Linguistics},volume={4},year={2016},url={https://aclanthology.org/Q16-1013},doi={10.1162/tacl_a_00091},pages={169--182}}
arXiv
GLEU Without Tuning
Courtney Napoles,
Keisuke Sakaguchi,
Matt Post,
Joel R. Tetreault
The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the modified metric does not require tuning. We recommend that this version be used instead of the original version.
@article{Napoles2016GLEUWT,title={GLEU Without Tuning},author={Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel R.},journal={arXiv},year={2016},volume={abs/1605.02592}}
2015
ACL
Ground Truth for Grammatical Error Correction Metrics
Courtney Napoles,
Keisuke Sakaguchi,
Matt Post,
Joel Tetreault
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Jul
2015
How do we know which grammatical error correction (GEC) system is best? A number of metrics have been proposed over the years, each motivated by weaknesses of previous metrics; however, the metrics themselves have not been compared to an empirical gold standard grounded in human judgments. We conducted the first human evaluation of GEC system outputs, and show that the rankings produced by metrics such as MaxMatch and I-measure do not correlate well with this ground truth. As a step towards better metrics, we also propose GLEU, a simple variant of BLEU, modified to account for both the source and the reference, and show that it hews much more closely to human judgments.
@inproceedings{napoles-etal-2015-ground,title={Ground Truth for Grammatical Error Correction Metrics},author={Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel},booktitle={Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},month=jul,year={2015},address={Beijing, China},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P15-2097},doi={10.3115/v1/P15-2097},pages={588--593}}
NAACL
Effective Feature Integration for Automated Short Answer Scoring
Keisuke Sakaguchi,
Michael Heilman,
Nitin Madnani
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
May
2015
A major opportunity for NLP to have a realworld impact is in helping educators score student writing, particularly content-based writing (i.e., the task of automated short answer scoring). A major challenge in this enterprise is that scored responses to a particular question (i.e., labeled data) are valuable for modeling but limited in quantity. Additional information from the scoring guidelines for humans, such as exemplars for each score level and descriptions of key concepts, can also be used. Here, we explore methods for integrating scoring guidelines and labeled responses, and we find that stacked generalization (Wolpert, 1992) improves performance, especially for small training sets.
@inproceedings{sakaguchi-etal-2015-effective,title={Effective Feature Integration for Automated Short Answer Scoring},author={Sakaguchi, Keisuke and Heilman, Michael and Madnani, Nitin},booktitle={Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies},month=may,year={2015},address={Denver, Colorado},publisher={Association for Computational Linguistics},url={https://aclanthology.org/N15-1111},doi={10.3115/v1/N15-1111},pages={1049--1054}}
2014
WMT
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation
Keisuke Sakaguchi,
Matt Post,
Benjamin Van Durme
Proceedings of the Ninth Workshop on Statistical Machine Translation
Jun
2014
A main output of the annual Workshop on Statistical Machine Translation (WMT) is a ranking of the systems that participated in its shared translation tasks, produced by aggregating pairwise sentencelevel comparisons collected from human judges. Over the past few years, there have been a number of tweaks to the aggregation formula in attempts to address issues arising from the inherent ambiguity and subjectivity of the task, as well as weaknesses in the proposed models and the manner of model selection. We continue this line of work by adapting the TrueSkill TM algorithm â an online approach for modeling the relative skills of players in ongoing competitions, such as Microsoftâs Xbox Live â to the human evaluation of machine translation output. Our experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
@inproceedings{sakaguchi-etal-2014-efficient,title={Efficient Elicitation of Annotations for Human Evaluation of Machine Translation},author={Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin},booktitle={Proceedings of the Ninth Workshop on Statistical Machine Translation},month=jun,year={2014},address={Baltimore, Maryland, USA},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W14-3301},doi={10.3115/v1/W14-3301},pages={1--11}}
2013
ACL
Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners
Keisuke Sakaguchi,
Yuki Arase,
Mamoru Komachi
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Aug
2013
We propose discriminative methods to generate semantic distractors of fill-in-theblank quiz for language learners using a large-scale language learnersâ corpus. Unlike previous studies, the proposed methods aim at satisfying both reliability and validity of generated distractors; distractors should be exclusive against answers to avoid multiple answers in one quiz, and distractors should discriminate learnersâ proficiency. Detailed user evaluation with 3 native and 23 non-native speakers of English shows that our methods achieve better reliability and validity than previous methods.
@inproceedings{sakaguchi-etal-2013-discriminative,title={Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners},author={Sakaguchi, Keisuke and Arase, Yuki and Komachi, Mamoru},booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},month=aug,year={2013},address={Sofia, Bulgaria},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P13-2043},pages={238--242}}
CoNLL
NAIST at 2013 CoNLL Grammatical Error Correction Shared Task
This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the CoNLL 2013 Shared Task. We constructed three systems: a system based on the Treelet Language Model for verb form and subjectverb agreement errors; a classifier trained on both learner and native corpora for noun number errors; a statistical machine translation (SMT)-based model for preposition and determiner errors. As for subject-verb agreement errors, we show that the Treelet Language Model-based approach can correct errors in which the target verb is distant from its subject. Our system ranked fourth on the official run.
@inproceedings{yoshimoto-etal-2013-naist,title={{NAIST} at 2013 {C}o{NLL} Grammatical Error Correction Shared Task},author={Yoshimoto, Ippei and Kose, Tomoya and Mitsuzawa, Kensuke and Sakaguchi, Keisuke and Mizumoto, Tomoya and Hayashibe, Yuta and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task},month=aug,year={2013},address={Sofia, Bulgaria},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W13-3604},pages={26--33}}
BEA
NAIST at the NLI 2013 Shared Task
Tomoya Mizumoto,
Yuta Hayashibe,
Keisuke Sakaguchi,
Mamoru Komachi,
Yuji Matsumoto
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
Jun
2013
This paper describes the Nara Institute of Science and Technology (NAIST) native language identification (NLI) system in the NLI 2013 Shared Task. We apply feature selection using a measure based on frequency for the closed track and try Capping and Sampling data methods for the open tracks. Our system ranked ninth in the closed track, third in open track 1 and fourth in open track 2.
@inproceedings{mizumoto-etal-2013-naist,title={{NAIST} at the {NLI} 2013 Shared Task},author={Mizumoto, Tomoya and Hayashibe, Yuta and Sakaguchi, Keisuke and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of the Eighth Workshop on Innovative Use of {NLP} for Building Educational Applications},month=jun,year={2013},address={Atlanta, Georgia},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W13-1717},pages={134--139}}
MWE
Construction of English MWE Dictionary and its Application to POS Tagging
This paper reports our ongoing project for constructing an English multiword expression (MWE) dictionary and NLP tools based on the developed dictionary. We extracted functional MWEs from the English part of Wiktionary, annotated the Penn Treebank (PTB) with MWE information, and conducted POS tagging experiments. We report how the MWE annotation is done on PTB and the results of POS and MWE tagging experiments.
@inproceedings{shigeto-etal-2013-construction,title={Construction of {E}nglish {MWE} Dictionary and its Application to {POS} Tagging},author={Shigeto, Yutaro and Azuma, Ai and Hisamoto, Sorami and Kondo, Shuhei and Kose, Tomoya and Sakaguchi, Keisuke and Yoshimoto, Akifumi and Yung, Frances and Matsumoto, Yuji},booktitle={Proceedings of the 9th Workshop on Multiword Expressions},month=jun,year={2013},address={Atlanta, Georgia, USA},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W13-1021},pages={139--144}}
2012
COLING
Joint English Spelling Error Correction and POS Tagging for Language Learners Writing
We propose an approach to correcting spelling errors and assigning part-of-speech (POS) tags simultaneously for sentences written by learners of English as a second language (ESL). In ESL writing, there are several types of errors such as preposition, determiner, verb, noun, and spelling errors. Spelling errors often interfere with POS tagging and syntactic parsing, which makes other error detection and correction tasks very difficult. In studies of grammatical error detection and correction in ESL writing, spelling correction has been regarded as a preprocessing step in a pipeline. However, several types of spelling errors in ESL are difficult to correct in the preprocessing, for example, homophones (e.g. *hear/here), confusion (*quiet/quite), split (*now a day/nowadays), merge (*swimingpool/swimming pool), inflection (*please/pleased) and derivation (*badly/bad), where the incorrect word is actually in the vocabulary and grammatical information is needed to disambiguate. In order to correct these spelling errors, and also typical typographical errors (*begginning/beginning), we propose a joint analysis of POS tagging and spelling error correction with a CRF (Conditional Random Field)-based model. We present an approach that achieves significantly better accuracies for both POS tagging and spelling correction, compared to existing approaches using either individual or pipeline analysis. We also show that the joint model can deal with novel types of misspelling in ESL writing.
@inproceedings{sakaguchi-etal-2012-joint,title={Joint {E}nglish Spelling Error Correction and {POS} Tagging for Language Learners Writing},author={Sakaguchi, Keisuke and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of {COLING} 2012},month=dec,year={2012},address={Mumbai, India},publisher={The COLING 2012 Organizing Committee},url={https://aclanthology.org/C12-1144},pages={2357--2374}}
This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the Helping Our Own (HOO) 2012 Shared Task. Our system targets preposition and determiner errors with spelling correction as a pre-processing step. The result shows that spelling correction improves the Detection, Correction, and Recognition F-scores for preposition errors. With regard to preposition error correction, F-scores were not improved when using the training set with correction of all but preposition errors. As for determiner error correction, there was an improvement when the constituent parser was trained with a concatenation of treebank and modified treebank where all the articles appearing as the first word of an NP were removed. Our system ranked third in preposition and fourth in determiner error corrections.
@inproceedings{sakaguchi-etal-2012-naist,title={{NAIST} at the {HOO} 2012 Shared Task},author={Sakaguchi, Keisuke and Hayashibe, Yuta and Kondo, Shuhei and Kanashiro, Lis and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of the Seventh Workshop on Building Educational Applications Using {NLP}},month=jun,year={2012},address={Montr{\'e}al, Canada},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W12-2033},pages={281--288}}