As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all five years of the exams, highlighting LLMsâ potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
@article{kasai2023med,author={{Kasai}, Jungo and {Kasai}, Yuhei and {Sakaguchi}, Keisuke and {Yamada}, Yutaro and {Radev}, Dragomir},title={{Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations}},journal={arXiv},year={2023},doi={10.48550/arXiv.2303.18027}}
arXiv
An Analysis of GPT-3âs Performance in Grammatical Error Correction
GPT-3 models are very powerful, achieving high performance on a variety of natural language processing tasks. However, there is a relative lack of detailed published analysis on how well they perform on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3 model (text-davinci-003) against major GEC benchmarks, comparing the performance of several different prompts, including a comparison of zero-shot and few-shot settings. We analyze intriguing or problematic outputs encountered with different prompt formats. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets using a combination of automatic metrics and human evaluations, revealing interesting differences between the preferences of human raters and the reference-based automatic metrics.
@article{coyne2023gptgec,author={{Coyne}, Steven and {Sakaguchi}, Keisuke},title={{An Analysis of GPT-3's Performance in Grammatical Error Correction}},journal={arXiv},year={2023},doi={10.48550/arXiv.2303.14342}}
arXiv
Causal schema induction for knowledge discovery
Michael Regan,
Jena D. Hwang,
Keisuke Sakaguchi,
and James Pustejovsky
Making sense of familiar yet new situations typically involves making generalizations about causal schemas, stories that help humans reason about event sequences. Reasoning about events includes identifying cause and effect relations shared across event instances, a process we refer to as causal schema induction. Statistical schema induction systems may leverage structural knowledge encoded in discourse or the causal graphs associated with event meaning, however resources to study such causal structure are few in number and limited in size. In this work, we investigate how to apply schema induction models to the task of knowledge discovery for enhanced search of English-language news texts. To tackle the problem of data scarcity, we present Torquestra, a manually curated dataset of text-graph-schema units integrating temporal, event, and causal structures. We benchmark our dataset on three knowledge discovery tasks, building and evaluating models for each. Results show that systems that harness causal structure are effective at identifying texts sharing similar causal meaning components rather than relying on lexical cues alone. We make our dataset and models available for research purposes.
@article{regan2023causalschema,author={{Regan}, Michael and {Hwang}, Jena D. and {Sakaguchi}, Keisuke and {Pustejovsky}, James},title={{Causal schema induction for knowledge discovery}},journal={arXiv},year={2023},doi={10.48550/arXiv.2303.15381}}
EACL
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?
Keito Kudo,
Yoichi Aoki,
Tatsuki Kuribayashi,
Ana Brassard,
Masashi Yoshikawa,
Keisuke Sakaguchi,
and Kentaro Inui
Proceedings of the 2023 Conference of the European Chapter of the Association for Computational Linguistics
(to appear)
2023
@inproceedings{Kudo2023eacl,title={Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?},author={Kudo, Keito and Aoki, Yoichi and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro},year={2023},booktitle={Proceedings of the 2023 Conference of the {E}uropean Chapter of the Association for Computational Linguistics},month={(to appear)},publisher={Association for Computational Linguistics}}
EACL Findings
Empirical Investigation of Neural Symbolic Reasoning Strategies
Yoichi Aoki,
Keito Kudo,
Tatsuki Kuribayashi,
Ana Brassard,
Masashi Yoshikawa,
Keisuke Sakaguchi,
and Kentaro Inui
Findings of the Association for Computational Linguistics: EACL 2023
(to appear)
2023
@inproceedings{Aoki2023eacl,title={Empirical Investigation of Neural Symbolic Reasoning Strategies},author={Aoki, Yoichi and Kudo, Keito and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro},year={2023},booktitle={Findings of the Association for Computational Linguistics: EACL 2023 },month={(to appear)},publisher={Association for Computational Linguistics}}
2022
arXiv
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation
Chandra Bhagavatula,
Jena D Hwang,
Doug Downey,
Ronan Le Bras,
Ximing Lu,
Keisuke Sakaguchi,
Swabha Swayamdipta,
Peter West,
and Yejin Choi
@article{bhagavatula2022i2d2,title={I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation},author={Bhagavatula, Chandra and Hwang, Jena D and Downey, Doug and Bras, Ronan Le and Lu, Ximing and Sakaguchi, Keisuke and Swayamdipta, Swabha and West, Peter and Choi, Yejin},journal={arXiv},year={2022}}
arXiv
RealTime QA: Whatâs the Answer Right Now?
Jungo Kasai,
Keisuke Sakaguchi,
Yoichi Takahashi,
Ronan Le Bras,
Akari Asai,
Xinyan Yu,
Dragomir Radev,
Noah A. Smith,
Yejin Choi,
and Kentaro Inui
We introduce RealTime QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). RealTime QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open domain QA datasets and pursues, instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this preliminary report presents real-time evaluation results over the past month. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that RealTime QA will spur progress in instantaneous applications of question answering and beyond.
@article{kasai2022realtimeqa,title={RealTime QA: What's the Answer Right Now?},author={Kasai, Jungo and Sakaguchi, Keisuke and Takahashi, Yoichi and Bras, Ronan Le and Asai, Akari and Yu, Xinyan and Radev, Dragomir and Smith, Noah A. and Choi, Yejin and Inui, Kentaro},journal={arXiv},year={2022},volume={abs/2207.13332},doi={10.48550/ARXIV.2207.13332}}
arXiv
Can Machines Learn Morality? The Delphi Experiment
Liwei Jiang,
Jena D. Hwang,
Chandra Bhagavatula,
Ronan Le Bras,
Jenny Liang,
Jesse Dodge,
Keisuke Sakaguchi,
Maxwell Forbes,
Jon Borchardt,
Saadia Gabriel,
Yulia Tsvetkov,
Oren Etzioni,
Maarten Sap,
Regina Rini,
and Yejin Choi
As AI systems become increasingly powerful and pervasive, there are growing concerns about machinesâ morality or a lack thereof. Yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for AI. Existing AI systems deployed to millions of users, however, are already making decisions loaded with moral implications, which poses a seemingly impossible challenge: teaching machines moral sense, while humanity continues to grapple with it.
To explore this challenge, we introduce Delphi, an experimental framework based on deep neural networks trained directly to reason about descriptive ethical judgments, e.g., "helping a friend" is generally good, while "helping a friend spread fake news" is not. Empirical results shed novel insights on the promises and limits of machine ethics; Delphi demonstrates strong generalization capabilities in the face of novel ethical situations, while off-the-shelf neural network models exhibit markedly poor judgment including unjust biases, confirming the need for explicitly teaching machines moral sense.
Yet, Delphi is not perfect, exhibiting susceptibility to pervasive biases and inconsistencies. Despite that, we demonstrate positive use cases of imperfect Delphi, including using it as a component model within other imperfect AI systems. Importantly, we interpret the operationalization of Delphi in light of prominent ethical theories, which leads us to important future research questions.
@article{jiang2022delphi,title={Can Machines Learn Morality? The Delphi Experiment},author={Jiang, Liwei and Hwang, Jena D. and Bhagavatula, Chandra and Bras, Ronan Le and Liang, Jenny and Dodge, Jesse and Sakaguchi, Keisuke and Forbes, Maxwell and Borchardt, Jon and Gabriel, Saadia and Tsvetkov, Yulia and Etzioni, Oren and Sap, Maarten and Rini, Regina and Choi, Yejin},journal={arXiv},year={2022},volume={abs/2110.07574},doi={10.48550/ARXIV.2110.07574}}
arXiv
Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond
Masato Mita,
Keisuke Sakaguchi,
Masato Hagiwara,
Tomoya Mizumoto,
Jun Suzuki,
and Kentaro Inui
Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To go beyond sentence-level automated grammatical error correction to NLP-based document-level revision assistant, there are two major obstacles: (1) there are few public corpora with document-level revisions being annotated by professional editors, and (2) it is not feasible to elicit all possible references and evaluate the quality of revision with such references because there are infinite possibilities of revision. This paper tackles these challenges. First, we introduce a new document-revision corpus, TETRA, where professional editors revised academic papers sampled from the ACL anthology which contain few trivial grammatical errors that enable us to focus more on document- and paragraph-level edits such as coherence and consistency. Second, we explore reference-less and interpretable methods for meta-evaluation that can detect quality improvements by document revision. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle. This promising result will encourage the community to further explore automated document revision models and metrics in future.
@article{mita2022tetra,title={Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond},author={Mita, Masato and Sakaguchi, Keisuke and Hagiwara, Masato and Mizumoto, Tomoya and Suzuki, Jun and Inui, Kentaro},journal={arXiv},year={2022},volume={abs/2205.11484},doi={10.48550/ARXIV.2205.11484}}
EMNLP
Twist Decoding: Diverse Generators Guide Each Other
Jungo Kasai,
Keisuke Sakaguchi,
Ronan Le Bras,
Hao Peng,
Ximing Lu,
Dragomir Radev,
Yejin Choi,
and Noah A. Smith
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Dec
2022
Natural language generation technology has recently seen remarkable progress with large-scale training, and many natural language applications are now built upon a wide range of generation models. Combining diverse models may lead to further progress, but conventional ensembling (e.g., shallow fusion) requires that they share vocabulary/tokenization schemes. We introduce Twist decoding, a simple and general inference algorithm that generates text while benefiting from diverse models. Our method does not assume the vocabulary, tokenization or even generation order is shared. Our extensive evaluations on machine translation and scientific paper summarization demonstrate that Twist decoding substantially outperforms each model decoded in isolation over various scenarios, including cases where domain-specific and general-purpose models are both available. Twist decoding also consistently outperforms the popular reranking heuristic where output candidates from one model is rescored by another. We hope that our work will encourage researchers and practitioners to examine generation models collectively, not just independently, and to seek out models with complementary strengths to the currently available models.
@inproceedings{kasai2022twist,title={Twist Decoding: Diverse Generators Guide Each Other},author={Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Peng, Hao and Lu, Ximing and Radev, Dragomir and Choi, Yejin and Smith, Noah A.},booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},month=dec,year={2022},address={Abu Dhabi, UAE},publisher={Association for Computational Linguistics},url={https://arxiv.org/abs/2205.09273}}
arXiv
ELQA: A Corpus of Questions and Answers about the English Language
Shabnam Behzad,
Keisuke Sakaguchi,
Nathan Schneider,
and Amir Zeldes
We introduce a community-sourced dataset for English Language Question Answering (ELQA), which consists of more than 180k questions and answers on numerous topics about English language such as grammar, meaning, fluency, and etymology. The ELQA corpus will enable new NLP applications for language learners. We introduce three tasks based on the ELQA corpus: 1) answer quality classification, 2) semantic search for finding similar questions, and 3) answer generation. We present baselines for each task along with analysis, showing the strengths and weaknesses of current transformer-based models. The ELQA corpus and scripts are publicly available for future studies.
@article{behzad2022elqa,title={ELQA: A Corpus of Questions and Answers about the English Language},author={Behzad, Shabnam and Sakaguchi, Keisuke and Schneider, Nathan and Zeldes, Amir},journal={arXiv},year={2022},volume={abs/2205.00395},doi={10.48550/ARXIV.2205.00395}}
arXiv
Beam Decoding with Controlled Patience
Jungo Kasai,
Keisuke Sakaguchi,
Ronan Le Bras,
Dragomir Radev,
Yejin Choi,
and Noah A. Smith
Text generation with beam search has proven successful in a wide range of applications. The commonly-used implementation of beam decoding follows a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. We introduce a patience factor, a simple modification to this decoding algorithm, that generalizes the stopping criterion and provides flexibility to the depth of search. Extensive empirical results demonstrate that the patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation.
@article{kasai2022beam,title={Beam Decoding with Controlled Patience},author={Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Radev, Dragomir and Choi, Yejin and Smith, Noah A.},journal={arXiv},year={2022},volume={abs/2204.05424}}
NAACL
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai,
Keisuke Sakaguchi,
Ronan Le Bras,
Lavinia Dunagan,
Jacob Morrison,
Alexander R. Fabbri,
Yejin Choi,
and Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Jul
2022
Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to focus on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (BILLBOARDs), that simultaneously tracks progress in language generation tasks and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a BILLBOARD accepts both generators and evaluation metrics as competing entries. A BILLBOARD automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlations with human judgments. We release four BILLBOARDs for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.
@inproceedings{Kasai2022BidimensionalLG,title={Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},author={Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Dunagan, Lavinia and Morrison, Jacob and Fabbri, Alexander R. and Choi, Yejin and Smith, Noah A.},year={2022},booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},month=jul,pages={3540--3557},address={Seattle, United States},publisher={Association for Computational Linguistics}}
NAACL
Transparent Human Evaluation for Image Captioning
Jungo Kasai,
Keisuke Sakaguchi,
Lavinia Dunagan,
Jacob Morrison,
Ronan Le Bras,
Yejin Choi,
and Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Jul
2022
We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.
@inproceedings{Kasai2022TransparentHE,title={Transparent Human Evaluation for Image Captioning},author={Kasai, Jungo and Sakaguchi, Keisuke and Dunagan, Lavinia and Morrison, Jacob and Bras, Ronan Le and Choi, Yejin and Smith, Noah A.},year={2022},booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},pages={3464--3478},month=jul,address={Seattle, Washington},publisher={Association for Computational Linguistics}}
IMLW@AAAI
Interscript: A dataset for interactive learning of scripts through error feedback
Niket Tandon,
Aman Madaan,
Peter Clark,
Keisuke Sakaguchi,
and Yiming Yang
The AAAI-22 Workshop on Interactive Machine Learning
2022
How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, INTERSCRIPT, containing user feedback on a deployed model that generates complex everyday tasks. INTERSCRIPT contains 8,466 data pointsâ the input is a possibly erroneous script and a user feedback and the output is a modified script. We posit two use-cases of INTERSCRIPT that might significantly advance the state-of-the-art in interactive.
@inproceedings{Tandon2021InterscriptAD,title={Interscript: A dataset for interactive learning of scripts through error feedback},author={Tandon, Niket and Madaan, Aman and Clark, Peter and Sakaguchi, Keisuke and Yang, Yiming},year={2022},booktitle={The AAAI-22 Workshop on Interactive Machine Learning}}
2021
arXiv
Improving Neural Model Performance through Natural Language Feedback on Their Explanations
Aman Madaan,
Niket Tandon,
Dheeraj Rajagopal,
Yiming Yang,
Peter Clark,
Keisuke Sakaguchi,
and Eduard H. Hovy
A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIEan interactive system that refines its explanations for a given reasoning task by getting human feedback in natural language. Our approach generates graphs that have 40% fewer inconsistencies as compared with the off-the-shelf system. Further, simply appending the corrected explanation structures to the output leads to a gain of 1.2 points on accuracy on defeasible reasoning across all three domains.
@article{Madaan2021ImprovingNM,title={Improving Neural Model Performance through Natural Language Feedback on Their Explanations},author={Madaan, Aman and Tandon, Niket and Rajagopal, Dheeraj and Yang, Yiming and Clark, Peter and Sakaguchi, Keisuke and Hovy, Eduard H.},journal={arXiv},year={2021},volume={abs/2104.08765}}
arXiv
GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education
Masato Hagiwara,
Joshua Tanner,
and Keisuke Sakaguchi
We present GrammarTagger, an open-source grammar profiler which, given an input text, identifies grammatical features useful for language education. The model architecture enables it to learn from a small amount of texts annotated with spans and their labels, which 1) enables easier and more intuitive annotation, 2) supports overlapping spans, and 3) is less prone to error propagation, compared to complex hand-crafted rules defined on constituency/dependency parses. We show that we can bootstrap a grammar profiler model with F1 â 0.6 from only a couple hundred sentences both in English and Chinese, which can be further boosted via learning a multilingual model. With GrammarTagger, we also build Octanove Learn, a search engine of language learning materials indexed by their reading difficulty and grammatical features.
@article{Hagiwara2021GrammarTaggerAM,title={GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education},author={Hagiwara, Masato and Tanner, Joshua and Sakaguchi, Keisuke},journal={arXiv},year={2021},volume={abs/2104.03190}}
EMNLP Findings
proScript: Partially Ordered Scripts Generation
Keisuke Sakaguchi,
Chandra Bhagavatula,
Ronan Le Bras,
Niket Tandon,
Peter Clark,
and Yejin Choi
Findings of the Association for Computational Linguistics: EMNLP 2021
Nov
2021
Scripts â prototypical event sequences describing everyday activities â have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models can be finetuned to generate high-quality scripts, at varying levels of granularity, for a wide range of everyday scenarios (e.g., bake a cake). To do this, we collect a large (6.4k) crowdsourced partially ordered scripts (named proScript), that is substantially larger than prior datasets, and develop models that generate scripts by combining language generation and graph structure prediction. We define two complementary tasks: (i) edge prediction: given a scenario and unordered events, organize the events into a valid (possibly partial-order) script, and (ii) script generation: given only a scenario, generate events and organize them into a (possibly partial-order) script. Our experiments show that our models perform well (e.g., F1=75.7 on task (i)), illustrating a new approach to overcoming previous barriers to script collection. We also show that there is still significant room for improvement toward human level performance. Together, our tasks, dataset, and models offer a new research direction for learning script knowledge.
@inproceedings{sakaguchi-etal-2021-proscript-partially,title={pro{S}cript: Partially Ordered Scripts Generation},author={Sakaguchi, Keisuke and Bhagavatula, Chandra and Le Bras, Ronan and Tandon, Niket and Clark, Peter and Choi, Yejin},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2021},month=nov,year={2021},address={Punta Cana, Dominican Republic},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2021.findings-emnlp.184},pages={2138--2149}}
CACM
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi,
Ronan Le Bras,
Chandra Bhagavatula,
and Yejin Choi
Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the recent neural language models have reported above 90% accuracy on the Winograd Schema Challenge (WSC), a commonsense benchmark originally designed to be unsolvable for statistical models that rely simply on word associations. This raises an important questionâwhether these models have truly acquired robust commonsense capabilities or they rely on spurious biases in the dataset that lead to an overestimation of the true capabilities of machine commonsense.To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) large-scale crowdsourcing, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. Our experiments demonstrate that state-of-the-art models achieve considerably lower accuracy (59.4%-79.1%) on WINOGRANDE compared to humans (94%), confirming that the high performance on the original WSC was inflated by spurious biases in the dataset.Furthermore, we report new state-of-the-art results on five related benchmarks with emphasis on their dual implications. On the one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, the high performance on all these benchmarks suggests the extent to which spurious biases are prevalent in all such datasets, which motivates further research on algorithmic bias reduction.
@article{10.1145/3474381,author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},year={2021},issue_date={September 2021},publisher={Association for Computing Machinery},address={New York, NY, USA},volume={64},number={9},issn={0001-0782},url={https://doi.org/10.1145/3474381},doi={10.1145/3474381},journal={Commun. ACM},month=aug,pages={99â106},numpages={8}}
AAAI
COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs
Jena D. Hwang,
Chandra Bhagavatula,
Ronan Le Bras,
Jeff Da,
Keisuke Sakaguchi,
Antoine Bosselut,
and Yejin Choi
Proceedings of the AAAI Conference on Artificial Intelligence
May
2021
Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and challenging tasks. At the same time, there remain questions about the quality and coverage of these resources due to the massive scale required to comprehensively encompass general commonsense knowledge.
In this work, we posit that manually constructed CSKGs will never achieve the coverage necessary to be applicable in all situations encountered by NLP agents. Therefore, we propose a new evaluation framework for testing the utility of KGs based on how effectively implicit knowledge representations can be learned from them.
With this new goal, we propose ATOMIC 2020, a new CSKG of general-purpose commonsense knowledge containing knowledge that is not readily available in pretrained language models. We evaluate its properties in comparison with other leading CSKGs, performing the first large-scale pairwise study of commonsense knowledge resources. Next, we show that ATOMIC 2020 is better suited for training knowledge models that can generate accurate, representative knowledge for new, unseen entities and events. Finally, through human evaluation, we show that the few-shot performance of GPT-3 (175B parameters), while impressive, remains  12 absolute points lower than a BART-based knowledge model trained on ATOMIC 2020 despite using over 430x fewer parameters.
@article{Hwang2021COMETATOMIC2O,title={COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs},author={Hwang, Jena D. and Bhagavatula, Chandra and Le Bras, Ronan and Da, Jeff and Sakaguchi, Keisuke and Bosselut, Antoine and Choi, Yejin},journal={Proceedings of the AAAI Conference on Artificial Intelligence},volume={35},number={7},year={2021},month=may,pages={6384-6392}}
2020
EMNLP
A Dataset for Tracking Entities in Open Domain Procedural Text
Niket Tandon,
Keisuke Sakaguchi,
Bhavana Dalvi,
Dheeraj Rajagopal,
Peter Clark,
Michal Guerquin,
Kyle Richardson,
and Eduard Hovy
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Nov
2020
We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky, opaque, and clear. Previous formulations of this task provide the text and entities involved, and ask how those entities change for just a small, pre-defined set of attributes (e.g., location), limiting their fidelity. Our solution is a new task formulation where given just a procedural text as input, the task is to generate a set of state change tuples (entity, attribute, before-state, after-state) for each step, where the entity, attribute, and state values must be predicted from an open vocabulary. Using crowdsourcing, we create OPENPI, a high-quality (91.5% coverage as judged by humans and completely vetted), and large-scale dataset comprising 29,928 state changes over 4,050 sentences from 810 procedural real-world paragraphs from WikiHow.com. A current state-of-the-art generation model on this task achieves 16.1% F1 based on BLEU metric, leaving enough room for novel model architectures.
@inproceedings{tandon-etal-2020-dataset,title={A Dataset for Tracking Entities in Open Domain Procedural Text},author={Tandon, Niket and Sakaguchi, Keisuke and Dalvi, Bhavana and Rajagopal, Dheeraj and Clark, Peter and Guerquin, Michal and Richardson, Kyle and Hovy, Eduard},booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},month=nov,year={2020},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2020.emnlp-main.520},doi={10.18653/v1/2020.emnlp-main.520},pages={6408--6417}}
ACL
Uncertain Natural Language Inference
Tongfei Chen,
Zhengping Jiang,
Adam Poliak,
Keisuke Sakaguchi,
and Benjamin Van Durme
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Jul
2020
We introduce Uncertain Natural Language Inference (UNLI), a refinement of Natural Language Inference (NLI) that shifts away from categorical labels, targeting instead the direct prediction of subjective probability assessments. We demonstrate the feasibility of collecting annotations for UNLI by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise. We describe a direct scalar regression modeling approach, and find that existing categorically-labeled NLI data can be used in pre-training. Our best models correlate well with humans, demonstrating models are capable of more subtle inferences than the categorical bin assignment employed in current NLI tasks.
@inproceedings{chen-etal-2020-uncertain,title={Uncertain Natural Language Inference},author={Chen, Tongfei and Jiang, Zhengping and Poliak, Adam and Sakaguchi, Keisuke and Van Durme, Benjamin},booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},month=jul,year={2020},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2020.acl-main.774},doi={10.18653/v1/2020.acl-main.774},pages={8772--8779}}
LREC
The Universal Decompositional Semantics Dataset and Decomp Toolkit
Aaron Steven White,
Elias Stengel-Eskin,
Siddharth Vashishtha,
Venkata Subrahmanyan Govindarajan,
Dee Ann Reisinger,
Tim Vieira,
Keisuke Sakaguchi,
Sheng Zhang,
Francis Ferraro,
Rachel Rudinger,
Kyle Rawlins,
and Benjamin Van Durme
Proceedings of the 12th Language Resources and Evaluation Conference
May
2020
We present the Universal Decompositional Semantics (UDS) dataset (v1.0), which is bundled with the Decomp toolkit (v0.1). UDS1.0 unifies five high-quality, decompositional semantics-aligned annotation sets within a single semantic graph specificationâwith graph structures defined by the predicative patterns produced by the PredPatt tool and real-valued node and edge attributes constructed using sophisticated normalization procedures. The Decomp toolkit provides a suite of Python 3 tools for querying UDS graphs using SPARQL. Both UDS1.0 and Decomp0.1 are publicly available at http://decomp.io.
@inproceedings{white-etal-2020-universal,title={The Universal Decompositional Semantics Dataset and Decomp Toolkit},author={White, Aaron Steven and Stengel-Eskin, Elias and Vashishtha, Siddharth and Govindarajan, Venkata Subrahmanyan and Reisinger, Dee Ann and Vieira, Tim and Sakaguchi, Keisuke and Zhang, Sheng and Ferraro, Francis and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin},booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},month=may,year={2020},address={Marseille, France},publisher={European Language Resources Association},url={https://aclanthology.org/2020.lrec-1.699},pages={5698--5707},language={English},isbn={979-10-95546-34-4}}
ICLR
Abductive Commonsense Reasoning
Chandra Bhagavatula,
Ronan Le Bras,
Chaitanya Malaviya,
Keisuke Sakaguchi,
Ari Holtzman,
Hannah Rashkin,
Doug Downey,
Wen-tau Yih,
and Yejin Choi
International Conference on Learning Representations
2020
@inproceedings{bhagavatula2020abductive,title={Abductive Commonsense Reasoning},author={Bhagavatula, Chandra and Bras, Ronan Le and Malaviya, Chaitanya and Sakaguchi, Keisuke and Holtzman, Ari and Rashkin, Hannah and Downey, Doug and Yih, Wen-tau and Choi, Yejin},booktitle={International Conference on Learning Representations},year={2020},url={https://openreview.net/forum?id=Byg1v1HKDB}}
AAAI
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi,
Ronan Le Bras,
Chandra Bhagavatula,
and Yejin Choi
Proceedings of the AAAI Conference on Artificial Intelligence
Apr
2020
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 â 79.1%, which are âŒ15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% â 100% respectively). Furthermore, we establish new state-of-the-art results on five related benchmarks â WSC (â 90.1%), DPR (â 93.1%), COPA(â 90.6%), KnowRef (â 85.6%), and Winogender (â 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
@article{Sakaguchi-etal-2020-winogrande,title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},volume={34},url={https://ojs.aaai.org/index.php/AAAI/article/view/6399},doi={10.1609/aaai.v34i05.6399},number={05},journal={Proceedings of the AAAI Conference on Artificial Intelligence},author={Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin},year={2020},month=apr,pages={8732-8740}}
2019
EMNLP
WIQA: A dataset for âWhat if...â reasoning over procedural text
Niket Tandon,
Bhavana Dalvi,
Keisuke Sakaguchi,
Peter Clark,
and Antoine Bosselut
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Nov
2019
We introduce WIQA, the first large-scale dataset of âWhat if...â questions over procedural text. WIQA contains a collection of paragraphs, each annotated with multiple influence graphs describing how one change affects another, and a large (40k) collection of âWhat if...?â multiple-choice questions derived from these. For example, given a paragraph about beach erosion, would stormy weather hasten or decelerate erosion? WIQA contains three kinds of questions: perturbations to steps mentioned in the paragraph; external (out-of-paragraph) perturbations requiring commonsense knowledge; and irrelevant (no effect) perturbations. We find that state-of-the-art models achieve 73.8% accuracy, well below the human performance of 96.3%. We analyze the challenges, in particular tracking chains of influences, and present the dataset as an open challenge to the community.
@inproceedings{tandon-etal-2019-wiqa,title={{WIQA}: A dataset for {``}What if...{''} reasoning over procedural text},author={Tandon, Niket and Dalvi, Bhavana and Sakaguchi, Keisuke and Clark, Peter and Bosselut, Antoine},booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},month=nov,year={2019},address={Hong Kong, China},publisher={Association for Computational Linguistics},url={https://aclanthology.org/D19-1629},doi={10.18653/v1/D19-1629},pages={6076--6085}}
2018
ACL
Efficient Online Scalar Annotation with Bounded Support
Keisuke Sakaguchi,
and Benjamin Van Durme
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jul
2018
We describe a novel method for efficiently eliciting scalar annotations for dataset construction and system quality estimation by human judgments. We contrast direct assessment (annotators assign scores to items directly), online pairwise ranking aggregation (scores derive from annotator comparison of items), and a hybrid approach (EASL: Efficient Annotation of Scalar Labels) proposed here. Our proposal leads to increased correlation with ground truth, at far greater annotator efficiency, suggesting this strategy as an improved mechanism for dataset creation and manual system evaluation.
@inproceedings{sakaguchi-van-durme-2018-efficient,title={Efficient Online Scalar Annotation with Bounded Support},author={Sakaguchi, Keisuke and Van Durme, Benjamin},booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=jul,year={2018},address={Melbourne, Australia},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P18-1020},doi={10.18653/v1/P18-1020},pages={208--218}}
2017
IJCNLP
Grammatical Error Correction with Neural Reinforcement Learning
Keisuke Sakaguchi,
Matt Post,
and Benjamin Van Durme
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Nov
2017
We propose a neural encoder-decoder model with reinforcement learning (NRL) for grammatical error correction (GEC). Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes towards an objective that considers a sentence-level, task-specific evaluation metric, avoiding the exposure bias issue in MLE. We demonstrate that NRL outperforms MLE both in human and automated evaluation metrics, achieving the state-of-the-art on a fluency-oriented GEC corpus.
@inproceedings{sakaguchi-etal-2017-grammatical,title={Grammatical Error Correction with Neural Reinforcement Learning},author={Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin},booktitle={Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},month=nov,year={2017},address={Taipei, Taiwan},publisher={Asian Federation of Natural Language Processing},url={https://aclanthology.org/I17-2062},pages={366--372}}
BEA@EMNLP
GEC into the future: Where are we going and how do we get there?
Keisuke Sakaguchi,
Courtney Napoles,
and Joel Tetreault
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Sep
2017
The field of grammatical error correction (GEC) has made tremendous bounds in the last ten years, but new questions and obstacles are revealing themselves. In this position paper, we discuss the issues that need to be addressed and provide recommendations for the field to continue to make progress, and propose a new shared task. We invite suggestions and critiques from the audience to make the new shared task a community-driven venture.
@inproceedings{sakaguchi-etal-2017-gec,title={{GEC} into the future: Where are we going and how do we get there?},author={Sakaguchi, Keisuke and Napoles, Courtney and Tetreault, Joel},booktitle={Proceedings of the 12th Workshop on Innovative Use of {NLP} for Building Educational Applications},month=sep,year={2017},address={Copenhagen, Denmark},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W17-5019},doi={10.18653/v1/W17-5019},pages={180--187}}
ACL
Error-repair Dependency Parsing for Ungrammatical Texts
Keisuke Sakaguchi,
Matt Post,
and Benjamin Van Durme
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Jul
2017
We propose a new dependency parsing scheme which jointly parses a sentence and repairs grammatical errors by extending the non-directional transition-based formalism of Goldberg and Elhadad (2010) with three additional actions: SUBSTITUTE, DELETE, INSERT. Because these actions may cause an infinite loop in derivation, we also introduce simple constraints that ensure the parser termination. We evaluate our model with respect to dependency accuracy and grammaticality improvements for ungrammatical sentences, demonstrating the robustness and applicability of our scheme.
@inproceedings{sakaguchi-etal-2017-error,title={Error-repair Dependency Parsing for Ungrammatical Texts},author={Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin},booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},month=jul,year={2017},address={Vancouver, Canada},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P17-2030},doi={10.18653/v1/P17-2030},pages={189--195}}
EACL
JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction
Courtney Napoles,
Keisuke Sakaguchi,
and Joel Tetreault
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
Apr
2017
We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding. We describe the types of corrections made and benchmark four leading GEC systems on this corpus, identifying specific areas in which they do well and how they can improve. JFLEG fulfills the need for a new gold standard to properly assess the current state of GEC.
@inproceedings{napoles-etal-2017-jfleg,title={{JFLEG}: A Fluency Corpus and Benchmark for Grammatical Error Correction},author={Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel},booktitle={Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},month=apr,year={2017},address={Valencia, Spain},publisher={Association for Computational Linguistics},url={https://aclanthology.org/E17-2037},pages={229--234}}
AAAI
Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network
Keisuke Sakaguchi,
Kevin Duh,
Matt Post,
and Benjamin Van Durme
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
2017
Language processing mechanism by humans is generally more robust than computers. The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated such a robust word processing mechanism, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. On the other hand, computational models for word recognition (e.g. spelling checkers) perform poorly on data with such noise.Inspired by the findings from the Cmabrigde Uinervtisy effect, we propose a word recognition model based on a semi-character level recurrent neural network (scRNN). In our experiments, we demonstrate that scRNN has significantly more robust performance in word spelling correction (i.e. word recognition) compared to existing spelling checkers and character-based convolutional neural network. Furthermore, we demonstrate that the model is cognitively plausible by replicating a psycholinguistics experiment about human reading difficulty using our model.
@inproceedings{10.5555/3298023.3298045,author={Sakaguchi, Keisuke and Duh, Kevin and Post, Matt and Durme, Benjamin Van},title={Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network},year={2017},publisher={AAAI Press},booktitle={Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence},pages={3281â3287},numpages={7},location={San Francisco, California, USA},series={AAAI'17}}
2016
EMNLP
Universal Decompositional Semantics on Universal Dependencies
Aaron Steven White,
Drew Reisinger,
Keisuke Sakaguchi,
Tim Vieira,
Sheng Zhang,
Rachel Rudinger,
Kyle Rawlins,
and Benjamin Van Durme
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
Nov
2016
We present a framework for augmenting data sets from the Universal Dependencies project with Universal Decompositional Semantics. Where the Universal Dependencies project aims to provide a syntactic annotation standard that can be used consistently across many languages as well as a collection of corpora that use that standard, our extension has similar aims for semantic annotation. We describe results from annotating the English Universal Dependencies treebank, dealing with word senses, semantic roles, and event properties.
@inproceedings{white-etal-2016-universal,title={Universal Decompositional Semantics on {U}niversal {D}ependencies},author={White, Aaron Steven and Reisinger, Drew and Sakaguchi, Keisuke and Vieira, Tim and Zhang, Sheng and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin},booktitle={Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2016},address={Austin, Texas},publisher={Association for Computational Linguistics},url={https://aclanthology.org/D16-1177},doi={10.18653/v1/D16-1177},pages={1713--1723}}
EMNLP
Thereâs No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction
Courtney Napoles,
Keisuke Sakaguchi,
and Joel Tetreault
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
Nov
2016
Current methods for automatically evaluating grammatical error correction (GEC) systems rely on gold-standard references. However, these methods suffer from penalizing grammatical edits that are correct but not in the gold standard. We show that reference-less grammaticality metrics correlate very strongly with human judgments and are competitive with the leading reference-based evaluation metrics. By interpolating both methods, we achieve state-of-the-art correlation with human judgments. Finally, we show that GEC metrics are much more reliable when they are calculated at the sentence level instead of the corpus level. We have set up a CodaLab site for benchmarking GEC output using a common dataset and different evaluation metrics.
@inproceedings{napoles-etal-2016-theres,title={There{'}s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction},author={Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel},booktitle={Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2016},address={Austin, Texas},publisher={Association for Computational Linguistics},url={https://aclanthology.org/D16-1228},doi={10.18653/v1/D16-1228},pages={2109--2115}}
ACL
Phrase Structure Annotation and Parsing for Learner English
Ryo Nagata,
and Keisuke Sakaguchi
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aug
2016
There has been almost no work on phrase structure annotation and parsing specially designed for learner English despite the fact that they are useful for representing the structural characteristics of learner English. To address this problem, in this paper, we first propose a phrase structure annotation scheme for learner English and annotate two different learner corpora using it. Second, we show their usefulness, reporting on (a) inter-annotator agreement rate, (b) characteristic CFG rules in the corpora, and (c) parsing performance on them. In addition, we explore methods to improve phrase structure parsing for learner English (achieving an F -measure of 0.878). Finally, we release the full annotation guidelines, the annotated data, and the improved parser model for learner English to the public.
@inproceedings{nagata-sakaguchi-2016-phrase,title={Phrase Structure Annotation and Parsing for Learner {E}nglish},author={Nagata, Ryo and Sakaguchi, Keisuke},booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=aug,year={2016},address={Berlin, Germany},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P16-1173},doi={10.18653/v1/P16-1173},pages={1837--1847}}
TACL
Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality
Keisuke Sakaguchi,
Courtney Napoles,
Matt Post,
and Joel Tetreault
Transactions of the Association for Computational Linguistics
2016
The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GECâs reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (Ï = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.
@article{sakaguchi-etal-2016-reassessing,title={Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality},author={Sakaguchi, Keisuke and Napoles, Courtney and Post, Matt and Tetreault, Joel},journal={Transactions of the Association for Computational Linguistics},volume={4},year={2016},url={https://aclanthology.org/Q16-1013},doi={10.1162/tacl_a_00091},pages={169--182}}
arXiv
GLEU Without Tuning
Courtney Napoles,
Keisuke Sakaguchi,
Matt Post,
and Joel R. Tetreault
The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the modified metric does not require tuning. We recommend that this version be used instead of the original version.
@article{Napoles2016GLEUWT,title={GLEU Without Tuning},author={Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel R.},journal={arXiv},year={2016},volume={abs/1605.02592}}
2015
ACL
Ground Truth for Grammatical Error Correction Metrics
Courtney Napoles,
Keisuke Sakaguchi,
Matt Post,
and Joel Tetreault
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Jul
2015
How do we know which grammatical error correction (GEC) system is best? A number of metrics have been proposed over the years, each motivated by weaknesses of previous metrics; however, the metrics themselves have not been compared to an empirical gold standard grounded in human judgments. We conducted the first human evaluation of GEC system outputs, and show that the rankings produced by metrics such as MaxMatch and I-measure do not correlate well with this ground truth. As a step towards better metrics, we also propose GLEU, a simple variant of BLEU, modified to account for both the source and the reference, and show that it hews much more closely to human judgments.
@inproceedings{napoles-etal-2015-ground,title={Ground Truth for Grammatical Error Correction Metrics},author={Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel},booktitle={Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},month=jul,year={2015},address={Beijing, China},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P15-2097},doi={10.3115/v1/P15-2097},pages={588--593}}
NAACL
Effective Feature Integration for Automated Short Answer Scoring
Keisuke Sakaguchi,
Michael Heilman,
and Nitin Madnani
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
May
2015
A major opportunity for NLP to have a realworld impact is in helping educators score student writing, particularly content-based writing (i.e., the task of automated short answer scoring). A major challenge in this enterprise is that scored responses to a particular question (i.e., labeled data) are valuable for modeling but limited in quantity. Additional information from the scoring guidelines for humans, such as exemplars for each score level and descriptions of key concepts, can also be used. Here, we explore methods for integrating scoring guidelines and labeled responses, and we find that stacked generalization (Wolpert, 1992) improves performance, especially for small training sets.
@inproceedings{sakaguchi-etal-2015-effective,title={Effective Feature Integration for Automated Short Answer Scoring},author={Sakaguchi, Keisuke and Heilman, Michael and Madnani, Nitin},booktitle={Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies},month=may,year={2015},address={Denver, Colorado},publisher={Association for Computational Linguistics},url={https://aclanthology.org/N15-1111},doi={10.3115/v1/N15-1111},pages={1049--1054}}
2014
WMT
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation
Keisuke Sakaguchi,
Matt Post,
and Benjamin Van Durme
Proceedings of the Ninth Workshop on Statistical Machine Translation
Jun
2014
A main output of the annual Workshop on Statistical Machine Translation (WMT) is a ranking of the systems that participated in its shared translation tasks, produced by aggregating pairwise sentencelevel comparisons collected from human judges. Over the past few years, there have been a number of tweaks to the aggregation formula in attempts to address issues arising from the inherent ambiguity and subjectivity of the task, as well as weaknesses in the proposed models and the manner of model selection. We continue this line of work by adapting the TrueSkill TM algorithm â an online approach for modeling the relative skills of players in ongoing competitions, such as Microsoftâs Xbox Live â to the human evaluation of machine translation output. Our experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
@inproceedings{sakaguchi-etal-2014-efficient,title={Efficient Elicitation of Annotations for Human Evaluation of Machine Translation},author={Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin},booktitle={Proceedings of the Ninth Workshop on Statistical Machine Translation},month=jun,year={2014},address={Baltimore, Maryland, USA},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W14-3301},doi={10.3115/v1/W14-3301},pages={1--11}}
2013
ACL
Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners
Keisuke Sakaguchi,
Yuki Arase,
and Mamoru Komachi
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Aug
2013
We propose discriminative methods to generate semantic distractors of fill-in-theblank quiz for language learners using a large-scale language learnersâ corpus. Unlike previous studies, the proposed methods aim at satisfying both reliability and validity of generated distractors; distractors should be exclusive against answers to avoid multiple answers in one quiz, and distractors should discriminate learnersâ proficiency. Detailed user evaluation with 3 native and 23 non-native speakers of English shows that our methods achieve better reliability and validity than previous methods.
@inproceedings{sakaguchi-etal-2013-discriminative,title={Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners},author={Sakaguchi, Keisuke and Arase, Yuki and Komachi, Mamoru},booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},month=aug,year={2013},address={Sofia, Bulgaria},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P13-2043},pages={238--242}}
CoNLL
NAIST at 2013 CoNLL Grammatical Error Correction Shared Task
Ippei Yoshimoto,
Tomoya Kose,
Kensuke Mitsuzawa,
Keisuke Sakaguchi,
Tomoya Mizumoto,
Yuta Hayashibe,
Mamoru Komachi,
and Yuji Matsumoto
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task
Aug
2013
This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the CoNLL 2013 Shared Task. We constructed three systems: a system based on the Treelet Language Model for verb form and subjectverb agreement errors; a classifier trained on both learner and native corpora for noun number errors; a statistical machine translation (SMT)-based model for preposition and determiner errors. As for subject-verb agreement errors, we show that the Treelet Language Model-based approach can correct errors in which the target verb is distant from its subject. Our system ranked fourth on the official run.
@inproceedings{yoshimoto-etal-2013-naist,title={{NAIST} at 2013 {C}o{NLL} Grammatical Error Correction Shared Task},author={Yoshimoto, Ippei and Kose, Tomoya and Mitsuzawa, Kensuke and Sakaguchi, Keisuke and Mizumoto, Tomoya and Hayashibe, Yuta and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task},month=aug,year={2013},address={Sofia, Bulgaria},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W13-3604},pages={26--33}}
BEA@NAACL
NAIST at the NLI 2013 Shared Task
Tomoya Mizumoto,
Yuta Hayashibe,
Keisuke Sakaguchi,
Mamoru Komachi,
and Yuji Matsumoto
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
Jun
2013
This paper describes the Nara Institute of Science and Technology (NAIST) native language identification (NLI) system in the NLI 2013 Shared Task. We apply feature selection using a measure based on frequency for the closed track and try Capping and Sampling data methods for the open tracks. Our system ranked ninth in the closed track, third in open track 1 and fourth in open track 2.
@inproceedings{mizumoto-etal-2013-naist,title={{NAIST} at the {NLI} 2013 Shared Task},author={Mizumoto, Tomoya and Hayashibe, Yuta and Sakaguchi, Keisuke and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of the Eighth Workshop on Innovative Use of {NLP} for Building Educational Applications},month=jun,year={2013},address={Atlanta, Georgia},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W13-1717},pages={134--139}}
MWE@NAACL
Construction of English MWE Dictionary and its Application to POS Tagging
Yutaro Shigeto,
Ai Azuma,
Sorami Hisamoto,
Shuhei Kondo,
Tomoya Kose,
Keisuke Sakaguchi,
Akifumi Yoshimoto,
Frances Yung,
and Yuji Matsumoto
Proceedings of the 9th Workshop on Multiword Expressions
Jun
2013
This paper reports our ongoing project for constructing an English multiword expression (MWE) dictionary and NLP tools based on the developed dictionary. We extracted functional MWEs from the English part of Wiktionary, annotated the Penn Treebank (PTB) with MWE information, and conducted POS tagging experiments. We report how the MWE annotation is done on PTB and the results of POS and MWE tagging experiments.
@inproceedings{shigeto-etal-2013-construction,title={Construction of {E}nglish {MWE} Dictionary and its Application to {POS} Tagging},author={Shigeto, Yutaro and Azuma, Ai and Hisamoto, Sorami and Kondo, Shuhei and Kose, Tomoya and Sakaguchi, Keisuke and Yoshimoto, Akifumi and Yung, Frances and Matsumoto, Yuji},booktitle={Proceedings of the 9th Workshop on Multiword Expressions},month=jun,year={2013},address={Atlanta, Georgia, USA},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W13-1021},pages={139--144}}
2012
COLING
Joint English Spelling Error Correction and POS Tagging for Language Learners Writing
Keisuke Sakaguchi,
Tomoya Mizumoto,
Mamoru Komachi,
and Yuji Matsumoto
We propose an approach to correcting spelling errors and assigning part-of-speech (POS) tags simultaneously for sentences written by learners of English as a second language (ESL). In ESL writing, there are several types of errors such as preposition, determiner, verb, noun, and spelling errors. Spelling errors often interfere with POS tagging and syntactic parsing, which makes other error detection and correction tasks very difficult. In studies of grammatical error detection and correction in ESL writing, spelling correction has been regarded as a preprocessing step in a pipeline. However, several types of spelling errors in ESL are difficult to correct in the preprocessing, for example, homophones (e.g. *hear/here), confusion (*quiet/quite), split (*now a day/nowadays), merge (*swimingpool/swimming pool), inflection (*please/pleased) and derivation (*badly/bad), where the incorrect word is actually in the vocabulary and grammatical information is needed to disambiguate. In order to correct these spelling errors, and also typical typographical errors (*begginning/beginning), we propose a joint analysis of POS tagging and spelling error correction with a CRF (Conditional Random Field)-based model. We present an approach that achieves significantly better accuracies for both POS tagging and spelling correction, compared to existing approaches using either individual or pipeline analysis. We also show that the joint model can deal with novel types of misspelling in ESL writing.
@inproceedings{sakaguchi-etal-2012-joint,title={Joint {E}nglish Spelling Error Correction and {POS} Tagging for Language Learners Writing},author={Sakaguchi, Keisuke and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of {COLING} 2012},month=dec,year={2012},address={Mumbai, India},publisher={The COLING 2012 Organizing Committee},url={https://aclanthology.org/C12-1144},pages={2357--2374}}
BEA@NAACL
NAIST at the HOO 2012 Shared Task
Keisuke Sakaguchi,
Yuta Hayashibe,
Shuhei Kondo,
Lis Kanashiro,
Tomoya Mizumoto,
Mamoru Komachi,
and Yuji Matsumoto
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP
Jun
2012
This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the Helping Our Own (HOO) 2012 Shared Task. Our system targets preposition and determiner errors with spelling correction as a pre-processing step. The result shows that spelling correction improves the Detection, Correction, and Recognition F-scores for preposition errors. With regard to preposition error correction, F-scores were not improved when using the training set with correction of all but preposition errors. As for determiner error correction, there was an improvement when the constituent parser was trained with a concatenation of treebank and modified treebank where all the articles appearing as the first word of an NP were removed. Our system ranked third in preposition and fourth in determiner error corrections.
@inproceedings{sakaguchi-etal-2012-naist,title={{NAIST} at the {HOO} 2012 Shared Task},author={Sakaguchi, Keisuke and Hayashibe, Yuta and Kondo, Shuhei and Kanashiro, Lis and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji},booktitle={Proceedings of the Seventh Workshop on Building Educational Applications Using {NLP}},month=jun,year={2012},address={Montr{\'e}al, Canada},publisher={Association for Computational Linguistics},url={https://aclanthology.org/W12-2033},pages={281--288}}