Keisuke Sakaguchi | publications

2025

Blackbox NLP
Understanding the Side Effects of Rank-One Knowledge Editing

Ryosuke Takahashi, Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP Nov 2025

Abs Bib PDF

This study conducts a detailed analysis of the side effects of rank-one knowledge editing using language models with controlled knowledge. The analysis focuses on each element of knowledge triples (subject, relation, object) and examines two aspects: “knowledge that causes large side effects when edited” and “knowledge that is affected by the side effects.” Our findings suggest that editing knowledge with subjects that have relationships with numerous objects or are robustly embedded within the LM may trigger extensive side effects. Furthermore, we demonstrate that the similarity between relation vectors, the density of object vectors, and the distortion of knowledge representations are closely related to how susceptible knowledge is to editing influences. The findings of this research provide new insights into the mechanisms of side effects in LM knowledge editing and indicate specific directions for developing more effective and reliable knowledge editing methods.
@inproceedings{takahashi2025understanding, title = {Understanding the Side Effects of Rank-One Knowledge Editing}, author = {Takahashi, Ryosuke and Kamoda, Go and Heinzerling, Benjamin and Sakaguchi, Keisuke and Inui, Kentaro}, editor = {Belinkov, Yonatan and Mueller, Aaron and Kim, Najoung and Mohebbi, Hosein and Chen, Hanjie and Arad, Dana and Sarti, Gabriele}, booktitle = {Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2025.blackboxnlp-1.11}, pages = {189--205}, isbn = {979-8-89176-346-3} }
ACL
Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi Gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2025

Abs Bib PDF

The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik’s CUBE–an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code are available at https://github.com/RubriksCube/rubriks_cube.
@inproceedings{galvansosa2025rubriks, title = {Rubrik{'}s Cube: Testing a New Rubric for Evaluating Explanations on the {CUBE} dataset}, author = {Galvan-Sosa, Diana and Gaudeau, Gabrielle and Kavumba, Pride and Li, Yunmeng and Gu, Hongyi and Yuan, Zheng and Sakaguchi, Keisuke and Buttery, Paula}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.acl-long.1160/}, doi = {10.18653/v1/2025.acl-long.1160}, pages = {23800--23839}, isbn = {979-8-89176-251-0} }
AIED
Annotating Errors in English Learners’ Written Language Production: Advancing Automated Written Feedback Systems

Steven Coyne, Diana Galvan-Sosa, Ryan Spring, Camélia Guerraoui, Michael Zock, Keisuke Sakaguchi, Kentaro Inui

Artificial Intelligence in Education (AIED) Jul 2025

Best Paper Nomination

Abs Bib PDF Project

Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error’s error type and generalizability. For error type classification, we introduce a typology focused on inferring learners’ knowledge gaps by connecting their errors to specific grammatical patterns. We collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system’s outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the performance of the systems investigated.
@inproceedings{coyne2025feedback, author = {Coyne, Steven and Galvan-Sosa, Diana and Spring, Ryan and Guerraoui, Cam{\'e}lia and Zock, Michael and Sakaguchi, Keisuke and Inui, Kentaro}, editor = {Cristea, Alexandra I. and Walker, Erin and Lu, Yu and Santos, Olga C. and Isotani, Seiji}, title = {Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems}, booktitle = {Artificial Intelligence in Education (AIED)}, year = {2025}, publisher = {Springer Nature Switzerland}, pages = {292--306}, isbn = {978-3-031-98459-4}, month = jul }
ICLR
Sketch2Diagram: Generating Vector Diagrams from Hand-Drawn Sketches

Itsumi Saito, Haruto Yoshida, Keisuke Sakaguchi

The Thirteenth International Conference on Learning Representations Apr 2025

Abs Bib PDF Project

We address the challenge of automatically generating high-quality vector diagrams from hand-drawn sketches. Vector diagrams are essential for communicating complex ideas across various fields, offering flexibility and scalability. While recent research has progressed in generating diagrams from text descriptions, converting hand-drawn sketches into vector diagrams remains largely unexplored due to the lack of suitable datasets. To address this gap, we introduce SketikZ, a dataset comprising 3,231 pairs of hand-drawn sketches and thier corresponding TikZ codes as well as reference diagrams. Our evaluations reveal the limitations of state-of-the-art vision and language models (VLMs), positioning SketikZ as a key benchmark for future research in sketch-to-diagram conversion. Along with SketikZ, we present ImgTikZ, an image-to-TikZ model that integrates a 6.7B parameter code-specialized open-source large language model (LLM) with a pre-trained vision encoder. Despite its relatively compact size, ImgTikZ performs comparably to GPT-4o. This success is driven by using our two data augmentation techniques and a multi-candidate inference strategy. Our findings open promising directions for future research in sketch-to-diagram conversion and broader image-to-code generation tasks. SketikZ is publicly available.
@inproceedings{saito2025sketchdiagram, title = {Sketch2Diagram: Generating Vector Diagrams from Hand-Drawn Sketches}, author = {Saito, Itsumi and Yoshida, Haruto and Sakaguchi, Keisuke}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, url = {https://openreview.net/forum?id=KvaDHPhhir}, month = apr, address = {Singapore} }
NAACL Findings
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Go Kamoda, Benjamin Heinzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Findings of the Association for Computational Linguistics: NAACL 2025 Apr 2025

Abs Bib PDF

According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model‘s “inner vocabulary”.Prior analysis of this *detokenization* stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior.Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps.Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2.Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects.By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.
@inproceedings{kamoda2025detok, title = {Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference}, author = {Kamoda, Go and Heinzerling, Benjamin and Inaba, Tatsuro and Kudo, Keito and Sakaguchi, Keisuke and Inui, Kentaro}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Findings of the Association for Computational Linguistics: NAACL 2025}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-naacl.355/}, pages = {6324--6343}, isbn = {979-8-89176-195-7} }
NAACL
Self-Training Meets Consistency: Improving LLMs’ Reasoning with Consistency-Driven Rationale Evaluation

Jaehyeok Lee, Keisuke Sakaguchi, JinYeong Bak

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) Apr 2025

Abs Bib PDF

Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
@inproceedings{lee2025crest, title = {Self-Training Meets Consistency: Improving {LLM}s' Reasoning with Consistency-Driven Rationale Evaluation}, author = {Lee, Jaehyeok and Sakaguchi, Keisuke and Bak, JinYeong}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.naacl-long.528/}, pages = {10519--10539}, isbn = {979-8-89176-189-6} }
NAACL
Language Models can Categorize System Inputs for Performance Analysis

Dominic Sobhani, Ruiqi Zhong, Edison Marrese-Taylor, Keisuke Sakaguchi, Yutaka Matsuo

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) Apr 2025

Abs Bib PDF

Language model systems are used to process diverse categories of input requests, ranging from improving creative writing to solving programming challenges. It would be useful to know which categories they are good at. However, existing evaluations compare model performance on pre-defined categories, failing to reflect a system’s performance on finer-grained or novel ones. We propose to automatically search for finer-grained categories based on inputs where a system performs well or poorly, and describe them in natural language. To search for these categories, we propose a large number of candidate category descriptions, e.g. “Communication Improvement”, find the subset of inputs that match the category descriptions, and calculate the performance on these categories; then we sort these categories based on their performance, thereby highlighting those that score high or low. As one application, we apply our method to compare LLaMA 3-70B and Claude 3 Opus, which have similar Elo-ratings on Chatbot Arena; our method finds the former is weaker at making text more professional and humorous while better at providing psychological insights, depicting a more nuanced picture of model performance.
@inproceedings{sobhani2025categorize, title = {Language Models can Categorize System Inputs for Performance Analysis}, author = {Sobhani, Dominic and Zhong, Ruiqi and Marrese-Taylor, Edison and Sakaguchi, Keisuke and Matsuo, Yutaka}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.naacl-long.317/}, pages = {6241--6257}, isbn = {979-8-89176-189-6} }
Nature MI
Investigating machine moral judgement through the Delphi experiment

Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny T. Liang, Sydney Levine, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jack Hessel, Jon Borchardt, Taylor Sorensen, Saadia Gabriel, Yulia Tsvetkov, Oren Etzioni, Maarten Sap, Regina Rini, Yejin Choi

Nature Machine Intelligence 2025

Abs Bib PDF

As our society adopts increasingly powerful artificial intelligence (AI) systems for pervasive use, there are growing concerns about machine morality—or lack thereof. Millions of users already rely on the outputs of AI systems, such as chatbots, as decision aids. Meanwhile, AI researchers continue to grapple with the challenge of aligning these systems with human morality and values. In response to this challenge, we build and test Delphi, an open-source AI system trained to predict the moral judgements of US participants. The computational framework of Delphi is grounded in the framework proposed by the prominent moral philosopher John Rawls. Our results speak to the promises and limits of teaching machines about human morality. Delphi demonstrates improved generalization capabilities over those exhibited by off-the-shelf neural language models. At the same time, Delphi’s failures also underscore important challenges in this arena. For instance, Delphi has limited cultural awareness and is susceptible to pervasive biases. Despite these shortcomings, we demonstrate several compelling use cases of Delphi, including its incorporation as a component within an ensemble of AI systems. Finally, we computationally demonstrate the potential of Rawls’s prospect of hybrid approaches for reliable moral reasoning, inspiring future research in computational morality.
@article{jiang2025delphi, author = {Jiang, Liwei and Hwang, Jena D. and Bhagavatula, Chandra and Bras, Ronan Le and Liang, Jenny T. and Levine, Sydney and Dodge, Jesse and Sakaguchi, Keisuke and Forbes, Maxwell and Hessel, Jack and Borchardt, Jon and Sorensen, Taylor and Gabriel, Saadia and Tsvetkov, Yulia and Etzioni, Oren and Sap, Maarten and Rini, Regina and Choi, Yejin}, date = {2025/01/01}, doi = {10.1038/s42256-024-00969-6}, id = {Jiang2025}, isbn = {2522-5839}, journal = {Nature Machine Intelligence}, number = {1}, pages = {145--160}, title = {Investigating machine moral judgement through the Delphi experiment}, url = {https://doi.org/10.1038/s42256-024-00969-6}, volume = {7}, year = {2025} }
arXiv
FinchGPT: a Transformer based language model for birdsong analysis

Kosei Kobayashi, Kosuke Matsuzaki, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui, Kentaro Abe

arXiv 2025

Abs Bib PDF

The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT’s attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model’s outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.
@article{kobayashi2025finch, title = {FinchGPT: a Transformer based language model for birdsong analysis}, author = {Kobayashi, Kosei and Matsuzaki, Kosuke and Taniguchi, Masaya and Sakaguchi, Keisuke and Inui, Kentaro and Abe, Kentaro}, journal = {arXiv}, year = {2025}, doi = {10.48550/arxiv.2502.00344} }
COLING
Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment

Go Kamoda, Akari Asai, Ana Brassard, Keisuke Sakaguchi

Proceedings of the 31st International Conference on Computational Linguistics Jan 2025

Abs Bib PDF Project

Evaluating the outputs of large language models (LLMs) on long-form generative tasks remains challenging. While fine-grained, aspect-wise evaluations provide valuable diagnostic information, they are difficult to design exhaustively, and each aspect’s contribution to the overall acceptability of an answer is unclear. In this study, we propose a method to compute an overall quality score as a weighted average of three key aspects: factuality, informative- ness, and formality. This approach achieves stronger correlations with human judgments compared to previous metrics. Our analysis identifies factuality as the most predictive aspect of overall quality. Additionally, we release a dataset of 1.2k long-form QA answers annotated with both absolute judgments and relative preferences in overall and aspect-wise schemes to aid future research in evaluation practices.
@inproceedings{kamoda2025lfqa, title = {Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment}, author = {Kamoda, Go and Asai, Akari and Brassard, Ana and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 31st International Conference on Computational Linguistics}, month = jan, year = {2025}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.coling-main.588/}, pages = {8787--8808} }

2024

EMNLP
First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Nov 2024

Abs Bib PDF

"Explicit multi-step reasoning, such as chain-of-thought, is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process.Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, their reliance on heuristics decreases as LMs progress closer to the final answer. This suggests that LMs track only a limited number of future steps and dynamically combine heuristic strategies with rational ones in solving tasks involving multi-step reasoning."
@inproceedings{aoki2024heuristics, title = {First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning}, author = {Aoki, Yoichi and Kudo, Keito and Kuribayashi, Tatsuki and Sone, Shusaku and Taniguchi, Masaya and Sakaguchi, Keisuke and Inui, Kentaro}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.789/}, doi = {10.18653/v1/2024.emnlp-main.789}, pages = {14255--14271} }
COLM
ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

First Conference on Language Modeling 2024

Abs Bib PDF

Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement.
@inproceedings{brassard2024acorn, title = {{ACORN}: Aspect-wise Commonsense Reasoning Explanation Evaluation}, author = {Brassard, Ana and Heinzerling, Benjamin and Kudo, Keito and Sakaguchi, Keisuke and Inui, Kentaro}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=2oHnsM9M9D} }
arXiv
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, Kentaro Inui

arXiv Dec 2024

Abs Bib PDF

This study investigates the internal reasoning mechanism of language models during symbolic multi-step reasoning, motivated by the question of whether chain-of-thought (CoT) outputs are faithful to the model’s internals. Specifically, we inspect when they internally determine their answers, particularly before or after CoT begins, to determine whether models follow a post-hoc "think-to-talk" mode or a step-by-step "talk-to-think" mode of explanation. Through causal probing experiments in controlled arithmetic reasoning tasks, we found systematic internal reasoning patterns across models; for example, simple subproblems are solved before CoT begins, and more complicated multi-hop calculations are performed during CoT.
@article{kudo2024think, author = {{Kudo}, Keito and {Aoki}, Yoichi and {Kuribayashi}, Tatsuki and {Sone}, Shusaku and {Taniguchi}, Masaya and {Brassard}, Ana and {Sakaguchi}, Keisuke and {Inui}, Kentaro}, title = {{Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning}}, journal = {arXiv}, year = {2024}, month = dec, doi = {10.48550/arXiv.2412.01113} }
BEA
Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond

Masato Mita, Keisuke Sakaguchi, Masato Hagiwara, Tomoya Mizumoto, Jun Suzuki, Kentaro Inui

Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) Jun 2024

Abs Bib PDF Project

Natural language processing (NLP) technology has rapidly improved automated grammatical error correction (GEC) tasks, and the GEC community has begun to explore document-level revision. However, there are two major obstacles to going beyond automated \textitsentence-level GEC to NLP-based document-level revision support: (1) there are few public corpora with document-level revisions annotated by professional editors, and (2) it is infeasible to obtain all possible references and evaluate revision quality using such references because there are infinite revision possibilities. To address these challenges, this paper proposes a new document revision corpus, Text Revision of ACL papers (TETRA), in which multiple professional editors have revised academic papers sampled from the ACL anthology. This corpus enables us to focus on document-level and paragraph-level edits, such as edits related to coherence and consistency. Additionally, as a case study using the TETRA corpus, we investigate reference-less and interpretable methods for meta-evaluation to detect quality improvements according to document revisions. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle.
@inproceedings{mita2024tetra, title = {Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond}, author = {Mita, Masato and Sakaguchi, Keisuke and Hagiwara, Masato and Mizumoto, Tomoya and Suzuki, Jun and Inui, Kentaro}, booktitle = {Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, pages = {251--265} }
SigDial
A Multimodal Dialogue System to Lead Consensus Building with Emotion-Displaying

Shinnosuk Nozue, Yuto Nakano, Shoji Moriya, Tomoki Ariyama, Kazuma Kokuta, Suchun Xie, Kai Sato, Shusaku Sone, Ryohei Kamei, Reina Akama, Yuichiroh Matsubayashi, Keisuke Sakaguchi

Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue Sep 2024

Abs Bib PDF

The evolution of large language models has enabled fluent dialogue, increasing interest in the coexistence of humans and avatars. An essential aspect of achieving this coexistence involves developing sophisticated dialogue systems that can influence user behavior. In this background, we propose an effective multimodal dialogue system designed to promote consensus building with humans. Our system employs a slot-filling strategy to guide discussions and attempts to influence users with suggestions through emotional expression and intent conveyance via its avatar. These innovations have resulted in our system achieving the highest performance in a competition evaluating consensus building between humans and dialogue systems. We hope that our research will promote further discussion on the development of dialogue systems that enhance consensus building in human collaboration.
@inproceedings{nozue2024multimodal, title = {A Multimodal Dialogue System to Lead Consensus Building with Emotion-Displaying}, author = {Nozue, Shinnosuk and Nakano, Yuto and Moriya, Shoji and Ariyama, Tomoki and Kokuta, Kazuma and Xie, Suchun and Sato, Kai and Sone, Shusaku and Kamei, Ryohei and Akama, Reina and Matsubayashi, Yuichiroh and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue}, month = sep, year = {2024}, address = {Kyoto, Japan}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.sigdial-1.57}, doi = {10.18653/v1/2024.sigdial-1.57}, pages = {669--673} }
arXiv
Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Subaru Kimura, Ryota Tanaka, Shumpei Miyawaki, Jun Suzuki, Keisuke Sakaguchi

arXiv Aug 2024

Abs Bib PDF

We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.
@article{kimura2024hijack, author = {{Kimura}, Subaru and {Tanaka}, Ryota and {Miyawaki}, Shumpei and {Suzuki}, Jun and {Sakaguchi}, Keisuke}, title = {{Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection}}, journal = {arXiv}, year = {2024}, month = aug, doi = {10.48550/arXiv.2408.03554} }

arXiv

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano, Atsushi Keyaki, Keisuke Kiryu, Hirokazu Kiyomaru, Takashi Kodama, Takahiro Kubo, Yohei Kuga, Ryoma Kumon, Shuhei Kurita, Sadao Kurohashi, Conglong Li, Taiki Maekawa, Hiroshi Matsuda, Yusuke Miyao, Kentaro Mizuki, Sakae Mizuki, Yugo Murawaki, Akim Mousterou, Ryo Nakamura, Taishi Nakamura, Kouta Nakayama, Tomoka Nakazato, Takuro Niitsuma, Jiro Nishitoba, Yusuke Oda, Hayato Ogawa, Takumi Okamoto, Naoaki Okazaki, Yohei Oseki, Shintaro Ozaki, Koki Ryu, Rafal Rzepka, Keisuke Sakaguchi, Shota Sasaki, Satoshi Sekine, Kohei Suda, Saku Sugawara, Issa Sugiura, Hiroaki Sugiyama, Hisami Suzuki, Jun Suzuki, Toyotaro Suzumura, Kensuke Tachibana, Yu Takagi, Kyosuke Takami, Koichi Takeda, Masashi Takeshita, Masahiro Tanaka, Kenjiro Taura, Arseny Tolmachev, Nobuhiro Ueda, Zhen Wan, Shuntaro Yada, Sakiko Yahata, Yuya Yamamoto, Yusuke Yamauchi, Hitomi Yanaka, Rio Yokota, Koichiro Yoshino

arXiv e-prints Jul 2024

Abs Bib PDF

This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit this https URL.

@article{llmjp2024,
  author = {{LLM-jp} and {:} and {Aizawa}, Akiko and {Aramaki}, Eiji and {Chen}, Bowen and {Cheng}, Fei and {Deguchi}, Hiroyuki and {Enomoto}, Rintaro and {Fujii}, Kazuki and {Fukumoto}, Kensuke and {Fukushima}, Takuya and {Han}, Namgi and {Harada}, Yuto and {Hashimoto}, Chikara and {Hiraoka}, Tatsuya and {Hisada}, Shohei and {Hosokawa}, Sosuke and {Jie}, Lu and {Kamata}, Keisuke and {Kanazawa}, Teruhito and {Kanezashi}, Hiroki and {Kataoka}, Hiroshi and {Katsumata}, Satoru and {Kawahara}, Daisuke and {Kawano}, Seiya and {Keyaki}, Atsushi and {Kiryu}, Keisuke and {Kiyomaru}, Hirokazu and {Kodama}, Takashi and {Kubo}, Takahiro and {Kuga}, Yohei and {Kumon}, Ryoma and {Kurita}, Shuhei and {Kurohashi}, Sadao and {Li}, Conglong and {Maekawa}, Taiki and {Matsuda}, Hiroshi and {Miyao}, Yusuke and {Mizuki}, Kentaro and {Mizuki}, Sakae and {Murawaki}, Yugo and {Mousterou}, Akim and {Nakamura}, Ryo and {Nakamura}, Taishi and {Nakayama}, Kouta and {Nakazato}, Tomoka and {Niitsuma}, Takuro and {Nishitoba}, Jiro and {Oda}, Yusuke and {Ogawa}, Hayato and {Okamoto}, Takumi and {Okazaki}, Naoaki and {Oseki}, Yohei and {Ozaki}, Shintaro and {Ryu}, Koki and {Rzepka}, Rafal and {Sakaguchi}, Keisuke and {Sasaki}, Shota and {Sekine}, Satoshi and {Suda}, Kohei and {Sugawara}, Saku and {Sugiura}, Issa and {Sugiyama}, Hiroaki and {Suzuki}, Hisami and {Suzuki}, Jun and {Suzumura}, Toyotaro and {Tachibana}, Kensuke and {Takagi}, Yu and {Takami}, Kyosuke and {Takeda}, Koichi and {Takeshita}, Masashi and {Tanaka}, Masahiro and {Taura}, Kenjiro and {Tolmachev}, Arseny and {Ueda}, Nobuhiro and {Wan}, Zhen and {Yada}, Shuntaro and {Yahata}, Sakiko and {Yamamoto}, Yuya and {Yamauchi}, Yusuke and {Yanaka}, Hitomi and {Yokota}, Rio and {Yoshino}, Koichiro},
  title = {{LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs}},
  journal = {arXiv e-prints},
  keywords = {Computer Science - Computation and Language, Computer Science - Artificial Intelligence},
  year = {2024},
  month = jul,
  doi = {10.48550/arXiv.2407.03963}
}

arXiv
The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models

Ryosuke Takahashi, Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

arXiv Jun 2024

Abs Bib PDF

Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of knowledge deletion and the entities related to the knowledge. Our findings reveal that deleting knowledge related to popular entities can have catastrophic side effects. Furthermore, this research is the first to analyze knowledge deletion in models trained on synthetic knowledge graphs, indicating a new direction for controlled experiments.
@article{takahashi2024edit, author = {{Takahashi}, Ryosuke and {Kamoda}, Go and {Heinzerling}, Benjamin and {Sakaguchi}, Keisuke and {Inui}, Kentaro}, title = {{The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models}}, journal = {arXiv}, year = {2024}, month = jun, doi = {10.48550/arXiv.2406.06032} }
MORPHON
J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema

Kosuke Matsuzaki, Masaya Taniguchi, Kentaro Inui, Keisuke Sakaguchi

Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology Jun 2024

Abs Bib PDF Project

We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language’s agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] + suru (do-PRS)). Morphologically, this inflection pattern is same as the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We will release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.
@inproceedings{matsuzaki2024junimorph, title = {{J}-{U}ni{M}orph: {J}apanese Morphological Annotation through the Universal Feature Schema}, author = {Matsuzaki, Kosuke and Taniguchi, Masaya and Inui, Kentaro and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, pages = {7--19} }
LREC-COLING
A Call for Clarity in Beam Search: How It Works and When It Stops

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Dragomir Radev, Yejin Choi, Noah A. Smith

Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation May 2024

Abs Bib PDF Project

Text generation with beam search has proven successful in a wide range of applications. We point out that, though largely overlooked in the literature, the commonly-used implementation of beam decoding (e.g., Hugging Face Transformers and fairseq) uses a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. Based on this finding, we introduce a patience factor, a simple modification to this beam decoding implementation, that generalizes the stopping criterion and provides flexibility to the depth of search. Empirical results demonstrate that adjusting this patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation. Further, we find that different versions of beam decoding result in large performance differences in summarization, demonstrating the need for clarity in specifying the beam search implementation in research work. Our code will be available upon publication.
@inproceedings{kasai2024beam, title = {A Call for Clarity in Beam Search: How It Works and When It Stops}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Radev, Dragomir and Choi, Yejin and Smith, Noah A.}, booktitle = {{Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation}}, year = {2024}, month = may }
ICLR
PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning

Faeze Brahman, Chandra Bhagavatula, Valentina Pyatkin, Jena D. Hwang, Xiang Lorraine Li, Hirona Jacqueline Arai, Soumya Sanyal, Keisuke Sakaguchi, Xiang Ren, Yejin Choi

The Twelfth International Conference on Learning Representations May 2024

Abs Bib PDF

Procedural planning, which entails decomposing a high-level goal into a sequence of temporally ordered steps, is an important yet intricate task for machines. It involves integrating common-sense knowledge to reason about complex and often contextualized situations, e.g. “scheduling a doctor’s appointment without a phone”. While current approaches show encouraging results using large language models (LLMs), they are hindered by drawbacks such as costly API calls and reproducibility issues. In this paper, we advocate planning using smaller language models. We present PlaSma, a novel two-pronged approach to endow small language models with procedural knowledge and (constrained) language planning capabilities. More concretely, we develop symbolic procedural knowledge distillation to enhance the commonsense knowledge in small language models and an inference-time algorithm to facilitate more structured and accurate reasoning. In addition, we introduce a new related task, Replanning, that requires a revision of a plan to cope with a constrained situation. In both the planning and replanning settings, we show that orders-of-magnitude smaller models (770M-11B parameters) can compete and often surpass their larger teacher models’ capabilities. Finally, we showcase successful application of PlaSma in an embodied environment, VirtualHome.
@inproceedings{brahman2024plasma, title = {PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning}, author = {Brahman, Faeze and Bhagavatula, Chandra and Pyatkin, Valentina and Hwang, Jena D. and Li, Xiang Lorraine and Arai, Hirona Jacqueline and Sanyal, Soumya and Sakaguchi, Keisuke and Ren, Xiang and Choi, Yejin}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://github.com/allenai/PlaSma}, month = may }

2023

NeurIPS
RealTime QA: What’s the Answer Right Now?

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, Kentaro Inui

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track Dec 2023

Abs Bib PDF Project

We introduce RealTime QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). RealTime QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open domain QA datasets and pursues, instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this preliminary report presents real-time evaluation results over the past month. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that RealTime QA will spur progress in instantaneous applications of question answering and beyond.
@inproceedings{kasai2023realtime, title = {RealTime {QA}: What's the Answer Right Now?}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Takahashi, Yoichi and Bras, Ronan Le and Asai, Akari and Yu, Xinyan Velocity and Radev, Dragomir and Smith, Noah A. and Choi, Yejin and Inui, Kentaro}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year = {2023}, url = {https://openreview.net/forum?id=HfKOIPCvsv}, month = dec }
EMNLP Findings
Test-time Augmentation for Factual Probing

Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

Findings of the Association for Computational Linguistics: EMNLP 2023 Dec 2023

Abs Bib PDF Project

Factual probing is a method that uses prompts to test if a language model “knows” certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.
@inproceedings{kamoda2023tta, title = {Test-time Augmentation for Factual Probing}, author = {Kamoda, Go and Heinzerling, Benjamin and Sakaguchi, Keisuke and Inui, Kentaro}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023}, month = dec, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-emnlp.236}, doi = {10.18653/v1/2023.findings-emnlp.236}, pages = {3650--3661} }
ACL
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation

Chandra Bhagavatula, Jena D Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin Choi

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2023

Abs Bib PDF

Commonsense capabilities of pre-trained language models dramatically improve with scale, leading many to believe that scale is the only winning recipe. But is it? Here, we investigate an alternative that a priori seems impossible: can smaller language models (e.g., GPT-2) win over models that are orders of magnitude larger and better (e.g., GPT-3), if powered with novel commonsense distillation algorithms?The key intellectual challenge is to design a learning algorithm that achieve a competitive level of commonsense acquisition, without relying on the benefits of scale. In particular, we study generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly.We introduce I2D2, a novel commonsense distillation framework that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale teacher model with two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model’s own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date.
@inproceedings{bhagavatula2023i2d2, title = {I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation}, author = {Bhagavatula, Chandra and Hwang, Jena D and Downey, Doug and Bras, Ronan Le and Lu, Ximing and Sakaguchi, Keisuke and Swayamdipta, Swabha and West, Peter and Choi, Yejin}, year = {2023}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2023.acl-long.535}, pages = {9614--9630} }
ACL
ELQA: A Corpus of Metalinguistic Questions and Answers about English

Shabnam Behzad, Keisuke Sakaguchi, Nathan Schneider, Amir Zeldes

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2023

Abs Bib PDF Project

We present ELQA, a corpus of questions and answers in and about the English language. Collected from two online forums, the >70k questions (from English learners and others) cover wide-ranging topics including grammar, meaning, fluency, and etymology. The answers include descriptions of general properties of English vocabulary and grammar as well as explanations about specific (correct and incorrect) usage examples. Unlike most NLP datasets, this corpus is metalinguistic—it consists of language about language. As such, it can facilitate investigations of the metalinguistic capabilities of NLU models, as well as educational applications in the language learning domain. To study this, we define a free-form question answering task on our dataset and conduct evaluations on multiple LLMs (Large Language Models) to analyze their capacity to generate metalinguistic answers.
@inproceedings{behzad2023elqa, title = {ELQA: A Corpus of Metalinguistic Questions and Answers about English}, author = {Behzad, Shabnam and Sakaguchi, Keisuke and Schneider, Nathan and Zeldes, Amir}, booktitle = {{Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}}, year = {2023}, month = jul, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2023.acl-long.113}, pages = {2031--2047} }
arXiv
Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir Radev

arXiv 2023

Abs Bib PDF

As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all five years of the exams, highlighting LLMs’ potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
@article{kasai2023med, author = {{Kasai}, Jungo and {Kasai}, Yuhei and {Sakaguchi}, Keisuke and {Yamada}, Yutaro and {Radev}, Dragomir}, title = {{Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations}}, journal = {arXiv}, year = {2023}, doi = {10.48550/arXiv.2303.18027} }
arXiv
An Analysis of GPT-3’s Performance in Grammatical Error Correction

Steven Coyne, Keisuke Sakaguchi

arXiv 2023

Abs Bib PDF

GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of NaturalLanguage Processing tasks. However, there isa relative lack of detailed published analysisof their performance on the task of grammatical error correction (GEC). To address this,we perform experiments testing the capabilitiesof a GPT-3.5 model (text-davinci-003)and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shotand few-shot settings, analyzing intriguing orproblematic outputs encountered with differentprompt formats. We report the performance ofour best prompt on the BEA-2019 and JFLEGdatasets, finding that the GPT models can perform well in a sentence-level revision setting,with GPT-4 achieving a new high score on theJFLEG benchmark. Through human evaluation experiments, we compare the GPT models’corrections to source, human reference, andbaseline GEC system sentences and observedifferences in editing strategies and how theyare scored by human raters.
@article{coyne2023gptgec, author = {{Coyne}, Steven and {Sakaguchi}, Keisuke}, title = {{An Analysis of GPT-3's Performance in Grammatical Error Correction}}, journal = {arXiv}, year = {2023}, doi = {10.48550/arXiv.2303.14342} }
arXiv
Causal schema induction for knowledge discovery

Michael Regan, Jena D. Hwang, Keisuke Sakaguchi, James Pustejovsky

arXiv 2023

Abs Bib PDF

Making sense of familiar yet new situations typically involves making generalizations about causal schemas, stories that help humans reason about event sequences. Reasoning about events includes identifying cause and effect relations shared across event instances, a process we refer to as causal schema induction. Statistical schema induction systems may leverage structural knowledge encoded in discourse or the causal graphs associated with event meaning, however resources to study such causal structure are few in number and limited in size. In this work, we investigate how to apply schema induction models to the task of knowledge discovery for enhanced search of English-language news texts. To tackle the problem of data scarcity, we present Torquestra, a manually curated dataset of text-graph-schema units integrating temporal, event, and causal structures. We benchmark our dataset on three knowledge discovery tasks, building and evaluating models for each. Results show that systems that harness causal structure are effective at identifying texts sharing similar causal meaning components rather than relying on lexical cues alone. We make our dataset and models available for research purposes.
@article{regan2023causalschema, author = {{Regan}, Michael and {Hwang}, Jena D. and {Sakaguchi}, Keisuke and {Pustejovsky}, James}, title = {{Causal schema induction for knowledge discovery}}, journal = {arXiv}, year = {2023}, doi = {10.48550/arXiv.2303.15381} }
EACL
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Proceedings of the 2023 Conference of the European Chapter of the Association for Computational Linguistics May 2023

Abs Bib PDF

Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.
@inproceedings{Kudo2023eacl, title = {Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?}, author = {Kudo, Keito and Aoki, Yoichi and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro}, year = {2023}, booktitle = {Proceedings of the 2023 Conference of the {E}uropean Chapter of the Association for Computational Linguistics}, month = may, publisher = {Association for Computational Linguistics}, address = {Dubrovnik, Croatia}, pages = {1351--1362}, doi = {10.18653/v1/2023.eacl-main.98} }
EACL Findings
Empirical Investigation of Neural Symbolic Reasoning Strategies

Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Findings of the Association for Computational Linguistics: EACL 2023 May 2023

Best Paper Award at AACL-IJCNLP 2022 Student Research Workshop

Abs Bib PDF

Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear.Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning.Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer.Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation.Our results indicate the importance of further exploring effective strategies for neural reasoning models.
@inproceedings{Aoki2023eacl, title = {Empirical Investigation of Neural Symbolic Reasoning Strategies}, author = {Aoki, Yoichi and Kudo, Keito and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro}, year = {2023}, booktitle = {Findings of the Association for Computational Linguistics: EACL 2023 }, doi = {10.18653/v1/2023.findings-eacl.86}, pages = {1154--1162}, month = may, address = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics} }

Jxiv
Evaluating GPT in Japanese Bar Examination: Insights and Limitations

Jungmin Choi, Jungo Kasai, Keisuke Sakaguchi

Dec 2023

Abs Bib Project

Large-scale language models like ChatGPT have been reported to exceed the accuracy of human experts in a wide range of tasks. Recent research reports that ChatGPT passed the Japanese National Medical Examination, confirming its high performance in Japanese. We evaluated the accuracy of GPT-3, GPT-4, and ChatGPT in the Japanese Bar Examination (the multiple-choice format section), focusing on Constitutional Law, Civil Law, and Criminal Law over the past five years. The results revealed that the current correct answer rate for these exams is only 30-40% (compared to the average pass rate of 70%), which is significantly low. This study went beyond just the correct answer rate, dissecting the necessary reasoning and knowledge for the responses, and examining the performance of large-scale language models from each perspective. The findings show that 1) large-scale language models possess extensive knowledge of many statutes, 2) they have a high correct answer rate for questions that require understanding of legal theories but not specific knowledge of law, and 3) they have a low correct answer rate for questions requiring knowledge of case law. The primary reason for their lower performance compared to the American Bar Examination is thought to be a lack of knowledge in Japanese law, especially in case law.
@techreport{choi_et_al_2023_j_bar_exam_en, author = {Choi, Jungmin and Kasai, Jungo and Sakaguchi, Keisuke}, title = {{Evaluating GPT in Japanese Bar Examination: Insights and Limitations}}, year = {2023}, month = dec, booktitle = {Jxiv} }

2022

EMNLP
Twist Decoding: Diverse Generators Guide Each Other

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) Dec 2022

Abs Bib PDF Project

Many language generation models are now available for a wide range of generation tasks, including machine translation and summarization. Combining such diverse models may lead to further progress, but ensembling generation models is challenging during inference: conventional ensembling methods (e.g., shallow fusion) require that the models share vocabulary/tokenization schemes. We introduce Twist decoding, a simple and general text generation algorithm that benefits from diverse models at inference time. Our method does not assume the vocabulary, tokenization or even generation order is shared. Our extensive evaluations on machine translation and scientific paper summarization demonstrate that Twist decoding substantially outperforms each model decoded in isolation over various scenarios, including cases where domain-specific and general-purpose models are both available. Twist decoding also consistently outperforms the popular reranking heuristic where output candidates from one model are rescored by another. We hope that our work will encourage researchers and practitioners to examine generation models collectively, not just independently, and to seek out models with complementary strengths to the currently available models.
@inproceedings{kasai2022twist, title = {Twist Decoding: Diverse Generators Guide Each Other}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Peng, Hao and Lu, Ximing and Radev, Dragomir and Choi, Yejin and Smith, Noah A.}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, month = dec, year = {2022}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.emnlp-main.326}, pages = {4909--4923} }
arXiv
Can Machines Learn Morality? The Delphi Experiment

Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny Liang, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jon Borchardt, Saadia Gabriel, Yulia Tsvetkov, Oren Etzioni, Maarten Sap, Regina Rini, Yejin Choi

arXiv 2022

Abs Bib PDF Project

As AI systems become increasingly powerful and pervasive, there are growing concerns about machines’ morality or a lack thereof. Yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for AI. Existing AI systems deployed to millions of users, however, are already making decisions loaded with moral implications, which poses a seemingly impossible challenge: teaching machines moral sense, while humanity continues to grapple with it. To explore this challenge, we introduce Delphi, an experimental framework based on deep neural networks trained directly to reason about descriptive ethical judgments, e.g., "helping a friend" is generally good, while "helping a friend spread fake news" is not. Empirical results shed novel insights on the promises and limits of machine ethics; Delphi demonstrates strong generalization capabilities in the face of novel ethical situations, while off-the-shelf neural network models exhibit markedly poor judgment including unjust biases, confirming the need for explicitly teaching machines moral sense. Yet, Delphi is not perfect, exhibiting susceptibility to pervasive biases and inconsistencies. Despite that, we demonstrate positive use cases of imperfect Delphi, including using it as a component model within other imperfect AI systems. Importantly, we interpret the operationalization of Delphi in light of prominent ethical theories, which leads us to important future research questions.
@article{jiang2022delphi, title = {Can Machines Learn Morality? The Delphi Experiment}, author = {Jiang, Liwei and Hwang, Jena D. and Bhagavatula, Chandra and Bras, Ronan Le and Liang, Jenny and Dodge, Jesse and Sakaguchi, Keisuke and Forbes, Maxwell and Borchardt, Jon and Gabriel, Saadia and Tsvetkov, Yulia and Etzioni, Oren and Sap, Maarten and Rini, Regina and Choi, Yejin}, journal = {arXiv}, year = {2022}, volume = {abs/2110.07574}, doi = {10.48550/ARXIV.2110.07574} }
NAACL
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob Morrison, Alexander R. Fabbri, Yejin Choi, Noah A. Smith

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Jul 2022

Abs Bib PDF Project

Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to focus on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (BILLBOARDs), that simultaneously tracks progress in language generation tasks and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a BILLBOARD accepts both generators and evaluation metrics as competing entries. A BILLBOARD automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlations with human judgments. We release four BILLBOARDs for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.
@inproceedings{Kasai2022BidimensionalLG, title = {Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Dunagan, Lavinia and Morrison, Jacob and Fabbri, Alexander R. and Choi, Yejin and Smith, Noah A.}, year = {2022}, booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, month = jul, pages = {3540--3557}, address = {Seattle, United States}, publisher = {Association for Computational Linguistics} }
NAACL
Transparent Human Evaluation for Image Captioning

Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, Noah A. Smith

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Jul 2022

Abs Bib PDF Project

We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.
@inproceedings{Kasai2022TransparentHE, title = {Transparent Human Evaluation for Image Captioning}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Dunagan, Lavinia and Morrison, Jacob and Bras, Ronan Le and Choi, Yejin and Smith, Noah A.}, year = {2022}, booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages = {3464--3478}, month = jul, address = {Seattle, Washington}, publisher = {Association for Computational Linguistics} }
IMLW@AAAI
Interscript: A dataset for interactive learning of scripts through error feedback

Niket Tandon, Aman Madaan, Peter Clark, Keisuke Sakaguchi, Yiming Yang

The AAAI-22 Workshop on Interactive Machine Learning 2022

Abs Bib PDF Project

How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, INTERSCRIPT, containing user feedback on a deployed model that generates complex everyday tasks. INTERSCRIPT contains 8,466 data points– the input is a possibly erroneous script and a user feedback and the output is a modified script. We posit two use-cases of INTERSCRIPT that might significantly advance the state-of-the-art in interactive.
@inproceedings{Tandon2021InterscriptAD, title = {Interscript: A dataset for interactive learning of scripts through error feedback}, author = {Tandon, Niket and Madaan, Aman and Clark, Peter and Sakaguchi, Keisuke and Yang, Yiming}, year = {2022}, booktitle = {The AAAI-22 Workshop on Interactive Machine Learning} }

2021

arXiv
Improving Neural Model Performance through Natural Language Feedback on Their Explanations

Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Yiming Yang, Peter Clark, Keisuke Sakaguchi, Eduard H. Hovy

arXiv 2021

Abs Bib PDF

A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIEan interactive system that refines its explanations for a given reasoning task by getting human feedback in natural language. Our approach generates graphs that have 40% fewer inconsistencies as compared with the off-the-shelf system. Further, simply appending the corrected explanation structures to the output leads to a gain of 1.2 points on accuracy on defeasible reasoning across all three domains.
@article{Madaan2021ImprovingNM, title = {Improving Neural Model Performance through Natural Language Feedback on Their Explanations}, author = {Madaan, Aman and Tandon, Niket and Rajagopal, Dheeraj and Yang, Yiming and Clark, Peter and Sakaguchi, Keisuke and Hovy, Eduard H.}, journal = {arXiv}, year = {2021}, volume = {abs/2104.08765} }
arXiv
GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education

Masato Hagiwara, Joshua Tanner, Keisuke Sakaguchi

arXiv 2021

Abs Bib PDF Project

We present GrammarTagger, an open-source grammar profiler which, given an input text, identifies grammatical features useful for language education. The model architecture enables it to learn from a small amount of texts annotated with spans and their labels, which 1) enables easier and more intuitive annotation, 2) supports overlapping spans, and 3) is less prone to error propagation, compared to complex hand-crafted rules defined on constituency/dependency parses. We show that we can bootstrap a grammar profiler model with F1 ≈ 0.6 from only a couple hundred sentences both in English and Chinese, which can be further boosted via learning a multilingual model. With GrammarTagger, we also build Octanove Learn, a search engine of language learning materials indexed by their reading difficulty and grammatical features.
@article{Hagiwara2021GrammarTaggerAM, title = {GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education}, author = {Hagiwara, Masato and Tanner, Joshua and Sakaguchi, Keisuke}, journal = {arXiv}, year = {2021}, volume = {abs/2104.03190} }
EMNLP Findings
proScript: Partially Ordered Scripts Generation

Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, Yejin Choi

Findings of the Association for Computational Linguistics: EMNLP 2021 Nov 2021

Abs Bib PDF Project

Scripts – prototypical event sequences describing everyday activities – have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models can be finetuned to generate high-quality scripts, at varying levels of granularity, for a wide range of everyday scenarios (e.g., bake a cake). To do this, we collect a large (6.4k) crowdsourced partially ordered scripts (named proScript), that is substantially larger than prior datasets, and develop models that generate scripts by combining language generation and graph structure prediction. We define two complementary tasks: (i) edge prediction: given a scenario and unordered events, organize the events into a valid (possibly partial-order) script, and (ii) script generation: given only a scenario, generate events and organize them into a (possibly partial-order) script. Our experiments show that our models perform well (e.g., F1=75.7 on task (i)), illustrating a new approach to overcoming previous barriers to script collection. We also show that there is still significant room for improvement toward human level performance. Together, our tasks, dataset, and models offer a new research direction for learning script knowledge.
@inproceedings{sakaguchi-etal-2021-proscript-partially, title = {pro{S}cript: Partially Ordered Scripts Generation}, author = {Sakaguchi, Keisuke and Bhagavatula, Chandra and Le Bras, Ronan and Tandon, Niket and Clark, Peter and Choi, Yejin}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021}, month = nov, year = {2021}, address = {Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.findings-emnlp.184}, pages = {2138--2149} }
CACM
WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi

Commun. ACM Aug 2021

Abs Bib PDF

Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the recent neural language models have reported above 90% accuracy on the Winograd Schema Challenge (WSC), a commonsense benchmark originally designed to be unsolvable for statistical models that rely simply on word associations. This raises an important question—whether these models have truly acquired robust commonsense capabilities or they rely on spurious biases in the dataset that lead to an overestimation of the true capabilities of machine commonsense.To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) large-scale crowdsourcing, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. Our experiments demonstrate that state-of-the-art models achieve considerably lower accuracy (59.4%-79.1%) on WINOGRANDE compared to humans (94%), confirming that the high performance on the original WSC was inflated by spurious biases in the dataset.Furthermore, we report new state-of-the-art results on five related benchmarks with emphasis on their dual implications. On the one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, the high performance on all these benchmarks suggests the extent to which spurious biases are prevalent in all such datasets, which motivates further research on algorithmic bias reduction.
@article{10.1145/3474381, author = {Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin}, title = {WinoGrande: An Adversarial Winograd Schema Challenge at Scale}, year = {2021}, issue_date = {September 2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {64}, number = {9}, issn = {0001-0782}, url = {https://doi.org/10.1145/3474381}, doi = {10.1145/3474381}, journal = {Commun. ACM}, month = aug, pages = {99--106}, numpages = {8} }
AAAI
COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, Yejin Choi

Proceedings of the AAAI Conference on Artificial Intelligence May 2021

Abs Bib PDF Project

Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and challenging tasks. At the same time, there remain questions about the quality and coverage of these resources due to the massive scale required to comprehensively encompass general commonsense knowledge. In this work, we posit that manually constructed CSKGs will never achieve the coverage necessary to be applicable in all situations encountered by NLP agents. Therefore, we propose a new evaluation framework for testing the utility of KGs based on how effectively implicit knowledge representations can be learned from them. With this new goal, we propose ATOMIC 2020, a new CSKG of general-purpose commonsense knowledge containing knowledge that is not readily available in pretrained language models. We evaluate its properties in comparison with other leading CSKGs, performing the first large-scale pairwise study of commonsense knowledge resources. Next, we show that ATOMIC 2020 is better suited for training knowledge models that can generate accurate, representative knowledge for new, unseen entities and events. Finally, through human evaluation, we show that the few-shot performance of GPT-3 (175B parameters), while impressive, remains 12 absolute points lower than a BART-based knowledge model trained on ATOMIC 2020 despite using over 430x fewer parameters.
@article{Hwang2021COMETATOMIC2O, title = {COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, author = {Hwang, Jena D. and Bhagavatula, Chandra and Le Bras, Ronan and Da, Jeff and Sakaguchi, Keisuke and Bosselut, Antoine and Choi, Yejin}, volume = {35}, number = {7}, year = {2021}, month = may, pages = {6384--6392} }

2020

EMNLP
A Dataset for Tracking Entities in Open Domain Procedural Text

Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, Eduard Hovy

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Nov 2020

Abs Bib PDF Project

We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky, opaque, and clear. Previous formulations of this task provide the text and entities involved, and ask how those entities change for just a small, pre-defined set of attributes (e.g., location), limiting their fidelity. Our solution is a new task formulation where given just a procedural text as input, the task is to generate a set of state change tuples (entity, attribute, before-state, after-state) for each step, where the entity, attribute, and state values must be predicted from an open vocabulary. Using crowdsourcing, we create OPENPI, a high-quality (91.5% coverage as judged by humans and completely vetted), and large-scale dataset comprising 29,928 state changes over 4,050 sentences from 810 procedural real-world paragraphs from WikiHow.com. A current state-of-the-art generation model on this task achieves 16.1% F1 based on BLEU metric, leaving enough room for novel model architectures.
@inproceedings{tandon-etal-2020-dataset, title = {A Dataset for Tracking Entities in Open Domain Procedural Text}, author = {Tandon, Niket and Sakaguchi, Keisuke and Dalvi, Bhavana and Rajagopal, Dheeraj and Clark, Peter and Guerquin, Michal and Richardson, Kyle and Hovy, Eduard}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, month = nov, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.emnlp-main.520}, doi = {10.18653/v1/2020.emnlp-main.520}, pages = {6408--6417} }
ACL
Uncertain Natural Language Inference

Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke Sakaguchi, Benjamin Van Durme

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Jul 2020

Abs Bib PDF Project

We introduce Uncertain Natural Language Inference (UNLI), a refinement of Natural Language Inference (NLI) that shifts away from categorical labels, targeting instead the direct prediction of subjective probability assessments. We demonstrate the feasibility of collecting annotations for UNLI by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise. We describe a direct scalar regression modeling approach, and find that existing categorically-labeled NLI data can be used in pre-training. Our best models correlate well with humans, demonstrating models are capable of more subtle inferences than the categorical bin assignment employed in current NLI tasks.
@inproceedings{chen-etal-2020-uncertain, title = {Uncertain Natural Language Inference}, author = {Chen, Tongfei and Jiang, Zhengping and Poliak, Adam and Sakaguchi, Keisuke and Van Durme, Benjamin}, booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, month = jul, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.acl-main.774}, doi = {10.18653/v1/2020.acl-main.774}, pages = {8772--8779} }
LREC
The Universal Decompositional Semantics Dataset and Decomp Toolkit

Aaron Steven White, Elias Stengel-Eskin, Siddharth Vashishtha, Venkata Subrahmanyan Govindarajan, Dee Ann Reisinger, Tim Vieira, Keisuke Sakaguchi, Sheng Zhang, Francis Ferraro, Rachel Rudinger, Kyle Rawlins, Benjamin Van Durme

Proceedings of the 12th Language Resources and Evaluation Conference May 2020

Abs Bib PDF Project

We present the Universal Decompositional Semantics (UDS) dataset (v1.0), which is bundled with the Decomp toolkit (v0.1). UDS1.0 unifies five high-quality, decompositional semantics-aligned annotation sets within a single semantic graph specification—with graph structures defined by the predicative patterns produced by the PredPatt tool and real-valued node and edge attributes constructed using sophisticated normalization procedures. The Decomp toolkit provides a suite of Python 3 tools for querying UDS graphs using SPARQL. Both UDS1.0 and Decomp0.1 are publicly available at http://decomp.io.
@inproceedings{white-etal-2020-universal, title = {The Universal Decompositional Semantics Dataset and Decomp Toolkit}, author = {White, Aaron Steven and Stengel-Eskin, Elias and Vashishtha, Siddharth and Govindarajan, Venkata Subrahmanyan and Reisinger, Dee Ann and Vieira, Tim and Sakaguchi, Keisuke and Zhang, Sheng and Ferraro, Francis and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin}, booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association}, url = {https://aclanthology.org/2020.lrec-1.699}, pages = {5698--5707}, language = {English}, isbn = {979-10-95546-34-4} }

ICLR

Abductive Commonsense Reasoning

Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, Yejin Choi

International Conference on Learning Representations 2020

Bib PDF Project

@inproceedings{bhagavatula2020abductive,
  title = {Abductive Commonsense Reasoning},
  author = {Bhagavatula, Chandra and Bras, Ronan Le and Malaviya, Chaitanya and Sakaguchi, Keisuke and Holtzman, Ari and Rashkin, Hannah and Downey, Doug and Yih, Wen-tau and Choi, Yejin},
  booktitle = {International Conference on Learning Representations},
  year = {2020},
  url = {https://openreview.net/forum?id=Byg1v1HKDB}
}

AAAI
WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi

Proceedings of the AAAI Conference on Artificial Intelligence Apr 2020

Outstanding Paper Award (Single Best Paper)

MIT Technology Review

Import AI

Forbes

Abs Bib PDF Project

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ∼15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively). Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (→ 90.1%), DPR (→ 93.1%), COPA(→ 90.6%), KnowRef (→ 85.6%), and Winogender (→ 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
@article{Sakaguchi-etal-2020-winogrande, title = {WinoGrande: An Adversarial Winograd Schema Challenge at Scale}, volume = {34}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/6399}, doi = {10.1609/aaai.v34i05.6399}, number = {05}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, author = {Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin}, year = {2020}, month = apr, pages = {8732--8740} }

2019

EMNLP
WIQA: A dataset for “What if...” reasoning over procedural text

Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, Antoine Bosselut

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Nov 2019

Abs Bib PDF Project

We introduce WIQA, the first large-scale dataset of “What if...” questions over procedural text. WIQA contains a collection of paragraphs, each annotated with multiple influence graphs describing how one change affects another, and a large (40k) collection of “What if...?” multiple-choice questions derived from these. For example, given a paragraph about beach erosion, would stormy weather hasten or decelerate erosion? WIQA contains three kinds of questions: perturbations to steps mentioned in the paragraph; external (out-of-paragraph) perturbations requiring commonsense knowledge; and irrelevant (no effect) perturbations. We find that state-of-the-art models achieve 73.8% accuracy, well below the human performance of 96.3%. We analyze the challenges, in particular tracking chains of influences, and present the dataset as an open challenge to the community.
@inproceedings{tandon-etal-2019-wiqa, title = {{WIQA}: A dataset for {``}What if...{''} reasoning over procedural text}, author = {Tandon, Niket and Dalvi, Bhavana and Sakaguchi, Keisuke and Clark, Peter and Bosselut, Antoine}, booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)}, month = nov, year = {2019}, address = {Hong Kong, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/D19-1629}, doi = {10.18653/v1/D19-1629}, pages = {6076--6085} }

2018

ACL
Efficient Online Scalar Annotation with Bounded Support

Keisuke Sakaguchi, Benjamin Van Durme

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2018

Abs Bib PDF Project Slides

We describe a novel method for efficiently eliciting scalar annotations for dataset construction and system quality estimation by human judgments. We contrast direct assessment (annotators assign scores to items directly), online pairwise ranking aggregation (scores derive from annotator comparison of items), and a hybrid approach (EASL: Efficient Annotation of Scalar Labels) proposed here. Our proposal leads to increased correlation with ground truth, at far greater annotator efficiency, suggesting this strategy as an improved mechanism for dataset creation and manual system evaluation.
@inproceedings{sakaguchi-van-durme-2018-efficient, title = {Efficient Online Scalar Annotation with Bounded Support}, author = {Sakaguchi, Keisuke and Van Durme, Benjamin}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2018}, address = {Melbourne, Australia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P18-1020}, doi = {10.18653/v1/P18-1020}, pages = {208--218} }

2017

IJCNLP
Grammatical Error Correction with Neural Reinforcement Learning

Keisuke Sakaguchi, Matt Post, Benjamin Van Durme

Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Nov 2017

Abs Bib PDF Project Slides

We propose a neural encoder-decoder model with reinforcement learning (NRL) for grammatical error correction (GEC). Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes towards an objective that considers a sentence-level, task-specific evaluation metric, avoiding the exposure bias issue in MLE. We demonstrate that NRL outperforms MLE both in human and automated evaluation metrics, achieving the state-of-the-art on a fluency-oriented GEC corpus.
@inproceedings{sakaguchi-etal-2017-grammatical, title = {Grammatical Error Correction with Neural Reinforcement Learning}, author = {Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin}, booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)}, month = nov, year = {2017}, address = {Taipei, Taiwan}, publisher = {Asian Federation of Natural Language Processing}, url = {https://aclanthology.org/I17-2062}, pages = {366--372} }
BEA
GEC into the future: Where are we going and how do we get there?

Keisuke Sakaguchi, Courtney Napoles, Joel Tetreault

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications Sep 2017

Abs Bib PDF

The field of grammatical error correction (GEC) has made tremendous bounds in the last ten years, but new questions and obstacles are revealing themselves. In this position paper, we discuss the issues that need to be addressed and provide recommendations for the field to continue to make progress, and propose a new shared task. We invite suggestions and critiques from the audience to make the new shared task a community-driven venture.
@inproceedings{sakaguchi-etal-2017-gec, title = {{GEC} into the future: Where are we going and how do we get there?}, author = {Sakaguchi, Keisuke and Napoles, Courtney and Tetreault, Joel}, booktitle = {Proceedings of the 12th Workshop on Innovative Use of {NLP} for Building Educational Applications}, month = sep, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W17-5019}, doi = {10.18653/v1/W17-5019}, pages = {180--187} }
ACL
Error-repair Dependency Parsing for Ungrammatical Texts

Keisuke Sakaguchi, Matt Post, Benjamin Van Durme

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Jul 2017

Outstanding Paper Award (1.5% of the submissions)

Abs Bib PDF Project Slides

We propose a new dependency parsing scheme which jointly parses a sentence and repairs grammatical errors by extending the non-directional transition-based formalism of Goldberg and Elhadad (2010) with three additional actions: SUBSTITUTE, DELETE, INSERT. Because these actions may cause an infinite loop in derivation, we also introduce simple constraints that ensure the parser termination. We evaluate our model with respect to dependency accuracy and grammaticality improvements for ungrammatical sentences, demonstrating the robustness and applicability of our scheme.
@inproceedings{sakaguchi-etal-2017-error, title = {Error-repair Dependency Parsing for Ungrammatical Texts}, author = {Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin}, booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)}, month = jul, year = {2017}, address = {Vancouver, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P17-2030}, doi = {10.18653/v1/P17-2030}, pages = {189--195} }
EACL
JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction

Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers Apr 2017

Grammarly blog

Abs Bib PDF Project

We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding. We describe the types of corrections made and benchmark four leading GEC systems on this corpus, identifying specific areas in which they do well and how they can improve. JFLEG fulfills the need for a new gold standard to properly assess the current state of GEC.
@inproceedings{napoles-etal-2017-jfleg, title = {{JFLEG}: A Fluency Corpus and Benchmark for Grammatical Error Correction}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel}, booktitle = {Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers}, month = apr, year = {2017}, address = {Valencia, Spain}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/E17-2037}, pages = {229--234} }
AAAI
Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network

Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence 2017

Abs Bib PDF Project Poster

Language processing mechanism by humans is generally more robust than computers. The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated such a robust word processing mechanism, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. On the other hand, computational models for word recognition (e.g. spelling checkers) perform poorly on data with such noise.Inspired by the findings from the Cmabrigde Uinervtisy effect, we propose a word recognition model based on a semi-character level recurrent neural network (scRNN). In our experiments, we demonstrate that scRNN has significantly more robust performance in word spelling correction (i.e. word recognition) compared to existing spelling checkers and character-based convolutional neural network. Furthermore, we demonstrate that the model is cognitively plausible by replicating a psycholinguistics experiment about human reading difficulty using our model.
@inproceedings{10.5555/3298023.3298045, author = {Sakaguchi, Keisuke and Duh, Kevin and Post, Matt and Durme, Benjamin Van}, title = {Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network}, year = {2017}, publisher = {AAAI Press}, booktitle = {Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence}, pages = {3281--3287}, numpages = {7}, location = {San Francisco, California, USA}, series = {AAAI'17} }

2016

EMNLP
Universal Decompositional Semantics on Universal Dependencies

Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, Benjamin Van Durme

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Nov 2016

Abs Bib PDF Project

We present a framework for augmenting data sets from the Universal Dependencies project with Universal Decompositional Semantics. Where the Universal Dependencies project aims to provide a syntactic annotation standard that can be used consistently across many languages as well as a collection of corpora that use that standard, our extension has similar aims for semantic annotation. We describe results from annotating the English Universal Dependencies treebank, dealing with word senses, semantic roles, and event properties.
@inproceedings{white-etal-2016-universal, title = {Universal Decompositional Semantics on {U}niversal {D}ependencies}, author = {White, Aaron Steven and Reisinger, Drew and Sakaguchi, Keisuke and Vieira, Tim and Zhang, Sheng and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin}, booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2016}, address = {Austin, Texas}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/D16-1177}, doi = {10.18653/v1/D16-1177}, pages = {1713--1723} }
EMNLP
There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction

Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Nov 2016

Abs Bib PDF Project

Current methods for automatically evaluating grammatical error correction (GEC) systems rely on gold-standard references. However, these methods suffer from penalizing grammatical edits that are correct but not in the gold standard. We show that reference-less grammaticality metrics correlate very strongly with human judgments and are competitive with the leading reference-based evaluation metrics. By interpolating both methods, we achieve state-of-the-art correlation with human judgments. Finally, we show that GEC metrics are much more reliable when they are calculated at the sentence level instead of the corpus level. We have set up a CodaLab site for benchmarking GEC output using a common dataset and different evaluation metrics.
@inproceedings{napoles-etal-2016-theres, title = {There{'}s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel}, booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2016}, address = {Austin, Texas}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/D16-1228}, doi = {10.18653/v1/D16-1228}, pages = {2109--2115} }
ACL
Phrase Structure Annotation and Parsing for Learner English

Ryo Nagata, Keisuke Sakaguchi

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Aug 2016

Abs Bib PDF

There has been almost no work on phrase structure annotation and parsing specially designed for learner English despite the fact that they are useful for representing the structural characteristics of learner English. To address this problem, in this paper, we first propose a phrase structure annotation scheme for learner English and annotate two different learner corpora using it. Second, we show their usefulness, reporting on (a) inter-annotator agreement rate, (b) characteristic CFG rules in the corpora, and (c) parsing performance on them. In addition, we explore methods to improve phrase structure parsing for learner English (achieving an F -measure of 0.878). Finally, we release the full annotation guidelines, the annotated data, and the improved parser model for learner English to the public.
@inproceedings{nagata-sakaguchi-2016-phrase, title = {Phrase Structure Annotation and Parsing for Learner {E}nglish}, author = {Nagata, Ryo and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2016}, address = {Berlin, Germany}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P16-1173}, doi = {10.18653/v1/P16-1173}, pages = {1837--1847} }
TACL
Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality

Keisuke Sakaguchi, Courtney Napoles, Matt Post, Joel Tetreault

Transactions of the Association for Computational Linguistics 2016

Abs Bib PDF Project Slides

The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GEC’s reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (ρ = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.
@article{sakaguchi-etal-2016-reassessing, title = {Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality}, author = {Sakaguchi, Keisuke and Napoles, Courtney and Post, Matt and Tetreault, Joel}, journal = {Transactions of the Association for Computational Linguistics}, volume = {4}, year = {2016}, url = {https://aclanthology.org/Q16-1013}, doi = {10.1162/tacl_a_00091}, pages = {169--182} }
arXiv
GLEU Without Tuning

Courtney Napoles, Keisuke Sakaguchi, Matt Post, Joel R. Tetreault

arXiv 2016

Abs Bib PDF

The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the modified metric does not require tuning. We recommend that this version be used instead of the original version.
@article{Napoles2016GLEUWT, title = {GLEU Without Tuning}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel R.}, journal = {arXiv}, year = {2016}, volume = {abs/1605.02592} }

2015

ACL
Ground Truth for Grammatical Error Correction Metrics

Courtney Napoles, Keisuke Sakaguchi, Matt Post, Joel Tetreault

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Jul 2015

Abs Bib PDF Project

How do we know which grammatical error correction (GEC) system is best? A number of metrics have been proposed over the years, each motivated by weaknesses of previous metrics; however, the metrics themselves have not been compared to an empirical gold standard grounded in human judgments. We conducted the first human evaluation of GEC system outputs, and show that the rankings produced by metrics such as MaxMatch and I-measure do not correlate well with this ground truth. As a step towards better metrics, we also propose GLEU, a simple variant of BLEU, modified to account for both the source and the reference, and show that it hews much more closely to human judgments.
@inproceedings{napoles-etal-2015-ground, title = {Ground Truth for Grammatical Error Correction Metrics}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel}, booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)}, month = jul, year = {2015}, address = {Beijing, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P15-2097}, doi = {10.3115/v1/P15-2097}, pages = {588--593} }
NAACL
Effective Feature Integration for Automated Short Answer Scoring

Keisuke Sakaguchi, Michael Heilman, Nitin Madnani

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies May 2015

Abs Bib PDF Slides

A major opportunity for NLP to have a realworld impact is in helping educators score student writing, particularly content-based writing (i.e., the task of automated short answer scoring). A major challenge in this enterprise is that scored responses to a particular question (i.e., labeled data) are valuable for modeling but limited in quantity. Additional information from the scoring guidelines for humans, such as exemplars for each score level and descriptions of key concepts, can also be used. Here, we explore methods for integrating scoring guidelines and labeled responses, and we find that stacked generalization (Wolpert, 1992) improves performance, especially for small training sets.
@inproceedings{sakaguchi-etal-2015-effective, title = {Effective Feature Integration for Automated Short Answer Scoring}, author = {Sakaguchi, Keisuke and Heilman, Michael and Madnani, Nitin}, booktitle = {Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies}, month = may, year = {2015}, address = {Denver, Colorado}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/N15-1111}, doi = {10.3115/v1/N15-1111}, pages = {1049--1054} }

2014

WMT
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation

Keisuke Sakaguchi, Matt Post, Benjamin Van Durme

Proceedings of the Ninth Workshop on Statistical Machine Translation Jun 2014

Abs Bib PDF Project Slides

A main output of the annual Workshop on Statistical Machine Translation (WMT) is a ranking of the systems that participated in its shared translation tasks, produced by aggregating pairwise sentencelevel comparisons collected from human judges. Over the past few years, there have been a number of tweaks to the aggregation formula in attempts to address issues arising from the inherent ambiguity and subjectivity of the task, as well as weaknesses in the proposed models and the manner of model selection. We continue this line of work by adapting the TrueSkill TM algorithm — an online approach for modeling the relative skills of players in ongoing competitions, such as Microsoft’s Xbox Live — to the human evaluation of machine translation output. Our experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
@inproceedings{sakaguchi-etal-2014-efficient, title = {Efficient Elicitation of Annotations for Human Evaluation of Machine Translation}, author = {Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin}, booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation}, month = jun, year = {2014}, address = {Baltimore, Maryland, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W14-3301}, doi = {10.3115/v1/W14-3301}, pages = {1--11} }

2013

ACL
Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

Keisuke Sakaguchi, Yuki Arase, Mamoru Komachi

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Aug 2013

Abs Bib PDF Project Poster

We propose discriminative methods to generate semantic distractors of fill-in-theblank quiz for language learners using a large-scale language learners’ corpus. Unlike previous studies, the proposed methods aim at satisfying both reliability and validity of generated distractors; distractors should be exclusive against answers to avoid multiple answers in one quiz, and distractors should discriminate learners’ proficiency. Detailed user evaluation with 3 native and 23 non-native speakers of English shows that our methods achieve better reliability and validity than previous methods.
@inproceedings{sakaguchi-etal-2013-discriminative, title = {Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners}, author = {Sakaguchi, Keisuke and Arase, Yuki and Komachi, Mamoru}, booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)}, month = aug, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P13-2043}, pages = {238--242} }
CoNLL
NAIST at 2013 CoNLL Grammatical Error Correction Shared Task

Ippei Yoshimoto, Tomoya Kose, Kensuke Mitsuzawa, Keisuke Sakaguchi, Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Yuji Matsumoto

Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task Aug 2013

Abs Bib PDF

This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the CoNLL 2013 Shared Task. We constructed three systems: a system based on the Treelet Language Model for verb form and subjectverb agreement errors; a classifier trained on both learner and native corpora for noun number errors; a statistical machine translation (SMT)-based model for preposition and determiner errors. As for subject-verb agreement errors, we show that the Treelet Language Model-based approach can correct errors in which the target verb is distant from its subject. Our system ranked fourth on the official run.
@inproceedings{yoshimoto-etal-2013-naist, title = {{NAIST} at 2013 {C}o{NLL} Grammatical Error Correction Shared Task}, author = {Yoshimoto, Ippei and Kose, Tomoya and Mitsuzawa, Kensuke and Sakaguchi, Keisuke and Mizumoto, Tomoya and Hayashibe, Yuta and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task}, month = aug, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W13-3604}, pages = {26--33} }
BEA
NAIST at the NLI 2013 Shared Task

Tomoya Mizumoto, Yuta Hayashibe, Keisuke Sakaguchi, Mamoru Komachi, Yuji Matsumoto

Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications Jun 2013

Abs Bib PDF

This paper describes the Nara Institute of Science and Technology (NAIST) native language identification (NLI) system in the NLI 2013 Shared Task. We apply feature selection using a measure based on frequency for the closed track and try Capping and Sampling data methods for the open tracks. Our system ranked ninth in the closed track, third in open track 1 and fourth in open track 2.
@inproceedings{mizumoto-etal-2013-naist, title = {{NAIST} at the {NLI} 2013 Shared Task}, author = {Mizumoto, Tomoya and Hayashibe, Yuta and Sakaguchi, Keisuke and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of the Eighth Workshop on Innovative Use of {NLP} for Building Educational Applications}, month = jun, year = {2013}, address = {Atlanta, Georgia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W13-1717}, pages = {134--139} }
MWE
Construction of English MWE Dictionary and its Application to POS Tagging

Yutaro Shigeto, Ai Azuma, Sorami Hisamoto, Shuhei Kondo, Tomoya Kose, Keisuke Sakaguchi, Akifumi Yoshimoto, Frances Yung, Yuji Matsumoto

Proceedings of the 9th Workshop on Multiword Expressions Jun 2013

Abs Bib PDF

This paper reports our ongoing project for constructing an English multiword expression (MWE) dictionary and NLP tools based on the developed dictionary. We extracted functional MWEs from the English part of Wiktionary, annotated the Penn Treebank (PTB) with MWE information, and conducted POS tagging experiments. We report how the MWE annotation is done on PTB and the results of POS and MWE tagging experiments.
@inproceedings{shigeto-etal-2013-construction, title = {Construction of {E}nglish {MWE} Dictionary and its Application to {POS} Tagging}, author = {Shigeto, Yutaro and Azuma, Ai and Hisamoto, Sorami and Kondo, Shuhei and Kose, Tomoya and Sakaguchi, Keisuke and Yoshimoto, Akifumi and Yung, Frances and Matsumoto, Yuji}, booktitle = {Proceedings of the 9th Workshop on Multiword Expressions}, month = jun, year = {2013}, address = {Atlanta, Georgia, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W13-1021}, pages = {139--144} }

2012

COLING
Joint English Spelling Error Correction and POS Tagging for Language Learners Writing

Keisuke Sakaguchi, Tomoya Mizumoto, Mamoru Komachi, Yuji Matsumoto

Proceedings of COLING 2012 Dec 2012

Abs Bib PDF Slides

We propose an approach to correcting spelling errors and assigning part-of-speech (POS) tags simultaneously for sentences written by learners of English as a second language (ESL). In ESL writing, there are several types of errors such as preposition, determiner, verb, noun, and spelling errors. Spelling errors often interfere with POS tagging and syntactic parsing, which makes other error detection and correction tasks very difficult. In studies of grammatical error detection and correction in ESL writing, spelling correction has been regarded as a preprocessing step in a pipeline. However, several types of spelling errors in ESL are difficult to correct in the preprocessing, for example, homophones (e.g. *hear/here), confusion (*quiet/quite), split (*now a day/nowadays), merge (*swimingpool/swimming pool), inflection (*please/pleased) and derivation (*badly/bad), where the incorrect word is actually in the vocabulary and grammatical information is needed to disambiguate. In order to correct these spelling errors, and also typical typographical errors (*begginning/beginning), we propose a joint analysis of POS tagging and spelling error correction with a CRF (Conditional Random Field)-based model. We present an approach that achieves significantly better accuracies for both POS tagging and spelling correction, compared to existing approaches using either individual or pipeline analysis. We also show that the joint model can deal with novel types of misspelling in ESL writing.
@inproceedings{sakaguchi-etal-2012-joint, title = {Joint {E}nglish Spelling Error Correction and {POS} Tagging for Language Learners Writing}, author = {Sakaguchi, Keisuke and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of {COLING} 2012}, month = dec, year = {2012}, address = {Mumbai, India}, publisher = {The COLING 2012 Organizing Committee}, url = {https://aclanthology.org/C12-1144}, pages = {2357--2374} }
BEA
NAIST at the HOO 2012 Shared Task

Keisuke Sakaguchi, Yuta Hayashibe, Shuhei Kondo, Lis Kanashiro, Tomoya Mizumoto, Mamoru Komachi, Yuji Matsumoto

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP Jun 2012

Abs Bib PDF Poster

This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the Helping Our Own (HOO) 2012 Shared Task. Our system targets preposition and determiner errors with spelling correction as a pre-processing step. The result shows that spelling correction improves the Detection, Correction, and Recognition F-scores for preposition errors. With regard to preposition error correction, F-scores were not improved when using the training set with correction of all but preposition errors. As for determiner error correction, there was an improvement when the constituent parser was trained with a concatenation of treebank and modified treebank where all the articles appearing as the first word of an NP were removed. Our system ranked third in preposition and fourth in determiner error corrections.
@inproceedings{sakaguchi-etal-2012-naist, title = {{NAIST} at the {HOO} 2012 Shared Task}, author = {Sakaguchi, Keisuke and Hayashibe, Yuta and Kondo, Shuhei and Kanashiro, Lis and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of the Seventh Workshop on Building Educational Applications Using {NLP}}, month = jun, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W12-2033}, pages = {281--288} }