Keisuke Sakaguchi | 発表論文

2025

AIED
Annotating Errors in English Learners’ Written Language Production: Advancing Automated Written Feedback Systems

Steven Coyne, Diana Galvan-Sosa, Ryan Spring, Camélia Guerraoui, Michael Zock, Keisuke Sakaguchi, Kentaro Inui

Artificial Intelligence in Education (AIED) Jul 2025

Best Paper Nomination

Abs Bib PDF Project

Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error’s error type and generalizability. For error type classification, we introduce a typology focused on inferring learners’ knowledge gaps by connecting their errors to specific grammatical patterns. We collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system’s outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the performance of the systems investigated.
@inproceedings{coyne2025feedback, author = {Coyne, Steven and Galvan-Sosa, Diana and Spring, Ryan and Guerraoui, Cam{\'e}lia and Zock, Michael and Sakaguchi, Keisuke and Inui, Kentaro}, editor = {Cristea, Alexandra I. and Walker, Erin and Lu, Yu and Santos, Olga C. and Isotani, Seiji}, title = {Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems}, booktitle = {Artificial Intelligence in Education (AIED)}, year = {2025}, publisher = {Springer Nature Switzerland}, pages = {292--306}, isbn = {978-3-031-98459-4}, month = jul }
ICLR
Sketch2Diagram: Generating Vector Diagrams from Hand-Drawn Sketches

Itsumi Saito, Haruto Yoshida, Keisuke Sakaguchi

The Thirteenth International Conference on Learning Representations Apr 2025

Abs Bib PDF Project

We address the challenge of automatically generating high-quality vector diagrams from hand-drawn sketches. Vector diagrams are essential for communicating complex ideas across various fields, offering flexibility and scalability. While recent research has progressed in generating diagrams from text descriptions, converting hand-drawn sketches into vector diagrams remains largely unexplored due to the lack of suitable datasets. To address this gap, we introduce SketikZ, a dataset comprising 3,231 pairs of hand-drawn sketches and thier corresponding TikZ codes as well as reference diagrams. Our evaluations reveal the limitations of state-of-the-art vision and language models (VLMs), positioning SketikZ as a key benchmark for future research in sketch-to-diagram conversion. Along with SketikZ, we present ImgTikZ, an image-to-TikZ model that integrates a 6.7B parameter code-specialized open-source large language model (LLM) with a pre-trained vision encoder. Despite its relatively compact size, ImgTikZ performs comparably to GPT-4o. This success is driven by using our two data augmentation techniques and a multi-candidate inference strategy. Our findings open promising directions for future research in sketch-to-diagram conversion and broader image-to-code generation tasks. SketikZ is publicly available.
@inproceedings{saito2025sketchdiagram, title = {Sketch2Diagram: Generating Vector Diagrams from Hand-Drawn Sketches}, author = {Saito, Itsumi and Yoshida, Haruto and Sakaguchi, Keisuke}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, url = {https://openreview.net/forum?id=KvaDHPhhir}, month = apr, address = {Singapore} }
NAACL Findings
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Go Kamoda, Benjamin Heinzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Findings of the Association for Computational Linguistics: NAACL 2025 Apr 2025

Abs Bib PDF

According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model‘s “inner vocabulary”.Prior analysis of this *detokenization* stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior.Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps.Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2.Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects.By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.
@inproceedings{kamoda2025detok, title = {Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference}, author = {Kamoda, Go and Heinzerling, Benjamin and Inaba, Tatsuro and Kudo, Keito and Sakaguchi, Keisuke and Inui, Kentaro}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Findings of the Association for Computational Linguistics: NAACL 2025}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-naacl.355/}, pages = {6324--6343}, isbn = {979-8-89176-195-7} }
NAACL
Self-Training Meets Consistency: Improving LLMs’ Reasoning with Consistency-Driven Rationale Evaluation

Jaehyeok Lee, Keisuke Sakaguchi, JinYeong Bak

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) Apr 2025

Abs Bib PDF

Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
@inproceedings{lee2025crest, title = {Self-Training Meets Consistency: Improving {LLM}s' Reasoning with Consistency-Driven Rationale Evaluation}, author = {Lee, Jaehyeok and Sakaguchi, Keisuke and Bak, JinYeong}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.naacl-long.528/}, pages = {10519--10539}, isbn = {979-8-89176-189-6} }
NAACL
Language Models can Categorize System Inputs for Performance Analysis

Dominic Sobhani, Ruiqi Zhong, Edison Marrese-Taylor, Keisuke Sakaguchi, Yutaka Matsuo

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) Apr 2025

Abs Bib PDF

Language model systems are used to process diverse categories of input requests, ranging from improving creative writing to solving programming challenges. It would be useful to know which categories they are good at. However, existing evaluations compare model performance on pre-defined categories, failing to reflect a system’s performance on finer-grained or novel ones. We propose to automatically search for finer-grained categories based on inputs where a system performs well or poorly, and describe them in natural language. To search for these categories, we propose a large number of candidate category descriptions, e.g. “Communication Improvement”, find the subset of inputs that match the category descriptions, and calculate the performance on these categories; then we sort these categories based on their performance, thereby highlighting those that score high or low. As one application, we apply our method to compare LLaMA 3-70B and Claude 3 Opus, which have similar Elo-ratings on Chatbot Arena; our method finds the former is weaker at making text more professional and humorous while better at providing psychological insights, depicting a more nuanced picture of model performance.
@inproceedings{sobhani2025categorize, title = {Language Models can Categorize System Inputs for Performance Analysis}, author = {Sobhani, Dominic and Zhong, Ruiqi and Marrese-Taylor, Edison and Sakaguchi, Keisuke and Matsuo, Yutaka}, editor = {Chiruzzo, Luis and Ritter, Alan and Wang, Lu}, booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.naacl-long.317/}, pages = {6241--6257}, isbn = {979-8-89176-189-6} }
Nature MI
Investigating machine moral judgement through the Delphi experiment

Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny T. Liang, Sydney Levine, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jack Hessel, Jon Borchardt, Taylor Sorensen, Saadia Gabriel, Yulia Tsvetkov, Oren Etzioni, Maarten Sap, Regina Rini, Yejin Choi

Nature Machine Intelligence 2025

Abs Bib PDF

As our society adopts increasingly powerful artificial intelligence (AI) systems for pervasive use, there are growing concerns about machine morality—or lack thereof. Millions of users already rely on the outputs of AI systems, such as chatbots, as decision aids. Meanwhile, AI researchers continue to grapple with the challenge of aligning these systems with human morality and values. In response to this challenge, we build and test Delphi, an open-source AI system trained to predict the moral judgements of US participants. The computational framework of Delphi is grounded in the framework proposed by the prominent moral philosopher John Rawls. Our results speak to the promises and limits of teaching machines about human morality. Delphi demonstrates improved generalization capabilities over those exhibited by off-the-shelf neural language models. At the same time, Delphi’s failures also underscore important challenges in this arena. For instance, Delphi has limited cultural awareness and is susceptible to pervasive biases. Despite these shortcomings, we demonstrate several compelling use cases of Delphi, including its incorporation as a component within an ensemble of AI systems. Finally, we computationally demonstrate the potential of Rawls’s prospect of hybrid approaches for reliable moral reasoning, inspiring future research in computational morality.
@article{jiang2025delphi, author = {Jiang, Liwei and Hwang, Jena D. and Bhagavatula, Chandra and Bras, Ronan Le and Liang, Jenny T. and Levine, Sydney and Dodge, Jesse and Sakaguchi, Keisuke and Forbes, Maxwell and Hessel, Jack and Borchardt, Jon and Sorensen, Taylor and Gabriel, Saadia and Tsvetkov, Yulia and Etzioni, Oren and Sap, Maarten and Rini, Regina and Choi, Yejin}, date = {2025/01/01}, doi = {10.1038/s42256-024-00969-6}, id = {Jiang2025}, isbn = {2522-5839}, journal = {Nature Machine Intelligence}, number = {1}, pages = {145--160}, title = {Investigating machine moral judgement through the Delphi experiment}, url = {https://doi.org/10.1038/s42256-024-00969-6}, volume = {7}, year = {2025} }
arXiv
FinchGPT: a Transformer based language model for birdsong analysis

Kosei Kobayashi, Kosuke Matsuzaki, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui, Kentaro Abe

arXiv 2025

Abs Bib PDF

The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT’s attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model’s outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.
@article{kobayashi2025finch, title = {FinchGPT: a Transformer based language model for birdsong analysis}, author = {Kobayashi, Kosei and Matsuzaki, Kosuke and Taniguchi, Masaya and Sakaguchi, Keisuke and Inui, Kentaro and Abe, Kentaro}, journal = {arXiv}, year = {2025}, doi = {10.48550/arxiv.2502.00344} }
COLING
Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment

Go Kamoda, Akari Asai, Ana Brassard, Keisuke Sakaguchi

Proceedings of the 31st International Conference on Computational Linguistics Jan 2025

Abs Bib PDF Project

Evaluating the outputs of large language models (LLMs) on long-form generative tasks remains challenging. While fine-grained, aspect-wise evaluations provide valuable diagnostic information, they are difficult to design exhaustively, and each aspect’s contribution to the overall acceptability of an answer is unclear. In this study, we propose a method to compute an overall quality score as a weighted average of three key aspects: factuality, informative- ness, and formality. This approach achieves stronger correlations with human judgments compared to previous metrics. Our analysis identifies factuality as the most predictive aspect of overall quality. Additionally, we release a dataset of 1.2k long-form QA answers annotated with both absolute judgments and relative preferences in overall and aspect-wise schemes to aid future research in evaluation practices.
@inproceedings{kamoda2025lfqa, title = {Quantifying the Influence of Evaluation Aspects on Long-Form Response Assessment}, author = {Kamoda, Go and Asai, Akari and Brassard, Ana and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 31st International Conference on Computational Linguistics}, month = jan, year = {2025}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.coling-main.588/}, pages = {8787--8808} }

NL研

日英翻訳タスクにおける大規模言語モデルのゼロ照応解析性能の影響

野末慎之介, 松林優一郎, 藤井諒, 岸波洋介, 森下睦, 坂口慶祐

2025年 jun月

若手奨励賞(学生)

Bib

@techreport{nozue_etal_nl2025_zero,
  author = {野末, 慎之介 and 松林, 優一郎 and 藤井, 諒 and 岸波, 洋介 and 森下, 睦 and 坂口, 慶祐},
  issue = {6},
  month = jun,
  note = {日英機械翻訳では，日本語文内に発生する主語や目的語等の省略（ゼロ照応）が翻訳性能に影響を及ぼすことが議論されてきた．最新の大規模言語モデル（LLM）のゼロ照応解析の性能には依然として課題が残るものの，その日英翻訳性能は非常に高く，従来議論されてきた内容との間にギャップが生じている．そこで本研究では，GPT-4oを対象として，ゼロ照応に関する理解と日英翻訳性能の関係を詳細に分析することを試みる．具体的には，日本語の文間ゼロ照応事例に焦点を当て，省略されている項をモデルが特定できる場合とできない場合について，それぞれの翻訳性能を調査した．分析では，総合的な翻訳性能に加えて，翻訳後の英文で省略された項が補完されているか，補完に代名詞が利用されているか等の観点で分類を行い，傾向を観察した．分析の結果，GPT-4oはゼロ照応解析が失敗した事例でも自然な日英翻訳を実現できることが分かった．また，ゼロ照応解析が失敗した事例では，成功事例と比較して項が補完されにくい傾向が見られた．項が補完されない事例では，受動態の使用や述語の名詞化などの文構造が取られており，それらの翻訳文は省略された項が正しく推測できる形で記述されていることが確認された．, In Japanese-to-English machine translation, the omission of subjects and objects (zero anaphora) in Japanese has long been discussed as a factor affecting translation quality. Although current large language models (LLMs) still face challenges in zero anaphora resolution, their JA-to-EN translation quality is remarkably high, creating a gap with earlier concerns. This study aims to investigate the relationship between zero anaphora understanding and JA-to-EN translation performance, focusing on GPT-4o. Specifically, we examine inter-sentential zero anaphora in Japanese, analyzing translation performance between cases where the model successfully identifies omitted arguments and where it does not. Our analysis targeted not only overall translation quality, but also some aspects related to zero anaphora, such as whether zero pronouns are supplemented in the translation, whether pronouns are used for supplementation, and whether the supplemented arguments in the translated text can be correctly inferred from the context. The results showed that GPT-4o can produce natural translations even in instances where zero anaphora resolution fails. Furthermore, we observed a tendency that omitted arguments tend to be less frequently supplemented in instances where zero anaphora resolution fails. In unsupplemented cases, the translated sentences often employ syntactic structures such as passive voice or nominalization of predicates. It was confirmed that even in such cases, the translated text still allowed the missing elements to be inferred correctly from the context.},
  title = {日英翻訳タスクにおける大規模言語モデルのゼロ照応解析性能の影響},
  year = {2025}
}

NLP

Sketch2Diagram: 視覚的指示を入力とするダイアグラム生成

斉藤いつみ, 吉田遥音, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

委員特別賞

Bib PDF

@inproceedings{saito_etal_nlp2025_sketch,
  title = {Sketch2Diagram: 視覚的指示を入力とするダイアグラム生成},
  author = {斉藤, いつみ and 吉田, 遥音 and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

線形判別分析の PU 学習による朝日歌壇短歌の分析

加藤真大, 浦川通, 田口雄哉, 新妻巧朗, 田森秀明, 羽根田賢和, 坂口慶祐, 持橋大地

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{kato_etal_nlp2025_pu,
  title = {線形判別分析の PU 学習による朝日歌壇短歌の分析},
  author = {加藤, 真大 and 浦川, 通 and 田口, 雄哉 and 新妻, 巧朗 and 田森, 秀明 and 羽根田, 賢和 and 坂口, 慶祐 and 持橋, 大地},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

定数精度浮動小数点 Transformer Decoder が認識する言語の有限性・余有限性

根岸直生, 谷口雅弥, 坂口慶祐, 乾健太郎

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{negishi_etal_nlp2025_transformer,
  title = {定数精度浮動小数点 Transformer Decoder が認識する言語の有限性・余有限性},
  author = {根岸, 直生 and 谷口, 雅弥 and 坂口, 慶祐 and 乾, 健太郎},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

LM は日本の時系列構造をどうエンコードするか

佐々木睦史, 鴨田豪, 高橋良允, Benjamin Heinzerling, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

スポンサー賞(日本経済新聞社 CDIO室賞)

Bib PDF

@inproceedings{sasaki_etal_nlp2025_lm,
  title = {LM は日本の時系列構造をどうエンコードするか},
  author = {佐々木, 睦史 and 鴨田, 豪 and 高橋, 良允 and Benjamin, Heinzerling and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

Towards Equitable Translation: Gender Bias in Large Language Models

Hong Hai Ngo, Yunmeng Li, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{hong_etal_nlp2025_towards,
  title = {Towards Equitable Translation: Gender Bias in Large Language Models},
  author = {Hong, Hai Ngo and Yunmeng, Li and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

言語モデルのパラメータから探るDetokenizationメカニズム

鴨田豪, Benjamin Heinzerling, 稲葉達郎, 工藤慧音, 坂口慶祐, 乾健太郎

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{kamoda_etal_nlp2025_detokenization,
  title = {言語モデルのパラメータから探るDetokenizationメカニズム},
  author = {鴨田, 豪 and Benjamin, Heinzerling and 稲葉, 達郎 and 工藤, 慧音 and 坂口, 慶祐 and 乾, 健太郎},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

言語モデルの内部表現における文法情報の局所性について

佐藤宏亮, 鴨田豪, Benjamin Heinzerling, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{sato_etal_nlp2025_grammar_locality,
  title = {言語モデルの内部表現における文法情報の局所性について},
  author = {佐藤, 宏亮 and 鴨田, 豪 and Benjamin, Heinzerling and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

認知言語学的イメージスキーマの生成と解釈における大規模言語モデルと画像生成モデルの評価

本田恭平, 松﨑孝介, 吉田遥音, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{honda_etal_nlp2025_image_schema,
  title = {認知言語学的イメージスキーマの生成と解釈における大規模言語モデルと画像生成モデルの評価},
  author = {本田, 恭平 and 松﨑, 孝介 and 吉田, 遥音 and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

ASCII CHALLENGE−LLMは画家になれるか−

吉田遥音, 羽根田賢和, 斉藤いつみ, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{yoshida_etal_nlp2025_ascii,
  title = {ASCII CHALLENGE−LLMは画家になれるか−},
  author = {吉田, 遥音 and 羽根田, 賢和 and 斉藤, いつみ and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

大規模言語モデルが持つ抽象推論能力の分析

清野輝風, 青木洋一, 斉藤いつみ, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{seino_etal_nlp2025_abstract_reasoning,
  title = {大規模言語モデルが持つ抽象推論能力の分析},
  author = {清野, 輝風 and 青木, 洋一 and 斉藤, いつみ and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

NLP

ダイアグラム理解に向けた大規模視覚言語モデルの内部表現の分析

吉田遥音, 工藤慧音, 青木洋一, 田中涼太, 斉藤いつみ, 坂口慶祐

言語処理学会第31回年次大会論文集 2025年 3月

Bib PDF

@inproceedings{yoshida_etal_nlp2025_diagram_understanding,
  title = {ダイアグラム理解に向けた大規模視覚言語モデルの内部表現の分析},
  author = {吉田, 遥音 and 工藤, 慧音 and 青木, 洋一 and 田中, 涼太 and 斉藤, いつみ and 坂口, 慶祐},
  year = {2025},
  month = {{3}},
  booktitle = {言語処理学会第31回年次大会論文集}
}

2024

EMNLP
First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Nov 2024

Abs Bib PDF

"Explicit multi-step reasoning, such as chain-of-thought, is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process.Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, their reliance on heuristics decreases as LMs progress closer to the final answer. This suggests that LMs track only a limited number of future steps and dynamically combine heuristic strategies with rational ones in solving tasks involving multi-step reasoning."
@inproceedings{aoki2024heuristics, title = {First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning}, author = {Aoki, Yoichi and Kudo, Keito and Kuribayashi, Tatsuki and Sone, Shusaku and Taniguchi, Masaya and Sakaguchi, Keisuke and Inui, Kentaro}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.789/}, doi = {10.18653/v1/2024.emnlp-main.789}, pages = {14255--14271} }
COLM
ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

First Conference on Language Modeling 2024

Abs Bib PDF

Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement.
@inproceedings{brassard2024acorn, title = {{ACORN}: Aspect-wise Commonsense Reasoning Explanation Evaluation}, author = {Brassard, Ana and Heinzerling, Benjamin and Kudo, Keito and Sakaguchi, Keisuke and Inui, Kentaro}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=2oHnsM9M9D} }
arXiv
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, Kentaro Inui

arXiv Dec 2024

Abs Bib PDF

This study investigates the internal reasoning mechanism of language models during symbolic multi-step reasoning, motivated by the question of whether chain-of-thought (CoT) outputs are faithful to the model’s internals. Specifically, we inspect when they internally determine their answers, particularly before or after CoT begins, to determine whether models follow a post-hoc "think-to-talk" mode or a step-by-step "talk-to-think" mode of explanation. Through causal probing experiments in controlled arithmetic reasoning tasks, we found systematic internal reasoning patterns across models; for example, simple subproblems are solved before CoT begins, and more complicated multi-hop calculations are performed during CoT.
@article{kudo2024think, author = {{Kudo}, Keito and {Aoki}, Yoichi and {Kuribayashi}, Tatsuki and {Sone}, Shusaku and {Taniguchi}, Masaya and {Brassard}, Ana and {Sakaguchi}, Keisuke and {Inui}, Kentaro}, title = {{Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning}}, journal = {arXiv}, year = {2024}, month = dec, doi = {10.48550/arXiv.2412.01113} }
BEA
Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond

Masato Mita, Keisuke Sakaguchi, Masato Hagiwara, Tomoya Mizumoto, Jun Suzuki, Kentaro Inui

Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) Jun 2024

Abs Bib PDF Project

Natural language processing (NLP) technology has rapidly improved automated grammatical error correction (GEC) tasks, and the GEC community has begun to explore document-level revision. However, there are two major obstacles to going beyond automated \textitsentence-level GEC to NLP-based document-level revision support: (1) there are few public corpora with document-level revisions annotated by professional editors, and (2) it is infeasible to obtain all possible references and evaluate revision quality using such references because there are infinite revision possibilities. To address these challenges, this paper proposes a new document revision corpus, Text Revision of ACL papers (TETRA), in which multiple professional editors have revised academic papers sampled from the ACL anthology. This corpus enables us to focus on document-level and paragraph-level edits, such as edits related to coherence and consistency. Additionally, as a case study using the TETRA corpus, we investigate reference-less and interpretable methods for meta-evaluation to detect quality improvements according to document revisions. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle.
@inproceedings{mita2024tetra, title = {Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond}, author = {Mita, Masato and Sakaguchi, Keisuke and Hagiwara, Masato and Mizumoto, Tomoya and Suzuki, Jun and Inui, Kentaro}, booktitle = {Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, pages = {251--265} }
SigDial
A Multimodal Dialogue System to Lead Consensus Building with Emotion-Displaying

Shinnosuk Nozue, Yuto Nakano, Shoji Moriya, Tomoki Ariyama, Kazuma Kokuta, Suchun Xie, Kai Sato, Shusaku Sone, Ryohei Kamei, Reina Akama, Yuichiroh Matsubayashi, Keisuke Sakaguchi

Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue Sep 2024

Abs Bib PDF

The evolution of large language models has enabled fluent dialogue, increasing interest in the coexistence of humans and avatars. An essential aspect of achieving this coexistence involves developing sophisticated dialogue systems that can influence user behavior. In this background, we propose an effective multimodal dialogue system designed to promote consensus building with humans. Our system employs a slot-filling strategy to guide discussions and attempts to influence users with suggestions through emotional expression and intent conveyance via its avatar. These innovations have resulted in our system achieving the highest performance in a competition evaluating consensus building between humans and dialogue systems. We hope that our research will promote further discussion on the development of dialogue systems that enhance consensus building in human collaboration.
@inproceedings{nozue2024multimodal, title = {A Multimodal Dialogue System to Lead Consensus Building with Emotion-Displaying}, author = {Nozue, Shinnosuk and Nakano, Yuto and Moriya, Shoji and Ariyama, Tomoki and Kokuta, Kazuma and Xie, Suchun and Sato, Kai and Sone, Shusaku and Kamei, Ryohei and Akama, Reina and Matsubayashi, Yuichiroh and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue}, month = sep, year = {2024}, address = {Kyoto, Japan}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.sigdial-1.57}, doi = {10.18653/v1/2024.sigdial-1.57}, pages = {669--673} }
arXiv
Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Subaru Kimura, Ryota Tanaka, Shumpei Miyawaki, Jun Suzuki, Keisuke Sakaguchi

arXiv Aug 2024

Abs Bib PDF

We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.
@article{kimura2024hijack, author = {{Kimura}, Subaru and {Tanaka}, Ryota and {Miyawaki}, Shumpei and {Suzuki}, Jun and {Sakaguchi}, Keisuke}, title = {{Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection}}, journal = {arXiv}, year = {2024}, month = aug, doi = {10.48550/arXiv.2408.03554} }

arXiv

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano, Atsushi Keyaki, Keisuke Kiryu, Hirokazu Kiyomaru, Takashi Kodama, Takahiro Kubo, Yohei Kuga, Ryoma Kumon, Shuhei Kurita, Sadao Kurohashi, Conglong Li, Taiki Maekawa, Hiroshi Matsuda, Yusuke Miyao, Kentaro Mizuki, Sakae Mizuki, Yugo Murawaki, Akim Mousterou, Ryo Nakamura, Taishi Nakamura, Kouta Nakayama, Tomoka Nakazato, Takuro Niitsuma, Jiro Nishitoba, Yusuke Oda, Hayato Ogawa, Takumi Okamoto, Naoaki Okazaki, Yohei Oseki, Shintaro Ozaki, Koki Ryu, Rafal Rzepka, Keisuke Sakaguchi, Shota Sasaki, Satoshi Sekine, Kohei Suda, Saku Sugawara, Issa Sugiura, Hiroaki Sugiyama, Hisami Suzuki, Jun Suzuki, Toyotaro Suzumura, Kensuke Tachibana, Yu Takagi, Kyosuke Takami, Koichi Takeda, Masashi Takeshita, Masahiro Tanaka, Kenjiro Taura, Arseny Tolmachev, Nobuhiro Ueda, Zhen Wan, Shuntaro Yada, Sakiko Yahata, Yuya Yamamoto, Yusuke Yamauchi, Hitomi Yanaka, Rio Yokota, Koichiro Yoshino

arXiv e-prints Jul 2024

Abs Bib PDF

This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit this https URL.

@article{llmjp2024,
  author = {{LLM-jp} and {:} and {Aizawa}, Akiko and {Aramaki}, Eiji and {Chen}, Bowen and {Cheng}, Fei and {Deguchi}, Hiroyuki and {Enomoto}, Rintaro and {Fujii}, Kazuki and {Fukumoto}, Kensuke and {Fukushima}, Takuya and {Han}, Namgi and {Harada}, Yuto and {Hashimoto}, Chikara and {Hiraoka}, Tatsuya and {Hisada}, Shohei and {Hosokawa}, Sosuke and {Jie}, Lu and {Kamata}, Keisuke and {Kanazawa}, Teruhito and {Kanezashi}, Hiroki and {Kataoka}, Hiroshi and {Katsumata}, Satoru and {Kawahara}, Daisuke and {Kawano}, Seiya and {Keyaki}, Atsushi and {Kiryu}, Keisuke and {Kiyomaru}, Hirokazu and {Kodama}, Takashi and {Kubo}, Takahiro and {Kuga}, Yohei and {Kumon}, Ryoma and {Kurita}, Shuhei and {Kurohashi}, Sadao and {Li}, Conglong and {Maekawa}, Taiki and {Matsuda}, Hiroshi and {Miyao}, Yusuke and {Mizuki}, Kentaro and {Mizuki}, Sakae and {Murawaki}, Yugo and {Mousterou}, Akim and {Nakamura}, Ryo and {Nakamura}, Taishi and {Nakayama}, Kouta and {Nakazato}, Tomoka and {Niitsuma}, Takuro and {Nishitoba}, Jiro and {Oda}, Yusuke and {Ogawa}, Hayato and {Okamoto}, Takumi and {Okazaki}, Naoaki and {Oseki}, Yohei and {Ozaki}, Shintaro and {Ryu}, Koki and {Rzepka}, Rafal and {Sakaguchi}, Keisuke and {Sasaki}, Shota and {Sekine}, Satoshi and {Suda}, Kohei and {Sugawara}, Saku and {Sugiura}, Issa and {Sugiyama}, Hiroaki and {Suzuki}, Hisami and {Suzuki}, Jun and {Suzumura}, Toyotaro and {Tachibana}, Kensuke and {Takagi}, Yu and {Takami}, Kyosuke and {Takeda}, Koichi and {Takeshita}, Masashi and {Tanaka}, Masahiro and {Taura}, Kenjiro and {Tolmachev}, Arseny and {Ueda}, Nobuhiro and {Wan}, Zhen and {Yada}, Shuntaro and {Yahata}, Sakiko and {Yamamoto}, Yuya and {Yamauchi}, Yusuke and {Yanaka}, Hitomi and {Yokota}, Rio and {Yoshino}, Koichiro},
  title = {{LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs}},
  journal = {arXiv e-prints},
  keywords = {Computer Science - Computation and Language, Computer Science - Artificial Intelligence},
  year = {2024},
  month = jul,
  doi = {10.48550/arXiv.2407.03963}
}

arXiv
The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models

Ryosuke Takahashi, Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

arXiv Jun 2024

Abs Bib PDF

Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of knowledge deletion and the entities related to the knowledge. Our findings reveal that deleting knowledge related to popular entities can have catastrophic side effects. Furthermore, this research is the first to analyze knowledge deletion in models trained on synthetic knowledge graphs, indicating a new direction for controlled experiments.
@article{takahashi2024edit, author = {{Takahashi}, Ryosuke and {Kamoda}, Go and {Heinzerling}, Benjamin and {Sakaguchi}, Keisuke and {Inui}, Kentaro}, title = {{The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models}}, journal = {arXiv}, year = {2024}, month = jun, doi = {10.48550/arXiv.2406.06032} }
MORPHON
J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema

Kosuke Matsuzaki, Masaya Taniguchi, Kentaro Inui, Keisuke Sakaguchi

Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology Jun 2024

Abs Bib PDF Project

We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language’s agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] + suru (do-PRS)). Morphologically, this inflection pattern is same as the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We will release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.
@inproceedings{matsuzaki2024junimorph, title = {{J}-{U}ni{M}orph: {J}apanese Morphological Annotation through the Universal Feature Schema}, author = {Matsuzaki, Kosuke and Taniguchi, Masaya and Inui, Kentaro and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, pages = {7--19} }
LREC-COLING
Beam Decoding with Controlled Patience

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Dragomir Radev, Yejin Choi, Noah A. Smith

Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation May 2024

Abs Bib PDF Project

Text generation with beam search has proven successful in a wide range of applications. The commonly-used implementation of beam decoding follows a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. We introduce a patience factor, a simple modification to this decoding algorithm, that generalizes the stopping criterion and provides flexibility to the depth of search. Extensive empirical results demonstrate that the patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation.
@inproceedings{kasai2024beam, title = {Beam Decoding with Controlled Patience}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Radev, Dragomir and Choi, Yejin and Smith, Noah A.}, booktitle = {{Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation}}, year = {2024}, month = may }
ICLR
PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning

Faeze Brahman, Chandra Bhagavatula, Valentina Pyatkin, Jena D. Hwang, Xiang Lorraine Li, Hirona Jacqueline Arai, Soumya Sanyal, Keisuke Sakaguchi, Xiang Ren, Yejin Choi

The Twelfth International Conference on Learning Representations May 2024

Abs Bib PDF

Procedural planning, which entails decomposing a high-level goal into a sequence of temporally ordered steps, is an important yet intricate task for machines. It involves integrating common-sense knowledge to reason about complex and often contextualized situations, e.g. “scheduling a doctor’s appointment without a phone”. While current approaches show encouraging results using large language models (LLMs), they are hindered by drawbacks such as costly API calls and reproducibility issues. In this paper, we advocate planning using smaller language models. We present PlaSma, a novel two-pronged approach to endow small language models with procedural knowledge and (constrained) language-based planning capabilities. More concretely, we develop symbolic procedural knowledge distillation to enhance the commonsense knowledge in small language models and aninference-time algorithm to facilitate more structured and accurate reasoning. In addition, we introduce a new related task, Replanning, that requires a revision of a plan to cope with a constrained situation. In both the planning and replanning settings, we show that orders-of-magnitude smaller models (770M-11B parameters) can compete and often surpass their larger teacher models’ capabilities. Finally, we showcase successful application of PlaSma in an embodied environment, VirtualHome.
@inproceedings{brahman2024plasma, title = {PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning}, author = {Brahman, Faeze and Bhagavatula, Chandra and Pyatkin, Valentina and Hwang, Jena D. and Li, Xiang Lorraine and Arai, Hirona Jacqueline and Sanyal, Soumya and Sakaguchi, Keisuke and Ren, Xiang and Choi, Yejin}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://github.com/allenai/PlaSma}, month = may }

NLP

RLHFを用いた「面白い」短歌の自動生成の試み

羽根田賢和, 浦川通, 田口雄哉, 田森秀明, 坂口慶祐

言語処理学会第30回年次大会論文集 2024年 3月

スポンサー賞（博報堂テクノロジーズ賞）

Bib PDF

@inproceedings{haneda_etal_nlp2024_tanka,
  title = {RLHFを用いた「面白い」短歌の自動生成の試み},
  author = {羽根田, 賢和 and 浦川, 通 and 田口, 雄哉 and 田森, 秀明 and 坂口, 慶祐},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

検出器の判断に基づく大規模言語モデルの生成テキストの特徴分析

三浦東子, 谷口雅弥, 坂口慶祐, 乾健太郎

言語処理学会第30回年次大会論文集 2024年 3月

Bib PDF

@inproceedings{miura_etal_nlp2024_detection,
  title = {検出器の判断に基づく大規模言語モデルの生成テキストの特徴分析},
  author = {三浦, 東子 and 谷口, 雅弥 and 坂口, 慶祐 and 乾, 健太郎},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

言語モデルの思考連鎖的推論における探索戦略の動的変化

青木洋一, 工藤慧音, 曾根周作, 栗林樹生, 谷口雅弥, 坂口慶祐, 乾健太郎

言語処理学会第30回年次大会論文集 2024年 3月

若手奨励賞（学生）

Bib PDF

@inproceedings{aoki_etal_nlp2024_strategy,
  title = {言語モデルの思考連鎖的推論における探索戦略の動的変化},
  author = {青木, 洋一 and 工藤, 慧音 and 曾根, 周作 and 栗林, 樹生 and 谷口, 雅弥 and 坂口, 慶祐 and 乾, 健太郎},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

言語モデルからの知識削除：頻出実体の知識は副作用が破滅的

高橋良允, 鴨田豪, Benjamin Heinzerling, 坂口慶祐, 乾健太郎

言語処理学会第30回年次大会論文集 2024年 3月

Bib PDF

@inproceedings{takahashi_etal_nlp2024_knowledge,
  title = {言語モデルからの知識削除：頻出実体の知識は副作用が破滅的},
  author = {高橋, 良允 and 鴨田, 豪 and Benjamin, Heinzerling and 坂口, 慶祐 and 乾, 健太郎},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

算術推論問題における自己回帰型言語モデルの内部機序

工藤慧音, 青木洋一, 栗林樹生, 谷口雅弥, 曾根周作, 坂口慶祐, 乾健太郎

言語処理学会第30回年次大会論文集 2024年 3月

若手奨励賞（学生）

Bib PDF

@inproceedings{kudo_etal_nlp2024_numerical,
  title = {算術推論問題における自己回帰型言語モデルの内部機序},
  author = {工藤, 慧音 and 青木, 洋一 and 栗林, 樹生 and 谷口, 雅弥 and 曾根, 周作 and 坂口, 慶祐 and 乾, 健太郎},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

日本の司法試験を題材としたGPTモデルの評価

チェジョンミン, 笠井淳吾, 坂口慶祐

言語処理学会第30回年次大会論文集 2024年 3月

Bib PDF Project

@inproceedings{jungmin_etal_nlp2024_barexam,
  title = {日本の司法試験を題材としたGPTモデルの評価},
  author = {チェ, ジョンミン and 笠井, 淳吾 and 坂口, 慶祐},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

自然画像で学習された画像埋め込みにダイアグラムを特徴づける情報は含まれているか？

吉田遥音, 工藤慧音, 青木洋一, 田中涼太, 斉藤いつみ, 坂口慶祐, 乾健太郎

言語処理学会第30回年次大会論文集 2024年 3月

Bib PDF

@inproceedings{yoshida_etal_nlp2024_junimorph,
  title = {自然画像で学習された画像埋め込みにダイアグラムを特徴づける情報は含まれているか？},
  author = {吉田, 遥音 and 工藤, 慧音 and 青木, 洋一 and 田中, 涼太 and 斉藤, いつみ and 坂口, 慶祐 and 乾, 健太郎},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

J-UniMorph: 日本語の形態論における意味分類の体系化

松﨑孝介, 谷口雅弥, 乾健太郎, 坂口慶祐

言語処理学会第30回年次大会論文集 2024年 3月

Bib PDF

@inproceedings{matsuzaki_etal_nlp2024_junimorph,
  title = {J-UniMorph: 日本語の形態論における意味分類の体系化},
  author = {松﨑, 孝介 and 谷口, 雅弥 and 乾, 健太郎 and 坂口, 慶祐},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

長文生成の多面的評価:人手評価と自動評価の向上を目指して

鴨田豪, 浅井明里, Brassard Ana, 坂口慶祐

言語処理学会第30回年次大会論文集 2024年 3月

優秀賞

Bib PDF

@inproceedings{kamoda_etal_nlp2024_lfqa,
  title = {長文生成の多面的評価:人手評価と自動評価の向上を目指して},
  author = {鴨田, 豪 and 浅井, 明里 and Brassard, Ana and 坂口, 慶祐},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

NLP

大規模言語モデルにおける日本語ゼロ照応解析能力の分析

野末慎之介, 石月由紀子, 松林優一郎, 坂口慶祐

言語処理学会第30回年次大会論文集 2024年 3月

Bib PDF

@inproceedings{nozuee_etal_nlp2024_anaphora,
  title = {大規模言語モデルにおける日本語ゼロ照応解析能力の分析},
  author = {野末, 慎之介 and 石月, 由紀子 and 松林, 優一郎 and 坂口, 慶祐},
  year = {2024},
  month = {{3}},
  booktitle = {言語処理学会第30回年次大会論文集}
}

2023

NeurIPS
RealTime QA: What’s the Answer Right Now?

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, Kentaro Inui

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track Dec 2023

Abs Bib PDF Project

We introduce RealTime QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). RealTime QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open domain QA datasets and pursues, instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this preliminary report presents real-time evaluation results over the past month. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that RealTime QA will spur progress in instantaneous applications of question answering and beyond.
@inproceedings{kasai2023realtime, title = {RealTime {QA}: What's the Answer Right Now?}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Takahashi, Yoichi and Bras, Ronan Le and Asai, Akari and Yu, Xinyan Velocity and Radev, Dragomir and Smith, Noah A. and Choi, Yejin and Inui, Kentaro}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year = {2023}, url = {https://openreview.net/forum?id=HfKOIPCvsv}, month = dec }
EMNLP Findings
Test-time Augmentation for Factual Probing

Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

Findings of the Association for Computational Linguistics: EMNLP 2023 Dec 2023

Abs Bib PDF Project

Factual probing is a method that uses prompts to test if a language model “knows” certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.
@inproceedings{kamoda2023tta, title = {Test-time Augmentation for Factual Probing}, author = {Kamoda, Go and Heinzerling, Benjamin and Sakaguchi, Keisuke and Inui, Kentaro}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023}, month = dec, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-emnlp.236}, doi = {10.18653/v1/2023.findings-emnlp.236}, pages = {3650--3661} }
ACL
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation

Chandra Bhagavatula, Jena D Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin Choi

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2023

Abs Bib PDF

Commonsense capabilities of pre-trained language models dramatically improve with scale, leading many to believe that scale is the only winning recipe. But is it? Here, we investigate an alternative that a priori seems impossible: can smaller language models (e.g., GPT-2) win over models that are orders of magnitude larger and better (e.g., GPT-3), if powered with novel commonsense distillation algorithms?The key intellectual challenge is to design a learning algorithm that achieve a competitive level of commonsense acquisition, without relying on the benefits of scale. In particular, we study generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly.We introduce I2D2, a novel commonsense distillation framework that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale teacher model with two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model’s own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date.
@inproceedings{bhagavatula2023i2d2, title = {I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation}, author = {Bhagavatula, Chandra and Hwang, Jena D and Downey, Doug and Bras, Ronan Le and Lu, Ximing and Sakaguchi, Keisuke and Swayamdipta, Swabha and West, Peter and Choi, Yejin}, year = {2023}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2023.acl-long.535}, pages = {9614--9630} }
ACL
ELQA: A Corpus of Metalinguistic Questions and Answers about English

Shabnam Behzad, Keisuke Sakaguchi, Nathan Schneider, Amir Zeldes

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2023

Abs Bib PDF Project

We present ELQA, a corpus of questions and answers in and about the English language. Collected from two online forums, the >70k questions (from English learners and others) cover wide-ranging topics including grammar, meaning, fluency, and etymology. The answers include descriptions of general properties of English vocabulary and grammar as well as explanations about specific (correct and incorrect) usage examples. Unlike most NLP datasets, this corpus is metalinguistic—it consists of language about language. As such, it can facilitate investigations of the metalinguistic capabilities of NLU models, as well as educational applications in the language learning domain. To study this, we define a free-form question answering task on our dataset and conduct evaluations on multiple LLMs (Large Language Models) to analyze their capacity to generate metalinguistic answers.
@inproceedings{behzad2023elqa, title = {ELQA: A Corpus of Metalinguistic Questions and Answers about English}, author = {Behzad, Shabnam and Sakaguchi, Keisuke and Schneider, Nathan and Zeldes, Amir}, booktitle = {{Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}}, year = {2023}, month = jul, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2023.acl-long.113}, pages = {2031--2047} }
arXiv
Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir Radev

arXiv 2023

Abs Bib PDF

As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all five years of the exams, highlighting LLMs’ potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
@article{kasai2023med, author = {{Kasai}, Jungo and {Kasai}, Yuhei and {Sakaguchi}, Keisuke and {Yamada}, Yutaro and {Radev}, Dragomir}, title = {{Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations}}, journal = {arXiv}, year = {2023}, doi = {10.48550/arXiv.2303.18027} }
arXiv
An Analysis of GPT-3’s Performance in Grammatical Error Correction

Steven Coyne, Keisuke Sakaguchi

arXiv 2023

Abs Bib PDF

GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of NaturalLanguage Processing tasks. However, there isa relative lack of detailed published analysisof their performance on the task of grammatical error correction (GEC). To address this,we perform experiments testing the capabilitiesof a GPT-3.5 model (text-davinci-003)and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shotand few-shot settings, analyzing intriguing orproblematic outputs encountered with differentprompt formats. We report the performance ofour best prompt on the BEA-2019 and JFLEGdatasets, finding that the GPT models can perform well in a sentence-level revision setting,with GPT-4 achieving a new high score on theJFLEG benchmark. Through human evaluation experiments, we compare the GPT models’corrections to source, human reference, andbaseline GEC system sentences and observedifferences in editing strategies and how theyare scored by human raters.
@article{coyne2023gptgec, author = {{Coyne}, Steven and {Sakaguchi}, Keisuke}, title = {{An Analysis of GPT-3's Performance in Grammatical Error Correction}}, journal = {arXiv}, year = {2023}, doi = {10.48550/arXiv.2303.14342} }
arXiv
Causal schema induction for knowledge discovery

Michael Regan, Jena D. Hwang, Keisuke Sakaguchi, James Pustejovsky

arXiv 2023

Abs Bib PDF

Making sense of familiar yet new situations typically involves making generalizations about causal schemas, stories that help humans reason about event sequences. Reasoning about events includes identifying cause and effect relations shared across event instances, a process we refer to as causal schema induction. Statistical schema induction systems may leverage structural knowledge encoded in discourse or the causal graphs associated with event meaning, however resources to study such causal structure are few in number and limited in size. In this work, we investigate how to apply schema induction models to the task of knowledge discovery for enhanced search of English-language news texts. To tackle the problem of data scarcity, we present Torquestra, a manually curated dataset of text-graph-schema units integrating temporal, event, and causal structures. We benchmark our dataset on three knowledge discovery tasks, building and evaluating models for each. Results show that systems that harness causal structure are effective at identifying texts sharing similar causal meaning components rather than relying on lexical cues alone. We make our dataset and models available for research purposes.
@article{regan2023causalschema, author = {{Regan}, Michael and {Hwang}, Jena D. and {Sakaguchi}, Keisuke and {Pustejovsky}, James}, title = {{Causal schema induction for knowledge discovery}}, journal = {arXiv}, year = {2023}, doi = {10.48550/arXiv.2303.15381} }
EACL
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Proceedings of the 2023 Conference of the European Chapter of the Association for Computational Linguistics May 2023

Abs Bib PDF

Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.
@inproceedings{Kudo2023eacl, title = {Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?}, author = {Kudo, Keito and Aoki, Yoichi and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro}, year = {2023}, booktitle = {Proceedings of the 2023 Conference of the {E}uropean Chapter of the Association for Computational Linguistics}, month = may, publisher = {Association for Computational Linguistics}, address = {Dubrovnik, Croatia}, pages = {1351--1362}, doi = {10.18653/v1/2023.eacl-main.98} }
EACL Findings
Empirical Investigation of Neural Symbolic Reasoning Strategies

Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Findings of the Association for Computational Linguistics: EACL 2023 May 2023

Best Paper Award at AACL-IJCNLP 2022 Student Research Workshop

Abs Bib PDF

Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear.Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning.Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer.Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation.Our results indicate the importance of further exploring effective strategies for neural reasoning models.
@inproceedings{Aoki2023eacl, title = {Empirical Investigation of Neural Symbolic Reasoning Strategies}, author = {Aoki, Yoichi and Kudo, Keito and Kuribayashi, Tatsuki and Brassard, Ana and Yoshikawa, Masashi and Sakaguchi, Keisuke and Inui, Kentaro}, year = {2023}, booktitle = {Findings of the Association for Computational Linguistics: EACL 2023 }, doi = {10.18653/v1/2023.findings-eacl.86}, pages = {1154--1162}, month = may, address = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics} }

SLUD
Hagi bot: LLMを用いた対話状態追跡と人間らしい振る舞いで自然な議論を行うマルチモーダル対話システム

中野雄斗, 野末慎之介, 穀田一真, 有山知希, 佐藤魁, 曾根周作, 亀井遼平, 謝素春, 成田風香, 守屋彰二, 赤間怜奈, 松林優一郎, 坂口慶祐

2023年

対話システムライブコンペティション最優秀賞

Abs Bib

本稿では，対話システムライブコンペティション6に提出したシステムについて述べる．本システムは，応答生成機構とアバター制御機構を組み合わせたタスク指向型マルチモーダル対話システムである．応答生成機構では，対話履歴と話し合うべき議題を考慮しつつ，GPT-4を用いて発話内容と感情・動作ラベルを生成する．具体的には，スロットフィリングにより対話状態を監視し，状況に応じてプロンプトを変更し続けることで，自然な対話展開を実現している．アバター制御機構では，応答生成機構で生成された発話内容や感情・動作ラベルに応じた音声・表情・姿勢を，ラッセルの感情円環モデルなどを参考に著者らが事前に設計したルールに基づいて制御することで，人間らしい自然な振る舞いを実現する．これら2つの機構を組み合わせることで，対話の状況に応じた議論の展開と発話内容や感情に基づいた自然な話し方を実現した．本システムは予選一位で通過した．
@techreport{nakano_etal_slud2023_hagi, title = {Hagi bot: LLMを用いた対話状態追跡と人間らしい振る舞いで自然な議論を行うマルチモーダル対話システム}, author = {中野, 雄斗 and 野末, 慎之介 and 穀田, 一真 and 有山, 知希 and 佐藤, 魁 and 曾根, 周作 and 亀井, 遼平 and 謝, 素春 and 成田, 風香 and 守屋, 彰二 and 赤間, 怜奈 and 松林, 優一郎 and 坂口, 慶祐}, journal = {人工知能学会研究会資料言語・音声理解と対話処理研究会}, volume = {99}, pages = {102--107}, year = {2023}, doi = {10.11517/jsaislud.99.0_102} }

JLR

日本語言語資源の構築と利用性の向上 ―JLR2023 ワークショップ

浅原正幸, 河原大輔, 久保隆宏, 坂口慶祐, 柴田知秀, 松田寛

自然言語処理 2023年

Bib

@article{1050296808060981376,
  author = {浅原, 正幸 and 河原, 大輔 and 久保, 隆宏 and 坂口, 慶祐 and 柴田, 知秀 and 松田, 寛},
  title = {日本語言語資源の構築と利用性の向上 ―JLR2023 ワークショップ},
  journal = {自然言語処理},
  issn = {1340-7619},
  publisher = {言語処理学会},
  year = {2023},
  volume = {30},
  number = {2},
  pages = {857-860},
  url = {https://cir.nii.ac.jp/crid/1050296808060981376}
}

YANS

テキストに基づくダイアグラム生成タスクの提案

吉田遥音, 工藤慧音, 青木洋一, 坂口慶祐

NLP若手の会第18回シンポジウム 2023年 8月

奨励賞

Bib

@inproceedings{yoshida_etal_yans2023_diagram,
  title = {テキストに基づくダイアグラム生成タスクの提案},
  author = {吉田, 遥音 and 工藤, 慧音 and 青木, 洋一 and 坂口, 慶祐},
  year = {2023},
  month = {{8}},
  booktitle = {NLP若手の会 第18回シンポジウム}
}

YANS

日本語学習のための形態意味中心の動詞活用

松﨑孝介, 谷口雅弥, 坂口慶祐, 乾健太郎

NLP若手の会第18回シンポジウム 2023年 8月

スポンサー (Helpfeel) 賞

Bib

@inproceedings{matsuzaki_etal_yans2023_unimorph,
  title = {日本語学習のための形態意味中心の動詞活用},
  author = {松﨑, 孝介 and 谷口, 雅弥 and 坂口, 慶祐 and 乾, 健太郎},
  year = {2023},
  month = {{8}},
  booktitle = {NLP若手の会 第18回シンポジウム}
}

FIT
大規模言語モデルにおける暗黙の推論生成能力の評価

根岸直生, 坂口慶祐, 乾健太郎

第22回情報科学技術フォーラム（FIT2023） 2023年 9月

Abs Bib

暗黙の推論とは与えられた論証テキスト中の主張・前提間の隔たりを埋める推論であり，暗黙の推論を行う能力は日常の，あるいはディベートの論証等への理解に必要な能力である．自然言語処理では言語モデルによる暗黙の推論のためのデータセットやモデルがこれまで提案されてきた．本研究では近年の大規模言語モデルの急速な発展をうけ，大規模言語モデルによる暗黙の推論の妥当性についてクラウドソーシングを用いて評価を行った．実験の結果，人手によるアノテーションと同等かそれを上回る評価を全体で得たが，主張と前提に明確で形式的な因果関係が無い場合に妥当な推論を行えない傾向が見られた．暗黙の推論に関するベンチマークは設計が非常に難しく，しかし人間と同等の推論能力を持つ人工知能の実現に向けて極めて重要である．今回の評価に関するデータをコミュニティに提供することで，後続の研究にて新たな発見や知見を得られることを期待する．
@inproceedings{negishi_etal_fit2023_implicit, title = {大規模言語モデルにおける暗黙の推論生成能力の評価}, author = {根岸, 直生 and 坂口, 慶祐 and 乾, 健太郎}, year = {2023}, month = {{9}}, booktitle = {第22回情報科学技術フォーラム（FIT2023）} }
NLP
Developing a Typology for Language Learning Feedback

Coyne Steven, Galvan-Sosa Diana, 坂口慶祐, 乾健太郎

言語処理学会第29回年次大会論文集 2023年 3月

Abs Bib PDF

Writing is an important part of language learning. Withthe recent release of corpora containing feedback on learnerwriting, it has become easier for NLP researchers to examine this process and work towards such tasks as automaticfeedback comment generation. However, analysis and generation are hindered by a lack of a typology for such comments, and it is costly to determine frequency distributionsor generation error rates for different kinds of comments.In this paper, we discuss typologies from both NLP and educational research, and propose a system to combine themto create an annotation scheme for feedback comments.
@inproceedings{coyne_etal_nlp2023_typology, title = {Developing a Typology for Language Learning Feedback}, author = {Coyne, Steven and Galvan-Sosa, Diana and 坂口, 慶祐 and 乾, 健太郎}, year = {2023}, month = {{3}}, booktitle = {言語処理学会第29回年次大会論文集}, pages = {2552--2557} }
NLP
因果的プロンプトによる NLI の敵対的ロバスト性の強化

Kavumba Pride, Brassard Ana, Heinzerling Benjamin, 坂口慶祐, 乾健太郎

言語処理学会第29回年次大会論文集 2023年 3月

Abs Bib PDF

因果的プロンプトは，label because explanationというテンプレートを用いることで，与えられた入力に特定のラベルを割り当てるだけでなく，このラベルをサポートする説明を生成することができる．この種のプロンプトはもともとモデルの解釈可能性を向上させる目的で導入されたが，本論文では，因果的プロンプトが自然言語推論ベンチマークにおける敵対的摂動に対して，頑健性を向上させる効果があることを示す．
@inproceedings{kavumba_etal_nlp2023_causal, title = {因果的プロンプトによる NLI の敵対的ロバスト性の強化}, author = {Kavumba, Pride and Brassard, Ana and Heinzerling, Benjamin and 坂口, 慶祐 and 乾, 健太郎}, year = {2023}, month = {{3}}, booktitle = {言語処理学会第29回年次大会論文集}, pages = {2211--2216} }
NLP
算術問題におけるニューラルモデルの構成的推論能力

工藤慧音, 青木洋一, Brassard Ana, 栗林樹生, 吉川将司, 坂口慶祐, 乾健太郎

言語処理学会第29回年次大会論文集 2023年 3月

Abs Bib PDF

未知の問題への汎化を達成する上で，問題の構成要素を捉え既知の知識を組み合わせる能力が重要となる．しかし，近年の言語モデルがどの程度構成性を捉えられているかは記号推論の文脈ではまだ明らかでない．そこで本研究は，多段算術記号推論データセットを用いた実験を行い，構成的推論能力の検証を行なう．具体的には，問題の複雑さを整理したスキルツリーを定義し統制的に分析する．実験の結果，言語モデルは体系性の習得が最も困難であり，単純な数式の組み合わせに対する汎化でさえ難しかった．また追加の分析から，入力系列に書かれない知識へのアクセスが必要な問題で，体系性に対する汎化が特に困難である明らかとなった．
@inproceedings{kudo_etal_nlp2023_skilltree, title = {算術問題におけるニューラルモデルの構成的推論能力}, author = {工藤, 慧音 and 青木, 洋一 and Brassard, Ana and 栗林, 樹生 and 吉川, 将司 and 坂口, 慶祐 and 乾, 健太郎}, year = {2023}, month = {{3}}, booktitle = {言語処理学会第29回年次大会論文集}, pages = {2082--2087} }
NLP
Towards grammatically-informed feedback comments

Galvan-Sosa Diana, Coyne Steven, 坂口慶祐, 乾健太郎

言語処理学会第29回年次大会論文集 2023年 3月

Abs Bib PDF

Current writing assistants are good in error correctionand in helping users to change ungrammatical sentencesinto their correct grammatical form. However, they stillfall short on various dimensions, in particular error justification. While the current systems are useful when themain goal is expression, they are insufficient when the goalis the acquisition of a writing skill. It is clear that findingthe root of an error is key for improvement. The question is how to do this automatically? We present here anapproach that automatically aligns error annotations withgrammatical-category annotations made on grammaticalungrammatical sentence pairs. Our preliminary resultssuggest that such alignments provide a good hint concerning the specific grammar points a user should pay attentionto.
@inproceedings{galvan_sosa__etal_nlp2023_feedback, title = {Towards grammatically-informed feedback comments}, author = {Galvan-Sosa, Diana and Coyne, Steven and 坂口, 慶祐 and 乾, 健太郎}, year = {2023}, month = {{3}}, booktitle = {言語処理学会第29回年次大会論文集}, pages = {1339--1343} }
NLP
Test-time Augmentation for Factual Probing

Kamoda Go, Heinzerling Benjamin, Sakaguchi Keisuke, Inui Kentaro

言語処理学会第29回年次大会論文集 2023年 3月

Abs Bib PDF

Factual probing is a method for checking if a languagemodel “knows” certain world knowledge facts. A problemin factual probing is that small changes to prompts canresult in large output changes. Previous work aimed toalleviate this problem by optimizing prompts via text miningor finetuning. However, such approaches are relationspecificand do not generalize to unseen relations types.Here, we propose to use test-time augmentation (TTA) as arelation-agnostic method for reducing sensitivity to promptvariations by automatically augmenting and ensemblingprompts at test time. Experiments show that, while TTAreduces overconfidence in incorrect generations, accuracyincreases only in few cases. Error analysis reveals the difficultyof producing high-quality prompt variations as themain challenge for TTA.
@inproceedings{kamoda_etal_nlp2023_tta, title = {Test-time Augmentation for Factual Probing}, author = {Kamoda, Go and Heinzerling, Benjamin and Sakaguchi, Keisuke and Inui, Kentaro}, year = {2023}, month = {{3}}, booktitle = {言語処理学会第29回年次大会論文集}, pages = {1350--1355} }
NLP
ニューラル記号推論における推論過程の教示方法

青木洋一, 工藤慧音, Brassard Ana, 栗林樹生, 吉川将司, 坂口慶祐, 乾健太郎

言語処理学会第29回年次大会論文集 2023年 3月

Abs Bib PDF

ニューラルモデルを用いた記号推論能力は推論過程を生成することで向上する．しかし，推論過程の性質や形式がどのような影響を与えるかは分かっていない．本研究では，推論過程について出力戦略と推論戦略という2 つの軸から，ニューラルモデルで形式的な推論を行う際の適切な推論過程の教示方法を探る．我々は多段数値推論問題（A=1，B=3，C=A+3，C?）を用いた統制的実験を行い，各形式の選択が性能に大きな影響を与える事が分かった．加えて，少なくとも本実験設定範囲内では，ニューラルモデルによってほぼ完璧な推論が達成され，適切な戦略の選択が重要である事を示す．
@inproceedings{aoki_etal_nlp2023_strategy, title = {ニューラル記号推論における推論過程の教示方法}, author = {青木, 洋一 and 工藤, 慧音 and Brassard, Ana and 栗林, 樹生 and 吉川, 将司 and 坂口, 慶祐 and 乾, 健太郎}, year = {2023}, month = {{3}}, booktitle = {言語処理学会第29回年次大会論文集}, pages = {1140--1145} }

2022

EMNLP
Twist Decoding: Diverse Generators Guide Each Other

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) Dec 2022

Abs Bib PDF Project

Many language generation models are now available for a wide range of generation tasks, including machine translation and summarization. Combining such diverse models may lead to further progress, but ensembling generation models is challenging during inference: conventional ensembling methods (e.g., shallow fusion) require that the models share vocabulary/tokenization schemes. We introduce Twist decoding, a simple and general text generation algorithm that benefits from diverse models at inference time. Our method does not assume the vocabulary, tokenization or even generation order is shared. Our extensive evaluations on machine translation and scientific paper summarization demonstrate that Twist decoding substantially outperforms each model decoded in isolation over various scenarios, including cases where domain-specific and general-purpose models are both available. Twist decoding also consistently outperforms the popular reranking heuristic where output candidates from one model are rescored by another. We hope that our work will encourage researchers and practitioners to examine generation models collectively, not just independently, and to seek out models with complementary strengths to the currently available models.
@inproceedings{kasai2022twist, title = {Twist Decoding: Diverse Generators Guide Each Other}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Peng, Hao and Lu, Ximing and Radev, Dragomir and Choi, Yejin and Smith, Noah A.}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, month = dec, year = {2022}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.emnlp-main.326}, pages = {4909--4923} }
arXiv
Can Machines Learn Morality? The Delphi Experiment

Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny Liang, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jon Borchardt, Saadia Gabriel, Yulia Tsvetkov, Oren Etzioni, Maarten Sap, Regina Rini, Yejin Choi

arXiv 2022

Abs Bib PDF Project

As AI systems become increasingly powerful and pervasive, there are growing concerns about machines’ morality or a lack thereof. Yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for AI. Existing AI systems deployed to millions of users, however, are already making decisions loaded with moral implications, which poses a seemingly impossible challenge: teaching machines moral sense, while humanity continues to grapple with it. To explore this challenge, we introduce Delphi, an experimental framework based on deep neural networks trained directly to reason about descriptive ethical judgments, e.g., "helping a friend" is generally good, while "helping a friend spread fake news" is not. Empirical results shed novel insights on the promises and limits of machine ethics; Delphi demonstrates strong generalization capabilities in the face of novel ethical situations, while off-the-shelf neural network models exhibit markedly poor judgment including unjust biases, confirming the need for explicitly teaching machines moral sense. Yet, Delphi is not perfect, exhibiting susceptibility to pervasive biases and inconsistencies. Despite that, we demonstrate positive use cases of imperfect Delphi, including using it as a component model within other imperfect AI systems. Importantly, we interpret the operationalization of Delphi in light of prominent ethical theories, which leads us to important future research questions.
@article{jiang2022delphi, title = {Can Machines Learn Morality? The Delphi Experiment}, author = {Jiang, Liwei and Hwang, Jena D. and Bhagavatula, Chandra and Bras, Ronan Le and Liang, Jenny and Dodge, Jesse and Sakaguchi, Keisuke and Forbes, Maxwell and Borchardt, Jon and Gabriel, Saadia and Tsvetkov, Yulia and Etzioni, Oren and Sap, Maarten and Rini, Regina and Choi, Yejin}, journal = {arXiv}, year = {2022}, volume = {abs/2110.07574}, doi = {10.48550/ARXIV.2110.07574} }
NAACL
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob Morrison, Alexander R. Fabbri, Yejin Choi, Noah A. Smith

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Jul 2022

Abs Bib PDF Project

Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to focus on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (BILLBOARDs), that simultaneously tracks progress in language generation tasks and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a BILLBOARD accepts both generators and evaluation metrics as competing entries. A BILLBOARD automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlations with human judgments. We release four BILLBOARDs for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.
@inproceedings{Kasai2022BidimensionalLG, title = {Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Bras, Ronan Le and Dunagan, Lavinia and Morrison, Jacob and Fabbri, Alexander R. and Choi, Yejin and Smith, Noah A.}, year = {2022}, booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, month = jul, pages = {3540--3557}, address = {Seattle, United States}, publisher = {Association for Computational Linguistics} }
NAACL
Transparent Human Evaluation for Image Captioning

Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, Noah A. Smith

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Jul 2022

Abs Bib PDF Project

We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.
@inproceedings{Kasai2022TransparentHE, title = {Transparent Human Evaluation for Image Captioning}, author = {Kasai, Jungo and Sakaguchi, Keisuke and Dunagan, Lavinia and Morrison, Jacob and Bras, Ronan Le and Choi, Yejin and Smith, Noah A.}, year = {2022}, booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages = {3464--3478}, month = jul, address = {Seattle, Washington}, publisher = {Association for Computational Linguistics} }
IMLW@AAAI
Interscript: A dataset for interactive learning of scripts through error feedback

Niket Tandon, Aman Madaan, Peter Clark, Keisuke Sakaguchi, Yiming Yang

The AAAI-22 Workshop on Interactive Machine Learning 2022

Abs Bib PDF Project

How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, INTERSCRIPT, containing user feedback on a deployed model that generates complex everyday tasks. INTERSCRIPT contains 8,466 data points– the input is a possibly erroneous script and a user feedback and the output is a modified script. We posit two use-cases of INTERSCRIPT that might significantly advance the state-of-the-art in interactive.
@inproceedings{Tandon2021InterscriptAD, title = {Interscript: A dataset for interactive learning of scripts through error feedback}, author = {Tandon, Niket and Madaan, Aman and Clark, Peter and Sakaguchi, Keisuke and Yang, Yiming}, year = {2022}, booktitle = {The AAAI-22 Workshop on Interactive Machine Learning} }

NLP
論述リビジョンのためのメタ評価基盤

三田雅人, 坂口慶祐, 萩原正人, 水本智也, 鈴木潤, 乾健太郎

言語処理学会第28回年次大会論文集 2022年 3月

言語処理学会年次大会優秀賞

Abs Bib PDF

論述やエッセイの作文のように文書単位で行うリビジョンには，従来の文単位文法誤り訂正の研究範囲では捉えきれない論述全体の流れや一貫性といった要素が含まれる．また，文書単位のリビジョンは妥当な参照が多岐にわたることから高精度な参照なし評価尺度の実現が大きな課題となる．本研究では，自動論述リビジョンの実現に向けて，高精度な自動評価尺度の開発促進を目的としたメタ評価基盤を提案する．そして，大規模言語モデルを用いたベースライン自動評価尺度を用いた自動評価の現状と実現可能性を示す．
@inproceedings{mita_etal_nlp2022_adr, title = {論述リビジョンのためのメタ評価基盤}, author = {三田, 雅人 and 坂口, 慶祐 and 萩原, 正人 and 水本, 智也 and 鈴木, 潤 and 乾, 健太郎}, year = {2022}, month = {{3}}, booktitle = {言語処理学会第28回年次大会論文集}, pages = {465--470} }

2021

arXiv
Improving Neural Model Performance through Natural Language Feedback on Their Explanations

Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Yiming Yang, Peter Clark, Keisuke Sakaguchi, Eduard H. Hovy

arXiv 2021

Abs Bib PDF

A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIEan interactive system that refines its explanations for a given reasoning task by getting human feedback in natural language. Our approach generates graphs that have 40% fewer inconsistencies as compared with the off-the-shelf system. Further, simply appending the corrected explanation structures to the output leads to a gain of 1.2 points on accuracy on defeasible reasoning across all three domains.
@article{Madaan2021ImprovingNM, title = {Improving Neural Model Performance through Natural Language Feedback on Their Explanations}, author = {Madaan, Aman and Tandon, Niket and Rajagopal, Dheeraj and Yang, Yiming and Clark, Peter and Sakaguchi, Keisuke and Hovy, Eduard H.}, journal = {arXiv}, year = {2021}, volume = {abs/2104.08765} }
arXiv
GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education

Masato Hagiwara, Joshua Tanner, Keisuke Sakaguchi

arXiv 2021

Abs Bib PDF Project

We present GrammarTagger, an open-source grammar profiler which, given an input text, identifies grammatical features useful for language education. The model architecture enables it to learn from a small amount of texts annotated with spans and their labels, which 1) enables easier and more intuitive annotation, 2) supports overlapping spans, and 3) is less prone to error propagation, compared to complex hand-crafted rules defined on constituency/dependency parses. We show that we can bootstrap a grammar profiler model with F1 ≈ 0.6 from only a couple hundred sentences both in English and Chinese, which can be further boosted via learning a multilingual model. With GrammarTagger, we also build Octanove Learn, a search engine of language learning materials indexed by their reading difficulty and grammatical features.
@article{Hagiwara2021GrammarTaggerAM, title = {GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education}, author = {Hagiwara, Masato and Tanner, Joshua and Sakaguchi, Keisuke}, journal = {arXiv}, year = {2021}, volume = {abs/2104.03190} }
EMNLP Findings
proScript: Partially Ordered Scripts Generation

Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, Yejin Choi

Findings of the Association for Computational Linguistics: EMNLP 2021 Nov 2021

Abs Bib PDF Project

Scripts – prototypical event sequences describing everyday activities – have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models can be finetuned to generate high-quality scripts, at varying levels of granularity, for a wide range of everyday scenarios (e.g., bake a cake). To do this, we collect a large (6.4k) crowdsourced partially ordered scripts (named proScript), that is substantially larger than prior datasets, and develop models that generate scripts by combining language generation and graph structure prediction. We define two complementary tasks: (i) edge prediction: given a scenario and unordered events, organize the events into a valid (possibly partial-order) script, and (ii) script generation: given only a scenario, generate events and organize them into a (possibly partial-order) script. Our experiments show that our models perform well (e.g., F1=75.7 on task (i)), illustrating a new approach to overcoming previous barriers to script collection. We also show that there is still significant room for improvement toward human level performance. Together, our tasks, dataset, and models offer a new research direction for learning script knowledge.
@inproceedings{sakaguchi-etal-2021-proscript-partially, title = {pro{S}cript: Partially Ordered Scripts Generation}, author = {Sakaguchi, Keisuke and Bhagavatula, Chandra and Le Bras, Ronan and Tandon, Niket and Clark, Peter and Choi, Yejin}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021}, month = nov, year = {2021}, address = {Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.findings-emnlp.184}, pages = {2138--2149} }
CACM
WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi

Commun. ACM Aug 2021

Abs Bib PDF

Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the recent neural language models have reported above 90% accuracy on the Winograd Schema Challenge (WSC), a commonsense benchmark originally designed to be unsolvable for statistical models that rely simply on word associations. This raises an important question—whether these models have truly acquired robust commonsense capabilities or they rely on spurious biases in the dataset that lead to an overestimation of the true capabilities of machine commonsense.To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) large-scale crowdsourcing, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. Our experiments demonstrate that state-of-the-art models achieve considerably lower accuracy (59.4%-79.1%) on WINOGRANDE compared to humans (94%), confirming that the high performance on the original WSC was inflated by spurious biases in the dataset.Furthermore, we report new state-of-the-art results on five related benchmarks with emphasis on their dual implications. On the one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, the high performance on all these benchmarks suggests the extent to which spurious biases are prevalent in all such datasets, which motivates further research on algorithmic bias reduction.
@article{10.1145/3474381, author = {Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin}, title = {WinoGrande: An Adversarial Winograd Schema Challenge at Scale}, year = {2021}, issue_date = {September 2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {64}, number = {9}, issn = {0001-0782}, url = {https://doi.org/10.1145/3474381}, doi = {10.1145/3474381}, journal = {Commun. ACM}, month = aug, pages = {99--106}, numpages = {8} }
AAAI
COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, Yejin Choi

Proceedings of the AAAI Conference on Artificial Intelligence May 2021

Abs Bib PDF Project

Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and challenging tasks. At the same time, there remain questions about the quality and coverage of these resources due to the massive scale required to comprehensively encompass general commonsense knowledge. In this work, we posit that manually constructed CSKGs will never achieve the coverage necessary to be applicable in all situations encountered by NLP agents. Therefore, we propose a new evaluation framework for testing the utility of KGs based on how effectively implicit knowledge representations can be learned from them. With this new goal, we propose ATOMIC 2020, a new CSKG of general-purpose commonsense knowledge containing knowledge that is not readily available in pretrained language models. We evaluate its properties in comparison with other leading CSKGs, performing the first large-scale pairwise study of commonsense knowledge resources. Next, we show that ATOMIC 2020 is better suited for training knowledge models that can generate accurate, representative knowledge for new, unseen entities and events. Finally, through human evaluation, we show that the few-shot performance of GPT-3 (175B parameters), while impressive, remains 12 absolute points lower than a BART-based knowledge model trained on ATOMIC 2020 despite using over 430x fewer parameters.
@article{Hwang2021COMETATOMIC2O, title = {COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, author = {Hwang, Jena D. and Bhagavatula, Chandra and Le Bras, Ronan and Da, Jeff and Sakaguchi, Keisuke and Bosselut, Antoine and Choi, Yejin}, volume = {35}, number = {7}, year = {2021}, month = may, pages = {6384--6392} }

YANS

文法誤り訂正を拡張した新タスクの提案

三田雅人, 萩原正人, 坂口慶祐, 水本智也, 鈴木潤, 乾健太郎

NLP若手の会第14回シンポジウム 2021年 8月

奨励賞

Bib

@conference{mita_etal_yans2020_tetra,
  title = {文法誤り訂正を拡張した新タスクの提案},
  author = {三田, 雅人 and 萩原, 正人 and 坂口, 慶祐 and 水本, 智也 and 鈴木, 潤 and 乾, 健太郎},
  year = {2021},
  month = {{8}},
  booktitle = {NLP若手の会 第14回シンポジウム}
}

NLP

論述リライトタスクの提案と自動評価の実現に向けて

三田雅人, 萩原正人, 坂口慶祐, 水本智也, 鈴木潤, 乾健太郎

言語処理学会第27回年次大会ワークショップ「文章の評価と品質推定」 2021年 3月

Bib

@inproceedings{mita_etal_nlp2021_tetra,
  title = {論述リライトタスクの提案と自動評価の実現に向けて},
  author = {三田, 雅人 and 萩原, 正人 and 坂口, 慶祐 and 水本, 智也 and 鈴木, 潤 and 乾, 健太郎},
  year = {2021},
  month = {{3}},
  booktitle = {言語処理学会第27回年次大会 ワークショップ「文章の評価と品質推定」}
}

2020

EMNLP
A Dataset for Tracking Entities in Open Domain Procedural Text

Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, Eduard Hovy

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Nov 2020

Abs Bib PDF Project

We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky, opaque, and clear. Previous formulations of this task provide the text and entities involved, and ask how those entities change for just a small, pre-defined set of attributes (e.g., location), limiting their fidelity. Our solution is a new task formulation where given just a procedural text as input, the task is to generate a set of state change tuples (entity, attribute, before-state, after-state) for each step, where the entity, attribute, and state values must be predicted from an open vocabulary. Using crowdsourcing, we create OPENPI, a high-quality (91.5% coverage as judged by humans and completely vetted), and large-scale dataset comprising 29,928 state changes over 4,050 sentences from 810 procedural real-world paragraphs from WikiHow.com. A current state-of-the-art generation model on this task achieves 16.1% F1 based on BLEU metric, leaving enough room for novel model architectures.
@inproceedings{tandon-etal-2020-dataset, title = {A Dataset for Tracking Entities in Open Domain Procedural Text}, author = {Tandon, Niket and Sakaguchi, Keisuke and Dalvi, Bhavana and Rajagopal, Dheeraj and Clark, Peter and Guerquin, Michal and Richardson, Kyle and Hovy, Eduard}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, month = nov, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.emnlp-main.520}, doi = {10.18653/v1/2020.emnlp-main.520}, pages = {6408--6417} }
ACL
Uncertain Natural Language Inference

Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke Sakaguchi, Benjamin Van Durme

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Jul 2020

Abs Bib PDF Project

We introduce Uncertain Natural Language Inference (UNLI), a refinement of Natural Language Inference (NLI) that shifts away from categorical labels, targeting instead the direct prediction of subjective probability assessments. We demonstrate the feasibility of collecting annotations for UNLI by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise. We describe a direct scalar regression modeling approach, and find that existing categorically-labeled NLI data can be used in pre-training. Our best models correlate well with humans, demonstrating models are capable of more subtle inferences than the categorical bin assignment employed in current NLI tasks.
@inproceedings{chen-etal-2020-uncertain, title = {Uncertain Natural Language Inference}, author = {Chen, Tongfei and Jiang, Zhengping and Poliak, Adam and Sakaguchi, Keisuke and Van Durme, Benjamin}, booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, month = jul, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.acl-main.774}, doi = {10.18653/v1/2020.acl-main.774}, pages = {8772--8779} }
LREC
The Universal Decompositional Semantics Dataset and Decomp Toolkit

Aaron Steven White, Elias Stengel-Eskin, Siddharth Vashishtha, Venkata Subrahmanyan Govindarajan, Dee Ann Reisinger, Tim Vieira, Keisuke Sakaguchi, Sheng Zhang, Francis Ferraro, Rachel Rudinger, Kyle Rawlins, Benjamin Van Durme

Proceedings of the 12th Language Resources and Evaluation Conference May 2020

Abs Bib PDF Project

We present the Universal Decompositional Semantics (UDS) dataset (v1.0), which is bundled with the Decomp toolkit (v0.1). UDS1.0 unifies five high-quality, decompositional semantics-aligned annotation sets within a single semantic graph specification—with graph structures defined by the predicative patterns produced by the PredPatt tool and real-valued node and edge attributes constructed using sophisticated normalization procedures. The Decomp toolkit provides a suite of Python 3 tools for querying UDS graphs using SPARQL. Both UDS1.0 and Decomp0.1 are publicly available at http://decomp.io.
@inproceedings{white-etal-2020-universal, title = {The Universal Decompositional Semantics Dataset and Decomp Toolkit}, author = {White, Aaron Steven and Stengel-Eskin, Elias and Vashishtha, Siddharth and Govindarajan, Venkata Subrahmanyan and Reisinger, Dee Ann and Vieira, Tim and Sakaguchi, Keisuke and Zhang, Sheng and Ferraro, Francis and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin}, booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association}, url = {https://aclanthology.org/2020.lrec-1.699}, pages = {5698--5707}, language = {English}, isbn = {979-10-95546-34-4} }

ICLR

Abductive Commonsense Reasoning

Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, Yejin Choi

International Conference on Learning Representations 2020

Bib PDF Project

@inproceedings{bhagavatula2020abductive,
  title = {Abductive Commonsense Reasoning},
  author = {Bhagavatula, Chandra and Bras, Ronan Le and Malaviya, Chaitanya and Sakaguchi, Keisuke and Holtzman, Ari and Rashkin, Hannah and Downey, Doug and Yih, Wen-tau and Choi, Yejin},
  booktitle = {International Conference on Learning Representations},
  year = {2020},
  url = {https://openreview.net/forum?id=Byg1v1HKDB}
}

AAAI
WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi

Proceedings of the AAAI Conference on Artificial Intelligence Apr 2020

Outstanding Paper Award (Single Best Paper)

MIT Technology Review

Import AI

Forbes

Abs Bib PDF Project

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ∼15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively). Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (→ 90.1%), DPR (→ 93.1%), COPA(→ 90.6%), KnowRef (→ 85.6%), and Winogender (→ 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
@article{Sakaguchi-etal-2020-winogrande, title = {WinoGrande: An Adversarial Winograd Schema Challenge at Scale}, volume = {34}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/6399}, doi = {10.1609/aaai.v34i05.6399}, number = {05}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, author = {Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin}, year = {2020}, month = apr, pages = {8732--8740} }

2019

EMNLP
WIQA: A dataset for “What if...” reasoning over procedural text

Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, Antoine Bosselut

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Nov 2019

Abs Bib PDF Project

We introduce WIQA, the first large-scale dataset of “What if...” questions over procedural text. WIQA contains a collection of paragraphs, each annotated with multiple influence graphs describing how one change affects another, and a large (40k) collection of “What if...?” multiple-choice questions derived from these. For example, given a paragraph about beach erosion, would stormy weather hasten or decelerate erosion? WIQA contains three kinds of questions: perturbations to steps mentioned in the paragraph; external (out-of-paragraph) perturbations requiring commonsense knowledge; and irrelevant (no effect) perturbations. We find that state-of-the-art models achieve 73.8% accuracy, well below the human performance of 96.3%. We analyze the challenges, in particular tracking chains of influences, and present the dataset as an open challenge to the community.
@inproceedings{tandon-etal-2019-wiqa, title = {{WIQA}: A dataset for {``}What if...{''} reasoning over procedural text}, author = {Tandon, Niket and Dalvi, Bhavana and Sakaguchi, Keisuke and Clark, Peter and Bosselut, Antoine}, booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)}, month = nov, year = {2019}, address = {Hong Kong, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/D19-1629}, doi = {10.18653/v1/D19-1629}, pages = {6076--6085} }

2018

ACL
Efficient Online Scalar Annotation with Bounded Support

Keisuke Sakaguchi, Benjamin Van Durme

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2018

Abs Bib PDF Project Slides

We describe a novel method for efficiently eliciting scalar annotations for dataset construction and system quality estimation by human judgments. We contrast direct assessment (annotators assign scores to items directly), online pairwise ranking aggregation (scores derive from annotator comparison of items), and a hybrid approach (EASL: Efficient Annotation of Scalar Labels) proposed here. Our proposal leads to increased correlation with ground truth, at far greater annotator efficiency, suggesting this strategy as an improved mechanism for dataset creation and manual system evaluation.
@inproceedings{sakaguchi-van-durme-2018-efficient, title = {Efficient Online Scalar Annotation with Bounded Support}, author = {Sakaguchi, Keisuke and Van Durme, Benjamin}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2018}, address = {Melbourne, Australia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P18-1020}, doi = {10.18653/v1/P18-1020}, pages = {208--218} }

2017

IJCNLP
Grammatical Error Correction with Neural Reinforcement Learning

Keisuke Sakaguchi, Matt Post, Benjamin Van Durme

Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Nov 2017

Abs Bib PDF Project Slides

We propose a neural encoder-decoder model with reinforcement learning (NRL) for grammatical error correction (GEC). Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes towards an objective that considers a sentence-level, task-specific evaluation metric, avoiding the exposure bias issue in MLE. We demonstrate that NRL outperforms MLE both in human and automated evaluation metrics, achieving the state-of-the-art on a fluency-oriented GEC corpus.
@inproceedings{sakaguchi-etal-2017-grammatical, title = {Grammatical Error Correction with Neural Reinforcement Learning}, author = {Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin}, booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)}, month = nov, year = {2017}, address = {Taipei, Taiwan}, publisher = {Asian Federation of Natural Language Processing}, url = {https://aclanthology.org/I17-2062}, pages = {366--372} }
BEA
GEC into the future: Where are we going and how do we get there?

Keisuke Sakaguchi, Courtney Napoles, Joel Tetreault

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications Sep 2017

Abs Bib PDF

The field of grammatical error correction (GEC) has made tremendous bounds in the last ten years, but new questions and obstacles are revealing themselves. In this position paper, we discuss the issues that need to be addressed and provide recommendations for the field to continue to make progress, and propose a new shared task. We invite suggestions and critiques from the audience to make the new shared task a community-driven venture.
@inproceedings{sakaguchi-etal-2017-gec, title = {{GEC} into the future: Where are we going and how do we get there?}, author = {Sakaguchi, Keisuke and Napoles, Courtney and Tetreault, Joel}, booktitle = {Proceedings of the 12th Workshop on Innovative Use of {NLP} for Building Educational Applications}, month = sep, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W17-5019}, doi = {10.18653/v1/W17-5019}, pages = {180--187} }
ACL
Error-repair Dependency Parsing for Ungrammatical Texts

Keisuke Sakaguchi, Matt Post, Benjamin Van Durme

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Jul 2017

Outstanding Paper Award (1.5% of the submissions)

Abs Bib PDF Project Slides

We propose a new dependency parsing scheme which jointly parses a sentence and repairs grammatical errors by extending the non-directional transition-based formalism of Goldberg and Elhadad (2010) with three additional actions: SUBSTITUTE, DELETE, INSERT. Because these actions may cause an infinite loop in derivation, we also introduce simple constraints that ensure the parser termination. We evaluate our model with respect to dependency accuracy and grammaticality improvements for ungrammatical sentences, demonstrating the robustness and applicability of our scheme.
@inproceedings{sakaguchi-etal-2017-error, title = {Error-repair Dependency Parsing for Ungrammatical Texts}, author = {Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin}, booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)}, month = jul, year = {2017}, address = {Vancouver, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P17-2030}, doi = {10.18653/v1/P17-2030}, pages = {189--195} }
EACL
JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction

Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers Apr 2017

Grammarly blog

Abs Bib PDF Project

We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding. We describe the types of corrections made and benchmark four leading GEC systems on this corpus, identifying specific areas in which they do well and how they can improve. JFLEG fulfills the need for a new gold standard to properly assess the current state of GEC.
@inproceedings{napoles-etal-2017-jfleg, title = {{JFLEG}: A Fluency Corpus and Benchmark for Grammatical Error Correction}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel}, booktitle = {Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers}, month = apr, year = {2017}, address = {Valencia, Spain}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/E17-2037}, pages = {229--234} }
AAAI
Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network

Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence 2017

Abs Bib PDF Project Poster

Language processing mechanism by humans is generally more robust than computers. The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated such a robust word processing mechanism, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. On the other hand, computational models for word recognition (e.g. spelling checkers) perform poorly on data with such noise.Inspired by the findings from the Cmabrigde Uinervtisy effect, we propose a word recognition model based on a semi-character level recurrent neural network (scRNN). In our experiments, we demonstrate that scRNN has significantly more robust performance in word spelling correction (i.e. word recognition) compared to existing spelling checkers and character-based convolutional neural network. Furthermore, we demonstrate that the model is cognitively plausible by replicating a psycholinguistics experiment about human reading difficulty using our model.
@inproceedings{10.5555/3298023.3298045, author = {Sakaguchi, Keisuke and Duh, Kevin and Post, Matt and Durme, Benjamin Van}, title = {Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network}, year = {2017}, publisher = {AAAI Press}, booktitle = {Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence}, pages = {3281--3287}, numpages = {7}, location = {San Francisco, California, USA}, series = {AAAI'17} }

JNLP
Phrase Structure Annotation and Parsing for Learner English

Keisuke Sakaguchi, Ryo Nagata

自然言語処理 2017年 6月

言語処理学会 2017年度論文賞

Abs Bib PDF

Learner English often contains grammatical errors with structural characteristics such as omissions, insertions, substitutions, and word order errors. These errors are not covered by the existing context-free grammar (CFG) rules. Therefore, it is not at all straightforward how to annotate learner English with phrase structures. Because of this limitation, there has been almost no work on phrase structure annotation for learner corpora despite its importance and usefulness. To address this issue, we propose a phrase structure annotation scheme for learner English, that consists of five principles. We apply the annotation scheme to two different learner corpora and show (i) its effectiveness at consistently annotating learner English with phrase struc- ture (i.e., high inter-annotator agreement); (ii) the structural characteristics (CFG rules) of learner English obtained from the annotated corpora; and (iii) phrase struc- ture parsing performance on learner English for the first time. We also release the annotation guidelines, the annotated data, and the parser model to the public.
@article{sakaguchi-nagata-2017, title = {Phrase Structure Annotation and Parsing for Learner English}, author = {Sakaguchi, Keisuke and Nagata, Ryo}, journal = {自然言語処理}, volume = {24 (3)}, pages = {491--514}, year = {2017}, month = {{6}}, publisher = {言語処理学会} }

ISS

Connecting the Dots Looking Backwards

坂口慶祐

情報・システムソサイエティ誌 2017年

Bib PDF

@article{sakaguchi_2017,
  title = {Connecting the Dots Looking Backwards},
  author = {坂口慶祐},
  journal = {情報・システムソサイエティ誌},
  volume = {21},
  number = {4},
  pages = {23--25},
  year = {2017},
  doi = {10.1587/ieiceissjournal.21.4_23}
}

2016

EMNLP
Universal Decompositional Semantics on Universal Dependencies

Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, Benjamin Van Durme

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Nov 2016

Abs Bib PDF Project

We present a framework for augmenting data sets from the Universal Dependencies project with Universal Decompositional Semantics. Where the Universal Dependencies project aims to provide a syntactic annotation standard that can be used consistently across many languages as well as a collection of corpora that use that standard, our extension has similar aims for semantic annotation. We describe results from annotating the English Universal Dependencies treebank, dealing with word senses, semantic roles, and event properties.
@inproceedings{white-etal-2016-universal, title = {Universal Decompositional Semantics on {U}niversal {D}ependencies}, author = {White, Aaron Steven and Reisinger, Drew and Sakaguchi, Keisuke and Vieira, Tim and Zhang, Sheng and Rudinger, Rachel and Rawlins, Kyle and Van Durme, Benjamin}, booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2016}, address = {Austin, Texas}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/D16-1177}, doi = {10.18653/v1/D16-1177}, pages = {1713--1723} }
EMNLP
There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction

Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Nov 2016

Abs Bib PDF Project

Current methods for automatically evaluating grammatical error correction (GEC) systems rely on gold-standard references. However, these methods suffer from penalizing grammatical edits that are correct but not in the gold standard. We show that reference-less grammaticality metrics correlate very strongly with human judgments and are competitive with the leading reference-based evaluation metrics. By interpolating both methods, we achieve state-of-the-art correlation with human judgments. Finally, we show that GEC metrics are much more reliable when they are calculated at the sentence level instead of the corpus level. We have set up a CodaLab site for benchmarking GEC output using a common dataset and different evaluation metrics.
@inproceedings{napoles-etal-2016-theres, title = {There{'}s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Tetreault, Joel}, booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2016}, address = {Austin, Texas}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/D16-1228}, doi = {10.18653/v1/D16-1228}, pages = {2109--2115} }
ACL
Phrase Structure Annotation and Parsing for Learner English

Ryo Nagata, Keisuke Sakaguchi

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Aug 2016

Abs Bib PDF

There has been almost no work on phrase structure annotation and parsing specially designed for learner English despite the fact that they are useful for representing the structural characteristics of learner English. To address this problem, in this paper, we first propose a phrase structure annotation scheme for learner English and annotate two different learner corpora using it. Second, we show their usefulness, reporting on (a) inter-annotator agreement rate, (b) characteristic CFG rules in the corpora, and (c) parsing performance on them. In addition, we explore methods to improve phrase structure parsing for learner English (achieving an F -measure of 0.878). Finally, we release the full annotation guidelines, the annotated data, and the improved parser model for learner English to the public.
@inproceedings{nagata-sakaguchi-2016-phrase, title = {Phrase Structure Annotation and Parsing for Learner {E}nglish}, author = {Nagata, Ryo and Sakaguchi, Keisuke}, booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2016}, address = {Berlin, Germany}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P16-1173}, doi = {10.18653/v1/P16-1173}, pages = {1837--1847} }
TACL
Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality

Keisuke Sakaguchi, Courtney Napoles, Matt Post, Joel Tetreault

Transactions of the Association for Computational Linguistics 2016

Abs Bib PDF Project Slides

The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GEC’s reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (ρ = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.
@article{sakaguchi-etal-2016-reassessing, title = {Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality}, author = {Sakaguchi, Keisuke and Napoles, Courtney and Post, Matt and Tetreault, Joel}, journal = {Transactions of the Association for Computational Linguistics}, volume = {4}, year = {2016}, url = {https://aclanthology.org/Q16-1013}, doi = {10.1162/tacl_a_00091}, pages = {169--182} }
arXiv
GLEU Without Tuning

Courtney Napoles, Keisuke Sakaguchi, Matt Post, Joel R. Tetreault

arXiv 2016

Abs Bib PDF

The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the modified metric does not require tuning. We recommend that this version be used instead of the original version.
@article{Napoles2016GLEUWT, title = {GLEU Without Tuning}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel R.}, journal = {arXiv}, year = {2016}, volume = {abs/1605.02592} }

2015

ACL
Ground Truth for Grammatical Error Correction Metrics

Courtney Napoles, Keisuke Sakaguchi, Matt Post, Joel Tetreault

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Jul 2015

Abs Bib PDF Project

How do we know which grammatical error correction (GEC) system is best? A number of metrics have been proposed over the years, each motivated by weaknesses of previous metrics; however, the metrics themselves have not been compared to an empirical gold standard grounded in human judgments. We conducted the first human evaluation of GEC system outputs, and show that the rankings produced by metrics such as MaxMatch and I-measure do not correlate well with this ground truth. As a step towards better metrics, we also propose GLEU, a simple variant of BLEU, modified to account for both the source and the reference, and show that it hews much more closely to human judgments.
@inproceedings{napoles-etal-2015-ground, title = {Ground Truth for Grammatical Error Correction Metrics}, author = {Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel}, booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)}, month = jul, year = {2015}, address = {Beijing, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P15-2097}, doi = {10.3115/v1/P15-2097}, pages = {588--593} }
NAACL
Effective Feature Integration for Automated Short Answer Scoring

Keisuke Sakaguchi, Michael Heilman, Nitin Madnani

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies May 2015

Abs Bib PDF Slides

A major opportunity for NLP to have a realworld impact is in helping educators score student writing, particularly content-based writing (i.e., the task of automated short answer scoring). A major challenge in this enterprise is that scored responses to a particular question (i.e., labeled data) are valuable for modeling but limited in quantity. Additional information from the scoring guidelines for humans, such as exemplars for each score level and descriptions of key concepts, can also be used. Here, we explore methods for integrating scoring guidelines and labeled responses, and we find that stacked generalization (Wolpert, 1992) improves performance, especially for small training sets.
@inproceedings{sakaguchi-etal-2015-effective, title = {Effective Feature Integration for Automated Short Answer Scoring}, author = {Sakaguchi, Keisuke and Heilman, Michael and Madnani, Nitin}, booktitle = {Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies}, month = may, year = {2015}, address = {Denver, Colorado}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/N15-1111}, doi = {10.3115/v1/N15-1111}, pages = {1049--1054} }

2014

WMT
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation

Keisuke Sakaguchi, Matt Post, Benjamin Van Durme

Proceedings of the Ninth Workshop on Statistical Machine Translation Jun 2014

Abs Bib PDF Project Slides

A main output of the annual Workshop on Statistical Machine Translation (WMT) is a ranking of the systems that participated in its shared translation tasks, produced by aggregating pairwise sentencelevel comparisons collected from human judges. Over the past few years, there have been a number of tweaks to the aggregation formula in attempts to address issues arising from the inherent ambiguity and subjectivity of the task, as well as weaknesses in the proposed models and the manner of model selection. We continue this line of work by adapting the TrueSkill TM algorithm — an online approach for modeling the relative skills of players in ongoing competitions, such as Microsoft’s Xbox Live — to the human evaluation of machine translation output. Our experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
@inproceedings{sakaguchi-etal-2014-efficient, title = {Efficient Elicitation of Annotations for Human Evaluation of Machine Translation}, author = {Sakaguchi, Keisuke and Post, Matt and Van Durme, Benjamin}, booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation}, month = jun, year = {2014}, address = {Baltimore, Maryland, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W14-3301}, doi = {10.3115/v1/W14-3301}, pages = {1--11} }

2013

ACL
Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

Keisuke Sakaguchi, Yuki Arase, Mamoru Komachi

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Aug 2013

Abs Bib PDF Project Poster

We propose discriminative methods to generate semantic distractors of fill-in-theblank quiz for language learners using a large-scale language learners’ corpus. Unlike previous studies, the proposed methods aim at satisfying both reliability and validity of generated distractors; distractors should be exclusive against answers to avoid multiple answers in one quiz, and distractors should discriminate learners’ proficiency. Detailed user evaluation with 3 native and 23 non-native speakers of English shows that our methods achieve better reliability and validity than previous methods.
@inproceedings{sakaguchi-etal-2013-discriminative, title = {Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners}, author = {Sakaguchi, Keisuke and Arase, Yuki and Komachi, Mamoru}, booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)}, month = aug, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P13-2043}, pages = {238--242} }
CoNLL
NAIST at 2013 CoNLL Grammatical Error Correction Shared Task

Ippei Yoshimoto, Tomoya Kose, Kensuke Mitsuzawa, Keisuke Sakaguchi, Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Yuji Matsumoto

Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task Aug 2013

Abs Bib PDF

This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the CoNLL 2013 Shared Task. We constructed three systems: a system based on the Treelet Language Model for verb form and subjectverb agreement errors; a classifier trained on both learner and native corpora for noun number errors; a statistical machine translation (SMT)-based model for preposition and determiner errors. As for subject-verb agreement errors, we show that the Treelet Language Model-based approach can correct errors in which the target verb is distant from its subject. Our system ranked fourth on the official run.
@inproceedings{yoshimoto-etal-2013-naist, title = {{NAIST} at 2013 {C}o{NLL} Grammatical Error Correction Shared Task}, author = {Yoshimoto, Ippei and Kose, Tomoya and Mitsuzawa, Kensuke and Sakaguchi, Keisuke and Mizumoto, Tomoya and Hayashibe, Yuta and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task}, month = aug, year = {2013}, address = {Sofia, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W13-3604}, pages = {26--33} }
BEA
NAIST at the NLI 2013 Shared Task

Tomoya Mizumoto, Yuta Hayashibe, Keisuke Sakaguchi, Mamoru Komachi, Yuji Matsumoto

Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications Jun 2013

Abs Bib PDF

This paper describes the Nara Institute of Science and Technology (NAIST) native language identification (NLI) system in the NLI 2013 Shared Task. We apply feature selection using a measure based on frequency for the closed track and try Capping and Sampling data methods for the open tracks. Our system ranked ninth in the closed track, third in open track 1 and fourth in open track 2.
@inproceedings{mizumoto-etal-2013-naist, title = {{NAIST} at the {NLI} 2013 Shared Task}, author = {Mizumoto, Tomoya and Hayashibe, Yuta and Sakaguchi, Keisuke and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of the Eighth Workshop on Innovative Use of {NLP} for Building Educational Applications}, month = jun, year = {2013}, address = {Atlanta, Georgia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W13-1717}, pages = {134--139} }
MWE
Construction of English MWE Dictionary and its Application to POS Tagging

Yutaro Shigeto, Ai Azuma, Sorami Hisamoto, Shuhei Kondo, Tomoya Kose, Keisuke Sakaguchi, Akifumi Yoshimoto, Frances Yung, Yuji Matsumoto

Proceedings of the 9th Workshop on Multiword Expressions Jun 2013

Abs Bib PDF

This paper reports our ongoing project for constructing an English multiword expression (MWE) dictionary and NLP tools based on the developed dictionary. We extracted functional MWEs from the English part of Wiktionary, annotated the Penn Treebank (PTB) with MWE information, and conducted POS tagging experiments. We report how the MWE annotation is done on PTB and the results of POS and MWE tagging experiments.
@inproceedings{shigeto-etal-2013-construction, title = {Construction of {E}nglish {MWE} Dictionary and its Application to {POS} Tagging}, author = {Shigeto, Yutaro and Azuma, Ai and Hisamoto, Sorami and Kondo, Shuhei and Kose, Tomoya and Sakaguchi, Keisuke and Yoshimoto, Akifumi and Yung, Frances and Matsumoto, Yuji}, booktitle = {Proceedings of the 9th Workshop on Multiword Expressions}, month = jun, year = {2013}, address = {Atlanta, Georgia, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W13-1021}, pages = {139--144} }

2012

COLING
Joint English Spelling Error Correction and POS Tagging for Language Learners Writing

Keisuke Sakaguchi, Tomoya Mizumoto, Mamoru Komachi, Yuji Matsumoto

Proceedings of COLING 2012 Dec 2012

Abs Bib PDF Slides

We propose an approach to correcting spelling errors and assigning part-of-speech (POS) tags simultaneously for sentences written by learners of English as a second language (ESL). In ESL writing, there are several types of errors such as preposition, determiner, verb, noun, and spelling errors. Spelling errors often interfere with POS tagging and syntactic parsing, which makes other error detection and correction tasks very difficult. In studies of grammatical error detection and correction in ESL writing, spelling correction has been regarded as a preprocessing step in a pipeline. However, several types of spelling errors in ESL are difficult to correct in the preprocessing, for example, homophones (e.g. *hear/here), confusion (*quiet/quite), split (*now a day/nowadays), merge (*swimingpool/swimming pool), inflection (*please/pleased) and derivation (*badly/bad), where the incorrect word is actually in the vocabulary and grammatical information is needed to disambiguate. In order to correct these spelling errors, and also typical typographical errors (*begginning/beginning), we propose a joint analysis of POS tagging and spelling error correction with a CRF (Conditional Random Field)-based model. We present an approach that achieves significantly better accuracies for both POS tagging and spelling correction, compared to existing approaches using either individual or pipeline analysis. We also show that the joint model can deal with novel types of misspelling in ESL writing.
@inproceedings{sakaguchi-etal-2012-joint, title = {Joint {E}nglish Spelling Error Correction and {POS} Tagging for Language Learners Writing}, author = {Sakaguchi, Keisuke and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of {COLING} 2012}, month = dec, year = {2012}, address = {Mumbai, India}, publisher = {The COLING 2012 Organizing Committee}, url = {https://aclanthology.org/C12-1144}, pages = {2357--2374} }
BEA
NAIST at the HOO 2012 Shared Task

Keisuke Sakaguchi, Yuta Hayashibe, Shuhei Kondo, Lis Kanashiro, Tomoya Mizumoto, Mamoru Komachi, Yuji Matsumoto

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP Jun 2012

Abs Bib PDF Poster

This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the Helping Our Own (HOO) 2012 Shared Task. Our system targets preposition and determiner errors with spelling correction as a pre-processing step. The result shows that spelling correction improves the Detection, Correction, and Recognition F-scores for preposition errors. With regard to preposition error correction, F-scores were not improved when using the training set with correction of all but preposition errors. As for determiner error correction, there was an improvement when the constituent parser was trained with a concatenation of treebank and modified treebank where all the articles appearing as the first word of an NP were removed. Our system ranked third in preposition and fourth in determiner error corrections.
@inproceedings{sakaguchi-etal-2012-naist, title = {{NAIST} at the {HOO} 2012 Shared Task}, author = {Sakaguchi, Keisuke and Hayashibe, Yuta and Kondo, Shuhei and Kanashiro, Lis and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji}, booktitle = {Proceedings of the Seventh Workshop on Building Educational Applications Using {NLP}}, month = jun, year = {2012}, address = {Montr{\'e}al, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W12-2033}, pages = {281--288} }

NL研
英作文誤り訂正における複数の手法の利用に関する考察

水本智也, 林部祐太, 坂口慶祐, 小町守, 松本裕治

情報処理学会第208回自然言語処理研究会 2012年 9月

Abs Bib PDF

英語学習者の書く作文には様々な種類の文法誤りが含まれている.文法誤りの中にはヒューリスティックスを用いたルールで訂正できるものもあれば，長距離の依存関係や選択選好を考慮した統計的なモデルを用いないと訂正が難しいものもある.英語学習者の文法誤りの自動訂正に取り組んだ先行研究では，訂正する誤りの種類を数種類に限定して取り組んできたが，全ての種類の誤りを対象にした場合，1 つのモデルだけで十分に対処できるかは未だ分かっていない.そこで，本稿では全ての誤りを対象に，誤りの種類に応じて異なる誤り訂正システムを構築して誤り訂正を行ない，誤り訂正システムの適用の仕方による誤り訂正の結果について分析・考察を行なう.
@inproceedings{mizumoto_etal_nl_2012, title = {英作文誤り訂正における複数の手法の利用に関する考察}, author = {水本, 智也 and 林部, 祐太 and 坂口, 慶祐 and 小町, 守 and 松本, 裕治}, year = {2012}, month = {{9}}, booktitle = {情報処理学会第208回自然言語処理研究会}, pages = {1--7} }

CORE

Contextual Spelling Correction for Preposition and Determiner Error Correction

Keisuke Sakaguchi, Yuta Hayashibe, Shuhei Kondo, Lis Kanashiro, Tomoya Mizumoto, Mamoru Komachi, Yuji Matsumoto

Microsoft Research Asia, Japan CORE Project Workshop 2012年 6月

Bib

@article{sakaguchi-EtAl:2012:msra,
  author = {Sakaguchi, Keisuke and Hayashibe, Yuta and Kondo, Shuhei and Kanashiro, Lis and Mizumoto, Tomoya and Komachi, Mamoru and Matsumoto, Yuji},
  title = {Contextual Spelling Correction for Preposition and Determiner Error Correction},
  journal = {Microsoft Research Asia, Japan CORE Project Workshop},
  month = {{6}},
  year = {2012},
  address = {Tokyo, Japan}
}

NL研
英語スペリング訂正と品詞タグ付けの結合学習

坂口慶祐, 水本智也, 小町守, 松本裕治

情報処理学会第206回自然言語処理研究会 2012年 5月

Abs Bib PDF

近年，外国語学習者が書く作文に対する文法の自動誤り訂正が注目を集めているが，学習者作文の多くは文法的な誤りだけでなくスペリング誤りを多く含んでいる．その結果，学習者作文に対する品詞タグ付けや構文解析の精度が悪化し，誤りの訂正を阻害する大きな要因になっている．またスペリング誤り訂正と品詞タグ付けは従来独立したタスクとして扱われており，スペリング誤り訂正の結果が後続の品詞タグ付けや構文解析に影響する点が指摘されてきたが，近年ではこれまで直列に解析・処理されてきたタスクを統合し，解析の情報を互いに補完しながら同時に処理する結合学習が盛んになっている．そこで本論文では英語学習者コーパスに対してスペリング誤りと品詞タグ付けの結合学習を行いその効果を検討する．実験の結果，結合学習を用いた同時解析の方がそれぞれの解析を単独で行う場合，そしてそれらをパイプラインで処理する場合に比べて解析精度が高くなることを示す．
@inproceedings{sakaguchi_etal_nl_2012, title = {英語スペリング訂正と品詞タグ付けの結合学習}, author = {坂口, 慶祐 and 水本, 智也 and 小町, 守 and 松本, 裕治}, year = {2012}, month = {{5}}, booktitle = {情報処理学会第206回自然言語処理研究会}, pages = {1--7} }

2011

NLP

オークション検索クリックスルーログからの属性値抽出

水本智也, 坂口慶祐, 小町守, 内海慶, 河野洋志, 前澤敏之, 佐藤敏紀

言語処理学会第18回年次大会論文集 2011年 3月

Bib PDF

@inproceedings{mizumoto_etal_nlp_2011,
  title = {オークション検索クリックスルーログからの属性値抽出},
  author = {水本, 智也 and 坂口, 慶祐 and 小町, 守 and 内海, 慶 and 河野, 洋志 and 前澤, 敏之 and 佐藤, 敏紀},
  year = {2011},
  month = {{3}},
  booktitle = {言語処理学会第18回年次大会論文集},
  pages = {1023--1026}
}