Details of a Researcher

写真a

LEPAGE, Yves

Scopus Paper Info

Paper Count: 120 Citation Count: 582 h-index: 12

Click to view the Scopus page. The data was downloaded from Scopus API in July 31, 2026, via http://api.elsevier.com and http://www.scopus.com .

Google Scholar Information (Citations per year)

Citation Count: 2090 h-index: 23 i10-index: 47

Click to view the Google Scholar page.

Scopus Information

Affiliation

Faculty of Science and Engineering, Graduate School of Information, Production, and Systems

Job title

Professor

Homepage URL

http://lepage-lab.ips.waseda.ac.jp/

Education Background

　

-

1985

Université de Grenoble, France Mathematics and computer science school computer science - natural language processing
　

-

1983

Ecole des Mines de Saint-Etienne, France Graduate School, Division of Engineering computer science

Committee Memberships

2008

-

2016

Reviewing committee of the Traitement automatique des langues (TAL) Journal Editor-in-chief
2008

-

2016

Traitement automatique des langues (TAL) 編集委員会編集長

Professional Memberships

　

　

　

Information Processing Society of Japan
　

　

　

ATALA French Natural Language Processing Association
　

　

　

Japanese Natural Language Processing Association
　

　

　

Reviewing Committe of the Traitement Automatique des Langues (TAL) Journal

Research Areas

Linguistics / Intelligent informatics

Research Interests

machine translation, analogy, multilingual laignment, multilingual large language models, foreign language aids

Awards

Waseda University Teaching Award (Spring semester 2016)

2016 Waseda University

　View Summary

Lecture in Natural Language Processing

Papers

Generalizing Analogical Inference from Boolean to Continuous Domains

Francisco Cunha, Yves Lepage, Miguel Couceiro, Zied Bouraoui

CoRR abs/2511.10416 2025.11

　View Summary

Analogical reasoning is a powerful inductive mechanism, widely used in human cognition and increasingly applied in artificial intelligence. Formal frameworks for analogical inference have been developed for Boolean domains, where inference is provably sound for affine functions and approximately correct for functions close to affine. These results have informed the design of analogy-based classifiers. However, they do not extend to regression tasks or continuous domains. In this paper, we revisit analogical inference from a foundational perspective. We first present a counterexample showing that existing generalization bounds fail even in the Boolean setting. We then introduce a unified framework for analogical reasoning in real-valued domains based on parameterized analogies defined via generalized means. This model subsumes both Boolean classification and regression, and supports analogical inference over continuous functions. We characterize the class of analogy-preserving functions in this setting and derive both worst-case and average-case error bounds under smoothness assumptions. Our results offer a general theory of analogical inference across discrete and continuous domains.

DOI
Mixup Helps Translation, But Do the Coefficients and the Selection Strategy Influence Translation Quality?

Yifei Zhou, Yves Lepage

ACM Transactions on Asian and Low-Resource Language Information Processing 24 ( 10 ) 108 - 20 2025.10

DOI

Scopus
Dual-Perspective Evaluation of Knowledge Graphs for Graph-to-Text Generation

Haotong Wang, Liyan Wang, Yves Lepage

CMC-COMPUTERS MATERIALS & CONTINUA 84 ( 1 ) 305 - 324 2025

DOI

Scopus

1

Citation

(Scopus)
Label-Guided Scientific Abstract Generation with a Siamese Network Using Knowledge Graphs

Haotong Wang, Yves Lepage

CMC-COMPUTERS MATERIALS & CONTINUA 83 ( 3 ) 4141 - 4166 2025

DOI

Scopus

1

Citation

(Scopus)
Q&A-LF : A French Question-Answering Benchmark for Measuring Fine-Grained Lexical Knowledge.

Alexander Petrov, Alessandra Thais Mancas, Viviane Binet, Antoine Venant, François Lareau, Yves Lepage, Phillippe Langlais

RANLP 962 - 969 2025
AnaScore: Understanding Semantic Parallelism in Proportional Analogies.

Liyan Wang, Haotong Wang, Yves Lepage

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies 1175 - 1188 2025

DOI

Scopus
L'analogie numérique revisitée et étendue, précisée, rétrécie ou agrandie.

Yves Lepage, Miguel Couceiro

19èmes Journées d'Intelligence Artificielle Fondamentale et 20èmes Journées Francophones sur la Planification(JIAF-JFPDA) 133 - 142 2025
ALF: A Fine-Grained French Analogical Dataset for Evaluating Lexical Knowledge of Large Language Models.

Alexander Petrov, Antoine Venant, François Lareau, Yves Lepage, Philippe Langlais

ECAI 4346 - 4353 2025

DOI

Scopus
Generative Resolution of Proportional Analogies between Sentences.

Liyan Wang, Zhicheng Pan, Haotong Wang, Yves Lepage

Vietnam Journal of Computer Science 12 ( 4 ) 449 - 468 2025

DOI
Eliciting analogical reasoning from language models in retrieval-augmented translation under low-resource scenarios.

Liyan Wang, Bartholomäus Wloka, Yves Lepage

Neurocomputing 630 129680 - 129680 2025

DOI

Scopus
Extraction-Augmented Generation of Scientific Abstracts Using Knowledge Graphs.

Haotong Wang, Yves Lepage

IEEE Access 13 48775 - 48791 2025

DOI

Scopus

3

Citation

(Scopus)
Any four real numbers are on all fours with analogy

Yves Lepage, Miguel Couceiro

CoRR abs/2407.18770 2024.07

　View Summary

This work presents a formalization of analogy on numbers that relies on generalized means. It is motivated by recent advances in artificial intelligence and applications of machine learning, where the notion of analogy is used to infer results, create data and even as an assessment tool of object representations, or embeddings, that are basically collections of numbers (vectors, matrices, tensors). This extended analogy use asks for mathematical foundations and clear understanding of the notion of analogy between numbers. We propose a unifying view of analogies that relies on generalized means defined in terms of a power parameter. In particular, we show that any four increasing positive real numbers is an analogy in a unique suitable power. In addition, we show that any such analogy can be reduced to an equivalent arithmetic analogy and that any analogical equation has a solution for increasing numbers, which generalizes without restriction to complex numbers. These foundational results provide a better understanding of analogies in areas where representations are numerical.

DOI
Leveraging Knowledge from Translation Memory for Globally and Locally Guiding Neural Machine Translation.

Ruibo Hou, Hengjie Liu, Yves Lepage

PACLIC 9 - 19 2024
A study of universal morphological analysis using morpheme-based, holistic, and neural approaches under various data size conditions

Rashel Fam, Yves Lepage

Annals of Mathematics and Artificial Intelligence 2024

　View Summary

We perform a study on the universal morphological analysis task: given a word form, generate the lemma (lemmatisation) and its corresponding morphosyntactic descriptions (MSD analysis). Experiments are carried out on the SIGMORPHON 2018 Shared Task: Morphological Reinflection Task dataset which consists of more than 100 different languages with various morphological richness under three different data size conditions: low, medium and high. We consider three main approaches: morpheme-based (eager learning), holistic (lazy learning), and neural (eager learning). Performance is evaluated on the two subtasks of lemmatisation and MSD analysis. For the lemmatisation subtask, under all three data sizes, experimental results show that the holistic approach predicted more accurate lemmata, while the morpheme-based approach produced lemmata closer to the answers when it produces the wrong answers. For the MSD analysis subtask, under all three data sizes, the holistic approach achieves higher recall, while the morpheme-based approach is more precise. However, the trade-off between precision and recall of the two systems leads to a very similar overall F1 score. On the whole, neural approaches suffer under low resource conditions, but they achieve the best performance in comparison to the other approaches when the size of the training data increases.

DOI

Scopus
High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering.

Hengjie Liu, Ruibo Hou, Yves Lepage

CoRR abs/2408.12079 2024

DOI
Analogie et moyenne généralisée.

Yves Lepage, Miguel Couceiro

JIAF-JFPDA 114 - 124 2024
Continued Pre-training on Sentence Analogies for Translation with Small Data.

Liyan Wang, Haotong Wang, Yves Lepage

LREC/COLING 3890 - 3896 2024

　View Summary

This paper introduces Continued Pre-training on Analogies (CPoA) to incorporate pre-trained language models with analogical abilities, aiming at improving performance in low-resource translations without data augmentation. We continue training the models on sentence analogies retrieved from a translation corpus. Considering the sparsity of analogy in corpora, especially in low-resource scenarios, we propose exploring approximate analogies between sentences. We attempt to find sentence analogies that might not conform to formal criteria for entire sentences but partial pieces. When training the models, we introduce a weighting scalar pertaining to the quality of analogies to adjust the influence: emphasizing closer analogies while diminishing the impact of far ones. We evaluate our approach on a low-resource translation task: German-Upper Sorbian. The results show that CPoA using 10 times fewer instances can effectively attain gains of +1.4 and +1.3 BLEU points over the original model in two translation directions. This improvement is more pronounced when there are fewer parallel examples.
Organising lexica into analogical grids: a study of a holistic approach for morphological generation under various sizes of data in various languages.

Rashel Fam, Yves Lepage

J. Exp. Theor. Artif. Intell. 36 ( 1 ) 1 - 26 2024.01

DOI
Example-Based Machine Translation with a Multi-Sentence Construction Transformer Architecture

Haozhe Xiao, Yifei Zhou, Yves Lepage

PROCEEDINGS OF THE 2023 CLASP CONFERENCE ON LEARNING WITH SMALL DATA, LSD 72 - 80 2023
Learning from masked analogies between sentences at multiple levels of formality

Liyan Wang, Yves Lepage

Annals of Mathematics and Artificial Intelligence 2023

　View Summary

This paper explores the inference of sentence analogies not restricted to the formal level. We introduce MaskPrompt, a prompt-based method that addresses the analogy task as masked analogy completion. This enables us to fine-tune, in a lightweight manner, pre-trained language models on the task of reconstructing masked spans in analogy prompts. We apply constraints which are approximations of the parallelogram view of analogy to construct a corpus of sentence analogies from textual entailment sentence pairs. In the constructed corpus, sentence analogies are characterized by their level of being formal, ranging from strict to loose. We apply MaskPrompt on this corpus and compare MaskPrompt with the basic fine-tuning paradigm. Our experiments show that MaskPrompt outperforms basic fine-tuning in solving analogies in terms of overall performance, with gains of over 2% in accuracy. Furthermore, we study the contribution of loose analogies, i.e., analogies relaxed on the formal aspect. When fine-tuning with a small number of them (several hundreds), the accuracy on strict analogies jumps from 82% to 99%. This demonstrates that loose analogies effectively capture implicit but coherent analogical regularities. We also use MaskPrompt with different schemes on masked content to optimize analogy solutions. The best masking scheme during fine-tuning is to mask any term: it exhibits the highest robustness in accuracy on all tested equivalent forms of analogies.

DOI

Scopus

2

Citation

(Scopus)
A Dual Reinforcement Method for Data Augmentation using Middle Sentences for Machine Translation

Wenyi Tang, Yves Lepage

MT Summit 2023 - Proceedings of 19th Machine Translation Summit 1 48 - 58 2023

　View Summary

This paper presents an approach to enhance the quality of machine translation by leveraging middle sentences as pivot points and employing dual reinforcement learning. Conventional methods for generating parallel sentence pairs for machine translation rely on parallel corpora, which may be scarce, resulting in limitations in translation quality. In contrast, our proposed method entails training two machine translation models in opposite directions, utilizing the middle sentence as a bridge for a virtuous feedback loop between the two models. This feedback loop resembles reinforcement learning, facilitating the models to make informed decisions based on mutual feedback. Experimental results substantiate that our proposed method significantly improves machine translation quality.
A Framework for Neural Machine Translation by Fuzzy Analogies.

Liyan Wang, Bartholomäus Wloka, Yves Lepage

IARML@IJCAI 3492 47 - 55 2023

　View Summary

This paper introduces a novel translation technique, driven by modeling fuzzy analogies that capture approximate conformity to parallel transformations between fragments in sentences. We conduct preliminary experiments on English-Japanese translations with a data set of limited size. The results show the potential of using fuzzy analogies for translation, achieving an increase of about 6 BLEU points compared to NMT.
Formulae for the solution of an analogical equation between Booleans using the Sheffer stroke (NAND) or the Pierce arrow (NOR).

Yves Lepage

IARML@IJCAI 3492 3 - 14 2023

　View Summary

This paper gives a formula for the solution of an analogical equation between Booleans using the Sheffer stroke (NAND). Naturally, a counterpart using the Pierce arrow (NOR) is also given. Although not so intuitive, these formulae are somewhat elegant. The formulae are obtained in the following way: a rapid review on analogies between sets is given. The result on sets is transposed to Booleans. This result is rewritten using solely the operators mentioned above and simplified.
Improving Sentence Embedding With Sentence Relationships From Word Analogies.

Qixuan Zhang, Yves Lepage

ICCBR Workshops 3438 43 - 53 2023

　View Summary

In this study, we introduce a novel approach to enhance sentence embedding by leveraging word analogy. Compared with past methods that use word analogy on sentence-level tasks, our method is less affected by sentence patterns and pays more attention to semantic relations. By fine-tuning pre-trained models as BERT, RoBERTa and Sentence-BERT and evaluating their performance on inter-sentence downstream tasks, we demonstrate the efficiency of our method. Our experimental results show that each model, following fine-tuning using our approach, exhibits improvements across all inter-sentence tasks. In the STS task, our method increases the average result from 18.63% to 62.52% on BERT. This outcome substantiates that sentence relationships derived from word analogy contain valuable knowledge that can enhance the performance of sentence embedding models.
Embedding-To-Embedding Method Based on Autoencoder for Solving Sentence Analogies.

Weihao Mao, Yves Lepage

ICCBR Workshops 3438 15 - 26 2023

　View Summary

We propose a method for solving sentence analogies using an embedding-to-embedding method. The method involves the pretraining of an autoencoder with a denoising decoder that generates sentence embeddings and reconstructs sentences. To generate solutions to analogical equations in the sentence embedding space, we introduce a network architecture that learns analogy properties from the dataset instead of relying on predefined formulas. The embeddings of the solutions are then decoded back into sentences using the decoder of the pretrained autoencoder. We conduct experiments on a set of semantico-formal analogies and purely-formal analogies datasets in English, French, and German. The results show that our method achieves state-of-the-art performance in most cases and to some extent provides evidence of the limitations of the 3CosAdd formula in handling longer sentences.
Resolution of Analogies Between Strings in the Case of Multiple Solutions.

Xulin Deng, Yves Lepage

ICCBR Workshops 3438 3 - 14 2023

　View Summary

The verification and resolution of formal analogies between strings focuses on the character sequences, disregarding the underlying semantics of the sequences. Our approach to these two tasks employs an algorithm based on edit distance. A previous version was limited in that it provided only a single solution for an analogy equation, even when multiple valid solutions existed. We enhance the algorithm to generate all possible solutions. The previous algorithm traversed edit distance matrices only once. Consequently, it could only yield one solution for an analogy puzzle, even in cases of multiple solutions were viable. In order to deliver all possible solutions for analogies, we introduce a recursive approach. By recursively exploring all traces in the edit distance matrices, our newer version is capable of generating and outputting all feasible solutions.
Analogies Between Short Sentences: A Semantico-Formal Approach

Yves Lepage

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 13212 LNAI 163 - 179 2022

　View Summary

The present article proposes a method to solve analogies between sentences by combining existing techniques to solve formal analogies between strings and semantic analogies between words. The method is applied on sentences from the Tatoeba corpus. Two datasets of more than five thousand semantico-formal analogies, in English and French, are released.

DOI

Scopus

1

Citation

(Scopus)
Large-scale AMR Corpus with Re-generated Sentences: Domain Adaptive Pre-training on ACL Anthology Corpus

Ming Zhao, Yaling Wang, Yves Lepage

Proceedings - ICACSIS 2022: 14th International Conference on Advanced Computer Science and Information Systems 19 - 24 2022

　View Summary

Meaning Representation (AMR) is a broad -coverage formalism for capturing the semantics of a given sentence. However, domain adaptation of AMR is limited by the shortage of annotated AMR graphs. In this paper, we explore and build a new large-scale dataset with 2.3 million AMRs in the domain of academic writing. Additionally, we prove that 30% of them are of similar quality as the annotated data in the downstream AMR-to-text task. Our results outperform previous graph-based approaches by over 11 BLEU points. We provide a pipeline that integrates automated generation and evaluation. This can help explore other AMR benchmarks.

DOI

Scopus

2

Citation

(Scopus)
Introducing EM-FT for Manipuri-English Neural Machine Translation

Rudali Huidrom, Yves Lepage

6th Workshop on Indian Language Data: Resources and Evaluation, WILDRE 2022 - held in conjunction with the International Conference on Language Resources and Evaluation, LREC 2022 - Proceedings 1 - 6 2022

　View Summary

This paper introduces pretrained word embeddings for Manipuri, a low-resourced Indian language. The pretrained word embeddings based on fastText is capable of handling the highly agglutinative language Manipuri (mni). We then perform machine translation (MT) experiments using neural network (NN) models. In this paper, we confirm the following observations. Firstly, the reported BLEU score of the Transformer architecture with fastText word embedding model EM-FT performs better than without in all the NMT experiments. Secondly, we observe that adding more training data from a different domain of the test data negatively impacts translation accuracy. The resources reported in this paper are made available in the ELRA catalogue to help the low-resourced languages community with MT/NLP tasks.
Analogies: from Theory to Applications

Miguel Couceiro, Esteban Marquer, Pierre Alexandre Murena, Pierre Monnin, Adrien Coulet, Jean Lieber, Henri Prade, Mehdi Kaytoue, Mathieu D'Aquin, Christophe Cerisara, Claire Gardent, Gilles Richard, Laurent Miclet, Steven Schockaert, Yves Lepage, Myriam Bounhas, Sebastien Destercke, Claudia D'Amato

CEUR Workshop Proceedings 3389 3 2022
Langues par défaut? Analyse contrastive et diachronique des langues non citées dans les articles de TALN et d'ACL (Contrastive and diachronic study of unmentioned (by default ?) languages in TALN and ACL We study the application of the #BenderRule in natural language processing articles, taking into account a contrastive and a diachronic dimensions, by examining the proceedings of two NLP conferences, TALN and ACL, over time).

Fanny Ducel, Karën Fort, Gaël Lejeune, Yves Lepage

Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale(TALN-RECITAL) 1 144 - 153 2022

　View Summary

We study the application of the #BenderRule in natural language processing articles, taking into account a contrastive and a diachronic dimensions, by examining the proceedings of two NLP conferences, TALN and ACL, over time. A sample of articles was annotated manually and two classifiers were developed to automatically annotate the remaining articles. This allows us to quantify the extent to which the #BenderRule is applied and to show a slight advantage in favor of TALN.
A Study of Re-generating Sentences Given Similar Sentences that Cover Them on the Level of Form and Meaning.

Hsuan-Wei Lo, Yifei Zhou, Rashel Fam, Yves Lepage

Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation(PACLIC) 369 - 378 2022
Can the Translation Memory Principle Benefit Neural Machine Translation? A Series of Extensive Experiments with Input Sentence Annotation.

Yaling Wang, Yves Lepage

Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation(PACLIC) 243 - 252 2022
A corpus of drafts of NLP papers from non-native English speakers.

Haotong Wang, Liyan Wang, Yves Lepage

NLPIR 125 - 129 2022

　View Summary

We created an English parallel corpus of 3,005 sentence pairs, each containing a well-polished text from ACL Anthology Reference Corpus (ACL-ARC) [1] and corresponding restated drafts collected from 26 non-native writers. The purpose of this paper is to explore the writing features of the drafts from non-native English speakers, so as to benefit research in Academic Writing Aid Systems. We present a feature analysis of the corpus based on handcrafted features. To assess utility, we formulate a draft identification task to automatically recognize drafts from ground truth texts based on hybrid features. We show that the combination of deep semantic features with the optimal handcrafted features improves identification accuracy on the collected data, up to 84.57%.

DOI

Scopus
Do we Name the Languages we Study? The #BenderRule in LREC and ACL articles.

Fanny Ducel, Karën Fort, Gaël Lejeune, Yves Lepage

Proceedings of the Thirteenth Language Resources and Evaluation Conference(LREC) 564 - 573 2022

　View Summary

This article studies the application of the #BenderRule in Natural Language Processing (NLP) articles according to two dimensions. Firstly, in a contrastive manner, by considering two major international conferences, LREC and ACL, and secondly, in a diachronic manner, by inspecting nearly 14,000 articles over a period of time ranging from 2000 to 2020 for LREC and from 1979 to 2020 for ACL. For this purpose, we created a corpus from LREC and ACL articles from the above-mentioned periods, from which we manually annotated nearly 1,000. We then developed two classifiers to automatically annotate the rest of the corpus. We show that LREC articles tend to respect the #BenderRule (80 to 90% of them respect it), whereas only around half of ACL articles do. Interestingly, over the considered periods, the results appear to be stable for the two conferences, even though a rebound in ACL 2020 could be a sign of the influence of the blog post about the #BenderRule.
Masked Prompt Learning for Formal Analogies beyond Words.

Liyan Wang, Yves Lepage

Proceedings of the Workshop on the Interactions between Analogical Reasoning and Machine Learning (International Joint Conference on Artificial Intelligence - European Conference on Artificial Intelligence (IJAI-ECAI 2022))(IARML@IJCAI) 3174 1 - 14 2022

　View Summary

Prompt learning, a recent thread in few-shot learning for pre-trained language models (PLMs), has been explored for completing word analogies in the extractive way. In this paper, we reformulate the analogy task as masked analogy completion task with the use of prompting to derive a generative model for analogies beyond words. We introduce a simple prompt-based fine-tuning paradigm for language modeling on answered prompts of analogies in the sequence-to-sequence framework. To convert discrete terms of analogies into linear sequences, we present a symbolic prompt template. The sequence-to-sequence model is fine-tuned to fill in the missing span of masked prompts deduced from different masking schemes on phrase analogies extracted from a small corpus. We analyze the out-of-distribution performance on sentence analogies which are unseen cases. Our experiments demonstrate that prompt-based fine-tuning with the objective of language modeling enables models to achieve significantly better performance on in-distribution cases than PLMs. Masked prompt learning with one-term masking exhibits the best out-of-distribution generalization on sentence analogies, with a difference of only 3 characters from references.
WAPITI - Web-based Assignment Preparation and Instruction Tool for Interpreters.

Bartholomäus Wloka, Yves Lepage, Werner Winiwarter

Information Integration and Web Intelligence - 24th International Conference(iiWAS) 13635 LNCS 295 - 306 2022

　View Summary

This paper proposes a framework to ease the workload of preparation for interpreters through quick and efficient discovery of relevant material. We describe a software architecture and present arguments why this combination of components and functionalities will result in an ideal assignment preparation tool for interpreters, which they currently are lacking. We draw from the rich professional experience from interpretation experts and teaching staff gathered through interviews and feedback over an extended period of time. We use this experience to add functionalities to enrich, share, and store data; anywhere, be it at home, or on a mobile device while on the go. This results in a multimodal, flexible, easy to use, mobile-ready Web application. The framework allows for incremental extension, export and import of the data collection, keeping in mind accessibility, mobility, interoperability, reusability, and sustainability.

DOI

Scopus

1

Citation

(Scopus)
Extraction of analogies between sentences on the level of syntax using parse trees.

Yifei Zhou, Rashel Fam, Yves Lepage

Workshop Proceedings of the 30th International Conferece on Case-Based Reasoning co-located with the 30th International Conference on Case-Based Reasoning (ICCBR 2022) 3389 30 - 42 2022

　View Summary

Example-based machine translation by analogy is an alternative approach to machine translation. Its principle is relatively simple, but the absolute number of analogies between sentences contained in the corpus is crucial for the overall quality of translation. The relative number of analogies is called the analogical density. The goal of this paper is to measure the analogical density of different aligned corpora. To this end, we extract analogies between sentences. Now, we use parse trees to represent sentences on the level of syntax. We report analogical densities for five different languages in an aligned multilingual corpus extracted from the Tatoeba resource, at the level of characters, words or parse trees.
Sentence Analogies for Text Morphing.

Zhicheng Pan, Xinbo Zhao, Yves Lepage

Workshop Proceedings of the 30th International Conferece on Case-Based Reasoning co-located with the 30th International Conference on Case-Based Reasoning (ICCBR 2022) 3389 4 - 13 2022

　View Summary

Text morphing is a Natural Language Processing (NLP) task which aims at generating sequences of fluent and smooth intermediate sentences between two input sentences, the start and end sentences. In this paper, we show how to use sentence analogies to augment data for this task. We rely on the notion of analogy to produce sequences of sentences exhibiting step-by-step transitions. We use these sequences to fine-tune a large-scale pre-trained language model that is used for text generation. The performance is evaluated by two criteria: fluency and transition smoothness on both the semantic and formal levels. Compared to a variational autoencoder generative model, our model is shown to generate smoother transitions, although the generated sentences are slightly less fluent.
ABCD: Analogy-Based Controllable Data Augmentation.

Shuo Yang, Yves Lepage

Theory and Practice of Natural Computing - 10th International Conference(TPNC) 69 - 81 2021

DOI

Scopus
Covering a sentence in form and meaning with fewer retrieved sentences.

Yuan Liu, Yves Lepage

Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation(PACLIC) 513 - 522 2021
EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English.

Rudali Huidrom, Yves Lepage, Khogendra Khomdram

Proceedings of the 14th Workshop on Building and Using Comparable Corpora(BUCC@RANLP) 60 - 67 2021
A Study of Analogical Density in Various Corpora at Various Granularity.

Rashel Fam, Yves Lepage

Information(Inf.) 12 ( 8 ) 314 - 314 2021

DOI

Scopus

3

Citation

(Scopus)
Vector-to-Sequence Models for Sentence Analogies

Liyan Wang, Yves Lepage

ICACSIS 2020: 2020 12TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS) 441 - 446 2020

DOI
Réseaux de neurones pour la résolution d'analogies entre phrases en traduction automatique par l'exemple (Neural networks for the resolution of analogies between sentences in EBMT ).

Valentin Taillandier, Liyan Wang, Yves Lepage

Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition) 108 - 121 2020
Typicality of Lexical Bundles in Different Sections of Scientific Articles.

Haotong Wang, Yves Lepage, Chooi-Ling Goh

SSPS 2020: 2020 2nd Symposium on Signal Processing Systems(SSPS) 56 - 60 2020

DOI

Scopus

1

Citation

(Scopus)
The French Correction: When Retrieval Is Harder to Specify than Adaptation.

Yves Lepage, Jean Lieber, Isabelle Mornard, Emmanuel Nauer, Julien Romary, Reynault Sies

Case-Based Reasoning Research and Development - 28th International Conference(ICCBR) 309 - 324 2020

DOI

Scopus

4

Citation

(Scopus)
Neural Morphological Segmentation Model for Mongolian.

Weihua Wang, Rashel Fam, Feilong Bao, Yves Lepage, Guanglai Gao

International Joint Conference on Neural Networks(IJCNN) 1 - 7 2019

DOI

Scopus

6

Citation

(Scopus)
An Approach to Case-Based Reasoning Based on Local Enrichment of the Case Base.

Yves Lepage, Jean Lieber

Case-Based Reasoning Research and Development - 27th International Conference(ICCBR) 235 - 250 2019

DOI

Scopus
Improving automatic Chinese-Japanese patent translation using bilingual term extraction

Wei Yang, Yves Lepage

IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING 13 ( 1 ) 117 - 125 2018.01 [Refereed]

　View Summary

The identification of terms in scientific and patent documents is a crucial issue for applications like information retrieval, text categorization, and also for machine translation. This paper describes a method to improve Chinese-Japanese statistical machine translation of patents by re-tokenizing the training corpus with aligned bilingual multi-word terms. We automatically extract multi-word terms from monolingual corpora by combining statistical and linguistic filtering methods. An automatic alignment method is used to identify corresponding terms. The most promising bilingual multi-word terms are extracted by setting some threshold on translation probabilities and further filtering by considering the components of the bilingual multi-word terms in characters as well as the ratio of their lengths in words. We also use kanji (Japanese)-hanzi (Chinese) character conversion to confirm and extract more promising bilingual multi-word terms. We obtain a high quality of correspondence with 93% in bilingual term extraction and a significant improvement of 1.5 BLEU score in a translation experiment. (c) 2017 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

DOI

Scopus

2

Citation

(Scopus)
Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

Wei Yang, Hanfei Shen, Yves Lepage

Journal of Information Processing 25 ( 0 ) 88 - 99 2017

　View Summary

Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

DOI CiNii

Scopus

3

Citation

(Scopus)
Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

Wei Yang, Hanfei Shen, Yves Lepage

Journal of Information Processing 25 88 - 99 2017

　View Summary

Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

DOI

Scopus

3

Citation

(Scopus)
A method of generating translations of unseen n-grams by using proportional analogy

Juan Luo, Yves Lepage

IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING 11 ( 3 ) 325 - 330 2016.05 [Refereed]

　View Summary

In recent years, statistical machine translation has gained much attention. The phrase-based statistical machine translation model has made significant advancement in translation quality over the word-based model. In this paper, we attempt to apply the technique of proportional analogy to statistical machine translation systems. We propose a novel approach to apply proportional analogy to generate translations of unseen n-grams from the phrase table for phrase-based statistical machine translation. Experiments are conducted with two datasets of different sizes. We also investigate two methods to integrate n-grams translations produced by proportional analogy into the state-of-the-art statistical machine translation system, Moses.(1) The experimental results show that unseen n-grams translations generated using the technique of proportional analogy are rewarding for statistical machine translation systems with small datasets. (c) 2016 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

DOI

Scopus

1

Citation

(Scopus)
Morphological predictability of unseen words using computational analogy

Fam, Rashel, Lepage, Yves

CEUR Workshop Proceedings 1815 51 - 60 2016.01

　View Summary

Copyright © 2016 for this paper by its authors.We address the problem of predicting unseen words by relying on the organization of the vocabulary of a language as exhibited by paradigm tables. We present a pipeline to automatically produce paradigm tables from all the words contained in a text. We measure how many unseen words from an unseen test text can be predicted using the paradigm tables obtained from a training text. Experiments are carried out in several languages to compare the morphological richness of languages, and also the richness of the vocabulary of different authors.
Solving analogical equations between strings of symbols using neural networks

Kaveeta, Vivatchai, Lepage, Yves

CEUR Workshop Proceedings 1815 67 - 76 2016.01

　View Summary

Copyright © 2016 for this paper by its authors.A neural network model to solve analogical equations between strings of symbols is proposed. The method transforms the input strings into two fixed size alignment matrices. The matrices act as the input of the neural network which predicts two output matrices. Finally, a string decoder transforms the predicted matrices into the final string output. By design, the neural network is constrained by several properties of analogy. The experimental results show a fast learning rate with a high prediction accuracy that can beat a baseline algorithm.
HSSA tree structures for BTG-based preordering in machine translation

Zhang, Yujia, Zhang, Yujia, Wang, Hao, Lepage, Yves

Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, PACLIC 2016 123 - 132 2016.01

　View Summary

The Hierarchical Sub-Sentential Alignment (HSSA) method is a method to obtain aligned binary tree structures for two aligned sentences in translation correspondence. We propose to use the binary aligned tree structures delivered by this method as training data for preordering prior to machine translation. For that, we learn a Bracketing Transduction Grammar (BTG) from these binary aligned tree structures. In two oracle experiments in English to Japanese and Japanese to English translation, we show that it is theoretically possible to outperform a baseline system with a default distortion limit of 6, by about 2.5 and 5 BLEU points and, 7 and 10 RIBES points respectively, when preordering the source sentences using the learnt preordering model and using a distortion limit of 0. An attempt at learning a preordering model and its results are also reported.
Yet another symmetrical & real-time word alignment method: Hierarchical sub-sentential alignment using F-measure

Wang, Hao, Lepage, Yves

Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, PACLIC 2016 143 - 152 2016.01

　View Summary

Symmetrization of word alignments is the fundamental issue in statistical machine translation (SMT). In this paper, we describe an novel reformulation of Hierarchical Subsentential Alignment (HSSA) method using F-measure. Starting with a soft alignment matrix, we use the F-measure to recursively split ENGL the matrix into two soft alignment submatrices. A direction is chosen as the same time on the basis of Inversion Transduction Grammar (ITG). In other words, our method simplifies the processing of word alignment as recursive segmentation in a bipartite graph, which is simple and easy to implement. It can be considered as an alternative of growdiag- final-and heuristic. We show its application on phrase-based SMT systems combined with the state-of-the-art approaches. In addition, by feeding with word-to-word associations, it also can be a real-time word aligner. Our experiments show that, given a reliable lexicon translation table, this simple method can yield comparable results with state-of-theart approaches.
Extraction of Potentially Useful Phrase Pairs for Statistical Machine Translation

Juan Luo, Yves Lepage

Journal of Information Processing 23 ( 3 ) 344 - 352 2015

　View Summary

Over the last decade, an increasing amount of work has been done to advance the phrase-based statistical machine translation model in which the method of extracting phrase pairs consists of word alignment and phrase extraction. In this paper, we show that, for Japanese-English and Chinese-English statistical machine translation systems, this method is indeed missing potentially useful phrase pairs which could lead to better translation scores. These potentially useful phrase pairs can be detected by looking at the segmentation traces after decoding. We choose to see the problem of extracting potentially useful phrase pairs as a two-class classification problem: among all the possible phrase pairs, distinguish the useful ones from the not-useful ones. As for any classification problem, the question is to discover the relevant features which contribute the most. Extracting potentially useful phrase pairs resulted in a statistically significant improvement of 7.65 BLEU points in English-Chinese and 7.61 BLEU points in Chinese-English experiments. A slight increase of 0.94 BLEU points and 0.4 BLEU points is also observed for English-Japanese system and Japanese-English system, respectively.

DOI CiNii

Scopus

2

Citation

(Scopus)
Translation of unseen bigrams by analogy using an SVM classifier

Wang, Hao, Lyu, Lu, Lepage, Yves

29th Pacific Asia Conference on Language, Information and Computation, PACLIC 2015 16 - 25 2015.01

　View Summary

Detecting language divergences and predicting possible sub-translations is one of the most essential issues in machine translation. Since the existence of translation divergences, it is impractical to straightforward translate from source sentence into target sentence while keeping the high degree of accuracy and without additional information. In this paper, we investigate the problem from an emerging and special point of view: bigrams and the corresponding translations. We first profile corpora and explore the constituents of bigrams in the source language. Then we translate unseen bigrams based on proportional analogy and filter the outputs using an Support Vector Machine (SVM) classifier. The experiment results also show that even a small set of features from analogous can provide meaningful information in translating by analogy.
Chinese word segmentation based on analogy and majority voting

Zheng, Zongrong, Wang, Yi, Lepage, Yves

29th Pacific Asia Conference on Language, Information and Computation, PACLIC 2015 151 - 156 2015.01

　View Summary

This paper proposes a new method of Chinese word segmentation based on proportional analogy and majority voting. First, we introduce an analogy-based method for solving the word segmentation problem. Second, we show how to use majority voting to make the decision on where to segment. The preliminary results show that this approach compares well with other segmenters reported in previous studies. As an important and original feature, our method does not need any pretraining or lexical knowledge.
Analogies Between Binary Images: Application to Chinese Characters

Yves Lepage

COMPUTATIONAL APPROACHES TO ANALOGICAL REASONING: CURRENT TRENDS 548 25 - 57 2014 [Refereed]

　View Summary

The purpose of this chapter is to show how it is possible to efficiently extract the structure of a set of objects by use of the notion of proportional analogy. As a proportional analogy involves four objects, the very naive approach to the problem, has basically a complexity of O(n(4)) for a given set of n objects. We show, under some conditions on proportional analogy, how to reduce this complexity to O(n(2)) by considering an equivalent problem, that of enumerating analogical clusters that are informative and not redundant. We further show how some improvements make the task tractable. We illustrate our technique with a task related with natural language processing, that of clustering Chinese characters. In this way, we re-discover the graphical structure of these characters.

DOI

Scopus

10

Citation

(Scopus)
Inflating a Training Corpus for SMT by Using Unrelated Unaligned Monolingual Data

Wei Yang, Yves Lepage

Advances in Natural Language Processing 8686 236 - 248 2014 [Refereed]

　View Summary

To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned. We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.
Improved Chinese-Japanese Phrase-based MT Quality Using an Extended Quasi-parallel Corpus

Hao Wang, Wei Yang, Yves Lepage

PROCEEDINGS OF 2014 IEEE INTERNATIONAL CONFERENCE ON PROGRESS IN INFORMATICS AND COMPUTING (PIC) 6 - 10 2014 [Refereed]

　View Summary

State-of-the-art phrase-based machine translation (MT) systems usually demand large parallel corpora in the step of training. The quality and the quantity of the training data exert a direct influence on the performance of such translation systems. The lack of open-source bilingual corpora for a particular language pair results in lower translation scores reported for such a language pair. This is the case of Chinese-Japanese. In this paper, we propose to build an extension of an initial parallel corpus in the form of quasi-parallel sentences, instead of adding new parallel sentences. The extension of the initial corpus is obtained by using monolingual analogical associations. Our experiments show that the use of such quasi-parallel corpora improves the performance of Chinese-Japanese translation systems.

DOI

Scopus

1

Citation

(Scopus)
Improving the Distribution of N-Grams in Phrase Tables Obtained by the Sampling-Based Method

Juan Luo, Adrien Lardilleux, Yves Lepage

HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS 8387 419 - 431 2014 [Refereed]

　View Summary

We describe an approach to improve the performance of sampling-based sub-sentential alignment method on translation tasks by investigating the distribution of n-grams in the phrase tables. This approach consists in enforcing the alignment of n-grams. We compare the quality of phrase translation tables output by this approach and that of the state-of-the-art estimation approach in statistical machine translation tasks. We report significant improvements for this approach and show that merging phrase tables outperforms the state-of-the-art techniques.

DOI

Scopus
Marker-Based Chunking in Eleven European Languages for Analogy-Based Translation

Kota Takeya, Yves Lepage

HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS 8387 432 - 444 2014 [Refereed]

　View Summary

An example-based machine translation (EBMT) system based on proportional analogies requires numerous proportional analogies between linguistic units to work properly. Consequently, long sentences cannot be handled directly in such a framework. Cutting sentences into chunks would be a solution. Using different markers, we count the number of proportional analogies between chunks in 11 European languages. As expected, the number of proportional analogies between chunks found is very high. These results, and preliminary experiments in translation, are promising for the EBMT system that we intend to build.

DOI

Scopus
Analogy-based machine translation using secability

Tatsuya Kimura, Jin Matsuoka, Yusuke Nishikawa, Yves Lepage

2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), VOL 2 2 297 - 298 2014 [Refereed]

　View Summary

The problem of reordering remains the main problem in machine translation. Computing structures of sentences and the alignment of substructures is a way that has been proposed to solve this problem. We use secability to compute structures and show its effectiveness in an example-based machine translation.

DOI

Scopus

1

Citation

(Scopus)
Generalizing sampling-based multilingual alignment

Adrien Lardilleux, François Yvon, Yves Lepage

Machine Translation 27 ( 1 ) 1 - 23 2013.03

　View Summary

Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks. © 2012 Springer Science+Business Media B.V.

DOI

Scopus

6

Citation

(Scopus)
Exploiting parallel corpus for handling out-of-vocabulary words

Luo, Juan, Tinsley, John, Lepage, Yves

27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27 399 - 408 2013.01

　View Summary

© 2013 by Juan Luo, John Tinsley, and Yves Lepage.This paper presents a hybrid model for handling out-of-vocabulary words in Japanese to- English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.
Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

Sun, Jing, Sun, Jing, Sun, Jing, Lepage, Yves, Lepage, Yves, Lepage, Yves

Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 351 - 360 2012.12

　View Summary

Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand. © 2012 The PACLIC.
Hierarchical sub-sentential alignment with anymalign

Lardilleux, Adrien, Yvon, François, Lepage, Yves

Proceedings of the 16th Annual Conference of the European Association for Machine Translation, EAMT 2012 279 - 286 2012.01

　View Summary

© 2012 European Association for Machine Translation.We present a sub-sentential alignment algorithm that relies on association scores between words or phrases. This algorithm is inspired by previous work on alignment by recursive binary segmentation and on document clustering. We evaluate the resulting alignments on machine translation tasks and show that we can obtain state-of-the-art results, with gains up to more than 4 BLEU points compared to previous work, with a method that is simple, independent of the size of the corpus to be aligned, and directly computes symmetric alignments. This work also provides new insights regarding the use of "heuristic" alignment scores in statistical machine translation.
Improving sampling-based alignment by investigating the distribution of N-grams in phrase translation tables

Luo, Juan, Lardilleux, Adrien, Lepage, Yves

PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation 150 - 159 2011.12

　View Summary

This paper describes an approach to improve the performance of sampling-based multilingual alignment on translation tasks by investigating the distribution of n-grams in the translation tables. This approach consists in enforcing the alignment of n-grams. The quality of phrase translation tables output by this approach and that of MGIZA++ is compared in statistical machine translation tasks. Significant improvements for this approach are reported. In addition, merging translation tables is shown to outperform state-of-the-art techniques. © 2011 by Juan Luo, Adrien Lardilleux, and Yves Lepage.
Fully-automatic marker-based chunking in 11 European languages and counts of the number of analogies between chunks

Takeya, Kota, Lepage, Yves

PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation 567 - 576 2011.12

　View Summary

Analogy has been proposed as a possible principle for example-based machine translation. For such a framework to work properly, the training data should contain a large number of analogies between sentences. Consequently, such a framework can only work properly with short and repetitive sentences. To handle longer and more varied sentences, cutting the sentences into chunks could be a solution if the number of analogies between chunks is confirmed to be large. This paper thus reports counts of number of analogies using different numbers of chunk markers in 11 European languages. These experiments confirm that the number of analogies between chunks is very large: several tens of thousands of analogies between chunks extracted from sentences among which only very few analogies, if not none, were found. © 2011 by Kota Takeya and Yves Lepage.
Estimating the proximity between languages by their commonality in vocabulary structures

Yves Lepage, Julien Gosme, Adrien Lardilleux

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6562 127 - 138 2011

　View Summary

This article proposes a possible way of measuring proximity between languages: it consists in measuring the commonality of structures between the vocabularies of two languages. Experiments conducted on a multilingual lexicon of nine European languages acquired from the Acquis communautaire confirmed usual knowledge on the closeness or remoteness of these languages. © 2011 Springer-Verlag.

DOI

Scopus

1

Citation

(Scopus)
Ambiguity spotting using WordnNet semantic similarity in support to recommended practice for software requirements specifications

Jin Matsuoka, Yves Lepage

NLP-KE 2011 - Proceedings of the 7th International Conference on Natural Language Processing and Knowledge Engineering 479 - 484 2011

　View Summary

Word Sense Disambiguation is a crucial problem in documents whose purpose is to serve as specifications for automatic systems. The combination of different techniques of Natural Language Processing can help in this task. In this paper, we show how to detect ambiguous terms in Software Requirements Specifications. And we propose a computer-aided method that signals the reader for possibly ambiguous usage of terms. The method uses compound term measure (C-value), WordNet semantic similarity (WordNet wup-similarity) and a proposed semantic similarity measure between sentences. © 2011 IEEE.

DOI

Scopus

12

Citation

(Scopus)
The true score of statistical paraphrase generation

Chevelu, Jonathan, Chevelu, Jonathan, Putois, Ghislain, Lepage, Yves

Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference 2 144 - 152 2010.12

　View Summary

This article delves into the scoring function of the statistical paraphrase generation model. It presents an algorithm for exact computation and two applicative experiments. The first experiment analyses the behaviour of a statistical paraphrase generation decoder, and raises some issues with the ordering of n-best outputs. The second experiment shows that a major boost of performance can be obtained by embedding a true score computation inside a Monte-Carlo sampling based paraphrase generator.
The structure of unseen trigrams and its application to language models: A first investigation

Yves Lepage, Julien Gosme, Adrien Lardilleux

2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings 273 - 280 2010

　View Summary

In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish. ©2010 IEEE.

DOI

Scopus

1

Citation

(Scopus)

▼display all

Research Projects

Theoretically founded algorithms for the automatic production of analogy tests in NLP

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2021.04

-

2024.03
Natural language processing for academic writing in English

Project Year :

2018.04

-

2021.03
Self-explainable and fast-to-train example-based machine translation using neural networks

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2018.04

-

2021.03

LEPAGE YVES

　View Summary

This research introduced self-explanation in example-based machine translation (EBMT) by analogy. It is thus positioned in explainable artificial intelligence (XAI). Self-explanation was implemented by tracing the analogies verified or solved during translation. The direct and indirect approaches to EBMT by analogy were merged in system that uses an original neural network. It was studied how to retrieve sentences that cover a given sentence semantically and formally was built. It was studied how dense corpora are relative to analogies. Datasets of analogies between sentences were released.
Language productivity: fast extraction of productive analogical clusters and their evaluation using statistical machine translation

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2015.04

-

2018.03

LEPAGE YVES, YANG Wei, FAM Rashel, SUSANTI GOJALI

　View Summary

The goal of the project was 1/ to build tools to produce analogical clusters from monolingual data, 2/ to use such clusters in the production of quasi-parallel corpora, 3/ to use such quasi-parallel corpora in addition to parallel corpora 4/ to obtain improvements in translation accuracy in statistical machine translation (SMT).Tools were built and publicly released. In addition to what was announced in the research plan, a new data structure, analogical grid was introduced. Data were produced in morphologically poor to rich languages: 11 European languages (N-grams from word to 6-grams), Chinese, Japanese (short sentences of less than 30 characters for SMT experiments), and additional languages (word forms in Arabic, Georgian, Navajo, Russian, Turkish, etc.). Part of the data has been publicly released.Various experiments showed that it is possible to improve translation accuracy thanks to quasi-parallel data produced by analogy, and filtered, in SMT for Chinese-Japanese
Improvement of alignment for statistical and example-based machine translation and release of multilingual syntactic patterns

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2011.04

-

2014.03

LEPAGE Yves

　View Summary

Current machine translation systems rely on translation tables to translate. Translation tables look like usual dictionaries, but contain longer entries, with numerical values to assess the reliability of the entries. Translation tables are extracted automatically from translated texts by tools called subsentential aligners. In earlier research, we proposed a new subsentential aligner. It is simpler and faster than state-of-the-art tools. But it has lower quality scores in some translation tasks because its entries are not long enough. The goal of the project was to improve the tool so as to achieve as good results as the state-of-the-art tools. This has been achieved in various cases. Many of the translation tables output during the project have been made freely available to the community through a web site (all language pairs between the 11 languages on the common part of the Europarl corpus version 3)
SAMAR Arabic Multimedia Platform -- Arabic -French-English machine translation and alignment

Project Year :

2009

-

2012

▼display all

Misc

中間文生成によるスタイル変換のためのパラレルデータ拡張

大澤功記, ルパージュイヴ

言語処理学会年次大会発表論文集(Web) 28th 2022

J-GLOBAL
Fast BTG-Forest-Based Hierarchical Sub-sentential Alignment

Hao Wang, Yves Lepage

2017.11

　View Summary

In this paper, we propose a novel BTG-forest-based alignment method. Based on a fast unsupervised initialization of parameters using variational IBM models, we synchronously parse parallel sentences top-down and align hierarchically under the constraint of BTG. Our two-step method can achieve the same run-time and comparable translation performance as fast_align while it yields smaller phrase tables. Final SMT results show that our method even outperforms in the experiment of distantly related languages, e.g., English-Japanese.
Inflating a Small Parallel Corpus into a Large Quasi-parallel Corpus Using Monolingual Data for Chinese-Japanese Machine Translation

Wei Yang, Hanfei Shen, Yves Lepage

58 ( 1 ) 2017.01

CiNii
統計的機械翻訳における交差エントロピーを用いたパラメータ推定の検討

川部友大, ルパージュイヴ

言語処理学会年次大会発表論文集(Web) 23rd 2017

J-GLOBAL
An Investigation of the Sampling-Based Alignment Method and Its Contributions

Juan Luo, Yves Lepage

2013.08

　View Summary

By investigating the distribution of phrase pairs in phrase translation tables, the work in this paper describes an approach to increase the number of n-gram alignments in phrase translation tables output by a sampling-based alignment method. This approach consists in enforcing the alignment of n-grams in distinct translation subtables so as to increase the number of n-grams. Standard normal distribution is used to allot alignment time among translation subtables, which results in adjustment of the distribution of n- grams. This leads to better evaluation results on statistical machine translation tasks than the original sampling-based alignment approach. Furthermore, the translation quality obtained by merging phrase translation tables computed from the sampling-based alignment method and from MGIZA++ is examined.

DOI

Syllabus

Topics in Fundamental Science and Engineering A

School of Fundamental Science and Engineering

2026 spring semester
Master's Thesis (Information Architecture)(Fall)

Graduate School of Information, Production and Systems

2026 fall semester
Background and basics in distributional semantics

Graduate School of Information, Production and Systems

2026 fall semester
Master's Thesis (Information Architecture)(Spring)

Graduate School of Information, Production and Systems

2026 spring semester
Example-based machine translation/NLP Research (Fall)

Graduate School of Information, Production and Systems

2026 fall semester
Example-based machine translation/NLP Research (Spring)

Graduate School of Information, Production and Systems

2026 spring semester
Machine translation technology

Graduate School of Information, Production and Systems

2026 spring semester
Natural language processing (NLP)

Graduate School of Information, Production and Systems

2026 spring semester
Example-based machine translation/NLP Research (Doctor's Thesis)

Graduate School of Information, Production and Systems

2026 full year
Example-based machine translation/NLP Research (Fall)

Graduate School of Information, Production and Systems

2026 fall semester
Example-based machine translation/NLP Research (Spring)

Graduate School of Information, Production and Systems

2026 spring semester
Example-based machine translation/NLP B

Graduate School of Information, Production and Systems

2026 spring semester
Example-based machine translation/NLP C

Graduate School of Information, Production and Systems

2026 spring semester
Example-based machine translation/NLP D

Graduate School of Information, Production and Systems

2026 fall semester
Example-based machine translation/NLP A

Graduate School of Information, Production and Systems

2026 fall semester
Example-based machine translation/NLP

Graduate School of Information, Production and Systems

2026 fall semester
Topics in Fundamental Science and Engineering A

School of Fundamental Science and Engineering

2026 spring semester
Topics in Fundamental Science and Engineering A

School of Fundamental Science and Engineering

2026 spring semester
Topics in Fundamental Science and Engineering A

School of Fundamental Science and Engineering

2026 spring semester
Topics in Fundamental Science and Engineering A

School of Fundamental Science and Engineering

2026 spring semester

▼display all

Overseas Activities

類推関係を代数構造への拡張、理論的に十分な根拠のあるアルゴリズムの設計と実装

2023.09

-

2024.09

カナダモントリオール大学

フランス LORIA ナンシー大学

Research Institute

2024

-

2026

Waseda Research Institute for Science and Engineering Concurrent Researcher

Internal Special Research Projects

実数間類推関係：数学的定義と機械学習における実用的応用

2025 Miguel Couceiro

　View Summary

Background:Analogy, through analogy-making or the analogical inference principle, is a fundamental cognitive process. Nowadays AI uses vector representations which necessitate parallel processing or GPU architectures. Until now, analogy has been confined to symbolic AI with Boolean or arithmetic analogy. The PI recently proposed a formalisation of analogy between numbers that has the potential to be applied to vector representations used in AI. Goals:(a) Consolidate the theoretical foundations of numerical analogy through adequate definitions and necessary formalization for machine learning;(b) Demonstrate the efficiency of numerical analogy in several AI tasks using vector representations.Methods:(a.1) Propose extensions of formalization, study relations with machine learning, determine bounds and approximations for analogical powers (main feature in numerical analogy);(a.2) Design fast approximation algorithms and implement vectorization for GPU devices;(b.1) Demonstrate usefulness of numerical analogy in static word embedding spaces;(b.2) Demonstrate potential of numerical analogy in image processing.Results:The grant enabled significant progress in the formalization of numerical analogy: crucial results have been obtained on bounds, approximations, and learnable functions. Fast computation of analogical powers, supported by these theoretical results, allowed to get promising results in lexical representation and in image reconstruction and classification. All these advances point to the possibility of fundamental breakthroughs in analogical inference and its application to core machine learning tasks such as classification. In particular, they open the door to applications in large language models for more effective language learning from less training data.(a.1.i) Characterization of a class of functions compatible with the analogical inference principle, results presented in paper [1] (published);(a.1.ii) Premiminary study of the relation between analogy and PAC-learnability, first results presented in paper [5] (submitted).(a.2.i) Mathematical results in determination of bounds, results presented in paper [2] (submitted);(a.2.ii) Fast approximate computation of analogical powers    - Determination of analytical formulae, inspection of quality of approximations, results presented in paper [3] (submitted);    - Tabulation techniques for the approximation of analogical powers, report under preparation.(b.1) Successful demonstration of the efficiency of numerical analogy as a tool for the analysis of word analogies and for the enforcement of analogical structure in word representations through the design and implementation of analogy-based loss functions, results presented in paper [6] and journal paper in preparation.(b.2) Successful demonstration of the potential of numerical analogy in image reconstruction and image classification through the definition of analogical pooling, results presented in paper [4] (submitted).
統計的機械翻訳システムの開発時間の減少：サンプリング手法の検討

2015

　View Summary

Background: to train a statistical machine translation (SMT) system is time-consuming.  In 2013, for the probabilistic approach, a fast alignment method (Fast_align) has been proposed. It is 10 times as fast as the standard method (GIZA++).Goal: the present research project addressed the problem of reducing the training time of SMT systems for the associative approach 1/ in word-to-word associations (Anymalign) and 2/ in hierarchical sub-sentential alignment (Cutnalign), while increasing translation accuracy.Method: 1/ for word-to-word association, we studied two improvements in sampling: a/ sampling given the knowledge of a test set to produce ad-hoc translation tables. Two different techniques to estimate inverse translation probabilities have been studied; b/ relying on whether a word is a hapax or not to build and sample sub-corpora. 2/ For sub-sentential alignment, we accelerated decisions in segmentation and reduced the search space. Core components have been re-implemented in C and we introduced multi-processing.Results: we report improvements in time and translation accuracy using three different language pairs: Spanish-Portuguese, French-English and Finnish-English. Compared to our previous methods, our improved methods increased translation accuracy by one confidence interval in average. Compared with Fast_align, same or lower training times yield similar translation accuracy in the two easiest language pairs.
機械翻訳のための言語生産性の検討：類推関係マップ

2014

　View Summary

言語データの構造化の一般的な問題と機械翻訳でその言語データ構造化の結果に基づき翻訳品質改善の問題を扱った。ここでいう構造化とは、類推関係に基づいた構造化のことである。今まで適応した日中データ以外、欧州連合言語に適応するため、加速が必要であった。５倍以上の加速ができ、時間と素性数の様々な値で測定し英仏データで実験最中である。国際会議PolTALにも国内会議言語処理学会年次大会にも発表した日中翻訳実験で本研究で開発したプログラムを適応した。国際ワークショップCogalex2014に発表された論文の実験でも同プログラムを使用した。
用例機械翻訳のための二カ国語の同時構造分析の手法の検討

2013

　View Summary

背景と目標　本研究では本研究室で開発している用例翻訳エンジンの適切な翻訳テーブルの検討をする目的である。現在統計翻訳手法の研究が盛んでありのに対して、類推関係に基づく用例翻訳エンジンを開発している。基本技術としては三つの文の部分から４つ目の計算ができる形式化と実装に取り組んでいる（例：「風邪を」：「ひどい風邪が」：：「熱さを」：x　=> x = 「ひどい熱さが」）。統計翻訳後術と同様に翻訳知識として翻訳テーブルが必要である。　翻訳テーブルを生成するため、本研究では単語間アラインメント結果に基づき、（Zha et al., 2001）のクラスタリング手法を適用し、対訳文を同時に構造解析とアラインメントを行なう。構造解析とアラインメントから自動的に翻訳テーブルを生成する。また、以前に提案された単言語構造解析の可切性（secability）手法で得られた翻訳テーブルと比較し、翻訳品質を測定した。本研究の主な結果は次のようになる。　①　類推関係に基づく用例翻訳エンジンで長文の翻訳の可能性を示した。可切性を利用し、単言語の構造解析を行って、翻訳実験結果で長い文の翻訳は提案手法で可能であると示した。尺度BLEUで測定した翻訳本質は統計翻訳システムより低いが、文の長さの影響を計ると同じグラフの振る舞いの観察ができた。　②　複数の言語対で実験を行ない、得られた翻訳テーブルを公開した。Europarlコーパスを使用し、予備実験で代表言語対の間で翻訳実験を行なった：フランス語・英語、スペイン語・ポルトガル語、フィンランド語・英語。また、可切性手法で全ての１１カ国語の言語対の間の翻訳テーブルを生成し、その翻訳テーブルとそれを使用して得られたBLEUスコアを本研究室のウェッブサイトで公開した(http://133.9.48.109/index/analogy-based-ebmt/、Experiments with an in-house analogy-based EBMT systemを参照)。　③　二カ国語同時構造解析アラインメントツールの向上した。一般と特別計算場合の区別によって基礎演算数量を減少し、５０倍の加速ができ、マルチプロセッシングを使用し、コア数の半倍弱の加速できて、会わせて４コアで１００倍の加速できた。　行なった実験では二カ国語同時構造解析アラインメントで得られた翻訳結果は可切性で得られた結果の比較するとやや低い。しかし、両実験で入力文の構造解析手法は可切性手法であるため、ある意味で不公平な比較となると考えられる。今後の課題として、同時構造解析アラインメントを利用するとき、入力文構造解析を行なわずに翻訳手法の検討をするべき。研究費の使い方：　①　国内と国際学会参加費：Lepage (LTC 2013, ポーランド) 木村竜矢 (AISE 2013, タイ),西川裕介と尾美圭亮 (言語処理学会第２０次年大会、札幌)　②　国内と国際学会出張費：木村竜矢 (AISE 2013, タイ),西川裕介と尾美圭亮 (言語処理学会第２０次年大会、札幌)　③　予定した図書購入は研究費の調整のため異なる研究費で購入した。
用例自動翻訳エンジンと実験応用基盤

2010

　View Summary

The final goal of this study is to produce an example-based machine translation engine that can be distributed to the research community on a site dedicated to example-based approaches to machine translation. The engine should use chunks to translate by analogy, and should be made fast by using C implementations of basic computations (resolution of analogical equations). The approach should be tested on various data, like the Europarl data.1. Work on chunking has been done by implementing two methods: marker-based chunking (Gough and Way, 2004) (255 lines of Python code for chunking) and secability (Chenon, 2005) (170 lines of Python code).Tests on the Europarl corpus and informal assessment of the relevance of the chunks produced by the two methods has led to prefer the marker-based chunking technique.In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine the markers as the most frequent less informative words in a corpus (207 lines of Python code).The number of markers can be freely chosen by the user.In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine whether to cut on the left or on the right of the markers to have a truly language-independent method.There are still problems on this part of the computation, which is currently done by estimating the difference in entropies on the left and right of each marker.Improvements are under study.1.1. We conducted experiments to compute the number of analogies between the chunks obtained (100,000 lines in 11 languages of the Europarl corpus, average sentence length in English: 30 words).This led to a paper at the Japanese Natural Language Processing Annual Conference (gengosyorigakkai) this year.My participation to gengosyorigakkai was charged on this budget.1.2. The production of all chunks for each of the 11 languages of the Europarl corpus (300,000 lines in each language) has been done.The alignment of chunks by computation of lexical weights is currently being done.The corresponding programs have been written and tested (136 lines of code in Python).We determine the most reliable chunk segmentation between two languages by keeping the same average number of chunks for each sentence over the entire corpus.We are currently in the phase of producing the data.1.3. Relatively to language models, trigrams and analogy, a connex research will be reported at the French Natural Language Processing Annual Conference on a new smoothing scheme for trigrams. This technique has been shown to beat even Kneser-Ney smoothing on relatively small amounts of corpora: 300,000 lines from the Europarl corpus in all 11 languages except Finnish.2. The translation engine2.1. A new engine has been reimplemented in Python (511 lines of code).Its main feature is the use of threads. to allow concurrent computation of different kinds.Each of the following task is performed in a different thread:- generation of analogy equations,- resolution of analogical equations,- transfer from source language into target language, and- linking between source text and translation.This allows a clearer design.Work on the design is still in progress.In particular, the use of UML diagrams for class design allowed to improve the code.The engine is now in its 3rd version.Two students are still working on the design of the engine through UML diagrams.Their part-time job salaries charged on this budget.2.2. The resolution of analogical equations as a C library has been integrated within the Python translation engine using C/Python SWIG.The same has been done for the efficient computation of distance or similarity between strings.The use of the C library leads to an acceleration of 5 to 10 times measured on small examples in formal language theory (translation of the context-free language a^n.b^n n into a regular language (ab)^n).3. The validation part of the work is ongoing research.The production of the alignment of chunks in all pairs for the 11 languages of the Europarl corpus is currently being done.The next step will be systematic assessment of translation by analogy of the chunks in each of these pairs using the standard scripts for assessment with various translation quality metrics: WER, BLEU, NIST and TER.4. The disclosure of the translation engine on the example-based web site is unfortunately not yet possible. It is hoped that it is made possible in the next few months.