ルパージュ イヴ (ルパージュ イヴ)

写真a

所属

理工学術院 大学院情報生産システム研究科

職名

教授

ホームページ

http://lepage-lab.ips.waseda.ac.jp/

学内研究所等 【 表示 / 非表示

  • 2020年
    -
    2022年

    理工学術院総合研究所   兼任研究員

学歴 【 表示 / 非表示

  •  
    -
    1985年

    フランス国立グルノブル大学   情報学研究科   情報学 自然言語処理  

  •  
    -
    1983年

    Mines Saint-Etienne フランス グランドぜコール   工学研究科   情報学  

所属学協会 【 表示 / 非表示

  •  
     
     

    情報処理学会

  •  
     
     

    ATALA フランス自然言語処理学会

  •  
     
     

    自然言語処理学会

  •  
     
     

    フランス自然言語処理雑誌編集委員会

 

研究分野 【 表示 / 非表示

  • 言語学

  • 知能情報学

研究キーワード 【 表示 / 非表示

  • 自動翻訳、多言語アラインメント、類推関係、言い換え、言語モデル、外国語ソフト

論文 【 表示 / 非表示

  • Improving automatic Chinese-Japanese patent translation using bilingual term extraction

    Wei Yang, Yves Lepage

    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING   13 ( 1 ) 117 - 125  2018年01月  [査読有り]

     概要を見る

    The identification of terms in scientific and patent documents is a crucial issue for applications like information retrieval, text categorization, and also for machine translation. This paper describes a method to improve Chinese-Japanese statistical machine translation of patents by re-tokenizing the training corpus with aligned bilingual multi-word terms. We automatically extract multi-word terms from monolingual corpora by combining statistical and linguistic filtering methods. An automatic alignment method is used to identify corresponding terms. The most promising bilingual multi-word terms are extracted by setting some threshold on translation probabilities and further filtering by considering the components of the bilingual multi-word terms in characters as well as the ratio of their lengths in words. We also use kanji (Japanese)-hanzi (Chinese) character conversion to confirm and extract more promising bilingual multi-word terms. We obtain a high quality of correspondence with 93% in bilingual term extraction and a significant improvement of 1.5 BLEU score in a translation experiment. (c) 2017 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

    DOI

  • Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

    Wei Yang, Hanfei Shen, Yves Lepage

    Journal of Information Processing   25 ( 0 ) 88 - 99  2017年

     概要を見る

    Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

    DOI CiNii

  • Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

    Yang, Wei, Shen, Hanfei, Lepage, Yves

    Journal of Information Processing   25   88 - 99  2017年

     概要を見る

    © 2017 Information Processing Society of Japan.Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

    DOI

  • A method of generating translations of unseen n-grams by using proportional analogy

    Juan Luo, Yves Lepage

    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING   11 ( 3 ) 325 - 330  2016年05月  [査読有り]

     概要を見る

    In recent years, statistical machine translation has gained much attention. The phrase-based statistical machine translation model has made significant advancement in translation quality over the word-based model. In this paper, we attempt to apply the technique of proportional analogy to statistical machine translation systems. We propose a novel approach to apply proportional analogy to generate translations of unseen n-grams from the phrase table for phrase-based statistical machine translation. Experiments are conducted with two datasets of different sizes. We also investigate two methods to integrate n-grams translations produced by proportional analogy into the state-of-the-art statistical machine translation system, Moses.(1) The experimental results show that unseen n-grams translations generated using the technique of proportional analogy are rewarding for statistical machine translation systems with small datasets. (c) 2016 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

    DOI

  • Morphological predictability of unseen words using computational analogy

    Fam, Rashel, Lepage, Yves

    CEUR Workshop Proceedings   1815   51 - 60  2016年01月

     概要を見る

    Copyright © 2016 for this paper by its authors.We address the problem of predicting unseen words by relying on the organization of the vocabulary of a language as exhibited by paradigm tables. We present a pipeline to automatically produce paradigm tables from all the words contained in a text. We measure how many unseen words from an unseen test text can be predicted using the paradigm tables obtained from a training text. Experiments are carried out in several languages to compare the morphological richness of languages, and also the richness of the vocabulary of different authors.

全件表示 >>

受賞 【 表示 / 非表示

  • 早稲田大学ティーチングアワード(2016年春)

    2016年   早稲田大学  

     概要を見る

    自然言語処理科目

共同研究・競争的資金等の研究課題 【 表示 / 非表示

  • Natural language processing for academic writing in English

    研究期間:

    2018年04月
    -
    2021年03月
     

     概要を見る

    In the 2nd fiscal year, research was carried out on the use of word embedding models to search for substitute words used for academic writing. Human evaluation has been carried out and results compared to a machine translation system (1 paper at int. conf. with reviewing committee, PACLING 2019).N-grams from ACL-ARC have been extracted and classified into True and False lexical bundles using machine learning models trained on manually checked bundles. 18,000 true lexical bundles have been collected and publicly released (1 paper at int. conf. with reviewing committee, ICACSIS 2019). They are useful for composing fluent academic texts. They are plagiarism-free.Work on using sentence embeddings to search for similar sentences in Abstract sections has been conducted. Similar sentences are presented to non-native writers to help them make correction (1 paper at 言語処理学会第26回年次大会, no reviewing committee)A web site has been set up based on the prototype built in the 1st fiscal year. A part-time research assistant is hired to setup the server, create and administer the website, and design and implement the front end user interface. This website is designed to be able to help researchers to compose their scientific articles. It includes a text drafting pane, automatic translation to English when necessary, dictionary lookup, search of similar words/sentences, text generation and finally plagiarism checking. Currently only interface is provided, the main engines will be linked in the future.Some improvements have been made to the research on searching similar words and sentences, and also collection of plagiarism-free lexical bundles. A website has been built and will be put on running when the text generation part is ready.In the third fiscal year, main focus will be on the text generation part. Following the current research trend, deep learning will be applied to generate new text based on the the original text, collection of lexical bundles and the ACL-ARC knowledge base.The text generation engine must be able to combine possible chunks, lexical bundles, discursive and argumentative connectors from already published articles besides conserving the original meaning of the text. Furthermore, text style must be typical to the sections of a paper, which means that typicality of phrases must be conformed.The final part of the research is concerning plagiarism. Metrics used for plagiarism will be surveyed and algorithms used for detecting plagiarism will be determined

  • Self-explainable and fast-to-train example-based machine translation using neural networks

    研究期間:

    2018年04月
    -
    2021年03月
     

     概要を見る

    After working on the direct approach in the first year, work on the indirect approach in example-based machine translation (EBMT) system was performed in the second fiscal year. A system was implemented. Numerical approaches were introduced in adaptation and retrieval (1 paper at international conference). In addition, it was studied how to merge the direct and the indirect approaches in EBMT by analogy. A model has been proposed. It is not yet been integrated in the final EBMT system. It exploits vector representations of words for monolingual comparison (results from Neural NLP) and sub-sentential alignment for bilingual comparison (results from SMT) (1 paper at a national conference, accepted, to be published in fiscal year 2020). Also, work on sentence representations for retrieval and similarity computation started.Data was collected: because we could not acquire the BTEC corpus, we use data from the Tatoeba corpus. A method to produce semantico-formal analogies between sentences was proposed (1 paper at an international conference). The dataset was publicly released. Preliminary experiments in matrix representations of sentences and resolution of analogies between such representations was conducted. No paper has been published. Also experiments in improving bilingual word embedding mapping were conducted (1 paper published at international conference).To run experiments, we could not buy another DeepLearning Box as planned because the prices of went up. Instead, one graphic card (GPU) was added to the DeepLearning Box already acquired in fiscal year 2018.The planning is basically kept.Work planned for the 2nd year, was normally performed: (1) The use of (a) word vector representations, coming from neural NLP, and the use of (b) sub-sentential alignment, coming from statistical machine translation, was adopted for the monolingual and the bilingual cases. The representation of the correspondence between sentences is made by using similarity matrices. The use of sub-sentential alignment and bilingual word embedding mapping was compared in an experiment. (2) In order to go from formal and crispy analogies to softer and more semantically relevant analogies between sentences, a method to solve semantico-formal analogies between sentences was designed. A resource of semantico-formal analogies in English was produced automatically and was publicly released.Work for the 3rd year was initiated: (1) Study of representations of sentences themselves, by use of matrices (of interpolated points in a word embedding space (original approach) or direct sentence embeddings started and is continued. (2) A set of bilingual analogies between sentences extracted from the Tatoeba corpus has been produced. This dataset will be released.Some work delayed: the work on self-explanation of translations was initially planned for the 3rd fiscal year. It was initiated in the 1st year, but was suspended in the 2nd year. It will resume in the 3rd fiscal year. In addition, integrating the resolution of soft analogies in the EBMT system has been slightly delayed.During the 3rd year, work on the prototype system will continue. The self-explainable functionality for tracing recursive translation of fragments of sentences was addressed in the 2nd year in the model proposed for the indirect approach to example-based machine translation. A first interface has been designed. However, work on the visualisation of the traces is needed because traces need to shorter and more readable for the user. Also explanation of how similar retrieved sentences match the sentence to be translated need to be inspected.One of the main work will be to conduct experiments to measure to what extent crispy vs. soft comparison of words in the translation of shorter vs. longer sentences using more dense vs. less dense corpora are more efficient. For that, data should be prepared. Work on sentence representations and representations of the correspondence between sentences will continue. Training times will also be measured and compared with training times in the neural approach to machine translation.Work will be conducted on retrieval of similar sentences. It is a necessary component in an example-based machine translation system. The use of vector representations of sentences and cosine similarity will be compared with more classical methods using suffix arrays and pattern-matching techniques.Work on self-explanation will also resume. Interfaces for the visualisation of traces will be improved. The existing explanations need to be shorter and more easily understandable by a standard user

  • 言語生産性:有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価

    研究期間:

    2015年04月
    -
    2018年03月
     

     概要を見る

    本研究の目的は、1。単言語データから類推関係クラスターを構築し、2。そのクラスターから擬似パラレルコーパスを生成し、3。パラレルコーパスに追加することにより4。統計的機械翻訳(SMT)の精度を向上させる。そのため、様々なツールを実装し公開した。新しいデータ構造も導入した:類推関係グリッド。形態的に貧しい言語を始め形態豊かな言語を渡って様々な言語でデータを構築した:欧州連合の11ヶ国語、中国語、日本語、また追加言語(アラビア語、グルジア語、ナバホ語、ロシア語、トルコ語)。データの一部分は公開した。行った実験で擬似パラレルコーパスの追加により日中SMTの翻訳精度を向上することを明らかにした

  • 統計・用例機械翻訳のためのアラインメント向上と多言語文法パターン公開

    研究期間:

    2011年04月
    -
    2014年03月
     

     概要を見る

    従来機械翻訳システムの翻訳知識は翻訳テーブルにある。翻訳テーブルとは、二カ国語辞書に似たようなものであり、通常の辞書より長い見出しを持ち、その見出しの確率等を表す数値をも含めるものである。翻訳テーブルは文部分的アライナーというツールにより自動的に生成される。本研究では先行研究で提案した文部分的アライナー手法の向上ができた。以前より長い見出しを出力し、特定の場合では最高技術水準の翻訳品質を得られることを示した。また、出力された翻訳テーブルを一部公開した:Europarlバージョン3の11カ国語の共通部分の全言語対を様々な実験設定で得られた翻訳テーブルである

  • アラビア語のマルチメディアプラトフォーム

    研究期間:

    2009年
    -
    2012年
     

特定課題研究 【 表示 / 非表示

  • 統計的機械翻訳システムの開発時間の減少:サンプリング手法の検討

    2015年  

     概要を見る

    Background: to train a statistical machine translation (SMT) system is time-consuming.  In 2013, for the probabilistic approach, a fast alignment method (Fast_align) has been proposed. It is 10 times as fast as the standard method (GIZA++).Goal: the present research project addressed the problem of reducing the training time of SMT systems for the associative approach 1/ in word-to-word associations (Anymalign) and 2/ in hierarchical sub-sentential alignment (Cutnalign), while increasing translation accuracy.Method: 1/ for word-to-word association, we studied two improvements in sampling: a/ sampling given the knowledge of a test set to produce ad-hoc translation tables. Two different techniques to estimate inverse translation probabilities have been studied; b/ relying on whether a word is a hapax or not to build and sample sub-corpora. 2/ For sub-sentential alignment, we accelerated decisions in segmentation and reduced the search space. Core components have been re-implemented in C and we introduced multi-processing.Results: we report improvements in time and translation accuracy using three different language pairs: Spanish-Portuguese, French-English and Finnish-English. Compared to our previous methods, our improved methods increased translation accuracy by one confidence interval in average. Compared with Fast_align, same or lower training times yield similar translation accuracy in the two easiest language pairs.

  • 機械翻訳のための言語生産性の検討:類推関係マップ

    2014年  

     概要を見る

    言語データの構造化の一般的な問題と機械翻訳でその言語データ構造化の結果に基づき翻訳品質改善の問題を扱った。ここでいう構造化とは、類推関係に基づいた構造化のことである。今まで適応した日中データ以外、欧州連合言語に適応するため、加速が必要であった。5倍以上の加速ができ、時間と素性数の様々な値で測定し英仏データで実験最中である。国際会議PolTALにも国内会議言語処理学会年次大会にも発表した日中翻訳実験で本研究で開発したプログラムを適応した。国際ワークショップCogalex2014に発表された論文の実験でも同プログラムを使用した。

  • 用例機械翻訳のための二カ国語の同時構造分析の手法の検討

    2013年  

     概要を見る

    背景と目標 本研究では本研究室で開発している用例翻訳エンジンの適切な翻訳テーブルの検討をする目的である。現在統計翻訳手法の研究が盛んでありのに対して、類推関係に基づく用例翻訳エンジンを開発している。基本技術としては三つの文の部分から4つ目の計算ができる形式化と実装に取り組んでいる(例:「風邪を」:「ひどい風邪が」::「熱さを」:x => x = 「ひどい熱さが」)。統計翻訳後術と同様に翻訳知識として翻訳テーブルが必要である。 翻訳テーブルを生成するため、本研究では単語間アラインメント結果に基づき、(Zha et al., 2001)のクラスタリング手法を適用し、対訳文を同時に構造解析とアラインメントを行なう。構造解析とアラインメントから自動的に翻訳テーブルを生成する。また、以前に提案された単言語構造解析の可切性(secability)手法で得られた翻訳テーブルと比較し、翻訳品質を測定した。本研究の主な結果は次のようになる。 ① 類推関係に基づく用例翻訳エンジンで長文の翻訳の可能性を示した。可切性を利用し、単言語の構造解析を行って、翻訳実験結果で長い文の翻訳は提案手法で可能であると示した。尺度BLEUで測定した翻訳本質は統計翻訳システムより低いが、文の長さの影響を計ると同じグラフの振る舞いの観察ができた。 ② 複数の言語対で実験を行ない、得られた翻訳テーブルを公開した。Europarlコーパスを使用し、予備実験で代表言語対の間で翻訳実験を行なった:フランス語・英語、スペイン語・ポルトガル語、フィンランド語・英語。また、可切性手法で全ての11カ国語の言語対の間の翻訳テーブルを生成し、その翻訳テーブルとそれを使用して得られたBLEUスコアを本研究室のウェッブサイトで公開した(http://133.9.48.109/index/analogy-based-ebmt/、Experiments with an in-house analogy-based EBMT systemを参照)。 ③ 二カ国語同時構造解析アラインメントツールの向上した。一般と特別計算場合の区別によって基礎演算数量を減少し、50倍の加速ができ、マルチプロセッシングを使用し、コア数の半倍弱の加速できて、会わせて4コアで100倍の加速できた。 行なった実験では二カ国語同時構造解析アラインメントで得られた翻訳結果は可切性で得られた結果の比較するとやや低い。しかし、両実験で入力文の構造解析手法は可切性手法であるため、ある意味で不公平な比較となると考えられる。今後の課題として、同時構造解析アラインメントを利用するとき、入力文構造解析を行なわずに翻訳手法の検討をするべき。研究費の使い方: ① 国内と国際学会参加費:Lepage (LTC 2013, ポーランド) 木村竜矢 (AISE 2013, タイ),西川裕介と尾美圭亮 (言語処理学会第20次年大会、札幌) ② 国内と国際学会出張費:木村竜矢 (AISE 2013, タイ),西川裕介と尾美圭亮 (言語処理学会第20次年大会、札幌) ③ 予定した図書購入は研究費の調整のため異なる研究費で購入した。

  • 用例自動翻訳エンジンと実験応用基盤

    2010年  

     概要を見る

    The final goal of this study is to produce an example-based machine translation engine that can be distributed to the research community on a site dedicated to example-based approaches to machine translation. The engine should use chunks to translate by analogy, and should be made fast by using C implementations of basic computations (resolution of analogical equations). The approach should be tested on various data, like the Europarl data.1. Work on chunking has been done by implementing two methods: marker-based chunking (Gough and Way, 2004) (255 lines of Python code for chunking) and secability (Chenon, 2005) (170 lines of Python code).Tests on the Europarl corpus and informal assessment of the relevance of the chunks produced by the two methods has led to prefer the marker-based chunking technique.In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine the markers as the most frequent less informative words in a corpus (207 lines of Python code).The number of markers can be freely chosen by the user.In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine whether to cut on the left or on the right of the markers to have a truly language-independent method.There are still problems on this part of the computation, which is currently done by estimating the difference in entropies on the left and right of each marker.Improvements are under study.1.1. We conducted experiments to compute the number of analogies between the chunks obtained (100,000 lines in 11 languages of the Europarl corpus, average sentence length in English: 30 words).This led to a paper at the Japanese Natural Language Processing Annual Conference (gengosyorigakkai) this year.My participation to gengosyorigakkai was charged on this budget.1.2. The production of all chunks for each of the 11 languages of the Europarl corpus (300,000 lines in each language) has been done.The alignment of chunks by computation of lexical weights is currently being done.The corresponding programs have been written and tested (136 lines of code in Python).We determine the most reliable chunk segmentation between two languages by keeping the same average number of chunks for each sentence over the entire corpus.We are currently in the phase of producing the data.1.3. Relatively to language models, trigrams and analogy, a connex research will be reported at the French Natural Language Processing Annual Conference on a new smoothing scheme for trigrams. This technique has been shown to beat even Kneser-Ney smoothing on relatively small amounts of corpora: 300,000 lines from the Europarl corpus in all 11 languages except Finnish.2. The translation engine2.1. A new engine has been reimplemented in Python (511 lines of code).Its main feature is the use of threads. to allow concurrent computation of different kinds.Each of the following task is performed in a different thread:- generation of analogy equations,- resolution of analogical equations,- transfer from source language into target language, and- linking between source text and translation.This allows a clearer design.Work on the design is still in progress.In particular, the use of UML diagrams for class design allowed to improve the code.The engine is now in its 3rd version.Two students are still working on the design of the engine through UML diagrams.Their part-time job salaries charged on this budget.2.2. The resolution of analogical equations as a C library has been integrated within the Python translation engine using C/Python SWIG.The same has been done for the efficient computation of distance or similarity between strings.The use of the C library leads to an acceleration of 5 to 10 times measured on small examples in formal language theory (translation of the context-free language a^n.b^n n into a regular language (ab)^n).3. The validation part of the work is ongoing research.The production of the alignment of chunks in all pairs for the 11 languages of the Europarl corpus is currently being done.The next step will be systematic assessment of translation by analogy of the chunks in each of these pairs using the standard scripts for assessment with various translation quality metrics: WER, BLEU, NIST and TER.4. The disclosure of the translation engine on the example-based web site is unfortunately not yet possible. It is hoped that it is made possible in the next few months.

 

現在担当している科目 【 表示 / 非表示

全件表示 >>

 

委員歴 【 表示 / 非表示

  • 2008年
    -
    2016年

    Reviewing committee of the Traitement automatique des langues (TAL) Journal  Editor-in-chief

  • 2008年
    -
    2016年

    Traitement automatique des langues (TAL) 編集委員会  編集長