Updated on 2023/06/05

写真a

 
HAYAMIZU, Satoru
 
Scopus Paper Info  
Paper Count: 124  Citation Count: 1063  h-index: 15

Click to view the Scopus page. The data was downloaded from Scopus API in June 04, 2023, via http://api.elsevier.com and http://www.scopus.com .

Affiliation
Research Council (Research Organization), Green Computing Systems Research Organization
Job title
Senior Researcher(Professor)
Degree
東京大学大学院工学系研究科 博士(工学)
Profile

国際会議の参加を再開しました。6月にニューオーリンズに行って CVPR2022 に参加しました。8月にワシントン DC に行って KDD2022 に参加しました。どちらの会議も対面での参加は、2019 年以来のことです。日本からの参加者は少なかった印象です。オンラインでの参加と比べて、はるかに効率がいいと思いました。

Research Experience

  • 2021.04
    -
    Now

    Waseda University   Green Computing Systems Research Organization

  • 2002.04
    -
     

    Gifu University   Faculty of Engineering

  • 2001.04
    -
    2002.03

    National Institute of Advanced Industrial Science and Technology

  • 1981.04
    -
    2001.03

    通商産業省工業技術院電子技術総合研究所

Education Background

  • 1979.04
    -
    1981.03

    The University of Tokyo   School of Engineering   Department of Mechanical Engineering  

  • 1974.04
    -
    1978.03

    The University of Tokyo   Faculty of Engineering   Department of Engineering Synthesis  

Research Areas

  • Intelligent informatics

Research Interests

  • メディア情報学

  • social entrepreneur

 

Papers

  • Anomalous sound detection based on attention mechanism

    Hayato Mori, Satoshi Tamura, Satoru Hayamizu

    Proceedings of EUSIPCO    2021.08  [Refereed]

    Authorship:Last author

  • Proposal of failure prediction method of factory equipment by vibration data with Recurrent Autoencoder

    Shota ASAHI, Ayaka MATSUI, Satoshi TAMURA, Satoru HAYAMIZU, Ryosuke ISASHI, Akira FURUKAWA, Takayoshi NAITOU

    Transactions of the JSME (in Japanese)   86 ( 891 ) 20 - 00020  2020.10  [Refereed]

    DOI

  • Anomaly Detection in Mechanical Vibration Using Combination of Signal Processing and Autoencoder

    Ayaka Matsui, Shota Asahi, Satoshi Tamura, Satoru Hayamizu, Ryosuke Isashi, Akira Furukawa, Takayoshi Naitou

    Journal of Signal Processing   24 ( 4 ) 203 - 206  2020.07  [Refereed]

    DOI

  • 音響信号処理と3-IR照度差ステレオ法による嚥下機能評価

    児玉千紗, 加藤邦人, 田村哲嗣, 速水悟

    電子情報通信学論文誌   .J102-D ( 3 ) 173 - 184  2019.03  [Refereed]

  • Semantic Segmentation of Paved Road and Pothole Image Using U-Net Architecture

    Vosco Pereira, Satoshi Tamura, Satoru Hayamizu, Hidekazu Fukai

    2019 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA). IEEE    2019  [Refereed]

  • Swallowing function evaluation using deep-learning-based acoustic signal processing

    Chisa Kodama, Kunihito Kato, Satoshi Tamura, Satoru Hayamizu

    APSIPA ASC 2017     961 - 964  2017.12  [Refereed]

    DOI

    Scopus

  • Development of audio-visual speech corpus toward speaker-independent Japanese LVCSR

    Kazuto Ukai, Satoshi Tamura, Satoru Hayamizu

    2016 Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2016     12 - 15  2017.05  [Refereed]

     View Summary

    In the speech recognition literature, building corpora for Large Vocabulary Continuous Speech Recognition (LVCSR) is quite important. In addition, in order to overcome performance decrease caused by noise, using visual information such as lip images is effective. In this paper, therefore, we focus on collecting speech and lip-image data for audio-visual LVCSR. Audio-visual speech data were obtained from 12 speakers, each who uttered ATR503 phonetically-balanced sentences. These data were recorded in acoustically and visually clean environments. Using the data, we conducted recognition experiments. Mel Frequency Cepstral Coefficients (MFCCs) and eigenlip features were obtained, and multi-stream Hidden Markov Models (HMMs) were built. We compared the performance in clean condition to those in noisy environments. It is found that visual information is able to compensate the performance. In addition, it turns out that we should improve visual speech recognition for high-performance audio-visual LVCSR.

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • Toward Building Speech Databases in Timor Leste

    Borja L.C, Patrocinio Antonino, Satochi Tamura, Hidekazu Fukai, Satoru Hayamizu

    The 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment    2017  [Refereed]

  • Investigation of DNN-based audio-visual speech recognition

    Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

    IEICE Transactions on Information and Systems   E99D ( 10 ) 2444 - 2451  2016.10

     View Summary

    © 2016 The Institute of Electronics, Information and Communication Engineers. Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandembased method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.

    DOI

    Scopus

    4
    Citation
    (Scopus)
  • Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

    Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

    2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015     575 - 582  2016.02

     View Summary

    © 2015 Asia-Pacific Signal and Information Processing Association. This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.

    DOI

    Scopus

    36
    Citation
    (Scopus)
  • Audio-visual processing toward robust speech recognition in cars

    Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

    7th Biennial Workshop on Digital Signal Processing for In-Vehicle Systems and Safety 2015     31 - 34  2015

     View Summary

    This paper reports our recent efforts to develop robust speech recognition in cars. Speech recognition is expected to handle many devices on cars. However, many kinds of acoustic noises, e.g. engine noise and car stereo, are observed in in-car environments, making performance of speech recognition decrease. In order to overcome the degradation, we develop a high-performance audio-visual speech recognition method. Lip images are obtained from captured face images using our face detection scheme. Some basic visual features are computed, then converted into visual features for speech recognition using a deep neural network. Audio features are obtained as well. Audio and visual features are subsequently concatenated into audio-visual features. As a recognition model, a multi-stream hidden Markov model is employed which can adjust contributions of audio and visual modalities. We evaluated our proposed method using an audio-visual corpus CENSREC-1-AV. In order to simulate driving-car condition, we prepared driving and music noises. Experimental results show that our method can significantly improving recognition performance in in-car condition.

  • Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

    Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

    2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA)     575 - 582  2015

     View Summary

    This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.

  • MULTI-MODAL SERVICE OPERATION ESTIMATION USING DNN-BASED ACOUSTIC BAG-OF-FEATURES

    Satoshi Tamura, Takuya Uno, Masanori Takehara, Satoru Hayamizu, Takeshi Kurata

    2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO)     2291 - 2295  2015  [Refereed]

     View Summary

    In service engineering it is important to estimate when and what a worker did, because they include crucial evidences to improve service quality and working environments. For Service Operation Estimation (SOE), acoustic information is one of useful and key modalities; particularly environmental or background sounds include effective cues. This paper focuses on two aspects: (1) extracting powerful and robust acoustic features by using stacked-denoising-autoencoder and hag-of-feature techniques, and (2) investigating a multi-modal SOE scheme by combining the audio features and the other sensor data as well as non-sensor information. We conducted evaluation experiments using multi-modal data recorded in a restaurant. We improved SOE performance in comparison to conventional acoustic features, and effectiveness of our multi modal SOE scheme is also clarified.

  • IMPROVEMENT OF UTTERANCE CLUSTERING BY USING EMPLOYEES' SOUND AND AREA DATA

    Tetsuya Kawase, Masanori Takehara, Satoshi Tamura, Satoru Hayamizu, Ryuhei Tenmoku, Takeshi Kurata

    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)     3047 - 3051  2014  [Refereed]

     View Summary

    In this paper, we propose to use staying area data toward the estimation of serving time for customers. To classify utterances enables us to estimate conversation types between speakers. However, its performance becomes lower in real environments. We propose a method using area data with sound data to solve this problem. We also propose a method to estimate the conversation types using the decision trees. They were tested with the data recorded in a Japanese restaurant. In the experiment to classify utterances, the proposed method performed better than the method using only sound data. In the experiment to estimate the conversation types, we succeeded to recover 70% of the mis-classified conversations using both of sound and area data.

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • Analysis of Customer Communication by Employee in Restaurant and Lead Time Estimation

    Masanori Takehara, Hiroya Nojiri, Satoshi Tamura, Satoru Hayamizu, Takeshi Kurata

    2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA)     1 - 5  2014  [Refereed]

     View Summary

    Human behavior sensing and their analysis are great role to improve service quality and education of employees. This paper shows novel frameworks of detection of customer communication and lead time estimation(LTE) by using multi-sensored data, sound data and accounting data in the restaurant. They are useful for management about work environments and problems for employees. Lead time from order to delivery shows the quality of the service for customers. We found sound data of an employee's speech is useful for these techniques by speech ratio smoothing and POS sound detection.

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • AUDIO-VISUAL VOICE CONVERSION USING NOISE-ROBUST FEATURES

    Kohei Sawada, Masanori Takehara, Satoshi Tamura, Satoru Hayamizu

    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)    2014  [Refereed]

     View Summary

    Voice Conversion (VC) is a technique to convert speech data of source speaker into ones of target speaker. VC has been investigated and statistical VC is used for various purposes. Conventional VC uses acoustic features, however, the audio-only VC has suffered from the degradation in noisy or real environments. This paper proposes an Audio-Visual VC (AVVC) method using not only audio features but also visual information, i.e. lip images. Eigenlip feature is employed in our scheme as visual feature. We also propose a feature selection approach for audio-visual features. Experiments were conducted to evaluate our AVVC scheme comparing with audio-only VC, using noisy data. The results show that AVVC can improve the performance even in noisy environments, by properly selecting audio and visual parameters. It is also found that visual VC is also successful. Furthermore, it is observed that visual dynamic features are more effective than visual static information.

  • Data Collection for Mobile Audio-visual Speech Recognition in Various Environments

    Satoshi Tamura, Takumi Seko, Satoru Hayamizu

    2014 17TH ORIENTAL CHAPTER OF THE INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDIZATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (COCOSDA)    2014  [Refereed]

     View Summary

    This paper introduces our recent activities for audio-visual speech recognition on mobile devices and data collection in various environments. Audio-visual automatic speech recognition is effective in noisy or real conditions to enhance the robustness of speech recognizer and to improve the recognition accuracy. We have developed an audio-visual speech recognition interface for mobile devices. In order to evaluate the recognizer and investigate issues related to audio-visual processing on mobile computers, we collected speech data and lip images of 16 subjects in eight conditions, where there were various audio noises and visual difficulties. Audio-only speech recognition and visual-only lipreading were then conducted. Through these experiments, we found some issues and future works not only for construction of audio-visual database but also for robust audio-visual speech recognition.

  • Probabilistic expression of Polynomial Semantic Indexing and its application for classification

    Kentaro Minoura, Satoshi Tamura, Satoru Hayamizu

    PATTERN RECOGNITION LETTERS   34 ( 13 ) 1485 - 1489  2013.10  [Refereed]

     View Summary

    We propose a probabilistic expression of PSI (Polynomial Semantic Indexing). PSI is a model which represents a latent semantic space in the polynomial form of input vectors. PSI express high-order relationships between more than two vectors in the form of extended inner products. PSI employs the low rank representation, which enables us to treat high-dimensional data without processes such as dimension reduction and feature extraction explicitly. Our proposed pPSI also has the same advantages as PSI. The contribution of this paper is (1) to formulate a probabilistic expression of PSI (pPSI), (2) to propose a pPSI-based classifier, and (3) to show a possibility of the pPSI classifier. The training algorithm of the stochastic gradient descendent for pPSI is introduced, saving memory use as well as computational costs. Furthermore, pPSI has a potential to reach the better solution compared to PSI. The proposed pPSI method can perform model-based training and adaptation, such as MAP (Maximum A Posterior)-based estimation according to the amount of data. In order to evaluate pPSI and its classifier, we conducted three experiments with artificial data and music data, comparing with multi-class SVM and boosting classifiers. Through the experiments, it is shown that the proposed method is feasible, especially for the case of small dimension of latent concept spaces. (c) 2013 Elsevier B.V. All rights reserved.

    DOI

    Scopus

  • Improvement of lip reading performance in real environments using speaker and environmental adaptation

    Takuya Kawasaki, Naoya Ukai, Seko Takumi, Satoshi Tamura, Satoru Hayamizu

    Proceedings - 2nd IAPR Asian Conference on Pattern Recognition, ACPR 2013     346 - 350  2013  [Refereed]

     View Summary

    Lip reading technologies play a great role not only in image pattern recognition e.g. computer vision, but also in audio-visual pattern recognition e.g. bimodal speech recognition. However, it is a problem that the recognition accuracy is still significantly low, compared to that of speech recognition. Another problem lies which the performance degradation occurs in real environments. To improve the performance, in this paper we employ two adaptation schemes: speaker adaptation and environmental adaptation. The speaker adaptation is performed to recognition models so as to prevent the degradation caused by the difference of speakers. The environmental adaptation is also conducted to deal with environmental differences. We tested these adaptation schemes using a real-world audio-visual corpus CENSREC-2-AV, we have built this corpus containing real-world data (speech signals and lip images) recorded in a driving car, in which subjects uttered Japanese connected digits. Experimental results show that the lip reading recognition performance was largely improved by the speaker adaptation, and further recovered by the environmental adaptation. © 2013 IEEE.

    DOI

    Scopus

    2
    Citation
    (Scopus)
  • An audio-visual in-car corpus "CENSREC-2-AV" for robust bimodal speech recognition

    Takuya Kawasaki, Naoya Ukai, Takumi Seko, Satoshi Tamura, Satoru Hayamizu, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda

    6th Biennial Workshop on DSP for In-Vehicle Systems and Safety 2013, DSP 2013    2013

     View Summary

    The purpose of this study is to build an evaluation framework for robust bimodal speech recognition in real environments, such as in-car conditions. Bimodal speech recognition using lip images has been studied to prevent the deterioration of speech recognition performance in noisy environments. Lip reading technologies using lip images play a great role for the bimodal speech recognition. Therefore, for the bimodal speech recognition, a database both speech signals and lip images is necessary to build a bimodal speech recognizer and to evaluate its performance. An evaluation framework for noisy bimodal speech recognition (CENSREC-1-AV) was constructed by Tamura et al; a subject on a blue screen background spoke Japanese connected digits in a quiet office environment. CENSREC-1-AV was recorded in the clean condition, on the other hand, a database recorded in real environments is required to evaluate a bimodal speech recognizer. Therefore, we have constructed a new audio-visual corpus CENSREC-2-AV, recorded in in-car environments; a subject sitting on a driver's seat in a car uttered Japanese connected digits in various driving conditions: for example, a tunnel situation with music background noises, and driving on an expressway while the window is open. By using CENSREC-2-AV, it is possible to realize a robust bimodal speech recognition method even in real environments.

  • Confidence estimation and keyword extraction from speech recognition result based on Web information

    Hara Kensuke, Sekiya Hideki, Kawase Tetsuya, Tamura Satoshi, Hayamizu Satoru

    2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA)    2013  [Refereed]

     View Summary

    This paper proposes to use Web information for confidence measure and to extract keywords for speech recognition results. Spoken document processing has been attracting attention particularly for information retrieval and video (audiovisual) content systems. For example, measuring a confidence score which indicates how likely a document or a segmented document includes recognition errors has been studied. It is well known keyword extraction from recognition results is also an important issue. For these purposes, in this paper, pointwise mutual information (PMI) between two words is employed. PMI has been used to calculate a confidence measure of speech recognition, as a coherence measure by co-occurrence of words. We propose to further improve the method by using a Web query expansion technique with term triplets which consist of nouns in the same document. We also apply PMI to keyword estimation by summing a co-occurrence score (sumPMI) between a targeting keyword candidate and each term. The proposed methods were tested with 10 lectures in Corpus of Spontaneous Japanese (CSJ) and 2 simulated movie dialogues. In the experiments it is shown that the estimated confidence score has high relationship with recognition accuracy, indicating the effectiveness of our method. And sumPMI scores for keywords have higher values in the subjective tests.

  • Measurement and analysis of speech data toward improving service in restaurant.

    Masanori Takehara, Satoshi Tamura, Satoru Hayamizu, Ryuhei Tenmoku, Array,Tomohiro Fukuhara, Takeshi Kurata

    2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, November 25-27, 2013     1 - 4  2013  [Refereed]

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • CENSREC-2-AV: An evaluation framework for bimodal speech recognition in real environments

    Naoya Ukai, Takuya Kawasaki, Satoshi Tamura, Satoru Hayamizu, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda

    Proceedings of the 2012 International Conference on Speech Database and Assessments, Oriental COCOSDA 2012     88 - 91  2012

     View Summary

    In this paper, we introduce a bimodal speech recognition corpus in real environments. In recent years, speech recognition technology has been used in noisy conditions. Therefore, it becomes necessary to achieve higher recognition accuracy in real environments. As one of the solutions, bimodal speech recognition using audio and non-audio information is getting studied. However, there are few databases which can be used to evaluate the bimodal speech recognition in real environments. In this paper, we introduce CENSREC-2-AV we have been working to built, as a new bimodal speech recognition corpus. CENSREC-2-AV is one of the databases of the CEN-SREC project; we provided a similar corpus CENSREC-1-AV as a database for bimodal speech recognition for additive noises. In these corpora, there are speech data and lip images. Researchers can evaluate a bimodal speech recognition method built using CENSREC-1-AV which consists of clean data, in real environments by using CENSREC-2-AV. © 2012 IEEE.

    DOI

    Scopus

    2
    Citation
    (Scopus)
  • GIF-SP: GA-based Informative Feature for Noisy Speech Recognition

    Satoshi Tamura, Yoji Tagami, Satoru Hayamizu

    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)    2012  [Refereed]

     View Summary

    This paper proposes a novel discriminative feature extraction method. The method consists of two stages; in the first stage, a classifier is built for each class, which categorizes an input vector into a certain class or not. From all the parameters of the classifiers, a first transformation can be formed. In the second stage, another transformation that generates a feature vector is subsequently obtained to reduce the dimension and enhance recognition ability. These transformations are computed applying genetic algorithm. In order to evaluate the performance of the proposed feature, speech recognition experiments were conducted. Results in clean training condition shows that GIF greatly improves recognition accuracy compared to conventional MFCC in noisy environments. Multi-condition results also clarifies that out proposed scheme is robust against differences of conditions.

  • Multi-stream acoustic model adaptation for noisy speech recognition

    Satoshi Tamura, Satoru Hayamizu

    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)    2012  [Refereed]

     View Summary

    In this paper, a multi-stream-based model adaptation method is proposed for speech recognition in noisy or real environments. The proposed scheme comes from our experience about audio-visual model adaptation. At first, an acoustic feature vector is divided into several vectors (e. g. static, first-order and second-order dynamic vectors), namely streams. While adaptation, a stream performing relatively high recognition performance is updated for the stream only. Alternatively, a stream having less recognition power is adapted using all the streams that are superior to the stream. In order to evaluate the proposed technique, recognition experiments were conducted using every streams, and then adaptation experiments were also investigated for various types of combination of streams.

  • Statistical Voice Conversion using GA-based Informative Feature

    Kohei Sawada, Yoji Tagami, Satoshi Tamura, Masanori Takehara, Satoru Hayamizu

    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)    2012  [Refereed]

     View Summary

    In order to make voice conversion (VC) robust to noise, we propose VC using GA-based informative feature (GIF), by adding an extraction process of GIF to a conventional VC. GIF is proposed as a feature that can be applied not only in pattern recognition but also in relative tasks. In speech recognition, furthermore, GIF could improve recognition accuracy in noise environment. We evaluated the performances of VC using spectral segmental features (conventional method) and GIF, respectively. Objective experimental result indicates that in noise environments, the proposed method was better than the conventional method. Subjective experiment was also conducted to compare the performances. These results show that application of GIF to VC was effective.

  • GIF-LR:GA-based Informative Feature for Lipreading

    Naoya Ukai, Takumi Seko, Satoshi Tamura, Satoru Hayamizu

    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)    2012  [Refereed]

     View Summary

    In this paper, we propose a general and discriminative feature "GIF" (GA-based Informative Feature), and apply the feature to lipreading (visual speech recognition). The feature extraction method consists of two transforms, that convert an input vector to GIF for recognition. The transforms can be computed using training data and Genetic Algorithm (GA). For lipreading, we extract a fundamental feature as an input vector from an image; the vector consists of intensity values at all the pixels in an input lip image, which are enumerated from left-top to right-bottom. Recognition experiments of continuous digit utterances were conducted using an audio-visual corpus including more than 268,000 lip images. The recognition results show that the GIF-based method is better than the baseline method using eigenlip features.

  • The role of speech technology in service-operation estimation

    Masanori Takehara, Satoshi Tamura, Ryuhei Tenmoku, Takeshi Kurata, Satoru Hayamizu

    2011 International Conference on Speech Database and Assessments, Oriental COCOSDA 2011 - Proceedings     116 - 119  2011  [Refereed]

     View Summary

    This paper introduces our recent effort to develop a Service-Operation Estimation (SOE) system using speech and multi-sensored data as well as other acquired data. In SOE, it is essential to analyze employees' data in order to increase the productivity in many service industries. Speech processing techniques, such as voice activity detection and keyword spotting recognition, help the analysis and enhance the precision of the results
    the beginning and end times of speech region are used to detect work events, and recognized keywords are used to conduct work estimation. In our system all the results are visualized in a 3D model, and it makes employers and employees help their operations. © 2011 IEEE.

    DOI

    Scopus

    6
    Citation
    (Scopus)
  • Template-based Spectral Estimation Using Microphone Array for Speech Recognition

    Satoshi Tamura, Eriko Hishikawa, Wataru Taguchi, Satoru Hayamizu

    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4     2050 - +  2010  [Refereed]

     View Summary

    This paper proposes a Template-based Spectral Estimation (TSE) method for noise reduction of microphone array processing aiming at speech recognition enhancement. In the proposed method, a noise template in a complex plane is calculated for each frequency bin using non-speech audio signals observed at microphones. Then for every noise-overlapped speech signals, a speech signal can be reformed by applying the template and the gradient descent method. Experiments were conducted to evaluate not only performance of noise reduction but also improvement of speech recognition. Then NRR 16.7dB improvement was achieved by combining TSE and Spectral Subtraction (SS) methods. For speech recognition, 44% relative recognition error reduction was obtained comparing with the conventional SS method.

  • A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection

    Satoshi Tamura, Masato Ishikawa, Takashi Hashiba, Shin'ichi Takeuchi, Satoru Hayamizu

    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4     2702 - +  2010  [Refereed]

     View Summary

    This paper proposes a novel speech recognition method combining Audio-Visual Voice Activity Detection (AVVAD) and Audio-Visual Automatic Speech Recognition (AVASR). AVASR has been developed to enhance the robustness of ASR in noisy environments, using visual information in addition to acoustic features. Similarly, AVVAD increases the precision of VAD in noisy conditions, which detects presence of speech from an audio signal. In our approach, AVVAD is conducted as a preprocessing followed by an AVASR system, making a significantly robust speech recognizer. To evaluate the proposed system, recognition experiments were conducted using noisy audio-visual data, testing several AVVAD approaches. Then it is found that the proposed AVASR system using the model-free feature-fusion AVVAD method outperforms not only non-VAD audio-only ASR but also conventional AVASR.

  • Automatic metadata generation and video editing based on speech and image recognition for medical education contents

    Satoshi Tamura, Koji Hashimoto, Jiong Zhu, Satoru Hayamizu, Hirotsugu Asai, Hideki Tanahashi, Makoto Kanagawa

    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5     2466 - +  2006  [Refereed]

     View Summary

    This paper reports a metadata generation system as well as an automatic video edit system. The metadata are information described about the other data. In the audio metadata. generation system, speech recognition using general language model (LM) and specialized LM is performed to input speech in order to obtain segment (event group) and audio metadata (event information) respectively. In the video edit system, visual metadata obtained by image recognition and audio metadata are combined into audio-visual metadata. Subsequently, multiple videos are edited to one video using the audio-visual metadata. Experiments were conducted to evaluate event detection of the systems using medical education contents, ACLS and BLS. The audio metadata system achieved about a 78% event detection correctness. In the edit system, an 87% event correctness was obtained by audio-visual metadata, and the survey proved that the edited video is appropriate and useful.

▼display all

Books and Other Publications

  • 製造業のAI活用を支える統計的機械学習&深層学習

    ( Part: Sole author)

    日経BP社  2020.12

  • 事例+演習で学ぶ機械学習 : ビジネスを支えるデータ活用のしくみ

    速水 悟( Part: Sole author)

    森北出版  2016.04 ISBN: 9784627880214

Misc

  • 製造業におけるAI活用の拡大:現状と課題

    速水悟

    日本経営学会全国大会    2021.09

    Authorship:Lead author

    Research paper, summary (national, other academic conference)  

  • 初等教育におけるテキスト型プログラミング言語 Python によるプログラミング教育の効果検証

    朝日翔太, 高橋和之, 村山聡江, 寺田和憲, 加藤邦人, 山口忠, 今井亜湖, 速水悟

    日本教育工学会 第34回全国大会    2018.09

  • 音響信号処理による嚥下タイミング推定手法

    児玉千紗, 加藤邦人, 田村哲嗣, 速水悟

    計測自動制御学会ライフエンジニアリング部門,LE2017     139 - 142  2017.09

  • A study for the robustness of multi-modal voice conversion

    KAWASHIMA Daiki, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report. Speech   114 ( 411 ) 7 - 12  2015.01

     View Summary

    Voice Conversion (VC) is a technique to convert speeches of source speaker into those of target speaker. VC has an issue the quality of converted speech degrades in noisy conditions. Multi-modal VC employing speech data as well as lip images has been proposed. An audio-visual feature is obtained by incorporating audio and visual features. In this paper, we propose a method to combine audio and visual features for noise-robust multi-modal VC. We evaluated our feature integration schemes using parallel data consisting of several speakers' utterances. Subjective and objective experiments were conducted in various acoustic and visual noisy environments. As a result, the quality of the conventional VC decreased, whereas our method improved the quality. It is found visual dynamic information contributes the improvement.

    CiNii

  • 呼吸音区間に対する喀痰検出システムと実環境における個人適応

    山下達也, 田村哲嗣, 速水悟, 林賢二, 西本裕

    電子情報通信学会論文誌D: 情報・システム   97 ( 12 ) 1831 - 1838  2014.12

  • 音声情報と位置情報を用いた従業員の接客作業分析とその活用

    竹原正矩, 野尻弘也, 田村哲嗣, 速水悟, 蔵田武志

    研究報告音声言語情報処理(SLP)   2014 ( 5 ) 1 - 6  2014.10

     View Summary

    サービス業における業務改善や従業員教育を支援するため,行動計測とデータの分析技術の向上が必要とされている.本稿では,レストランにおける従業員の接客作業に着目する.接客作業には発話の機会が多いため,音声情報を分析することで接客の状況や作業に関する指標が得られることが期待される.また,注文伺い,配膳といった接客作業を検出することで,顧客の注文後の待ち時間を推定することができる.そこで,我々は音声情報と位置情報を組み合わせ,従業員の接客作業の検出を行う枠組みを提案し,顧客の待ち時間の推定への活用を検討する.そして,接客作業の検出や待ち時間の推定が,他の店舗やサービスでどのように展開できるかを考察する.

    CiNii

  • E-022 Model Adaptation Using Audio-Visual Interactivity for Multi-Modal Speech Recognition

    Kinuta Takuya, Tamura Satoshi, Hayamizu Satoru

      13 ( 2 ) 257 - 260  2014.08

    CiNii

  • E-021 Classification of Environmental Noises for Service Operation Estimation

    Uno Takuya, Takehara Masanori, Tamura Satoshi, Hayamizu Satoru, Kurata Takeshi

    情報科学技術フォーラム講演論文集   13 ( 2 ) 253 - 256  2014.08

    CiNii

  • A study on multi-modal speech recognition using depth images

    UKAI Naoya, TAMURA Satoshi, HAYAMIZU Satoru

    Technical report of IEICE. PRMU   113 ( 493 ) 179 - 184  2014.03

     View Summary

    This paper presents a novel framework which uses depth information of human face and mouth movements as yet another modality for audio-visual speech recognition. We propose features of "eigenlip" by principal component analysis of depth maps in order to make them more robust for the sensor noise. We conducted experiments of digit speech recognition by incorporating audio information with depth maps of facial 3D shape in the multi-stream HMM (hidden Markov model). By comparing recognition of only depth with audio-depth information, we show improvement of accuracy in noisy environment.

    CiNii

  • Application of multi-modal speech interface in real environments

    SEKO Takuni, KAWASAKI Takuya, TAMURA Satoshi, HAYAMIZU Satoru

    Technical report of IEICE. PRMU   113 ( 493 ) 185 - 190  2014.03

     View Summary

    This paper proposes multi-modal speech interface for mobile devices such as smart phones, which is based on multi-modal speech recognition using speech waveforms and mouth image sequences. In our multi-modal speech interface, server-client model is employed; voice activity detection and feature extraction for every modalities are conducted on a mobile device, subsequently multi-modal speech recognition is performed on a recognition server. In addition, model adaptation technique is also utilized in our framework. Experiments were conducted using our multi-modal speech interface. Japanese connected-digit audio and visual data by 16 subjects were recorded in various real environments, e.g. office, outside, in-car, and station. In audio-only experiments, model adaptation improved recognition performance successfully in some conditions, on the other hand, some issues were found in heavily noisy conditions. In visual-only experiments, we have achieved performance improvement of lipreading by using speech recognition results and applying the model adaptation technique. We also investigate and discuss some issues about voice activity detection, visual features, and audio-visual integration.

    CiNii

  • D-14-4 DEVELOPMENT OF MULTI-MODAL SPEECH INTERFACE

    Tamura Satoshi, Seko Takumi, Hayamizu Satoru

    Proceedings of the IEICE General Conference   2014 ( 1 ) 131 - 131  2014.03

    CiNii

  • 接客時間推定に向けた従業員の位置・音声データによる発話クラスタリング(音声対話・合成,第15回音声言語シンポジウム)

    川瀬 徹也, 竹原 正矩, 田村 哲嗣, 天目 隆平, 蔵田 武志, 速水 悟

    電子情報通信学会技術研究報告. SP, 音声   113 ( 366 ) 89 - 95  2013.12

     View Summary

    我々はレストランで収録された音声について,発話クラスタリングの研究を行っている.発話データには,従業員同士の会話や,顧客との会話が含まれており,対象話者をクラスタリングすることで,従業員の接客時間などの業務に関する指標を推定できることが期待される.本稿では,マイク装着者,他の従業員,顧客の3クラス発話クラスタリングを検討している.しかし,話者の不特定性や雑音の影響により分類精度が低下する事が考えられる.そこで,従業員の位置情報を音声データと統合して精度向上を試みた.さらに,提案手法をレストラン以外の場面に応用した際の汎用性について考察した.

    CiNii

  • 接客時間推定に向けた従業員の位置・音声データによる発話クラスタリング

    川瀬徹也, 竹原正矩, 田村哲嗣, 天目隆平, 蔵田武志, 速水悟

    研究報告音声言語情報処理(SLP)   2013 ( 15 ) 1 - 7  2013.12

     View Summary

    我々はレストランで収録された音声について,発話クラスタリングの研究を行っている.発話データには,従業員同士の会話や,顧客との会話が含まれており,対象話者をクラスタリングすることで,従業員の接客時間などの業務に関する指標を推定できることが期待される.本稿では,マイク装着者,他の従業員,顧客の 3 クラス発話クラスタリングを検討している.しかし,話者の不特定性や雑音の影響により分類精度が低下する事が考えられる.そこで,従業員の位置情報を音声データと統合して精度向上を試みた.さらに,提案手法をレストラン以外の場面に応用した際の汎用性について考察した.

    CiNii

  • H-035 Toward Lead Time Estimation in a Japanese Restaurant using Position, Voice and POS Data

    Nojiri Hiroya, Takehara Masanori, Maeyama Kento, Tamura Satoshi, Kurata Takeshi, Hayamizu Satoru

    情報科学技術フォーラム講演論文集   12 ( 3 ) 171 - 174  2013.08

    CiNii

  • Comparison of classification methods for multi-modal voice activity detection

    OKUDA Hiroya, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report. Speech   112 ( 450 ) 31 - 32  2013.02

     View Summary

    Automatic Speech Recognition (ASR) technology has been developed and used in various situations, such as car navigation systems. Voice Activity Detection(VAD) is often used as a preprocessing for ASR in noisy environments. Multi-modal VAD using lip images has also been investigated. Hidden Markov Models (HMM), Support Vector Machine (SVM) and AdaBoost have been used in VAD. In this paper we compare, investigate and discuss these classification methods and audio-visual integration in multi-modal VAD.

    CiNii

  • Recent efforts for high-performance multi-modal speech recognition

    TAMURA Satoshi, SHEN Peng, OKUDA Hiroya, UKAI Naoya, KAWASAKI Takuya, SEKO Takumi, HAYAMIZU Satoru

    IEICE technical report. Speech   112 ( 369 ) 41 - 46  2012.12

     View Summary

    Regarding Multi-Modal Automatic Speech Recognition (MMASR) which uses acoustic and lip/mouth information, this paper describes recent efforts for high-performance real-time MMASR. At first, technical overviews as well as past works for fundamental technologies in MMASR, e.g. visual feature extraction and multi-modal voice activity detection, are introduced in order to discuss their technical issues. Our related works are also summarized. According to the discussion, we investigate speed-up methods for high-performance real-time MMASR, and build an MMASR system using the methods. Details of our system are then reported, and discussion as well as future works are finally described.

    CiNii

  • E-026 Noise Robust Voice Conversion using GA-based Informative Feature

    Sawada Kohei, Tagami Yoji, Tamura Satoshi, Takehara Masanori, Hayamizu Satoru

      11 ( 2 ) 217 - 218  2012.09

    CiNii

  • E-027 Voice Activity Detection using GA-based Informative Feature

    Okuda Hiroya, Tamura Satoshi, Hayamizu Satoru

      11 ( 2 ) 219 - 220  2012.09

    CiNii

  • Acoustic model adaptation choosing static and dynamic streams in noisy environments

    TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report. Speech   112 ( 141 ) 33 - 38  2012.07

     View Summary

    In this paper, an acoustic model adaptation method based on multi streams is proposed for speech recognition in noisy or real environments. At first, an acoustic feature vector is divided into several vectors (e.g. static, first-order and second-order dynamic vectors), namely streams. Second, the order of streams is determined according to accuracies of pre-recognition results. While adaptation, a stream performing the highest recognition performance is updated for the stream only. Alternatively, a stream is adapted using all the streams that are superior to the stream. In order to evaluate the proposed technique, recognition and adaptation experiments were conducted using a corpus CENSREC-1. Pre-recognition results show dynamic features are more robust to noise than static parameters. And through adaptation experiments it is found that our proposed method achieved the best performance compared to conventional acoustic feature and its streams. These results show the effectiveness of our proposed adaptation scheme.

    CiNii

  • GIF-SP : Improvement of speech recognition using general and discriminative features

    TAMURA Satoshi, TAGAMI Yoji, HAYAMIZU Satoru

    IEICE technical report. Natural language understanding and models of communication   111 ( 364 ) 119 - 124  2011.12

     View Summary

    This paper proposes a general and discriminative feature "GIF". The feature extraction method proposed in this paper consists of two transforms from an input vector to an output feature for recognition, via an intermediate vector; the first transform is derived from binary classifiers for each class, and the second transform is obtained so as to maximize a variance of projected values, with orthogonalization and dimension reduction. These transforms can be computed using training data by the genetic algorithm. Recognition experiments were conducted using the evaluation corpus for speech recognition. The proposed feature achieved drastic improvements compared with conventional features, then the effectiveness of the proposed method is clarified.

    CiNii

  • Information Processing of Lung Sounds and its Application

    HAYAMIZU Satoru, TAMURA Satoshi

      60 ( 12 ) 706 - 712  2011.12

    CiNii

  • K-062 Service-Operation Estimation Based on Multi-sensor and POS data in the Japanese Restaurant Industry

    Tenmoku Ryuhei, Ueoka Ryoko, Makita Koji, Shinmura Takeshi, Takehara Masanori, Hayamizu Satoru, Kurata Takeshi

      10 ( 3 ) 859 - 860  2011.09

    CiNii

  • RO-008 Kensaku Shimbun : User adaptation using microblogs for news-paper-style retrieval system

    Sekiya Hideki, Sobue Sho, Tamura Satoshi, Hayamizu Satoru

      10 ( 4 ) 141 - 146  2011.09

    CiNii

  • Model Adaptation using Audio-visual Interaction for Multi-modal Speech Recognition

    OONISHI Masanao, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report   111 ( 97 ) 17 - 22  2011.06

     View Summary

    This paper investigates a linear-regressive model adaptation method, i.e. MLLR (Maximum Likelihood Linear Regression), for multi-modal speech recognition focusing on audio-visual interaction, e.g. inter-modal influences. In the multi-modal adaptation, inter-modal information may contribute the performance of speech recognition. The influence and advantage of inter-modal elements, therefore, should be examined. Recognition experiments were conducted to evaluate several MLLR transformation matrices including or excluding inter-modal and intra-modal elements, using noisy data in an audio-visual corpus. From the experimental results, the importance of effective use of audio-visual interaction is clarified.

    CiNii

  • Decision Fusion using Boosting Method for Multi-Modal Voice Activity Detection

    TAKEUCHI Shin'ichi, HASHIBA Takashi, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report   110 ( 81 ) 25 - 30  2010.06

     View Summary

    In this paper, we propose a multi-modal voice activity detection system (VAD) that uses audio and visual information. In multi-modal (speech) signal processing, there are two methods for fusing the audio and the visual information: concatenating the audio and visual features, and employing audio-only and visual-only classifiers, then fusing the unimodal decisions. We investigate the effectiveness of decision fusion given by the results from AdaBoost. AdaBoost is one of the machine learning method. By using AdaBoost, the effective classifier is constructed by combining weak classifiers. It classifies input data into two classes based on the weighted results from weak classifiers. In proposed method, this fusion scheme is applied to decision fusion of multi-modal VAD. Experimental results show proposed method to generally be more effective.

    CiNii

  • Human Activity Recognition Based on Acceleration Information

    TAKEUCHI Shinichi, ITOU Shinya, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report   108 ( 453 ) 229 - 234  2009.02

     View Summary

    In this paper, we study human activity recognition method based on acceleration information using hidden Markov models for support system of human activity in daily living (ADL: Activity in Daily Living). Acceleration information is observed by using single tri-axial acceleration sensor put on testee's waist. As the feature parameters, we compared time series of accelerometer signal and frequency domain parameters whose are obtained by speech signal processing. We use MFCC (Mel-Frequency Cepstral Coefficients) which emphasizes low frequency, angular information. We also investigate optimal frequency response of MFCC. The experimental results show that combination of these parameters brings better recognition rate: recognition correct rate increases 16.66% from 71.3% to 87.96%, recognition accuracy increases 20.75% from 58.88% to 79.63% respectively.

    CiNii

  • Synchronization of speech and image channels in multimodal speech recognition

    TAMURA Satoshi, ISHIKAWA Masato, HAYAMIZU Satoru

    IEICE technical report   108 ( 312 ) 1 - 6  2008.11

     View Summary

    Multimodal speech recognition which uses acoustic and visual information is one of the robust speech recognition methods against various noises. In most methods, acoustic and visual features are concatenated into audio-visual features, then speech recognition is conducted using multi-stream HMMs. Acoustic and visual features are computed from speech signals and image sequences respectively, where a sampling or a frame rate of visual features is normally lower than that of acoustic features. Furthermore, a misalignment between acoustic and visual vectors is sometimes occurred due to audio/image input devices. These phenomena might cause a disgradation of the performance of multimodal speech recognition. This paper investigates and discusses an effect of these phenomena through recognition experiments.

    CiNii

  • Improvement of multimodal speech recognition by normalizing visual features

    ISHIKAWA Masato, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report   108 ( 312 ) 7 - 12  2008.11

     View Summary

    Multimodal speech recognition, namely MMASR, which uses speech and lip images has been developed as a robust automatic speech recognition (ASR) against various noises. Visual features, such as optical-flow parameters or principle component analysis (PCA) coefficients, play a great role in MMASR and their effectiveness are proven through experimental results. It is crucial for recognition accuracy not only which visual information should be adopted but also how feature orthogonalization and normalization should be applied. This paper compares conventional normalization methods of visual features and their performances; extracted visual features are converted into uncorrelated parameters using singular value decomposition or PCA, then using these features the recognition accuracy is improved.

    CiNii

  • Human motion detection with 3D-accelerometers using statistical voice activity detection method

    ITO Shinya, ASANO Shou, TAMURA Satoshi, HAYAMIZU Satoru

      70   225 - 226  2008.03

    CiNii

  • Multimodal speech recognition using audio and visual confusion networks

    KAMISAWA Tai, ISHIKAWA Masato, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report   107 ( 356 ) 37 - 42  2007.11

     View Summary

    In multimodal speech recognition, hypotheses from speech and visual recognizers are usually integrated afterwards when both recognition process have been finished. As speech recognition and visual recognition are done separately, intermediate representation of hypotheses for audio (speech) and visual information is very important issue. Recently, confusion networks (CN) are used as intermediate representation of hypotheses in speech recognition. In addition, confusion network combination (CNC), which integrates multiple confusion networks, have been proposed as a method to integrate hypotheses which are derived from multiple recognition processes. Integration by CNC produces better recognition performance when each recognition process has different property in recognition errors. As multimodal speech recognition integrates audio and visual recognition processes, it is expected that CNC will produce improvement in recognition performance. Therefore, in this paper, audio and visual recognition results were integrated as CNC and were applied to multimodal speech recognition in noisy environment to get improvement in recognition performance. Two methods for combination of CN described. Relationship with confidence scores and recognition correctness is discussed.

    CiNii

  • Multimodal Supporting System for Medical Information

    HAYAMIZU Satoru

      8 ( 3 ) 136 - 137  2006.12

    CiNii

  • Automatic correction of misidentification with application of digital pen character recognition system in home nursing

    SAWADA Go, HAYASHI Yujiro, TAMURA Satoshi, HAYAMIZU Satoru

    IEICE technical report   105 ( 594 ) 43 - 48  2006.02

     View Summary

    We are developing a home-nursing support system which reports patient's conditions written by the family using a digital pen to visiting nurse and hospital. To enhance the system, this paper proposes an automatic correction method of misidentification for handwritten character recognition. The patient's family can describe patient's conditions by using a digital pen any time which converts hand-written analog information into digital data. In the proposed method, two strategies are conducted for the character correction: (1) using history information of character correction, and (2) using the morphological analysis to identify and correct misrecognized words. The recognition accuracy was improved by this function.

    CiNii

  • Note-taking support for nurses using digital pen character recognition system

    Yujiro Hayashi, Satoshi Tamura, Satoru Hayamizu, Yutaka Nishimoto

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   4270 LNCS   428 - 436  2006

     View Summary

    This study presents a novel system which supports nurses in notetaking by providing a digital pen and character-recognition system laying stress on user interface. The system applies characteristics of a digital pen for improving the efficiency of tasks related to nursing records. The system aims at improving the efficiency of nursing activities and reducing the time spent for tasks for nursing records. In our system, first, notes are written on a check sheet using a digital pen along with a voice that is recorded on a voice recorder; the pen and voice data are transferred to a PC. The pen data are then recognized automatically as characters, which can be viewed and manipulated with the application. We conducted an evaluation experiment to improve efficiency and operation of the system, and its interface. The evaluation and test operations used 10 test subjects. Based on the test operation and the evaluation experiment of the system, it is turned out that improvement for urgent situations, enhancement of portability, and further use of character recognition are required. © Springer-Verlag Berlin Heidelberg 2006.

  • Utterance Analysis in Medical Cases for Spoken Dialog System

    Kiyoshi Naganuma, Satoru Hayamizu, Yuzo Takahashi, Yutaka Nishimoto, Yoshimi Matsuda, Yukiko Takahashi

    Proceedings of VSMM2004   10   954 - 961  2004.11

     View Summary

    音声による医療情報, 看護情報取得を目指して, 医療面接, 患者情報に関する発話分析を行った.発熱, 腹痛, 喘息の患者と医師との会話, バイタルサイン測定時の患者と看護師の会話, そして医師と看護師の会話について, 単語熟語の標識, その頻度解析を行い, 自動的解析用辞書を作成した.これに基づいて読み上げ原稿と自然発話でコンピュータによる理解度検証を行った.読み上げの場合は83.3%を理解し得たが, 自然発話では60%以下であった.会話の内容が医学用語や略語など特異な内容を含んでいるため, 医学的な文の構造解析を行うための情報が必要である.将来的にはこの情報を自動的に処理したい.

  • Intellectual Resources for Research and Development

    HAYAMIZU Satoru, Satoru Hayamizu, National Institute of Advanced Industrial Science and Technology (AIST)

    Journal of Japanese Society for Artificial Intelligence   17 ( 2 ) 167 - 170  2002.03

    CiNii

  • Detection of unknown words in large vocabulary speech recognition

    Hayamizu Satoru, Itou Katunobu, Tanaka Kazuyo

    Journal of the Acoustical Society of Japan ?   16 ( 3 ) 165 - 171  1995

     View Summary

    This paper describes the relation between vocabulary sizes and detection errors of unknown words in large vocabulary speech recognition through recognition and detection experiments. Although the relation between vocabulary sizes and recognition performances has been reported the relation between vocabulary sizes and detection performances has not yet been studied. Especially, it has not for the cases of vocabulary sizes of over 1, 000 word. Experiments were conducted using the speech material of speaker MAU's ATR word speech database. The entries of the dictionary used is 40, 000 words from the Shinmeikai Japanese Language Dictionary. It is shown that when the vocabulary size increases from 1, 000 words to 40, 000 words, the relation between vocabulary sizes and detection errors has a similar tendency with the relation between vocabulary sizes and recognition errors. And increases of detection errors caused by increases of vocabulary sizes are shown to be small for the case of within vocabulary, compared with increases of detection errors for the case of out of vocabulary. These results should be taken into accounts in designing large vocabulary speech recognition systems including unknown word processing.

    CiNii

▼display all

 

Teaching Experience

  • 情報処理入門(情報処理入門)

    工学部(昼)  

    2015.10
    -
    Now
     

  • ソーシャルイノベーション特論(ソーシャルイノベーション特論)

    工学研究科D  

    2017.10
    -
    2021.02
     

  • Machine Learning

    2017.04
    -
    2020.09
     

  • Introduction to Data Science

    2017.10
    -
    2020.02
     

  • 技術経営概論

    工学部  

    2016.10
    -
    2020.02
     

  • メディアコンテント論

    工学研究科  

▼display all

 

Sub-affiliation

  • Faculty of Science and Engineering   Graduate School of Fundamental Science and Engineering