研究者詳細 - 速水　悟

写真a

ハヤミズ　サトル

速水　悟

Scopus 論文情報

論文数: 126 Citation: 1212 h-index: 18

Click to view the Scopus page. The data was downloaded from Scopus API in November 25, 2025, via http://api.elsevier.com and http://www.scopus.com .

所属

研究機構グリーン・コンピューティング・システム研究機構

職名

上級研究員（研究院教授）

学位

博士（工学） ( 東京大学大学院工学系研究科 )

プロフィール

2024年も国際会議に参加しました。6月にシアトルで CVPR2024 に参加しました。8月にバルセロナで KDD2024 に参加しました。2024年は、応用が拡大した一年だと思います。8月に製造業向けの書籍を出版しました。

製造業の AI 活用は、技術開発・設計支援、安全・品質管理まで幅広い可能性を持っています。業務の効率化に加えて、領域知識の更新と活用が重要です。活用におけるポイントは、高いレベルの人材育成を行うことです。

経歴

2021年04月

-

継続中

早稲田大学グリーン・コンピューティング・システム研究機構上級研究員研究院教授
2002年04月

-

　

岐阜大学工学部教授
2001年04月

-

2002年03月

独立行政法人産業技術総合研究所
1981年04月

-

2001年03月

通商産業省工業技術院電子技術総合研究所

学歴

1979年04月

-

1981年03月

東京大学工学系研究科機械工学専攻
1974年04月

-

1978年03月

東京大学工学部産業機械工学科

研究分野

知能情報学

研究キーワード

メディア情報学
社会企業家

論文

Few-Shot Multi-Label Annotation of Causes for Incident Texts Using Large Language Models

Manato Nakamura, Kazunori Terada, Satoru Hayamizu, Hattori Masanori, Takafumi Fuseya, Hidetoshi Iwamatsu

Proceedings of the 28th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2024) 2024年09月 [査読有り]

DOI

Scopus

2

被引用数

(Scopus)
Anomalous sound detection based on attention mechanism

Hayato Mori, Satoshi Tamura, Satoru Hayamizu

Proceedings of EUSIPCO 581 - 585 2021年08月 [査読有り]

担当区分：最終著者

DOI

Scopus

5

被引用数

(Scopus)
Multi-angle lipreading with angle classification-based feature extraction and its application to audio-visual speech recognition

Shinnosuke Isobe, Satoshi Tamura, Satoru Hayamizu, Yuuto Gotoh, Masaki Nose

Future Internet 13 ( 7 ) 2021年07月

　概要を見る

Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results.

DOI

Scopus

11

被引用数

(Scopus)
Combination of temporal and spatial denoising methods for cine MRI

Tsubasa Maeda, Satoshi Tamura, Satoru Hayamizu, Keigo Kawaji

LifeTech 2021 - 2021 IEEE 3rd Global Conference on Life Sciences and Technologies 44 - 47 2021年03月

　概要を見る

In this paper, we propose a denoising method for cine MRI acquired by MoPS. The MoPS-based cine MRI has a high FPS but contains reconstruction noise. DISPEL, a conventional method, performs denoising in the temporal domain. A neural network is further introduced to remove spatial noise. Different from most those methods requiring noisy and clean images, we choose an unsupervised scheme, N2N. We combine these two methods to perform temporal and spatial denoising for cine MRI. Experimental results show that the proposed method is able to remove noise from cine MRIs acquired by MoPS without removing tissue signal.

DOI

Scopus

1

被引用数

(Scopus)
Speech recognition using deep canonical correlation analysis in noisy environments

Shinnosuke Isobe, Satoshi Tamura, Satoru Hayamizu

ICPRAM 2021 - Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods 63 - 70 2021年

　概要を見る

In this paper, we propose a method to improve the accuracy of speech recognition in noisy environments by utilizing Deep Canonical Correlation Analysis (DCCA). DCCA generates projections from two modalities into one common space, so that the correlation of projected vectors could be maximized. Our idea is to employ DCCA techniques with audio and visual modalities to enhance the robustness of Automatic Speech Recognition (ASR)
A) noisy audio features can be recovered by clean visual features, and B) an ASR model can be trained using audio and visual features, as data augmentation. We evaluated our method using an audiovisual corpus CENSREC-1-AV and a noise database DEMAND. Compared to conventional ASR and feature-fusion-based audio-visual speech recognition, our DCCA-based recognizers achieved better performance. In addition, experimental results shows that utilizing DCCA enables us to get better results in various noisy environments, thanks to the visual modality. Furthermore, it is found that DCCA can be used as a data augmentation scheme if only a few training data are available, by incorporating visual DCCA features to build an audio-only ASR model, in addition to audio DCCA features.
Using Deep-Learning Approach to Detect Anomalous Vibrations of Press Working Machine

Kazuya Inagaki, Satoru Hayamizu, Satoshi Tamura

Conference Proceedings of the Society for Experimental Mechanics Series 229 - 232 2021年

　概要を見る

In recent years, there has been a demand for advanced maintenance in factories. Data collection from factory equipment is being carried out, and the collected sensor data is widely used for statistical analysis in quality control and failure prediction by machine learning. For example, if it is possible to detect an abnormality using vibration data obtained from an equipment, increase in the operation rate of the plant can be expected. In this research, we aim at early detection of equipment failure by finding signs of abnormality from vibration data, using a deep-learning technique, particularly an autoencoder. In this paper, the following two methods were tested. The first scheme is based on the reconstruction error in an autoencoder. An autoencoder is trained using normal data only. Looking at the difference between input data and reconstructed data, we can regard the data having higher difference as abnormal. In the second approach, given the input data, values of the middle layer of the autoencoder are extracted, and we calculate the degree of abnormality using a Gaussian Mixture Model (GMM), representing a data set by superposition of a mixture of Gaussian distributions. In this framework, regarding an autoencoder structure, we tested both full-connection networks and convolutional networks. In this work, we chose a press machine. Frequency characteristics were acquired from the data in production mode of a press machine. Then using each method, we evaluated whether abnormality could be found by calculating the degree of abnormality. We employed two-day data without failure as training data, and another data set was prepared as forecast data obtained on the following days
on one of the days the machine stopped due to a sudden abnormality. Similar to time-series signal processing, we applied framing processing so that we can analyze data even in the case we can only get a small amount of data. As a result, our method succeeded in finding the day when the abnormality occurred and the machine stopped. In addition, the degree of abnormality became higher before the abnormality occurs, indicating we can detect signs of abnormality. In conclusion, the degree of abnormality could be calculated using the reconstruction error using an autoencoder from the vibration data during production, and the method using GMM from the middle layer of autoencoder. We consequently conclude it is possible to detect a sudden abnormality in which the device stopped, from actual vibration data. These results provide new solutions for equipment failure estimation.

DOI

Scopus

2

被引用数

(Scopus)
再帰型オートエンコーダを用いた振動データによる工場設備の故障予測手法の提案

朝日翔太, 松井彩華, 田村哲嗣, 速水悟, 井指諒亮, 古川輝, 内藤孝昌

日本機械学会 86 ( 891 ) 20 - 00020 2020年10月 [査読有り]

DOI
Anomaly Detection in Mechanical Vibration Using Combination of Signal Processing and Autoencoder

Ayaka Matsui, Shota Asahi, Satoshi Tamura, Satoru Hayamizu, Ryosuke Isashi, Akira Furukawa, Takayoshi Naitou

Journal of Signal Processing 24 ( 4 ) 203 - 206 2020年07月 [査読有り]

DOI
Multi-angle lipreading using angle classification and angle-specific feature integration.

Shinnosuke Isobe, Satoshi Tamura, Satoru Hayamizu, Yuuto Gotoh, Masaki Nose

International Conference on Communications, Signal Processing, and their Applications(ICCSPA) 1 - 5 2020年

DOI

Scopus

4

被引用数

(Scopus)
Toward a High Performance Piano Practice Support System for Beginners

Shota Asahi, Satoshi Tamura, Yuko Sugiyama, Satoru Hayamizu

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2018 - Proceedings 73 - 79 2019年03月

　概要を見る

In piano learning, it is difficult especially for beginners to judge by themselves whether their musical performances are appropriate in terms of rhythm and melody. Therefore, we have been developing a piano practice support system, which enables piano beginners to conduct independent practice without their instructors. In this paper, we propose the system with the aid of a deep learning technique: Long Short-Term Memory (LSTM). Our system accepts raw piano sounds, extracting performance information. From these information, we evaluate performance. We evaluated the scheme using actual beginners' performances, and found the proposed system achieved better than previous conventional methods. This paper also presents an application employing our methods. Through subjective evaluation experiments for the proposed application, it turns out almost the all beginners found reflection points, and they maintained their motivation for independent practice.

DOI

Scopus

5

被引用数

(Scopus)
音響信号処理と3-IR照度差ステレオ法による嚥下機能評価

児玉千紗, 加藤邦人, 田村哲嗣, 速水悟

電子情報通信学論文誌 .J102-D ( 3 ) 173 - 184 2019年03月 [査読有り]
Semantic Segmentation of Paved Road and Pothole Image Using U-Net Architecture

2019 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA). IEEE 2019年 [査読有り]
Classification of Paved and Unpaved Road Image Using Convolutional Neural Network for Road Condition Inspection System

Vosco Pereira, Satoshi Tamura, Satoru Hayamizu, Hidekazu Fukai

ICAICTA 2018 - 5th International Conference on Advanced Informatics: Concepts Theory and Applications 165 - 169 2018年11月

　概要を見る

Image processing techniques have been actively used for research on road condition inspection and achieving high detection accuracies. Many studies focus on the detection of cracks and potholes of the road. However, in some least developed countries, there are some distances of roads are still unpaved and it escaped the attention of the researchers. Inspired by penetration and success in applying deep learning technic to computer vision and to any other fields and by the existence of the various type of smartphone devices, we proposed a low - cost method for paved and unpaved road images classification using convolutional neural network (CNN). Our model is trained with 13.186 images and validate with 3.186 images which collected using smartphone device in various conditions of roads such as wet, muddy, dry, dusty and shady conditions and with different types of road surface such as ground, rocks and sands. The experiment using 500 new testing images showed that our model can achieve high Precision (98.0%), Recall (98.4%) and F1 - Score (98.2%) simultaneously.

DOI

Scopus

26

被引用数

(Scopus)
A Deep Learning-Based Approach for Road Pothole Detection in Timor Leste

Vosco Pereira, Satoshi Tamura, Satoru Hayamizu, Hidekazu Fukai

Proceedings of the 2018 IEEE International Conference on Service Operations and Logistics, and Informatics, SOLI 2018 279 - 284 2018年09月

　概要を見る

This research proposes a low-cost solution for detecting road potholes image by using convolutional neural network (CNN). Our model is trained entirely on the image which collected from several different places and has variation such as in wet, dry and shady conditions. The experiment using the 500 testing images showed that our model can achieve (99.80 %) of Accuracy, Precision (100%), Recall (99.60%), and F-Measure (99.60%) simultaneously.

DOI

Scopus

74

被引用数

(Scopus)
Swallowing function evaluation using deep-learning-based acoustic signal processing

Chisa Kodama, Kunihito Kato, Satoshi Tamura, Satoru Hayamizu

APSIPA ASC 2017 961 - 964 2017年12月 [査読有り]

DOI

Scopus
Toward effective noise reduction for sub-Nyquist high-frame-rate MRI techniques with deep learning

Yudai Suzuki, Keigo Kawaji, Amit R.Patel, Satoshi Tamura, Satoru Hayamizu

APSIPA ASC 2017 1136 - 1139 2017年12月 [査読有り]

DOI

Scopus

1

被引用数

(Scopus)
Development of audio-visual speech corpus toward speaker-independent Japanese LVCSR

Kazuto Ukai, Satoshi Tamura, Satoru Hayamizu

2016 Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2016 12 - 15 2017年05月 [査読有り]

　概要を見る

In the speech recognition literature, building corpora for Large Vocabulary Continuous Speech Recognition (LVCSR) is quite important. In addition, in order to overcome performance decrease caused by noise, using visual information such as lip images is effective. In this paper, therefore, we focus on collecting speech and lip-image data for audio-visual LVCSR. Audio-visual speech data were obtained from 12 speakers, each who uttered ATR503 phonetically-balanced sentences. These data were recorded in acoustically and visually clean environments. Using the data, we conducted recognition experiments. Mel Frequency Cepstral Coefficients (MFCCs) and eigenlip features were obtained, and multi-stream Hidden Markov Models (HMMs) were built. We compared the performance in clean condition to those in noisy environments. It is found that visual information is able to compensate the performance. In addition, it turns out that we should improve visual speech recognition for high-performance audio-visual LVCSR.

DOI

Scopus

1

被引用数

(Scopus)
Toward Building Speech Databases in Timor Leste

Borja L.C, Patrocinio Antonino, Satochi Tamura, Hidekazu Fukai, Satoru Hayamizu

The 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment 2017年 [査読有り]
Investigation of DNN-based audio-visual speech recognition

Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

IEICE Transactions on Information and Systems E99D ( 10 ) 2444 - 2451 2016年10月

　概要を見る

© 2016 The Institute of Electronics, Information and Communication Engineers. Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandembased method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.

DOI

Scopus

5

被引用数

(Scopus)
Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 575 - 582 2016年02月

　概要を見る

© 2015 Asia-Pacific Signal and Information Processing Association. This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.

DOI

Scopus

47

被引用数

(Scopus)
Audio-visual processing toward robust speech recognition in cars

Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

7th Biennial Workshop on Digital Signal Processing for In-Vehicle Systems and Safety 2015 31 - 34 2015年

　概要を見る

This paper reports our recent efforts to develop robust speech recognition in cars. Speech recognition is expected to handle many devices on cars. However, many kinds of acoustic noises, e.g. engine noise and car stereo, are observed in in-car environments, making performance of speech recognition decrease. In order to overcome the degradation, we develop a high-performance audio-visual speech recognition method. Lip images are obtained from captured face images using our face detection scheme. Some basic visual features are computed, then converted into visual features for speech recognition using a deep neural network. Audio features are obtained as well. Audio and visual features are subsequently concatenated into audio-visual features. As a recognition model, a multi-stream hidden Markov model is employed which can adjust contributions of audio and visual modalities. We evaluated our proposed method using an audio-visual corpus CENSREC-1-AV. In order to simulate driving-car condition, we prepared driving and music noises. Experimental results show that our method can significantly improving recognition performance in in-car condition.
Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA) 575 - 582 2015年

　概要を見る

This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.
MULTI-MODAL SERVICE OPERATION ESTIMATION USING DNN-BASED ACOUSTIC BAG-OF-FEATURES

Satoshi Tamura, Takuya Uno, Masanori Takehara, Satoru Hayamizu, Takeshi Kurata

2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) 2291 - 2295 2015年 [査読有り]

　概要を見る

In service engineering it is important to estimate when and what a worker did, because they include crucial evidences to improve service quality and working environments. For Service Operation Estimation (SOE), acoustic information is one of useful and key modalities; particularly environmental or background sounds include effective cues. This paper focuses on two aspects: (1) extracting powerful and robust acoustic features by using stacked-denoising-autoencoder and hag-of-feature techniques, and (2) investigating a multi-modal SOE scheme by combining the audio features and the other sensor data as well as non-sensor information. We conducted evaluation experiments using multi-modal data recorded in a restaurant. We improved SOE performance in comparison to conventional acoustic features, and effectiveness of our multi modal SOE scheme is also clarified.
IMPROVEMENT OF UTTERANCE CLUSTERING BY USING EMPLOYEES' SOUND AND AREA DATA

Tetsuya Kawase, Masanori Takehara, Satoshi Tamura, Satoru Hayamizu, Ryuhei Tenmoku, Takeshi Kurata

2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 3047 - 3051 2014年 [査読有り]

　概要を見る

In this paper, we propose to use staying area data toward the estimation of serving time for customers. To classify utterances enables us to estimate conversation types between speakers. However, its performance becomes lower in real environments. We propose a method using area data with sound data to solve this problem. We also propose a method to estimate the conversation types using the decision trees. They were tested with the data recorded in a Japanese restaurant. In the experiment to classify utterances, the proposed method performed better than the method using only sound data. In the experiment to estimate the conversation types, we succeeded to recover 70% of the mis-classified conversations using both of sound and area data.

DOI

Scopus

1

被引用数

(Scopus)
Analysis of Customer Communication by Employee in Restaurant and Lead Time Estimation

Masanori Takehara, Hiroya Nojiri, Satoshi Tamura, Satoru Hayamizu, Takeshi Kurata

2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA) 1 - 5 2014年 [査読有り]

　概要を見る

Human behavior sensing and their analysis are great role to improve service quality and education of employees. This paper shows novel frameworks of detection of customer communication and lead time estimation(LTE) by using multi-sensored data, sound data and accounting data in the restaurant. They are useful for management about work environments and problems for employees. Lead time from order to delivery shows the quality of the service for customers. We found sound data of an employee's speech is useful for these techniques by speech ratio smoothing and POS sound detection.

DOI

Scopus

1

被引用数

(Scopus)
Multistream sparse representation features for noise robust audio-visual speech recognition

Shen Peng, Tamura Satoshi, Hayamizu Satoru

Acoustical Science and Technology 35 ( 1 ) 17 - 27 2014年

　概要を見る

In this paper, we propose to use exemplar-based sparse representation features for noise robust audio-visual speech recognition. First, we introduce a sparse representation technology and describe how noise robustness can be realized by the sparse representation for noise reduction. Then, feature fusion methods are proposed to combine audio-visual features with the sparse representation. Our work provides new insight into two crucial issues in automatic speech recognition: noise reduction and robust audio-visual features. For noise reduction, we describe a noise reduction method in which speech and noise are mapped into different subspaces by the sparse representation to reduce the noise. Our proposed method can be deployed not only on audio noise reduction but also on visual noise reduction for several types of noise. For the second issue, we investigate two feature fusion methods –- late feature fusion and the joint sparsity model method –- to calculate audio-visual sparse representation features to improve the accuracy of the audio-visual speech recognition. Our proposed method can also contribute to feature fusion for the audio-visual speech recognition system. Finally, to evaluate the new sparse representation features, a database for audio-visual speech recognition is used in this research. We show the effectiveness of our proposed noise reduction on both audio and visual cases for several types of noise and the effectiveness of audio-visual feature determination by the joint sparsity model, in comparison with the late feature fusion method and traditional methods.

DOI CiNii

Scopus

5

被引用数

(Scopus)
AUDIO-VISUAL VOICE CONVERSION USING NOISE-ROBUST FEATURES

Kohei Sawada, Masanori Takehara, Satoshi Tamura, Satoru Hayamizu

2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 2014年 [査読有り]

　概要を見る

Voice Conversion (VC) is a technique to convert speech data of source speaker into ones of target speaker. VC has been investigated and statistical VC is used for various purposes. Conventional VC uses acoustic features, however, the audio-only VC has suffered from the degradation in noisy or real environments. This paper proposes an Audio-Visual VC (AVVC) method using not only audio features but also visual information, i.e. lip images. Eigenlip feature is employed in our scheme as visual feature. We also propose a feature selection approach for audio-visual features. Experiments were conducted to evaluate our AVVC scheme comparing with audio-only VC, using noisy data. The results show that AVVC can improve the performance even in noisy environments, by properly selecting audio and visual parameters. It is also found that visual VC is also successful. Furthermore, it is observed that visual dynamic features are more effective than visual static information.
Data Collection for Mobile Audio-visual Speech Recognition in Various Environments

Satoshi Tamura, Takumi Seko, Satoru Hayamizu

2014 17TH ORIENTAL CHAPTER OF THE INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDIZATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (COCOSDA) 2014年 [査読有り]

　概要を見る

This paper introduces our recent activities for audio-visual speech recognition on mobile devices and data collection in various environments. Audio-visual automatic speech recognition is effective in noisy or real conditions to enhance the robustness of speech recognizer and to improve the recognition accuracy. We have developed an audio-visual speech recognition interface for mobile devices. In order to evaluate the recognizer and investigate issues related to audio-visual processing on mobile computers, we collected speech data and lip images of 16 subjects in eight conditions, where there were various audio noises and visual difficulties. Audio-only speech recognition and visual-only lipreading were then conducted. Through these experiments, we found some issues and future works not only for construction of audio-visual database but also for robust audio-visual speech recognition.
Probabilistic expression of Polynomial Semantic Indexing and its application for classification

Kentaro Minoura, Satoshi Tamura, Satoru Hayamizu

PATTERN RECOGNITION LETTERS 34 ( 13 ) 1485 - 1489 2013年10月 [査読有り]

　概要を見る

We propose a probabilistic expression of PSI (Polynomial Semantic Indexing). PSI is a model which represents a latent semantic space in the polynomial form of input vectors. PSI express high-order relationships between more than two vectors in the form of extended inner products. PSI employs the low rank representation, which enables us to treat high-dimensional data without processes such as dimension reduction and feature extraction explicitly. Our proposed pPSI also has the same advantages as PSI. The contribution of this paper is (1) to formulate a probabilistic expression of PSI (pPSI), (2) to propose a pPSI-based classifier, and (3) to show a possibility of the pPSI classifier. The training algorithm of the stochastic gradient descendent for pPSI is introduced, saving memory use as well as computational costs. Furthermore, pPSI has a potential to reach the better solution compared to PSI. The proposed pPSI method can perform model-based training and adaptation, such as MAP (Maximum A Posterior)-based estimation according to the amount of data. In order to evaluate pPSI and its classifier, we conducted three experiments with artificial data and music data, comparing with multi-class SVM and boosting classifiers. Through the experiments, it is shown that the proposed method is feasible, especially for the case of small dimension of latent concept spaces. (c) 2013 Elsevier B.V. All rights reserved.

DOI

Scopus

1

被引用数

(Scopus)
Improvement of lip reading performance in real environments using speaker and environmental adaptation

Takuya Kawasaki, Naoya Ukai, Seko Takumi, Satoshi Tamura, Satoru Hayamizu

Proceedings - 2nd IAPR Asian Conference on Pattern Recognition, ACPR 2013 346 - 350 2013年 [査読有り]

　概要を見る

Lip reading technologies play a great role not only in image pattern recognition e.g. computer vision, but also in audio-visual pattern recognition e.g. bimodal speech recognition. However, it is a problem that the recognition accuracy is still significantly low, compared to that of speech recognition. Another problem lies which the performance degradation occurs in real environments. To improve the performance, in this paper we employ two adaptation schemes: speaker adaptation and environmental adaptation. The speaker adaptation is performed to recognition models so as to prevent the degradation caused by the difference of speakers. The environmental adaptation is also conducted to deal with environmental differences. We tested these adaptation schemes using a real-world audio-visual corpus CENSREC-2-AV, we have built this corpus containing real-world data (speech signals and lip images) recorded in a driving car, in which subjects uttered Japanese connected digits. Experimental results show that the lip reading recognition performance was largely improved by the speaker adaptation, and further recovered by the environmental adaptation. © 2013 IEEE.

DOI

Scopus

2

被引用数

(Scopus)
An audio-visual in-car corpus "CENSREC-2-AV" for robust bimodal speech recognition

Takuya Kawasaki, Naoya Ukai, Takumi Seko, Satoshi Tamura, Satoru Hayamizu, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda

6th Biennial Workshop on DSP for In-Vehicle Systems and Safety 2013, DSP 2013 2013年

　概要を見る

The purpose of this study is to build an evaluation framework for robust bimodal speech recognition in real environments, such as in-car conditions. Bimodal speech recognition using lip images has been studied to prevent the deterioration of speech recognition performance in noisy environments. Lip reading technologies using lip images play a great role for the bimodal speech recognition. Therefore, for the bimodal speech recognition, a database both speech signals and lip images is necessary to build a bimodal speech recognizer and to evaluate its performance. An evaluation framework for noisy bimodal speech recognition (CENSREC-1-AV) was constructed by Tamura et al; a subject on a blue screen background spoke Japanese connected digits in a quiet office environment. CENSREC-1-AV was recorded in the clean condition, on the other hand, a database recorded in real environments is required to evaluate a bimodal speech recognizer. Therefore, we have constructed a new audio-visual corpus CENSREC-2-AV, recorded in in-car environments; a subject sitting on a driver's seat in a car uttered Japanese connected digits in various driving conditions: for example, a tunnel situation with music background noises, and driving on an expressway while the window is open. By using CENSREC-2-AV, it is possible to realize a robust bimodal speech recognition method even in real environments.
Confidence estimation and keyword extraction from speech recognition result based on Web information

Hara Kensuke, Sekiya Hideki, Kawase Tetsuya, Tamura Satoshi, Hayamizu Satoru

2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA) 2013年 [査読有り]

　概要を見る

This paper proposes to use Web information for confidence measure and to extract keywords for speech recognition results. Spoken document processing has been attracting attention particularly for information retrieval and video (audiovisual) content systems. For example, measuring a confidence score which indicates how likely a document or a segmented document includes recognition errors has been studied. It is well known keyword extraction from recognition results is also an important issue. For these purposes, in this paper, pointwise mutual information (PMI) between two words is employed. PMI has been used to calculate a confidence measure of speech recognition, as a coherence measure by co-occurrence of words. We propose to further improve the method by using a Web query expansion technique with term triplets which consist of nouns in the same document. We also apply PMI to keyword estimation by summing a co-occurrence score (sumPMI) between a targeting keyword candidate and each term. The proposed methods were tested with 10 lectures in Corpus of Spontaneous Japanese (CSJ) and 2 simulated movie dialogues. In the experiments it is shown that the estimated confidence score has high relationship with recognition accuracy, indicating the effectiveness of our method. And sumPMI scores for keywords have higher values in the subjective tests.
Measurement and analysis of speech data toward improving service in restaurant.

Masanori Takehara, Satoshi Tamura, Satoru Hayamizu, Ryuhei Tenmoku, Array,Tomohiro Fukuhara, Takeshi Kurata

2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, November 25-27, 2013 1 - 4 2013年 [査読有り]

DOI

Scopus

1

被引用数

(Scopus)
Sparse representation of audio features for sputum detection from lung sounds

Tatsuya Yamashita, Satoshi Tamura, Kenji Hayashi, Yutaka Nishimoto, Satoru Hayamizu

Proceedings - International Conference on Pattern Recognition 2005 - 2008 2012年

　概要を見る

A medical staff needs to check sputum accumulation in patient's respiratory tract by lung sounds auscultation at any time, and it is the big burden for the staff. This paper aims to develop a system which notifies appropriate timing for the tracheal suction for the medical staff by analyzing lung sounds of the patients. We present a novel framework about automatic sputum detection from lung sounds. We proposed the sparse representation of audio features to realize robust detection in real environment. We showed the effectiveness of our proposed method for three patients in an ICU of Gifu University Hospital, where the recorded lung sounds included electronic beeps, human voices, and other various noises. © 2012 ICPR Org Committee.
Toward polyphonic musical instrument identification using example-based sparse representation

Mari Okamura, Masanori Takehara, Satoshi Tamura, Satoru Hayamizu

2012 Conference Handbook - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2012 2012年

　概要を見る

Musical instrument identification is one of the major topics in music signal processing. In this paper, we propose a musical instrument identification method based on sparse representation for polyphonic sounds. Such the identification has been still categorized into challenging tasks, since it needs high-performance signal processing techniques. The proposed scheme can be applied without any signal processing such as source separation. Sample feature vectors for various musical instruments are used for the base matrix of sparse representation. We conducted two experiments to evaluate the proposed method. First, the musical instrument identification is tested for monophonic sounds using five musical instruments. The average accuracy of 91.9% was obtained and it shows the effectiveness of the proposed method. Second, musical instrument composition of polyphonic sounds is examined, which contain two instruments. It is found that the estimated weight vector by sparse representation indicates the mixture ratio of two instruments. © 2012 APSIPA.
Feature reconstruction using sparse imputation for noise robust audio-visual speech recognition

Peng Shen, Satoshi Tamura, Satoru Hayamizu

2012 Conference Handbook - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2012 2012年

　概要を見る

In this paper, we propose to use noise reduction technology on both speech signal and visual signal by using exemplar-based sparse representation features for audio-visual speech recognition. First, we introduce sparse representation classification technology and describe how to utilize the sparse imputation to reduce noise not only for audio signal but also for visual signal. We utilize a normalization method to improve the accuracy of the sparse representation classification, and propose a method to reduce the error rate of visual signal when using the normalization method. We show the effectiveness of our proposed noise reduction method and that the audio features achieved up to 88.63% accuracy at -5dB, a 6.24% absolute improvement is achieved over the additive noise reduction method, and the visual features achieved 27.24% absolute improvement at gamma noise. © 2012 APSIPA.
CENSREC-2-AV: An evaluation framework for bimodal speech recognition in real environments

Naoya Ukai, Takuya Kawasaki, Satoshi Tamura, Satoru Hayamizu, Chiyomi Miyajima, Norihide Kitaoka, Kazuya Takeda

Proceedings of the 2012 International Conference on Speech Database and Assessments, Oriental COCOSDA 2012 88 - 91 2012年

　概要を見る

In this paper, we introduce a bimodal speech recognition corpus in real environments. In recent years, speech recognition technology has been used in noisy conditions. Therefore, it becomes necessary to achieve higher recognition accuracy in real environments. As one of the solutions, bimodal speech recognition using audio and non-audio information is getting studied. However, there are few databases which can be used to evaluate the bimodal speech recognition in real environments. In this paper, we introduce CENSREC-2-AV we have been working to built, as a new bimodal speech recognition corpus. CENSREC-2-AV is one of the databases of the CEN-SREC project; we provided a similar corpus CENSREC-1-AV as a database for bimodal speech recognition for additive noises. In these corpora, there are speech data and lip images. Researchers can evaluate a bimodal speech recognition method built using CENSREC-1-AV which consists of clean data, in real environments by using CENSREC-2-AV. © 2012 IEEE.

DOI

Scopus

2

被引用数

(Scopus)
GIF-SP: GA-based Informative Feature for Noisy Speech Recognition

Satoshi Tamura, Yoji Tagami, Satoru Hayamizu

2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 2012年 [査読有り]

　概要を見る

This paper proposes a novel discriminative feature extraction method. The method consists of two stages; in the first stage, a classifier is built for each class, which categorizes an input vector into a certain class or not. From all the parameters of the classifiers, a first transformation can be formed. In the second stage, another transformation that generates a feature vector is subsequently obtained to reduce the dimension and enhance recognition ability. These transformations are computed applying genetic algorithm. In order to evaluate the performance of the proposed feature, speech recognition experiments were conducted. Results in clean training condition shows that GIF greatly improves recognition accuracy compared to conventional MFCC in noisy environments. Multi-condition results also clarifies that out proposed scheme is robust against differences of conditions.
Multi-stream acoustic model adaptation for noisy speech recognition

Satoshi Tamura, Satoru Hayamizu

2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 2012年 [査読有り]

　概要を見る

In this paper, a multi-stream-based model adaptation method is proposed for speech recognition in noisy or real environments. The proposed scheme comes from our experience about audio-visual model adaptation. At first, an acoustic feature vector is divided into several vectors (e. g. static, first-order and second-order dynamic vectors), namely streams. While adaptation, a stream performing relatively high recognition performance is updated for the stream only. Alternatively, a stream having less recognition power is adapted using all the streams that are superior to the stream. In order to evaluate the proposed technique, recognition experiments were conducted using every streams, and then adaptation experiments were also investigated for various types of combination of streams.
Statistical Voice Conversion using GA-based Informative Feature

Kohei Sawada, Yoji Tagami, Satoshi Tamura, Masanori Takehara, Satoru Hayamizu

2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 2012年 [査読有り]

　概要を見る

In order to make voice conversion (VC) robust to noise, we propose VC using GA-based informative feature (GIF), by adding an extraction process of GIF to a conventional VC. GIF is proposed as a feature that can be applied not only in pattern recognition but also in relative tasks. In speech recognition, furthermore, GIF could improve recognition accuracy in noise environment. We evaluated the performances of VC using spectral segmental features (conventional method) and GIF, respectively. Objective experimental result indicates that in noise environments, the proposed method was better than the conventional method. Subjective experiment was also conducted to compare the performances. These results show that application of GIF to VC was effective.
GIF-LR:GA-based Informative Feature for Lipreading

Naoya Ukai, Takumi Seko, Satoshi Tamura, Satoru Hayamizu

2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 2012年 [査読有り]

　概要を見る

In this paper, we propose a general and discriminative feature "GIF" (GA-based Informative Feature), and apply the feature to lipreading (visual speech recognition). The feature extraction method consists of two transforms, that convert an input vector to GIF for recognition. The transforms can be computed using training data and Genetic Algorithm (GA). For lipreading, we extract a fundamental feature as an input vector from an image; the vector consists of intensity values at all the pixels in an input lip image, which are enumerated from left-top to right-bottom. Recognition experiments of continuous digit utterances were conducted using an audio-visual corpus including more than 268,000 lip images. The recognition results show that the GIF-based method is better than the baseline method using eigenlip features.
Audio-visual interaction in model adaptation for multi-modal speech recognition

Satoshi Tamura, Masanao Oonishi, Satoru Hayamizu

APSIPA ASC 2011 - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2011 875 - 878 2011年

　概要を見る

This paper investigates audio-visual interaction, i.e. inter-modal influences, in linear-regressive model adaptation for multi-modal speech recognition. In the multi-modal adaptation, inter-modal information may contribute the performance of speech recognition. Thus the influence and advantage of intermodal elements should be examined. Experiments were conducted to evaluate several transformation matrices including or excluding inter-modal and intra-modal elements, using noisy data in an audio-visual corpus. From the experimental results, the importance of effective use of audio-visual interaction is clarified.
The role of speech technology in service-operation estimation

Masanori Takehara, Satoshi Tamura, Ryuhei Tenmoku, Takeshi Kurata, Satoru Hayamizu

2011 International Conference on Speech Database and Assessments, Oriental COCOSDA 2011 - Proceedings 116 - 119 2011年 [査読有り]

　概要を見る

This paper introduces our recent effort to develop a Service-Operation Estimation (SOE) system using speech and multi-sensored data as well as other acquired data. In SOE, it is essential to analyze employees' data in order to increase the productivity in many service industries. Speech processing techniques, such as voice activity detection and keyword spotting recognition, help the analysis and enhance the precision of the results
the beginning and end times of speech region are used to detect work events, and recognized keywords are used to conduct work estimation. In our system all the results are visualized in a 3D model, and it makes employers and employees help their operations. © 2011 IEEE.

DOI

Scopus

6

被引用数

(Scopus)
Topic-based generation of keywords and caption for video content

Masanao Okamoto, Kiichi Hasegawa, Sho Sobue, Akira Nakamura, Satoshi Tamura, Satoru Hayamizu

APSIPA ASC 2010 - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 605 - 608 2010年

　概要を見る

This paper studies usage of both keywords and captions in one scene for video content. Captions show the spoken content and are renewed in a sentence unit. A method is proposed to extract keywords automatically from transcribed texts. The method estimates topic boundary, extracts keywords by Latent Dirichlet Allocation (LDA) and presents them in speech balloon captioning system. The proposed method is evaluated by experiments from the viewpoint of easy to view and helpfulness to understand the video content. Adding keywords and captions obtained favorable scores by subjective assessments.
Template-based Spectral Estimation Using Microphone Array for Speech Recognition

Satoshi Tamura, Eriko Hishikawa, Wataru Taguchi, Satoru Hayamizu

11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4 2050 - + 2010年 [査読有り]

　概要を見る

This paper proposes a Template-based Spectral Estimation (TSE) method for noise reduction of microphone array processing aiming at speech recognition enhancement. In the proposed method, a noise template in a complex plane is calculated for each frequency bin using non-speech audio signals observed at microphones. Then for every noise-overlapped speech signals, a speech signal can be reformed by applying the template and the gradient descent method. Experiments were conducted to evaluate not only performance of noise reduction but also improvement of speech recognition. Then NRR 16.7dB improvement was achieved by combining TSE and Spectral Subtraction (SS) methods. For speech recognition, 44% relative recognition error reduction was obtained comparing with the conventional SS method.
A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection

Satoshi Tamura, Masato Ishikawa, Takashi Hashiba, Shin'ichi Takeuchi, Satoru Hayamizu

11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4 2702 - + 2010年 [査読有り]

　概要を見る

This paper proposes a novel speech recognition method combining Audio-Visual Voice Activity Detection (AVVAD) and Audio-Visual Automatic Speech Recognition (AVASR). AVASR has been developed to enhance the robustness of ASR in noisy environments, using visual information in addition to acoustic features. Similarly, AVVAD increases the precision of VAD in noisy conditions, which detects presence of speech from an audio signal. In our approach, AVVAD is conducted as a preprocessing followed by an AVASR system, making a significantly robust speech recognizer. To evaluate the proposed system, recognition experiments were conducted using noisy audio-visual data, testing several AVVAD approaches. Then it is found that the proposed AVASR system using the model-free feature-fusion AVVAD method outperforms not only non-VAD audio-only ASR but also conventional AVASR.
The cardiac massage detection in the emergency medical care video

Hirotsugu Asai, Hideki Tanahashi, Satoru Hayamizu, Makoto Kanagawa

Proceedings of the 6th IASTED International Conference on Visualization, Imaging, and Image Processing, VIIP 2006 597 - 602 2006年

　概要を見る

The purpose of our research is to summarize the medical care video by treatment detection and situation classification based on the positional relation between patient and medical staff by utilizing the feature of the medical care video. As the first step, we discussed the automated detection of the treatment in video of emergency medical care only by position information and motion information and proposed the simple way to detect the cardiac massage at low cost.
Automatic metadata generation and video editing based on speech and image recognition for medical education contents

Satoshi Tamura, Koji Hashimoto, Jiong Zhu, Satoru Hayamizu, Hirotsugu Asai, Hideki Tanahashi, Makoto Kanagawa

INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5 2466 - + 2006年 [査読有り]

　概要を見る

This paper reports a metadata generation system as well as an automatic video edit system. The metadata are information described about the other data. In the audio metadata. generation system, speech recognition using general language model (LM) and specialized LM is performed to input speech in order to obtain segment (event group) and audio metadata (event information) respectively. In the video edit system, visual metadata obtained by image recognition and audio metadata are combined into audio-visual metadata. Subsequently, multiple videos are edited to one video using the audio-visual metadata. Experiments were conducted to evaluate event detection of the systems using medical education contents, ACLS and BLS. The audio metadata system achieved about a 78% event detection correctness. In the edit system, an 87% event correctness was obtained by audio-visual metadata, and the survey proved that the edited video is appropriate and useful.

▼全件表示

書籍等出版物

製造業向け人工知能講義

速水悟( 担当：単著)

日経BP社 2024年08月 ISBN: 4296205528
製造業のAI活用を支える統計的機械学習＆深層学習

( 担当：単著)

日経BP社 2020年12月
事例+演習で学ぶ機械学習 : ビジネスを支えるデータ活用のしくみ

速水悟( 担当：単著)

森北出版 2016年04月 ISBN: 9784627880214

共同研究・競争的資金等の研究課題

音響処理と画像処理の協調的統合による嚥下タイミング計測の研究

日本学術振興会科学研究費助成事業

研究期間:

2018年04月

-

2021年03月

速水悟, 加藤邦人, 田村哲嗣, 木村暁夫, 西津貴久

　概要を見る

本研究は，嚥下機能を非侵襲かつ定量的に計測するために，嚥下時の音と甲状軟骨の動きのデータをモデル化した．音源食品を飲み込ませ，その音を録音したデータから，音響信号処理及びニューラルネットワークによって音源食品が食道に送られるタイミングを推定した．また，3-IR照度差ステレオ法によって，喉周辺の動画像から甲状軟骨の動きを得ることで，これと同期して動く喉頭蓋の動きを推定した．最後に，これら二つを比較することにより，嚥下機能を計測した．嚥下機能が正常である被験者に対する実験を行い，音データと動画像データから得られた二つの時刻を比較することにより，嚥下タイミングの推定とその評価を実施した．
マルチモーダルサイレント音声認識技術に関する研究

日本学術振興会科学研究費助成事業

研究期間:

2016年04月

-

2020年03月

齊藤剛史, 田村哲嗣, 桂田浩一, 速水悟, 永井秀利, 山崎敏正

　概要を見る

本研究課題では、音声を用いずに発話内容を認識するサイレント音声認識技術において、画像や表面筋電、脳波などの各データにおけるサイレント音声認識技術を確立し、さらにマルチモーダルサイレント音声認識技術の開発を目指した。特に画像情報を用いる読唇技術においては、データベースの構築および公開、読唇技術のデモアプリの開発および公開、コンペティションの開催などを積極的に取り組んだ。深層学習技術の導入により高い認識精度を得る手法を確立した。さらにワークショップの継続開催や各種イベントに出展し、本研究成果を社会に発信した。
”iPad”と”学内SNS”を活用した自主学習型音楽技能向上システムの開発

日本学術振興会科学研究費助成事業

研究期間:

2013年04月

-

2016年03月

杉山祐子, 速水悟, 堅田明義, 菊池春秀

　概要を見る

ピアノ技能向上のために、ICTを利用した自主学習システムを構築した。
ピアノ初学者のリズムを評価する尺度を作成し、その尺度を基にリズムの自動評価システムのためのアルゴリズムを解明した。iPad上でその自動評価システムを利用したリズム学習メソッドを制作した。また、モチベーション面での自律的な学習を促進させるために、SNS上で学習者と指導者間での連続した“評価的やりとり”を構築し、効果を確認した。今回のICTを活用した自主学習支援システム研究により、これまでのマンツーマンで時間を要したピアノ技能向上に対し、ピアノ初学者の困難の解消と、同様の音楽技能向上への効率良く効果的な支援の可能性が見られた。
音声認識による話題語の継時的提示に関する研究

日本学術振興会科学研究費助成事業

研究期間:

2010年

-

2012年

速水悟, 田村哲嗣

　概要を見る

連続的なメディアコンテントに対して、音声認識結果として得られるテキスト情報をもとに、再生時点での要約情報を時間的に変化させる新たな手法を提案した。とくに、話題の変化と提示するキーワード選定の問題を同時に解決する手法として、統計的なトピック言語モデルの適用を提案し、映像に対して字幕とキーワードを付与し、その話題の変化を複数のキーワードによって示すシステムとして実現し、その有効性を明らかにした。
ユーザと情報システムとの認知的調和のための確率的制御機構の研究

日本学術振興会科学研究費助成事業

研究期間:

2002年

-

2004年

麻生英樹, 浅野太, 本村陽一, 秋葉友良, 伊藤克亘, 速水悟

　概要を見る

情報システムとユーザとが相互の状態を理解し円滑にインタラクションする「認知的調和」の達成をめざして確率統計的な手法に基づく研究を行ない、以下の成果を得た。
1.マイクロフォンアレイとステレオカメラを用いて、移動しながら発話するユーザの状況を推定する確率統計的な方法を提案し、実環境中で収集をしたデータを用いて有効性を確認した。発話区間の推定に関して85%程度の正解率を達成した。
2.質問応答タスクにおける固定的なフレーズを含むユーザ発話内容をモデル化する方法を提案し、音声質問応答システムに適用した。音声認識精度およびタスク達成率が有意に向上することを確認した。また、音声質問応答タスクにおける自然なユーザ発話データを収集する方法を提案し、NTCIR-3およびNTCIR-4の情報検索タスクに沿ったデータ収集を行った。
3.研究所案内タスクにおけるユーザの習熟度、知識度を推定する確率統計的な方法を提案した。Wizard of Oz法で自然なユーザ発話データを収集するためのシステムを構築し、12名分のデータ収集を行った。収集されたデータを用いて推定手法を評価し、ユーザを4つのクラスに分類する問題に対して87%程度の正解率を達成した。
4.情報家電制御タスクにおけるユーザの自然な発話データ20名分をWizard of Oz法で収集し、書き起こしとラベルづけを行った。
これらの成果は、動的ベイジアンネットワークを中心とする確率統計的手法が、様々な状況での対話タスクにおけるユーザの状態推定のために有効であることを支持している。
今後の課題としては、ユーザの状態推定手法をより一層高度化し、今回収集したデータによる、より詳細な性能評価をおこなうこと、ユーザの状態推定をシステムの意思決定と応答生成に利用するための方法を検討し、対話システムとして完成させること、があげられる。

Misc

製造業におけるAI活用の拡大：現状と課題

速水悟

日本経営学会全国大会 2021年09月

担当区分：筆頭著者

研究発表ペーパー・要旨（全国大会，その他学術会議）
手順書による画像検索の性能改善の検討

三橋祐亮, 田村哲嗣, 速水悟

電子情報通信学会大会講演論文集(CD-ROM) 2021 2021年

J-GLOBAL
初等教育におけるテキスト型プログラミング言語 Python によるプログラミング教育の効果検証

朝日翔太, 高橋和之, 村山聡江, 寺田和憲, 加藤邦人, 山口忠, 今井亜湖, 速水悟

日本教育工学会第34回全国大会 2018年09月
音響信号処理による嚥下タイミング推定手法

児玉千紗, 加藤邦人, 田村哲嗣, 速水悟

計測自動制御学会ライフエンジニアリング部門,LE2017 139 - 142 2017年09月
ピアノの自主学習を促すリズム自動評価システムの提案

杉山祐子, 臼田寛明, 田村哲嗣, 速水悟, 堅田明義

中部学院大学・中部学院大学短期大学部研究紀要 ( 18 ) 11 - 19 2017年03月

CiNii
Investigation of DNN-Based Audio-Visual Speech Recognition (Special Section on Recent Advances in Machine Learning for Spoken Language Processing)

Tamura Satoshi, Ninomiya Hiroshi, Kitaoka Norihide, Osuga Shin, Iribe Yurie, Takeda Kazuya, Hayamizu Satoru

IEICE Transactions on Information and Systems 99 ( 10 ) 2444 - 2451 2016年10月

　概要を見る

Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.

CiNii
診療所電子カルテデータを用いた診療プロセス可視化の試み

中嶋航大, 田村哲嗣, 速水悟

人工知能学会全国大会論文集 29 1 - 4 2015年

CiNii
呼吸音区間に対する喀痰検出システムと実環境における個人適応

山下達也, 田村哲嗣, 速水悟, 林賢二, 西本裕

電子情報通信学会論文誌D: 情報・システム 97 ( 12 ) 1831 - 1838 2014年12月
音声情報と位置情報を用いた従業員の接客作業分析とその活用

竹原正矩, 野尻弘也, 田村哲嗣, 速水悟, 蔵田武志

研究報告音声言語情報処理（SLP） 2014 ( 5 ) 1 - 6 2014年10月

　概要を見る

サービス業における業務改善や従業員教育を支援するため，行動計測とデータの分析技術の向上が必要とされている．本稿では，レストランにおける従業員の接客作業に着目する．接客作業には発話の機会が多いため，音声情報を分析することで接客の状況や作業に関する指標が得られることが期待される．また，注文伺い，配膳といった接客作業を検出することで，顧客の注文後の待ち時間を推定することができる．そこで，我々は音声情報と位置情報を組み合わせ，従業員の接客作業の検出を行う枠組みを提案し，顧客の待ち時間の推定への活用を検討する．そして，接客作業の検出や待ち時間の推定が，他の店舗やサービスでどのように展開できるかを考察する．

CiNii
E-021 作業推定に向けた環境雑音のクラス分類(E分野:自然言語・音声・音楽,一般論文)

宇野太久哉, 竹原正矩, 田村哲嗣, 速水悟, 蔵田武志

情報科学技術フォーラム講演論文集 13 ( 2 ) 253 - 256 2014年08月

CiNii
商品レビューを用いたプレゼント支援の検討

田口拓明, 田村哲嗣, 速水悟

人工知能学会全国大会論文集 28 1 - 4 2014年

CiNii
接客時間推定に向けた従業員の位置・音声データによる発話クラスタリング(音声対話・合成,第15回音声言語シンポジウム)

川瀬徹也, 竹原正矩, 田村哲嗣, 天目隆平, 蔵田武志, 速水悟

電子情報通信学会技術研究報告. SP, 音声 113 ( 366 ) 89 - 95 2013年12月

　概要を見る

我々はレストランで収録された音声について,発話クラスタリングの研究を行っている.発話データには,従業員同士の会話や,顧客との会話が含まれており,対象話者をクラスタリングすることで,従業員の接客時間などの業務に関する指標を推定できることが期待される.本稿では,マイク装着者,他の従業員,顧客の3クラス発話クラスタリングを検討している.しかし,話者の不特定性や雑音の影響により分類精度が低下する事が考えられる.そこで,従業員の位置情報を音声データと統合して精度向上を試みた.さらに,提案手法をレストラン以外の場面に応用した際の汎用性について考察した.

CiNii
接客時間推定に向けた従業員の位置・音声データによる発話クラスタリング

川瀬徹也, 竹原正矩, 田村哲嗣, 天目隆平, 蔵田武志, 速水悟

研究報告音声言語情報処理（SLP） 2013 ( 15 ) 1 - 7 2013年12月

　概要を見る

我々はレストランで収録された音声について，発話クラスタリングの研究を行っている．発話データには，従業員同士の会話や，顧客との会話が含まれており，対象話者をクラスタリングすることで，従業員の接客時間などの業務に関する指標を推定できることが期待される．本稿では，マイク装着者，他の従業員，顧客の 3 クラス発話クラスタリングを検討している．しかし，話者の不特定性や雑音の影響により分類精度が低下する事が考えられる．そこで，従業員の位置情報を音声データと統合して精度向上を試みた．さらに，提案手法をレストラン以外の場面に応用した際の汎用性について考察した．

CiNii
H-035 位置・発話・会計データを用いた配膳待ち時間推定の試み(H分野:画像認識・メディア理解,一般論文)

野尻弘也, 竹原正矩, 前山賢人, 田村哲嗣, 蔵田武志, 速水悟

情報科学技術フォーラム講演論文集 12 ( 3 ) 171 - 174 2013年08月

CiNii
肺音の情報処理と応用

速水悟, 田村哲嗣

非破壊検査 : journal of N.D.I 60 ( 12 ) 706 - 712 2011年12月

CiNii
多次元尺度構成法と相関分析を用いた健診データの解析

山本けい子, 田村哲嗣, 浅野昌和, 金川誠, 紀ノ定保臣, 速水悟

システム制御情報学会第53回研究発表講演会 2009年05月

研究発表ペーパー・要旨（全国大会，その他学術会議）

DOI
多次元尺度構成法を用いた健診データの解析

山本けい子, 田村哲嗣, 速水悟, 紀ノ定保臣, 浅野昌和, 金川誠

人工知能学会 2008年全国大会 2008年06月

研究発表ペーパー・要旨（全国大会，その他学術会議）

DOI
GEMSIS - a novel application of speech recognition to emergency and disaster medicine

Satoshi Tamura, Kunihiko Takamatsu, Shinji Ogura, Satoru Hayamizu

INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4 4 2468 - + 2007年

　概要を見る

This paper reports an instance of novel application of speech recognition applied to emergency and disaster medicine. The emergency medical system named "GEMSIS" (Gifu Emergency Medical Supporting Intelligent System) including the speech recognition application is also introduced in this paper. Speech recognition plays an important role in this system; when a paramedic team is sent to a disaster or accident site, a life-saving technician reports the situation using speech recognition in the site. The recognized information are shared by all hospitals and critical care centers. This system can solve the severe issue of the emergency medical care in which pre-hospital medical care is insufficient due to lack of information. A prototype application of speech recognition interface was constructed to evaluate a baseline performance and to make a discussion with medical doctors. Through this work, it is found that the applicable domain of speech processing technology can be extended.
マルチモーダル医療支援システムの開発

速水悟

Journal of Japan Society of Computer Aided Surgery : J.JSCAS 8 ( 3 ) 136 - 137 2006年12月

CiNii
講義情報を用いた教材配信制御システム

河村高守, 速水悟, 田村哲嗣

人工知能学会 2006年全国大会 2006年06月

研究発表ペーパー・要旨（全国大会，その他学術会議）

DOI
印象語のグループ化を用いた楽曲推薦システム

市川裕也, 速水悟, 田村哲嗣

人工知能学会 2006年全国大会 2006年06月

研究発表ペーパー・要旨（全国大会，その他学術会議）

DOI
Note-taking support for nurses using digital pen character recognition system

Yujiro Hayashi, Satoshi Tamura, Satoru Hayamizu, Yutaka Nishimoto

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4270 LNCS 428 - 436 2006年

　概要を見る

This study presents a novel system which supports nurses in notetaking by providing a digital pen and character-recognition system laying stress on user interface. The system applies characteristics of a digital pen for improving the efficiency of tasks related to nursing records. The system aims at improving the efficiency of nursing activities and reducing the time spent for tasks for nursing records. In our system, first, notes are written on a check sheet using a digital pen along with a voice that is recorded on a voice recorder; the pen and voice data are transferred to a PC. The pen data are then recognized automatically as characters, which can be viewed and manipulated with the application. We conducted an evaluation experiment to improve efficiency and operation of the system, and its interface. The evaluation and test operations used 10 test subjects. Based on the test operation and the evaluation experiment of the system, it is turned out that improvement for urgent situations, enhancement of portability, and further use of character recognition are required. © Springer-Verlag Berlin Heidelberg 2006.

CiNii
Utterance Analysis in Medical Cases for Spoken Dialog System

Kiyoshi Naganuma, Satoru Hayamizu, Yuzo Takahashi, Yutaka Nishimoto, Yoshimi Matsuda, Yukiko Takahashi

Proceedings of VSMM2004 10 954 - 961 2004年11月

　概要を見る

音声による医療情報, 看護情報取得を目指して, 医療面接, 患者情報に関する発話分析を行った．発熱, 腹痛, 喘息の患者と医師との会話, バイタルサイン測定時の患者と看護師の会話, そして医師と看護師の会話について, 単語熟語の標識, その頻度解析を行い, 自動的解析用辞書を作成した．これに基づいて読み上げ原稿と自然発話でコンピュータによる理解度検証を行った．読み上げの場合は83.3%を理解し得たが, 自然発話では60%以下であった．会話の内容が医学用語や略語など特異な内容を含んでいるため, 医学的な文の構造解析を行うための情報が必要である．将来的にはこの情報を自動的に処理したい．
音声補完 : 音声入力インタフェースへの新しいモダリティの導入(<特集>インタラクティブシステムとソフトウェア)

後藤真孝, 伊藤克亘, 秋葉友良, 速水悟, Masataka Goto, Katunobu Itou, Tomoyosi Akiba, Satoru Hayamizu

コンピュータソフトウェア = Computer software 19 ( 4 ) 254 - 265 2002年07月

CiNii
研究開発用知的資源(<特集>「RWC-実世界知能」)

速水悟, Satoru Hayamizu, National Institute of Advanced Industrial Science and Technology (AIST)

人工知能学会誌 = Journal of Japanese Society for Artificial Intelligence 17 ( 2 ) 167 - 170 2002年03月

CiNii
音声補完の評価—ヒューマンインタフェース,音声言語情報処理合同研究報告

後藤真孝, 伊藤克亘, 速水悟

情報処理学会研究報告 = IPSJ SIG technical reports 2002 ( 10 ) 19 - 26 2002年02月

CiNii
音声補完: 音声ワイルドカード補完機能の実現

後藤真孝, 伊藤克亘, 秋葉友良, 速水悟

日本音響学会研究発表会講演論文集 2001 ( 1 ) 141 - 142 2001年03月

CiNii
単語発声の複数サンプルを利用した未知語の音韻系列の推定—音声情報処理:現状と将来技術論文特集

伊藤克亘, 速水悟, 田中和世

電子情報通信学会論文誌. D-2, 情報・システム. 2, パターン処理 = The IEICE transactions on information and systems. Pt. 2 / 電子情報通信学会編 83 ( 11 ) 2152 - 2159 2000年11月

CiNii
自然発話中の有声休止箇所のリアルタイム検出システム—音声情報処理:現状と将来技術論文特集

後藤真孝, 伊藤克亘, 速水悟

電子情報通信学会論文誌. D-2, 情報・システム. 2, パターン処理 = The IEICE transactions on information and systems. Pt. 2 / 電子情報通信学会編 83 ( 11 ) 2330 - 2340 2000年11月

CiNii
音声補完:"TAB"on Speech—音声言語情報処理研究報告

後藤真孝, 伊藤克亘, 速水悟

情報処理学会研究報告 = IPSJ SIG technical reports 2000 ( 64 ) 81 - 86 2000年07月

CiNii
マルチモーダル情報統合システムの研究動向

速水悟, 竹澤寿幸, Satoru Hayamizu, Toshiyuki Takezawa

人工知能学会誌 = Journal of Japanese Society for Artificial Intelligence 13 ( 2 ) 206 - 211 1998年03月

CiNii
複数サンプルを用いた未知語の音韻系列の推定

伊藤克亘, 速水悟, 田中和世

日本音響学会研究発表会講演論文集 1997 ( 1 ) 7 - 8 1997年03月

CiNii
時間の扱いを考慮した対話システム制御手法

伊藤克亘, 速水悟, 田中和世

電子情報通信学会技術研究報告. SP, 音声 95 ( 468 ) 1 - 6 1996年01月

　概要を見る

複数のモジュールをもつ対話システムにおいて、それぞれのモジュールの行為やそれらを統合した対話システム全体の挙動を時間の観点から適切にあつかうためのシステム制御手法について述べる。ー般に、複数のモジュールを組み合わせて構築するシステムの制御方法は、情報の流れや処理の階層性から語られることが多い。また、時間的か現象をあっかう場合でも、入力相互の関係だけに言及されていることが多い。しかし、対話全体を円滑にすすめるためには、入力だけではなく、出力相互や出力と入力の時間的な関係、モジュールの時間的な性質を考慮する必要がある。本稿で提案するモデルでは、時間を管理するサーバを設け、個々のモジュールはクライアントとして、一括して時間をあっかう。このモデルを利用すると、円滑に対話をすすめながら割り込みや複数の入力の統合をおこなえるようになることを示す。

CiNii
音声と画像のインターモーダル学習

速水悟

知能情報メディアシンポジウム講論集 1996年

CiNii
大語彙音声認識における未知語の検出について

速水悟, 伊藤克亘, 田中和世

Journal of the Acoustical Society of Japan (E) 16 ( 3 ) 165 - 171 1995年

　概要を見る

This paper describes the relation between vocabulary sizes and detection errors of unknown words in large vocabulary speech recognition through recognition and detection experiments. Although the relation between vocabulary sizes and recognition performances has been reported the relation between vocabulary sizes and detection performances has not yet been studied. Especially, it has not for the cases of vocabulary sizes of over 1, 000 word. Experiments were conducted using the speech material of speaker MAU's ATR word speech database. The entries of the dictionary used is 40, 000 words from the Shinmeikai Japanese Language Dictionary. It is shown that when the vocabulary size increases from 1, 000 words to 40, 000 words, the relation between vocabulary sizes and detection errors has a similar tendency with the relation between vocabulary sizes and recognition errors. And increases of detection errors caused by increases of vocabulary sizes are shown to be small for the case of within vocabulary, compared with increases of detection errors for the case of out of vocabulary. These results should be taken into accounts in designing large vocabulary speech recognition systems including unknown word processing.

CiNii
電総研の研究用音声データベース

田中和世, 速水悟

日本音響学会誌 48 ( 12 ) 883 - 887 1992年12月

CiNii
音声の音素片ネットワ-ク表現と時系列のセグメント化法を用いた自動ラベリング手法

田中和世, 速水悟, 太田耕三

日本音響学会誌 42 ( 11 ) p860 - 868 1986年11月

CiNii
研究用音声データベースのためのVCV/CVCバランス単語セットの作成

速水悟

電子総合研究所彙報 49 804 - 834 1985年

CiNii
研究用音声データベースのためのVCV/CVCバランス単語セットの作戦

速水悟

電総研彙報 49 ( 10 ) 803 - 834 1985年

CiNii

▼全件表示

現在担当している科目

Pattern Recognition

大学院基幹理工学研究科

2025年春学期
パターン認識特論

大学院基幹理工学研究科

2025年春学期

担当経験のある科目(授業)

情報処理入門（情報処理入門）

工学部（昼）
2015年10月

-

継続中
ソーシャルイノベーション特論（ソーシャルイノベーション特論）

工学研究科Ｄ
2017年10月

-

2021年02月
機械学習特論

2017年04月

-

2020年09月
データサイエンス入門

全学共通教育
2017年10月

-

2020年02月
技術経営概論

工学部
2016年10月

-

2020年02月
メディアコンテント論

工学研究科

▼全件表示

他学部・他研究科等兼任情報

理工学術院大学院基幹理工学研究科