2024/09/30 更新

写真a

ヒョウ キ
馮 起
所属
理工学術院 理工学術院総合研究所
職名
次席研究員(研究院講師)
学位
博士(工学) ( 2022年09月 早稲田大学 )
修士(工学) ( 2019年09月 早稲田大学 )
学士(工学) ( 2017年09月 早稲田大学 )
ホームページ
プロフィール

深層学習とデータ生成の手法を用いて、コンピュータグラフィックス(CG)やコンピュータビジョン(CV)に関する研究を行っています。 また、仮想現実(VR)や拡張現実(AR)における重要かつ難しい課題にCG/CVの手法を活用して取り込んでいます。

 

論文

  • Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability

    Taichi Higasa, Keitaro Tanaka, Qi Feng, Shigeo Morishima

    ACM International Conference Proceeding Series     292 - 296  2023年10月

     概要を見る

    Language learners should regularly engage in reading challenging materials as part of their study routine. Nevertheless, constantly referring to dictionaries is time-consuming and distracting. This paper presents a novel gaze-driven sentence simplification system designed to enhance reading comprehension while maintaining their focus on the content. Our system incorporates machine learning models tailored to individual learners, combining eye gaze features and linguistic features to assess sentence comprehension. When the system identifies comprehension difficulties, it provides simplified versions by replacing complex vocabulary and grammar with simpler alternatives via GPT-3.5. We conducted an experiment with 19 English learners, collecting data on their eye movements while reading English text. The results demonstrated that our system is capable of accurately estimating sentence-level comprehension. Additionally, we found that GPT-3.5 simplification improved readability in terms of traditional readability metrics and individual word difficulty, paraphrasing across different linguistic levels.

    DOI

    Scopus

    1
    被引用数
    (Scopus)
  • Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning

    Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH   2023-August   3397 - 3401  2023年

     概要を見る

    This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes. Specifically, we aim to map the input of two speech types close to each other in a latent space if they have similar viseme representations. By minimizing the Kullback-Leibler divergence of the predicted viseme probability distributions between and within the two speech types, our model effectively learns and predicts viseme identities. Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.

    DOI

    Scopus

  • Enhancing Perception and Immersion in Pre-Captured Environments through Learning-Based Eye Height Adaptation

    Qi Feng, Hubert P.H. Shum, Shigeo Morishima

    Proceedings - 2023 IEEE International Symposium on Mixed and Augmented Reality, ISMAR 2023     405 - 414  2023年

     概要を見る

    Pre-captured immersive environments using omnidirectional cameras provide a wide range of virtual reality applications. Previous research has shown that manipulating the eye height in egocentric virtual environments can significantly affect distance perception and immersion. However, the influence of eye height in pre-captured real environments has received less attention due to the difficulty of altering the perspective after finishing the capture process. To explore this influence, we first propose a pilot study that captures real environments with multiple eye heights and asks participants to judge the egocentric distances and immersion. If a significant influence is confirmed, an effective image-based approach to adapt pre-captured real-world environments to the user's eye height would be desirable. Motivated by the study, we propose a learning-based approach for synthesizing novel views for omnidirectional images with altered eye heights. This approach employs a multitask architecture that learns depth and semantic segmentation in two formats, and generates high-quality depth and semantic segmentation to facilitate the inpainting stage. With the improved omnidirectional-aware layered depth image, our approach synthesizes natural and realistic visuals for eye height adaptation. Quantitative and qualitative evaluation shows favorable results against state-of-the-art methods, and an extensive user study verifies improved perception and immersion for pre-captured real-world environments.

    DOI

    Scopus

  • Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

    Ryosuke Oshima, Seitaro Shinagawa, Hideki Tsunashima, Qi Feng, Shigeo Morishima

    Proceedings - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023     4665 - 4670  2023年

     概要を見る

    Effective communication between humans and intelligent agents has promising applications for solving complex problems. One such approach is visual dialogue, which leverages multimodal context to assist humans. However, real-world scenarios occasionally involve human mistakes, which can cause intelligent agents to fail. While most prior research assumes perfect answers from human interlocutors, we focus on a setting where the agent points out unintentional mistakes for the interlocutor to review, better reflecting real-world situations. In this paper, we show that human answer mistakes depend on question type and QA turn in the visual dialogue by analyzing a previously unused data collection of human mistakes. We demonstrate the effectiveness of those factors for the model's accuracy in a pointing-human-mistake task through experiments using a simple MLP model and a Visual Language Model.

    DOI

    Scopus

  • 3D car shape reconstruction from a contour sketch using GAN and lazy learning

    Naoki Nozawa, Hubert P.H. Shum, Qi Feng, Edmond S.L. Ho, Shigeo Morishima

    Visual Computer   38 ( 4 ) 1317 - 1330  2022年04月

     概要を見る

    3D car models are heavily used in computer games, visual effects, and even automotive designs. As a result, producing such models with minimal labour costs is increasingly more important. To tackle the challenge, we propose a novel system to reconstruct a 3D car using a single sketch image. The system learns from a synthetic database of 3D car models and their corresponding 2D contour sketches and segmentation masks, allowing effective training with minimal data collection cost. The core of the system is a machine learning pipeline that combines the use of a generative adversarial network (GAN) and lazy learning. GAN, being a deep learning method, is capable of modelling complicated data distributions, enabling the effective modelling of a large variety of cars. Its major weakness is that as a global method, modelling the fine details in the local region is challenging. Lazy learning works well to preserve local features by generating a local subspace with relevant data samples. We demonstrate that the combined use of GAN and lazy learning produces is able to produce high-quality results, in which different types of cars with complicated local features can be generated effectively with a single sketch. Our method outperforms existing ones using other machine learning structures such as the variational autoencoder.

    DOI

    Scopus

    16
    被引用数
    (Scopus)
  • 360 Depth Estimation in the Wild - The Depth360 Dataset and the SegFuse Network

    Qi Feng, Hubert P.H. Shum, Shigeo Morishima

    Proceedings - 2022 IEEE Conference on Virtual Reality and 3D User Interfaces, VR 2022     664 - 673  2022年

     概要を見る

    Single-view depth estimation from omnidirectional images has gained popularity with its wide range of applications such as autonomous driving and scene reconstruction. Although data-driven learning-based methods demonstrate significant potential in this field, scarce training data and ineffective 360 estimation algorithms are still two key limitations hindering accurate estimation across diverse domains. In this work, we first establish a large-scale dataset with varied settings called Depth360 to tackle the training data problem. This is achieved by exploring the use of a plenteous source of data, 360 videos from the internet, using a test-time training method that leverages unique information in each omnidirectional sequence. With novel geometric and temporal constraints, our method generates consistent and convincing depth samples to facilitate single-view estimation. We then propose an end-to-end two-branch multi-task learning network, SegFuse, that mimics the human eye to effectively learn from the dataset and estimate high-quality depth maps from diverse monocular RGB images. With a peripheral branch that uses equirectangular projection for depth estimation and a foveal branch that uses cubemap projection for semantic segmentation, our method predicts consistent global depth while maintaining sharp details at local regions. Experimental results show favorable performance against the state-of-the-art methods.

    DOI

    Scopus

    12
    被引用数
    (Scopus)
  • Audio–visual object removal in 360-degree videos

    Ryo Shimamura, Qi Feng, Yuki Koyama, Takayuki Nakatsuka, Satoru Fukayama, Masahiro Hamasaki, Masataka Goto, Shigeo Morishima

    The Visual Computer   36 ( 10-12 ) 2117 - 2128  2020年10月

     概要を見る

    <jats:title>Abstract</jats:title>
    <jats:p>We present a novel concept <jats:italic>audio–visual object removal</jats:italic> in 360-degree videos, in which a target object in a 360-degree video is removed in both the visual and auditory domains synchronously. Previous methods have solely focused on the visual aspect of object removal using video inpainting techniques, resulting in videos with unreasonable remaining sounds corresponding to the removed objects. We propose a solution which incorporates direction acquired during the video inpainting process into the audio removal process. More specifically, our method identifies the sound corresponding to the visually tracked target object and then synthesizes a three-dimensional sound field by subtracting the identified sound from the input 360-degree video. We conducted a user study showing that our multi-modal object removal supporting both visual and auditory domains could significantly improve the virtual reality experience, and our method could generate sufficiently synchronous, natural and satisfactory 360-degree videos.</jats:p>

    DOI

    Scopus

    6
    被引用数
    (Scopus)
  • Resolving hand‐object occlusion for mixed reality with joint deep learning and model optimization

    Qi Feng, Hubert P. H. Shum, Shigeo Morishima

    Computer Animation and Virtual Worlds   31 ( 4-5 )  2020年07月

     概要を見る

    By overlaying virtual imagery onto the real world, mixed reality facilitates diverse applications and has drawn increasing attention. Enhancing physical in-hand objects with a virtual appearance is a key component for many applications that require users to interact with tools such as surgery simulations. However, due to complex hand articulations and severe hand-object occlusions, resolving occlusions in hand-object interactions is a challenging topic. Traditional tracking-based approaches are limited by strong ambiguities from occlusions and changing shapes, while reconstruction-based methods show a poor capability of handling dynamic scenes. In this article, we propose a novel real-time optimization system to resolve hand-object occlusions by spatially reconstructing the scene with estimated hand joints and masks. To acquire accurate results, we propose a joint learning process that shares information between two models and jointly estimates hand poses and semantic segmentation. To facilitate the joint learning system and improve its accuracy under occlusions, we propose an occlusion-aware RGB-D hand data set that mitigates the ambiguity through precise annotations and photorealistic appearance. Evaluations show more consistent overlays compared with literature, and a user study verifies a more realistic experience.

    DOI

    Scopus

    6
    被引用数
    (Scopus)
  • Foreground-aware Dense Depth Estimation for 360 Images

    Qi Feng, Hubert P. H. Shum, Ryo Shimamura, Shigeo Morishima

    Journal of WSCG   28 ( 1-2 ) 79 - 88  2020年

     概要を見る

    With 360 imaging devices becoming widely accessible, omnidirectional content has gained popularity in multiple fields. The ability to estimate depth from a single omnidirectional image can benefit applications such as robotics navigation and virtual reality. However, existing depth estimation approaches produce sub-optimal results on real-world omnidirectional images with dynamic foreground objects. On the one hand, capture-based methods cannot obtain the foreground due to the limitations of the scanning and stitching schemes. On the other hand, it is challenging for synthesis-based methods to generate highly-realistic virtual foreground objects that are comparable to the real-world ones. In this paper, we propose to augment datasets with realistic foreground objects using an image-based approach, which produces a foreground-aware photorealistic dataset for machine learning algorithms. By exploiting a novel scale-invariant RGB-D correspondence in the spherical domain, we repurpose abundant non-omnidirectional datasets to include realistic foreground objects with correct distortions. We further propose a novel auxiliary deep neural network to estimate both the depth of the omnidirectional images and the mask of the foreground objects, where the two tasks facilitate each other. A new local depth loss considers small regions of interests and ensures that their depth estimations are not smoothed out during the global gradient’s optimization. We demonstrate the system using human as the foreground due to its complexity and contextual importance, while the framework can be generalized to any other foreground objects. Experimental results demonstrate more consistent global estimations and more accurate local estimations compared with state-of-the-arts.

    DOI

  • Resolving occlusion for 3D object manipulation with hands in mixed reality

    Qi Feng, Hubert P.H. Shum, Shigeo Morishima

    Proceedings of the ACM Symposium on Virtual Reality Software and Technology, VRST    2018年11月

     概要を見る

    Due to the need to interact with virtual objects, the hand-object interaction has become an important element in mixed reality (MR) applications. In this paper, we propose a novel approach to handle the occlusion of augmented 3D object manipulation with hands by exploiting the nature of hand poses combined with tracking-based and model-based methods, to achieve a complete mixed reality experience without necessities of heavy computations, complex manual segmentation processes or wearing special gloves. The experimental results show a frame rate faster than real-time and a great accuracy of rendered virtual appearances, and a user study verifies a more immersive experience compared to past approaches. We believe that the proposed method can improve a wide range of mixed reality applications that involve hand-object interactions.

    DOI

    Scopus

    9
    被引用数
    (Scopus)

▼全件表示