研究者詳細 - 木村　啓二

写真a

キムラ　ケイジ

木村　啓二

Scopus 論文情報

論文数: 82 Citation: 392 h-index: 10

Click to view the Scopus page. The data was downloaded from Scopus API in July 21, 2026, via http://api.elsevier.com and http://www.scopus.com .

Google Scholar 情報（Citations per year）

Citation: 1326 h-index: 18 i10-index: 34

Click to view the Google Scholar page.

Scopus 情報

所属

理工学術院基幹理工学部

職名

教授

学位

博士(工学) ( 早稲田大学 )
Doctor of Engineering

ホームページ

http://www.apal.cs.waseda.ac.jp/

経歴

2012年

-

　

早稲田大学理工学術院情報理工学科教授
2005年

-

2012年

早稲田大学理工学部コンピュータ・ネットワーク工学科助教授
2004年

-

2005年

早稲田大学理工学部コンピュータ・ネットワーク工学科専任講師
2002年

-

2004年

早稲田大学理工学総合研究センター客員講師（専任扱い）
1999年

-

2002年

早稲田大学理工学部電気電子情報工学科助手

学歴

1998年04月

-

2001年03月

早稲田大学大学院理工学研究科電気工学専攻（博士後期課程）

博士（工学）
1996年04月

-

1999年03月

早稲田大学大学院理工学研究科電気工学専攻（修士課程）

修士（工学）
1992年04月

-

1996年03月

早稲田大学理工学部電気工学科

委員歴

2022年04月

-

2022年10月

The 31st International Conference on Parallel Architectures and Compilation Techniques (PACT 2022)
2021年

-

　

The 30th International Conference on Parallel Architectures and Compilation Techniques (PACT 2021)
2021年

-

　

The 34th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2021)
2021年

-

　

ACM Principles and Practice of Parallel Programming 2021 (PPoPP 2021), Extended Review Committee
2020年

-

　

The 26th IEEE International Symposium on High-Performance Computer Architecture Program Committee
2018年

-

2020年

IEEE International Parallel & Distributed Processing Symposium (IPDPS 2018-2020) Program Committee
2019年

-

　

The 37th IEEE International Conference on Computer Design (ICCD 2019) Program track Chair (Processor Architecture)
2019年

-

　

24th Asia and South Pacific Design Automation Conference (ASP-DAC 2019) Program Committee (On-chip Communication and Networks-on-Chip)
2018年

-

　

Principles and Practice of Parallel Programming 2018 (PPoPP 2018) Publicity Chair
2018年

-

　

IEEE COMPSAC 2018 Computer Architecture and Platforms Co-Chairs
2016年

-

　

The 22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS 2016) Program Vice Chair (Parallel / Distributed Algorithms and Applications)
2016年

-

　

The 45th International Conference on Parallel Processing (ICPP-2016) Program Committee (Programming Models, Languages and Compilers)
2016年

-

　

The 3rd International Workshop on Software and Engineering for Parallel Sysmtems (SEPS 2016) Program Committee
2015年

-

　

The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT 2015) Program Committee
2015年

-

　

27th International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2015) Program Committee (Software Track)
2015年

-

　

15th International Symposium on High-Performance Computer Architecture (HPCA-15) Publicity Co-Chairs
2010年04月

-

2014年03月

情報処理学会計算機アーキテクチャ研究会幹事
2014年

-

　

The 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Program Committee
2014年

-

　

The 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Program Committee
2011年

-

2014年

The 24--27th International Workshop on Languages and Compilers for Parallel Computing (LCPC ) Program Committee, Program Chair (2012)
2010年04月

-

2013年03月

情報処理学会組込システム研究会運営委員
2013年

-

　

The 13th International Forum on Embedded MPSoC and Multicore (MPSoC2013) Finace Co-Chairs
2013年

-

　

The 27th Internationcal Conference on Supercomputing (ICS 2013) Program Committee
2009年

-

2013年

IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips XII--XVII) Program Committee
2009年

-

2013年

XXVII--XXXII IEEE International Conference on Computer Design (ICCD ) Program Committee (Computer System Design and Application Track)
2012年

-

　

The 12th International Forum on Embedded MPSoC and Multicore (MPSoC2012) Program Co-Chairs
2011年

-

　

Advanced Parallel Processing Technology Symposium (APPT ) Program Committee
2011年

-

　

The 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS ) Program Committee (Multicore Computing and Parallel / Distributed Architecture)
2008年04月

-

2010年03月

情報処理学会計算機アーキテクチャ研究会運営委員
2010年

-

　

22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD ) Program Committee (System Software Track)
2010年

-

　

IEEE International Symposium on Workload Characterization (IISWC-2010) Program Committee
2005年04月

-

2009年03月

情報処理学会学会誌編集委員 SWG
2005年04月

-

2009年03月

情報処理学会システムLSI設計技術研究会（SLDM）運営委員
2005年

-

2009年03月

情報処理学会論文誌コンピューティングシステム ACS 論文誌編集委員会
2009年

-

　

The 38th International Conference on Parallel Processing (ICPP-2009) Program Committee (Programming Models, Languages and Compilers)
2006年

-

2008年

IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX--XI) Program Committee Vice Chair
2006年

-

2008年

IPSJ ComSys Program Committee
2006年

-

2008年

ComSys - コンピュータシステムシンポジウムプログラム委員
2007年

-

　

IPSJ DA Symposium University Chair
2007年

-

　

情報処理学会 DAシンポジウム大学幹事
2007年

-

　

IPSJ SACSIS Program Committee Vice Chair
2007年

-

　

SACSIS 先進的計算基盤システムシンポジウムプログラム副委員長
2006年

-

　

IPSJ SACSIS , 2008--2013 Program Committee
2006年

-

　

SACSIS , 2008--2013 - 先進的計算基盤システムシンポジウムプログラム委員
2003年

-

2006年

並列/分散/協調処理に関するサマーワークショップ(SWoPP) 実行委員
2001年04月

-

2005年03月

情報処理学会システムソフトウェアとオペレーティング・システム研究会運営委員
2001年04月

-

2005年03月

情報処理学会学会誌編集委員 BWG, （最終年度主査）
2004年

-

　

SACSIS 先進的計算基盤システムシンポジウム会計委員長・プログラム委員

▼全件表示

所属学協会

　

　

　

ACM
　

　

　

IEEE Computer Society
　

　

　

電子情報通信学会
　

　

　

情報処理学会

研究分野

計算機システム

研究キーワード

並列計算機、並列化コンパイラ、計算機科学、セキュアコンピュータシステム

受賞

文部科学大臣表彰科学技術賞（研究部門）

2014年04月文部科学省

論文

Towards GPU Passthrough in Intel TDX: Design Challenges and Early Baselines

Yoshi Sato, Hidetoshi Uranami, Akihiro Saiki, Keiji Kimura

2025 IEEE Conference on Dependable, Autonomic and Secure Computing (DASC) 144 - 146 2025年10月 [査読有り]

担当区分：最終著者

DOI
Enclave Application Cache for RISC-V Keystone

Takumu Umezawa, Akihiro Saiki, Keiji Kimura

2025 IEEE European Symposium on Security and Privacy Workshops (EuroS&amp;PW) 422 - 428 2025年06月 [査読有り]

担当区分：最終著者

DOI
Efficient Memory Protection Method for Large-Scale Host-Enclave Data Transfer on Keystone Enclave

Akihiro SAIKI, Keiji KIMURA

IEICE Transactions on Information and Systems 2025年

DOI
Parallel Verification in RISC-V Secure Boot

Akihiro Saiki, Yu Omori, Keiji Kimura

2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 2023年12月 [査読有り]

担当区分：最終著者

DOI
Automatic Deep Learning Parallelization for Vector Multicore Chips with the OSCAR Parallelizing and the TVM Open-Source Deep Learning Compiler.

Fumiaki Onishi, Ryosei Otaka, Kazuki Fujita, Tomoki Suetsugu, Tohma Kawasumi, Toshiaki Kitamura, Hironori Kasahara, Keiji Kimura

LCPC 96 - 110 2023年

DOI

Scopus
Parallelizing Factory Automation Ladder Programs by OSCAR Automatic Parallelizing Compiler

Tohma Kawasumi, Tsumura Yuta, Hiroki Mikami, Tomoya Yoshikawa, Takero Hosomi, Shingo Oidate, Keiji Kimura, Hironori Kasahara

Proc. of the 35th International Workshop on Languages and Compilers for Parallel Computing (LCPC2022) 2022年10月 [査読有り]
Open-Source Hardware Memory Protection Engine Integrated With NVMM Simulator

Yu Omori, Keiji Kimura

IEEE Computer Architecture Letters 21 ( 2 ) 77 - 80 2022年08月 [査読有り]

担当区分：最終著者

DOI
Data stream clustering for low-cost machines

166 57 - 70 2022年08月 [査読有り]

DOI

Scopus

4

被引用数

(Scopus)
Open-Source RISC-V Linux-Compatible NVMM Emulator

Yu Omori, Keiji Kimura

Sixth Workshop on Computer Architecture Research with RISC-V (CARRV 2022) 2022年06月 [査読有り]

担当区分：最終著者
Lightweight Array Contraction by Trace-Based Polyhedral Analysis

Hugo Thievenaz, Keiji Kimura, Christophe Alias

C3PO’22: Compiler-assisted Correctness Checking and Performance Optimization for HPC 2022年06月 [査読有り]
Rephrasing polyhedral optimizations with trace analysis

Hugo Thievenaz, Keiji Kimura, Christophe Alias

12th International Workshop on Polyhedral Compilation Techniques (IMPACT 2022) 2022年06月 [査読有り]
組込みシステムにおける並列化技術動向

木村啓二, 梅田弾, 笠原博徳

システム/制御/情報 66 ( 1 ) 2022年

J-GLOBAL
Accelerating Data Dependence Profiling Through Abstract Interpretation of Loop Instructions

10 31626 - 31640 2022年 [査読有り]

DOI
OSCAR Parallelizing and Power Reducing Compiler and API for Heterogeneous Multicores : (Invited Paper)

2021年11月 [査読有り] [招待有り]

DOI
Parallelizing Compiler Translation Validation Using Happens-Before and Task-Set

2021年11月 [査読有り]

DOI
Performance Evaluation of OSCAR Multi-target Automatic Parallelizing Compiler on Intel, AMD, Arm and RISC-V Multicores

Birk M. Magnussen, Tohma Kawasumi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

LCPC2021 2021年10月 [査読有り]
Durable Queue Implementations Built on a Formally Defined Strand Persistency Model

29 823 - 838 2021年 [査読有り]

担当区分：最終著者

DOI
Secure Image Inference Using Pairwise Activation Functions

Jonas T. Agyepong, Mostafa Soliman, Yasutaka Wada, Keiji Kimura, Ahmed El-Mahdy

IEEE Access 9 118271 - 118290 2021年 [査読有り]

DOI
Non-Volatile Main Memory Emulator for Embedded Systems Employing Three NVMM Behaviour Models

Yu OMORI, Keiji KIMURA

IEICE TRANSACTIONS on Information and Systems E104-D ( 5 ) 697 - 708 2021年 [査読有り]

担当区分：最終著者
Scalable and Fast Lazy Persistency on GPUs

Ardhi Wiratama, Baskara Yudha, Keiji Kimura, Huiyang Zhou, Yan Solihin

2020 IEEE International Symposium on Workload Characterization (IISWC 2020) 252 - 263 2020年10月 [査読有り]
Local Memory Mapping of Multicore Processors on an Automatic Parallelizing Compiler

Yoshitake OKI, Yuto ABE, Kazuki YAMAMOTO, Kohei YAMAMOTO, Tomoya SHIRAKAWA, Akimasa YOSHIDA, Keiji KIMURA, Hironori KASAHARA

IEICE TRANSACTIONS on Electronics E103-C ( 3 ) 98 - 109 2020年03月 [査読有り]
Compiler Software Coherent Control for Embedded High Performance Multicore

Boma A. ADHI, Tomoya KASHIMATA, Ken TAKAHASHI, Keiji KIMURA, Hironori KASAHARA

IEICE TRANSACTIONS on Electronics E103-C ( 3 ) 85 - 97 2020年03月 [査読有り]
Compiler-support for Critical Data Persistence in NVM

Reem Elkhouly, Mohammad Alshboul, Akihiro Hayashi, Yan Solihin, Keiji Kimura

ACM Transactions on Architecture and Code Optimization (TACO) 16 ( 4 ) 2019年12月 [査読有り]

担当区分：最終著者
Software Cache Coherent Control by Parallelizing Compiler

Boma A. Adhi, Masayoshi Mase, Yuhei Hosokawa, Yohei Kishimoto, Taisuke Onishi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11403 17 - 25 2019年11月 [査読有り]
Cascaded DMA Controller for Speedup of Indirect Memory Access in Irregular Applications

Tomoya Kashimata, Toshiaki Kitamura, Keiji Kimura, Hironori Kasahara

9th Workshop on Irregular Applications: Architectures and Algorithms 71 - 76 2019年11月 [査読有り]
Performance of Static and Dynamic Task Scheduling for Real-Time Control System on Embedded Multicore Processor

Yoshitake Oki, Hiroki Mikami, Hikaru Nishida, Dan Umeda, Keiji Kimura, Hironori Kasahara

32nd International Workshop on Languages and Compilers for Parallel Computing(LCPC) 2019年10月 [査読有り]
Performance Evaluation on NVMM Emulator Employing Fine-Grain Delay Injection

Yu Omori, Keiji Kimura

The 8th IEEE Non-Volatile Memory Systems and Applications Symposium (IEEE NVMSA 2019) 1 - 6 2019年08月 [査読有り]

担当区分：最終著者

DOI

Scopus

3

被引用数

(Scopus)
Fast and Highly Optimizing Separate Compilation for Automatic Parallelization

Tohma Kawasumi, Ryota Tamura, Yuya Asada, Jixin Han, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

The 2019 International Conference on High Performance Computing & Simulation (HPCS 2019) 478 - 485 2019年07月 [査読有り]
Efficient Checkpointing with Recompute Scheme for Non-volatile Main Memory

Mohammad Alshboul, Hussein Elnawawy, Reem Elkhouly, Keiji Kimura, James Tuck, Yan Solihin

ACM Transactions on Architecture and Code Optimization (TACO) 16 ( 2 ) 2019年05月 [査読有り]
Multicore Cache Coherence Control by a Parallelizing Compiler

Hironori Kasahara, Keiji Kimura, Boma A. Adhi, Yuhei Hosokawa, Yohei Kishimoto, Masayoshi Mase

Proceedings - International Computer Software and Applications Conference 1 492 - 497 2017年09月 [査読有り]

　概要を見る

A recent development in multicore technology has enabled development of hundreds or thousands core processor. However, on such multicore processor, an efficient hardware cache coherence scheme will become very complex and expensive to develop. This paper proposes a parallelizing compiler directed software coherence scheme for shared memory multicore systems without hardware cache coherence control. The general idea of the proposed method is that an automatic parallelizing compiler analyzes the control dependency and data dependency among coarse grain task in the program. Then based on the obtained information, task parallelization, false sharing detection and data restructuration to prevent false sharing are performed. Next the compiler inserts cache control code to handle stale data problem. The proposed method is built on OSCAR automatic parallelizing compiler and evaluated on Renesas RP2 with 8 SH-4A cores processor. The hardware cache coherence scheme on the RP2 processor is only available for up to 4 cores and the hardware cache coherence can be completely turned off for non-coherence cache mode. Performance evaluation is performed using 10 benchmark program from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB) and Mediabench II. The proposed method performs as good as or better than hardware cache coherence scheme. For example, 4 cores with the hardware coherence mechanism gave us speed up of 2.52 times against 1 core for SPEC2000 'equake', 2.9 times for SPEC2006 'lbm', 3.34 times for NPB 'cg', and 3.17 times for MediaBench II MPEG2 Encoder. The proposed software cache coherence control gave us 2.63 times for 4 cores and 4.37 for 8 cores for 'equake', 3.28 times for 4 cores and 4.76 times for 8 cores for lbm, 3.71 times for 4 cores and 4.92 times for 8 cores for 'MPEG2 Encoder'.

DOI

Scopus

9

被引用数

(Scopus)
Automatic Local Memory Management for Multicores Having Global Address Space

Kouhei Yamamoto, Tomoya Shirakawa, Yoshitake Oki, Akimasa Yoshida, Keiji Kimura, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016 10136 282 - 296 2017年 [査読有り]

　概要を見る

Embedded multicore processors for hard real-time applications like automobile engine control require the usage of local memory on each processor core to precisely meet the real-time deadline constraints, since cache memory cannot satisfy the deadline requirements due to cache misses. To utilize local memory, programmers or compilers need to explicitly manage data movement and data replacement for local memory considering the limited size. However, such management is extremely difficult and time consuming for programmers. This paper proposes an automatic local memory management method by compilers through (i) multi-dimensional data decomposition techniques to fit working sets onto limited size local memory (ii) suitable block management structures, called Adjustable Blocks, to create application specific fixed size data transfer blocks (iii) multi-dimensional templates to preserve the original multi-dimensional representations of the decomposed multi-dimensional data that are mapped onto one-dimensional Adjustable Blocks (iv) block replacement policies from liveness analysis of the decomposed data, and (v) code size reduction schemes to generate shorter codes. The proposed local memory management method is implemented on the OSCAR multi-grain and multi-platform compiler and evaluated on the Renesas RP2 8 core embedded homogeneous multicore processor equipped with local and shared memory. Evaluations on 5 programs including multimedia and scientific applications show promising results. For instance, speedups on 8 cores compared to single core execution using off-chip shared memory on an AAC encoder program, a MPEG2 encoder program, Tomcatv, and Swim are improved from 7.14 to 20.12, 1.97 to 7.59, 5.73 to 7.38, and 7.40 to 11.30, respectively, when using local memory with the proposed method. These evaluations indicate the usefulness and the validity of the proposed local memory management method on real embedded multicore processors.

DOI

Scopus

2

被引用数

(Scopus)
Architecture design for the environmental monitoring system over the winter season

Koichiro Yamashita, Chen Ao, Takahisa Suzuki, Yi Xu, Hongchun Li, Jun Tian, Keiji Kimura, Hironori Kasahara

MobiWac 2016 - Proceedings of the 14th ACM International Symposium on Mobility Management and Wireless Access, co-located with MSWiM 2016 27 - 34 2016年11月 [査読有り]

　概要を見る

One of the applications as a source of big data, there is a sensor network for the environmental monitoring that is designed to detect the deterioration of the infrastructure, erosion control and so on. The specific targets are bridges, buildings, slopes and embankments due to the natural disasters or aging. Basic requirement of this monitoring system is to collect data over a long period of time from a large number of nodes that installed in a wide area. However, in order to apply a wireless sensor network (WSN), using wireless communication and energy harvesting, there are not many cases in the actual monitoring system design. Because of the system must satisfy various conditions measurement location and time specified by the civil engineering communication quality and topology obtained from the network technology the electrical engineering to solve the balance of weather environment and power consumption that depends on the above-mentioned conditions. We propose the whole WSN design methodology especially for the electrical architecture that is affected by the network behavior and the environmental disturbance. It is characterized by determining recursively mutual trade-off of a wireless simulation and a power architecture simulation of the node devices. Furthermore, the system allows the redundancy of the design. In addition, we deployed the actual slope monitoring WSN that is designed by the proposed method to the snow-covered area. A conventional similar monitoring WSN, with 7 Ah Li-battery, it worked only 129 days in a mild climate area. On the other hand, our proposed system, deployed in the heavy snow area has been working more than 6 months (still working) with 3.2 Ah batteries. Finally, it made a contribution to the civil engineering succeeded in the real time observation of the groundwater level displacement at the time of melting snow in the spring season.

DOI

Scopus

2

被引用数

(Scopus)
Reducing parallelizing compilation time by removing redundant analysis

Jixin Han, Rina Fujino, Ryota Tamura, Mamoru Shimaoka, Hiroki Mikami, Moriyuki Takamura, Sachio Kamiya, Kazuhiko Suzuki, Takahiro Miyajima, Keiji Kimura, Hironori Kasahara

SEPS 2016 - Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems, co-located with SPLASH 2016 1 - 9 2016年10月 [査読有り]

　概要を見る

Parallelizing compilers employing powerful compiler optimizations are essential tools to fully exploit performance from today's computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers apply multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of analysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.

DOI

Scopus

2

被引用数

(Scopus)
An Android Systrace Extension for Tracing Wakelocks

Bui Duc Binh, Keiji Kimura

IEEE International Conference on Embedded and Ubiquitous Computing (EUC 2016) 146 - 149 2016年08月 [査読有り]

担当区分：責任著者
Android Video Processing System Combined with Automatically Parallelized and Power Optimized Code by OSCAR Compiler

Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

情報処理学会論文誌 57 ( 4 ) 2016年04月

　概要を見る

The emergence of multi-core processors in smart devices promises higher performance and low power consumption. The parallelization of applications enables us to improve their performance. However, simultaneously utilizing many cores would drastically drain the device battery life. This paper shows a demonstration system of real-time video processing combined with power reduction controlled by the OSCAR automatic parallelization compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime with 4 ARM Cortext-A9 cores. In this paper, we exploited the DVFS framework, core partitioning, and profiling technique and OSCAR parallelization - power control algorithm to reduce the total consumption in a real-time video application. The demonstration results show that it can cut power consumption by 42.8% for MPEG-2 Decoder application and 59.8% for Optical Flow application by using 3 cores in both applications.\n------------------------------This is a preprint of an article intended for publication Journal ofInformation Processing(JIP). This preprint should not be cited. Thisarticle should be cited as: Journal of Information Processing Vol.24(2016) No.3 (online)------------------------------The emergence of multi-core processors in smart devices promises higher performance and low power consumption. The parallelization of applications enables us to improve their performance. However, simultaneously utilizing many cores would drastically drain the device battery life. This paper shows a demonstration system of real-time video processing combined with power reduction controlled by the OSCAR automatic parallelization compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime with 4 ARM Cortext-A9 cores. In this paper, we exploited the DVFS framework, core partitioning, and profiling technique and OSCAR parallelization - power control algorithm to reduce the total consumption in a real-time video application. The demonstration results show that it can cut power consumption by 42.8% for MPEG-2 Decoder application and 59.8% for Optical Flow application by using 3 cores in both applications.\n------------------------------This is a preprint of an article intended for publication Journal ofInformation Processing(JIP). This preprint should not be cited. Thisarticle should be cited as: Journal of Information Processing Vol.24(2016) No.3 (online)------------------------------

CiNii
組み込み向けモデルベース開発アプリケーションのプロファイル情報を用いたマルチコア用マルチグレイン並列処理

梅田弾, 鈴木貴広, 見神広紀, 木村啓二, 笠原博徳

情報処理学会論文誌 57 ( 2 ) 1 - 12 2016年02月 [査読有り]

　概要を見る

現在の組み込みシステム開発ではMATLAB/Simulinikに代表されるモデルベース開発ツールがよく使用されるようになっている．また，開発されたモデルの複雑化とともに，このようなツールで開発されるアプリケーションのマルチコア上での高性能化，低消費電力化の要求が高まってきている．この要求に対して，モデル中のブロック間並列性を利用した並列化の提案はされているが，ブロック間だけでなく，ブロック内の並列性を利用したアプリケーション全体の並列性を有効利用できる方式は提案されていない．そこで，本論文では逐次Cプログラムから並列化Cプログラムを生成可能なOSCAR自動並列化コンパイラを用いて，MATLAB/SimulinkからEmbedded Coderを使って自動生成されたCプログラムに対して，モデル上に現れるブロック間並列性および，ブロック内のベクトル演算やユーザカスタマイズのコードからループ並列性を抽出し，マルチグレイン並列化を行う．また，マルチグレイン並列化の際に，Simulink上で得られたプロファイル情報を使ったタスクスケジューリングを行うことによりスケジューリングの精度向上を行う．提案手法によりXeon X5670上の6コアを使い，逐次実行時間と比較して道路追従アプリケーションでは4.21倍，血管抽出アプリケーションでは5.80倍，異常検出アプリケーションでは4.10倍の速度向上率が得られた．また，道路追従アプリケーションに関しては逐次の最悪実行時の実行時間と比較して，4.81倍の速度向上率が得られた．Model-based development tools such as the MATLAB/Simulink have become popular for development of embedded systems recently. These applications require high performance and low power processing on multicores. Therefore, several researchers have proposed parallel processing of these applications utilizing parallelism among blocks in these models. However, no one proposes a method to extract all parallelism from not only among blocks but also in a block in these models. This paper proposes multigrain parallelization of C program generated by Embedded Coder from MATLAB/Simulink utilizing both coarse grain task parallelism among blocks and loop parallelism in a block including a vector operation or user's customized code using the OSCAR automatic parallelizing compiler. The compiler generates a parallelized C program from a sequential C program. The proposed method utilizes profiling information on Simulink to improve scheduling results into a multicore. It attains 4.21 times speedup for road tracking application, 5.80 times speedup for vessel detecting application and 4.10 times speedup for abnormality detecting application using six cores of Xeon X5670 compared with case of an ordinary sequential execution. Also, it attains 4.81 times speed up for road tracking application in worse case execution.

CiNii
Android video processing system combined with automatically parallelized and power optimized code by OSCAR compiler

Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

Journal of Information Processing 24 ( 3 ) 504 - 511 2016年 [査読有り]

　概要を見る

The emergence of multi-core processors in smart devices promises higher performance and low power consumption. The parallelization of applications enables us to improve their performance. However, simultaneously utilizing many cores would drastically drain the device battery life. This paper shows a demonstration system of realtime video processing combined with power reduction controlled by the OSCAR automatic parallelization compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime with 4 ARM Cortext- A9 cores. In this paper, we exploited the DVFS framework, core partitioning, and profiling technique and OSCAR parallelization - power control algorithm to reduce the total consumption in a real-time video application. The demonstration results show that it can cut power consumption by 42.8% for MPEG-2 Decoder application and 59.8% for Optical Flow application by using 3 cores in both applications.

DOI CiNii

Scopus
Multigrain parallelization for model-based design applications using the OSCAR compiler

Dan Umeda, Takahiro Suzuki, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9519 125 - 139 2016年 [査読有り]

　概要を見る

Model-based design is a very popular software development method for developing a wide variety of embedded applications such as automotive systems, aircraft systems, and medical systems. Model-based design tools like MATLAB/Simulink typically allow engineers to graphically build models consisting of connected blocks for the purpose of reducing development time. These tools also support automatic C code generation from models with a special tool such as Embedded Coder to map models onto various kinds of embedded CPUs. Since embedded systems require real-time processing, the use of multi-core CPUs poses more opportunities for accelerating program execution to satisfy the real-time constraints. While prior approaches exploit parallelism among blocks by inspecting MATLAB/Simulink models, this may lose an opportunity for fully exploiting parallelism of the whole program because models potentially have parallelism within a block. To unlock this limitation, this paper presents an automatic parallelization technique for auto-generated C code developed by MATLAB/Simulink with Embedded Coder. Specifically, this work (1) exploits multi-level parallelism including inter-block and intra-block parallelism by analyzing the auto-generated C code, and (2) performs static scheduling to reduce dynamic overheads as much as possible. Also, this paper proposes an automatic profiling framework for the auto-generated code for enhancing static scheduling, which leads to improving the performance of MATLAB/Simulink applications. Performance evaluation shows 4.21 times speedup with six processor cores on Intel Xeon X5670 and 3.38 times speedup with four processor cores on ARM Cortex-A15 compared with uniprocessor execution for a road tracking application.

DOI

Scopus

12

被引用数

(Scopus)
Coarse grain task parallelization of earthquake simulator GMS using OSCAR compiler on various Cc-NUMA servers

Mamoru Shimaoka, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9519 238 - 253 2016年 [査読有り]

　概要を見る

This paper proposes coarse grain task parallelization for a earthquake simulation program using Finite Difference Method to solve the wave equations in 3-D heterogeneous structure or the Ground Motion Simulator (GMS) on various cc-NUMA servers using IBM, Intel and Fujitsu multicore processors. The GMS has been developed by the National Research Institute for Earth Science and Disaster Prevention (NIED) in Japan. Earthquake wave propagation simulations are important numerical applications to save lives through damage predictions of residential areas by earthquakes. Parallel processing with strong scaling has been required to precisely calculate the simulations quickly. The proposed method uses the OSCAR compiler for exploiting coarse grain task parallelism efficiently to get scalable speed-ups with strong scaling. The OSCAR compiler can analyze data dependence and control dependence among coarse grain tasks, such as subroutines, loops and basic blocks. Moreover, locality optimizations considering the boundary calculations of FDM and a new static scheduler that enables more efficient task schedulings on cc-NUMA servers are presented. The performance evaluation shows 110 times speed-up using 128 cores against the sequential execution on a POWER7 based 128 cores cc-NUMA server Hitachi SR16000 VM1, 37.2 times speed-up using 64 cores against the sequential execution on a Xeon E7-8830 based 64 cores cc-NUMA server BS2000, 19.8 times speed-up using 32 cores against the sequential execution on a Xeon X7560 based 32 cores cc-NUMA server HA8000/RS440, 99.3 times speed-up using 128 cores against the sequential execution on a SPARC64 VII based 256 cores cc-NUMA server Fujitsu M9000, 9.42 times speed-up using 12 cores against the sequential execution on a POWER8 based 12 cores cc-NUMA server Power System S812L.

DOI

Scopus
2-Step Power Scheduling with Adaptive Control Interval for Network Intrusion Detection Systems on Multicores

Lau Phi Tuong, Keiji Kimura

2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC) 69 - 76 2016年 [査読有り]

担当区分：最終著者

　概要を見る

Network intrusion detection system (NIDS) is becoming an important element even in embedded systems as well as in data centers since embedded computers have been increasingly exposed to the Internet. The demand for power budget of these embedded systems is a critical issue in addition to that for performance. In this paper, we propose a technique to minimize power consumption in the NIDS by 2-step power scheduling with the adaptive control interval. In addition, we also propose a CPU-core controlling algorithm so that our scheduling technique can preserve the performance for other applications and NIDS assuming the cases of multiplexing NIDS and them simultaneously on the same device such as a home server or a mobile platform. We implement our 2-step algorithm into Suricata, which is a popular NIDS, as well as a 1-step algorithm with the adaptive interval, and a simple fixed-interval algorithm for evaluations. Experimental results show that our 2-step scheduling with both the adaptive and the fixed 30-millisecond interval achieve 75% power saving comparing with the Ondemand governor and 87% comparing with the Performance governor in Linux, respectively, without affecting their performance capability on four ARM Cortex-A15 cores at the network traffic of 1,000 packets/seconds. In contrast, when the network traffic reaches to 17,000 packets/seconds, our 2-step scheduling and the Ondemand as well as the Performance governor can maintain the packet processing capacity while the fixed 30-milliseconds interval processes only 50% packets with two and three cores, and about 80% packets on four cores.

DOI

Scopus

1

被引用数

(Scopus)
Accelerating Multicore Architecture Simulation Using Application Profile

Keiji Kimura, Gakuho Taguchi, Hironori Kasahara

2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC) 177 - 184 2016年 [査読有り]

担当区分：筆頭著者

　概要を見る

Architecture simulators play an important role in exploring frontiers in the early stages of the architecture design. However, the execution time of simulators increases with an increase the number of cores. The sampling simulation technique that was originally proposed to simulate single-core processors is a promising approach to reduce simulation time. Two main hurdles for multi/many-core are preparing sampling points and thread skewing at functional simulation time. This paper proposes a very simple and low-error sampling-based acceleration technique for multi/many-core simulators. For a parallelized application, an iteration of a large loop including a parallelizable program part, is defined as a sampling unit. We apply X-means method to a profile result of the collection of iterations derived from a real machine to form clusters of those iterations. Multiple iterations are exploited as sampling points from these clusters. We execute the simulation along the sampling points and calculate the number of total execution cycles. Results from a 16-core simulation show that our proposed simulation technique gives us a maximum of 443x speedup with a 0.52% error and 218x speedup with 1.50% error on an average.

DOI

Scopus

5

被引用数

(Scopus)
Annotatable systrace: An extended linux ftrace for tracing a parallelized program

Daichi Fukui, Mamoru Shimaoka, Hiroki Mikami, Dominic Hillenbrand, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

SEPS 2015 - Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems 21 - 25 2015年10月 [査読有り]

　概要を見る

Investigation of the runtime behavior is one of the most important processes for performance tuning on a computer system. Profiling tools have been widely used to detect hot-spots in a program. In addition to them, tracing tools produce valuable information especially from parallelized programs, such as thread scheduling, barrier synchronizations, context switching, thread migration, and jitter by interrupts. Users can optimize a runtime system and hardware configuration in addition to a program itself by utilizing the attained information. However, existing tools provide information per process or per function. Finer information like task-or loop-granularity should be required to understand the program behavior more precisely. This paper has proposed a tracing tool, Annotatable Systrace, to investigate runtime execution behavior of a parallelized program based on an extended Linux ftrace. The Annotatable Systrace can add arbitrary annotations in a trace of a target program. The proposed tool exploits traces from 183.equake, 179.art, and mpeg2enc on Intel Xeon X7560 and ARMv7 as an evaluation. The evaluation shows that the tool enables us to observe load imbalance along with the program execution. It can also generate a trace with the inserted annotations even on a 32-core machine. The overhead of one annotation on Intel Xeon is 1.07 us and the one on ARMv7 is 4.44 us, respectively.

DOI

Scopus

6

被引用数

(Scopus)
Evaluation of Automatic Power Reduction with OSCAR Compiler on Intel Haswell and ARM Cortex-A9 Multicores

Tomohiro Hirano, Hideo Yamamoto, Shuhei Iizuka, Kohei Muto, Takashi Goto, Tamami Wake, Hiroki Mikami, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8967 239 - 252 2015年05月 [査読有り]
グリーンコンピューティングの展望 (特集スマートグリッドをささえる新技術)

木村啓二, 笠原博徳

スマートグリッド : 技術雑誌 = Smart grid : technical journal 4 ( 4 ) 3 - 8 2014年10月

CiNii
MATLAB/Simulinkで設計されたエンジン制御Cコードのマルチコア用自動並列化

梅田弾, 金羽木洋平, 見神広紀, 林明宏, 谷充弘, 森裕司, 木村啓二, 笠原博徳

情報処理学会論文誌 55 ( 8 ) 1817 - 1829 2014年08月 [査読有り]

　概要を見る

近年の自動車では安全性・快適性・環境適合性が求められ,これらを実現するために自動車制御系のソフトウェアが年々より高度化している.制御の高度化と同時に,これらを実現するソフトウェアをリアルタイムで動作させるために,プロセッサの高速化が必要である.しかし,シングルコアの動作周波数の向上が困難であることから,1コアによる処理性能向上が限界となり,今後の自動車制御系でマルチコアへの移行が進んでいくと考えられる.また,自動車制御系において開発期間の短縮および信頼性の向上のためにMATLAB/Simulinkによるモデルベース設計が普及している.しかし,現時点でこのようなモデルベース設計で自動的にコード生成されるソースコードはマルチコア上で自動的に並列処理できるまでには至っていない.そこで,本論文ではMATLAB/Simulinkによって設計された制御モデルからEmbedded Coderにより自動生成されたエンジン制御Cコードをマルチコア上で動作するための並列化手法を提案する.提案手法を用いて,従来手動ではタスク粒度が細かく並列化が困難であった条件分岐と算術代入文からなるエンジン制御CコードをOSCAR自動並列化コンパイラにて自動並列化した.RP2やV850E2R等の組み込みマルチコア上で実行したところ,2コアで最大1.91倍,4コアで最大3.76倍の性能向上が得られた.

CiNii
低消費電力コンピューティングを実現するマルチコア技術

木村啓二, 笠原博徳

電子情報通信学会誌 97 ( 2 ) 133 - 139 2014年02月 [招待有り]

担当区分：筆頭著者

　概要を見る

マルチコアプロセッサは,スマートフォン,パーソナルコンピュータ,自動車からクラウドサーバ,スーパコンピュータに至るまで,各種のIT機器で利用されている.これは,マルチコアでは半導体集積度の向上とともに性能向上を可能にしつつ消費電力を抑えることができるためで,環境に優しい低消費電力コンピューティング,すなわちグリーンコンピューティングの実現のための最有力技術として採用されている.本稿では,この低消費電力マルチコアにおけるコンパイラを中心としたソフトウェアとハードウェアの協調及び各種組込み応用について紹介する.

CiNii
OSCAR Compiler Controlled Multicore Power Reduction on Android Platform

Hideo Yamamoto, Tomohiro Hirano, Kohei Muto, Hiroki Mikami, Takashi Goto, Dominic Hillenbrand, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2013 8664 155 - 168 2014年 [査読有り]

　概要を見る

In recent years, smart devices are transitioning from single core processors to multicore processors to satisfy the growing demands of higher performance and lower power consumption. However, power consumption of multicore processors is increasing, as usage of smart devices become more intense. This situation is one of the most fundamental and important obstacle that the mobile device industries face, to extend the battery life of smart devices. This paper evaluates the power reduction control by the OSCAR Automatic Parallelizing Compiler on an Android platform with the newly developed precise power measurement environment on the ODROID-X2, a development platform with the Samsung Exynos4412 Prime, which consists of 4 ARM Cortex-A9 cores. The OSCAR Compiler enables automatic exploitation of multigrain parallelism within a sequential program, and automatically generates a parallelized code with the OSCAR Multi-Platform API power reduction directives for the purpose of DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating. The paper also introduces a newly developed micro second order pseudo clock gating method to reduce power consumption using WFI (Wait For Interrupt). By inserting GPIO (General Purpose Input Output) control functions into programs, signals appear on the power waveform indicating the point of where the GPIO control was inserted and provides a precise power measurement of the specified program area. The results of the power evaluation for real-time Mpeg2 Decoder show 86.7% power reduction, namely from 2.79[W] to 0.37[W] and for real-time Optical Flow show 86.5% power reduction, namely from 2.23[W] to 0.36[W] on 3 core execution.

DOI

Scopus

3

被引用数

(Scopus)
モデルベース設計により自動生成されたエンジン制御Cコードのマルチコア用自動並列化

梅田弾, 金羽木洋平, 見神広紀, 谷充弘(デンソー, 森裕司(デンソー, 木村啓二, 笠原博徳

組み込みシステムシンポジウム（ESS2013） 2013 104 - 113 2013年10月

CiNii
OSAR API v2.1: Extensions for an Advanced Accelerator Control Scheme to a Low-Power Multicore API

Keiji Kimura, Cecilia Gonzales-Alvarez, Akihiro Hayashi, Hiroki Mikami, Mamoru Shimaoka, Jun Shirako, Hironori Kasahara

17th Workshop on Compilers for Parallel Computing (CPC2013) 2013年07月 [査読有り]

担当区分：筆頭著者
Automatic Parallelization of Hand Written Automotive Engine Control Codes Using OSCAR Compiler

Dan Umeda, Yohei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

17th Workshop on Compilers for Parallel Computing (CPC2013) 2013年07月 [査読有り]
Evaluation of power consumption at execution of multiple automatically parallelized and power controlled media applications on the RP2 low-power multicore

Hiroki Mikami, Shumpei Kitaki, Masayoshi Mase, Akihiro Hayashi, Mamoru Shimaoka, Keiji Kimura, Masato Edahiro, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7146 31 - 45 2013年

　概要を見る

This paper evaluates an automatic power reduction scheme of OSCAR automatic parallelizing compiler having power reduction control capability when multiple media applications parallelized by the OSCAR compiler are executed simultaneously on RP2, a 8-core multicore processor developed by Renesas Electronics, Hitachi, and Waseda University. OSCAR compiler enables the hierarchical multigrain parallel processing and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating and power gating for each processor core using the OSCAR multi-platform API. The RP2 has eight SH4A processor cores, each of which has power control mechanisms such as DVFS, clock gating and power gating. First, multiple applications with relatively light computational load are executed simultaneously on the RP2. The average power consumption of power controlled eight AAC encoder programs, each of which was executed on one processor, was reduced by 47%, (to 1.01W), against one AAC encoder execution on one processor (from 1.89W) without power control. Second, when multiple intermediate computational load applications are executed, the power consumptions of an AAC encoder executed on four processors with the power reduction control was reduced by 57% (to 0.84W) against an AAC encoder execution on one processor (from 1.95W). Power consumptions of one MPEG2 decoder on four processors with power reduction control was reduced by 49% (to 1.01W) against one MPEG2 decoder execution on one processor (from 1.99W). Finally, when a combination of a high computational load application program and an intermediate computational load application program are executed simultaneously, the consumed power reduced by 21% by using twice number of cores for each application. This paper confirmed parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions. In execution of multiple light computational load applications, power consumption increases only 12% for one application. Parallel processing being applied to intermediate computational load applications, power consumption of executing one application on one processor core (1.49W) is almost same power consumption of two applications on eight processor cores (1.46W). © 2013 Springer-Verlag.

DOI

Scopus

1

被引用数

(Scopus)
Automatic Design Exploration Framework for Multicores with Reconfigurable Accelerators

Cecilia Gonzalez-Alvarez, Haruku Ishikawa, Akihiro Hayashi, Daniel Jimenez-Gonzalez, Carlos Alvarez, Keiji Kimura, Hironori Kasahara

th Workshop on Reconfigurable Computing (WRC) 2013, held in conjuction with HiPEAC conference 2013 2013年01月 [査読有り]
Parallelization of Automotive Engine Control Software On Embedded Multi-core Processor Using OSCAR Compiler

Yohei Kanehagi, Dan Umeda, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

2013 IEEE COOL CHIPS XVI (COOL CHIPS) 2013年 [査読有り]
Automatic Parallelization, Performance Predictability and Power Control for Mobile-Applications

Dominic Hillenbrand, Akihiro Hayashi, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

2013 IEEE COOL CHIPS XVI (COOL CHIPS) 2013年 [査読有り]

　概要を見る

Currently few mobile applications exploit the power- and performance capabilities of multi-core architectures. As the number of cores increases, the challenges become more pressing. We picked three challenges: application parallelization, performance-predictability/portability and power control for mobile devices. We tackled the challenges with our auto-parallelizing compiler and operating system enhancements.
Reconciling application power control and operating systems for optimal power and performance

Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip, ReCoSoC 2013 2013年

　概要を見る

In the age of dark silicon on-chip power control is a necessity. Upcoming and state of the art embedded- and cloud computer system-on-chips (SoCs) already provide interfaces for fine grained power control. Sometimes both: core- and interconnect-voltage and frequency can be scaled for example. To further reduce power consumption SoCs often have specialized accelerators. Due to the rising specialization of hard- and software general purpose operating systems require changes to exploit the power saving opportunities provided by the hardware. However, they lack detailed hardware- and application-level-information. Application-level power control in turn is still very uncommon and difficult to realize. Now a days vendors of mobile devices are forced to tweak and patch system-level software to enhance the power efficiency of each individual product. This manual process is time consuming and must be re-iterated for each new product. In this paper we explore the opportunities and challenges of automatic application- level power control using compilers. © 2013 IEEE.

DOI

Scopus

4

被引用数

(Scopus)
組込マルチコア用OSCAR APIを用いたTILEPro64上でのマルチメディアアプリケーションの並列処理

岸本耀平, 見神広紀, 中野恵一, 林明宏, 木村啓二, 笠原博徳

組み込みシステムシンポジウム（ESS2012） 2012 22 - 30 2012年10月

CiNii
OSCAR Parallelizing Compiler and API for Real-time Low Power Heterogeneous Multicores

Akihiro Hayashi, Mamoru Shimaoka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

6th Workshop on Compilers for Parallel Computing(CPC2012) 5 ( 1 ) 68 - 79 2012年01月 [査読有り]

　概要を見る

汎用CPUコアに加え特定処理を高効率で実行可能なアクセラレータを搭載したヘテロジニアスマルチコアが広く普及している．しかしながら，ヘテロジニアスマルチコアでは様々な計算資源へのタスクスケジューリングやデータ転送コード挿入等多くをプログラマが記述する必要があるためプログラミングが困難である．そこで本論文では，逐次プログラムを入力とし自動並列化コンパイラを用いることで自動的に汎用コアとアクセラレータコアにタスクを配分し，高い性能および低消費電力を実現可能なソフトウェア開発フレームワークを提案する．本手法はアクセラレータコンパイラやアクセラレータライブラリ等既存のアクセラレータ開発環境を有効に利用可能である．本フレームワークを情報家電用ヘテロジニアスマルチコアプロセッサRP-Xをターゲットとして，アクセラレータライブラリを使用し，AACエンコーダおよびOptical Flow計算の自動並列化性能および消費電力を評価した．その結果，8つの汎用CPUコアおよび4つのアクセラレータコアを使用した場合，逐次実行時と比較してOptical Flow計算で最大32倍，AACエンコーダで最大80%の電力を削減可能であることを確認し，ヘテロジニアスマルチコアを対象とした汎用的なコンパイラフレームワークを実現した．There has been a growing interest in heterogeneous multicores because heterogeneous multicores achieve high performance keeping power consumption low. However, heterogeneous multicores force programmers very difficult programming. In order to overcome such a situation, this paper proposes a compilation framework which realizes high performance and low power. This paper also evaluates processing performance and the power reduction by the proposed framework on RP-X processor. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP (Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding when we utilize an existing accelerator library.

CiNii
重粒子線がん治療用線量計算エンジンの自動並列化

林明宏, 松本卓司, 見神広紀, 木村啓二, 山本啓二, 崎浩典, 高谷保行, 笠原博徳

HPCS2012 - ハイパフォーマンスコンピューティングと計算科学シンポジウム 2012 135 - 143 2012年01月

CiNii
Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

Yasir I. M. Al-Dosary, Keiji Kimura, Hironori Kasahara, Seinosuke Narita

2012 17TH INTERNATIONAL CONFERENCE ON COMPUTER GAMES (CGAMES) 67 - 75 2012年 [査読有り]

　概要を見る

Video Games have been a very popular form of digital entertainment in recent years. They have been delivered in state of the art technologies that include multi-core processors that are known to be the leading contributor in enhancing the performance of computer applications. Since parallel programming is a difficult technology to implement, that field in Video Games is still rich with areas for advancements. This paper investigates performance enhancement in Video Games when using parallelizing compilers and the difficulties involved in achieving that. This experiment conducts several stages in attempting to parallelize a well-renowned sequentially written Video Game called ioquake3. First, the Game is profiled for discovering bottlenecks, then examined by hand on how much parallelism could be extracted from those bottlenecks, and what sort of hazards exist in delivering a parallel-friendly version of ioquake3. Then, the Game code is rewritten into a hazard-free version while also modified to comply with the Parallelizable-C rules, which crucially aid parallelizing compilers in extracting parallelism. Next, the program is compiled using a parallelizing compiler called OSCAR (Optimally Scheduled Advanced Multiprocessor) to produce a parallel version of ioquake3. Finally, the performance of the newly produced parallel version of ioquake3 on a Multi-core platform is analyzed.
The following is found: (1) the parallelized game by the compiler from the revised sequential program of the game is found to achieve a 5.1 faster performance at 8-threads than original one on an IBM Power 5+ machine that is equipped with 8-cores, and (2) hazards are caused by thread contentions over globally shared data, and as well as thread private data, and (3) AI driven players are represented very similarly to Human players inside ioquake3 engine, which gives an estimation of the costs for parallelizing Human driven sessions, and (4) 70% of the costs of the experiment is spent in analyzing ioquake3 code, 30% in implementing the changes in the code.
ヘテロジニアスマルチコア向けソフトウェア開発フレームワーク及びAPI

林明宏, 和田康孝, 渡辺岳志, 関口威, 間瀬正啓, 白子準, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム(ACS36) 5 ( 1 ) 68 - 79 2011年11月 [査読有り]
A 45-nm 37.3 GOPS/W Heterogeneous Multi-Core SOC with 16/32 Bit Instruction-Set General-Purpose Core

Osamu Nishii, Yoichi Yuyama, Masayuki Ito, Yoshikazu Kiyoshige, Yusuke Nitta, Makoto Ishikawa, Tetsuya Yamada, Junichi Miyakoshi, Yasutaka Wada, Keiji Kimura, Hironori Kasahara, Hideo Maejima

IEICE TRANSACTIONS ON ELECTRONICS E94C ( 4 ) 663 - 669 2011年04月 [査読有り]

　概要を見る

We built a 12.4 mm x 12.4 mm, 45-nm CMOS, chip that integrates eight 648-MHz general purpose cores, two matrix processor (MX-2) cores, four flexible engine (FE) cores and media IP (VPU5) to establish heterogeneous multi-core chip architecture. The general purpose core had its IPC (instructions per cycle) performance enhanced by adding 32-bit instructions to the existing 16-bit fixed-length instruction set and executing up to two 32-bit instructions per cycle. Considering these five-to-seven years of embedded LSI and increasing trend of access-master within LSI, we predict that the memory usage of single core will not exceed 32-bit physical area (i.e. 4 GB), but chip-total memory usage will exceed 4 GB. Based on this prediction, the physical address was expanded from 32-bit to 40-bit. The fabricated chip was tested and a parallel operation of eight general purpose cores and four FE cores and eight data transfer units (DTU) is obtained on AAC (Advanced Audio Coding) encode processing.

DOI

Scopus
Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

Akihiro Hayashi, Yasutaka Wada, Takeshi Watanabe, Takeshi Sekiguchi, Masayoshi Mase, Jun Shirako, Keiji Kimura, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING 6548 184 - 198 2011年 [査読有り]

　概要を見る

Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
A parallelizing compiler cooperative heterogeneous multicore processor architecture

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6760 215 - 233 2011年

　概要を見る

Heterogeneous multicore architectures, integrating several kinds of accelerator cores in addition to general purpose processor cores, have been attracting much attention to realize high performance with low power consumption. To attain effective high performance, high application software productivity, and low power consumption on heterogeneous multicores, cooperation between an architecture and a parallelizing compiler is important. This paper proposes a compiler cooperative heterogeneous multicore architecture and parallelizing compilation scheme for it. Performance of the proposed scheme is evaluated on the heterogeneous multicore integrating Hitachi and Renesas' SH4A processor cores and Hitachi's FE-GA accelerator cores, using an MP3 encoder. The heterogeneous multicore gives us 14.34 times speedup with two SH4As and two FE-GAs, and 26.05 times speedup with four SH4As and four FE-GAs against sequential execution with a single SH4A. The cooperation between the heterogeneous multicore architecture and the parallelizing compiler enables to achieve high performance in a short development period. © 2011 Springer-Verlag Berlin Heidelberg.

DOI
Parallelizable C and Its Performance on Low Power High Performance Multicore Processors

Masayoshi Mase, Yuto Onozaki, Keiji Kimura, Hironori Kasahara

Proc. of 15th Workshop on Compilers for Parallel Computing (CPC 2010) 2010年07月 [査読有り]

CiNii
自動並列化のためのElement-Sensitiveポインタ解析

間瀬正啓, 村田雄太, 木村啓二, 笠原博徳

情報処理学会論文誌プログラミング(PRO) 3 ( 2 ) 36 - 47 2010年03月 [査読有り]
A 45nm 37.3GOPS/W heterogeneous multi-core SoC

Yoichi Yuyama, Masayuki Ito, Yoshikazu Kiyoshige, Yusuke Nitta, Shigezumi Matsui, Osamu Nishii, Atsushi Hasegawa, Makoto Ishikawa, Tetsuya Yamada, Junichi Miyakoshi, Koichi Terada, Tohru Nojiri, Makoto Satoh, Hiroyuki Mizuno, Kunio Uchiyama, Yasutaka Wada, Keiji Kimura, Hironori Kasahara, Hideo Maejima

Digest of Technical Papers - IEEE International Solid-State Circuits Conference 53 100 - 101 2010年

　概要を見る

We develop a heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search. Figure 5.3.1 shows the chip features. This SoC is capable of decoding 1080i audio/video data using a part of SoC (one general-purpose CPU core, video processing unit called VPU5 and sound processing unit called SPU) [1]. Four dynamically reconfigurable processors called FE [2] are integrated and have a total theoretical performance of 41.5GOPS and power consumption of 0.76W. Two 1024-way matrix-processors called MX-2 [3] are integrated and have a total theoretical performance of 36.9GOPS and power consumption of 1.10W. Overall, the performance per watt of our SoC is 37.3GOPS/W at 1.15V, the highest among comparable processors [4-6] excluding special-purpose codecs. The operation granularity of the CPU, FE and MX-2 are 32bit, 16bit, and 4bit respectively, and thus we can assign the appropriate processor for each task in an effective manner. A heterogeneous multi-core approach is one of the most promising approaches to attain high performance with low frequency, or low power, for consumer electronics application and scientific applications, compared to homogeneous multi-core SoCs [4]. For example, for image-recognition application in the IP-TV system, the FEs are assigned to calculate optical flow operation [7] of VGA (640x480) size video data at 15fps, which requires 0.62GOPS. The MX-2s are used for face detection and calculation of the feature quantity of the VGA video data at 15fps, which requires 30.6GOPS. In addition, general-purpose CPU cores are used for database search using the results of the above operations, which requires further enhancement of CPU. The automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE. ©2010 IEEE.

DOI

Scopus

33

被引用数

(Scopus)
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

Keiji Kimura, Masayoshi Mase, Hiroki Mikami, Takamichi Miyamoto, Jun Shirako, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING 5898 188 - 202 2010年 [査読有り]

担当区分：筆頭著者

　概要を見る

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR, compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR. API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.
マルチコア上でのOSCAR APIを用いた並列化コンパイラによる低消費電力化手法

間瀬正啓, 中川亮, 大國直人, 白子準, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム(ACS) 2 ( 3 ) 96 - 106 2009年09月 [査読有り]
マルチコア上でのOSCAR APIを用いた並列化コンパイラによる低消費電力化手法

中川亮, 間瀬正啓, 大國直人, 白子準, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2009) 3 - 10 2009年05月
Performance of OSCAR Multigrain Parallelizing Compiler on Multicore Processors

Hiroki Mikami, Jun Shirako, Masayoshi Mase, Takamichi Miyamoto, Hirofumi Nakano, Fumiyo Takano, Akihiro Hayashi, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Proc. of 14th Workshop on Compilers for Parallel Computing(CPC 2009) 2009年01月 [査読有り]
Green multicore-SoC software-execution framework with timely-power-gating scheme

Masafumi Onouchi, Keisuke Toyama, Toru Nojiri, Makoto Sato, Masayoshi Mase, Jun Shirako, Mikiko Sato, Masashi Takada, Masayuki Ito, Hiroyuki Mizuno, Mitaro Namiki, Keiji Kimura, Hironori Kasahara

Proceedings of the International Conference on Parallel Processing 510 - 517 2009年

　概要を見る

We are developing a software-execution framework based on an octo-core chip multiprocessor named RP2 and an automatic multigrain-parallelizing compiler named OSCAR. The main purpose of this framework is to maintain good speed scalability and power efficiency over the number of processor cores under severe hardware restrictions for embedded use. Key to the speed scalability is reduction of a communication overhead with parallelized tasks. A data-categorization scheme enables small-overhead cache-coherency maintenance by using directives and instructions from the compiler. In this scheme, the number of cache-flushing time is minimized and parallelized tasks are quickly synchronized by using flags in local memory. As regards power efficiency, to reduce power consumption, power supply to processor cores waiting for other cores is timely and frequently cut off, even in the middle of an application, by using a timelypower- gating scheme. In this scheme, to achieve quick mode transition between "NORMAL" mode and "RESUME POWEROFF" mode, register values of the processor core are stored in core-local memory, which is active even in "RESUME POWEROFF" mode and can be accessed in one or two clock cycles. Measured speed and power of an application show good speed scalability in execution time and high power efficiency, simultaneously. In the case of a secure AAC-LC encoding program, execution speed when eight processor cores are used can be increased by 4.85 times compared to that of sequential execution. Moreover, power consumption under the same condition can be reduced by 51.0% by parallelizing and timely-power gating. The time for mode transition is less than 20 μsec, which is only 2.5% of the "RESUME POWER-OFF" period. © 2009 IEEE.

DOI

Scopus

1

被引用数

(Scopus)
情報家電用マルチコア並列化APIを生成する自動並列化コンパイラによる並列化の評価

宮本孝道, 浅香沙織, 見神広紀, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム(ACS) 1 ( 3 ) 83 - 95 2008年12月 [査読有り]

　概要を見る

マルチコアプロセッサは携帯機器，カーナビ，デジタルTV，ゲーム機等の情報家電向けの組み込み分野において低消費電力で高性能を得るために利用され始めている．一方，これらのマルチコアを有効利用可能なOSCAR自動並列化コンパイラが開発されている．このOSCARコンパイラによる最適化を複数種類のマルチコアに適用するために，OSCARコンパイラと各マルチコア用ネイティブコンパイラを接続できる並列化APIをNEDO “リアルタイム情報家電用マルチコア技術”プロジェクトで新規に開発した．本論文では各社情報家電用マルチコア向けに新規開発したAPIを用いて，並列化コンパイラが情報家電用マルチコア向けに生成したコードを，VLIWコアを4基集積の富士通FR1000マルチコア，およびルネサステクノロジ，日立製作所，早稲田大学で共同開発したSH-4Aコアを4基集積のRP1マルチコア上で並列性能評価を行った．情報家電機器上での高速化が重要となるコーデック，グラフィックス等のマルチメディア処理を対象として，FR1000マルチコアでは4プロセッサ時に1プロセッサ時に比べ平均で3.28倍，RP1マルチコアでは4プロセッサ時に1プロセッサ時に比べ平均で3.31倍という並列処理性能が得られた．さらにFR1000マルチコアでは新規開発したAPIを用いることでOpenMP API準拠の並列処理APIのみを用いたコードと比較して最大1.74倍の速度向上が得られた．Multicore processors are adopted for embedded systems like portable electronics, car navigation systems, digital TVs and games to obtain high performance and low power. Furthermore, OSCAR automatic parallelizing compiler has been developed to utilize these multicores. We newly develop consumer electronics multicore API, with support by NEDO “Multicore-processor Technology for Real-Time Consumer Electronics project”, to connect OSCAR compiler with native compilers for various kinds of multicores to apply optimization by OSCAR compiler. This paper evaluates parallel processing performances of multimedia applications using this API by OSCAR compiler on FR1000 4 VLIW cores multicore processor developed by Fujitsu Ltd, and RP1 4 SH-4A cores multicore processor jointly-developed by Renesas Technology Corp., Hitachi Ltd. and Waseda University. As the results, the developed API gives us 3.28 times speedup in average using 4 cores against using 1 core on FR1000 multicore, and 3.31 times speedup in average using 4 cores against using 1 core on RP1 multicore. Furthermore, the developed API gives us maximum of 1.74 times speedup against using only parallelization API which is compliant with OpenMP API on FR1000 multicore.

CiNii
Parallelizing Compiler Cooperative Heterogeneous Multicore

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of Workshop on Software and Hardware Challenges of Manycore Platforms (SHCMP 2008) 2008年06月 [査読有り]
ヘテロジニアスマルチコア上でのスタティックスケジューリングを用いた MP3エンコーダの並列化

和田康孝, 林明宏, 益浦健, 白子準, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム 1 ( 1 ) 105 - 119 2008年06月 [査読有り]

　概要を見る

情報家電の市場拡大にともない，低消費電力でありながら高い性能を実現するプロセッサが求められるようになっている．この要求に対応するため，汎用プロセッサに加え，動的再構成可能プロセッサ（DRP）や信号処理用プロセッサ（DSP）等のアクセラレータを1チップ上に複数集積したヘテロジニアスマルチコアアーキテクチャが注目を集めている．このようなヘテロジニアスマルチコアにおいては，処理の特性やコア間のデータ転送を考慮して適切に各コアに処理を割り当てることが必要となる．本論文では，このようなヘテロジニアスマルチコア用の粗粒度タスクスタティックスケジューリング手法を提案する．本論文で提案するスタティックスケジューリング手法では，ループやサブルーチン，基本ブロック間の並列性を利用する粗粒度タスク並列処理において，各タスクがどのコアで実行可能か等の特性，各コア間でのデータ転送オーバヘッドを考慮して処理時間を最小とするように汎用コアあるいはアクセラレータに割り当て，さらにコア間でのデータ転送をDMAを用いてタスク処理とオーバラップして行う．これによりプログラムの階層的な並列性とチップ上のアクセラレータを有効に利用し，処理の高速化を図ることができる．本手法を用い，世界初のヘテロジニアス並列化コンパイラを開発しMP3エンコーダに適用し評価した結果，SH4A 1コアのみを用いた場合に対して，SH4A 4コアで3.99倍，SH4A 2コアとDRP 2コアで14.55倍，SH4A 4コアとDRP 4コアを用いたときに25.20倍の性能向上を得られることが確認できた．Heterogeneous multicore architectures integrating various kind of accelerators like dynamically reconfigurable processors (DRPs) or digital signal processors (DSPs) in addition to general purpose processor cores have attracted much attention to realize high performance with low power consumption. These heterogeneous multicores require scheduling schemes considering characteristics of tasks on each core and data transfers on chips. This paper proposes a static scheduling scheme for coarse grain task parallel processing on a heterogeneous multicore processor with overlapping data transfer and task execution. In the proposed scheme, the compiler extracts parallelism using coarse grain parallel processing and assigns tasks considering characteristics on each core to minimize the execution time of an application. Performance of the proposed scheme is evaluated on a heterogeneous multicore processor using an MP3 encoder. Heterogeneous configurations give us 14.55 times speedup with two SH4As and two DRPs and 25.20 times speedup with four SH4As and four DRPs against sequential execution with one SH4A core.

CiNii
情報家電用マルチコア上におけるマルチメディア処理のコンパイラによる並列化

宮本孝道, 浅香沙織, 見神広紀, 間瀬正啓, 木村啓二, 笠原博徳

SACSIS2008 - 先進的計算基盤システムシンポジウム 2008年05月

CiNii
Power-aware compiler controllable chip multiprocessor

Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

IEICE TRANSACTIONS ON ELECTRONICS E91C ( 4 ) 432 - 439 2008年04月 [査読有り]

　概要を見る

A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.

DOI

Scopus

1

被引用数

(Scopus)
Heterogeneous multi-core architecture that enables 54x AAC-LC stereo encoding

Hiroaki Shikano, Masaki Ito, Masafumi Onouchi, Takashi Todaka, Takanobu Tsunoda, Tomoyuki Kodama, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

IEEE JOURNAL OF SOLID-STATE CIRCUITS 43 ( 4 ) 902 - 910 2008年04月 [査読有り]

　概要を見る

This paper describes a heterogeneous multi-core processor (HMCP) architecture that integrates general-purpose processors (CPUs) and accelerators (ACCs) to achieve exceptional performance as well as low-power consumption for the SoCs of embedded systems. The memory architectures of CPUs and ACCs were unified to improve programming and compiling efficiency. Advanced audio codec-low complexity (AAC-LC) stereo audio encoding was parallelized on a heterogeneous multi-core having homogeneous processor cores and dynamically reconfigurable processor (DRP) ACC cores in a preliminary evaluation of the HMCP architecture. The performance evaluation revealed that 54x AAC encoding was achieved on the chip with two CPUs at 600 MHz and two DRPs at 300 MHz, which achieved encoding of an entire CD within 1-2 min.

DOI

Scopus

16

被引用数

(Scopus)
An 8 CPU SoC with Independent Power-off Control of CPUs and Multicore Software Debug Function

Yutaka Yoshida, Masayuki Ito, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Toshihiro Hattori, Jun Sakiyama, Masashi Takada, Kunio Uchiyama, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Proc. of IEEE Cool Chips XI: Symposium on Low-Power and High-Speed Chips 2008 2008年04月 [査読有り]
A 600MHz SoC with Compiler Power-off Control of 8 CPUs and 8 Onchip-RAMs

Masayuki Ito, Toshihiro Hattori, Yutaka Yoshida, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Yoshihiko Yasu, Atsushi Hasegawa, Masashi Takada, Masaki Ito, Hiroyuki Mizuno, Kunio Uchiyama, Toshihiko Odaka, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Proc. of International Solid State Circuits Conference (ISSCC2008) 90 - 91 2008年02月 [査読有り]
An 8640 MIPS SoC with independent power-off control of 8 CPUs and 8 RAMs by an automatic parallelizing compiler

Masayuki Ito, Toshihiro Hattori, Yutaka Yoshida, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Yoshihiko Yasu, Atsushi Hasegawa, Masashi Takada, Masaki Ito, Hiroyuki Mizuno, Kunio Uchiyama, Toshihiko Odaka, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Digest of Technical Papers - IEEE International Solid-State Circuits Conference 51 81 - 598 2008年 [査読有り]

　概要を見る

A 104.8mm2 90nm CMOS 600MHz SoC integrates 8 processor cores and 8 user RAMs in 17 separate power domains and delivers 33.6GFLOPS. An automatic parallelizing compiler assigns tasks to each CPU and controls its power mode including power supply in accordance with its processing load and status. The compiler also uses barrier registers to achieve fast and accurate CPU synchronization. ©2008 IEEE.

DOI

Scopus

37

被引用数

(Scopus)
Performance evaluation of compiler controlled power saving scheme

Jun Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofurni Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

HIGH-PERFORMANCE COMPUTING 4759 480 - 493 2008年 [査読有り]

　概要を見る

Multicore processors, or chip multiprocessors, which allow us to realize low power consumption, high effective performance, good cost performance and short hardware/software development period, are attracting much attention. In order to achieve full potential of multicore processors, cooperation with a parallelizing compiler is very important. The latest compiler extracts multilevel parallelism, such as coarse grain task parallelism, loop parallelism and near fine grain parallelism, to keep parallel execution efficiency high. It also controls voltage and clock frequency of processors carefully to reduce energy consumption during execution of an application program. This paper evaluates performance of compiler controlled power saving scheme which has been implemented in OSCAR multigrain parallelizing compiler. The developed power saving scheme realizes voltage/frequency control and power shutdown of each processor core during coarse grain task parallel processing. In performance evaluation, when static power is assumed as one-tenth of dynamic power, OSCAR compiler with the power saving scheme achieved 61.2 percent energy reduction for SPEC CFP95 applu without performance degradation on 4 processors and 87.4 percent energy reduction for mpeg2encode, 88.1 percent energy reduction for SPEC CFP95 tomcatv and 84.6 percent energy reduction for applu with real-time deadline constraint on 4 processors.
Software-cooperative power-efficient heterogeneous multi-core for media processing

Hiroaki Shikano, Masaki Ito, Kunio Uchiyama, Toshihiko Odaka, Akihiro Hayashi, Takeshi Masuura, Masayoshi Mase, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

2008 ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2 712 - + 2008年 [査読有り]

　概要を見る

A heterogeneous multi-core processor (HMCP) architecture, which integrates general purpose processors (CPU) and accelerators (ACC) to achieve high-performance as well as low-power consumption with the support of a parallelizing compiler, was developed. The evaluation was performed using an MP3 audio encoder on a simulator that accurately models the HMCP, It showed that 16-frame encoding on the HMCP with four CPUs and four ACCs yielded 24.5-fold speed-up of performance against sequential execution on one CPU. Furthermore, power saving by the compiler reduced energy consumption of the encoding to 0.17 J, namely, by 28.4%.
Power Reduction Controll for Multicores in OSCAR Multigrain Parallelizing Compiler

Jun Shirako, Keiji Kimura, Hironori Kasahara

ISOCC: 2008 INTERNATIONAL SOC DESIGN CONFERENCE, VOLS 1-3 50 - 55 2008年 [査読有り]

　概要を見る

Multicore processors have become mainstream computer architecture to go beyond the performance and power efficiency limits of single-core processors. To achieve low power consumption and high performance on multicores, parallelizing compilers take on an important role. This paper describes the performance of a compiler-based power reduction scheme cooperating with OSCAR multigrain parallelizing compiler on a newly developed 8-way SH4A low power multicore chip for consumer electronics, which supports DVFS (Dynamic Voltage and Frequency Scaling) and Clock/Power Gating. Using hardware parameters and parallelized program information, OSCAR compiler determines suitable voltage and frequency of each active processor core and appropriate schedule of clock gating and power gating. Performance experiments shows the compiler reduces consumed power by 88.3%, namely from 5.68 W to 0.67 W, for real-time secure AAC Encoding and 73.5%, namely from 5.73 W to 1.52 W, for real-time MPEG2 Decoding on 8 core execution.
Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API

Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

PROCEEDINGS OF THE 2008 INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS 600 - 607 2008年 [査読有り]

　概要を見る

Multicore processors have been adopted for consumer electronics like portable electronics, mobile phones, car navigation systems, digital TVs and games to obtain high performance with low power consumption. The OSCAR automatic parallelizing compiler has been developed to utilize these multicores easily. Also, a new Consumer Electronics Multicore Application Program Interface (API) to use the OSCAR compiler with native sequential compilers for various kinds of multicores from different vendors has been developed in NEDO (New Energy and Industrial Technology Development Organization) "Multicore Technology for Realtime Consumer Electronics" project with Japanese 6 IT companies. This paper evaluates the parallel processing performance of multimedia applications using this API by the OSCAR compiler on the FR1000 4 VLIW cores multicore processor developed by Fujitsu Ltd, and the RP1 4 SH-4A cores multicore processor jointly-developed by Renesas Technology Corp., Hitachi Ltd. and Waseda University. As the results, the parallel codes generated by the OSCAR compiler using the API give us 3.27 times speedup on average using 4 cores against 1 core on the FR1000 multicore, and 3.31 times speedup on average using 4 cores against 1 core on the RP1 multicore.

DOI

Scopus

6

被引用数

(Scopus)
情報家電用マルチコアSMP実行モードにおける制約付きCプログラムのマルチグレイン並列化

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 宮本孝道, 白子準, 中野啓史, 木村啓二, 笠原博徳

組込みシステムシンポジウム2007 2007年10月

CiNii
MP3エンコーダを用いたOSCARヘテロジニアスチップマルチプロセッサの性能評価

鹿野裕明, 鈴木裕貴, 和田康孝, 白子準, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム Vol. 48, No. SIG8(ACS18), 141 - 152 2007年05月 [査読有り]
Power-aware compiler controllable chip multiprocessor

Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT 427 2007年 [査読有り]

DOI

Scopus

1

被引用数

(Scopus)
A 4320MIPS four-processor core SMP/AMP with individually managed clock frequency for low power consumption

Yutaka Yoshida, Tatsuya Kamei, Kiyoshi Hayase, Shinichi Shibahara, Osamu Nishii, Toshihiro Hattori, Atsushi Hasegawa, Masashi Takada, Naohiko Irie, Kunio Uchiyama, Toshihiko Odaka, Kiwamu Takada, Keiji Kimura, Hironori Kasahara

Digest of Technical Papers - IEEE International Solid-State Circuits Conference 95 - 590 2007年

　概要を見る

A 4320MIPS four-core SoC that supports both SMP and AMP for embedded applications is designed in 90nm CMOS. Each processor-core can be operated with a different frequency dynamically including clock stop, while keeping data cache coherency, to maintain maximum processing performance and to reduce average operating power. The 97.6mm2 die achieves a floating-point performance of 16.8GFLOPS. © 2007 IEEE.

DOI

Scopus

26

被引用数

(Scopus)
Heterogeneous multiprocessor on a chip which enables 54x AAC-LC stereo encoding

Masaki Ito, Takashi Todaka, Takanobu Tsunoda, Hiroshi Tanaka, Tomoyuki Kodama, Hiroaki Shikano, Masafumi Onouchi, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

2007 Symposium on VLSI Circuits, Digest of Technical Papers 18 - 19 2007年 [査読有り]

　概要を見る

A heterogeneous multiprocessor on a chip has been designed and implemented. It consists of 2 CPUs and 2 DRPs (Dynamic Reconfigurable Processors). The design of DRP was intended to achieve high-performance in a small area to be integrated on a SoC for embedded systems. Memory architecture of CPUs and DRPs were unified to improve programming and compiling efficiency. 54x AAC-LC stereo encoding has been enabled with 2 DRPs at 300MHz and 2 CPUs at 600MHz.
マルチコアプロセッサにおけるコンパイラ制御低消費電力化手法

白子準, 吉田宗弘, 押山直人, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム Vol. 47(ACS15) 2006年09月 [査読有り]
マルチコアプロセッサにおけるコンパイラ制御低消費電力化手法

白子準, 吉田宗広, 押山直人, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2006) ( 467 ) 476 2006年05月

CiNii
Performance Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

Proc. of IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX) 349 - 363 2006年05月 [査読有り]

CiNii
マルチコアにおけるプログラミング

木村啓二, 笠原博徳

情報処理 47 ( 1 ) 17 - 23 2006年01月 [招待有り]

担当区分：筆頭著者
マルチコア化するマイクロプロセッサ

笠原博徳, 木村啓二

情報処理 47 ( 1 ) 10 - 16 2006年01月 [査読有り]
Parallelizing Compilation Scheme for Reduction of Power Consumption of Chip Multiprocessors

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of 12th Workshop on Compilers for Parallel Computers (CPC 2006), 2006年01月 [査読有り]
Compiler control power saving scheme for multi core processors

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4339 362 - 376 2006年

　概要を見る

With the increase of transistors integrated onto a chip, multi core processor architectures have attracted much attention to achieve high effective performance, shorten development period and reduce the power consumption. To this end, the compiler for a multi core processor is expected not only to parallelize program effectively, but also to control the voltage and clock frequency of processors and storages carefully inside an application program. This paper proposes a compilation scheme for reduction of power consumption under the multigrain parallel processing environment that controls Voltage/Frequency and power supply of each processor core on a chip. In the evaluation, the OSCAR compiler with the proposed scheme achieves 60.7 percent energy savings for SPEC CFP95 applu without performance degradation on 4 processors, and 45.4 percent energy savings for SPEC CFP95 tomcatv with real-time deadline constraint on 4 processors, and 46.5 percent energy savings for SPEC CFP95 swim with the deadline constraint on 4 processors. © 2006 Springer-Verlag Berlin Heidelberg.

DOI

Scopus

19

被引用数

(Scopus)
マルチコアプロセッサ上でのデータローカライゼーション

中野啓文, 浅野尚一郎, 内藤陽介, 仁藤拓実, 田川友博, 宮本孝道, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-165-10 2005年12月
コンピュータのその先を見せる-早稲田大学コンピュータ・ネットワーク工学科における広報活動-

木村啓二

情報処理 46 ( 10 ) 1155 - 1157 2005年10月

CiNii
チップマルチプロセッサ上でのMPEG2エンコードの並列処理

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会論文誌 46 ( 9 ) 2311 - 2325 2005年09月 [査読有り]
ホモジニアスマルチコアにおけるコンパイラ制御低消費電力化手法

白子準, 押山直人, 和田康孝, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-164-10 (SWoPP205) 2005年08月
Performance of OSCAR multigrain parallelizing compiler on SMP servers

K Ishizaka, T Miyamoto, J Shirako, M Obata, K Kimura, H Kasahara

LANGUAGES AND COMPILERS FOR HIGH PERFORMANCE COMPUTING 3602 319 - 331 2005年 [査読有り]

　概要を見る

This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II4 processors desktop work-station, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.
Multigrain parallel processing on compiler cooperative chip multiprocessor

K Kimura, Y Wada, H Nakano, T Kodaka, J Shirako, K Ishizaka, H Kasahara

9TH ANNUAL WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS 11 - 20 2005年 [査読有り]

担当区分：筆頭著者

　概要を見る

This paper describes multigrain parallel processing on a compiler cooperative chip multiprocessor The multigrain parallel processing hierarchically exploits multiple grains of parallelism such as coarse grain task parallelism, loop iteration level parallelism and statement level near-fine grain parallelism. The chip multiprocessor has been designed to attain high effective peformance, cost effectiveness and high software productivity by supporting the optimizations of the multigrain parallelizing compiler, which is developed by Japanese Millennium Project IT21 "Advance Parallelizing Compiler". To achieve full potential of multigrain parallel processing, the chip multiprocessor integrates simple single-issue processors having distributed shared data memory for both optimal use of data locality and scalar data transfer local data memory for processor private data, in addition to centralized shared memory for shared data among processors. This paper focuses on the scalability of the chip multiprocessor having up to eight processors on a chip by exploiting of the multigrain parallelism from SPECfp95 programs. When microSPARC like the simple processor core is used under assumption of 90 nm technology and 2.8 GHz, the evaluation results show the speedups for eight processors and four processors reach 7.1 and 3.9, respectively. Similarly, when 400 MHz is assumed for embedded usage, the speedups reach 7.8 and 4.0, respectively.
Memory Management for Data Localication on OSCAR Chip Multiprocessor

Hirofumi Nakano, Takeshi Kodaka, Keiji Kimura, Hironori Kasahara

Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04) 82 - 88 2004年 [査読有り]
Parallel processing using data localization for MPEG2 encoding on OSCAR chip multiprocessor

T Kodaka, H Nakano, K Kimura, H Kasahara

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS 119 - 127 2004年 [査読有り]

　概要を見る

Currently, many people are enjoying multimedia applications with image and audio processing on PCs, PDAs, mobile phones and so on. With the popularization of the multimedia applications, needs for low cost, low power consumption and high performance processors has been increasing. To this end, chip multiprocessor architectures which allow us to attain scalable performance improvement by using multigrain parallelism are attracting much attention. However, in order to extract higher performance on a chip multiprocessor, more sophisticated software techniques are required, such as decomposing a program into adequate grain of tasks, assigning them onto processors considering parallelism, data locality optimization and so on. This paper describes a parallel processing scheme for MPEG2 encoding using data localization which improve execution efficiency assigning coarse grain tasks sharing same data on a same processor consecutively for a chip multiprocessor. The performance evaluation on OSCAR chip multiprocessor architecture shows that proposed scheme gives us 6.97 times speedup using 8 processors and 10.93 times speedup using 16 processors against sequential execution time respectively. Moreover, the proposed scheme gives us 1.61 times speedup using 8 processors and 2.08 times speedup using 16 processors against loop parallel processing which has been widely used for multiprocessor systems using the same number of processors.
Static coarse grain task scheduling with cache optimization using OpenMP

H Nakano, K Ishizaka, M Obata, K Kimura, H Kasahara

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING 31 ( 3 ) 211 - 223 2003年06月 [査読有り]

　概要を見る

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC Ver. 6 update 1 loop parallelizing compiler.
Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture 'Jointly Worked'

Keiji Kimura, Yasutaka Wada, Hirofumi Nakano, Takeshi Kodaka, Jun Shirako, Kazuhisa Ishizaka, Hironori Kasahara

The IEICE Transactions on Electronics, Special Issue on High-Performance and Low-Power System LSIs and Related Technologies E86-C ( 4 ) 570 - 579 2003年02月 [査読有り]

担当区分：筆頭著者
Multigrain parallel processing on OSCAR CMP

K Kimura, T Kodaka, M Obata, H Kasahara

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS 56 - 65 2003年 [査読有り]

担当区分：筆頭著者

　概要を見る

It seems that Instruction Level Parallelism (ILP) approach, which has been used by various superscalar processors and VLIW processors for a long time, reaches its limitation of performance improvement. To obtain scalable performance improvement, cost effectiveness and high productivity even in the era of one billion transistors, the cooperative work between software and hardware is getting increasingly important. For this reason, the authors have developed OSCAR (Optimally SCheduled Advanced multiprocessoR) Chip Multiprocessor (OSCAR CMP) and OSCAR multigrain compiler simultaneously. To preserve the scalability in the future, OSCAR CMP has mechanisms for efficient use of parallelism and data locality, and for hiding data transfer overhead. These mechanisms can be fully controlled by the OSCAR multigrain compiler In this paper, the authors focus on multigrain parallel processing on OSCAR CMP, which enables us to exploit loop iteration level parallelism and coarse grain task parallelism in addition to ILP from the entire of a program. Performance of multigrain parallel processing on OSCAR CMP architecture is evaluated using SPEC fp 2000195 benchmark suite. When microSPARC like single issue core is used, OSCAR CMP gives us from 1.77 to 3.96 times speedup for four processors against single processor In addition, OSCAR CMP is compared with Sun UltraSPARC II like processor to evaluate cost effectiveness. As a result, OSCAR CMP gives us 1.66 times better performance on the average under the condition that OSCAR CMP and UltraSPARC II are built from almost same number of transistors.
シングルチップマルチプロセッサにおけるJPEGエンコーディングのマルチグレイン並列処理

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会ハイパフォーマンスコンピューティングシステム論文誌 43 ( Sig 6(HPS5) ) 153 - 162 2002年09月 [査読有り]

　概要を見る

近年のJPEG,MPEGなどを用いたマルチメディアコンテンツの増加にともない,これらマルチメディアアプリケーションを効率良く処理できる低コスト,低消費電力かつ高性能なプロセッサの開発が望まれている.特に,複数のプロセッサコアを搭載したシングルチップマルチプロセッサは命令レベル以外の並列性も自然に引き出すことができ集積度向上に対しスケーラブルな性能向上が得られるアーキテクチャとして注目されている.本論文では,JPEGエンコーディングのシングルチップマルチプロセッサ用マルチグレイン並列処理手法を提案するとともに,その性能評価を行う.評価の結果,シンプルなシングルイシュープロセッサを4基搭載したOSCAR型シングルチップマルチプロセッサアーキテクチャでは逐次実行に対して約3.59倍の性能向上が得られスケーラブルな性能向上が得られることが確かめられた.

CiNii
シングルチップマルチプロセッサにおける JPEGエンコーディングのマルチグレイン並列処理（共著）

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会並列処理シンポジウム(JSPP2002) 2002年05月
Static coarse grain task scheduling with cache optimization using openMP

Hirofumi Nakano, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2327 479 - 489 2002年

　概要を見る

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation, using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC 6 loop parallelizing compiler. © 2002 Springer Berlin Heidelberg.

DOI

Scopus

2

被引用数

(Scopus)
Multigrain parallel processing for JPEG encoding on a single chip multiprocessor

T Kodaka, K Kimura, H Kasahara

INTERNATIONAL WORKSHOP ON INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS 57 - 63 2002年 [査読有り]

　概要を見る

With the recent increase of multimedia contents using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia application have been expected. Particularly, single chip multiprocessor architecture having simple processor cores that will attain good scalability and cost effectiveness is attracting much attention. To exploit full performance of single chip multiprocessor architecture, multigrain parallel processing, which exploits coarse grain task parallelism, loop parallelism and instruction level parallelism, is attractive. This paper describes a multigrain parallel processing scheme for the JPEG encoding on a single chip multiprocessor and its performance. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gave us 3.59 times speed-up against sequential execution time.
Multigrain automatic parallelization in Japanese Millennium Project IT21 Advanced Parallelizing Compiler

H Kasahara, M Obata, K Ishizaka, K Kimura, H Kaminaga, H Nakano, K Nagasawa, A Murai, H Itagaki, J Shirako

PAR ELEC 2002: INTERNATIONAL CONFERENCE ON PARALLEL COMPUTING IN ELECTRICAL ENGINEERING 105 - 111 2002年 [査読有り]

　概要を見る

This paper describes OSCAR multigrain parallelizing compiler which has been developed in Japanese Millennium Project IT21 "Advanced Parallelizing Compiler" project and its performance on SMP machines. The compiler realizes multigrain parallelization for chip-multiprocessors to high-end servers. It hierarchically exploits coarse grain task parallelism among loops, subroutines and basic blocks and near fine grain parallelism among statements inside a basic block in addition to loop parallelism. Also, it globally optimizes cache use over different loops, or coarse grain tasks, based on data localization technique to reduce memory access overhead Current performance of OSCAR compiler for SPEC95fp is evaluated on different SMPs. For example, it gives us 3.7 times speedup for HYDRO2D, 1.8 times for SWIM, 1.7 times for SU2COR, 2.0 times for MGRID, 3.3 times for TURB3D on 8 processor IBM RS6000, against XL Fortran compiler ver:7.1 and 4.2 times speedup for SWIM and 2.2 times speedup for TURB3D on 4 processor Sun Ultra80 workstation against Forte6 update 2.
近細粒度並列処理用シングルチップマルチプロセッサにおけるプロセッサコアの評価

木村啓二, 加藤孝幸, 笠原博徳

情報処理学会論文誌 42 ( 4 ) 692 - 703 2001年04月 [査読有り]

担当区分：筆頭著者
Evaluation of Single Chip Multiprocessor Core Architecture with Near Fine Grain Parallel Processing

Keiji Kimura, Hironori Kasahara

Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01) 2001年01月 [査読有り]

担当区分：筆頭著者
Near fine grain parallel processing using static scheduling on single chip multiprocessors

K Kimura, H Kasahara

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS 23 - 31 2000年 [査読有り]

　概要を見る

With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting important problems. However it has been thought that popular superscalar and VLIW would have difficulty, to obtain scalable improvement of effective performance in future because of the limitation of instruction level parallelism. To cope with this problem, a single chip multiprocessor (SCM) approach,vith multi grain parallelprocessing inside a chip, which hierarchically exploits loop parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism, is thought one of the most promising approaches. This paper evaluates effectiveness of the single chip multiprocessor architectures with a shared cache, global registers, distributed shared memory and/or local memory for near fine grain parallel processing as the first step of research on SCM architecture to support multi grain parallel processing. The evaluation shows OSCAR (Optimally Scheduled Advanced Multiprocessor architecture having distributed shared memory and local memory in addition to centralized shared memory and attachment of global register gives us significant speed up such as 13.8% to 143.8% for four processors compared with shared cache architecture for applications which have been difficult to extract parallelism effectively.
シングルチップマルチプロセッサ上での近細粒度並列処理

木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会論文誌 40 ( 5 ) 1924 - 1934 1999年05月 [査読有り]

担当区分：筆頭著者
Near fine grain parallel processing using static scheduling on single chip multiprocessors

Keiji Kimura, Hironori Kasahara

Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems 1999- 23 - 31 1999年 [査読有り]

担当区分：筆頭著者

　概要を見る

With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting im-portant problems. However, it has been thought that popular superscalar and VLIW would have difficulty to obtain scalable improvement of effective performance in future because of the limitation of instruction level parallelism. To cope with this problem, a single chip multiprocessor (SCM) approach with multi grain parallel processing inside a chip, which hierarchically exploits loop parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism, is thought one of the most promising approaches. This paper evaluates effectiveness of the single chip multiprocessor architectures with a shared cache, global registers, distributed shared memory and/or local memory for near fine grain parallel processing as the first step of research on SCM architecture to support multi grain parallel processing. The evaluation shows OSCAR (Optimally Scheduled Advanced Multiprocessor) architecture having distributed shared memory and local memory in addition to centralized shared memory and attachment of global register gives us significant speed up such as 13.8% to 143.8% for four pro-cessors compared with shared cache architecture for applications which have been difficult to extract parallelism effectively.

DOI

Scopus

7

被引用数

(Scopus)
OSCAR multi-grain architecture and its evaluation

H Kasahara, W Ogata, K Kimura, G Matsui, H Matsuzaki, M Okamoto, A Yoshida, H Honda

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS 106 - 115 1998年 [査読有り]

　概要を見る

OSCAR (Optimally Scheduled Advanced Multiprocessor) was designed to efficiently realize multi-grain parallel processing using static and dynamic scheduling. It is a shared memory multiprocessor system having centralized and distributed shared memories in addition to local memory on each processor with data transfer controller for overlapping of data transfer and task processing. Also, its Fortran multi-grain compiler hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks, conventional medium grain parallelism among loop-iterations in a Doall loop and near fine grain parallelism among statements. At the coarse grain parallel processing, data localization (automatic data distribution) have been employed to minimize data transfer overhear. In the near fine grain processing of a basic block, explicit synchronization can be removed by use of a clock level accurate code scheduling technique with architectural supports. This paper describes OSCAR's architecture, its compiler and the performance for the multi-grain parallel processing. OSCAR's architecture and compilation technology will be more important in future High Performance Computers and single chip multiprocessors.
Data-Localization among Doall and Sequential Loops in Coarse Grain Parallel Processing

Akimasa Yoshida, Yasushi Ujigawa, Motoki Obata, Keiji Kimura, Hironori Kasahara

Seventh Workshop on Compilers for Parallel Computers Linkoping Sweden 266 - 277 1998年01月 [査読有り]
Near Fine Grain Parallel Processing without Explicit Synchronization on a Multiprocessor System

Wataru Ogata, Akimasa Yoshida, Masami Okamoto, Keiji Kimura, Hironori Kasahara

Proc. of Sixth Workshop on Compilers for Parallel Computers (Aachen Germany) 1996年12月 [査読有り]

CiNii

▼全件表示

講演・口頭発表等

OSCAR ベクトルマルチコアにおける推論処理の高速化に向けたローカルメモリ管理手法

権藤創太, 上林嶺, 朱允楷, 水本幸希, 野谷優仁, 北村俊明, 笠原博徳, 木村啓二

電子情報通信学会技術報告CPSY2025-69 (2026-03)

発表年月： 2026年03月
Jetson Orin Nano における不確実性駆動型予見予測モデルの評価とFP16 精度計算の検討

朱允楷, 昼間彪吾, 顧茗瑞, 伊藤洋, 尾形哲也, 北村俊明, 木村啓二

情報処理学会研究報告Vol.2026-EMB-71 No.38

発表年月： 2026年03月
OSCAR ベクトルマルチコアの性能・電力評価", 電子情報通信学会技術報告

野谷優仁, 水本幸希, 権藤創太, 上林嶺, 朱允楷, 北村俊明, 笠原博徳, 木村啓二

電子情報通信学会技術報告CPSY2025-64 (2026-03)

発表年月： 2026年03月
End-to-End Implementation of IOMMU-Based DMA Isolation for Trusted Execution Environments on RISC-V

Yanqing LIU, Haocheng XU, Hidetoshi URANAMI, Akihiro SAIKI, Keiji KIMURA

電子情報通信学会技術報告CPSY2025-59 (2026-03)

発表年月： 2026年03月
System-Level Integration and Evaluation of RISC-V Advanced Interrupt Architecture towards Secure I/O on Multicore Platforms

Xinyuan FU, Akihiro SAIKI, Keiji KIMURA

電子情報通信学会技術報告CPSY2025-54 (2026-03)

発表年月： 2026年03月
ユニバーサルTEE におけるIO デバイスのメモリ防御機構

浦浪英俊, 齊木昭大, 内山一秀, 五島正裕, 木村啓二

電子情報通信学会技術報告CPSY2025-53 (2026-03)

発表年月： 2026年03月
ソフトウェアへの依存を最小化するTEE 割り込み防御手法

齊木昭大, 浦浪英俊, 内山一秀, 五島正裕, 木村啓二

電子情報通信学会技術報告CPSY2025-52 (2026-03)

発表年月： 2026年03月
セキュアなVM上でGPGPUを利用する際のデータ転送オーバーヘッドの評価

浦浪英俊, 齊木昭大, 木村啓二

電子情報通信学会技術研究報告(Web)

発表年月： 2025年

開催年月：
2025年

　

　
FTQC における表面符号の物理量子ビットへのノイズ適応型配置の性能評価

丸茂直樹, 木村啓二

情報処理学会研究報告Vol.2026-HPC-203 No.58

発表年月： 2024年03月
Keystone Enclaveにおける高効率でセキュアなHost-Enclave間での大規模データ授受手法

齊木昭大, 木村啓二

電子情報通信学会技術研究報告(Web)

発表年月： 2024年

開催年月：
2024年

　

　
ホストとエンクレーブ間の効率的なデータ転送のためのメモリプール管理の強化

DING Jianxuan, KIJI Koichiro, SAIKI Akihiro, KIMURA Keiji

電子情報通信学会技術研究報告(Web)

発表年月： 2024年

開催年月：
2024年

　

　
OSCAR自動並列化コンパイラによる並列化オーバヘッド削減のためのタスク融合手法を用いた実ラダーアプリケーションの並列化

川角冬馬, 見神広紀, 吉川智哉, 細見武郎, 追立真吾, 木村啓二, 笠原博徳

情報処理学会論文誌ジャーナル(Web)

発表年月： 2024年

開催年月：
2024年

　

　
RISC-V SoCにおけるSecure Bootの実装と検証の高速化に向けた評価

齊木昭大, 大森侑, 木村啓二

情報処理学会研究報告(Web)

発表年月： 2023年

開催年月：
2023年

　

　
RISC-V KeystoneにおけるEnclaveアプリケーションキャッシュ機能の拡張

梅澤拓夢, 齊木昭大, 木村啓二

電子情報通信学会技術研究報告(Web)

発表年月： 2023年

開催年月：
2023年

　

　
Jetson Xavier NXにおけるORB-SLAM3の低消費電力化の検討

林頼人, 見神広紀, 納富昭, 木村貞弘, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告(Web)

発表年月： 2023年

開催年月：
2023年

　

　
各コアがローカルメモリを持つ組み込みベクトルマルチコアでの畳み込み層演算の評価

大高凌聖, 小池穂乃花, 磯野立成, 川角冬馬, 北村俊明, 見神広紀, 納富昭, 木村貞弘, 木村啓二, 笠原博徳

情報処理学会研究報告(Web)

発表年月： 2023年

開催年月：
2023年

　

　
深層学習コンパイラTVMのベクトルマルチコア向けコード生成手法の検討

大西文彬, 大高凌聖, 藤田一輝, 末次智貴, 川角冬馬, 北村俊明, 笠原博徳, 木村啓二

情報処理学会研究報告(Web)

発表年月： 2023年

開催年月：
2023年

　

　
OSCAR自動並列化コンパイラを用いたラダープログラムの並列性解析

津村雄太, 川角冬馬, 見神広紀, 川上大樹, 細見武郎, 追立真吾, 木村啓二, 笠原博徳

情報処理学会研究報告(Web)

発表年月： 2022年

開催年月：
2022年

　

　
Prototype Implementation of Non-Volatile Memory Support for RISC-V Keystone Enclave

信学技報, vol. 121, no. 116, CPSY2021-2 (SWoPP2021)

発表年月： 2021年07月
Sparse Neural NetworkにおけるSpMMの並列/ベクトル化による高速化

田處雄大, 木村啓二, 笠原博徳

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

発表年月： 2021年03月
整合性ツリーおよび暗号化機構を持つ不揮発性メインメモリエミュレータの実装

林知輝, 大森侑, 木村啓二

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

発表年月： 2021年03月
OSCARコンパイラによるMATLAB/Simulinkアプリケーションの自動並列化

古山凌, 津村雄太, 川角冬馬, 仲田優哉, 梅田弾, 木村啓二, 笠原博徳

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

発表年月： 2021年03月
Linuxが動作可能なRISC-V NVMMエミュレータの実装

大森侑, 木村啓二

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

発表年月： 2021年03月
OSCAR自動並列化コンパイラとNECベクトル化コンパイラの協調によるベクトル・パーソナルスパコン上での自動ベクトル並列化

田處雄大, 見神広紀, 細見岳生, 木村啓二, 笠原博徳

情報処理学会研究報告 2020-ARC-240 情報処理学会

発表年月： 2020年03月
マルチターゲット自動並列化コンパイラにおけるアクセラレータコスト推定手法の検討

山本一貴, 藤田一輝, 柏俣智哉, 高橋健, Boma A. Adhi, 北村俊明, 川島慧大, 納富昭, 森裕司, 木村啓二, 笠原博徳

情報処理学会研究報告 2020-ARC-240 情報処理学会

発表年月： 2020年03月
OSCARコンパイラのC++プログラム対応の検討

川角冬馬, TilmanPriesner, 野口真聖, 韓吉新, 見神広紀, 川島慧大, 田中啓士郎, 木村啓二, 笠原博徳

情報処理学会研究報告 2020-ARC-240 情報処理学会

発表年月： 2020年03月
NDCKPT:不揮発性メインメモリを用いたOSによる透過的なプロセスチェックポインティングの実現

西田耀, 木村啓二

電子通信情報学会技術報告 CPSY2019-102 電子情報通信学会

発表年月： 2020年03月
準同型暗号による行列積の高速化の検討

牧田哲也, 宍戸哲平, 和田康孝, 木村啓二

電子通信情報学会技術報告 CPSY2019-96 電子情報通信学会

発表年月： 2020年03月
Cascaded DMAC Enabling Efficient Data Transfer for Indirect Memory Access Applications

Keiji Kimura [招待有り]

4th International Symposium on Research and Education of Computational Science (RECS) RECS

発表年月： 2019年11月
DMAのカスケード接続による間接ロードの高速化

柏俣智哉, 北村俊明, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告

発表年月： 2019年

開催年月：
2019年

　

　
OSCARベクトルマルチコアプロセッサのための自動並列ベクトル化コンパイラフレームワーク

宮本一輝, 牧田哲也, 高橋健, 柏俣智哉, 河田巧, 狩野哲史, 北村俊明, 木村啓二, 笠原博徳

情報処理学会研究報告 2018-ARC-230 情報処理学会

発表年月： 2018年03月
OSCARベクトルアクセラレータのFPGA上での性能評価

柏俣智哉, ADHI Boma A., 狩野哲史, 宮本一輝, 河田巧, 高橋健, 牧田哲也, 北村俊明, 木村啓二, 笠原博徳

情報処理学会全国大会講演論文集

発表年月： 2018年

開催年月：
2018年

　

　
OSCARベクトルマルチコアアーキテクチャのコンパイルフロー構築及び評価

高橋健, 狩野哲史, 宮本一輝, 河田巧, 柏俣智哉, 牧田哲也, 北村俊明, 木村啓二, 笠原博徳

情報処理学会全国大会講演論文集

発表年月： 2018年

開催年月：
2018年

　

　
階層アジャスタブルブロックを用いた自動マルチコア・ローカルメモリ管理とその性能評価

白川智也, 阿部佑人, 大木吉健, 吉田明正, 木村啓二, 笠原博徳

情報処理学会研究報告 2017-ARC-220 情報処理学会

発表年月： 2017年11月
結果に再現性のある計算機システムエミュレータ

清水勇希, 高井峰生, 木村啓二

マルチメディア、分散、協調とモバイル(DICOMO2017)シンポジウム情報処理学会

発表年月： 2017年07月
大規模システムを想定したGem5 シミュレータの階層的インターコネクションネットワーク拡張

小野口達也, 林綾音, 宇高勝之, 松島裕一, 木村啓二, 笠原博徳

情報処理学会研究報告 2017-ARC-217 情報処理学会

発表年月： 2017年03月
自動車リアルタイム制御計算の複数クラスタ構成マルチコア上での並列処理

宮田仁, 島岡護, 見神広紀, 西博史, 鈴木均, 木村啓二, 笠原博徳

情報処理学会研究報告 2017-ARC-217 情報処理学会

発表年月： 2017年03月
自動並列化コンパイラのコンパイル時間短縮のための実行プロファイル・フィードバックを用いたコード生成手法

藤野里奈, 韓吉新, 島岡護, 見神広紀, 宮島崇浩, 高村守幸, 木村啓二, 笠原博徳

情報処理学会研究報告 2017-ARC-217 情報処理学会

発表年月： 2017年03月
OSCARベクトルマルチコアアーキテクチャのコンパイルフロー構築及び評価

高橋健, 狩野哲史, 宮本一輝, 河田巧, 柏俣智哉, 牧田哲也, 北村俊明, 木村啓二, 笠原博徳

情報処理学会第80回全国大会情報処理学会

発表年月： 2017年03月
OSCARベクトルアクセラレータのFPGA上での性能評価

柏俣智哉, Boma A. ADHI, 狩野哲史, 宮本一輝, 河田巧, 高橋健, 牧田哲也, 北村俊明, 木村啓二, 笠原博徳

情報処理学会第80回全国大会情報処理学会

発表年月： 2017年03月
LLVMを用いたベクトルアクセラレータ用コードのコンパイル手法

丸岡晃, 無州祐也, 狩野哲史, 持山貴司, 北村俊明, 神谷幸男, 高村守幸, 木村啓二, 笠原博徳

情報処理学会研究報告 2016-ARC-221 情報処理学会

発表年月： 2016年08月
OSCARコンパイラを用いた医用画像フィルタリングのマルチグレイン並列処理

奥村万里子, 柴崎大侑, 桑島昂平, 見神広紀, 木村啓二, 門下康平, 中野恵一, 笠原博徳

情報処理学会研究報告 2016-HPC-153 情報処理学会

発表年月： 2016年03月
OSCARコンパイラを用いた医用画像3Dノイズリダクションの自動マルチグレイン並列処理

柴崎大侑, 桑島昂平, 奥村万里子, 見神広紀, 木村啓二, 門下康平, 中野恵一, 笠原博徳

情報処理学会研究報告 2016-HPC-153 情報処理学会

発表年月： 2016年03月
OSCAR自動並列化コンパイラにおける解析時データ構造変換による並列性抽出手法

影浦直人, 和気珠実, 韓吉新, 木村啓二, 笠原博徳

情報処理学会研究報告 2016-HPC-153 情報処理学会

発表年月： 2016年03月
データ多次元整合分割によるマルチコア・ローカルメモリ管理手法

山本康平, 白川智也, 吉田明正, 木村啓二, 笠原博徳

情報処理学会研究報告 2016-SLDM-174 情報処理学会

発表年月： 2016年01月
計算機システムエミュレーションにおける再現性の評価

福意大智, 水本旭洋, 西本真介, 金田茂, 高井峰生, 木村啓二

マルチメディア、分散、協調とモバイル(DICOMO2015)シンポジウム情報処理学会

発表年月： 2015年07月
OSCAR自動並列化コンパイラを用いたリアルタイム動画像アプリケーションのHaswellマルチコア上での低消費電力化

飯塚修平, 山本英雄, 平野智大, 岸本耀平, 後藤隆志, 見神広紀, 木村啓二, 笠原博徳

情報処理学会研究報告 2015-EMB-36 情報処理学会

発表年月： 2015年03月
動画像デコーディングのIntelおよびARMマルチコア上での並列処理の評価

和気珠実, 飯塚修平, 見神広紀, 木村啓二, 笠原博徳

研究報告組込みシステム（EMB）一般社団法人情報処理学会

発表年月： 2015年02月

開催年月：
2015年02月

　

　

　概要を見る

本稿では，マルチコアプロセッサを用いて動画像デコーディング処理の高速化を実現する手法として 2 種類の並列化手法について性能評価を行った．1 つ目の並列化手法は並列化対象ループにループスキューイング/ループインターチェンジを適用する手法，2 つ目の並列化手法は wave-front 手法を適用する手法であり，どちらの場合もマクロブロック間の依存関係を満たしつつこれらの間の並列性を利用することで並列処理が可能となる．評価に用いる動画像コーデックは，MPEG2 と比較して約 2 倍の符号化効率を持ちワンセグ放送等に用いられている H.264/AVC と，H.264/AVC と同等の品質を持ち Youtube 等でも採用されている動画規格である WebM のビデオコーデック VP8 である．これらの規格により動画像デコーディングを行うプログラムに対して，上記 2 つの並列化手法をそれぞれ適用した．Snapdragon APQ8064 Krait 4 コアを搭載した Nexus7 上で評価を行った結果，ループスキューイング/ループインターチェンジ手法で並列化した場合，並列化箇所のみで逐次実行に比べ 3 コアで 1.33 倍速度向上し，その一方で wave-front 手法では 3 コアで 2.86 倍の速度向上が得られた．同様に Intel(R) Xeon(R) CPU X5670 プロセッサを搭載したマシンで評価を行った結果，ループスキューイング/ループインターチェンジ手法で並列化した場合，並列化箇所のみで逐次実行に比べ 6 コアで 1.82 倍速度向上し，一方で wave-front 手法では 6 コアで 4.61 倍の速度向上が得られた．
自動並列化・低消費電力化された複数アプリケーションに対するマルチコア用ダイナミックスケジューリング手法

後藤隆志, 武藤康平, 平野智大, 見神広紀, 高橋宇一郎, 井上栄, 木村啓二, 笠原博徳

研究報告組込みシステム（EMB）一般社団法人情報処理学会

発表年月： 2015年02月

開催年月：
2015年02月

　

　

　概要を見る

本稿では，マルチコアを搭載したスマートフォン端末において，コンパイラにより自動並列化及び低消費電力化された複数のアプリケーションを実行する際に，全体の実行時間の短縮あるいは各アプリケーション毎に設定されたデッドラインを守りつつ電力削減を達成するダイナミックスケジューリング方式について提案する．本スケジューリング手法では，コンパイル時に指定した各アプリケーションの並列実行時の利用コア数に応じた実行時間や消費電力，及びデッドラインを用いて，3種類の方式に基づくスケジューリングを行う．ARM 4 コアの端末上で動画コーデックアプリケーションを対象に評価を行い，FIFO 方式と比べ速度向上率で 18.5％，電力削減率で -28.8％の結果が得られた．This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.
LTE無線基地局におけるレイヤ2信号処理のOSCARコンパイラによる自動並列化

田中優利, 小松裕樹, 影浦直人, 見神広紀, 松元映二, 横山正浩, 江崎孝斗, 箕輪守彦, 高村守幸, 木村啓二, 笠原博徳

情報処理学会研究報告(Web)

発表年月： 2015年

開催年月：
2015年

　

　
モデルベース開発向け画像処理ソフトウェアの並列化フレームワーク

梅田弾, 鈴木貴広, 見神広紀, 木村啓二, 笠原博徳

情報処理学会研究報告(Web)

発表年月： 2015年

開催年月：
2015年

　

　
自動並列化コンパイラによるソフトウェアキャッシュコヒーレンシ制御手法の評価

岸本耀平, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究報告 2014-ARC-213 情報処理学会

発表年月： 2014年12月
Android Movie Player System Combined with Automatically Parallelized and Power Optimized Code by OSCAR Compiler

BuiDucBinh, Tomohiro Hirano, Dominic Hillenbrand, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

組込みシステムシンポジウム2014論文集

発表年月： 2014年10月

開催年月：
2014年10月

　

　
OSCARコンパイラを用いたH.264/AVCデコーダのAndroidマルチコアでの低消費電力化

飯塚修平, 山本英雄, 平野智大, 後藤隆志, 見神広紀, 高橋宇一郎, 井上栄, 高村守幸, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2014年09月

開催年月：
2014年09月

　

　

　概要を見る

スマートフォンの普及と移動通信の高速化に伴い，モバイル端末における動画再生の頻度が増加している．H.264/AVC は高い圧縮率を実現することからワンセグ放送や YouTube など現在のメディア処理に広く利用されている動画像圧縮符号標準であるが，モバイル端末では動画再生時の膨大な演算に対する消費電力の増大がバッテリー持続時間の低下を招き，問題となっている．この問題に対して，現在では専用ハードウェアが用いられているが，モバイル端末に求められる多様なコーデックやアップデートへの柔軟な対応を考慮すると，今後ソフトウェアによる解決手法が有用であると考えられる．本研究では H.264/AVC デコーダのプログラムのうち最も負荷が大きいフレーム間予測及び，デブロッキングフィルタの処理に対して並列化を行った上で電力制御を適用し，ソフトウェアによる消費電力削減の有用性を検証した．OSCAR 自動並列化コンパイラを用いて LoopSkewing のアクセス順序からマクロブロックレベルでの並列性を抽出し，リアルタイム制約の保証内での DVFS 及び WFI を用いた擬似クロックゲーティングを適用した．Android 端末の開発ボードである ODROID-X2 の上で電力値の評価を行ったところ，1PE で 1.07[W] から 0.79[W] に，2PE で 1.69[W] から 0.57[W] に，3PE で 2.45[W] から 0.51[W] に消費電力を削減したことが確認された．
大規模無線センサネットワークにおける外乱を考慮したアーキテクチャ探索シミュレータの実装と評価

山下浩一郎, 鈴木貴久, 栗原康志, 大友俊也, 木村啓二, 笠原博徳

マルチメディア、分散協調とモバイルシンポジウム2014論文集

発表年月： 2014年07月

開催年月：
2014年07月

　

　
Android Demonstration System of Automatic Parallelization and Power Optimization by OSCAR Compiler

Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara

情報処理学会研究報告 2014-ARC-211 情報処理学会

発表年月： 2014年07月
Linux ftrace を用いたマルチコアプロセッサ上での並列化プログラムのトレース手法

福意大智, 島岡護, 見神広紀, Dominic Hillenbrand, 木村啓二, 笠原博徳

情報処理学会研究報告 2014-ARC-211 情報処理学会

発表年月： 2014年07月
A Latency Reduction Technique for Network Intrusion Detection System on Multicores

Keiji Kimura [招待有り]

14th International Forum on Embedded MPSoC and Multicore MPSoC

発表年月： 2014年07月
小ポイントFFTのマルチコア上での自動並列化手法

古山祐樹, 見神広紀, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-201 情報処理学会

発表年月： 2014年03月
統計的手法を用いた並列化コンパイラ協調マルチコアアーキテクチャシミュレータ高速化手法

田口学豊, 木村啓二, 笠原博徳

電子通信情報学会技術報告 ETNET2014 電子情報通信学会

発表年月： 2014年03月
不正侵入検知システムにおけるマルチコア上でのシグネチャ割当によるレイテンシ削減手法

山田正平, 見神広紀, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2014年02月

開催年月：
2014年02月

　

　

　概要を見る

企業や政府機関を標的としたサイバー攻撃が年々高度で大規模なものになっている．これらサイバー攻撃の有効策のひとつとして不正侵入検知システムが挙げられる．不正侵入検知システムはネットワークを監視し，IP パケットをフィルタリングすることで不審なアクセスをリアルタイムで検知する．一方で，膨大なパケットを処理するための処理性能が求められる．そこで本研究では，シグネチャ型の不正侵入検知システムにおいてシグネチャを分割し，マルチコアへの割当によるレイテンシ削減手法を提案する．本手法は，並列処理によってパケットあたりの検知処理時間の短縮が可能である．レイテンシ削減手法をオープンソースの不正侵入検知システムであるSuricataにおいて適用し，DARPA Intrusion Detection Evaluation Data Setなどのデータセットを入力とした際の検知処理性能を評価した．その結果，2 コア上でシグネチャを分割しない場合と比較して DARPA Intrusion Detection Evaluation Data Set において 4 コア上で最大 3.22 倍の検知処理時間の短縮を得ることができた．Cyber attacks targeting on companies and government organizations have been increasing and highly developed. An Intrusion Detection System (IDS) is one of efficient solutions to prevent those attacks. An IDS detects illegal network accesses in realtime by monitoring the network and filtering suspicious IP packets. Large processing performance is required for IDSs to process a large number of IP packets in realtime. In order to satisfy this requirement, a latency reduction technique for signature-based IDSs by allocating decomposed signature on multicores is proposed in this paper. The proposed technique is implemented in Suricata, which is an open source IDS, and evaluated it with several data sets, such as DARPA Intrusion Detection Evaluation Data Set. The evaluation results show the proposed techniques with four cores achieves 3.22 times performance improvement in maximum comparing with two cores without signature decomposition.
小ポイントFFTのマルチコア上での自動並列化手法

古山祐樹, 見神広紀, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2014年02月

開催年月：
2014年02月

　

　

　概要を見る

高速フーリエ変換 (FFT) は，ディジタル信号処理や画像圧縮など様々な分野で使用される非常に応用性の高い計算アルゴリズムである．その中でも，LTE 等のベースバンド処理で用いられる小ポイントの FFT プログラムは，データ転送や制御のオーバーヘッドを伴う専用ハードウェアを使用しにくく，マルチコア上での並列化の要求が高まっている．本稿では，そのような小ポイントの FFT プログラムに対しコンパイラによる自動並列化及び，false sharing 回避を目的としたキャッシュ最適化を適用し，データキャッシュを持つ種々の共有メモリ型マルチコアアーキテクチャに向けて低オーバーヘッドな並列化コードを生成する自動並列化手法を提案する．提案手法を OSCAR 自動並列化コンパイラに実装し，32 ポイントから 256 ポイントまでの小ポイントFFTを並列化し，8 つの SH4A コアを集積した情報家電用マルチコアプロセッサ RP2 上で性能評価を行ったところ，256 ポイントの FFT プログラムで，逐次プログラムに対し 2 コア並列化で 1.97 倍，4 コア並列化で 3.9 倍というスケーラブルな速度向上を得ることが出来た．また，FFT と同様にバタフライ演算を行う高速アダマール変換のプログラムにも同手法を適用し評価を行い，256 ポイントのプログラムで 2 コア並列化で 1.91 倍，4 コア並列化で 3.32 倍という高い速度向上が得られ，提案手法の有用性が確認された．Fast Fourier Transorm (FFT) is one of the most frequently used algorihtms in many applications including digital signal processing and image processing to compute Descrite Fourier Transform (DFT). Although small size FFT programs must be used in baseband signal processing such as LTE and so on, it's difficult to use special hardwares like DSPs for computing such a small problem because of their relatively large data transfer and control overhead. This paper proposes an automatic parallelization method to generate parallelized programs with low overhead for small size FFTs suited for shared memory multicore processor by applying cache optimization to avoide false sharing between cores. The proposed method has been implemented in OSCAR automatic parallelizing compiler, parallelized small point FFT programs from 32 points to 256 points and evaluated them on RP2 multicore processor having 8 SH-4A cores. It achieved 1.97 times speedup on 2 SH-4A cores and 3.9 times speedup on 4 SH-4A cores in a 256 points FFT program. In addition to the FFT programs, the proposed approach is applied to Fast Hadamard Transform (FHT) which has similar computation to the FFT. The results are 1.91 times speedup on 2 SH-4A cores and 3.32 times speedup on 4 SH-4A cores. It shows effectiveness of the proposed method and easiness of applying the method to many kinds of programs.
プロファイル情報を用いたAndroid 2D描画ライブラリSKIAのOSCARコンパイラによる並列化

後藤隆志, 武藤康平, 山本英雄, 平野智大, 見神広紀, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-207-12 情報処理学会

発表年月： 2013年12月
モデルベース設計により自動生成されたエンジン制御Cコードのマルチコア用自動並列化

梅田弾, 金羽木洋平, 見神広紀, 谷充弘(デンソー, 森裕司(デンソー, 木村啓二, 笠原博徳

組み込みシステムシンポジウム（ESS2013）情報処理学会

発表年月： 2013年10月
OSCAR API標準解釈系を用いた階層グルーピング対応ハードウェアバリア同期機構の評価

川島慧大, 金羽木洋平, 林明宏, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-206-16 情報処理学会

発表年月： 2013年08月
Androidベースマルチコア上での自動電力制御

平野智大, 武藤康平, 後藤隆志, 見神広紀, 山本英雄, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-206-23 情報処理学会

発表年月： 2013年08月
OSCAR API標準解釈系を用いた階層グルーピング対応ハードウェアバリア同期機構の評価

川島慧大, 金羽木洋平, 林明宏, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2013年07月

開催年月：
2013年07月

　

　

　概要を見る

1 チップ内に搭載されるコア数の増加に伴い，アプリケーションからより多くの並列性を抽出し，低オーバーヘッドで利用することがこれらのコアを有効利用するために重要となっている．OSCAR コンパイラによる自動並列化ではより多くの並列性を利用するため，ループやサブルーチン内部の粗粒度並列性を解析し，階層的にタスク定義を行う．この階層的に定義されたタスクをコアを階層的にグルーピングし，コアグループに対して割り当てることにより並列処理を実現する．この階層的なグループ間で独立かつ低コストでバリア同期を実現できるハードウェアが提案され，SH4A プロセッサ 8 コア搭載の情報家電用マルチコア RP2 に実装されている．本稿では，OSCAR API 標準解釈系の階層グループバリア同期 API を RP2 のハードウェアバリア同期機構に対応し評価を行った結果について述べる．8 コアを使用した SPEC CPU 2000 の ART による評価ではソフトウェアでのバリア同期に対し 1.16 倍の性能向上が得られた．
OSCAR API v2.1 with Flexible Accelerator Control Facilities

Keiji Kimura [招待有り]

13th International Forum on Embedded MPSoC and Multicore MPSoC

発表年月： 2013年07月
マルチコア用並列化アプリケーション開発の基礎と実例

木村啓二 [招待有り]

ESEC 2013 専門セミナー Reed Exhibition Japan

発表年月： 2013年05月
Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

アルドーサリーヤーセル, 古山祐樹, ドミニクヒレンブランド, 木村啓二, 笠原博徳, 成田誠之助

情報処理学会研究報告 2013-OS-125 情報処理学会

発表年月： 2013年04月
マルチコア商用スマートディバイスの評価と並列化の試み

山本英雄, 後藤隆志, 平野智大, 武藤康平, 見神広紀, Hillenbrand Dominic, 林明宏, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-OS-124 情報処理学会

発表年月： 2013年02月
自動車エンジン制御ソフトウェアにおけるマルチコア上での並列処理

金羽木洋平, 梅田弾, 見神広紀, 林明宏, 沢田光男, トヨ, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-203-2 情報処理学会

発表年月： 2013年01月
並列化アプリケーションを対象とした統計的手法によるメニーコアアーキテクチャシミュレーションの高速化

阿部洋一, 田口学豊, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-203-13 情報処理学会

発表年月： 2013年01月
コンパイラと協調したシミュレーション精度切り換え可能なマルチコアアーキテクチャシミュレータ

田口学豊, 阿部洋一, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-203-14 情報処理学会

発表年月： 2013年01月
Automatic parallelization with OSCAR API Analyzer: a cross-platform performance evaluation

Gonzalez-Alvarez Cecilia, 金羽木洋平, 竹本昂生, 岸本耀平, 武藤康平, 見神広紀, 林明宏, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-202HPC137-10 情報処理学会

発表年月： 2012年12月
地震動シミュレータGMSのOSCARコンパイラによる自動並列化

島岡護, 見神広紀, 林明宏, 和田康孝, 木村啓二, 森田秀和, 内山邦男, 笠原博徳

情報処理学会研究報告 2012-ARC-202HPC137-11 情報処理学会

発表年月： 2012年12月
Opportunities and Challenges of Application-Power Control in the Age of Dark Silicon

Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

情報処理学会研究報告 2012-ARC-202 HPC137-26 情報処理学会

発表年月： 2012年12月
組込マルチコア用OSCAR APIを用いたTILEPro64上でのマルチメディアアプリケーションの並列処理

岸本耀平, 見神広紀, 中野恵一, 林明宏, 木村啓二, 笠原博徳

組み込みシステムシンポジウム（ESS2012）情報処理学会

発表年月： 2012年10月
エンジン基本制御ソフトウェアモデルのルチコア上での並列処理

梅田弾, 金羽木洋平, 見神広紀, 林明宏, 谷充弘, 森裕司, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）

発表年月： 2012年07月

開催年月：
2012年07月

　

　

　概要を見る

自動車の安全性・快適性・環境負荷の低減を目指し，自動車制御系は年々高度化している．これに伴い，制御プロセッサには高い性能が求められるが，シングルコアの動作周波数，及び命令レベル並列性の向上が困難となり，1 コアによる処理性能が限界に達したため，マルチコアへの移行が求められている．しかし，マルチコアではプログラムの並列化の困難なため，並列化プログラムの開発コスト・開発期間・信頼性等が問題となっている．本稿では従来シングルコアのみで動作していた基本エンジン制御ソフトウェアモデルのマルチコア上での並列化手法を提案する．具体的には基本エンジン制御 C プログラムをポインタ利用等に制限を加えた Parallelizable C によって記述されたプログラムに変換し，OSCAR 自動並列化コンパイラにより自動並列化を行う．その結果，従来タスク粒度が細かく手動では並列化ができなかった基本エンジン制御 C プログラムを情報家電用 RP2 上で 2 コアを用いて並列実行したところ，1 コアに対して 1.89 倍，V850 2 コア上で 1 コアに対して 2.06 倍の性能向上することに成功し，エンジン制御ソフトウェアモデルのマルチコア上での並列処理が可能であることを確認した．The automobile control system is advancing from year to year to achieve safety, comfort and fuel efficiency. Accordingly, control system needs high performance. However, the improvement of clock frequency and instruction-level parallelism are difficult, and the performance of a single-core processor has reached the limits. This paper proposes a parallelization method of a basic engine control software model for a multicore processor, which has only functioned on single-core processors. In the multicore, development cost, development period, and software reliability are problems because it is difficult to parallelize a software. By developing a Parallelizable C program with some limitations for pointer usage, the OSCAR compiler allows us perform automatic parallelization and generation of a parallel C program. Using the proposed method, the basic engine control program, which is difficult to parallelize because of very fine grain, is parallelized and gives us 1.89 times speedup using 2 cores on RP2 multicore and 2.06 times speedup using 2 cores of V850 multicore. It is confirmed that parallelization of a basic engine control C program on multi-core processor is possible.
低消費電力マルチコアRP-Xを用いた1ワットWebサービスの実現

古山祐樹, 島岡護, 見神広紀, 林明宏, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）

発表年月： 2012年07月

開催年月：
2012年07月

　

　

　概要を見る

一般に Web サーバの稼働時間の多くはアイドル状態であると言われるが，その際においても常に数十ワットの電力を消費し，大きな電力の浪費となっている。そこで本研究では，Web サーバのプロセッサに低消費電力情報家電用ヘテロジニアスマルチコア RP-X を用いることで，QoS (Quality of Service) を確保しつつ低消費電力で動作する Web サーバを開発した。評価の結果，提案する Web サーバを実際に早稲田大学笠原研究室の Web サーバとして運用した所，平均 1.04 ワットの低消費電力で動作可能なことを確認した．また，様々なアクセス頻度のワークロードでシミュレートした結果，Web サービスとしての QoS(Quality of Service) を満足しつつ，1.66 ワットで動作できることも確認した．本稿では Web サーバの電力をリアルタイムでモニタリングし，電力の可視化を行うシステムについても言及する．Web servers are known to be in the idle state for most of their execution time though they consume tens of watts even in this situation. This causes a signi cant waste of power consumption. To satisfy both keeping QoS (Quality of Service) and low power consumption for web servers simultaneously, in this paper, a web server is built upon the low-power multicore processor for consumer electronics, RP-X. Using the proposed server system as the web server of Kasahara Laboratoy in Waseda University, power consumption was 1.04 Watt on average. In addition, the power consumption of the web server is evaluated over several workload with different access frequency. As the results, the developed web server runs on 1.66 Watt with satisfying QoS. This paper also presents the real-time power monitoring system that allows the power consumption visualization of the web server.
OSCAR API for Low-Power Multicores and Manycores, and API Standard Translator

Keiji Kimura [招待有り]

12th International Forum on Embedded MPSoC and Multicore MPSoC

発表年月： 2012年07月
並列化コンパイラを考慮したコーディング作法と並列化APIの現在

木村啓二 [招待有り]

ESEC 2012 専門セミナー Reed Exhibition Japan

発表年月： 2012年05月
Javaの自動並列化における例外フローとメソッドディスパッチのインライン化解析

田端啓一, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）

発表年月： 2012年03月

開催年月：
2012年03月

　

　

　概要を見る

本稿では，Java プログラムを自動並列化するためのコンパイル手法を提案する．Java プログラムから複数粒度の並列性を抽出する上では，2 つの動的なメカニズムに対する解析の複雑さが問題となる．1 つは，例外によって生じる制御フローの解析である．もう 1 つは，メソッド呼び出しによって生じる動的ディスパッチの解析である．本稿の提案手法は，ランタイム環境におけるこれらの動的なメカニズムを，中間表現でのプリミティブな条件分岐にインライン展開し，解析を容易にする．提案手法を実装し評価したところ，Java で記述された optical flow など 3 つの画像処理プログラムから並列性の抽出が可能となり，IBM Power5+ 8 プロセッサにおける 1 プロセッサに対する速度向上率として，最低 7.84 倍の性能向上が得られた．This paper proposes compilation methods for automatic parallelization of Java. Java programs have two dynamic mechanisms which complicates multiple-grain parallelism extraction. The one is implicit or possible control flow by exception. Another one is dynamic dispatch for virtual method call. The proposed methods inline these dynamic mechanisms into primitive conditional branches on intermediate representation for easier analysis. The evaluation result shows at least 7.84x speedup on optical flow and other two image processing programs with IBM Power5+ 8 processors.
並列化メディアアプリケーションを対象としたメニーコアアーキテクチャシミュレーションの高速化の検討

阿部洋一, 石塚亮, 大胡亮太, 田口学豊, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-199-3 情報処理学会

発表年月： 2012年03月
JISX0180:2011「組込みソフトウェア向けコーディング規約の作成方法」を用いたParallelizable Cの定義

木村啓二, 間瀬正啓, 笠原博徳

研究報告組込みシステム（EMB）

発表年月： 2012年02月

開催年月：
2012年02月

　

　

　概要を見る

組込みソフトウェアの品質向上を目的として，JISX0180:2011「組込みソフトウェア向けコーディング規約の作成方法」が策定された．一方，自動並列化コンパイラによる並列性抽出を補助するための Paralleliza C が提案されている．本稿では，組込みソフトウェア開発者の自動並列化コンパイラ活用によるマルチコア用アプリケーション開発の生産性向上を目的とし，JISX0180:2011 による Parallelizable C の定義を提案する．本コーディング規約によるプログラムを商用 SMP 及び情報家電用マルチコア上で評価した結果，8 コアの IBM p5 550Q では平均 5.54 倍，4 コアの Intel Core i7 960 では平均 2.43 倍，4 コアの Renesas/Hitachi/Waseda RP2 では平均 2.79 倍の速度向上をそれぞれ得ることができた．JISX0180:2011 "Framework of establishing coding guidelines for embedded system development" was decided to improve the quality of embeded systems. Parallelizable C has bee also proposed to support exploitation of parallelism by a parallelizing compiler. This paper proposes a definition of Parallelizable C by JISX0180:2011 aiming at the improvement of productivity for embeded multicore developers with parallelizing compilers. An evaluation has been carried out using rewritten programs by the defined coding guideline on ordinary SMPs and a consumer electronics multicore. As the result, 5.54x speedup on IBM p5 550Q (8core), 2.42x speedup on Intel Core i7 960 (4core), and 2.79x speedup on Renesas/Hitachi/Waseda RP2 (4core) have been achieved, respectively.
重粒子線がん治療用線量計算エンジンの自動並列化

林明宏, 松本卓司, 見神広紀, 木村啓二, 山本啓二, 崎浩典, 高谷保行, 笠原博徳

HPCS2012 - ハイパフォーマンスコンピューティングと計算科学シンポジウム情報処理学会

発表年月： 2012年01月
SMPサーバー上での粒子線がん治療用線量計算エンジンの自動並列化

林明宏, 松本卓司, 見神広紀, 木村啓二, 山本啓二, 崎浩典, 高谷保行, 笠原博徳

情報処理学会研究報告 2011-ARC189HPC132-2 情報処理学会

発表年月： 2011年11月
SPECベンチマークプログラムのCUDAによる並列化の検討

平勇樹, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-HPC-130-16 情報処理学会

発表年月： 2011年07月
科学技術計算プログラムの構造を利用したメニーコアアーキテクチャシミュレーション高速化手法の評価

石塚亮, 阿部洋一, 大胡亮太, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-ARC-196-14 情報処理学会

発表年月： 2011年07月
並列化APIとコンパイラによるマルチコア用アプリケーションの開発

木村啓二 [招待有り]

ESEC 2011 専門セミナー Reed Exhibition Japan

発表年月： 2011年05月
メディアアプリケーションにおけるコンパイラによるI/Oオーバーヘッド隠蔽手法

林明宏, 関口威, 間瀬正啓, 和田康孝, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-ARC-195-14 情報処理学会

発表年月： 2011年04月
低消費電力マルチコアRP2上での複数メディアアプリケーション実行時の消費電力評価

見神広紀, 北基俊平, 佐藤崇文, 間瀬正啓, 木村啓二, 石坂一久, 酒井淳嗣, 枝廣正人, 笠原博徳

情報処理学会研究報告 2011-ARC-194-1 情報処理学会

発表年月： 2011年03月
OSCAR API標準解釈系を用いたParallelizable Cプログラムの評価

佐藤卓也, 見神広紀, 林明宏, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-ARC-191-2 情報処理学会

発表年月： 2010年10月
情報家電用ヘテロジニアスマルチコアRP-Xにおけるコンパイラ低消費電力制御性能

和田康孝, 林明宏, 渡辺岳志, 関口威, 間瀬正啓, 白子準, 木村啓二, 伊藤雅之, 長谷川淳, 佐藤真琴, 野尻徹, 内山邦男, 笠原博徳

研究報告計算機アーキテクチャ（ARC）情報処理学会

発表年月： 2010年07月

開催年月：
2010年07月

　

　

　概要を見る

本稿では，情報家電用ヘテロジニアスマルチコア RP-X 上で，コンパイラによる低消費電力制御を適用した結果について述べる．RP-X は NEDO の "情報家電用ヘテロジニアス・マルチコア技術の研究開発" プロジェクトにおいて開発された情報家電用のヘテロジニアスマルチコアであり，汎用 CPU コアとして SH-4A コアを 8 基，アクセラレータコアとして多目的 DRP コア FE-GA 4 基と画像処理用コア MX2 2 基，さらにメディア用コア VPU5 を搭載する．また，周波数制御・電圧制御等の低消費電力化のための機構を持つ．OSCAR コンパイラによって実現される低消費電力制御手法を RP-X の低消費電力機構に適用し，リアルタイム処理時の消費電力削減効果の評価を行った．その結果，SH-4A 8 コアと FE-GA 4 コアを用いた場合，制御を適用しない場合と比較して，オプティカルフロー演算において約 70[%]，AAC エンコーダにおいて約 80[%] の電力削減を得ることができた．This paper reports the efficiency of power reduction scheme by OSCAR compiler applied for a heterogeneous multicore for consumer electronics "RP-X". RP-X is a heterogeneous multicore developed in NEDO "Heterogeneous Multicore for Consumer Electronics" project. RP-X includes eight SH-4A cores, four FE-GA DRPs, two MX2 matrix processors, and one VPU5 media processor. To satisfy strong demands for low power consumption, RP-X is also equipped with mechanisms to reduce the power by changing operation frequency and voltage, or by gating clock. Power reduction scheme implemented in OSCAR compiler is applied to RP-X, and evaluated under the realtime constraint using eight SH-4A cores and four FE-GA cores. As the results, consumed power was reduced by about 70[%] for optical flow calculation, and about 80[%] for an AAC encoder program.
プログラム構造に着目したメニーコアアーキテクチャシミュレータの高速化手法

石塚亮, 大友俊也, 大胡亮太, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）情報処理学会

発表年月： 2010年07月

開催年月：
2010年07月

　

　

　概要を見る

本稿ではキャッシュやパイプラインまでシミュレーションする詳細シミュレーションと命令実行のみの高速な機能シミュレーションの両方を用いたシミュレーション精度切り替えによるメニーコアシミュレータの高速化手法を提案する．本手法はメニーコアシミュレータ上で並列化プログラムを実行することを前提としており，このプログラムの一部のみを詳細シミュレーションを行うことにより高速化を図る．このとき，詳細シミュレーションを行うサンプリング部分をプログラム構造から判断し，その分量を統計的手法により決定する．本手法を SPEC95 の TOMCATV，SWIM で及びルネサステクノロジ（当時）提供の AAC エンコーダプログラムを用いて評価したところ，64 コアを想定したシミュレーションで，TOMCATV で 3% 以下の誤差，SWIM で 6% 以下の誤差，AAC エンコーダで 5% 以下の誤差の実行サイクル数を 1/90～1/8 のサンプリング実行で得ることができた．This paper proposes an acceleration technique of many core architecture simulator which dynamically changes the simulation mode. The detailed simulation mode considering architectual details, such as cache and pipeline, is used for some essential portion of the target program while the fast functional simulation mode which only simulates instruction execution is leveraged for the rest of the program. The key feature of the proposed technique is that the essensial portion of the program which should be precisely simulated, is analyzed from the program strutre as well as the appropriate sampling size for detail simulation for that portion are determined with statistical approach. The evaluation results show that the simulation method give us the within 3% error for TOMCATV, 6% error for SWIM, 5%error for AACencorder, of execution clock cycles by 1/90 - 1/8 of samplings in the simulation of 64 cores.
情報家電用ヘテロジニアスマルチコア用自動並列化コンパイラフレームワーク

林明宏, 和田康孝, 渡辺岳志, 関口威, 間瀬正啓, 木村啓二, 伊藤雅之, 長谷川淳, 佐藤真琴, 野尻徹, 内山邦男, 笠原博徳

研究報告計算機アーキテクチャ（ARC）情報処理学会

発表年月： 2010年07月

開催年月：
2010年07月

　

　

　概要を見る

汎用 CPU コアに加え特定処理を高効率で実行可能なアクセラレータを搭載したヘテロジニアスマルチコアが広く普及している．しかしながら，ヘテロジニアスマルチコアでは様々な計算資源へのタスクスケジューリングやデータ転送コード挿入等多くの負担をプログラマが負う必要がある等プログラミングが困難である．そこで本稿では，複数 CPU 及びアクセラレータを持つヘテロジニアスマルチコアに対して，逐次プログラムを入力とし自動的に実行効率の良い並列プログラムを生成する，ヘテロジニアスマルチコア向け自動並列化コンパイラフレームワークを提案する．本フレームワークでは自動並列化コンパイラとアクセラレータコンパイラとのインターフェースとして新たに提案するヘテロジニアスマルチコア向け OSCAR API を利用することで，逐次 C プログラムを自動的に汎用コアとアクセラレータコアにタスクを配分し，高い性能を実現する．本手法を情報家電用ヘテロジニアスマルチコアプロセッサ RP-X をターゲットとして，AAC エンコーダ及び Optical Flow 計算の自動並列化性能を評価した．その結果，8 つの汎用 CPU コア及び 4 つのアクセラレータコアを使用した場合，逐次実行時と比較して Optical Flow 計算で約 12 倍（OSCAR コンパイラ+アクセラレータコンパイラ使用時），約 32 倍（OSCAR コンパイラ+既存ライブラリ使用時），AAC エンコーダで約 16 倍（OSCAR コンパイラ+既存ライブラリ使用時）の性能向上が得られ，ヘテロジニアスマルチコアを対象とした汎用的なコンパイラフレームワークを実現可能であることがわかった．Heterogeneous multicores, which integrates multiple general purpose CPU cores and special purpose accelerator cores on a chip, has been widely used in order to attain high performance keeping power consumption low. However, heterogeneous multicores require to programmers very difficult coding for load distribution to CPU cores and accelerator cores, synchronizations and data transfer using DMA controllers. To this end, this paper proposes a compiler framework which facilitates the development of the program for heterogeneous multicores. This framework parallelize the sequenctial C program using OSCAR parallelizing compiler and accelerator compiler. The developed framework gives us 12 times, 32 times and 16 times speedup with eight general purpose CPU cores and four accelerator cores on RP-X processor for an Optical Flow Calculation(using accelerator compiler), Optical Flow Calculation(using library) and an AAC audio encoder program(using library), respectively, against sequential execution by a single CPU core.
組込みマルチコア用並列化APIと並列化コンパイラの現在

木村啓二 [招待有り]

ESEC 2010 専門セミナー Reed Exhibition Japan

発表年月： 2010年05月
並列化コンパイラによるソフトウェアコヒーレンシ制御

間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究会報告 2010-ARC-189, 2010-OS-114 情報処理学会

発表年月： 2010年04月
自動並列化技術を用いたメディア処理オフロード

石坂一久, 酒井淳嗣, 枝廣正人, 宮本孝道, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究会報告 2010-SLDM144, 2010-EMB16 情報処理学会

発表年月： 2010年03月
組込み向けマルチコア上での複数アプリケーション動作時の自動並列化されたアプリケーションの処理性能

宮本孝道, 間瀬正啓, 木村啓二, 石坂一久, 酒井淳嗣, 枝廣正人, 笠原博徳

研究報告計算機アーキテクチャ（ARC）情報処理学会

発表年月： 2010年02月

開催年月：
2010年02月

　

　

　概要を見る

組込み向けマルチコアではユーザの入力などにより複数の逐次あるいは並列プロセスが動作される環境においても高い性能を得ることが重要となる．複数のアプリケーションが同時に実行される環境では，性能低下への対策として共有リソースの競合を減少させることが重要となる．本論文では，アプリケーションの複数同時実行時の OSCAR 自動並列化コンパイラにより生成されたプログラムの並列処理性能を NEC エレクトロニクス NaviEngine 上で評価した．コンパイラにより最適化された MPEG2 デコードと他アプリケーションを同時実行した場合には MPEG2 デコードは最大で 0.91% の性能低下に抑えられ，SPEC95 CFP 101.tomcatv ではコンパイラによる複数のキャッシュ最適化コードを同時実行した場合においても最大で 1.06% の性能低下に抑えられ，性能低下が起こらないことが確かめられた．On embedded multicores, it is important which high performance is obtained although multiple sequential or parallel applications run together. However, performance degradation is occurred by competing resources of multicores. In this paper, we have evaluated parallel performance of programs generated by OSCAR automatic parallelizing compiler in an environment where multiple applications run on NaviEngine developed by NEC Electronics Corporation. When a MPEG2 decoder and other application run together, a MPEG2 decoder's performance degradation is little, a maximum of 0.91% performance degradation. When some SPEC95 CFP 101.tomcatv with cache optimizations by OSCAR automatic parallelizing compiler run together, it is verified which performance degradation is little, a maximum of 1.06% performance degradation.
瞬時電源遮断機構を用いたマルチコアSoC向け省電力ソフトウェア実行環境

小野内雅文, 十山圭介, 野尻徹, 佐藤真琴, 間瀬正啓, 白子準, 佐藤未来子, 高田雅士, 伊藤雅之, 水野弘之, 並木美太郎, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告. CST, コンカレント工学一般社団法人電子情報通信学会

発表年月： 2010年01月

開催年月：
2010年01月

　

　

　概要を見る

8つのCPUコアを搭載するマルチコアSoC RP2と,自動並列化コンパイラOSCARを用いて,高い処理性能と省電力を両立するソフトウェア実行環境を構築した。この環境ではCPUコア数の増加に応じて処理速度を向上させるため,OSCARコンパイラと連携しデータの特性を考慮したメモリ配置を行うデータマッピング手法を開発し,各CPUコア上で実行される並列化タスク間のコミュニケーションオーバヘッド,すなわち,キャッシュコヒーレンシ維持とタスク間同期の時間を削減した。さらに,オンチップのCPUコアローカルメモリを活用した高速な電源遮断・復帰を実現する瞬時電源遮断機構を開発し,OSCARコンパイラとの連携によりプログラム実行中の待機CPUコアの電源を細粒度に遮断することで,無駄な電力消費を削減した。開発したソフトウェア実行環境上でセキュアAAC-LC圧縮処理を実行したところ,データマッピング手法を適用することにより,CPUコア数を1から8へと増やした場合に5.00倍の処理速度を達成した。さらに,瞬時電源遮断機構を併用することで,電力効率が10%向上することを確認した。
H.264/AVCエンコーダのマルチコアプロセッサにおける階層的並列処理

見神広紀, 宮本孝道, 木村啓二, 笠原博徳

情報処理学会研究会報告 2010-ARC-187 情報処理学会

発表年月： 2010年01月
自動並列化のためのElement-Sensitiveポインタ解析

間瀬正啓, 村田雄太, 木村啓二, 笠原博徳

情報処理学会研究報告情報処理学会

発表年月： 2009年10月
メニーコア・プロセッサとそれを支える要素技術

井上弘士, 木村啓二, 松谷宏紀 [招待有り]

組込システムシンポジウム 2009 情報処理学会

発表年月： 2009年10月
マルチコアにおけるParallelizable Cプログラムの自動並列化

間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究報告 2009-ARC-174-15(SWoPP2009) 情報処理学会

発表年月： 2009年08月
マルチコアにおけるParallelizable Cプログラムの自動並列化

間瀬正啓, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC）情報処理学会

発表年月： 2009年07月

開催年月：
2009年07月

　

　

　概要を見る

本稿ではコンパイラによる自動並列化を可能とするための C 言語の記述方法として Parallelizable C を提案する．Parallelizable C で記述した科学技術計算およびマルチメディア処理の逐次プログラム 6 本に対して OSCAR コンパイラによる自動並列化を適用し，マルチコアシステム上での処理性能の評価を行った．その結果，逐次実行時と比較して，2 コア集積のマルチコアである IBM Power5+ を 4 基搭載した 8 コア構成のサーバである IBM p5 550Q において平均 5.54 倍，4 コア集積のマルチコアである Intel Core i7 920 プロセッサを搭載した PC において平均 2.43 倍，SH-4A コアベースの情報家電用マルチコア RP2 の 4 コアを使用した SMP 実行モードにおいて平均 2.78 倍の性能向上が得られた．This paper proposes Parallelizable C, a guideline for writing C programs which enables automatic parallelization by a compiler. 6 sequential programs written in Parallelizable C from numerical and multimedia application domains are automatically parallelized by OSCAR compiler. The parallel processing performance for these applications are evaluated on multicore systems. The evaluation results show that the compiler automatic parallelization achieves average 5.54 times speedup on a 8 cores server IBM p5 550Q with 4 dual-core Power5+ processors, average 2.43 times speedup on a 4 cores multicore processor PC with Intel Core i7 920, and average 2.78 times speedup on Renesas/Hitachi/Waseda RP2 with SH-4A cores in SMP execution mode using 4 cores compared with sequential execution, respectively.
マルチコアプロセッサ上での粗粒度タスク並列処理のためのコンパイラによるローカルメモリ管理手法

中野啓史, 桃園拓, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム（ACS）情報処理学会

発表年月： 2009年07月

開催年月：
2009年07月

　

　

　概要を見る

リアルタイム性および高性能，低電力が要求される情報家電機器では，オフチップ共有メモリに加え，小容量高速なローカルメモリを搭載したマルチコアプロセッサが開発されている．しかしながら，プログラマが手動でローカルメモリ容量を考慮しつつローカリティの最適化を行うことはきわめて困難であり，プログラム開発期間の短縮のためにはコンパイラによる自動最適化が必要となる．そこで，本論文では，容量制約のあるローカルメモリを有効に利用するための並列化コンパイル手法を提案する．提案手法ではまず，粗粒度タスク並列処理によりループやサブルーチン間の並列性を抽出する．続いてループ整合分割により，ローカルメモリサイズを考慮した粗粒度タスク分割を行う．従来のデータローカライゼーション手法は，分割されたデータを固定的にローカルメモリに割り当てていた．提案手法では，タスク分割後，データの定義あるいは参照時刻に基づくローカルメモリの割当てと解放を行い，より柔軟なローカルメモリ管理を実現する．オーディオ圧縮に用いられる AAC エンコーダを用いた性能評価の結果，固定的な割当てを行う従来のデータローカライゼーション手法と比較し，SH4A を 4 コア集積した RP1 マルチコア上で，約 2.6 倍，8 コア集積した RP2 マルチコア上で，約 2.5 倍の速度向上がそれぞれ得られた．Multicore processors integrating a small fast local memory for each core in addition to an off-chip shared memory has been developed for consumer electronics to meet real-time constraints, high performance and low power demand. However, data locality optimization by hand considering local memory size is much difficult. Therefore automatic compilation optimization is necessary to speed up application development time. This paper proposes a parallelizing compilation scheme which realizes effective use of limited local memory. First, the proposed scheme extracts parallelism among loops or subroutines using coarse grain task parallel processing. Subsequently, a loop is decomposed into smaller loops to fit local memory size using loop aligned decomposition. A conventional data localization scheme allocates decomposed data to fixed local memory address. On the other hand, the proposed scheme effectively allocates and deallocates decomposed data based on data definition and reference time. As the results, the proposed scheme gives us about 2.6 times speedup for AAC encoding program against the conventional scheme which does not manage each array on RP1 4 SH4A multicore processor and about 2.5 on RP2 8 SH4A multicore processor, respectively.
組込みソフトウェアの信頼性，開発効率向上のためのコーディングガイドライン

木村啓二 [招待有り]

平成21年度 INSTAC成果報告会

発表年月： 2009年07月
マルチコア上でのOSCAR APIを用いた並列化コンパイラによる低消費電力化手法

中川亮, 間瀬正啓, 大國直人, 白子準, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2009) 情報処理学会

発表年月： 2009年05月
最新の組込みマルチコア用コンパイラ技術と並列API

木村啓二 [招待有り]

ESEC 2009 専門セミナー Reed Exhibition Japan

発表年月： 2009年05月
並列度・タスク実行時間の偏りを考慮した標準タスクグラフセットSTG Ver3を用いたスケジューリングアルゴリズムの評価

島岡護, 今泉和浩, 鷹野芙美代, 木村啓二, 笠原博徳

情報処理学会研究報告情報処理学会

発表年月： 2009年02月
マルチコアプロセッサ上での粗粒度タスク並列処理のためのコンパイラによるローカルメモリ管理手法

中野啓史, 桃園拓, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会論文誌トランザクション(CD-ROM)

発表年月： 2009年

開催年月：
2009年

　

　
マルチコア上でのOSCAR API を用いた低消費電力化手法

中川亮, 間瀬正啓, 白子準, 木村啓二, 笠原博徳

電子通信情報学会技術報告 ICD2008-145 電子情報通信学会

発表年月： 2009年01月
マルチコアのためのコンパイラにおけるローカルメモリ管理手法

桃園拓, 中野啓史, 間瀬正啓, 木村啓二, 笠原博徳

電子通信情報学会技術報告 ICD2008-141 電子情報通信学会

発表年月： 2009年01月
メディアアプリケーションを用いた並列化コンパイラ協調型ヘテロジニアスマルチコアアーキテクチャのシミュレーション評価

神山輝壮, 和田康孝, 林明宏, 間瀬正啓, 中野啓史, 渡辺岳志, 木村啓二, 笠原博徳

電子通信情報学会技術報告 ICD2008-140 電子情報通信学会

発表年月： 2009年01月
マルチコアのソフトウェア開発

木村啓二 [招待有り]

CEATEC JAPAN 2008 インダストリアルセッション(IS) JEITA

発表年月： 2008年10月
マルチコア用コンパイル技術の現在

木村啓二 [招待有り]

第10回組み込みシステム技術に関するサマーワークショップ (SWEST10) 情報処理学会

発表年月： 2008年09月
マルチコアプロセッサのソフトウェア

木村啓二 [招待有り]

第31回STARCアドバンスト講座システムアーキテクチャセミナー - SoCシステムアーキテクチャ - STARC

発表年月： 2008年07月
階層グルーピング対応バリア同期機構の評価

山田海斗, 間瀬正啓, 白子準, 木村啓二, 伊藤雅之, 服部俊洋, 水野弘之, 内山邦男, 笠原博徳

情報処理学会研究報告 2007-ARC-178-4 情報処理学会

発表年月： 2008年05月
ポインタ解析を用いた制約付きCプログラムの自動並列化

間瀬正啓, 馬場大介, 長山晴美, 村田雄太, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-178-14 情報処理学会

発表年月： 2008年05月
情報家電用マルチコア上におけるマルチメディア処理のコンパイラによる並列化

宮本孝道, 浅香沙織, 見神広紀, 間瀬正啓, 木村啓二, 笠原博徳

SACSIS2008 - 先進的計算基盤システムシンポジウム情報処理学会

発表年月： 2008年05月
ヘテロジニアスマルチコアプロセッサ上でのスタティックスケジューリングを用いたMP3エンコーダの並列化

和田康孝, 林明宏, 益浦健, 白子準, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会論文誌トランザクション(CD-ROM)

発表年月： 2008年

開催年月：
2008年

　

　
マルチコアプロセッサ上でのマルチメディア処理の並列化

宮本孝道, 田村圭, 田野裕秋, 見神広紀, 浅香沙織, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-175-15(デザインガイア2007) 情報処理学会

発表年月： 2007年11月
最新の組み込みマルチコア用コンパイラ技術

木村啓二 [招待有り]

システムLSIワークショップ情報処理学会

発表年月： 2007年11月
情報家電用マルチコアSMP実行モードにおける制約付きCプログラムのマルチグレイン並列化

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 宮本孝道, 白子準, 中野啓史, 木村啓二, 笠原博徳

組込みシステムシンポジウム (ESS2007) 情報処理学会

発表年月： 2007年10月
ヘテロジニアスマルチコア上でのコンパイラによる低消費電力制御

林明宏, 伊能健人, 中川亮, 松本繁, 山田海斗, 押山直人, 白子準, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-174-18(SWoPP2007) 情報処理学会

発表年月： 2007年08月
ヘテロジニアスマルチコア上での階層的粗粒度タスクスタティックスケジューリング手法

和田康孝, 林明宏, 伊能健人, 白子準, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-174-17(SWoPP2007) 情報処理学会

発表年月： 2007年08月
54倍速AACエンコードを実現するヘテロジニアスマルチコアアーキテクチャの検討

鹿野裕明, 伊藤雅樹, 戸高貴司, 津野田賢伸, 兒玉征之, 小野内雅文, 内山邦男, 小高俊彦, 亀井達也, 永濱衛, 草桶学, 新田祐介, 和田康孝, 木村啓二, 笠原博徳

電子通信情報学会技術報告 ICD2007-71, Vol. 107 電子情報通信学会

発表年月： 2007年08月
マルチコア用コンパイラ技術

木村啓二 [招待有り]

165委員会主催研究会第46回研究会「マルチコアプロセッサSoCの現状と今後の展望」

発表年月： 2007年07月
組込マルチコアの動向

木村啓二 [招待有り]

JEITA 情報端末フェスティバル 2007 JEITA

発表年月： 2007年06月
情報家電用マルチコア SMP 実行モードにおけるマルチグレイン並列処理

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 宮本孝道, 白子準, 中野啓史, 木村啓二, 亀井達也, 服部俊洋, 長谷川淳, 伊藤雅樹, 佐藤真琴, 内山邦男, 小高俊彦, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2007年05月

開催年月：
2007年05月

　

　

　概要を見る

現在、ゲーム、カーナビゲーションシステム、デジタルＴＶ、携帯電話等の情報家電機器を始め、PC からスーパーコンピュータに至る、多くの情報機器でマルチコアプロセッサ採用の動きが進んでいる。本稿では、制約付き C 言語で記述されたメディア処理等のプログラムを OSCAR マルチグレイン自動並列化コンパイラにより並列化し、NEDO "リアルタイム情報家電用マルチコア技術の研究開発"プロジェクトの一環で OSCAR 標準マルチコアメモリアーキテクチャに基づき株式会社ノレネサステクノロジ、株式会社日立製作所により開発された SH-4A(SH-X3)コアを４コア集積した情報家電用マルチコアプロセッサ RP１上で SMP モード実行時の性能評価を行った。評価の結果、AAC オーディオエンコーダで４コア使用時に１コア使用時の 3.34 倍の速度向上が得られた。Currently, multicore processors are becoming ubiquitous in various computing domains, namely consumer electronics such as games, car navigation systems and mobile phones, PCs, and supercomputers. This paper describes parallelization of media processing programs written in restricted C language by OSCAR multigrain parallelizing compiler and SMP processing performance on RP1 4-core SH-4A (SH-X3) multicore processor developed by Renesas Technology Corp. and Hitachi, Ltd. based on standard OSCAR multicore memory architecture as a part of NEDO "Research and Development of Multicore Technology for Real Time Consumer Electronics Project". Performance evaluation shows OSCAR compiler achieved 3.34 times speedup using 4 cores against using 1 core for AAC audio encoder.
独立に周波数制御可能な 4320MIPS、SMP/AMP対応 4プロセッサLSIの開発

早瀬清, 吉田裕, 亀井達也, 芝原真一, 西井修, 服部俊洋, 長谷川淳, 高田雅士, 入江直彦, 内山邦男, 小高俊彦, 高田究, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-173-06 情報処理学会

発表年月： 2007年05月
情報家電用マルチコアSMP実行モードにおけるマルチグレイン並列処理

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 深津幸二, 宮本孝道, 白子準, 中野啓史, 木村啓二, 亀井達也, 服部俊洋, 長谷川淳, 佐藤真琴, 伊藤雅樹, 内山邦男, 小高俊彦, 笠原博徳

情報処理学会研究報告 2007-ARC-173-05 情報処理学会

発表年月： 2007年05月
マルチコアプロセッサ活用の勘所

木村啓二 [招待有り]

組み込みプロセッサ＆プラットホームワークショップ

発表年月： 2007年04月
マルチグレイン並列化コンパイラにおけるローカルメモリ管理手法

三浦剛, 田川友博, 村松裕介, 池見明紀, 中川正洋, 中野啓史, 白子準, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-172/HPC-109-11 (HOKKE2007) 情報処理学会

発表年月： 2007年03月
マルチコア上でのマルチメディアアプリケーションの自動並列化

宮本孝道, 浅香沙織, 鎌倉信仁, 山内宏真, 間瀬正啓, 白子準, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2007-171-13 情報処理学会

発表年月： 2007年01月
SMPサーバ及び組込み用マルチコア上でのOSCARマルチグレイン自動並列化コンパイラの性能

白子準, 田川友博, 三浦剛, 宮本孝道, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2006年11月

開催年月：
2006年11月

　

　

　概要を見る

半導体集積度向上に伴うスケーラブルな性能向上、低消費電力、価格性能を達成するためにマルチコアプロセッサが大きな注目を集めている。このようなマルチコアプロセッサの性能を最大限に引き出し、ソフトウェア/ハードウェア開発期間を短縮するためには自動並列化コンパイラが重要な役目を果たす。本論文ではループ並列処理に加え、粗粒度タスク並列処理・近細粒度並列処理によりプログラム全域にわたる並列化を行うOSCARマルチグレイン自動並列化コンパイラを用いた、最新SMPサーバ及び組込み組込み用マルチコアプロセッサ上での性能評価について述べる。OSCARコンパイラではプログラム中の各部分に対する適切な処理プロセッサ数と並列処理手法の決定、複数のループや粗粒度タスク間にまたがる広域的なキャッシュメモリ最適化技術が実現されている。SPEC CFP95ベンチマーク全10本とCFP2000ベンチマーク4本を用いた性能評価において、OSCARコンパイラはIBM p5 550Q Power+8 プロセッササーバ上でIBM XL Fortran コンパイラ version 10.1の自動並列化性能に比べ平均2.74倍、IBM pSeries690 Power4 24 プロセッササーバ上でIBM XL Fortran コンパイラ version 8.1 の自動並列化性能に比べ平均4.82倍の性能向上が得られた。またNEC/ARM MPCore ARMv6 4 プロセッサ集積組込み用マルチコアにおいて、OpenMP API の一部機能をサポートすることでOSCARコンパイラによる自動並列化を実現した。組込み用途を考慮しデータセットを縮小したSPEC CFP95 を用いた評価において、逐次処理に比べtomcatv で4.08倍、swim で3.90倍、su2cor で2.21倍、hydro2d で3.53倍、mgrid で3.85倍、applu で3.62倍、turb3d で3.20倍の性能向上が得られた。Currently, multiprocessor systems, especially multicore processors, are attracting much attention for performance, low power consumption and short hardware/software development period. To take the full advantage of multiprocessor systems, parallelizing compilers serve important roles. This paper describes the execution performance of OSCAR multigrain parallelizing compiler using coarse grain task parallelization and near fine grain parallelization in addition to loop parallelization, on the latest SMP servers and a SMP embedded multicore. The OSCAR compiler has realized the automatic determination of parallelizing layer, which decides the suitable number of processors and parallelizing technique for each nested part of the program, and global cache memory optimization over loops and coarse grain tasks. In the performance evaluation using 10 SPEC CFP95 benchmark programs and 4 SPEC CFP2000, OSCAR compiler gave us 2.74 times speedup compared with IBM XL Fortran compiler 10.1 on IBM p5550Q Power5+ 8 processors server, 4.82 times speedup compared with IBM XL Fortran compiler 8.1 on IBM pSeries690 Power4 24 processors server. OSCAR compiler can be also applied for NEC/ARM MPCore ARMv6 4 processors low power embedded multicore, using subset of OpenMP libraries and g77 compiler. In the evaluation using SPEC CFP95 benchmarks with reduced data sets, OSCAR compiler achieved 4.08 times speedup for tomcatv, 3.90 times speedup for swim, 2.21 times speedup for su2cor, 3.53 times speedup for hydro2d, 3.85 times speedup for mgrid, 3.62 times speedup for applu and 3.20 times speedup for turb3d against the sequential execution.
OSCARコンパイラにおける制約付きCプログラムの自動並列化

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 深津幸二, 宮本孝道, 白子準, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2006-170-01 （デザインガイア2006）情報処理学会

発表年月： 2006年11月
SMPサーバ及び組み込み用マルチコア上でのOSCARマルチグレイン自動並列化コンパイラの性能

白子準, 田川友博, 三浦剛, 宮本孝道, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2006-170-02 （デザインガイア2006）情報処理学会

発表年月： 2006年11月
ソフトウェアもおもしろいこれからのプロセッサアーキテクチャ

木村啓二 [招待有り]

FIT2006イベント企画「これからが面白いプロセッサアーキテクチャ」（パネル）情報処理学会

発表年月： 2006年09月
OSCARマルチコア上でのローカルメモリ管理手法

中野啓史, 仁藤拓実, 丸山貴紀, 中川正洋, 鈴木裕貴, 内藤陽介, 宮本孝道, 和田康孝, 木村啓二, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2006年07月

開催年月：
2006年07月

　

　

　概要を見る

半導体集積度向上に伴う消費電力の増大、プロセッサ実質速度向上の鈍化，ハードウエア，ソフトウエア開発期間の増大といった問題を解決すべく，一つのチップ上に複数のプロセッサコアを集積するマルチコアプロセッサが次世代プロセッサアーキテクチャとして注目を集めている．このマルチコアプロセッサにおいても，プロセッサとメモリ動作速度のギャップに伴うメモリウオールは深刻な問題であり，プロセッサに近接したキャッシュやローカルメモリ等の高速メモリの有効利用が実効性能向上のために重要なポイントとなっている．このような事項を考慮して筆者等は自動マルチグレイン並列化コンパイラとの協調動作により実行性能が高く価格性能比の良いコンピュータシステムの実現を目指すOSCARマルチコアプロセッサを提案している．このOSCARマルチコアプロセッサは，すべてのプロセッサコアがアクセスできる集中共有メモリ(CSM)の他に，プロセッサコアのプライベートデータメモリ(LDM)とプロセッサコア間の同期やデータ転送に使用する2ポートメモリ構成の分散共有メモリ(DSM)，死す手で0多点創オーバヘッドの隠蔽を目指し，プロセッサコアと非同期に動作可能なデータ転送ユニット(DTU)を持つ．本稿ではOSCARコンパイラを用いた粗粒度タスク並列処理におけるLDM/DSM管理手法について述べる．性能評価の結果，逐次実効に比べ８PE時，MP３エンコーダで約7．1倍，MPEG2エンコーダで約6．3倍，JPEG2000エンコーダで約3．8倍の速度向上が得られた。Along with the advancement of integration technology of semiconductor devices, to overcome the increase of power consumption, the slowdown of processor effective performance improvement rate, and the increase of period for hardware/software developing transistors integrated on to a chip, multicore processors have attracted much attention as a next-generation microprocessor architecture. However, the memory wall caused by the gap between memory access speed and processor core speed is getting a serious problem also on the multicore processors. Therefore, the effective use of fast memories like cache and local memory nearby a processor with high effective performance and good cost performance. The OSCAR multicore processor has local data memory (LDM) for processor private data, distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centerlized shared memory(CSM) to support dynamic task scheduling, and data transfer unit (DTU) which transfers data asynchronously and aims at overlapping data transfaroverhand. This paper descrives data location scheme that aimed at improving the effective use of LDM and DSM using coarse grain task parallel processing and compiler-controlled LDM and DSM management scheme. As the results, the proposed scheme gives us 7.1 times speedup for MP3 encoding rogram, 6.3 for MPEG2 encording program and 3.8 for JPEG2000 encording program against the sequential execution without the proposed scheme on 8 processors automatically.
マルチコアプロセッサにおけるコンパイラ制御低消費電力化手法

白子準, 吉田宗広, 押山直人, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2006) 情報処理学会

発表年月： 2006年05月
マルチコアプロセッサ上での粗粒度タスク並列処理におけるデータ転送オーバラップ方式

宮本孝道, 中川正洋, 浅野尚一郎, 内藤陽介, 仁藤拓実, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC-2006-167, HPC-2006-105 情報処理学会

発表年月： 2006年02月
ヘテロジニアスチップマルチプロセッサにおける粗粒度タスクスタティックスケジューリング手法

和田康孝, 押山直人, 鈴木裕貴, 内藤陽介, 白子準, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC-2006-166 情報処理学会

発表年月： 2006年01月
MP3エンコーダを用いたヘテロジニアスチップマルチプロセッサの性能評価

鹿野裕明, 鈴木裕貴, 和田康孝, 白子準, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC-2006-166 情報処理学会

発表年月： 2006年01月
マルチコアプロセッサ上でのデータローカライゼーション

中野啓文, 浅野尚一郎, 内藤陽介, 仁藤拓実, 田川友博, 宮本孝道, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-165-10 情報処理学会

発表年月： 2005年12月
ホモジニアスマルチコアにおけるコンパイラ制御低消費電力化手法

白子準, 押山直人, 和田康孝, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-164-10 (SWoPP205) 情報処理学会

発表年月： 2005年08月
共有メモリ型マルチプロセッササーバー上におけるOSCAR マルチグレイン自動並列化コンパイラの性能評価

白子準, 宮本孝道, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2005年01月

開催年月：
2005年01月

　

　

　概要を見る

マルチプロセッサシステムの普及に伴い，実効性能，システム価格性能比，ソフトウェア生産性向上のため高性能な自動並列化コンパイラの重要性が高まっている．しかしながら並列処理技術において広く利用されているループ並列処理手法は既に成熟期に至り，今後の大幅な性能向上実現のためには従来とは異なる並列化手法の利用が必須である．本論文ではループ並列処理に加え，基本ブロック，ループ，サブルーチンといった粗粒度タスク間の並列性を利用する粗粒度タスク並列処理・基本ブロック内ステートメントレベルの並列性を用いる近細粒度並列処理によりプログラム全域にわたる並列化を行うOSCAR マルチグレイン自動並列化コンパイラの性能評価について述べる．OSCAR コンパイラではプログラムの形状や並列性に応じた適切な処理プロセッサ数や各並列処理粒度の決定，複数のループや粗粒度タスク間にまたがる広域的なキャッシュメモリ最適化技術が実現されている．SPEC95FP を用いた本性能評価においてOSCAR コンパイラは，IBM pSeries690Power4 24 プロセッササーバ上でIBM XL Fortran コンパイラ 8.1 の自動並列化性能に比べ平均4.78 倍，SGI Altix3700 Itanium2 16 プロセッササーバ上においてIntel Fortran Itanium Compiler 7.1 に比べ平均2.40 倍，Sun Fire V880 Ultra SPARC III Cu 8 プロセッササーバ上においてSun Forteコンパイラ 7.1 に比べ平均1.90 倍の性能向上が得られた．The needs for automatic parallelizing compilers are getting larger with widly use of multiprocessor systems.However, the loop parallelization techniques are almost matured and new generation of parallelization methods like multi-grain parallelization are required to achieve higher effective performance. This paper describes the performance of OSCAR multigrain parallelizing compiler that uses the coarse grain task parallelization and the near fine grain parallelization in addition to the loop parallelization. OSCAR compiler realizes the following two important techniques. The first is the automatic determination scheme of parallelizing layer, which decides the number of processors and parallelizing technique for each part of the program. The other is global cache memory optimization among loops and coarse grain tasks. In the evaluation using SPEC95FP benchmarks, OSCAR compiler gave us 4.78 times speedup compared with IBM XL Fortran compiler 8.1 on IBM pSeries690 Power4 24 processors server, 2.40 times speedup compared with Intel Fortran Itanium Compiler 7.1 on SGI Altix3700 Itanium2 16 processors server, 1.90 times speedup compared with Sun Forte compiler 7.1 on Sun Fire V880 Ultra SPARC III Cu 8 processors server.
共有メモリ型マルチプロセッササーバ上におけるOSCARマルチグレイン自動並列化コンパイラの性能評価

白子準, 宮本孝道, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-161-5 (SHINING2005) 情報処理学会

発表年月： 2005年01月
配列間接アクセスを用いないコード生成法を用いた電子回路シミュレーション

黒田亮, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-161-1 (SHINING2005) 情報処理学会

発表年月： 2005年01月
OSCARチップマルチプロセッサ上でのMPEG2エンコードの並列処理

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC02004-160-10 情報処理学会

発表年月： 2004年12月
OSCARチップマルチプロセッサ上でのデータ転送ユニットを用いたデータローカライゼーション

中野啓文, 内藤陽介, 鈴木貴久, 小高剛, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2004-159-20 (SWoPP2004) 情報処理学会

発表年月： 2004年08月
OSCARチップマルチプロセッサ上でのマルチグレイン並列性評価

和田康孝, 白子準, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2004-159-11 (SWoPP2004) 情報処理学会

発表年月： 2004年08月
OSCARチップマルチプロセッサ上でのデータ転送ユニットを用いたデータローカライゼーション

中野啓史, 内藤陽介, 鈴木貴久, 小高剛, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2004年07月

開催年月：
2004年07月

　

　

　概要を見る

現在，次世代のマイクロプロセッサアーキテクチャとして，複数のプロセッサコアを1チップ上に集積するチップマルチプロセッサ(CMP)が大きな注目を集めている．これらのCMPアーキテクチャにおいても，従来のマルチプロセッサシステムで大きな課題となっていたキャッシュやローカルメモリ等のプロセッサコア近接メモリの有効利用に関する問題は依然存在する．筆者等はこのメモリウォールの問題に対処し，高い並列性を抽出し効果的な並列処理を実現するために，マルチグレイン並列処理との協調動作により実効性能が高く価格性能比の向上を可能にするOSCAR CMPを提案している．このOSCAR CMPは，集中共有メモリ(CSM)に加え，プロセッサのプライベートデータを格納するローカルデータメモリ(LDM)，プロセッサコア間の同期やデータ転送にも使用する2ポートメモリ構成の分散共有メモリ(DSM)，プロセッサコアと非同期に動作可能なデータ転送ユニット(DTU)を持つ．本稿では，FORTRAN プログラムをループ・サブルーチン・基本ブロックを粗粒度タスクとする．粗粒度タスク並列処理において，配列の生死解析情報を用いて粗粒度タスクの並び替えを行い，プログラムのデータローカリティを抽出するデータローカライゼーション手法について述べる．データ転送は，コンパイラにより自動生成したDTUによるデータ転送命令を用いてバースト転送を行う．Recently, Chip Multiprocessor (CMP) architecture has attracted much attention as a next-generation microprocessor architecture, and many kinds of CMP have widely developed. However, these CMP architectures still have the problem of effective use of memory system nearby processor cores such as cache and local memory. %This problem has also been one of the most important problems for ordinary %multiprocessors. On the other hand, the authors have proposed OSCAR CMP, which cooperatively works with multigrain parallel processing, to achieve high effective performance and good cost effectiveness. To overcome the problem of effective use of cache and local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centralized shared memory (CSM) to support dynamic task scheduling, and data transfer unit(DTU) for asynchronous data transfer. The multigrain parallelizing compiler uses such memory architecture of OSCAR CMP with data localization scheme that fully uses compile time information. This paper proposes a coarse grain task static scheduling scheme considering data localization using live variable analysis. Data is transferred in burst mode using automatically generated DTU instructions.
OSCARチップマルチプロセッサ上でのマルチグレイン並列性評価

和田康孝, 白子準, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2004年07月

開催年月：
2004年07月

　

　

　概要を見る

本論文では，コンパイラ協調型OSCARチップマルチプロセッサ(OSCAR CMP)上でのマルチグレイン並列性の評価について述べる．OSCAR CMPは，プログラム中のステートメント間の並列性を利用する近細粒度並列処理，ループイタレーションレベルの並列性を利用する中粒度並列処理，ループやサブルーチン，基本ブロック間の並列性を利用する粗粒度タスク並列処理を階層的に組み合わせて利用するマルチグレイン並列処理をOSCARマルチグレイン並列化コンパイラと協調して行うことができるように設計されている．このコンパイラとアーキテクチャの協調動作により，OSCAR CMPはチップ上の資源の有効利用およびプログラムの開発効率の向上を可能とする．本論文では，SPEC CFP 95ベンチマークの，OSCAR CMP上でのマルチグレイン並列処理性能を評価した結果を報告する．評価の結果，8プロセッサコアおよび集中共有メモリを1チップ上に搭載したOSCAR CMPは，逐次実行に対して，動作周波数が400MHzであると想定した場合に2.03?7.79倍の性能向上を，動作周波数が2.8GHzであると想定した場合に1.89?7.05倍の性能向上を得られることが確かめられた．This paper describes performance of multigrain parallel processing of SPEC CFP 95 on OSCAR Chip Multi Processor(OSCAR CMP). OSCAR multigrain parallelizing compiler, which exploits statement level near-fine grain parallelism, loop iteration level parallelism and coarse grain parallelism hierarchically, allows us to fully control hardware on OSCAR CMP. Also, this cooperation realizes high software productivity and effective use of hardware resources. Performance of multigrain parallel processing of SPEC CFP 95 benchmark programs on OSCAR CMP with 8 processor cores and centralized shared memory were 2.03 to 7.79 times speedup against sequential execution using 400MHz clock cycles for embedded use and 1.89 to 7.05 times speedup against sequential execution using 2.8GHz clock cycles for high-end use.
IBM pSeries 690上でのOSCARマルチグレイン自動並列化コンパイラの性能評価

石坂一久, 白子準, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会第66回全国大会情報処理学会

発表年月： 2004年03月
データローカライゼーションを伴うMPEG2エンコーディングの並列処理

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2004-156-3 情報処理学会

発表年月： 2004年02月
IBM pSeries 690 SMPサーバー上でのOSCARマルチグレイン自動並列化コンパイラの性能評価

石坂一久, 白子準, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会全国大会講演論文集

発表年月： 2004年

開催年月：
2004年

　

　
SMPマシン上での粗粒度タスク並列処理におけるデータプリフェッチ手法

宮本孝道, 山口高弘, 飛田高雄, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-155-06 情報処理学会

発表年月： 2003年11月
OSCAR CMP 上でのスタティックスケジューリングを用いたデータローカライゼーション手法

中野啓文, 内藤陽介, 鈴木貴久, 小高剛, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-154-14 (SWoPP2003) 情報処理学会

発表年月： 2003年08月
OSCAR チップマルチプロセッサ上でのMPEG2エンコーディングの並列処理

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-154-10 (SWoPP2003) 情報処理学会

発表年月： 2003年08月
チップマルチプロセッサ上での粗粒度タスク並列処理によるデータローカライゼーション

中野啓文, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-151-3 (SHINING2003) 情報処理学会

発表年月： 2003年01月
OSCAR 型シングルチップマルチプロセッサにおける動きベクトル探索処理

小高剛, 鈴木貴久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-150-6 情報処理学会

発表年月： 2002年11月
OSCAR チップマルチプロセッサ上でのマルチグレイン並列処理

木村啓二, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-150-7 情報処理学会

発表年月： 2002年11月
SMPマシン上での粗粒度タスク並列処理オーバーへッドの解析

和田康孝, 中野啓文, 木村啓二, 小幡元樹, 笠原博徳

情報処理学会研究報告 ARC2002-148-3 情報処理学会

発表年月： 2002年05月
シングルチップマルチプロセッサにおける JPEGエンコーディングのマルチグレイン並列処理（共著）

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会並列処理シンポジウム(JSPP2002) 情報処理学会

発表年月： 2002年05月
OSCAR型シングルチップマルチプロセッサ上でのJPEGエンコーディングプログラムのマルチグレイン並列処理

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-146-4 情報処理学会

発表年月： 2002年02月
シングルチップマルチプロセッサにおけるマルチグレイン並列処理

内田貴之, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-146-3 情報処理学会

発表年月： 2002年02月
キャッシュ最適化を考慮したマルチプロセッサシステム上での粗粒度タスクスタティックスケジューリング手法

中野啓文, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2001-144-12 情報処理学会

発表年月： 2001年08月
シングルチップマルチプロセッサ上でのマルチメディアアプリケーションの近細粒度並列処理

小高剛, 宮下直久, 木村啓二, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2001年07月

開催年月：
2001年07月

　

　

　概要を見る

近年のマルチメディアコンテンツの増加に伴い，JPEG，MPEGなどのメディア系アプリケーションを効率良く処理できる，低コストかつ低消費電力のプロセッサの開発が望まれている．これらの要求を満たすプロセッサとして，簡素なプロセッサコアを複数搭載したシングルチップマルチプロセッサが注目を集めている．本稿では，シングルチップマルチプロセッサのメディア系アプリケーションでの有用性を確かめるため，まず，第一段階として画像圧縮処理のJPEGエンコーディングプログラムを用い，その処理単位が最終的に$8?times8$画素のブロックになることに注目し，その$8?times8$画素ブロックの処理に近細粒度並列処理を施しOSCAR型シングルチップマルチプロセッサ上で性能評価を行った．その結果，シンプルなシングルイシュープロセッサを4基搭載したシングルチップマルチプロセッサシステムは4イシュースーパースカラプロセッサのUltraSPARC-II相当のプロセッサコアを1基搭載するシステムに対し約2.32倍の速度向上が得られた．With the recent increase of multimedia contents, such as JPEG and MPEG data, low cost and low power consumption processors that can process these multimedia contents efficiently are expected. In such microprocessors, single chip multiprocessor architecture having simple processor cores is attracting much attention. Considering the above facts, this paper evaluate a JPEG encoding program on OSCAR type single chip multiprocessor architecture using near fine grain parallel processing for $8\times8$ pixel block that is a fundamental part of JPEG algorithm. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gives 2.32 times speedup than four-issue UltraSPARC-II type super-scaler processor.
キャッシュ最適化を考慮したマルチプロセッサシステム上での粗粒度タスクスタティックスケジューリング手法

中野啓史, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2001年07月

開催年月：
2001年07月

　

　

　概要を見る

近年のプロセッサの動作速度とメモリアクセスの速度差の拡大により，データローカリティを利用したキャッシュ最適化がますます重要となっている。また，マルチプロセッサシステム上での並列処理においては，従来のループ並列化のみの並列処理は限界を向かえつつある。そのため更なる性能向上を得るには粗粒度タスク並列処理の併用等マルチグレイン並列化が重要となっている。本稿では，Fortranプログラムをループ・サブルーチン・基本ブロックの３種類の粗粒度タスクに分割し，粗粒度タスク間の制御依存・データ依存を解析して並列性を抽出する粗粒度タスク並列処理において，粗粒度タスク間のデータ共有量を考慮してキャッシュ最適化を行う粗粒度タスクスタティックスケジューリング手法について述べる。本手法をOSCAR Fortranマルチグレイン並列化コンパイラに実装してSunUltra80（４プロセッサSMP）上で評価を行った結果，SPEC 95fpのswim，tomcatvにおいて，本手法により，Sun Forte HPC 6 update 1 の自動並列化に対してそれぞれ4.56倍，2.37倍の速度向上が得られ，本手法の有効性が確かめられた。Effective use of cache memory based on data locality is getting more important with increasing gap between the processor speed and memory access speed. As to parallel processing on multiprocessor systems, it seems to be difficult to achieve large performance improvement only with the conventional loop iteration level parallelism. This paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme is based on the macro data flow parallel processing that uses coarse grain task parallelism among tasks such as loop blocks, subroutines and basic blocks. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP machine, using swim and tomcatv from the SPEC fp 95 benchmark suite. As the results, the proposed scheme gives us 4.56 times speedup for swim and 2.37 times for tomcatv respectively against the Sun Forte HPC 6 loop parallelizing compiler on 4 processors.
「分散オブジェクトにより実装された分散型アプリケーションサーバ」

山下関哉, 木村啓二

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 2001年05月

開催年月：
2001年05月

　

　

　概要を見る

インターネットが普及して以来、特に WWW におけるサービス利用の急増を背景として、WWW サービスの開発・保守の効率化、WWW サイトに集中する負荷の分散を目的としてアプリケーションサーバと呼ばれるサーバソフトウェアが注目されている。本研究では、分散オブジェクトによってソフトウェアを構成する、分散ソフトウェアアーキテクチャをアプリケーションサーバに応用した、分散型アプリケーションサーバを提案し、拡張性と付加耐性に優れたアプリケーションサーバの開発を目標とする。本論文では、分散型アプリケーションサーバの提案と設計方針、予備評価の結果について報告する。Since the Internet became popular, especially for utilization of WWW service to increase rapidly, server software has attracted a great deal of attention. Server software's purposes are efficiency of www service's development and maintenance as well as load balancing of WWW site's congestion. Such server software is called application servers. In this study, we propose distributed application servers that distributed software architecture are applied to application servers. Distributed software architecture is composed of distributed objects. Our goal is the development of application serves that has high scalability and load tolerance. In this paper, we describe the proposal, design policy and preliminary assessment's result of distributed application servers.
マルチプロセッサシステム上でのキャッシュ最適化を考慮した粗粒度タスクスタティックスケジューリング手法（共著）

中野啓文, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会第62回全国大会情報処理学会

発表年月： 2001年03月
マルチメディアアプリケーションのシングルチップマルチプロセッサ上での近細粒度並列処理

小高剛, 木村啓二, 宮下直久, 笠原博徳

情報処理学会第62回全国大会情報処理学会

発表年月： 2001年03月
近細粒度並列処理に適したシングルチップマルチプロセッサのメモリアーキテクチャの評価

松元信介, 木村啓二, 笠原博徳

情報処理学会第62回全国大会情報処理学会

発表年月： 2001年03月
マルチグレイン並列処理用シングルチップマルチプロセッサにおけるデータ転送ユニットの検討

宮下直久, 木村啓二, 小高剛, 笠原博徳

情報処理学会第62回全国大会情報処理学会

発表年月： 2001年03月
近細粒度並列処理用シングルチップマルチプロセッサにおけるプロセッサコアの構成

木村啓二, 内田貴之, 加藤孝幸, 笠原博徳

情報処理学会研究報告 ARC139-16(SWoPP2000) 情報処理学会

発表年月： 2000年08月
シングルチップマルチプロセッサの近細粒度並列処理に対する性能評価

加藤考幸, 尾形航, 木村啓二, 内田貴之, 笠原博徳

情報処理学会第60回全国大会情報処理学会

発表年月： 2000年03月
マルチグレイン並列化コンパイラのメモリアクセスアナライザ

岩井啓輔, 小幡元樹, 木村啓二, 天野英晴, 笠原博徳

電子通信情報学会技術報告 CPSY99-62 電子情報通信学会

発表年月： 1999年08月
シングルチップマルチプロセッサ上での近細粒度並列処理の性能評価

木村啓二, 間中邦之, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会研究報告 ARC134-4 情報処理学会

発表年月： 1999年08月
最早実行可能条件解析を用いたキャッシュ最適化手法

稲石大祐, 木村啓二, 藤本謙作, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会第58回全国大会情報処理学会

発表年月： 1999年03月
OSCAR FORTRAN Compilerを用いたマルチグレイン並列性の評価

小幡元樹, 松井巌徹, 松崎秀則, 木村啓二, 稲石大祐, 宇治川泰史, 山本晃正, 岡本雅巳, 笠原博徳

情報処理学会研究報告計算機アーキテクチャ（ARC）一般社団法人情報処理学会

発表年月： 1998年08月

開催年月：
1998年08月

　

　

　概要を見る

現在スーパーコンピュータは、数TFLOPSのピーク性能を持ち、今後も伸び続けると考えられるが、価格性能比、使い難さの問題から市場を拡大できないという問題を持っている。また、マイクロプロセッサにおいては、スーパースカラ、VLIW等で利用されている命令レベル並列性の限界が顕在化しており、次世代のプロセッサとして、シングルチップマルチプロセッサ(SCM)が注目されつつある。著者らは、SCM、サーバマシン、スーパーコンピュータの実効性能、すなわちコストパフォーマンス、使い易さを高めることを可能とするために、マルチグレイン自動並列化コンパイル手法を提案している。マルチグレイン並列処理とは、命令あるいはステートメントレベルの細粒度並列性、ループイタレーションレベルの中粒度並列性、サブルーチン・ループ・基本ブロックレベルの粗粒度並列性という、プログラムに内在する並列性を最大限に引き出す方式である。本論文では、Perfect Benchmarkの2次元流体解析プログラムARC2Dを例に、OSCARマルチグレインFORTRAN並列化Compilerを用いたマルチグレイン並列性利用の有効性を示す。Currently, peak performances of supercomputers attain TFLOPS order. It seems that the peak performances will continue by increase. However, supercomputers have a problem that enlargement of the world is very difficult because of relatively low cost performance and difficulty of use. In microprocessor, limitations of extraction of instruction level parallelism being used by super scalar and VLIW architecture are getting clear and single chip multiprocessor is received much attention as one of next generation processor architechture. In order to improve effective performance or cost performance, and ease of use, the authors have been proposing a Multigrain Automatic Parallelizing Compilation scheme. The multigrain parallel processing is a method which extract all parallelism from a program, such as coarse grain parallelism among subroutines, loops, and basic blocks, medium grain parallelism among loop iterations, and fine grain parallelism among instructions and statements. This paper shows effectiveness of multigrain parallel processing using OSCAR multigrain FORTRAN parallelization compiler using fluid flow problem solver ARC2D(Perfect Benchmark) as an example.
最早実行可能条件解析を用いたキャッシュ利用の最適化

稲石大祐, 木村啓二, 藤本謙作, 尾形航, 笠原博徳

情報処理学会研究報告 ARC98-130-6 情報処理学会

発表年月： 1998年08月
シングルチップマルチプロセッサ上でのマルチグレイン並列処理

木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会研究報告 ARC98-130-5 情報処理学会

発表年月： 1998年08月
マルチグレイン並列化コンパイラとそのアーキテクチャ支援

笠原博徳, 尾形航, 木村啓二, 小幡元樹, 飛田高雄, 稲石大祐

電子通信情報学会技術報告 ICD98-10 CPSY98-10 FTS98-10 電子情報通信学会

発表年月： 1998年04月
科学技術計算プログラムにおけるマルチグレイン並列性の評価

小幡元樹, 松井巌徹, 松崎秀則, 木村啓二, 稲石大祐, 宇治川泰史, 山本晃正, 岡本雅巳, 笠原博徳

全国大会講演論文集

発表年月： 1998年03月

開催年月：
1998年03月

　

　
FPGAを用いたマルチプロセッサシステムテストベッドの実装

尾形航, 山本泰平, 水尾学, 木村啓二, 笠原博徳

情報処理学会研究報告ARC128-14 HPC70-14 情報処理学会

発表年月： 1998年03月
マルチグレイン並列処理用シングルチップマルチプロセッサアーキテクチャ

木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会第56回全国大会情報処理学会

発表年月： 1998年03月
マクロタスク最早実行可能条件解析を用いたキャッシュ最適化手法

稲石大祐, 木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会第56回全国大会情報処理学会

発表年月： 1998年03月
マルチグレイン並列処理用マルチプロセッサシステム

岩井啓輔, 藤原崇, 森村知弘, 天野英晴, 木村啓二, 尾形航, 笠原博徳

電子情報通信学会研究報告 CPSY97-46 電子情報通信学会

発表年月： 1997年08月
処理とデータ転送のオーバーラッピングを考慮したダイナミックスケジューリングアルゴリズム

木村啓二, 橋本茂, 古郷誠, 尾形航, 笠原博徳

電子情報通信学会研究報告 CPSY97-40 電子情報通信学会

発表年月： 1997年08月

▼全件表示

研究シーズ

共同研究・競争的資金等の研究課題

信頼できる実行環境の利便性向上を低実行時オーバーヘッドで実現する方式に関する研究

日本学術振興会科学研究費助成事業

研究期間:

2023年04月

-

2026年03月

木村啓二

　概要を見る

2023年度は研究計画に従い，TEE上のプログラムの起動時間を短縮するEnclaveキャッシュ，TEE内外のデータ授受柔軟性を改善するデータ転送利便性向上手法，及びアクセラレータEnclaveに関する研究をそれぞれ実施した．さらに，上記項目に加えて，TEEの信頼性担保に必要不可欠なセキュアブートの高速化，及びTEEに大規模データを渡す際の効率向上手法に関する研究を実施した．
Enclaveキャッシュに関しては，その基本方式をRISC-V用TEEの代表的な実装であるKeystone上に実装し，実RISC-VマシンであるHiFive Unmatched上で評価した．評価の結果，キャッシュ無しの構成に対して40-50倍高速化可能であることを確認した．データ転送利便性向上手法に関しては，メモリプールを用いた授受方法を提案し，vectorとlistで提案方式を実装しこれをIntel SGX上で評価した．評価の結果，データ構造のシリアライズに比較して転送時間を約19倍高速化可能であることを確認した．アクセラレータEnclaveに関しては方式の基本方針を検討し，RISC-V SoCのオープンソース実装であるChipyard上に実験プラットフォームの構築を行った．また，新規に実施したセキュアブートに関しては,RISC-Vマルチコア上でKeystoneを実行可能なLinuxシステムのブートを4コア並列処理で行う方式を提案した．提案方式をHiFive Unmatched上で評価したところ，セキュアブート上の重要処理である検証処理の4.51倍の高速化を確認した．さらに，RISC-V Keystoneにおける大規模データ授受に関しては，RISC-VのTEE実現のためのメモリ保護機構の運用方法を拡張することで，HiFive Unmatched上で2.3倍の性能向上が得られることを確認した．
深層学習フレームワークでの利用を目指した完全準同型暗号による行列計算に関する研究

日本学術振興会科学研究費助成事業

研究期間:

2018年06月

-

2020年03月

木村啓二, 和田康孝

　概要を見る

本研究は、安全な深層学習計算の高速化を目的として、暗号化したまま計算が可能な準同型暗号による行列計算の高速化手法を探求する。研究開始の2018年度は、公開されている準同型暗号計算ライブラリHElibの調査及び深層学習の公開モデルの調査等を行った。最終年度の2019年度では、HElibのボトルネック部分の高速化(1)、データ転送機構の開発(2)、及び行列サイズ削減と推論精度のトレードオフの調査(3)を実施した。(1)に関しては、まずHElibによる行列計算の特に時間を要する部分として、演算中に必要となる鍵変換行列の生成部と暗号文演算部を特定した。各々に対して、演算に要するビット幅削減及びSIMD化による並列演算の導入を適用した。Intel Xeonプロセッサを搭載するサーバで提案手法を評価したところ、鍵変換行列生成部で3.4倍、暗号文演算部では加算処理で5.53倍、乗算処理で3.73倍の性能向上を得た。(2)に関しては、疎行列計算に必要な間接参照アクセスを効率良く処理可能なデータ転送機構を開発した。提案データ転送機構とベクトルアクセラレータを持つマルチコアをFPGA上に実装し、まずは通常の疎行列・ベクトル積を用いて評価を行った。評価の結果、提案データ転送機構を使わずにCPU転送を行った場合と比較して17倍の速度向上を得ることができた。(3)に関しては、行列サイズ削減手法として小規模なニューラルネットワークを複数並列に用いる手法を提案・検討した。提案手法は、ニューラルネットワークを分割することで、認識精度を保ちつつ個々のニューラルネットワークの規模を縮小する。提案手法をFPGAに実装し、作成したニューラルネットワーク8つを並列に用いて推論を行った結果、1つのネットワークを用いた場合と比較して、認識精度で約8ポイント、認識速度でおよそ54パーセントの向上が確認できた
フラグによりCPUとアクセラレータが連携するヘテロジニアスマルチコアに関する研究

日本学術振興会科学研究費助成事業

研究期間:

2015年04月

-

2018年03月

木村啓二

　概要を見る

本研究では、CPU、アクセラレータ、及びデータ転送ユニットの柔軟な連携を可能とするヘテロジニアスマルチコアのコンパイラ及びアーキテクチャを開発した。本研究による主な成果の一つとして、アクセラレータ用LLVMバックエンドコンパイラを含むコンパイルフローを開発し、コンパイルしたプログラムを開発したFPGAテストベッドで評価したところ、1CPU実行に対して24.91倍の性能向上が得られたことが挙げられる
大規模非線形時空間パターン制御の実時間最適化アルゴリズムと応用

研究期間:

2012年04月

-

2016年03月

　概要を見る

大規模かつ複雑なシステムでも最適に制御できるよう，非線形最適制御問題を高速に解くアルゴリズムについて研究し，さまざまな分野への応用を検討した．たとえば，大規模システムを制御する場合の最適化計算効率化，制御の応答を見通しよく調整する方法の提案，アルゴリズムのプログラミングを自動化するツールの開発などの成果を得た．そして，熱流体における温度や流速の制御，鉄鋼プロセスにおける製品ばらつきの抑制，高度下水処理施設の水質制御，スマートグリッドにおける需要誘導，浮体式洋上風力発電施設の発電量と動揺の制御など，多岐にわたる問題でアルゴリズムの有効性を示した
プログラムの大域的構造を利用したメニーコア・シミュレーションの高速化に関する研究

日本学術振興会科学研究費助成事業

研究期間:

2011年04月

-

2014年03月

木村啓二

　概要を見る

本研究では、マルチコア・メニーコアのアーキテクチャシミュレーションにおいて、並列化されたアプリケーションをマルチコア上で実行するという前提の基、シミュレーションの精度を適宜切り替えながら、高速かつ高精度にシミュレーションを行う手法を提案する。本手法を4つの異なる特性を持つアプリケーションを用い、16コアのマルチコアアーキテクチャを想定して評価した結果、最大443倍の速度向上を誤差0.52%で得ることができ、平均では218倍の速度向上を2.76%の誤差で得られることが確認できた
ソフトウェア協調整チップマルチプロセッサにおけるデータ利用最適化に関する研究

　概要を見る

本年度は、昨年度に引き続きソフトウェア協調動作型チップマルチプロセッサ用のデータローカリティ最適化およびデータ転送最適化に関する研究を行なった。本研究では、データを共有するタスク群に着目し、プロセッサコアローカルなキャッシュやローカルメモリのサイズを考慮してこれらのタスクを分割し各プロセッサコアに割り当て、キャッシュやローカルメモリの有効利用を図る。さらに、残存するデータ転送を、プロセッサコアに割り当てたタスクとオーバラップして行うことにより、データ転送オーバヘッドの隠蔽を図る。具体的には、MPEG2エンコーデイング処理やJPEG2000エンコーディング処理などのマルチメディアデプリケーションをターゲットとして、これらのアプリケーションに自動的にデータローカリティ最適化とデータ転送最適化手法を適用し、チップマルチプロセッサ上で効率よく動作させるためのソフトウェア・ハードウェア協調動作技術の開発とその評価を行なった。評価の結果、とりわけMPEG2エンコーディング処理では動作周波数400MHz時で逐次実行に対し8プロセッサ使用時で7.97倍、動作周波数2.8GHz時で逐次実行に対し8プロセッサ使用時で6.54倍の速度向上率を得られることが確認できた。MPEG2エンコーディングプログラムに対する本データローカリティ最適化およびデータ転送最適化は、自動並列化コンパイラによりほぼ自動的に行われる。より多くのアプリケーションに対して本手法を自動的に適用し対象アプリケーションを拡大することは今後の課題である

▼全件表示

産業財産権

並列化コンパイラ、並列化コンパイル装置、及び並列プログラムの生成方法

6600888

笠原博徳, 木村啓二, 梅田弾, 見神広紀

特許権
マルチプロセッサシステム

6335253

笠原博徳, 木村啓二

特許権
マルチプロセッサシステム

笠原博徳, 木村啓二

特許権
並列化コンパイル方法、並列化コンパイラ、並列化コンパイル装置、及び、車載装置

6018022

笠原博徳, 木村啓二, 林明宏, 見神広紀, 梅田弾, 金羽木洋平

特許権
並列性の抽出方法及びプログラムの作成方法

6319880

木村啓二, 林明宏, 笠原博徳, 見神広紀, 金羽木洋平, 梅田弾

特許権
マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法

笠原博徳, 木村啓二

特許権
プロセッサシステム及びアクセラレータ

6103647

木村啓二, 笠原博徳

特許権
プロセッサによって実行可能なコードの生成方法、記憶領域の管理方法及びコード生成プログラム

5283128

笠原博徳, 木村啓二, 間瀬正啓

特許権
マルチプロセッサ

笠原博徳, 木村啓二

特許権
マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法

笠原博徳, 木村啓二

特許権
マルチプロセッサ

4304347

笠原博徳, 木村啓二

特許権
メモリ管理方法、情報処理装置、プログラムの作成方法及びプログラム

5224498

笠原博徳, 木村啓二, 中野啓史, 仁藤拓実, 丸山貴紀, 三浦剛, 田川友博

特許権
マルチプロセッサ及びマルチプロセッサシステム

4784842

笠原博徳, 木村啓二

特許権
プロセッサ及びデータ転送ユニット

4476267

笠原博徳, 木村啓二

特許権
ヘテロジニアスマルチプロセッサ向けグローバルコンパイラ

4784827

笠原博徳, 木村啓二, 鹿野裕明

特許権
ヘテロジニアス・マルチプロセッサシステムの制御方法及びマルチグレイン並列化コンパイラ

4936517

笠原博徳, 木村啓二, 白子準, 和田康孝, 伊藤雅樹, 鹿野裕明

特許権
マルチプロセッサシステム及びマルチグレイン並列化コンパイラ

笠原博徳, 木村啓二, 白子準, 伊藤雅樹, 鹿野裕明

特許権
マルチプロセッサシステム及びマルチグレイン並列化コンパイラ

4082706

笠原博徳, 木村啓二, 白子準, 伊藤雅樹, 鹿野裕明

特許権
マルチプロセッサ

4784792

笠原博徳, 木村啓二

特許権

▼全件表示

現在担当している科目

IoTシステム設計

基幹理工学部

2026年春学期
先端プロセッサ技術

基幹理工学部

2026年春学期
言語処理系

基幹理工学部

2026年春学期
卒業論文Ｂ（春学期）

基幹理工学部

2026年春学期
プロジェクト研究Ｂ

基幹理工学部

2026年秋学期
プロジェクト研究Ａ

基幹理工学部

2026年春学期
卒業論文Ａ　（集中）

基幹理工学部

2026年集中（春・秋学期）
卒業論文Ｂ

基幹理工学部

2026年秋学期
卒業論文Ａ（秋学期）

基幹理工学部

2026年秋学期
言語処理系　　【前年度成績S評価者用】

基幹理工学部

2026年春学期
卒業論文Ａ

基幹理工学部

2026年春学期
情報理工学実験Ａ　【前年度成績S評価者用】

基幹理工学部

2026年秋学期
卒業論文Ｂ　18前再　【前年度成績S評価者用】

基幹理工学部

2026年秋学期
情報理工学実験Ａ

基幹理工学部

2026年秋学期
卒業論文Ｂ（春学期）　18前再

基幹理工学部

2026年春学期
卒業論文Ａ　18前再　【前年度成績S評価者用】

基幹理工学部

2026年春学期
卒業論文Ａ（秋学期）　18前再

基幹理工学部

2026年秋学期
卒業論文Ａ　18前再

基幹理工学部

2026年春学期
高性能計算プログラミング　【前年度成績S評価者用】

基幹理工学部

2026年夏クォーター
高性能計算プログラミング

基幹理工学部

2026年夏クォーター
卒業論文Ｂ　18前再

基幹理工学部

2026年秋学期
情報理工学実験Ｂ【前年度成績S評価者用】

基幹理工学部

2026年春学期
コンピュータアーキテクチャＢ【前年度成績S評価者用】

基幹理工学部

2026年秋学期
コンピュータアーキテクチャＢ

基幹理工学部

2026年秋学期
コンピュータアーキテクチャＡ　【前年度成績S評価者用】

基幹理工学部

2026年秋学期
コンピュータアーキテクチャＡ

基幹理工学部

2026年秋学期
情報理工学実験Ｂ

基幹理工学部

2026年春学期
IoTシステム設計

基幹理工学部

2026年春学期
先端プロセッサ技術

基幹理工学部

2026年春学期
情報通信実験Ｂ

基幹理工学部

2026年春学期
プロジェクト研究Ｂ

基幹理工学部

2026年秋学期
高性能計算プログラミング

基幹理工学部

2026年夏クォーター
情報通信実験Ｂ【前年度成績S評価者用】

基幹理工学部

2026年春学期
IoTシステム設計

基幹理工学部

2026年春学期
卒業論文Ａ（秋学期）

基幹理工学部

2026年秋学期
プロジェクト研究Ａ

基幹理工学部

2026年春学期
卒業論文Ｂ　18前再　【前年度成績S評価者用】

基幹理工学部

2026年秋学期
卒業論文Ａ

基幹理工学部

2026年春学期
卒業論文Ａ　18前再　【前年度成績S評価者用】

基幹理工学部

2026年春学期
卒業論文Ａ（秋学期）　18前再

基幹理工学部

2026年秋学期
卒業論文Ａ　18前再

基幹理工学部

2026年春学期
卒業論文Ｂ（春学期）　18前再

基幹理工学部

2026年春学期
コンピュータアーキテクチャＢ

基幹理工学部

2026年秋学期
言語処理系

基幹理工学部

2026年春学期
卒業論文Ｂ　18前再

基幹理工学部

2026年秋学期
情報通信実験Ａ

基幹理工学部

2026年秋学期
情報通信実験Ａ　【前年度成績S評価者用】

基幹理工学部

2026年秋学期
コンピュータアーキテクチャＡ　【前年度成績S評価者用】

基幹理工学部

2026年秋学期
コンピュータアーキテクチャＡ

基幹理工学部

2026年秋学期
Project Research Spring

基幹理工学部

2026年春学期
Project Research Fall

基幹理工学部

2026年秋学期
Introduction to Computers and Networks

基幹理工学部

2026年春学期
Computer Science and Communications Engineering Laboratory B

基幹理工学部

2026年春学期
Advanced Processor Architecture Technology

基幹理工学部

2026年春学期
Computer Architecture

基幹理工学部

2026年秋学期
Graduation Thesis A (Spring) [S Grade]

基幹理工学部

2026年春学期
Graduation Thesis B (Spring) [S Grade]

基幹理工学部

2026年春学期
Graduation Thesis A (Fall)

基幹理工学部

2026年秋学期
Graduation Thesis A (Fall) [S Grade]

基幹理工学部

2026年秋学期
Graduation Thesis B (Fall)

基幹理工学部

2026年秋学期
Graduation Thesis A　(Spring)[S Grade]【For students enrolled before 2022】

基幹理工学部

2026年春学期
Graduation Thesis A (Spring)

基幹理工学部

2026年春学期
Graduation Thesis B (Spring)

基幹理工学部

2026年春学期
Computer Science and Communications Engineering Laboratory A

基幹理工学部

2026年秋学期
Computer Science and Communications Engineering Laboratory A [S Grade]

基幹理工学部

2026年秋学期
卒業論文Ｂ

基幹理工学部

2026年秋学期
卒業論文Ｂ（春学期）

基幹理工学部

2026年春学期
卒業論文Ａ　（集中）

基幹理工学部

2026年集中（春・秋学期）
Graduation Thesis A　(Fall)[S Grade]【For students enrolled before 2022】

基幹理工学部

2026年秋学期
Graduation Thesis A　(Spring)【For students enrolled before 2022】

基幹理工学部

2026年春学期
Graduation Thesis B (Fall) [S Grade]

基幹理工学部

2026年秋学期
Graduation Thesis A　(Fall)【For students enrolled before 2022】

基幹理工学部

2026年秋学期
IoTシステム設計

大学院基幹理工学研究科

2026年春学期
Master's Thesis (Department of Computer Science and Communications Engineering)

大学院基幹理工学研究科

2026年通年
修士論文（情報・通信）

大学院基幹理工学研究科

2026年通年
先端プロセッサ構成演習D

大学院基幹理工学研究科

2026年秋学期
先端プロセッサ構成演習C

大学院基幹理工学研究科

2026年春学期
先端プロセッサ構成演習B

大学院基幹理工学研究科

2026年秋学期
先端プロセッサ構成演習A

大学院基幹理工学研究科

2026年春学期
情報理工・情報通信特別実験B

大学院基幹理工学研究科

2026年秋学期
先端プロセッサ技術

大学院基幹理工学研究科

2026年春学期
先端プロセッサ構成研究

大学院基幹理工学研究科

2026年通年
情報理工・情報通信特別演習Ａ

大学院基幹理工学研究科

2026年春学期
先端プロセッサ構成研究

大学院基幹理工学研究科

2026年通年
Seminar on Advanced Processor Architecture A

大学院基幹理工学研究科

2026年春学期
Seminar on Advanced Processor Architecture D

大学院基幹理工学研究科

2026年秋学期
Seminar on Advanced Processor Architecture C

大学院基幹理工学研究科

2026年春学期
Seminar on Advanced Processor Architecture B

大学院基幹理工学研究科

2026年秋学期
Special Laboratory B in Computer Science and Communications Engineering

大学院基幹理工学研究科

2026年秋学期
Advanced Processor Architecture

大学院基幹理工学研究科

2026年春学期
Special Laboratory A in Computer Science and Communications Engineering

大学院基幹理工学研究科

2026年春学期
Research on Advanced Processor Architecture

大学院基幹理工学研究科

2026年通年
情報理工・情報通信特別実験A

大学院基幹理工学研究科

2026年春学期
IoTシステム設計

大学院創造理工学研究科

2026年春学期
情報理工・情報通信特別演習Ｂ

大学院基幹理工学研究科

2026年秋学期
IoTシステム設計

大学院先進理工学研究科

2026年春学期

▼全件表示

特別研究期間制度（学内資金）

新しいメモリ階層を考慮したソフトウェア・ハードウェアの構成法に関する研究

2017年08月

-

2018年02月

アメリカ North Carolina State University

他学部・他研究科等兼任情報

理工学術院大学院基幹理工学研究科

学内研究所・附属機関兼任歴

2024年

-

2026年

理工学術院総合研究所兼任研究員
2024年

-

2026年

ミッションオリエンテッド研究教育センター兼任センター員

特定課題制度（学内資金）

深層学習フレームワークでの利用を目指した完全準同型暗号による行列計算に関する研究

2020年

　概要を見る

2020年度は、研究のベースとなるソフトウェアとして、Microsoft ResearchのSEAを利用し、これによる行列積演算を構成する各種処理の時間を測定し、そのオーバーヘッドと並列性の調査を行った。まず、行列積計算をOpenMPで並列化し、8コア搭載のIntel Xeon W2145(3.70GHz)で実行した結果、1コア実行時に対して約6倍の性能向上を得ることが出来た。さらに、準同型暗号による行列積演算を構成する処理をSIMD演算（AVX512）により高速化することを試みた。その結果、ライブラリ内部で使用する基本データ型を64bitから32bitに縮小しかつSIMD演算幅を増やすことで、行列演算の重要処理をSIMDオリジナルの実装に対して3.48倍高速化可能となった。 
フラグによりCPUとアクセラレータが連係するヘテロジニアスマルチコアに関する研究

2014年

　概要を見る

本研究は、アクセラレータを持つヘテロジニアスマルチコアに対して、アクセラレータの制御やデータ転送に要するオーバーヘッドを削減する技術の開発する。具体的にはCPU、データ転送ユニット（DTU）、及びアクセラレータを同時実行させることで上記オーバーヘッドを隠蔽可能とするタスク分割及びスケジューリング手法を開発し、自動並列化コンパイラに実装する。本年度の成果としては、まず本研究が前提とするアクセラレータの基本仕様を決定した。その上で、本アクセラレータ用のコンパイラモジュールを開発し、さらにアクセラレータのアーキテクチャシミュレータを開発することにより、本研究を行う上での基本的な評価環境を整備した。
コンパイラ解析情報と実機実行情報を利用したマルチコアシミュレーション高速化の研究

2009年

　概要を見る

計算機アーキテクチャの研究では、様々な構成のシステム評価を行うため、ソフトウェアによるアーキテクチャシミュレーションが大きな役割を果たしている。しかしながら、ソフトウェアシミュレータはプログラムの実行に実機の数千倍の時間がかかる。このような膨大な評価時間は今後のメニーコアの研究・開発の大きな妨げになる。本研究では、このような問題を克服するための、マルチコア・メニーコアのソフトウェアシミュレーション高速化手法の研究を行う。特に並列アーキテクチャ研究のためのシミュレーション高速化の研究に関しては、これまでミュレーションによる実験対象となる仮想のマルチコアやマルチプロセッサのコアを、シミュレータを実行する実際のマルチプロセッサのコアに割り当てるという方法が提案されてきたが、実機上の並列処理オーバーヘッドが大きく、実用的なシステムはこれまで実現されていない。本研究の特徴は、マルチコア・メニーコアのソフトウェアシミュレーションの高速化に、ループ構造や並列化情報などの並列化コンパイラによる解析情報と、評価対象アプリケーションの実機での実行情報を利用することである。これらの情報を利用し、詳細にシミュレーションする必要がある箇所とそうでない箇所を特定する。従来のソフトウェアシミュレーション高速化手法では利用されてこなかったこれらの付加的な情報を利用することで、精度の高い性能値を最小の実行コストで得ることができる。本年度は、本高速化手法の基本的な適用可能性を検討するための予備実験を行った。具体的には、二種類のマルチコアアーキテクチャのコア数を32コアまで変化させ、ベンチマークプログラムのメインループの回転数を変化させ本研究による性能値推定手法により本来のループ回転数における性能値を再現できるか調査した。ベンチマークプログラムとしてSPEC95ベンチマークのtomcatvとswim、および音声圧縮で標準的に使われているAACエンコーディングプログラムを用いた。評価の結果、いずれのアーキテクチャ、コア数、ベンチマークプログラムの組み合わせにおいても、わずか数回転分の性能値から本来の数百回転分の性能値を高々2%程度の誤差で予測することができた。今後は適用アプリケーションの拡大ならびにシステムの自動化を行う予定である。
ソフトウェア協調型チップマルチプロセッサにおけるメモリ最適化に関する研究

2004年

　概要を見る

本研究ではまず、データローカリティ最適化およびデータ転送最適化の基盤技術となるマルチグレイン並列化コンパイラとチップマルチプロセッサアーキテクチャプラットフォームの選定および評価基盤の整備を行った。コンパイラとしては、経済産業省ミレニアムプロジェクトIT21 アドバンスト並列化コンパイラで開発されたOSCARマルチグレイン並列化コンパイラをコアとした。また、チップマルチプロセッサアーキテクチャとしては、簡素なプロセッサコア、ローカルデータメモリ、２ポート構成の分散共有メモリ、およびデータ転送ユニットを持つプロセッシングエレメント（PE）をPE間ネットワークで接続したOSCAR型チップマルチプロセッサとした。本研究では、OSCARマルチグレイン並列化コンパイラに対してOSCAR型チップマルチプロセッサ用のバックエンド（コード生成器）を追加開発した。データローカリティ最適化およびデータ転送最適化技術開発の第一歩として、ターゲットアプリケーションには、SPECfp95ベンチマークより科学技術計算の典型例であるTomcatvとSwimプログラムを選んだ。本研究では、これらに対してタスク（並列処理の単位）とデータをデータローカリティと並列性の両方を考慮しながらPEへスケジューリングし、さらに共有メモリとプロセッサのローカルメモリ（データローカルメモリおよび分散共有メモリ）とのやり取りをプロセッサと非同期で動作するデータ転送ユニットにより処理させることにより、データローカリティ利用とデータ転送処理の効率化を行った。８PEで評価を行った結果、データローカリティ最適化を適用していない場合に対してTomcatvで1.56倍、Swimで1.38倍の速度向上を得ることができた。