2021/12/08 更新

写真a

シミズ カナ
清水 佳奈
所属
理工学術院 基幹理工学部
職名
教授

兼担

  • 理工学術院   大学院基幹理工学研究科

学内研究所等

  • 2020年
    -
    2022年

    理工学術院総合研究所   兼任研究員

  • 2020年
    -
    2022年

    国際情報通信研究センター   兼任研究員

学歴

  • 2003年04月
    -
    2006年03月

    早稲田大学   理工学研究科   情報ネットワーク専攻  

  • 2003年04月
    -
    2006年03月

    早稲田大学   理工学研究科   情報ネットワーク専攻  

  • 2001年04月
    -
    2003年03月

    早稲田大学   理工学研究科   情報科学専攻  

  • 2001年04月
    -
    2003年03月

    早稲田大学   理工学研究科   情報科学専攻  

  • 1997年04月
    -
    2001年03月

    早稲田大学   理工学部   情報学科  

学位

  • 早稲田大学   博士(工学)

経歴

  • 2018年04月
    -
     

    早稲田大学   理工学術院   教授

  • 2016年04月
    -
    2018年03月

    早稲田大学   理工学術院   准教授

  • 2013年03月
    -
    2016年03月

    産業技術総合研究所   創薬基盤研究部門/ゲノム情報研究センター/生命情報工学研究センター   主任研究員

  • 2013年12月
    -
    2015年04月

    The Sloan-Kettering Institute at Memorial Sloan-Kettering Cancer Center   Visiting Investigator

  • 2009年04月
    -
    2013年02月

    産業技術総合研究所   生命情報工学研究センター   研究員

  • 2006年11月
    -
    2009年03月

    産業技術総合研究所   生命情報科学研究センター/生命情報工学研究センター   産総研特別研究員

  • 2006年01月
    -
    2006年10月

    産業技術総合研究所   生命情報科学研究センター   テクニカルスタッフ

▼全件表示

 

研究キーワード

  • アルゴリズム

  • プライバシ保護データマイニング

  • データマイニング

  • ゲノム情報解析

  • 生命情報科学

  • バイオインフォマティクス

▼全件表示

論文

  • Discovery of cryoprotective activity in human genome-derived intrinsically disordered proteins

    Naoki Matsuo, Natsuko Goda, Kana Shimizu, Satoshi Fukuchi, Motonori Ota, Hidekazu Hiroaki

    International Journal of Molecular Sciences   19 ( 2 ) E401  2018年02月  [査読有り]

     概要を見る

    Intrinsically disordered proteins (IDPs) are an emerging phenomenon. They may have a high degree of flexibility in their polypeptide chains, which lack a stable 3D structure. Although several biological functions of IDPs have been proposed, their general function is not known. The only finding related to their function is the genetically conserved YSK2 motif present in plant dehydrins. These proteins were shown to be IDPs with the YSK2 motif serving as a core region for the dehydrins’ cryoprotective activity. Here we examined the cryoprotective activity of randomly selected IDPs toward the model enzyme lactate dehydrogenase (LDH). All five IDPs that were examined were in the range of 35–45 amino acid residues in length and were equally potent at a concentration of 50 µg/mL, whereas folded proteins, the PSD-95/Dlg/ZO-1 (PDZ) domain, and lysozymes had no potency. We further examined their cryoprotective activity toward glutathione S-transferase as an example of the other enzyme, and toward enhanced green fluorescent protein as a non-enzyme protein example. We further examined the lyophilization protective activity of the peptides toward LDH, which revealed that some IDPs showed a higher activity than that of bovine serum albumin (BSA). Based on these observations, we propose that cryoprotection is a general feature of IDPs. Our findings may become a clue to various industrial applications of IDPs in the future.

    DOI PubMed

  • Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and A Fast Implementation in WebAssembly

    Attrapadung, Nuttapong, Hanaoka, Goichiro, Mitsunari, Shigeo, Sakai, Yusuke, Shimizu, Kana, Teruya, Tadanori

    Proceedings of the ACM Asia Conference on Computer and Communications Security 2018 (AsiaCCS 2018)     685 - 697  2018年  [査読有り]

    DOI

  • Secure Wavelet Matrix: Alphabet-Friendly Privacy-Preserving String Search for Bioinformatics

    Sudo, Hiroki, Jimbo, Masanobu, Nuida, Koji, Shimizu, Kana

    IEEE/ACM Transactions on Computational Biology and Bioinformatics   to appear  2018年  [査読有り]

    DOI PubMed

  • Secure Division Protocol and Applications to Privacy-preserving Chi-squared Tests

    Morita, Hiraku, Attrapadung, Nuttapong, Ohata, Satsuya, Nuida, Koji, Yamada, Shota, Shimizu, Kana, Hanaoka, Goichiro, Asai, Kiyoshi

    Proceedings of the International Symposium on Information Theory and Its Applications 2018 (ISITA 2018)   accepted  2018年  [査読有り]

  • An efficient private evaluation of a decision graph

    Sudo, Hiroki, Nuida, Koji, Shimizu, Kana

    Proceedings of the 21th International Conference on Information Security and Cryptology (ICISC 2018)   accepted  2018年  [査読有り]

  • Differentially private Bayesian learning on distributed data

    Heikkila, Mikko, Lagerspetz, Eemil, Kaski, Samuel, Shimizu, Kana, Tarkoma, Sasu, Honkela, Antti

    Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (NIPS 2017)     3226 - 3235  2017年  [査読有り]

  • Efficient privacy-preserving string search and an application in genomics

    Kana Shimizu, Koji Nuida, Gunnar Ratsch

    BIOINFORMATICS   32 ( 11 ) 1652 - 1661  2016年06月  [査読有り]

     概要を見る

    Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database.
    Approach: We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries.
    Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within approximate to 4.6 s and approximate to 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm.

    DOI PubMed

  • Privacy-preserving search for chemical compound databases

    Kana Shimizu, Koji Nuida, Hiromi Arai, Shigeo Mitsunari, Nuttapong Attrapadung, Michiaki Hamada, Koji Tsuda, Takatsugu Hirokawa, Jun Sakuma, Goichiro Hanaoka, Kiyoshi Asai

    BMC BIOINFORMATICS   16 ( 18 ) S6  2015年12月  [査読有り]

     概要を見る

    Background: Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources.
    Results: In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation.
    Conclusion: We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.

    DOI PubMed

  • A Method for Systematic Assessment of Intrinsically Disordered Protein Regions by NMR

    Natsuko Goda, Kana Shimizu, Yohta Kuwahara, Takeshi Tenno, Tamotsu Noguchi, Takahisa Ikegami, Motonori Ota, Hidekazu Hiroaki

    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES   16 ( 7 ) 15743 - 15760  2015年07月  [査読有り]

     概要を見る

    Intrinsically disordered proteins (IDPs) that lack stable conformations and are highly flexible have attracted the attention of biologists. Therefore, the development of a systematic method to identify polypeptide regions that are unstructured in solution is important. We have designed an indirect/reflected detection system for evaluating the physicochemical properties of IDPs using nuclear magnetic resonance (NMR). This approach employs a chimeric membrane protein-based method using the thermostable membrane protein PH0471. This protein contains two domains, a transmembrane helical region and a C-terminal OB (oligonucleotide/oligosaccharide binding)-fold domain (named NfeDC domain), connected by a flexible linker. NMR signals of the OB-fold domain of detergent-solubilized PH0471 are observed because of the flexibility of the linker region. In this study, the linker region was substituted with target IDPs. Fifty-three candidates were selected using the prediction tool POODLE and 35 expression vectors were constructed. Subsequently, we obtained N-15-labeled chimeric PH0471 proteins with 25 IDPs as linkers. The NMR spectra allowed us to classify IDPs into three categories: flexible, moderately flexible, and inflexible. The inflexible IDPs contain membrane-associating or aggregation-prone sequences. This is the first attempt to use an indirect/reflected NMR method to evaluate IDPs and can verify the predictions derived from our computational tools.

    DOI PubMed

  • On Limitations and Alternatives of Privacy-Preserving Cryptographic Protocols for Genomic Data

    Tadanori Teruya, Koji Nuida, Kana Shimizu, Goichiro Hanaoka

    ADVANCES IN INFORMATION AND COMPUTER SECURITY (IWSEC 2015)   9241   242 - 261  2015年  [査読有り]

     概要を見る

    The human genome can identify an individual and determine the individual's biological characteristics, and hence has to be securely protected in order to prevent privacy issues. In this paper we point out, however, that current standard privacy-preserving cryptographic protocols may be insufficient to protect genome privacy. This is mainly due to typical characteristics of genome information; it is immutable, and an individual's genome has correlations to those of the individual's progeny. Then, as an alternative, we propose to protect genome privacy by cryptographic protocols with everlasting security, which provides an appropriate mixture of computational and information-theoretic security. We construct a concrete example of a protocol with everlasting security, and discuss its practical efficiency.

    DOI

  • Reference-free prediction of rearrangement breakpoint reads

    Edward Wijaya, Kana Shimizu, Kiyoshi Asai, Michiaki Hamada

    BIOINFORMATICS   30 ( 18 ) 2559 - 2567  2014年09月  [査読有り]

     概要を見る

    Motivation: Chromosome rearrangement events are triggered by atypical breaking and rejoining of DNA molecules, which are observed in many cancer-related diseases. The detection of rearrangement is typically done by using short reads generated by next-generation sequencing (NGS) and combining the reads with knowledge of a reference genome. Because structural variations and genomes differ from one person to another, intermediate comparison via a reference genome may lead to loss of information.
    Results: In this article, we propose a reference-free method for detecting clusters of breakpoints from the chromosomal rearrangements. This is done by directly comparing a set of NGS normal reads with another set that may be rearranged. Our method SlideSort-BPR (breakpoint reads) is based on a fast algorithm for all-against-all comparisons of short reads and theoretical analyses of the number of neighboring reads. When applied to a dataset with a sequencing depth of 100x, it finds similar to 88% of the breakpoints correctly with no false-positive reads. Moreover, evaluation on a real prostate cancer dataset shows that the proposed method predicts more fusion transcripts correctly than previous approaches, and yet produces fewer false-positive reads. To our knowledge, this is the first method to detect breakpoint reads without using a reference genome.

    DOI PubMed

  • PDB-scale analysis of known and putative ligand-binding sites with structural sketches

    Jun-Ichi Ito, Yasuo Tabei, Kana Shimizu, Kentaro Tomii, Koji Tsuda

    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS   80 ( 3 ) 747 - 763  2012年03月  [査読有り]

     概要を見る

    Computational investigation of protein functions is one of the most urgent and demanding tasks in the field of structural bioinformatics. Exhaustive pairwise comparison of known and putative ligand-binding sites, across protein families and folds, is essential in elucidating the biological functions and evolutionary relationships of proteins. Given the vast amounts of data available now, existing 3D structural comparison methods are not adequate due to their computation time complexity. In this article, we propose a new bit string representation of binding sites called structural sketches, which is obtained by random projections of triplet descriptors. It allows us to use ultra-fast all-pair similarity search methods for strings with strictly controlled error rates. Exhaustive comparison of 1.2 million known and putative binding sites finished in similar to 30 h on a single core to yield 88 million similar binding site pairs. Careful investigation of 3.5 million pairs verified by TM-align revealed several notable analogous sites across distinct protein families or folds. In particular, we succeeded in finding highly plausible functions of several pockets via strong structural analogies. These results indicate that our method is a promising tool for functional annotation of binding sites derived from structural genomics projects. Proteins 2011. (c) 2012 Wiley Periodicals, Inc.

    DOI PubMed

  • PoSSuM: a database of similar protein-ligand binding and putative pockets

    Jun-Ichi Ito, Yasuo Tabei, Kana Shimizu, Koji Tsuda, Kentaro Tomii

    NUCLEIC ACIDS RESEARCH   40 ( D1 ) D541 - D548  2012年01月  [査読有り]

     概要を見る

    Numerous potential ligand-binding sites are available today, along with hundreds of thousands of known binding sites observed in the PDB. Exhaustive similarity search for such vastly numerous binding site pairs is useful to predict protein functions and to enable rapid screening of target proteins for drug design. Existing databases of ligand-binding sites offer databases of limited scale. For example, SitesBase covers only similar to 33 000 known binding sites. Inferring protein function and drug discovery purposes, however, demands a much more comprehensive database including known and putative-binding sites. Using a novel algorithm, we conducted a large-scale all-pairs similarity search for 1.8 million known and potential binding sites in the PDB, and discovered over 14 million similar pairs of binding sites. Here, we present the results as a relational database Pocket Similarity Search using Multiple-sketches (PoSSuM) including all the discovered pairs with annotations of various types. PoSSuM enables rapid exploration of similar binding sites among structures with different global folds as well as similar ones. Moreover, PoSSuM is useful for predicting the binding ligand for unbound structures, which provides important clues for characterizing protein structures with unclear functions. The PoSSuM database is freely available at http://possum.cbrc.jp/PoSSuM/.

    DOI PubMed

  • SlideSort: all pairs similarity search for short reads

    Kana Shimizu, Koji Tsuda

    BIOINFORMATICS   27 ( 4 ) 464 - 470  2011年02月  [査読有り]

     概要を見る

    Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses.
    Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.

    DOI PubMed

  • SAHG, a comprehensive database of predicted structures of all human proteins

    Chie Motono, Junichi Nakata, Ryotaro Koike, Kana Shimizu, Matsuyuki Shirota, Takayuki Amemiya, Kentaro Tomii, Nozomi Nagano, Naofumi Sakaya, Kiyotaka Misoo, Miwa Sato, Akinori Kidera, Hidekazu Hiroaki, Tsuyoshi Shirai, Kengo Kinoshita, Tamotsu Noguchi, Motonori Ota

    NUCLEIC ACIDS RESEARCH   39 ( suppl_1 ) D487 - D493  2011年01月  [査読有り]

     概要を見る

    Most proteins from higher organisms are known to be multi-domain proteins and contain substantial numbers of intrinsically disordered (ID) regions. To analyse such protein sequences, those from human for instance, we developed a special protein-structure-prediction pipeline and accumulated the products in the Structure Atlas of Human Genome (SAHG) database at http://bird.cbrc.jp/sahg. With the pipeline, human proteins were examined by local alignment methods (BLAST, PSI-BLAST and Smith-Waterman profile-profile alignment), global-local alignment methods (FORTE) and prediction tools for ID regions (POODLE-S) and homology modeling (MODELLER). Conformational changes of protein models upon ligand-binding were predicted by simultaneous modeling using templates of apo and holo forms. When there were no suitable templates for holo forms and the apo models were accurate, we prepared holo models using prediction methods for ligand-binding (eF-seek) and conformational change (the elastic network model and the linear response theory). Models are displayed as animated images. As of July 2010, SAHG contains 42 581 protein-domain models in approximately 24 900 unique human protein sequences from the RefSeq database. Annotation of models with functional information and links to other databases such as EzCatDB, InterPro or HPRD are also provided to facilitate understanding the protein structure-function relationships.

    DOI PubMed

  • POODLE-I: Disordered region prediction by integrating POODLE series and structural information predictors based on a workflow approach

    Shuichi Hirose, Kana Shimizu, Tamotsu Noguchi

    In Silico Biology   10 ( 3-4 ) 185 - 191  2010年  [査読有り]

     概要を見る

    Under physiological conditions, many proteins that include a region lacking well-defined three-dimensional structures have been identified, especially in eukaryotes. These regions often play an important biological cellular role, although they cannot form a stable structure. Therefore, they are biologically remarkable phenomena. From an industrial perspective, they can provide useful information for determining three-dimensional structures or designing drugs. For these reasons, disordered regions have attracted a great deal of attention in recent years. Their accurate prediction is therefore anticipated to provide annotations that are useful for wide range of applications. POODLE-I (where "I" stands for integration) is a web-based disordered region prediction system. POODLE-I integrates prediction results obtained from three kinds of disordered region predictors (POODLEs) developed from the viewpoint that the characteristics of disordered regions change according to their length. Furthermore, POODLE-I combines that information with predicted structural information by application of a workflow approach. When compared with server teams that showed best performance in CASP8, POODLE-I ranked among the top and exhibited the highest performance in predicting unfolded proteins. POODLE-I is an efficient tool for detecting disordered regions in proteins solely from the amino acid sequence. The application is freely available at http://mbs.cbrc.jp/ poodle/poodle-i.html. © 2010 - IOS Press and Bioinformation Systems e.V. and the authors. All rights reserved.

    DOI PubMed

  • Interaction between Intrinsically Disordered Proteins Frequently Occurs in a Human Protein-Protein Interaction Network

    Kana Shimizu, Hiroyuki Toh

    JOURNAL OF MOLECULAR BIOLOGY   392 ( 5 ) 1253 - 1265  2009年10月  [査読有り]

     概要を見る

    Intrinsic protein disorder is a widespread phenomenon characterised by a lack of stable three-dimensional structures and is considered to play an important role in protein-protein interactions (PPIs). This study examined the genome-wide preference of disorder in PPIs by using exhaustive disorder prediction in human PPIs. We categorised the PPIs. into three types (interaction between disordered proteins, interaction between structured proteins, and interaction between a disordered protein and a structured protein) with regard to the flexibility of molecular recognition and compared these three interaction types in an existing human PPI network with those in a randomised network. Although the structured regions were expected to become the identifiers for binding recognition, this comparative analysis revealed unexpected results. The occurrence of interactions between disordered proteins was significantly frequent, and that between a disordered protein and a structured protein was significantly infrequent. We found that this propensity was much stronger in interactions between nonhub proteins. We also analysed the interaction types from a functional standpoint by using GO, which revealed that the interaction between disordered proteins frequently occurred in cellular processes, regulation, and metabolic processes. The number of interactions, especially in metabolic processes between disordered proteins, was 1.8 times as large as that in the randomised network. Another analysis conducted by using KEGG pathways provided results where several signaling pathways and disease-related pathways included many interactions between disordered proteins. All of these analyses suggest that human PPIs preferably occur between disordered proteins and that the flexibility of the interacting protein pairs may play an important role in human PPI networks. (C) 2009 Elsevier Ltd. All rights reserved.

    DOI PubMed

  • POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix

    Kana Shimizu, Shuichi Hirose, Tamotsu Noguchi

    BIOINFORMATICS   23 ( 17 ) 2337 - 2338  2007年09月  [査読有り]

     概要を見る

    Protein disorder is characterized by a lack of a stable 3D structure, and is considered to be involved in a number of important protein functions such as regulatory and signalling events. We developed a web application, the POODLE-S, which predicts the disordered region from amino acid sequences by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.
    Availability: POODLE-S is available from http://mbs.cbrc.jp/poodle/poodle-s.htmland can be used by both academic and commercial users.
    Contact: poodle@cbrc.jp.

    DOI PubMed

  • POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions

    Shuichi Hirose, Kana Shimizu, Satoru Kanai, Yutaka Kuroda, Tamotsu Noguchi

    BIOINFORMATICS   23 ( 16 ) 2046 - 2053  2007年08月  [査読有り]

     概要を見る

    Motivation: Recent experimental and theoretical studies have revealed several proteins containing sequence segments that are unfolded under physiological conditions. These segments are called disordered regions. They are actively investigated because of their possible involvement in various biological processes, such as cell signaling, transcriptional and translational regulation. Additionally, disordered regions can represent a major obstacle to high-throughput proteome analysis and often need to be removed from experimental targets. The accurate prediction of long disordered regions is thus expected to provide annotations that are useful for a wide range of applications.
    Results: We developed Prediction Of Order and Disorder by machine LEarning (POODLE-L; L stands for long), the Support Vector Machines (SVMs) based method for predicting long disordered regions using 10 kinds of simple physico-chemical properties of amino acid. POODLE-L assembles the output of 10 two-level SVM predictors into a final prediction of disordered regions. The performance of POODLE-L for predicting long disordered regions, which exhibited a Matthew's correlation coefficient of 0.658, was the highest when compared with eight well-established publicly available disordered region predictors.
    Availability: POODLE-L is freely available at http://mbs.cbrc. jp/ poodle/poodle-l.html
    Contact: hirose-shuichi@aist.go.jp
    Supplementary information: Supplementary data are available at Bioinformatics online.

    DOI PubMed

  • Predicting mostly disordered proteins by using structure-unknown protein data

    Kana Shimizu, Yoichi Muraoka, Shuichi Hirose, Kentaro Tomii, Tamotsu Noguchi

    BMC BIOINFORMATICS   8 ( 1 ) 78  2007年03月  [査読有り]

     概要を見る

    Background: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences.
    Results: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred ( long)), its sensitivity was 0.834 for disordered proteins, which is 0.052 - 0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036 - 0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5% - 10% disordered sequences, 1.46% for the proteins with 10% - 20% disordered sequences and 16.57% for proteins with 20% - 40% disordered sequences.
    Conclusion: The proposed method, which utilizes the information of structure- unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.

    DOI PubMed

  • Angle: A sequencing errors resistant program for predicting protein coding regions in unfinished cDNA

    Kana Shimizu, Jun Adachi, Yoichi Muraoka

    Journal of Bioinformatics and Computational Biology   4 ( 3 ) 649 - 664  2006年06月  [査読有り]

     概要を見る

    In the process of making full-length cDNA, predicting protein coding regions helps both in the preliminary analysis of genes and in any succeeding process. However, unfinished cDNA contains artifacts including many sequencing errors, which hinder the correct evaluation of coding sequences. Especially, predictions of short sequences are difficult because they provide little information for evaluating coding potential. In this paper, we describe ANGLE, a new program for predicting coding sequences in low quality cDNA. To achieve error-tolerant prediction, ANGLE uses a machine-learning approach, which makes better expression of coding sequence maximizing the use of limited information from input sequences. Our method utilizes not only codon usage, but also protein structure information which is difficult to be used for stochastic model-based algorithms, and optimizes limited information from a short segment when deciding coding potential, with the result that predictive accuracy does not depend on the length of an input sequence. The performance of ANGLE is compared with ESTSCAN on four dataset each of them having a different error rate (one frame-shift error or one substitution error per 200-500 nucleotides) and on one dataset which has no error. ANGLE outperforms ESTSCAN by 9.26% in average Matthews's correlation coefficient on short sequence dataset (&lt
    1000 bases). On long sequence dataset, ANGLE achieves comparable performance. © 2006 Imperial College Press.

    DOI PubMed

  • Feature selection based on physicochemical properties of redefined N-term region and C-term regions for predicting disorder

    Shimizu, Kana, Muraoka, Yoichi, Hirose, Shuichi, Noguchi, Tamotsu

    Proceedings of the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB'05)     1 - 6  2005年  [査読有り]

    DOI

  • A Melody-Retrieval System on Parallelized Computers

    Sonoda, Tomonari, Ikenaga, Toshiya, Shimizu, Kana, Muraoka, Yoichi

    Entertainment Computing: Technologies and Applications, IFIP First International Workshop on Entertainment Computing (IWEC 2002)     265 - 272  2003年  [査読有り]

    DOI

  • The design method of a melody retrieval system on parallel-ized computers

    T Sonoda, T Ikenaga, K Shimizu, Y Muraoka

    SECOND INTERNATIONAL CONFERENCE ON WEB DELIVERING OF MUSIC, PROCEEDINGS     66 - 73  2002年  [査読有り]

     概要を見る

    This paper describes the design method of a WWW-based melody-retrieval system which takes a sung melody as a search clue and retrieves the music title from a music database of standard MIDI files(SMF) over the Internet. The most important thing in building a melody-retrieval system on the Internet is to achieve both high matching accuracy and quick search. It was., however, quite difficult to simultaneously fulfill these two conditions since it took long time, for the matching process. We propose the design of a. system which consists of parallel-ized melody-retrieval servers for building a high performance service on the Internet.

    DOI

▼全件表示

書籍等出版物

受賞

  • 平成30年度科学技術分野の文部科学大臣表彰 科学技術賞(研究部門)

    2018年04月  

  • 生命医薬情報学連合大会2016年大会 研究奨励賞

    2016年10月  

  • 平成27年度産総研理事長賞(研究)

    2016年04月  

  • 生命医薬情報学連合大会2015年大会 研究奨励賞

    2015年10月  

  • 生命医薬情報学連合大会2015年大会 最優秀口頭発表賞

    2015年10月  

  • コンピュータセキュリティシンポジウム2014(CSS2014)優秀デモンストレーション賞

    2014年10月  

  • コンピュータセキュリティシンポジウム2013(CSS2013)優秀デモンストレーション賞

    2013年10月  

  • 生命医薬情報学連合大会2012年大会 ベストポスター賞

    2012年10月  

▼全件表示

共同研究・競争的資金等の研究課題

  • プライバシ保護ゲノム情報解析技術の開発

    研究期間:

    2019年04月
    -
    2022年03月
     

     概要を見る

    近年,爆発的に増加している個人ゲノムデータの取り扱いには高いプライバシのリスクが付随するため,データを安全かつ,効果的に集約し,有用な知見を発見する方法論の開発が強く望まれている.このような背景から本研究では,ゲノム情報のどの部分が個人のプライバシに該当するのかを明らかにしたうえで,秘匿すべき部分を暗号化したまま情報解析を行う方法論の研究を行う.本研究では特に,ゲノム配列検索とゲノムワイド関連解析の2点を中心的な課題と定め,大規模なデータ解析を安全に実施できる手法の開発を行う.近年,爆発的に増加している個人ゲノムデータの取り扱いには高いプライバシのリスクが付随するため,有用なデータが様々な組織に囲い込まれて孤立するサイロ化と呼ばれる現象が多発している.統計や機械学習を用いてゲノム情報を解析する際には,データの種類が豊富でサンプル数が多いほど正確な結果を得ることができるため,サイロ化したデータを安全かつ,効果的に集約し,有用な知見を発見する方法論の開発が強く望まれている.このような背景から本研究では,ゲノム情報を秘匿したまま情報解析を行う方法論の研究を行うことを目的とする.本研究では特に,(1)ゲノム配列検索と(2)ゲノムワイド関連解析の2点を中心的な課題と定め,大規模なデータ解析を安全に実施できる手法の開発を目指す.2019年度は,(1)については,秘密分散法による全文検索の暗号プロトコルを考案し,そのプロトタイプを実装した.プロトタイプを用いた実験では,長さ一千万のゲノムデータベースへの検索が実際のインターネット環境でも10秒程度となることを確認した.(2)については,Trusted Execution Environmentを実現する技術の一つであるIntel SGXを用いて,ゲノムワイド関連解析(GWAS)を行うことのできる情報分析プラットフォームを考案し,そのプロタイプ実装を行った.2019年度は,本研究で目的とする(1)ゲノム配列検索の秘匿化と(2)ゲノムワイド関連解析の秘匿化に関して,次の進捗があった.(1)秘密分散法にもとづき,ゲノム配列や医療文書の分析に役立つ秘匿全文検索の暗号プロトコルを考案し,そのプロトタイプを実装した.開発した手法は,事前計算の実施の工夫により,クエリの投入から検索結果を得るまでのオンライン計算に必要な時間計算量,通信量,ラウンド回数がデータベース長に依存せず,クエリ長のみに依存する.一般的な情報検索では,クエリ長はデータベース長と比較して非常に小さいため,ゲノムデータベースのような膨大な情報に対しても非常に高速に動作する.プロトタイプによる実験では,長さ一千万のゲノムデータベースへの検索が実際のインターネット環境でも10秒程度となることを確認した.(2)Trusted Execution Environmentを実現する技術の一つであるIntel SGXを用いて個人ゲノムデータを解析するシステムも開発した.開発したシステムでは,全ゲノム相関解析やデータのクラスタリングを行うことができる他,データのアクセスパターンを秘匿するOblivious RAMを用いる事により,巨大なデータにも高速にアクセスすることができる.データ分析は,ユーザーがJavaScript等のプログラミング言語により記述し,サーバー上のEnclave内に配備した仮想マシンがサーバー側に情報を漏らすことなく実行できる.200人以上のゲノム変異データを用いた実験では,情報保護をしないソフトウェアと同等の時間で解析を行えることを確認した.<BR>上記のように,大規模なゲノムデータ解析の実現に重要な要素技術について,基礎的な方法論の考案からプロトタイプ実装までを達成しており,当初の計画通り進展している.現在までのところ,おおむね順調に進展しているため,2020年度も引き続き当初の計画に従って研究を進めていく.ゲノム配列検索については,秘密分散の通信部分も含めた効率的な実装を目指すほか,秘匿全文検索アルゴリズムのさらなる高度化と効率化を検討する.ゲノムワイド関連解析については,TEEによる情報分析システムの出力プライバシの保護を検討する等,さらなる高度化を検討する

  • 医療情報解析を促進するプライバシ保護技術の開発

    公益財団法人大川情報通信基金  公益財団法人大川情報通信基金 2017年度(第31回)研究助成

    研究期間:

    2018年03月
    -
    2019年03月
     

  • 個別化医療を実現するプライバシ保護ゲノム情報解析

    科学技術振興機構/日本医療研究開発機構  戦略的国際科学技術協力推進事業(SICP)日-フィンランド(Tekes/AF)研究交流

    研究期間:

    2014年05月
    -
    2017年03月
     

  • 類似ゲノムの差異を逃さないDe novoゲノム解析技術の開発

    日本学術振興会  科研費・挑戦的萌芽研究

    研究期間:

    2014年04月
    -
    2017年03月
     

  • 類似ゲノムの差異を逃さないDe novoゲノム解析技術の開発

    日本学術振興会  科研費・挑戦的萌芽研究

    研究期間:

    2014年04月
    -
    2017年03月
     

     概要を見る

    近年の研究により,ゲノム配列は非常に多様であることが示唆された.しかし,現在主流となっている情報解析の手法では,シークエンサーから出力された断片配列をまずはじめに参照ゲノムに対して貼り付けて,その結果から統計情報を得る方策がとられているため,得られる解析結果は参照ゲノムの特徴に左右されて,ゲノムの多様性を見落としてしまう問題点があった.そこで本研究では,複数のデータセットを直接的に比較して,データセット間で異なる特徴を持つゲノム配列のパターンを発見する手法の設計及び実装を行った

  • プライバシー保護バイオインフォマティクス基盤技術の開発と応用

    研究期間:

    2013年04月
    -
    2016年03月
     

     概要を見る

    個人のゲノム情報や薬のたねとなる化合物情報などは,機密情報として取り扱うことが必要となる.一方,オープンサイエンスの立場からは,これらの情報を積極的に利用して他の情報と合わせてデータマイニングを行うことが重要である.本研究では,これらの生物分野の重要情報を秘匿したまま様々なデータマイニングを行う方法論の開発を行った.具体的には,化合物データベースの秘匿検索,隠れマルコフモデルを用いたゲノム情報の秘匿検索,秘匿配列アラインメントの技術を開発した

  • 基質結合部位予測に向けたタンパク質局所構造の高速比較法の開発

    研究期間:

    2011年04月
    -
    2014年03月
     

     概要を見る

    本研究では、タンパク質基質結合部位の粗視化と高速なソートアルゴリズムの適用により、大量の結合部位の比較を可能とする新たな手法を開発した。タンパク質立体構造データベース全体の規模の既知及び潜在的基質結合部位に対して開発手法を適用し、その比較結果を収載したデータベースPoSSuMを構築と公開を行った。現在PoSSuMは、550万の既知及び潜在的基質結合部位を比較して得られた4,900万の類似結合部位ペアを収載するまでに成長している。また、ドラッグリポジショニング、副作用予測などへの応用を目指し、ChEMBLのアッセイ情報へのリンクを付加した新たなデータベースPoSSuMdsの構築と公開を行った

  • ギガシークエンスデータの高速解析技術の開発

    日本学術振興会  科研費・若手研究(B)

    研究期間:

    2010年04月
    -
    2012年03月
     

     概要を見る

    ギガシークエンサーは,短い断片配列(リード)を大量に出力するため,高速な解析技術の開発が急務となっている.本研究では,オフセット付き鳩ノ巣原理を応用し,大量のリードから超高速に類似配列を発見するアルゴリズムSlideSortを開発した. SlideSortは従来手法と比較して,同程度のメモリで1000倍以上の速度向上を達成した.考案したアルゴリズムの応用例として,最小全域木を構築するソフトウェアの開発も行った.類似ペア検索の応用範囲は広く,上記に述べたクラスタリングの他にも,共通パターンの発見,アセンブリの効率化などに役立つと期待される

  • ギガシークエンスデータの高速解析技術の開発

    日本学術振興会  科研費・若手研究(B)

    研究期間:

    2010年04月
    -
    2012年03月
     

▼全件表示

講演・口頭発表等

  • Privacy-aware computational genomics

    Kana Shimizu  [招待有り]

    SPIEZ Convergence 2018   (Spiez) 

    発表年月: 2018年09月

  • Privacy-preserving genome sequence search

    Kana Shimizu  [招待有り]

    2016 International Workshop on Spatial and Temporal Modeling from Statistical, Machine Learning and Engineering perspectives (STM2016)   (Tokyo) 

    発表年月: 2016年07月

  • Efficient Privacy-Preserving String Search and an Application in Genomics

    Kana Shimizu

    High Throughput Sequencing Algorithms & Applications (HitSeq 2015), A SIG of ISMB/ECCB 2015   (Dublin) 

    発表年月: 2015年07月

  • Privacy Preserving Similarity Search in Biomedical Data by Homomorphic encryption

    Kana Shimizu

    Biological Data Science   (Cold Spring Harbor) 

    発表年月: 2014年11月

  • Next generation sequencing data analyses by using ultra-fast all pairs similarity search

    Kana Shimizu  [招待有り]

    International Symposium on Single Biomolecule Analysis 2013   (Kyoto) 

    発表年月: 2013年11月

  • Privacy-preserving search for a chemical compound database

    Kana Shimizu  [招待有り]

    ISMB/ECCB 2013 Oral Poster Presentations Track   (Berlin) 

    発表年月: 2013年07月

  • Privacy-preserving search for a chemical compound database

    Kana Shimizu

    ISCB-Asia/SCCG 2012   (Shenzhen) 

    発表年月: 2012年12月

  • SlideSort: Fast and exact algorithm for Next Generation Sequencing data analysis

    Kana Shimizu

    ISMB/ECCB 2011 Highlights Track   (Boston) 

    発表年月: 2011年07月

▼全件表示

特定課題研究

  • プライバシ保護ゲノム情報解析技術の開発

    2018年  

     概要を見る

    ゲノム情報処理を保護する際に必要な技術が備えるべき機能と性能について詳細な検討をした.また、決定グラフの秘匿計算プロトコルの開発を行ったほか、乗算が一度のみ可能な準同型暗号の応用方法について検討を重ね、ゲノム情報検索アプリケーションを実装した。

  • プライバシ保護ゲノム情報解析技術の開発

    2017年  

     概要を見る

    ゲノム情報を含むデータベースを検索する際に必要となるプライバシ保護技術の開発を行った.本研究では,準同型暗号を用いてユーザーとデータベースが双方の情報を開示しないまま目的とするデータ解析を行う暗号プロトコルの開発を行った.具体的には,ロジスティック回帰によりゲノムワイド関連解析を行うプロトコル,学習済みの決定木によるクラス分類を行うプロトコルを開発した.

  • 暗号技術を用いたプライバシ保護ゲノム情報解析技術の開発

    2016年  

     概要を見る

    データベース検索においてクエリとデータベースの双方にプライバシ情報を含む場合,双方のプライバシを同時に守ることは難しい.本研究では,このような問題を解決するため,データの中身を隠したまま,検索結果のみをユーザーに提示することのできる秘匿検索技術の開発を行った.提案手法は部分文字列の秘匿検索を行うことができるが,文字の種類数が多い場合にも高速に動作する性質を持ち,従来手法と比較して10~100倍以上高速であった.本研究成果はデータベース検索の安全化に役立つことが期待される.

  • cDNAにおける遺伝子領域の特定に関する研究

    2003年  

     概要を見る

    ポストシークエンス時代の到来と共に、ゲノム情報解析の必要性が高まっている。ゲノムの情報は冗長であり、シークエンサーで解読された情報の中のごく一部だけが生物の機能に携わっている。そのためゲノム情報を、創薬、遺伝子治療、品種改良等に役立てるには、まず最初に大量のデータの中から遺伝子領域を特定し、タンパク質の機能解析をする必要がある。本研究では以上のような背景を踏まえ、cDNA配列からタンパク質のコーディング領域を予測することを目標とした。cDNAからタンパク質のコーディング領域を特定する従来研究は、コドン連鎖などのコドンの使用頻度をもとに予測を行っている。そのためコドンの使用頻度に偏りがある配列に対しては、予測精度を保てない欠点がある。ゲノムの情報は例外が多く、コドンの使用頻度が偏った配列は数多く存在する。ロバストな予測を行うためには多くの生物学的知見による情報を利用する必要があるが、多くの従来研究では、隠れマルコフモデルなどの確率モデルを使った手法がとられているため、確率的に依存関係にある生物学的知見を同時に利用することが困難であった。これに対し、本研究ではコドンの使用頻度のほかにも有用と思われる生物学的知見を数多く組み合わせて予測することのできる手法を提案した。提案した手法を実装し、ベンチマーク用データを用いて評価を行った結果、従来研究よりも良い精度を得ることができた。また、本研究で実装したシステムはwebから実行することも可能であり、近日中に一web上で公開する予定である。なお、本研究の成果はcDNAだけでなくDNAのexon領域予測にも応用できる。現在はDNA予測に向けてシステムの改変を行い、本研究がより広範囲に貢献できるよう、研究を進めている。

 

現在担当している科目

▼全件表示

 

委員歴

  • 2017年
    -
    2018年

    日本バイオインフォマティクス学会  監事

  • 2011年
    -
    2018年

    International Society for Computational Biology (ISCB)  The affiliates committee

  • 2011年
    -
    2018年

    International Society for Computational Biology (ISCB)  The affiliates committee

  • 2010年
    -
    2011年

    日本バイオインフォマティクス学会  幹事

  • 2010年
    -
    2011年

    日本バイオインフォマティクス学会  理事