Updated on 2022/05/25

写真a

 
SHIMIZU, Kana
 
Affiliation
Faculty of Science and Engineering, School of Fundamental Science and Engineering
Job title
Professor

Concurrent Post

  • Faculty of Science and Engineering   Graduate School of Fundamental Science and Engineering

Research Institute

  • 2020
    -
    2022

    理工学術院総合研究所   兼任研究員

  • 2020
    -
    2022

    国際情報通信研究センター   兼任研究員

Education

  • 2003.04
    -
    2006.03

    早稲田大学   理工学研究科   情報ネットワーク専攻  

  • 2003.04
    -
    2006.03

    Waseda University  

  • 2001.04
    -
    2003.03

    早稲田大学   理工学研究科   情報科学専攻  

  • 2001.04
    -
    2003.03

    Waseda University  

  • 1997.04
    -
    2001.03

    Waseda University   School of Science and Engineering  

Degree

  • Waseda University   Dr.(Eng.)

Research Experience

  • 2018.04
    -
     

    Waseda University   Faculty of Science and Engineering   Professor

  • 2016.04
    -
    2018.03

    Waseda University   Faculty of Science and Engineering   Associate Professor

  • 2013.03
    -
    2016.03

    National Institute of Advanced Industrial Science and Technology   Biological Research Institute for Drug Discovery/Computational Biology Research Center   Senior Research Scientist

  • 2013.12
    -
    2015.04

    The Sloan-Kettering Institute at Memorial Sloan-Kettering Cancer Center   Visiting Investigator

  • 2009.04
    -
    2013.02

    National Institute of Advanced Industrial Science and Technology   Computational Biology Research Center   Research Scientist

  • 2006.11
    -
    2009.03

    National Institute of Advanced Industrial Science and Technology   Computational Biology Research Center   AIST Research Staff

  • 2006.01
    -
    2006.10

    National Institute of Advanced Industrial Science and Technology   Computational Biology Research Center   AIST Technical Staff

▼display all

 

Research Areas

  • Life, health and medical informatics

Research Interests

  • Algorithm

  • Privacy preserving Datamining

  • Datamining

  • Genome data analysis

  • Computational Biology

  • Bioinformatics

▼display all

Papers

  • Efficient Privacy-Preserving Variable-Length Substring Match for Genome Sequence.

    Yoshiki Nakagawa, Satsuya Ohata, Kana Shimizu

    21st International Workshop on Algorithms in Bioinformatics(WABI)     2 - 23  2021  [Refereed]

    DOI

  • Discovery of cryoprotective activity in human genome-derived intrinsically disordered proteins

    Naoki Matsuo, Natsuko Goda, Kana Shimizu, Satoshi Fukuchi, Motonori Ota, Hidekazu Hiroaki

    International Journal of Molecular Sciences   19 ( 2 ) E401  2018.02  [Refereed]

     View Summary

    Intrinsically disordered proteins (IDPs) are an emerging phenomenon. They may have a high degree of flexibility in their polypeptide chains, which lack a stable 3D structure. Although several biological functions of IDPs have been proposed, their general function is not known. The only finding related to their function is the genetically conserved YSK2 motif present in plant dehydrins. These proteins were shown to be IDPs with the YSK2 motif serving as a core region for the dehydrins’ cryoprotective activity. Here we examined the cryoprotective activity of randomly selected IDPs toward the model enzyme lactate dehydrogenase (LDH). All five IDPs that were examined were in the range of 35–45 amino acid residues in length and were equally potent at a concentration of 50 µg/mL, whereas folded proteins, the PSD-95/Dlg/ZO-1 (PDZ) domain, and lysozymes had no potency. We further examined their cryoprotective activity toward glutathione S-transferase as an example of the other enzyme, and toward enhanced green fluorescent protein as a non-enzyme protein example. We further examined the lyophilization protective activity of the peptides toward LDH, which revealed that some IDPs showed a higher activity than that of bovine serum albumin (BSA). Based on these observations, we propose that cryoprotection is a general feature of IDPs. Our findings may become a clue to various industrial applications of IDPs in the future.

    DOI PubMed

  • Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and A Fast Implementation in WebAssembly

    Attrapadung, Nuttapong, Hanaoka, Goichiro, Mitsunari, Shigeo, Sakai, Yusuke, Shimizu, Kana, Teruya, Tadanori

    Proceedings of the ACM Asia Conference on Computer and Communications Security 2018 (AsiaCCS 2018)     685 - 697  2018  [Refereed]

    DOI

  • Secure Wavelet Matrix: Alphabet-Friendly Privacy-Preserving String Search for Bioinformatics

    Sudo, Hiroki, Jimbo, Masanobu, Nuida, Koji, Shimizu, Kana

    IEEE/ACM Transactions on Computational Biology and Bioinformatics   16 ( 5 ) 1675 - 1684  2018  [Refereed]

    DOI PubMed

  • Secure Division Protocol and Applications to Privacy-preserving Chi-squared Tests

    Morita, Hiraku, Attrapadung, Nuttapong, Ohata, Satsuya, Nuida, Koji, Yamada, Shota, Shimizu, Kana, Hanaoka, Goichiro, Asai, Kiyoshi

    Proceedings of the International Symposium on Information Theory and Its Applications 2018 (ISITA 2018)   accepted  2018  [Refereed]

  • An efficient private evaluation of a decision graph

    Sudo, Hiroki, Nuida, Koji, Shimizu, Kana

    Proceedings of the 21th International Conference on Information Security and Cryptology (ICISC 2018)   accepted  2018  [Refereed]

  • Differentially private Bayesian learning on distributed data

    Heikkila, Mikko, Lagerspetz, Eemil, Kaski, Samuel, Shimizu, Kana, Tarkoma, Sasu, Honkela, Antti

    Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (NIPS 2017)     3226 - 3235  2017  [Refereed]

  • Efficient privacy-preserving string search and an application in genomics

    Kana Shimizu, Koji Nuida, Gunnar Ratsch

    BIOINFORMATICS   32 ( 11 ) 1652 - 1661  2016.06  [Refereed]

     View Summary

    Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database.
    Approach: We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries.
    Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within approximate to 4.6 s and approximate to 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm.

    DOI PubMed

  • Privacy-Preserving String Search for Genome Sequences with FHE bootstrapping optimization

    Yu Ishimaki, Hiroki Imabavashi, Kana Shimizu, Hayato Yamana

    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)     3989 - 3991  2016  [Refereed]

     View Summary

    Privacy-preserving string search is a crucial task for analyzing genomics-driven big data. In this work, we propose a cryptographic protocol that uses Fully Homomorphic Encryption (FHE) to enable a client to search on a genome sequence database without leaking his/her query to the server. Though FHE supports both addition and multiplication over encrypted data, random noise inside ciphertexts grows with every arithmetic operation especially multiplication, which results in incorrect decryption when the noise amount exceeds its threshold called level. There are two approaches to avoid the incorrect decryption: one is setting the sufficient level that assures correct decryption within the limited number of operations, and the other is resetting the noise by the method called bootstrapping. It is important to find an optimal balance between overhead caused by the level and overhead caused by the bootstrapping, since using higher level deteriorates the performance of all the arithmetic operations, while the more number of bootstrappings causes more expensive overhead. In this study, we propose an efficient approach to minimize the number of bootstrappings while reducing the level as much as possible. Our experimental result shows that it runs at most 10 times faster than a naive approach.

    DOI

  • Privacy-preserving search for chemical compound databases

    Kana Shimizu, Koji Nuida, Hiromi Arai, Shigeo Mitsunari, Nuttapong Attrapadung, Michiaki Hamada, Koji Tsuda, Takatsugu Hirokawa, Jun Sakuma, Goichiro Hanaoka, Kiyoshi Asai

    BMC BIOINFORMATICS   16 ( 18 ) S6  2015.12  [Refereed]

     View Summary

    Background: Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources.
    Results: In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation.
    Conclusion: We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.

    DOI PubMed

  • A Method for Systematic Assessment of Intrinsically Disordered Protein Regions by NMR

    Natsuko Goda, Kana Shimizu, Yohta Kuwahara, Takeshi Tenno, Tamotsu Noguchi, Takahisa Ikegami, Motonori Ota, Hidekazu Hiroaki

    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES   16 ( 7 ) 15743 - 15760  2015.07  [Refereed]

     View Summary

    Intrinsically disordered proteins (IDPs) that lack stable conformations and are highly flexible have attracted the attention of biologists. Therefore, the development of a systematic method to identify polypeptide regions that are unstructured in solution is important. We have designed an indirect/reflected detection system for evaluating the physicochemical properties of IDPs using nuclear magnetic resonance (NMR). This approach employs a chimeric membrane protein-based method using the thermostable membrane protein PH0471. This protein contains two domains, a transmembrane helical region and a C-terminal OB (oligonucleotide/oligosaccharide binding)-fold domain (named NfeDC domain), connected by a flexible linker. NMR signals of the OB-fold domain of detergent-solubilized PH0471 are observed because of the flexibility of the linker region. In this study, the linker region was substituted with target IDPs. Fifty-three candidates were selected using the prediction tool POODLE and 35 expression vectors were constructed. Subsequently, we obtained N-15-labeled chimeric PH0471 proteins with 25 IDPs as linkers. The NMR spectra allowed us to classify IDPs into three categories: flexible, moderately flexible, and inflexible. The inflexible IDPs contain membrane-associating or aggregation-prone sequences. This is the first attempt to use an indirect/reflected NMR method to evaluate IDPs and can verify the predictions derived from our computational tools.

    DOI PubMed

  • On Limitations and Alternatives of Privacy-Preserving Cryptographic Protocols for Genomic Data

    Tadanori Teruya, Koji Nuida, Kana Shimizu, Goichiro Hanaoka

    ADVANCES IN INFORMATION AND COMPUTER SECURITY (IWSEC 2015)   9241   242 - 261  2015  [Refereed]

     View Summary

    The human genome can identify an individual and determine the individual's biological characteristics, and hence has to be securely protected in order to prevent privacy issues. In this paper we point out, however, that current standard privacy-preserving cryptographic protocols may be insufficient to protect genome privacy. This is mainly due to typical characteristics of genome information; it is immutable, and an individual's genome has correlations to those of the individual's progeny. Then, as an alternative, we propose to protect genome privacy by cryptographic protocols with everlasting security, which provides an appropriate mixture of computational and information-theoretic security. We construct a concrete example of a protocol with everlasting security, and discuss its practical efficiency.

    DOI

  • Reference-free prediction of rearrangement breakpoint reads

    Edward Wijaya, Kana Shimizu, Kiyoshi Asai, Michiaki Hamada

    BIOINFORMATICS   30 ( 18 ) 2559 - 2567  2014.09  [Refereed]

     View Summary

    Motivation: Chromosome rearrangement events are triggered by atypical breaking and rejoining of DNA molecules, which are observed in many cancer-related diseases. The detection of rearrangement is typically done by using short reads generated by next-generation sequencing (NGS) and combining the reads with knowledge of a reference genome. Because structural variations and genomes differ from one person to another, intermediate comparison via a reference genome may lead to loss of information.
    Results: In this article, we propose a reference-free method for detecting clusters of breakpoints from the chromosomal rearrangements. This is done by directly comparing a set of NGS normal reads with another set that may be rearranged. Our method SlideSort-BPR (breakpoint reads) is based on a fast algorithm for all-against-all comparisons of short reads and theoretical analyses of the number of neighboring reads. When applied to a dataset with a sequencing depth of 100x, it finds similar to 88% of the breakpoints correctly with no false-positive reads. Moreover, evaluation on a real prostate cancer dataset shows that the proposed method predicts more fusion transcripts correctly than previous approaches, and yet produces fewer false-positive reads. To our knowledge, this is the first method to detect breakpoint reads without using a reference genome.

    DOI PubMed

  • PDB-scale analysis of known and putative ligand-binding sites with structural sketches

    Jun-Ichi Ito, Yasuo Tabei, Kana Shimizu, Kentaro Tomii, Koji Tsuda

    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS   80 ( 3 ) 747 - 763  2012.03  [Refereed]

     View Summary

    Computational investigation of protein functions is one of the most urgent and demanding tasks in the field of structural bioinformatics. Exhaustive pairwise comparison of known and putative ligand-binding sites, across protein families and folds, is essential in elucidating the biological functions and evolutionary relationships of proteins. Given the vast amounts of data available now, existing 3D structural comparison methods are not adequate due to their computation time complexity. In this article, we propose a new bit string representation of binding sites called structural sketches, which is obtained by random projections of triplet descriptors. It allows us to use ultra-fast all-pair similarity search methods for strings with strictly controlled error rates. Exhaustive comparison of 1.2 million known and putative binding sites finished in similar to 30 h on a single core to yield 88 million similar binding site pairs. Careful investigation of 3.5 million pairs verified by TM-align revealed several notable analogous sites across distinct protein families or folds. In particular, we succeeded in finding highly plausible functions of several pockets via strong structural analogies. These results indicate that our method is a promising tool for functional annotation of binding sites derived from structural genomics projects. Proteins 2011. (c) 2012 Wiley Periodicals, Inc.

    DOI PubMed

  • Privacy preservation in information retrieval

    荒井 ひろみ, 清水 佳奈, 浜田 道昭

    人工知能学会全国大会論文集   26   1 - 4  2012

    CiNii

  • PoSSuM: a database of similar protein-ligand binding and putative pockets

    Jun-Ichi Ito, Yasuo Tabei, Kana Shimizu, Koji Tsuda, Kentaro Tomii

    NUCLEIC ACIDS RESEARCH   40 ( D1 ) D541 - D548  2012.01  [Refereed]

     View Summary

    Numerous potential ligand-binding sites are available today, along with hundreds of thousands of known binding sites observed in the PDB. Exhaustive similarity search for such vastly numerous binding site pairs is useful to predict protein functions and to enable rapid screening of target proteins for drug design. Existing databases of ligand-binding sites offer databases of limited scale. For example, SitesBase covers only similar to 33 000 known binding sites. Inferring protein function and drug discovery purposes, however, demands a much more comprehensive database including known and putative-binding sites. Using a novel algorithm, we conducted a large-scale all-pairs similarity search for 1.8 million known and potential binding sites in the PDB, and discovered over 14 million similar pairs of binding sites. Here, we present the results as a relational database Pocket Similarity Search using Multiple-sketches (PoSSuM) including all the discovered pairs with annotations of various types. PoSSuM enables rapid exploration of similar binding sites among structures with different global folds as well as similar ones. Moreover, PoSSuM is useful for predicting the binding ligand for unbound structures, which provides important clues for characterizing protein structures with unclear functions. The PoSSuM database is freely available at http://possum.cbrc.jp/PoSSuM/.

    DOI PubMed

  • SlideSort: all pairs similarity search for short reads

    Kana Shimizu, Koji Tsuda

    BIOINFORMATICS   27 ( 4 ) 464 - 470  2011.02  [Refereed]

     View Summary

    Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses.
    Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.

    DOI PubMed

  • SAHG, a comprehensive database of predicted structures of all human proteins

    Chie Motono, Junichi Nakata, Ryotaro Koike, Kana Shimizu, Matsuyuki Shirota, Takayuki Amemiya, Kentaro Tomii, Nozomi Nagano, Naofumi Sakaya, Kiyotaka Misoo, Miwa Sato, Akinori Kidera, Hidekazu Hiroaki, Tsuyoshi Shirai, Kengo Kinoshita, Tamotsu Noguchi, Motonori Ota

    NUCLEIC ACIDS RESEARCH   39 ( suppl_1 ) D487 - D493  2011.01  [Refereed]

     View Summary

    Most proteins from higher organisms are known to be multi-domain proteins and contain substantial numbers of intrinsically disordered (ID) regions. To analyse such protein sequences, those from human for instance, we developed a special protein-structure-prediction pipeline and accumulated the products in the Structure Atlas of Human Genome (SAHG) database at http://bird.cbrc.jp/sahg. With the pipeline, human proteins were examined by local alignment methods (BLAST, PSI-BLAST and Smith-Waterman profile-profile alignment), global-local alignment methods (FORTE) and prediction tools for ID regions (POODLE-S) and homology modeling (MODELLER). Conformational changes of protein models upon ligand-binding were predicted by simultaneous modeling using templates of apo and holo forms. When there were no suitable templates for holo forms and the apo models were accurate, we prepared holo models using prediction methods for ligand-binding (eF-seek) and conformational change (the elastic network model and the linear response theory). Models are displayed as animated images. As of July 2010, SAHG contains 42 581 protein-domain models in approximately 24 900 unique human protein sequences from the RefSeq database. Annotation of models with functional information and links to other databases such as EzCatDB, InterPro or HPRD are also provided to facilitate understanding the protein structure-function relationships.

    DOI PubMed

  • POODLE-I: Disordered region prediction by integrating POODLE series and structural information predictors based on a workflow approach

    Shuichi Hirose, Kana Shimizu, Tamotsu Noguchi

    In Silico Biology   10 ( 3-4 ) 185 - 191  2010  [Refereed]

     View Summary

    Under physiological conditions, many proteins that include a region lacking well-defined three-dimensional structures have been identified, especially in eukaryotes. These regions often play an important biological cellular role, although they cannot form a stable structure. Therefore, they are biologically remarkable phenomena. From an industrial perspective, they can provide useful information for determining three-dimensional structures or designing drugs. For these reasons, disordered regions have attracted a great deal of attention in recent years. Their accurate prediction is therefore anticipated to provide annotations that are useful for wide range of applications. POODLE-I (where "I" stands for integration) is a web-based disordered region prediction system. POODLE-I integrates prediction results obtained from three kinds of disordered region predictors (POODLEs) developed from the viewpoint that the characteristics of disordered regions change according to their length. Furthermore, POODLE-I combines that information with predicted structural information by application of a workflow approach. When compared with server teams that showed best performance in CASP8, POODLE-I ranked among the top and exhibited the highest performance in predicting unfolded proteins. POODLE-I is an efficient tool for detecting disordered regions in proteins solely from the amino acid sequence. The application is freely available at http://mbs.cbrc.jp/ poodle/poodle-i.html. © 2010 - IOS Press and Bioinformation Systems e.V. and the authors. All rights reserved.

    DOI PubMed

  • Interaction between Intrinsically Disordered Proteins Frequently Occurs in a Human Protein-Protein Interaction Network

    Kana Shimizu, Hiroyuki Toh

    JOURNAL OF MOLECULAR BIOLOGY   392 ( 5 ) 1253 - 1265  2009.10  [Refereed]

     View Summary

    Intrinsic protein disorder is a widespread phenomenon characterised by a lack of stable three-dimensional structures and is considered to play an important role in protein-protein interactions (PPIs). This study examined the genome-wide preference of disorder in PPIs by using exhaustive disorder prediction in human PPIs. We categorised the PPIs. into three types (interaction between disordered proteins, interaction between structured proteins, and interaction between a disordered protein and a structured protein) with regard to the flexibility of molecular recognition and compared these three interaction types in an existing human PPI network with those in a randomised network. Although the structured regions were expected to become the identifiers for binding recognition, this comparative analysis revealed unexpected results. The occurrence of interactions between disordered proteins was significantly frequent, and that between a disordered protein and a structured protein was significantly infrequent. We found that this propensity was much stronger in interactions between nonhub proteins. We also analysed the interaction types from a functional standpoint by using GO, which revealed that the interaction between disordered proteins frequently occurred in cellular processes, regulation, and metabolic processes. The number of interactions, especially in metabolic processes between disordered proteins, was 1.8 times as large as that in the randomised network. Another analysis conducted by using KEGG pathways provided results where several signaling pathways and disease-related pathways included many interactions between disordered proteins. All of these analyses suggest that human PPIs preferably occur between disordered proteins and that the flexibility of the interacting protein pairs may play an important role in human PPI networks. (C) 2009 Elsevier Ltd. All rights reserved.

    DOI PubMed

  • POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix

    Kana Shimizu, Shuichi Hirose, Tamotsu Noguchi

    BIOINFORMATICS   23 ( 17 ) 2337 - 2338  2007.09  [Refereed]

     View Summary

    Protein disorder is characterized by a lack of a stable 3D structure, and is considered to be involved in a number of important protein functions such as regulatory and signalling events. We developed a web application, the POODLE-S, which predicts the disordered region from amino acid sequences by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.
    Availability: POODLE-S is available from http://mbs.cbrc.jp/poodle/poodle-s.htmland can be used by both academic and commercial users.
    Contact: poodle@cbrc.jp.

    DOI PubMed

  • POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions

    Shuichi Hirose, Kana Shimizu, Satoru Kanai, Yutaka Kuroda, Tamotsu Noguchi

    BIOINFORMATICS   23 ( 16 ) 2046 - 2053  2007.08  [Refereed]

     View Summary

    Motivation: Recent experimental and theoretical studies have revealed several proteins containing sequence segments that are unfolded under physiological conditions. These segments are called disordered regions. They are actively investigated because of their possible involvement in various biological processes, such as cell signaling, transcriptional and translational regulation. Additionally, disordered regions can represent a major obstacle to high-throughput proteome analysis and often need to be removed from experimental targets. The accurate prediction of long disordered regions is thus expected to provide annotations that are useful for a wide range of applications.
    Results: We developed Prediction Of Order and Disorder by machine LEarning (POODLE-L; L stands for long), the Support Vector Machines (SVMs) based method for predicting long disordered regions using 10 kinds of simple physico-chemical properties of amino acid. POODLE-L assembles the output of 10 two-level SVM predictors into a final prediction of disordered regions. The performance of POODLE-L for predicting long disordered regions, which exhibited a Matthew's correlation coefficient of 0.658, was the highest when compared with eight well-established publicly available disordered region predictors.
    Availability: POODLE-L is freely available at http://mbs.cbrc. jp/ poodle/poodle-l.html
    Contact: hirose-shuichi@aist.go.jp
    Supplementary information: Supplementary data are available at Bioinformatics online.

    DOI PubMed

  • Predicting mostly disordered proteins by using structure-unknown protein data

    Kana Shimizu, Yoichi Muraoka, Shuichi Hirose, Kentaro Tomii, Tamotsu Noguchi

    BMC BIOINFORMATICS   8 ( 1 ) 78  2007.03  [Refereed]

     View Summary

    Background: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences.
    Results: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred ( long)), its sensitivity was 0.834 for disordered proteins, which is 0.052 - 0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036 - 0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5% - 10% disordered sequences, 1.46% for the proteins with 10% - 20% disordered sequences and 16.57% for proteins with 20% - 40% disordered sequences.
    Conclusion: The proposed method, which utilizes the information of structure- unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.

    DOI PubMed

  • Angle: A sequencing errors resistant program for predicting protein coding regions in unfinished cDNA

    Kana Shimizu, Jun Adachi, Yoichi Muraoka

    Journal of Bioinformatics and Computational Biology   4 ( 3 ) 649 - 664  2006.06  [Refereed]

     View Summary

    In the process of making full-length cDNA, predicting protein coding regions helps both in the preliminary analysis of genes and in any succeeding process. However, unfinished cDNA contains artifacts including many sequencing errors, which hinder the correct evaluation of coding sequences. Especially, predictions of short sequences are difficult because they provide little information for evaluating coding potential. In this paper, we describe ANGLE, a new program for predicting coding sequences in low quality cDNA. To achieve error-tolerant prediction, ANGLE uses a machine-learning approach, which makes better expression of coding sequence maximizing the use of limited information from input sequences. Our method utilizes not only codon usage, but also protein structure information which is difficult to be used for stochastic model-based algorithms, and optimizes limited information from a short segment when deciding coding potential, with the result that predictive accuracy does not depend on the length of an input sequence. The performance of ANGLE is compared with ESTSCAN on four dataset each of them having a different error rate (one frame-shift error or one substitution error per 200-500 nucleotides) and on one dataset which has no error. ANGLE outperforms ESTSCAN by 9.26% in average Matthews's correlation coefficient on short sequence dataset (&lt
    1000 bases). On long sequence dataset, ANGLE achieves comparable performance. © 2006 Imperial College Press.

    DOI PubMed

  • Feature selection based on physicochemical properties of redefined N-term region and C-term regions for predicting disorder

    K Shimizu, Y Muraoka, S Hirose, T Noguchi

    PROCEEDINGS OF THE 2005 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY     262 - 267  2005  [Refereed]

     View Summary

    The prediction of intrinsic disorder from amino acid sequence has been gaining increasing attention because these have come to be known as important regions for protein functions. The most common way of predicting disorder is based on binary classification with machine learning. Since amino acid composition has different propensities in the N-term, C-term, and internal regions, the accuracy of prediction increases by dividing training data into these three regions and predicting them separately. However, previous work has lacked discussion about a concrete definition of the N-term and C-term regions, and has only used the heuristic length from the terminal. Other previous work has shown that general physicochemical properties rather than specific amino acids are important factors contributing to disorder, and a reduced amino acid alphabet can maintain excellent precision in predicting disorder. In this paper, we redefine a suitable length and position for the N-term and C-term regions for predicting disorder. Moreover, we show that each region has different physicochemical properties, which are important factors contributing to disorder. We also suggest a region-specificreduced set of amino acid and modified PSSM based on that for predicting disorder. We implemented our method and (1) compare it with the conventional division method, (2) compare our feature selection with all physicochemical features, on casp6 benchmark, PDB dataset, and DisProt. The result supports that the method of new data separation is effective, and indicates each region has different physicochemical properties that are important factors for predicting protein disorders.

    DOI

  • A melody-retrieval system on parallelized computers

    Tomonari Sonoda, Toshiya Ikenaga, Kana Shimizu, Yoichi Muraoka

    IFIP Advances in Information and Communication Technology   112   265 - 272  2003  [Refereed]

     View Summary

    This paper describes a method for a WWW-based melody-retrieval system, takes a melody sung by a user as a search clue and sent over the Internet and uses it to retrieve the song's title from a music database of standard MIDI files(SMF). It was difficult to build a melody-retrieval service with a large database and with a lot of user accesses since it was quite difficult to build a system which could achieve both quick search and high matching accuracy. We propose a method of a scalable melody-retrieval system which achieves 70% matching accuracy against more than 20, 000 pieces of music and its search time is within a few seconds. © 2003 by Springer Science+Business Media New York.

    DOI

  • The design method of a melody retrieval system on parallel-ized computers

    T Sonoda, T Ikenaga, K Shimizu, Y Muraoka

    SECOND INTERNATIONAL CONFERENCE ON WEB DELIVERING OF MUSIC, PROCEEDINGS     66 - 73  2002  [Refereed]

     View Summary

    This paper describes the design method of a WWW-based melody-retrieval system which takes a sung melody as a search clue and retrieves the music title from a music database of standard MIDI files(SMF) over the Internet. The most important thing in building a melody-retrieval system on the Internet is to achieve both high matching accuracy and quick search. It was., however, quite difficult to simultaneously fulfill these two conditions since it took long time, for the matching process. We propose the design of a. system which consists of parallel-ized melody-retrieval servers for building a high performance service on the Internet.

    DOI

▼display all

Books and Other Publications

Misc

  • Efficient Two-level Homomorphic Encryption based on Pairings

    Nuttapong Attrapadung, Goichiro Hanaoka, Shigeo Mitsunari, Yusuke Sakai, Kana Shimizu, Tadanori Teruya

    Symposium on Cryptography and Information Security 2018 (SCIS 2018)   1A2 ( 4 ) 1 - 8  2018.01

    Research paper, summary (national, other academic conference)  

     View Summary

    Demonstration is https://herumi.github.io/she-wasm/she-demo.html .

  • クライアント補助型秘匿計算および基本ツール

    森田 啓, 大畑 幸矢, Nuttapong Attrapadung, 縫田 光司, 山田 翔太, 清水 佳奈, 花岡 悟一郎, 浅井 潔

    2018年暗号と情報セキュリティシンポジウム(SCIS2018)予稿集    2018.01

    Research paper, summary (national, other academic conference)  

  • 完全準同型暗号を用いた高速なゲノム秘匿検索

    石巻 優, 清水 佳奈, 縫田 光司, 山名 早人

    2016年暗号と情報セキュリティシンポジウム(SCIS2016)予稿集    2016.01

    Research paper, summary (national, other academic conference)  

  • Privacy-Preserving Search for Chemical Compound Databases

    Kana Shimizu, Koji Nuida, Hiromi Arai, Shigeo Mitsunari, Nuttapong Attrapadung, Michiaki Hamada, Koji Tsuda, Takatsugu Hirokawa, Jun Sakuma, Goichiro Hanaoka, Kiyoshi Asai

    bioRxiv   ( 013995 )  2015.01

    Internal/External technical report, pre-print, etc.  

    DOI

  • ゲノムプライバシ保護を考慮した紛失通信プロトコル

    照屋 唯紀, 縫田 光司, 清水 佳奈, 花岡 悟一郎

    2015年暗号と情報セキュリティシンポジウム(SCIS2015)予稿集    2015.01

    Research paper, summary (national, other academic conference)  

  • 範囲指定型問い合わせに対する効率的なデータベース秘匿検索プロトコル

    Tadanori Teruya, Nuttapong Attrapadung, Masaki Inamura, Matsuda Takahiro, Sanami Nakagawa, Koji Nuida, Goichiro Hanaoka, Kana Shimizu

    Computer Security Symposium 2014 (CSS 2014), Dmonstration (Poster)   ( DPS-02 )  2014.10

    Other  

     View Summary

    Awarded 1 / 8 demonstrations.<br />
    <br />
    To make our demonstration, we used a library https://github.com/aistcrypt/Lifted-ElGamal (currently, newer version of this library is in https://github.com/herumi/mcl)

  • 双方向の情報を秘匿可能な効率的化合物データベース検索プロトコル

    縫田光司, 照屋唯紀, 花岡悟一郎, 清水佳奈, 松田隆宏, 矢内直人, 中川紗奈美

    Computer Security Symposium 2013 (CSS 2013), Demonstration (Poster) Session   ( DPS-07 )  2013.10

    Other  

     View Summary

    Awarded 1 / 9 demonstrations.<br />
    <br />
    To make our demonstration, we used a library https://github.com/aistcrypt/Lifted-ElGamal (currently, newer version of this library is in https://github.com/herumi/mcl)

  • トモサガ: スマホで安心「共ダチ」探し

    照屋唯紀, 縫田光司, 花岡悟一郎, 清水佳奈, 松田隆宏, 矢内直人, 中川紗奈美

    Computer Security Symposium 2013 (CSS 2013), Demonstration (Poster) Session    2013.10

    Other  

     View Summary

    To make our demonstration, we used a library https://github.com/aistcrypt/Lifted-ElGamal (currently, newer version of this library is in https://github.com/herumi/mcl)

  • 加法準同型暗号を用いた化合物データベースの秘匿検索プロトコル

    縫田光司, 清水佳奈, 荒井ひろみ, 浜田道昭, 津田宏治, 広川貴次, 花岡悟一郎, 佐久間淳, 浅井潔

    コンピュータセキュリティシンポジウム2012(CSS2012)予稿集    2012.10

    Research paper, summary (national, other academic conference)  

    J-GLOBAL

  • 検索行動におけるプライバシ保護

    荒井ひろみ, 清水佳奈, 浜田道昭, 津田宏治, 広川貴次, 佐久間淳, 浅井潔, 浅井潔

    人工知能学会全国大会論文集(CD-ROM)   26th   ROMBUNNO.3I2-OS-20-1  2012

    J-GLOBAL

  • PRESAT‐vectorを用いた天然変性タンパク質配列の網羅的検証系の確立

    合田名都子, 清水佳奈, 桑原陽太, 天野剛志, 池上貴久, 太田元規, 廣明秀一

    日本蛋白質科学会年会プログラム・要旨集   10th   68  2010.05

    J-GLOBAL

▼display all

Awards

  • 2021年日本バイオインフォマティクス学会年会 ・第10回生命医薬情報学連合大会(IIBMP2021)優秀ポスター賞

    2021  

  • コンピュータセキュリティシンポジウム2019(CSS2019)奨励賞

    2019  

  • 平成30年度科学技術分野の文部科学大臣表彰 科学技術賞(研究部門)

    2018.04  

  • 生命医薬情報学連合大会2016年大会 研究奨励賞

    2016.10  

  • 平成27年度産総研理事長賞(研究)

    2016.04  

  • 生命医薬情報学連合大会2015年大会 研究奨励賞

    2015.10  

  • 生命医薬情報学連合大会2015年大会 最優秀口頭発表賞

    2015.10  

  • コンピュータセキュリティシンポジウム2014(CSS2014)優秀デモンストレーション賞

    2014.10  

  • コンピュータセキュリティシンポジウム2013(CSS2013)優秀デモンストレーション賞

    2013.10  

  • 生命医薬情報学連合大会2012年大会 ベストポスター賞

    2012.10  

▼display all

Research Projects

  • 圧縮秘匿計算による大規模データ処理

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (S)

    Project Year :

    2021.07
    -
    2026.03
     

  • プライバシ保護ゲノム情報解析技術の開発

    Project Year :

    2019.04
    -
    2022.03
     

     View Summary

    近年,爆発的に増加している個人ゲノムデータの取り扱いには高いプライバシのリスクが付随するため,データを安全かつ,効果的に集約し,有用な知見を発見する方法論の開発が強く望まれている.このような背景から本研究では,ゲノム情報のどの部分が個人のプライバシに該当するのかを明らかにしたうえで,秘匿すべき部分を暗号化したまま情報解析を行う方法論の研究を行う.本研究では特に,ゲノム配列検索とゲノムワイド関連解析の2点を中心的な課題と定め,大規模なデータ解析を安全に実施できる手法の開発を行う.近年,爆発的に増加している個人ゲノムデータの取り扱いには高いプライバシのリスクが付随するため,有用なデータが様々な組織に囲い込まれて孤立するサイロ化と呼ばれる現象が多発している.統計や機械学習を用いてゲノム情報を解析する際には,データの種類が豊富でサンプル数が多いほど正確な結果を得ることができるため,サイロ化したデータを安全かつ,効果的に集約し,有用な知見を発見する方法論の開発が強く望まれている.このような背景から本研究では,ゲノム情報を秘匿したまま情報解析を行う方法論の研究を行うことを目的とする.本研究では特に,(1)ゲノム配列検索と(2)ゲノムワイド関連解析の2点を中心的な課題と定め,大規模なデータ解析を安全に実施できる手法の開発を目指す.2019年度は,(1)については,秘密分散法による全文検索の暗号プロトコルを考案し,そのプロトタイプを実装した.プロトタイプを用いた実験では,長さ一千万のゲノムデータベースへの検索が実際のインターネット環境でも10秒程度となることを確認した.(2)については,Trusted Execution Environmentを実現する技術の一つであるIntel SGXを用いて,ゲノムワイド関連解析(GWAS)を行うことのできる情報分析プラットフォームを考案し,そのプロタイプ実装を行った.2019年度は,本研究で目的とする(1)ゲノム配列検索の秘匿化と(2)ゲノムワイド関連解析の秘匿化に関して,次の進捗があった.(1)秘密分散法にもとづき,ゲノム配列や医療文書の分析に役立つ秘匿全文検索の暗号プロトコルを考案し,そのプロトタイプを実装した.開発した手法は,事前計算の実施の工夫により,クエリの投入から検索結果を得るまでのオンライン計算に必要な時間計算量,通信量,ラウンド回数がデータベース長に依存せず,クエリ長のみに依存する.一般的な情報検索では,クエリ長はデータベース長と比較して非常に小さいため,ゲノムデータベースのような膨大な情報に対しても非常に高速に動作する.プロトタイプによる実験では,長さ一千万のゲノムデータベースへの検索が実際のインターネット環境でも10秒程度となることを確認した.(2)Trusted Execution Environmentを実現する技術の一つであるIntel SGXを用いて個人ゲノムデータを解析するシステムも開発した.開発したシステムでは,全ゲノム相関解析やデータのクラスタリングを行うことができる他,データのアクセスパターンを秘匿するOblivious RAMを用いる事により,巨大なデータにも高速にアクセスすることができる.データ分析は,ユーザーがJavaScript等のプログラミング言語により記述し,サーバー上のEnclave内に配備した仮想マシンがサーバー側に情報を漏らすことなく実行できる.200人以上のゲノム変異データを用いた実験では,情報保護をしないソフトウェアと同等の時間で解析を行えることを確認した.<BR>上記のように,大規模なゲノムデータ解析の実現に重要な要素技術について,基礎的な方法論の考案からプロトタイプ実装までを達成しており,当初の計画通り進展している.現在までのところ,おおむね順調に進展しているため,2020年度も引き続き当初の計画に従って研究を進めていく.ゲノム配列検索については,秘密分散の通信部分も含めた効率的な実装を目指すほか,秘匿全文検索アルゴリズムのさらなる高度化と効率化を検討する.ゲノムワイド関連解析については,TEEによる情報分析システムの出力プライバシの保護を検討する等,さらなる高度化を検討する

  • グラフを用いた精密な癌ゲノム配列解析法の研究

    栢森情報科学振興財団  栢森情報科学振興財団 研究助成

    Project Year :

    2021
    -
    2022
     

  • 医療情報解析を促進するプライバシ保護技術の開発

    公益財団法人大川情報通信基金  公益財団法人大川情報通信基金 2017年度(第31回)研究助成

    Project Year :

    2018.03
    -
    2019.03
     

  • 個別化医療を実現するプライバシ保護ゲノム情報解析

    科学技術振興機構/日本医療研究開発機構  戦略的国際科学技術協力推進事業(SICP)日-フィンランド(Tekes/AF)研究交流

    Project Year :

    2014.05
    -
    2017.03
     

  • De novo approach to find differentially appearing genome sequence patterns from the two NGS datasets.

    Project Year :

    2014.04
    -
    2017.03
     

     View Summary

    High-throughput sequencing technology enables to determine various genomes for a same species. Given such a variety of genomes, it is more natural to consider all of such variations. However, majority of analysis method conducts mapping against only a single reference genome in first, which leads to loss of important information caused by mis-mapping. In order to capture individual data’s feature, we developed new approach to analyze NGS data by comparing two different NGS data sets directly and discovering sequence patterns which appears either of the two datasets and do not appear in the other. The proposed approach can be applied to various problems such as finding breakpoints in cancer genomes

  • Development of basic technology for privacy-preserving bioinformatics and its application

    Project Year :

    2013.04
    -
    2016.03
     

     View Summary

    It is highly demanded to deal with the information of personal genome and chemical compound secretly, because they are sensitive information that should not be leaked. On the other hand, from a viewpoint of "open" science, it is important to perform data-mining by combining those sensitive information with other data. In this study, we have developed several methods to perform data-mining, making those information secret. Specifically, we developed (i) privacy-preserving search for chemical database, (ii) privacy-preserving genome sequence search with hidden Markov Model (HMM) and (iii) privacy preserving sequence alignment, all of which will be useful toward open science of biology

  • Development of ultra-fast comparison method of protein local structures towards the prediction of ligand-binding sites

    Project Year :

    2011.04
    -
    2014.03
     

     View Summary

    In this study, we have developed a novel method that can perform a large-scale comparison of protein-ligand binding sites by utilizing a coarse-grained representation of binding sites and a fast sorting algorithm. Using the method, we conducted a large-scale all-pairs similarity search for both known and potential binding sites in the PDB, and constructed the database, called PoSSuM, that listed the discovered similar site pairs. PoSSuM has grown to provide information related to 49 million pairs of similar binding sites discovered among 5.5 million known and putative binding sites. For pharmaceutical applications, such as predictions of side effects and drug repositioning, we have provided a new database, PoSSuMds (PoSSuM drug search) to catalog detected known and potential binding sites for approved drug compounds whose assay data were retrieved from ChEMBL

  • Developing fast algorithm for analyzing Giga-sequence data

    Project Year :

    2010.04
    -
    2012.03
     

     View Summary

    Next Generation Sequencing(NGS) technology calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount data. In this study, we designed and implemented exact algorithm SlideSort that finds all similar pairs whose edit-distance does not exceed a given threshold from NGS data, which helps many important analyses, such as de novo genome assembly, identification of frequently appearing sequence patterns and accurate clustering. In comparison to state-of-the-art methods, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing NGS data for further processing

▼display all

Presentations

  • Privacy-aware computational genomics

    Kana Shimizu  [Invited]

    SPIEZ Convergence 2018  (Spiez) 

    Presentation date: 2018.09

  • Privacy-preserving genome sequence search

    Kana Shimizu  [Invited]

    2016 International Workshop on Spatial and Temporal Modeling from Statistical, Machine Learning and Engineering perspectives (STM2016)  (Tokyo) 

    Presentation date: 2016.07

  • Efficient Privacy-Preserving String Search and an Application in Genomics

    Kana Shimizu

    High Throughput Sequencing Algorithms & Applications (HitSeq 2015), A SIG of ISMB/ECCB 2015  (Dublin) 

    Presentation date: 2015.07

  • Privacy Preserving Similarity Search in Biomedical Data by Homomorphic encryption

    Kana Shimizu

    Biological Data Science  (Cold Spring Harbor) 

    Presentation date: 2014.11

  • Next generation sequencing data analyses by using ultra-fast all pairs similarity search

    Kana Shimizu  [Invited]

    International Symposium on Single Biomolecule Analysis 2013  (Kyoto) 

    Presentation date: 2013.11

  • Privacy-preserving search for a chemical compound database

    Kana Shimizu  [Invited]

    ISMB/ECCB 2013 Oral Poster Presentations Track  (Berlin) 

    Presentation date: 2013.07

  • Privacy-preserving search for a chemical compound database

    Kana Shimizu

    ISCB-Asia/SCCG 2012  (Shenzhen) 

    Presentation date: 2012.12

  • SlideSort: Fast and exact algorithm for Next Generation Sequencing data analysis

    Kana Shimizu

    ISMB/ECCB 2011 Highlights Track  (Boston) 

    Presentation date: 2011.07

▼display all

Specific Research

  • プライバシ保護ゲノム情報解析技術の開発

    2018  

     View Summary

    ゲノム情報処理を保護する際に必要な技術が備えるべき機能と性能について詳細な検討をした.また、決定グラフの秘匿計算プロトコルの開発を行ったほか、乗算が一度のみ可能な準同型暗号の応用方法について検討を重ね、ゲノム情報検索アプリケーションを実装した。

  • プライバシ保護ゲノム情報解析技術の開発

    2017  

     View Summary

    ゲノム情報を含むデータベースを検索する際に必要となるプライバシ保護技術の開発を行った.本研究では,準同型暗号を用いてユーザーとデータベースが双方の情報を開示しないまま目的とするデータ解析を行う暗号プロトコルの開発を行った.具体的には,ロジスティック回帰によりゲノムワイド関連解析を行うプロトコル,学習済みの決定木によるクラス分類を行うプロトコルを開発した.

  • 暗号技術を用いたプライバシ保護ゲノム情報解析技術の開発

    2016  

     View Summary

    データベース検索においてクエリとデータベースの双方にプライバシ情報を含む場合,双方のプライバシを同時に守ることは難しい.本研究では,このような問題を解決するため,データの中身を隠したまま,検索結果のみをユーザーに提示することのできる秘匿検索技術の開発を行った.提案手法は部分文字列の秘匿検索を行うことができるが,文字の種類数が多い場合にも高速に動作する性質を持ち,従来手法と比較して10~100倍以上高速であった.本研究成果はデータベース検索の安全化に役立つことが期待される.

  • cDNAにおける遺伝子領域の特定に関する研究

    2003  

     View Summary

    ポストシークエンス時代の到来と共に、ゲノム情報解析の必要性が高まっている。ゲノムの情報は冗長であり、シークエンサーで解読された情報の中のごく一部だけが生物の機能に携わっている。そのためゲノム情報を、創薬、遺伝子治療、品種改良等に役立てるには、まず最初に大量のデータの中から遺伝子領域を特定し、タンパク質の機能解析をする必要がある。本研究では以上のような背景を踏まえ、cDNA配列からタンパク質のコーディング領域を予測することを目標とした。cDNAからタンパク質のコーディング領域を特定する従来研究は、コドン連鎖などのコドンの使用頻度をもとに予測を行っている。そのためコドンの使用頻度に偏りがある配列に対しては、予測精度を保てない欠点がある。ゲノムの情報は例外が多く、コドンの使用頻度が偏った配列は数多く存在する。ロバストな予測を行うためには多くの生物学的知見による情報を利用する必要があるが、多くの従来研究では、隠れマルコフモデルなどの確率モデルを使った手法がとられているため、確率的に依存関係にある生物学的知見を同時に利用することが困難であった。これに対し、本研究ではコドンの使用頻度のほかにも有用と思われる生物学的知見を数多く組み合わせて予測することのできる手法を提案した。提案した手法を実装し、ベンチマーク用データを用いて評価を行った結果、従来研究よりも良い精度を得ることができた。また、本研究で実装したシステムはwebから実行することも可能であり、近日中に一web上で公開する予定である。なお、本研究の成果はcDNAだけでなくDNAのexon領域予測にも応用できる。現在はDNA予測に向けてシステムの改変を行い、本研究がより広範囲に貢献できるよう、研究を進めている。

 

Syllabus

▼display all

 

Committee Memberships

  • 2021.04
    -
    Now

    Japanese Society for Bioinformatics  Director

  • 2021
    -
    Now

    (JST)さきがけ「社会変革に向けたICT基盤強化」  領域アドバイザ

  • 2021
    -
    Now

    International Society for Computational Biology (ISCB) EDI Committee  Committee member

  • 2020.07
    -
    Now

    (JST)NBDC「統合化推進プログラム」  研究アドバイザ

  • 2020
    -
    Now

    東京高等裁判所  専門委員

  • 2019.06
    -
    2021.06

    Information Processing Society of Japan  Director

  • 2017
    -
    2018

    日本バイオインフォマティクス学会  監事

  • 2011
    -
    2018

    International Society for Computational Biology (ISCB)  The affiliates committee

  • 2010
    -
    2011

    日本バイオインフォマティクス学会  幹事

  • 2010
    -
    2011

    日本バイオインフォマティクス学会  理事

▼display all