Data-intensive analysis of HIV mutations

pdf
Số trang Data-intensive analysis of HIV mutations 23 Cỡ tệp Data-intensive analysis of HIV mutations 3 MB Lượt tải Data-intensive analysis of HIV mutations 0 Lượt đọc Data-intensive analysis of HIV mutations 1
Đánh giá Data-intensive analysis of HIV mutations
4.1 ( 14 lượt)
Nhấn vào bên dưới để tải tài liệu
Đang xem trước 10 trên tổng 23 trang, để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

Ozahata et al. BMC Bioinformatics (2015) 16:35 DOI 10.1186/s12859-015-0452-0 RESEARCH ARTICLE Open Access Data-intensive analysis of HIV mutations Mina Cintho Ozahata1* , Ester Cerdeira Sabino2 , Ricardo Sobhie Diaz3 , Roberto M Cesar-Jr1 and João Eduardo Ferreira2 Abstract Background: In this study, clustering was performed using a bitmap representation of HIV reverse transcriptase and protease sequences, to produce an unsupervised classification of HIV sequences. The classification will aid our understanding of the interactions between mutations and drug resistance. 10,229 HIV genomic sequences from the protease and reverse transcriptase regions of the pol gene and antiretroviral resistant related mutations represented in an 82-dimensional binary vector space were analyzed. Results: A new cluster representation was proposed using an image inspired by microarray data, such that the rows in the image represented the protein sequences from the genotype data and the columns represented presence or absence of mutations in each protein position.The visualization of the clusters showed that some mutations frequently occur together and are probably related to an epistatic phenomenon. Conclusion: We described a methodology based on the application of a pattern recognition algorithm using binary data to suggest clusters of mutations that can easily be discriminated by cluster viewing schemes. Keywords: HIV, Mutation, Cluster Background The human immunodeficiency virus (HIV) shows extensive genetic variability that helps the selection of drug resistance mutations in response to antiretroviral therapy. Hence, it is important to understand the relationship between HIV genotype and phenotype (i.e., drug resistance) to increase the probability of treatment success. To infer antiretroviral resistance, look-up tables [1,2] and rule-based systems [3,4] were developed by different groups to infer phenotypic resistance based on HIV genomic sequences from infected patients that failed on antiretroviral therapy. In Brazil, a look-up table [2] was developed and used by the Brazilian Ministry of Health AIDS program to help the decision-making process for antiretroviral salvage therapy (http://algoritmo.aids. gov.br/). In Brazil, patients who fail on antiretroviral therapy receive genotype tests for antiretroviral resistance throughout a network of laboratories [5]. This collection of HIV genomic sequences represents the variability of *Correspondence: mina.cintho@usp.br 1 Department of Computer Science - DCC, University of São Paulo, Rua do Matão, 1010, CEP 05508-090 São Paulo, SP, Brazil Full list of author information is available at the end of the article the HIV population in this country. With this extensive amount of data, questions arise as to whether it is possible to classify the sequences, based on the occurrences of resistance-related mutations in the different amino acid positions, and whether it is possible to achieve a classification that can express current knowledge of the relationship between mutations and drug resistance. One possible way to answer these questions is to apply clustering algorithms on reverse transcriptase and protease sequences, to obtain clusters containing sequences that are similar. This similarity among the sequences may reveal some of the relationships among the mutations related to antiretroviral resistance. Nonetheless, extraction of a simple and compact representation of the dataset is complex because of the number and size of sequences. The clusters thus generated may provide a representation that contributes to the understanding of the classification and the relationships between mutations. In the present study, a pipeline (see Figure 1) was introduced to represent clusters inspired by microarray data, in which extensive amounts of data are available. Microarray data were used as inspiration because such applications typically contain large volumes of information on gene patterns from thousands of genes at once. Thus, clusters © 2015 Ozahata et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ozahata et al. BMC Bioinformatics (2015) 16:35 Page 2 of 23 Figure 1 Pipeline summarizing the proposed framework. 1) Protease and reverse transcriptase sequences were gathered from patients from all over Brazil, 2) binarization of the sequences, 3) clustering of the mutations, 4) characterization of the clusters and 5) comparison with the Brazilian look-up-table predictions. were represented in an image corresponding to a matrix, such that the rows in the image represented each protein sequence and the columns indicated the presence or absence of resistance-related mutations. This image enabled us to summarize the dataset without losing any information about clustering, permitting the observation of important characteristics of each cluster and enabling cluster comparison, thus providing insights into the data. Previous studies have attempted to identify common protease and reverse transcriptase mutation patterns [6-15] (as shown in Tables 1, 2 and 3). However, many previous works search only for pairs of mutations, not being able to find larger mutation patterns, which are known to exist [11,16-21]. Furthermore, frequently, only subtype B virus sequences are used, and mutations occur with different probabilities in the different subtypes [22]. Also, in some of the previous works a small number of protein positions are used. Consequently, not all mutation patterns in the data are found and it is more difficult to compare results. Finally, small datasets used in some of the related works do not represent all of the virus population variability, also missing mutation patterns. Therefore, there is no clear consensus on which are the important mutation patterns that arise in the protein sequences. Nonetheless, some patterns have been reported in previous works such as the simultaneous presence of mutations at positions 30 and 88 of the protease [7,9-12,23], selected by nelfinavir [24]. The same applies to thymidine analog mutations (TAMs) in reverse transcriptase, which can be discriminated in TAM1 and TAM2 profiles [11,16-21]. The TAM1 profile presents mutations at codons 41, 210 and 215, whereas TAM2 presents mutation at codons 67, 70, and 219. Such studies on mutation patterns are important because the co-existence of mutations may result in different antiretroviral resistance profiles. For example, a mutation can restore the fitness decrease from another mutation that confers drug resistance. However, some of the previous studies only investigated pairs of mutations, and most of them only analyzed subtype B HIV-1 sequences. Moreover, previous studies analyzed specific mutation profiles, making it difficult to compare results between different studies. Thus, mutation patterns have not been fully characterized in the protease and reverse transcriptase sequences. Characterization of these patterns may lead to a better understanding of the interactions among these mutations and to classification of the sequences. In the present study, a large number of codons (38 from reverse transcriptase and 44 from protease, as shown in Table 4) from subtypes B, C and F were clustered, and the sequences were classified according to the mutation patterns. These clusters were compared with clusters reported in other studies. Look up tables and rule-based systems Based on genotype-phenotype correlation studies on laboratory HIV-1 isolates, genotype-phenotype correlations on clinical isolates and genotype-treatment history correlations [25], some efforts have been made to try to understand the relationship between HIV genotype and phenotype. For example, look-up tables [1,2,26] have been compiled using information from the scientific literature, Author Proteins Drugs Protein positions Liu et al. 2008 [7] Protease PI PR1 to PR99 Mutation patterns Number of sequences Method (PR30 PR75 PR88), 7758+8761 (Subtype B and non-Subtype B) k-way clustering (PR1–PR9 PR12–PR15 PR17 PR19 PR20 PR22 PR25 PR26 PR28 PR31 PR35–PR42 PR45 PR49 PR52) (PR56 PR57 PR59 PR61 PR65 PR68–PR70 PR77 PR83 PR87 Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 1 Related works PR89 PR96–PR99) (PR1 PR2 PR9 PR26 PR30 PR40 PR45 PR56 PR59 PR75 PR81 PR88 PR98) (PR13–PR15 PR20 PR35–PR38 PR41 PR42 PR49 PR57 PR69 PR70 PR77 PR83 PR89) (PR10 PR23 PR24 PR27 PR32–PR34 PR43 PR46–PR48 PR50 PR53–PR55 PR58 PR71 PR76 PR80 PR82) (PR30 PR75 PR88) (PR1 PR2 PR9 PR26 PR40 PR45 PR59 PR87 PR98) (PR13–PR15 PR20 PR35–PR38 PR41 PR49 PR57 PR69 PR70 PR77 PR83 PR89) (PR10 PR23 PR24 PR27 PR32–PR34 PR42 PR43 PR46–PR48 PR58 PR71PR76 PR80 PR82) Page 3 of 23 PR50 PR53–PR55 Reuman et al. 2010 [8] Reverse transcriptase NNRTI RT90, RT94, RT98, (RT101,RT181,RT190) 13039 Jaccard similarity RT100, RT101, RT102 (RT103,RT181,RT190) (10504 Subtype B, coefficient, RT103, RT105, RT106, (RT108,RT181,RT221) 747 Subtype C, Holm’s correction, Poissoness plot RT108, RT138, (RT98,RT181,RT190) 363 (CRF) 01_AE, RT139, RT178, RT179, (RT181,RT190,RT221) 210 Subtype A, RT181, RT188, (RT103,RT181,RT221) 320 CRF 02_AG, RT190, RT221, RT223, (RT103,RT108,RT221) 895 others) RT225, RT227, (RT101,RT108,RT181) RT230, RT232, (RT101,RT108,RT190) RT234, RT236, (RT103,RT108,RT181) RT237, RT238, (RT108,RT190,RT221) RT241, RT242, RT318 (RT98,RT108,RT181) Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 1 Related works (Continued) (RT98,RT101,RT190) (RT98,RT101,RT181) (RT101,RT181,RT190) (RT101,RT181,RT221) (RT98,RT103,RT108) (RT101,RT181,RT190) (RT108,RT181,RT190) (RT98,RT103,RT181) Wu et al. 2003 [10] Protease PI PR1 to PR99 (PR10 PR63 PR71 PR73 PR90) 2244 (Subtype B) binomial correlation coefficients, pca (PR10 PR63 PR71 PR90 PR93) (PR10 PR62 PR63 PR90 PR93) (PR10 PR62 PR63 PR73 PR90) (PR10 PR20 PR71 PR73 PR90) PR62 PR73 PR90) Page 4 of 23 (PR10 PR20 (PR10 PR46 PR71 PR90 PR93) (PR10 (PR30) PR73 PR84 PR90) (PR10 (PR30) PR46 PR84 PR90) (PR10 PR71 PR73 PR84 PR90) (PR10 PR46 PR71 PR84 PR90) (PR10 PR24 PR46 PR10 PR46 PR90) (PR10 (PR30) Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 1 Related works (Continued) PR46 PR54 PR82) (PR10 PR48 PR54 PR82) (PR10 PR24 PR46 PR54 PR82) (PR32 PR46 PR82) (PR10 PR46 PR53 PR54 PR71 PR82) (PR30 (PR82) PR88) (PR13 PR30 PR88) (PR30 PR75 PR88) (PR10 PR46 PR63 PR71 PR93) (PR20 PR36 PR54) (PR10 PR20 PR54 PR71) (PR63 (PR64) PR71) (PR10 PR77 PR93) (PR20 PR36 PR62) (PR20 PR35 PR36 (PR77)) (PR15 PR20 PR36 (PR77)) (PR10 PR24 PR89) (PR10 PR73 PR77) Protease positions are represented by the prefix PR and reverse transcriptase positions by the prefix RT. Page 5 of 23 (PR10 PR20 PR73) Author Rhee et al. 2004 [9] Proteins Drugs Protein positions Mutation patterns Number of sequences Protease PI, PR24, PR30, PR32, (PR30 ,PR88) (PR46 ,PR90) 2795 and Reverse NRTI, PR46, PR47, PR48, (PR73 ,PR90) (27 Subtype C, transcriptase NNRTI PR50, PR53, PR54, (PR54 ,PR82 ,PR90) 15 Subtype A, PR73, PR82, PR84, (PR24 ,PR46 ,PR54 ,PR82) 7 Subtype D, PR88, PR90 (PR73 ,PR84 ,PR90) 2746 Subtype B) RT41, RT44, RT62, (PR46 ,PR54 ,PR82 ,PR90) RT65, RT67, RT69, (PR84 ,PR90) (PR46 ,PR88) RT70, RT74, RT115, (PR46 ,PR73 ,PR90) (PR54 ,PR82) RT116, RT118, RT151, (PR46 ,PR84 ,PR90) RT184, RT210, (PR46 ,PR54 ,PR82 ,PR90) RT215, RT219 Method Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 2 Related works (PR46 ,PR73 ,PR84 PR90) (PR30 ,PR88 ,PR90) (PR48 ,PR54 ,PR82), (PR32 ,PR46 ,PR82 ,PR90) (PR24 ,PR46 ,PR54 ,PR82) (PR53 ,PR54 ,PR82 ,PR90) (PR24 ,PR46 ,PR82) (PR46 ,PR82) (PR46 ,PR90) (PR30 ,PR46 ,PR88) (RT41, RT184, RT215) (RT41, RT184, RT210) (RT41, RT215) (RT67, RT70, RT184, RT219) (RT70, RT184) (RT41, RT210, RT215) (RT184, RT215) (RT41, RT118, RT184) (RT210, RT215) (RT41, RT67, RT118, RT210, RT215) (RT74, RT184) (RT67, RT69, RT70, RT184, RT219) (RT41, RT67, RT184, Page 6 of 23 (RT67, RT70, RT184) Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 2 Related works (Continued) RT210, RT215) (RT41, RT184) (RT62, RT184) (RT41, RT44, RT67, RT118) (RT184, RT210, RT215) (RT67, RT70, RT184, RT215, RT219) (RT67, RT70, RT219) (RT67, RT70) (RT41, RT184, RT215) (RT41, RT118, RT210, RT215) (RT41, RT67, RT210, RT215) (RT69, RT70) (RT41, RT44, RT67, RT118, RT210, RT215) (RT41, RT74, RT184, RT215, RT69) (RT103 RT181) (RT100 RT103)(RT103 RT108) (RT101 RT190) (RT103 RT225) (RT103 RT181 RT190) (RT103 RT190) (RT181 RT190) (RT103 RT238)(RT101 RT103) (RT108 RT181) (RT101 RT181 RT190) (RT98 RT103) (RT103 RT108 RT181) (RT103 RT188)(RT103 RT230) Gonzales et al. 2003 [11] Protease PI, RT41, RT62, RT65, (RT41,RT184,RT215) 487 Fisher’s (Subtype B) exact NRTI, RT67, RT69, RT70, (RT41,RT184,RT210,RT215) transcriptase NNRTI RT74, RT75, RT77, (RT67,RT70,RT215,RT219) test, RT115, RT116, RT151, (RT41,RT67,RT69,RT210,RT215) Benjamini- RT184, RT210, (RT41,RT67,RT184,RT210, Hochberg, Page 7 of 23 and Reverse RT215, and RT219 RT215,RT219) PR24, PR30, PR32, (RT41,RT67,RT69,RT70, PR46, PR47, PR48, RT184,RT215,RT219) PR50, PR53, PR54, (RT65,RT70,RT75,RT77,RT115„ PR73, PR88, PR82, RT116,RT151,RT184,RT219) PR84, and PR90 (PR54,PR73,PR84,PR90) K-medoids Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 2 Related works (Continued) (PR46,PR84,PR90) (PR24,PR46,PR54,PR82) (PR46,PR54,PR82,PR90) (PR48,PR,54,PR82) Sing et al. 2005 [6] Reverse NRTI transcriptase RT41, RT43, RT44, RT62, (RT41, RT210,RT215) RT67, RT69, RT70, (RT67,RT70,RT219) 1355 hierarchical clustering, RT74, RT75, RT77, Fisher’s RT116, RT118, RT151, exact test RT203, RT208, RT210, RT215, RT215, RT218, RT219, RT219, RT223, RT228, RT228 Brehm et al. 2012 [41] Reverse NNRTI transcriptase (RT184,RT348) 12 (Subtype C) Protease positions are represented by the prefix PR and reverse transcriptase positions by the prefix RT. Page 8 of 23 Author Proteins Drugs Protein positions Mutation patterns Number of sequences Method Hoffman et al. 2003 [12] Protease PI PR10, PR12, PR13, PR14, (PR10,PR93) (PR12,PR19) 1179 Mutual PR15, PR19, PR20, PR30, (PR35,PR38)(PR63,PR64) (Subtype B) information PR32, PR35, PR36, PR37, (PR37,PR41)(PR62,PR71) PR41, PR46, PR48, PR54, (PR71,PR77) (PR71,PR93) PR57, PR60, PR62, PR63, (PR77,PR93)(PR12,PR19) PR64, PR69, PR71, PR72, (PR15,PR77)(PR20,PR36) PR73, PR77, PR82, PR84, (PR30,PR88)(PR35,PR36) PR88, PR90, PR93 (PR35,PR37)(PR36,PR62) (PR36,PR77)(PR46,PR82) (PR46,PR84)(PR48,PR54) Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 3 Related works (PR48,PR82)(PR54,PR82) (PR63,PR64)(PR63,PR90) (PR77,PR93)(PR84,PR90) (PR73,PR90) Alteri et al. 2009 [13] Reverse PI, RT41, RT65, RT67, (RT215,RT41,RT210) 213 Binomial transcriptase NRTI, RT69, RT70, RT74, RT75, (RT60,RT103) (Subtype B) correlation NNRTI RT77, RT100, RT101, coefficient, RT103, RT106, RT115, Benjamini- RT116, RT151, RT181, Hochberg RT184,RT188, RT190, method RT210, RT215, RT219, RT225, RT230, RT236, Doherty et al. 2011 [14] Protease PI PR10, PR24, PR30, (PR10,PR32,PR33, PR32, PR33, PR43, PR46,PR47,PR54, integer PR46, PR47, PR48, PR71,PR73,PR84,PR90) programming- PR50, PR53, PR54, (PR10,PR33,PR43,PR46, based PR71, PR73, PR74, PR54,PR71,PR82,PR84,PR90) clustering PR76, PR82, PR83, (PR10,PR24,PR46, PR84, PR88, PR90 398 Optimal PR54,PR71,PR74,PR82) (PR32,PR33,PR46,PR53, PR54,PR71,PR84,PR90) PR54,PR71,PR84,PR88,PR90) (PR10,PR33,PR43,PR46,PR48, Page 9 of 23 (PR10,PR30,PR32,PR33,PR46, PR50,PR54,PR71,PR82) (PR10,PR32,PR46, PR71,PR82,PR84) (PR10,PR46,PR54,PR82,PR90) (PR10,PR48,PR54,PR71, PR73,PR76,PR84,PR90) (PR10,PR24,PR32,PR33, PR43, PR46,PR54,PR71,PR82,PR84) (PR10,PR24,PR30, PR33,PR43,PR53,PR88) (PR10,PR43,PR47,PR48, Ozahata et al. BMC Bioinformatics (2015) 16:35 Table 3 Related works (Continued) PR53,PR54,PR71,PR82,PR84) (PR10,PR32,PR46, PR47,PR71,PR82,PR90) (PR10,PR33,PR54, PR73,PR84,PR90) (PR10,PR46,PR71,PR84,PR90) (PR10,PR54,PR71, PR73,PR82,PR90) (PR10,PR32,PR33, PR47,PR71,PR82,PR90) (PR10,PR46,PR54, PR71,PR82,PR90) (PR10,PR24,PR33,PR46, PR54,PR71,PR82) (PR10,PR48,PR54,PR82,PR90) (PR10,PR32,PR43, PR46,PR47,PR82) (PR10,PR54,PR71,PR82) (PR10,PR46,PR47, PR71,PR88,PR90) (PR10,PR33,PR43,PR46, PR73,PR82,PR90) (PR10,PR33,PR46, Page 10 of 23 PR50,PR54,PR71,
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.