Quality checks for arrays are performed through RLE and NUSE plots. RLE plots the deviation of each probeset from its median across all arrays, which, ideally, is 0 if most of the genes are unchanged. NUSE plots the standardized errors estimated from the Probe Level Model fit; so the median standard error across arrays is 1 for each gene.
PCA plots, which compare the three dimensions, are useful for quality control to identify outliers during pre-processing, as well as DEG variance for each group of replicates, and give an overview of the internal structure of the data set.
A more flexible data set-oriented filtering method using IQR function has been added for gene selection. The automatic choice of IQR implementation is based on its computational ease and ability to remove uninformative probesets that do not vary in the ES (Figure 3). Thus, signal-driven IQR gene filtering potentially reduces the number of false positives (Figure 3). The choice of probesets differs for Gene 1.0/1.1 ST, Exon 1.0 ST, and 3′ IVT, which all depend on probe-matching position on the mRNA and genome annotation (Figure 1B).
The comparison between a common baseline and an experimental condition of a data set containing paired samples is performed through limma-paired experimental design and DEG selection. The experimental design can be defined with an appropriate text file for Affymetrix, Agilent, and Illumina platforms. The implementation of a paired-sample test makes AMDA 2.13 useful also for analyzing data from clinical studies (e.g., human lymphocytes before and after drug treatment).
Through addition of RP, AMDA 2.13 provides a straightforward and statistically meaningful way to determine DEG significance for two sample classes. The RP approach is not parametric, like Significance Analysis of Microarrays (SAM), and therefore requires less assumption on data distribution; it is powerful for both identifying biologically relevant expression changes and controlling the false discovery rate. It acts reliably in situations where very few replicates are available or in case of highly noisy data, by performing a permutation test on the set of replicates.
The implementation of novel algorithms for GO trimming eliminates local dependencies and points to relevant areas of the GO graph. The parent-child relationship, elim, and weight methods all address overlaps by computing statistical over-representation of DEG with counts weighted according to the hierarchical structure of the ontology.
Toward more meaningful knowledge through gene annotation, KEGG pathways were implemented for all platforms. DEG in KEGG gene sets that are mapped in the pathway diagram are enriched to assess the potential functional convergence of gene signatures based on KEGG pathway modules. DEG that are significantly enriched in a pathway are mapped on the KEGG graphical representation of the pathway as blue for down- and red for up-regulated genes. The topology of mapped DEG facilitates the selection of genes of interest in a specific pathway (Figure 4).
The evaluation of different gene sets through GSA could be useful to explore potentially interesting genes and give insight for biological interpretation of microarray data. If more than two conditions are assessed, GSA-positive and -negative gene score collections obtained are plotted through a heatmap, and gene sets can be observed. A negative score of one condition indicates lower expression of most genes in a gene set in comparison with higher values of the other condition; conversely, a positive score means that higher expression of most genes in the gene set correlates with higher values of the compared condition.
We have tested AMDA 2.13 on a microarray data set from a recent study (22), which was uploaded on the ArrayExpress Database (E-MEXP-2681). Total RNA extracted from muscle biopsies of patients with polymyositis (PM), dermatomyositis (DM), or juvenile myositis (JM), or of healthy individuals was compared using the commonBaseline design experiment. Test data set and report generated are available (Supplementary File 1). Quality assessment using RLE and NUSE showed that there were no obvious artifacts. Distance and variance between replicates are shown by hierarchical clustering and PCA plots. A total of 1,578 probesets were identified using the limma Bayesian statistical model (18), 876 of which distinguish controls from JM samples, 756 distinguish controls from DM, and 1,035 distinguish controls from PM. As indicated on the Venn diagram, a substantial number of genes showed differential expression in more than one comparison (Supplementary File 1). The results obtained are in full concordance with the findings reported in the original publication (22).
A comparison of the tools available in AMDA 2.13 with the functionalities provided by well-known softwares for microarray data analysis is shown in Table 1. OneChannelGUI (11), dChip (23), AltAnalyze (12), GenePattern (7), and GEPAS (4) offer a full set of tools for microarray data analysis comparable to those of AMDA. While cross-platform normalization pre-processing is provided by OneChannelGUI, AltAnalyze, and ArrayMining, these applications do not offer different experimental designs or a comprehensive output report that documents data in a workflow. With AMDA 2.13, the analytical modules can also be used separately by invoking the relative function as described (see Supplementary File 2). In addition, steps and results of the whole analysis are collectively described in a .pdf file report together with a set of .txt and .png files, thus improving the readability of global output.
AMDA 2.13 is not limited to analysis of data from the most commonly used commercial human, mouse, and rat arrays, but allows also the analysis of arrays for plant and yeast gene expression, such as A. thaliana and S. cerevisiae. It is implemented in R language in combination with Bioconductor, which makes it one of the most powerful and flexible command-driven solutions for microarray data analysis. It is also suitable for biologists lacking the necessary computational and statistical knowledge to address all aspects of a typical analysis workflow. In addition, the report gives a comprehensive explanation of the results obtained, together with brief explanations of each approach, and is particularly useful for biologists who are not familiar with programming concepts or statistical methodologies.
AMDA 2.13 is freely available as an R GPL package in the sourceforge.net (https://sourceforge.net/projects/automicroarray/files/) for Linux, Windows, and Mac OS operating systems. It has been tested on machines running Linux such as Debian GNU/Linux, Fedora, and OpenSUSE distributions, Windows 7, and Mac OS X 10.5.
We thank Drs. Matteo Barcella and Antonella Farinaccio for testing the software independently. This work was supported in part by TOLERAGE EC 7FP HEALTH-F4-2008-202156 and FIGHT-MG EC 7FP HEALTH-F2-2010-242210.
The authors declare no competing interests.
Address correspondence to Dimos Kapetis, Fondazione IRCCS Istituto Neurologico Carlo Besta, Via Celoria 11, Milan 20133, Italy. Email: [email protected]
1.) Ashburner, M., C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25:25-29. 2.) Ogata, H., S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27:29-34. 3.) Pelizzola, M., N. Pavelka, M. Foti, and P. Ricciardi-Castagnoli. 2006. AMDA: an R package for the automated microarray data analysis. BMC Bioinformatics 7:335. 4.) Tárraga, J., I. Medina, J. Carbonell, J. Huerta-Cepas, P. Minguez, E. Alloza, F. Al-Shahrour, S. Vegas-Azcárate. 2008. GEPAS, a web-based tool for microarray data analysis and interpretation. Nucleic Acids Res. 36:W308-314. 5.) Keller, A., C. Backes, M. Al-Awadhi, A. Gerasch, J. Küntzer, O. Kohlbacher, M. Kaufmann, and H.P. Lenhof. 2008. GeneTrailExpress: a web-based pipeline for the statistical evaluation of microarray experiments. BMC Bioinformatics 9:552. 6.) Hull, D., K. Wolstencroft, R. Stevens, C. Goble, M.R. Pocock, P. Li, and T. Oinn. 2006. Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34:W729-732. 7.) Reich, M., T. Liefeld, J. Gould, J. Lerner, P. Tamayo, and J.P. Mesirov. 2006. GenePattern 2.0. Nat. Genet. 38:500-501. 8.) Giardine, B., C. Riemer, R.C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg. 2005. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15:1451-1455. 9.) Pan, F., K. Kamath, K. Zhang, S. Pulapura, A. Achar, J. Nunez-Iglesias, Y. Huang, X. Yan. 2006. Integrative Array Analyzer: a software package for analysis of cross-platform and cross-species microarray data. Bioinformatics 22:1665-1667. 10.) Kapushesky, M., P. Kemmeren, A.C. Culhane, S. Durinck, J. Ihmels, C. Körner, M. Kull, A. Torrente. 2004. Expression Profiler: next generation—an online platform for analysis of microarray data. Nucleic Acids Res. 32:W465-470. 11.) Sanges, R., F. Cordero, and R.A. Calogero. 2007. oneChannelGUI: a graphical interface to Bioconductor tools, designed for life scientists who are not familiar with R language. Bioinformatics 23:3406-3408. 12.) Emig, D., N. Salomonis, J. Baumbach, T. Lengauer, B.R. Conklin, and M. Albrecht. 2010. AltAnalyze and DomainGraph: analyzing and visualizing exon expression data. Nucleic Acids Res. 38:W755-762. 13.) Glaab, E., J.M. Garibaldi, and N. Krasnogor. 2009. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization. BMC Bioinformatics 10:358. 14.) Gentleman, R.C., V.J. Carey, D.M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier. 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5:R80. 15.) Leek, J.T., R.B. Scharpf, H.C. Bravo, D. Simcha, B. Langmead, W.E. Johnson, D. Geman, K. Baggerly, and R.A. Irizarry. 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11:733-739. 16.) Johnson, W.E., C. Li, and A. Rabinovic. 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118-127. 17.) Yeung, K.Y., and W.L. Ruzzo. 2001. Principal component analysis for clustering gene expression data. Bioinformatics 17:763-774. 18.) Smyth, G.K. 2004. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol Article 3 3. 19.) Breitling, R., P. Armengaud, A. Amtmann, and P. Herzyk. 2004. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 573:83-92. 20.) Alexa, A., J. Rahnenfuhrer, and T. Lengauer. 2006. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22:1600-1607. 21.) Maglott, D.R., K.S. Katz, H. Sicotte, and K.D. Pruitt. 2000. NCBI's LocusLink and RefSeq. Nucleic Acids Res. 28:126-128. 22.) Cappelletti, C., F. Baggi, F. Zolezzi, D. Biancolini, O. Beretta, M. Severa, E.M. Coccia, P. Confalonieri. 2011. Type I interferon and Toll-like receptor expression characterizes inflammatory myopathies. Neurology 76:2079-2088. 23.) Amin, S.B., P.K. Shah, A. Yan, S. Adamia, S. Minvielle, H. Avet-Loiseau, N.C. Munshi, and C. Li. 2011. The dChip survival analysis module for microarray data. BMC Bioinformatics 12:72.