2Fondazione IRCCS Istituto Neurologico Carlo Besta, Via Celoria 11, Milan, Italy
3Department of Biotechnology and Biosciences, University of Milano-Bicocca, Piazza della Scienza 2, Milan, Italy
## Present address: Singapore Immunology Network (SIgN), 8A Biomedical Grove, 138648, Singapore
* Present address: Fondazione IRCCS Istituto Neurologico Carlo Besta, Via Celoria 11, Milan 20133, Italy
Microarray platforms require analytical pipelines with modules for data pre-processing including data normalization, statistical analysis for identification of differentially expressed genes, cluster analysis, and functional annotation. We previously developed the Automated Microarray Data Analysis (AMDA, version 2.3.5) pipeline to process Affymetrix 3′ IVT GeneChips. The availability of newer technologies that demand open-source tools for microarray data analysis has impelled us to develop an updated multi-platform version, AMDA 2.13. It includes additional quality control metrics, annotation-driven (annotation grade of Affymetrix NetAffx) and signal-driven (Inter-Quartile Range) gene filtering, and approaches to experimental design. To enhance understanding of biological data, differentially expressed genes have been mapped into KEGG pathways. Finally, a more stable and user-friendly interface was designed to integrate the requirements for different platforms. AMDA 2.13 allows the analysis of Affymetrix (cartridges and plates) and whole transcript probe design (Gene 1.0/1.1 ST and Exon 1.0 ST GeneChips), Illumina Bead Arrays, and one-channel Agilent 4×44 arrays. Relative to early versions, it supports various experimental designs and delivers more insightful biological understanding and up-to-date annotations.
The main issues for commercially available microarray platforms such as Affymetrix, GeneChip, Illumina BeadChip, and Agilent arrays are pre-processing, statistical data analysis, and subsequent extraction of biological knowledge. Analysis of microarray data should follow four main steps: (i) quality control, (ii) data pre-processing, (iii) identification of differentially expressed genes (DEG), and (iv) extraction of biological knowledge from DEG. The modules for quality control and data normalization are the most important steps of array evaluation and are platform-dependent. Quality controls help the user detect array artifacts or outliers and evaluate the uniformity of experimental groups in such a way as to choose the most appropriate pre-processing steps. Subsequently, the identification of DEG, generally the main objective of a microarray experiment, requires the application of statistical tests. Finally, to get an overview of the biological phenomena underlying the experiment, the identified DEG are functionally enriched by the use of public domain annotation databases like Gene Ontology (1) or pathway analysis terms, such as the Kyoto Encyclopedia Gene and Genomes (KEGG) (2).
Currently, many commercial or free tools exist to perform the common steps described above, such as the Automated Microarray Data Analysis [AMDA, version 2 .3.5; (3)] pipeline that we developed to process Affymetrix 3′ IVT GeneChips. However, few such tools provide the required modules for multiplatform analysis. Well-known web applications such as Gene Expression Pattern Analysis Suite (4), GeneTrailExpress (5), Taverna (6), GenePattern (7), or Galaxy (8) implement many tools for microarray or genomic analyses and for meta-analysis, but do not offer pre-processing for multiplatform analysis. Others, like Integrative Array Analyzer (9) and Expression Profiler (10), provide a user-friendly interface for different microarray platforms, but lack the pre-processing normalization and quality control steps.
Several solutions for pre-processing steps are available in the OneChannelGUI Bioconductor package (11), AltAnalyze (12), or ArrayMining web application (13). However, while some of these cover all the steps of data analysis and array platforms, none provides a comprehensive explanatory report of the analyzed data.
Here, we report an updated multi-platform version of AMDA, AMDA 2.13, which performs a fully automated microarray analysis for Affymetrix GeneChips, Illumina BeadChip, and Agilent one-color 4×44 gene expression platforms. AMDA 2.13 has been implemented for R-2.14 or higher; it includes all the major data analysis steps as well as up-to-date annotations.
Materials and methods
Hard-drive Space, Processor, and Memory Requirements
AMDA 2.13 is freely available for Linux, Windows, and Mac OS X operating systems (https://sourceforge.net/projects/automicroarray/files/). The software requires a minimum of 300MB of hard-drive space for installation of additional packages and annotation data; a minimum of 2GB of RAM and Intel Pentium III processor speed for Affymetrix 3′ IVT array analysis; an additional free hard-drive space of around 100MB for writing the output files; and additional RAM (up to 4GB) and hard-drive space (up to 5GB free) recommended for large numbers of arrays, when using Gene 1.0/1.1 ST and Exon 1.0 ST GeneChips.
The functions with command-line options are called through a GUI interface. The previous tcl/tk Widget interface was substituted with gWidget package of the CRAN repository (http://cran.r-project.org/).
AMDA 2.13 accepts BeadStudio version 3.0 software output and pre-processes Illumina BeadChips through lumi (14). Affymetrix Gene 1.0/1.1 ST and Exon 1.0 ST .CEL files can be imported and normalized with Oligo (14). The summarization step for Gene 1.0/1.1 ST and Exon 1.0 ST arrays is computed at transcript cluster level. Agilent 4×44 gene expression files exported by Agilent Feature Extraction (AFE) 22.214.171.124 (or later version) image analysis software can be read and analyzed though Agi4×44PreProcess (14). The latter package offers a number of different pre-processing steps: background correction and normalization between samples, filtering probes by quality flags, and summarization of replicated probes. The user can choose between the foreground signal (gProcessedSignal or gMeanSignal) and the background signal for the background correction (gBG-MedianSignal or gBGUsed). The user also decides the methods for background correction (“none,” “half,” “normexp”) and normalization between arrays (“none,” “quantile,” “vsn”). AFE provides a flag for each spot, which identifies different quantification errors of the signal. Quality filters are also implemented to remove controls, well-above background, well-above negative controls, saturation, population outliers, non-uniform outliers, and non-available values. Negative controls are removed by default from the ExpressionSet (ES). Additional normalization methods, such as frozen RMA (fRMA) and VSN, are available for 3′ IVT through Bioconductor (14) and have been implemented. fRMA algorithm is available for HG-U133A and HG-U133Aplus2.0 Affymetrix GeneChips. Pre-processing and annotation steps are also implemented for not widely used arrays such as Arabidopsis thaliana and Saccharomyces cerevisiae.
Batch effect correction: The importance of compensating for non-biological batch variation has been extensively documented recently (15). In order to reduce variations in experimental effects or batch effects, thereby empowering the detection of DEG through our pipeline, we have introduced empirical Bayesian Methods with ComBat (16).
Quality control metrics among arrays: Affymetrix Relative Log Expression (RLE) and Normalized Un-scaled Standard Error (NUSE) are calculated and plotted according to affyPLM, Oligo, or Agi4x44PreProcess Bioconductor packages (14). In order to assess quality of replicates and evaluate relations between classes after DEG selection, Principal Component Analysis (PCA) was added. PCA reduces data dimensionality through linear combinations of variables according to a criterion of variance maximization, thus creating new latent variables (17). The first three principal components are computed and plotted with respect to each other in the three possible combinations, together with relative histograms of explained variances. The hierarchical clustering among arrays is calculated and plotted with Easy Microarray Analysis package (http://cran.r-project.org/), with agglomerative average-linkage hierarchical-clustering algorithm as default. Pearson correlation was chosen as default similarity measure.
Inter-Quartile Range (IQR) and Affymetrix NetAffx annotation grade filter wrapper function: To address the well-known multiple-testing problem in microarray data analysis, a set of non-specific filtering functions was provided through genefilter (14) implemented in the IQR wrapper function (IQRFilter), which provides a measure of the variability of each probeset signal across the given data set.
The IQRFilter calculates an IQR for each probeset across all samples, and proceeds in two different ways: (i) if the IQR value is equal to “auto,” the function evaluates the optimal IQR value and provides a figure describing the IQR profile, a data set before and after filtering, and the IQR value automatically selected; and (ii) the IQR value can be a user-defined value (IQR value > 0).
Optimal IQR threshold evaluation is performed to identify probesets with the highest variability in signal across arrays. In particular, a 100-point vector (at a step h = 0.01) is created to compute and plot an IQR profile for each vector; the first derivative of such profile is then computed and its minimum point is identified (i.e., that corresponding to the steepest slope of IQR plot). This will be the point of maximum variation of the data set. A check for convergence on the estimated minimum (IQR_ opt) is then performed, by progressively reducing the step (thus dividing h by 0.1 at each iteration, IQR_opt-h/i>), in order to refine the search for minimum value.
Affymetrix NetAffx Analysis Center (www.affymetrix.com/analysis/index.affx) records five levels of relationships (Grades A, B, C, R, or E) between 3′ IVT probesets and the current transcript record. Each grade assignment provides information between probeset and pair-wise genomic alignment. The GradeSelection function using NetAffx annotation permits selection of the annotation grade record at the pre-processing stage.
Oligo package selects between probe cross-hybridization by selecting different probe groups divided in three categories, as follows: (i) core (Gene 1.0/1.1 ST and Exon 1.0 ST): probesets supported by RefSeq and full-length GeneBank transcripts with “unique” and “mixed” targeting genome positions; (ii) extended (Exon 1.0 ST): probesets with cDNA support information; and (iii) full (Exon 1.0 ST): probesets supported by gene predictions.
Wrapper Functions for DEG Selection: limma-paired Sample and RankProduct The limma test provided by AMDA 2.3.5 worked on the so-called “commonBaseline” experimental design for DEG identification, allowing comparison of samples to a common reference. Limma has been extended to the paired case, e.g., in cases where for each individual in a group, the same sample undergoes two different treatments. By using a paired t-test, we take pairing into account thus decreasing the variance of our estimates and eliminating spurious sources of variations such as sex, age, etc. The LimmaPairedSamples function applies the limma linear model (18) to identify the set of common DEG characteristic for a treatment group as compared with a control group. The RankProduct (RP) method (19), a powerful method for data set with limited sample size, was also integrated for two classes using the Rankprod package of Bioconductor (14). Venn Diagrams
For multiple comparisons between conditions, Venn diagrams are produced to visualize mutual intersections of up to three different lists of DEG in condition contrasts. An image reporting the number of DEG with respect to all possible intersections of categories is plotted, and a set of tab-delimited text files of the annotated DEG is provided. Gene Set Analysis
Single-gene statistical approaches for DEG identification in AMDA 2.3.5 is affected by major limitations such as statistical thresholds applied. Gene effects on pathways could therefore be missed when a single-gene statistical test is applied.
To improve the functional meaning of the DEG, Gene Set Analysis (GSA; http://www-stat.stanford.edu/∼tibs/GSA/) was added to AMDA 2.13. GSA aggregates functionally related genes using different resources such as biological pathways and chromosomal bands in relation to similar gene expression patterns. GSA was implemented for additional gene-set collections to identify DEG that share the same biological context. GSA allows the analysis of more comprehensive gene-set collections for over-representation; it uses the maxmean statistic and restandardization process to summarize gene sets. AMDA 2.13 uses a total of 6,769 gene sets of the Molecular Signatures Database (MSigDB 3.0, www.broadinstitute.org/gsea/msigdb), which are divided into five major collections (positional, curated, motif, computational, and gene ontology gene sets). Gene Ontology Functional Enrichment of DEG
With AMDA 2.3.5, enrichment analysis of DEG through Gene Ontology (GO) and KEGG pathways was performed using the hyper-geometric distribution through GOStats to assess over-representation of DEG in functional terms (3). The implementation of TopGO (20) on AMDA 2.13 integrates many algorithms and statistical tests for GO term enrichment analysis. It includes new statistical tests and new algorithms (such as eliminating genes, weighting genes, and parent-child relationships) that deal with GO graph structure, giving better understanding of the relationship between GO terms for calculating statistical significance (20). Graphical Representation of KEGG Pathways
DEG can be mapped into KEGG diagrams through Entrez Gene identifier (21). KEGGSOAP Bioconductor package was used to highlight DEG on KEGG pathways. The latest release of KEGG pathways, dated 15 March 2011, was based on mappings from KEGG GENOME ftp://ftp.genome.jp/pub/kegg/genomes. Results and discussion
As described in our previous work (3), AMDA 2.3.5 was entirely written in R language making intensive use of Bioconductor classes (such as ES, Affybatch, Annotated-Dataframe) and invoking other R/Bioconductor functions. The various modules for microarray data analysis were implemented only for 3′ IVT Affymetrix cartridges and could be also customized by experienced users. Additionally, filtering of ES was limited to MAS5 detection calls in order to eliminate data that are likely to be noisy. DEG identification, experimental design methods, and probe annotation strongly require the implementation of new and updated packages. AMDA 2.13 includes many important improvements: a new interface, application to additional platforms, quality control features, probeset filtering, new experimental design, DEG selection, and functional annotation.
The gWidget interface implemented for Affymetrix, Illumina, and Agilent platforms allows the definition of all possible settings in a user-friendly way (Figure 1A). The data analysis for Affymetrix starts from binary .CEL files and can be customized with different algorithms and options to be executed throughout the analysis; four separate workflows specialized for each platform are available (Figure 2). This part of the pipeline is based on the affy, Oligo, Agi4×44PreProcess, and lumi Bioconductor packages (14). The different pre-processing steps are platform-dependent and the user can choose various pre-processing methods. Illumina and Agilent platforms follow the same workflow as Affymetrix, but differ in the pre-processing steps.