to BioTechniques free email alert service to receive content updates.
Identification of artifactual microarray probe signals constantly present in multiple sample types
Shihong Mao1, Aletheia Lima Souza1,2, Robert J. Goodrich1, and Stephen A. Krawetz1
1Center for Molecular Medicine and Genetics, Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA
2Department of Animal Science, Federal University of Ceara, Fortaleza, Ceara, Brazil
BioTechniques, Vol. 53, No. 2, August 2012, pp. 91–98
Full Text (PDF)
Supplementary Material
Table S1 (.pdf)
Table S2 (.pdf)
Table S3 (.pdf)
Table S4 (.pdf)
Table S5 (.pdf)

The detection, identification, and quantitation of transcripts have evolved from simple Northern analysis, cDNA cloning, and sequencing to RT-PCR, microarrays, and now digital gene expression using ultra–high-throughput RNA sequencing (RNA-Seq). During the course of our studies we observed that some microarray probes show very high signal intensity values yet are discordant when compared with RNA-Seq. A total of 99 probes from approximately 30,000 were identified as consistently discordant in four human tissues or cell lines. Interestingly, this set of discordant probes appears array-dependent. Among the 99 probes identified, 70 constantly exhibited a high signal in all 713 available samples surveyed using the Illumina HumanHT-12v4 platform. Some were discordant with additional probes that annotated the same genes. Absence of a number of these transcripts was confirmed by quantitative RT-PCR (qRT-PCR). Our findings suggest that one must be cautious, as some array probes do not capture the level of the target.

Microarray technology has matured into a routine high-throughput means of simultaneously querying the levels of thousands of transcripts. Although this technique remains a valuable transcriptomic tool, it is also well known that the accuracy of microarray results reflects several variables. These include experimental noise, the level of background, and platform sequence probe design. To date, many laboratory methods and analytical procedures have been developed to normalize the data so as to reduce the impact of these factors (1-4). With the exception of a few cases (5-7) where probe signals have been independently confirmed, the veracity of the signal intensities (SIs) of the remaining probes is assumed.

Ultra–high-throughput RNA sequencing (RNA-Seq) is rapidly emerging as an attractive alternative platform to microarrays for quantitative transcriptome profiling. The accuracy and the consistency between these two platforms have been widely reported (8-13). Many reports suggest that RNA-Seq is more accurate than microarrays (9, 12, 13) when validated using a third methodology. For example, using proteomic validation, Fu and colleagues (9) concluded that RNA-Seq is more accurate. While in general the correlation coefficient between RNA-Seq and microarrays are consistent (10, 12), the correlation coefficient for very low or very highly expressed genes is variable (12). Although this discordance in microarray data has been noticed, the underlying cause has yet to be determined. During the course of our studies, a series of discordant observations were noted when microarray and RNA-Seq data were compared. To reconcile this observation, we defined a set of criteria to determine the extent of discordance among currently available data retrieved from Gene Expression Omnibus (GEO). The analysis presented below reveals an apparent set of platform-specific discordant probes.

Materials and methods

Sample collection and RNA isolation

Semen samples were received from two donors denoted S1 and S2. Each semen sample was equally divided in half to yield a total of four samples (S1SCLB, S1PS, S2SCLB, and S2PS). Two methods were independently utilized for removing somatic cell contaminates. Samples S1PS and S2PS were enriched by PureSperm gradient centrifugation, and samples S1SCLB and S2SCLB were subject to somatic cell lysis (14). RNA was isolated using the sperm RNA isolation protocol as previously described (15). RT-PCR analysis using intron spanning primers confirmed the absence of DNA in all samples. Testis RNA was obtained from commercial libraries (lot no. 054P010702031A; Applied Biosystems/ Ambion, Austin, TX, USA).

Generation of microarray data and determination of expressed probes

RNA (50 ng) from each sample was subject to two rounds of amplification using the Message Amp II system (Ambion). Biotin-UTP was incorporated during the second round of amplification. Biotinylated aRNA (750 ng) from each sample was hybridized to HumanHT-12v4 bead arrays (Illumina, San Diego, CA, USA). Arrays were washed, stained, and scanned according to the manufacturer's instructions. Initial data analysis and background subtraction was performed using GenomeStudio version 1.7.0 (Illumina). SI and a coupled P value were determined for each probe. Only those probes with a P value of less than 0.01 were considered further.

RNA library preparation and sequencing

Standard RNA-Seq libraries were prepared using the TruSeq Sample Prep kit (part no. RS-930-2001; Illumina) according to the manufacturer's protocol. Briefly, 500 ng purified RNA were fragmented with divalent cations under elevated temperature. First-strand cDNA synthesis was then performed with random hexamers and reverse transcriptase. Second-strand cDNA synthesis was performed using RNase H and DNA Pol I. Following cDNA synthesis, double-stranded products were end-repaired, and PE adaptors (Illumina) were ligated onto the cDNA products followed by 15 cycles of PCR enrichment. All samples were subject to paired-end sequencing using the GAIIx Genome Analyzer (Illumina) for 34 cycles. Image analysis, base calling, FASTQ generation, and demultiplexing were performed using the genome analyzer pipeline software CASAVA version 1.8.1.

Short read mapping, assembly, and estimating transcript abundance

Short reads were mapped to National Center for Biotechnology (NCBI) build 37.2 of the human reference genome using version 1.3.2 TopHat paired-end base default parameters (16). Alignment results were confirmed independently using Novoalign (v.2.07; Novocraft Technologies, Selangor, Malaysia). The relative abundance of each transcript was calculated using Cufflinks version 1.1.0 (17) and presented as fragments per kilobase exon per million fragments mapped (FPKM). The relative abundance of each transcript was also confirmed using Genomatix RegionMiner (18).

Primer design for quantitative RT-PCR

In this study, the primer pairs for quantitative RT-PCR (qRT-PCR) were designed as close to the microarray probes as possible. The ideal primer pairs encompassed or contained the microarray probes. Primers were designed as previously described (19). In brief, the sequences of the gene of interest were retrieved from ENSEMBL genome browser. Candidate primer sequences were generated using Oligo7 (Molecular Biology Insight, Cascade, CO, USA) or Primer3Plus ( The specificity of the primer alignment along the genome was evaluated using the BLAST and BLAT prior to qRT-PCR as described (19).

Data sets

Sperm RNA data sets that describe both the microarray data GSE39526 and RNA-Seq data GSE39527 are available from GEO GSE39528. Additional HumanHT-12v4 microarray data sets were acquired from GEO (GPL10558). Human placenta and foreskin fibroblast cell line RNA-Seq data sets were obtained from GSE30554 (20).

Results and discussion

The correlation of the RNA profiles obtained by microarray analysis was initially compared with that obtained from RNA-Seq using a group of sperm data sets. The repertoire of RNAs in the mature male gamete is limited relative to other cell-types, and this reduced complexity should limit the possibility of cross-hybridization of the transcripts to several probes. However, a series of discordant probes of high SI by microarray, but of low abundance by RNA-Seq or qRT-PCR, were resolved. To determine whether this was a tissue/cell-specific phenomenon, the extent of discordance among other data sets was then determined.

GAIIx RNA-Seq and HumanHT-12 correlation in sperm and testes

We aligned the paired-end RNA-Seq reads to the human reference genome. The RNA-Seq metrics are presented in Supplementary Table S1. The majority of reads that could not be aligned are due to their low QC scores. The reads that aligned to multiple positions on the genome were not considered for further analysis. Among the 40% unique aligned reads, ~42% aligned to exon regions, which is about 12-fold enrichment compared with the genome. The size of each RNA fragment was calculated from the distance spanned from each of the paired reads (mean = ~160 bp, SD = ~50; see Supplementary Figure S1).

Approximately 30,000 HumanHT-12 platform probes were of identical sequence to the current genome build and thus could also be detected by RNA-Seq. A summary of the unsupervised hierarchical clustering of RNA profiles as determined by microarray (SI) and GAIIx deep sequencing (FPKM) is presented in Figure 1. Biological replicates of samples assayed on the same platform Array or Seq, were highly correlated. Interestingly, many probes that display a very high microarray SI value exhibited a very low FPKM when assessed by RNA-Seq. A series of criteria were developed to assess the generality of this observed discordance.

  1    2    3