Identification of artifactual microarray probe signals constantly present in multiple sample types
Shihong Mao1, Aletheia Lima Souza1,2, Robert J. Goodrich1, and Stephen A. Krawetz1
Figure 1.  Unsupervised clustering of the transcript profile generated from HumanHT-12 microarray and RNA-Seq using two sperm RNA samples and one testis RNA sample. (Click to enlarge)

Two criteria were defined to identify discordant probes, such that if any hybridizing probe (P < 0.01) satisfied either criterion, the probe was deemed discordant. First, for each probe and annotated gene above background, an FPKM of <1 was considered discordant. This corresponds to an average of less than one fragment among one million aligned fragments mapped onto a 1-kb exon of the transcript and is considered as background arising from sequencing error(s) or a statistical mapping error. Second, the standard ratio (SR) was considered. That is, the average ratio of SI:FPKM from the group of genes with the highest SI array values and RNA-Seq FPKM when multiple samples are considered. If the ratio of the SI value and FPKM from any probes was 100-fold higher than the SR, the observation was considered discordant. Based on these criteria, a total of 195 and 2391 discordant probes were identified from sperm and testis, respectively (see Supplementary Table S2).

qRT-PCR validation in sperm samples

Seven genes that exhibited discordant levels between microarray and RNA-Seq results in these samples were selected for verification by qRT-PCR (Table 1). The positions of the qRT-PCR primer pairs that were designed in relation to the microarray probe locations on the genome and qRT-PCR amplification results are summarized in Table 1. All were not detected by qRT-PCR (Table 1). The FPKM values and coverage from each transcript isoform of these seven genes is listed in Supplementary Table S3. The SI values of each corresponding probe, FPKM for each RNA-Seq sample, and the number of fragments mapped to the probe regions of each gene, with the PRM2 transcript providing a positive control, are shown in Table 2. The SIs of these seven discordent genes detected by microarray are very strong. In comparison, these transcripts were underrepresentedin the RNA-Seq data sets, and no sequence reads could be mapped to the microarray probe regions of these seven genes.

Table 1.  Primer sequences design summary and qRT-PCR amplification (Click to enlarge)


Table 2.  SI, FPKM values, and the number of fragments that mapped to the probes regions from two sperm samples. (Click to enlarge)


Discordant genes in other HT-12v4 data sets

In order to determine whether discordant probes were correlated with specific tissues or procedures that could be attributed to individual laboratories, the SIs of HumanHT-12 probes generated from different laboratories were examined using the RNA-Seq FPKM statistic. Based on the above criteria, 5780 discordant probes were identified in the human placenta samples, and 903 discordant probes were similarly identified in the human skin fibroblast cell lines. As illustrated in Figure 2, a four-way comparison of sperm, testis, placenta, and fibroblast cells showed that 99 probes were consistently identified in all of the four tissues or cell lines at P < 3.4 × 10−6 (21). It appears that the other discordant probes are tissue-specific and/or reflect specific experimental conditions and were not considered further. The list of the 99 probe sequences is detailed in Supplementary Table S4. The set of discordant probes from all four data sets included probes that correspond to AHR, ANKRD30B, and MCM8 (Table 3), which were shown to be absent from sperm by qRT-PCR.

Figure 2.  Four-way Venn diagram of the discordant probes from four tissues/cell lines. (Click to enlarge)

Table 3.  Statistics of SI and FPKM values of the seven genes from human placenta tissue and human skin fibroblast cell line (Click to enlarge)


The HumanHT-12v4 bead array can be retrieved from GEO as platform GPL10558. A total of 38 series of publications and experiments consisting of 718 samples were available in GEO. The accession number of each data series in GEO and the number of samples in each series is provided in Supplementary Table S5. The SI values of AHR, ANKRD30B, and MCM8 were first determined since they had been confirmed as absent by qRT-PCR. The stringent criterion of SI of 150 (equal to P < 0.01) was set as the threshold was then utilized to assess whether a given probe hybridized. As shown in Figure 3, among the 718 available samples, only five samples exhibited a SI below threshold and were not considered further asthe samples tested were predominantly derived from mouse.

