to BioTechniques free email alert service to receive content updates.
Identification of artifactual microarray probe signals constantly present in multiple sample types
Shihong Mao1, Aletheia Lima Souza1,2, Robert J. Goodrich1, and Stephen A. Krawetz1
Full Text (PDF)
Supplementary Material
Table S1 (.pdf)
Table S2 (.pdf)
Table S3 (.pdf)
Table S4 (.pdf)
Table S5 (.pdf)

Figure 3.  Distribution of the number of samples as a function of SI values of three genes. (Click to enlarge)

The SI values of the 99 discordant probes in all the remaining samples were assessed. Significant signal levels were detected for all 99 probes in greater than 95% of the samples, while 70 probes were detected in all 713 samples (Supplementary Table S4) independent of tissue or cell-type.

In some cases several differentially spaced probes were designed to interrogate a single transcript over an extended region. In the 99 probes identified, 74 are annotated by at least one additional probe. Ideally, in the absence of alternative splicing, the SIs of these probes should be similar, since they are representative of the same transcript. However, the SIs of the constantly present probes are different in this respect. For example, the HT-12 microarray platform uses three different probe sequences to query the level of MCM8. However in all available samples, only probe ILMN_1798581 exhibits a signal level above background. This is consistent with the view that this probe does not accurately measure the transcript present. (Figure 4A) as supported by the RNA-Seq (Table 2) and qRT-PCR (Table 1) data sets. Figure 4B shows the discordant probes in gene PNPT1. Probes ILMN_3251723 and ILMN_2051408 shows high SI values in all available samples, yet sequence reads could not be mapped to these probes (Figure 4B). They are annotated as constantly present probes in Supplementary Table S4. The discordant probes within one gene support the view that some probes cannot correctly capture the mRNA levels. The above are consistent with the observations of Marioni and colleagues (8) who, using another platform, compared the consistency of Affymetrix array intensities and Illumina deep sequencing reads. A number of probes showed very high SIs values (Figure 3 in Reference 8) with very lownumber of sequence reads. Similarly, Malone and colleagues (11) (Figure 3 in Reference 11) showed that many probes have high SIs values and a very low number of reads.

Figure 4.  Positons of microarray probes with MCM8 and PNPT1. (Click to enlarge)

A total of 99 constantly present microarray probes were identified based on the lack of comparable FPKM RNA-Seq values from four different tissues/cell lines. The nature of the constantly present probes has yet to be determined. The discordant probes do not correspond to ribosomal RNA (rRNA) or known microRNA (miRNA) or piwiRNA (piRNA) sequences. They do not form any common pathway nor do they have common biological functions or share a common motif. Mis-annotation is also unlikely, as analysis using the reannotation approach of Barbosa-Morais et al. (22) showed that almost all of the discordant probes mapped to the correct transcripts. Accordingly, the discordance may appear to reflect a yet unknown component of array platform technology. For example, Johnson and colleagues (23) created a custom designed series of probes to detect a suite of human genes inserted within the transgenic mouse genome. The probes were designed to be unique to the human genome. Surprisingly, comparative genomic hybridization (CGH) analysis showed that some probes were preferentially bound by mouse sequences.

The degree to which the constantly present probe may affect the downstream data analysis is significant. On one hand, for a single-class (e.g., up- or down-regulated data set), the effect will be particularly severe. The goal of one class data analysis is to mine the genes that are constantly expressed in each sample. Inclusion of data derived from constantly present probes will negatively influence analysis as the corresponding transcripts are always found to be present at significant levels. On the other hand, for a data set with two or more classes, fold change (4), P value, or permutation test (1, 2) are typically used to identify the differentially expressed genes. In this case, the constantly present probes may not be noticed. However, as shown in Figure 3, the SI values of such constantly present probes can vary in different samples, and it is possible to mistake the result as significant. Irrespective, microarray technology remains a useful tool to initially survey annotated transcriptomes to infer the state of a cell, but care must be taken to avoid the constantly present probe.


This work was supported in part by the Charlotte B. Failing Professorship to S.A.K. and in part by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development Contract 25PM6 in collaboration with the LIFE Study Working Group, Division of Epidemiology, Statistics, and Prevention Research who provided semen samples for analysis. A.S. was supported by visiting scholar fellowship from Brazilian Research Council (CAPES). We are grateful to Graham Johnson and Edward Sendler for their review of the manuscript. S.M. designed the analysis work flow and analyzed the data and wrote the manuscript; A.L.S. designed primers, performed the qRT-PCR experiments, and reviewed the manuscript; R.J.G. prepared libraries for both RNA-Seq and HumanHT-12 bead arrays and reviewed the manuscript; and S.A. K. oversaw the project and edited the manuscript.

Competing interests

The authors declare no competing interest.

1.) Tusher, V.G., R. Tibshirani, and G. Chu. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98:5116-5121.

2.) Fisher, R. 1950. Statistical Methods for Research Workers , 11th e. Oliver & Boyd, Edinburgh.

3.) Shi, L., L.H. Reid, W.D. Jones, R. Shippy, J.A. Warrington, S.C. Baker, P.J. Collins, F. de Longueville. 2006. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol 24:1151-1161.

4.) Mariani, T.J., V. Budhraja, B.H. Mecham, C.C. Gu, M.A. Watson, and Y. Sadovsky. 2003. A variable fold change threshold determines significance for expression microarrays. FASEB J 17:321-323.

5.) Yu, H., K. Nguyen, T. Royce, J. Qian, K. Nelson, M. Snyder, and M. Gerstein. 2007. Positional artifacts in microarrays: experimental verification and construction of COP, an automated detection tool. Nucleic Acids Res 35:e8.

6.) Brodsky, L., A. Leontovich, M. Shtutman, and E. Feinstein. 2004. Identification and handling of artifactual gene expression profiles emerging in microarray hybridization experiments. Nucleic Acids Res 32:e46.

7.) Nelson, D.C., D.J. Wohlbach, M.J. Rodesch, V. Stolc, M.R. Sussman, and M.P. Samanta. 2007. Identification of an in vitro transcription-based artifact affecting oligonucleotide microarrays. FEBS Lett 581:3363-3370.

8.) Marioni, J.C., C.E. Mason, S.M. Mane, M. Stephens, and Y. Gilad. 2008. RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509-1517.

9.) Fu, X., N. Fu, S. Guo, Z. Yan, Y. Xu, H. Hu, C. Menzel, W. Chen. 2009. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 10:161.

10.) Brunskill, E.W., H.L. Lai, D.C. Jamison, S.S. Potter, and L.T. Patterson. 2011. Microarrays and RNA-Seq identify molecular mechanisms driving the end of nephron production. BMC Dev. Biol. 11:15.

11.) Malone, J.H., and B. Oliver. 2011. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 9:34.

12.) Wang, Z., M. Gerstein, and M. Snyder. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet 10:57-63.

13.) Xiong, Y., X. Chen, Z. Chen, X. Wang, S. Shi, J. Zhang, and X. He. 2010. RNA sequencing shows no dosage compensation of the active X-chromosome. Nat. Genet 42:1043-1047.

14.) Goodrich, R., G. Johnson, and S.A. Krawetz. 2007. The preparation of human spermatozoal RNA for clinical analysis. Arch. Androl 53:161-167.

15.) Goodrich, R., E. Anton, and S.A. KrawetzIsolating mRNA and small noncoding RNAs from human sperm. In K. Aston, and D. Carrell (Eds.) Methods in Molecular Biology: Spermatogenesis and Spermiogenesis: Methods and Protocols. Humana Press, Totawa.

16.) Trapnell, C., L. Pachter, and S.L. Salzberg. 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105-1111.

17.) Trapnell, C., B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold, and L. Pachter. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28:511-515.

18.) Sultan, M., M.H. Schulz, H. Richard, A. Magen, A. Klingenhoff, M. Scherf, M. Seifert, T. Borodina. 2008. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321:956-960.

19.) Lima-Souza, A., E. Anton, S. Mao, W.J. Ho, and S.A. Krawetz. 2012. A platform for evaluating sperm RNA biomarkers: dysplasia of the fibrous sheath-testing the concept. Fertil. Steril 97:1061-1066.

20.) Cabili, M.N., C. Trapnell, L. Goff, M. Koziol, B. Tazon-Vega, A. Regev, and J.L. Rinn. 2011. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 25:1915-1927.

21.) Mao, S., C. Wang, and G. Dong. 2009. Evaluation of inter-laboratory and cross-platform concordance of DNA microarrays through discriminating genes and classifier transferability. J. Bioinform. Comput. Biol. 7:157-173.

22.) Barbosa-Morais, N.L., M.J. Dunning, S.A. Samarajiwa, J.F. Darot, M.E. Ritchie, A.G. Lynch, and S. Tavare. 2010. A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res 38:e17.

23.) Johnson, G.D., A.E. Platts, C. Lalancette, R. Goodrich, H.H. Heng, and S.A. Krawetz. 2011. Interrogating the transgenic genome: development of an interspecies tiling array. Syst. Biol. Reprod. Med 57:54-62.

  1    2    3