2Department of Computer Science, King's College London, London, UK
3Department of Biochemistry, School of Biomedical and Health Sciences, King's College London, London, UK
The contamination of cell cultures by mycoplasmas poses a problem in that it can adversely affect cellular behavior and physiology; this is exacerbated by the fact that colonization is not easy to detect (1). In a previous study investigating the effects on microarray data, it was shown that mycoplasma contamination can alter patterns of human gene expression by upsetting host cell physiology (2). Miller et al. used Affymetrix Human Genome U133A GeneChip microarrays (Santa Clara, CA, USA) containing mostly well-characterized human genes, and demonstrated that mycoplasma infection compromises the validity of any data generated from such samples (2).
We have discovered the probeset 1570561_at—on the microarray that succeeded the HG-U133A chip (HG-U133 Plus 2.0)—that maps to the 16S-23S rRNA intergenic transcribed spacer (ITS) sequences from multiple species of mycoplasma. Interestingly, this sequence is already used in molecular detection and screening protocols for mycoplasma infections and genotyping, using PCR and microarrays (3,4). In contrast to the HG-U133A array, HG-U133 Plus 2.0 arrays contain more probesets for less–well-characterized sequences. The 176-nucleotide Affymetrix target sequence used to select probes for this probeset was designed to a single human expressed sequence tag (EST)–like sequence (GenBank accession no. AF241217); according to the company, this entry does not have any similarities to any known human transcripts or genomic sequences.
However, we demonstrate that the target sequence bears overwhelming similarities to 16S-23S rRNA ITS sequences from various different species of mycoplasma (Table 1). The table shows the top 10 most similar alignments between the 176-nucleotide target sequence and entries in the GenBank multi-species ‘nr’ database; 9 of these 10 are mycoplasma rRNA gene sequences, while the remaining match is for the AF241217 entry itself (5). Our conclusion from this finding is that a probeset representing mycoplasma 16S-23S rRNA ITS has been included on a human microarray.
The GenBank entry AF241217 is a 249-nucleotide sequence annotated as “Homo sapiens unknown sequence” and was submitted to the high-throughput cDNA (HTC) division in 2000. We propose that the cDNA sequenced as part of this particular HTC project may have been derived from a mycoplasma-contaminated human cell line. This means that some of the cDNA clones in the library would contain mycoplasma gene fragments leading to inclusion of mycoplasma sequences in the human database. Then, while designing the HG-U133 Plus 2.0 array, Affymetrix included the AF241217 GenBank entry, as it appeared human in origin. The HG-U133A array used in the previous mycoplasma study (2) does not contain any probesets designed to this particular sequence.
To find out if any other mycoplasma genes were represented on HG-U133 Plus 2.0, we aligned all 54,675 of its target sequences against an example mycoplasma genome (Mycoplasma arthritidis 158L3–1; complete genome sequence; GenBank accession no. CP001047; 820,453 bp) using the FASTA program (6). The results showed no significant alignments other than 1570561_at as judged by Smith-Waterman scores and long overlaps as measures of similarity (data not shown). We therefore conclude that 1570561_at is probably the only probeset that aligns with mycoplasma sequences on this particular array.
In order to assess whether high relative signals for this probeset could indicate mycoplasma contamination, we downloaded data from the Gene Expression Omnibus (GEO) database at NCBI (www.ncbi.nlm.nih.gov/geo), which stores data from microarray studies and makes it publicly available and searchable. We intended to find out whether (i) there were any samples showing high expression levels for this probeset, and (ii) the majority of samples with high expression for 1570561_at were from cultured cells rather than non-cultured cells or tissues, which are less likely to be contaminated. GEO contained expression data from 2757 samples from HG-U133 Plus 2.0 hybridizations at the time of download (February 2007).
We analyzed a randomly-selected subset of these samples to find out the background frequency of cultured samples. Using this subset (1801 samples), we interpreted the sample descriptions and assigned each one into either a ‘cultured’ or ‘non-cultured’ group, depending on whether the cells in question had been subjected to standard culturing conditions. We found that the cultured samples made up 34% of the total (Supplementary Table S1).
Next, we analyzed the downloaded GEO HG-U133 Plus 2.0 data (2757 samples) and quantile-normalized the probe intensities using the RNAnet facility from the University of Essex (http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet), which allows detailed querying of GEO data sets at the CEL file level (7). The scatterplot in Figure 1 shows that there is a cluster of samples that have high relative expression for 1570561_at; signals from two of the 11 perfect match (PM) probes (the first and the third, PM1 and PM3) were chosen as representative of the probeset and plotted against each other.
Using mean PM intensity as an expression measure, we created a list of samples ranked on 1570561_at relative expression level; the 33 samples with highest overall signals for 1570561_at contained 31 cultured and two non-cultured samples (Supplementary Table S2). This cultured fraction (94%) of high-expressing samples is significantly higher than the observed background frequency of 34% (chi-squared test; χ2 = 0.0). We suggest that high expression of this probeset appears to correlate with the act of physically culturing a sample. Coupled with the fact that the probeset has a high degree of similarity with many species of mycoplasma rRNA, we propose that some of the samples stored in GEO may have been derived from mycoplasma-contaminated cell cultures.
In conclusion, we suggest that: (i) the 1570561_at probeset has its origins not in the human genome, but from mycoplasma cDNA, (ii) it may detect the presence of mycoplasma RNA in a human microarray sample, and (iii) high expression levels for this probeset may be used as a post-hybridization biomarker of mycoplasma infection in samples processed on HG-U133 Plus 2.0 arrays. Furthermore, it is possible that a subset (perhaps ~1%) of all the data stored in the GEO database has been compromised, as detailed in Miller et al. (2), by mycoplasma infection. Further experiments including hybridizing different species of mycoplasma at known levels of contamination to HG-U133 Plus 2.0 arrays, or subjecting archived array hybridization cocktails to molecular mycoplasma detection assays, such as NAT (3,4), would be required to prove these hypotheses. However, we do not advocate this resource as a replacement for standard screening and detection of mycoplasma in cell cultures.
The authors specifically wish to thank Tanya Barrett and Alexandra Soboleva from the NCBI GEO database team for their help in supplying custom-formatted expression data. We also acknowledge the scientific contribution of Adeel Riaz and Mahvash Tavassoli at King's College London in helping to identify the probeset.
The authors declare no competing interests.
Address correspondence to Matthew Arno, Genomics Centre, School of Biomedical and Health Sciences, King's College London, Franklin-Wilkins Building, 150 Stamford Street, London, SE1 9NH, UK. email: matthew.ar[email protected]