Since ReFLX files offer unique analysis challenges and opportunities not present in single-sample data collection modes, we implemented several additional features common to high-throughput multiwell-based RNAi screening for ReFLX file analysis. First, the mean of a user-selected parameter from each well is plotted in an 8 × 12 matrix heat map that is color-coded by well value (Figure 3B). This visualization strategy is a useful way to compare the data across a plate and often helps in the identification of plate edge effects, a common confounder in high-throughput RNAi screening (11). Second, instead of normalizing to a single negative control sample (as we do for single-sample data analysis), COPAmulti takes advantage of the large number of samples and uses the plate mean (calculated from the median 80% of nonzero value samples to remove effects of outliers) as the negative control value. This approach is a well-accepted data normalization strategy for multiwell plate assays that can be uniformly applied across all plates (11). In addition to this normalization strategy, we also implemented a second approach (COPAmulti V2) that allows users to define the well(s) that contain negative control data through the COPAmulti GUI (Figure 4B). Using these calculated negative control reference values, we implement three common statistical tests for hit identification that have been previously utilized in RNAi screening formats: (i) mean ± k SD; (ii) median ± k MAD; and (iii) the multiple-comparisons t-test with Bonferroni correction. The specific significance test and threshold for each test is set within the user-adjustable GUI. Each test has specific strengths and weaknesses and in some cases may not represent the best statistical approach for data analysis. Nonetheless, these methods are among the most commonly used approaches for analysis of high-throughput RNAi screening data (11), and the best approach is usually to compare results obtained with each statistical method. In general, the mean ± k SD test is the most commonly used hit identification technique for RNAi screening, due to its ease of calculation (12,13). Most screeners utilize a 3-SD cutoff with this approach. However this method is sensitive to outlier data and frequently misses weaker positives. Decreasing the SD cutoff usually increases false positives to an unacceptably high rate. An alternative approach is the median ± k MAD test. Like the mean ± k SD test, MAD is relatively easy to calculate but is much less sensitive to outlier data. MAD also does a good job of identifying weak hits while controlling false positives (14). A shortcoming of MAD is that it is not easily linked to probability distributions and P values. Despite this shortcoming, others have recommended MAD as the method-of-choice for hit selection in high throughput RNAi screens (14). MAD values of ≥2 are commonly used for hit identification in genome-wide RNAi screens (14). A final common statistical test for RNAi screening is the multiple-comparison t-test. This statistic is easy to calculate (due to the large number of events in each well), but is extremely sensitive to outliers and requires multiple-comparison correction (11). For multiple comparison t-tests, the simplest form of correction is the Bonferroni correction, which scales the desired P value by the number of samples to obtain an equivalent multiple comparison P value. A table of Bonferroni-corrected P values for common thresholds is listed in Table 1. In general, users should analyze their data with each statistical approach and utilize the method or combination of methods that most frequently identifies known positive controls. A major advantage of our software is that it allows users to rapidly adjust and test each of these statistical methods for hit identification through the simple GUI. For users that wish to perform statistical analysis of their data using other approaches, COPAmulti automatically exports both summarized and raw data to delimited text files for further analysis.
Following statistical analysis, hits meeting user-determined thresholds are binarized in an 8 × 12 matrix, with hits plotted in white and non-hits plotted in black (Figure 3C). We also visualize all data from all plates using a well index plot (Figure 3D). Such plots are useful indicators of screen phenotypic behavior among plates and can help identify plates with phenotypic drift or substantial variance. For example, data in Figure 3 demonstrate lower values toward the end of the plate as compared with the beginning of the plate. Finally, since some users may screen in duplicate, we implemented a separate algorithm, COPAcompare, that allows users to compare results between two plates (Figure 5). COPAcompare plots a userselected parameter for each well between two user-selected plates. The degree of overall plate-to-plate correlation is determined by calculating the Pearson correlation coefficient (R), where an R value of 1 equals perfect correlation among all wells and −1 equals perfect opposite correlation among all wells.
We developed a suite of MATLAB-based programs to process large COPAS file data sets such as those associated with C. elegans RNAi screens. We implemented one program, COPAquant, for comparisons among data collected in the single-sample format, which is useful for small-scale screens with larger populations. We also implemented two additional programs, COPAmulti and COPAcompare, that use more advanced filtering, analysis, normalization, and statistical analysis of data from 96-well plates obtained using the COPAS ReFLX system. Both programs allow users to rapidly move from raw COPAS data to graphical data representation, replicate plate comparison, and hit identification without extensive knowledge of or experience with the programming environment. Our software greatly simplifies the analysis of COPAS data and fills a major gap in our need for data analysis tools for high-throughput screening using this platform. While we used this program in the validation steps of an RNAi screen for regulators of a heat shock–inducible reporter in C. elegans, the program is customized to the standard data format output by COPAS Biosort instruments and thus can be used in any type of COPAS application, including data obtained from other organisms.