^{1,2}, Anders Nordgaard

^{2,3}, Birgitta Rasmusson

^{2}, Ricky Ansell

^{2,4}, and Peter Rådström

^{1}

**A statistical quality model for forensic DNA profiles**

The tool developed to assess the quality of forensic DNA profiles is based on principal component analysis (PCA) (18) using three factors: (*i*) the total peak height (TPH) of the capillary electrophoresis electropherograms [i.e., the sum of the heights of the observed STR peaks given in relative fluorescent units (rfu)], (*ii*) the mean local balance (MLB) (i.e., the mean of intra-locus balances or discrepancies between peak heights within a heterozygous STR marker), and (*iii*) the Shannon entropy (SH) (19) (i.e., discrepancies, or inter-locus balance, between the sum of the peak heights between the markers). The higher the value of the TPH, the higher the quality of the forensic DNA profile, provided fluorescence saturation is avoided by not overloading sample. The intra-locus balance for a marker is defined as the ratio given by dividing the height of the lower peak in a heterozygous marker by that of the higher, giving a marker-specific ratio varying from 0 to 1. For a homozygous marker, the measure is defined as 1, while for a false homozygous marker it is defined as the lowest balance obtained within the calibration set of electropherograms. High values of this measure therefore imply good balance. The MLB is a global measure of local balance, obtained by calculating the mean of these measures for all the markers analyzed. In this way, the MLB is consistent with the TPH in that the higher the value, the higher the quality. The Shannon entropy was defined as

where

*p*

_{i}is the proportion of summed peak heights in STR marker

*i*of the TPH, and

*M*is the number of STR markers investigated (in this case 10). SH varies between 0 and ln(

*M*), where 0 is obtained when only one marker has observable peaks, and ln(

*M*) is obtained when the sums of the peak heights in all markers are equal. Here, 10 STR markers are used, giving 2.30 as the highest possible value for SH.

The measures TPH, MLB, and SH can be used separately or combined to form a univariate measure, according to the linear combination

where

*a*

_{1},

*a*

_{2}, and

*a*

_{3}are chosen constants. PCA was used as a data reduction method, and the first principal component was proven to provide sufficient discrimination between higher- and lower-quality electropherograms/DNA profiles (i.e., the reduction in eigenvalues between the successive principal components was such that the first component would suffice). This was shown by carrying out PCA on a calibration set consisting of 446 representative DNA samples showing high-quality DNA profiles. Each of the original variables, TPH, MLB, and SH, was standardized (by subtracting the sample mean and dividing by the standard deviation for that sample) before PCA was applied. The standardized variables are denoted tph, mlb, and sh. The coefficients of these components were all found to be positive, which confirms that they form a basis for a final measure. At this stage, the coefficients only reflect the correlations between the original measures TPH, MLB, and SH within the calibration set. To enhance the discriminating power of this measure, we applied a manual grading scale from 1–20. This scale was based on the knowledge of SKL's experienced reporting officers of the relationship between TPH and a high quality electro-pherogram/DNA profile, combined with a measure of how close MLB and SH are to their respective maxima. Each profile in the calibration set was graded using this scale, and cross-validation (20) was applied to ‘shrink’ the coefficients obtained for the retained principal component (PC) so that the component becomes an optimal linear predictor of the scores on the grading scale. More specifically, if

*a*

_{1},

*a*

_{2}, and

*a*

_{3}are the coefficients of the first principal component obtained from the calibration set, and

*g*denotes the score on the grading scale for a profile, the modified version (PC

_{s}) becomes

where

*c*

_{1},

*c*

_{2}, and

*c*

_{3}are constants chosen so that the square sum of leave-one-out cross-validation prediction errors obtained from a linear prediction model

*b*

_{0}+

*b*

_{1}. PC

_{s}applied to all profiles is minimized (

*b*

_{0}and

*b*

_{1}being the least-squares estimated parameters of a linear regression model).

As PC_{s} is calculated using standardized values of TPH, MLB, and SH, its values will vary around zero. To make the measure easier to interpret and to obtain a well-defined zero, the values of PC_{s} are transformed by adding the expression

where

*s*

_{TPH},

*s*

_{MLB}, and

*s*

_{SH}are the sample standard deviations of the three original measures in the calibration set. We call the resulting transformed measure the forensic DNA profile index (FI). With the calibration set used in this study FI was found to be