Full Text (PDF)
The regulation of cytosine methylation in CpG dinucleotides of DNA has been recognized as an important control mechanism in the development and differentiation of an organism (1). Aberrant DNA methylation within CpG islands is one of the earliest and more pronounced epigenetic alterations in human malignancies (2). There is substantial evidence to show that changes in DNA methylation occur during the stages of carcinogenesis. Such studies include both global changes in DNA methylation, as well as methylation pattern changes in restricted CpG dinucleotides sites in specific genes (3). A variety of diagnostic methods are available to determine the methylation status of DNA molecules. The bisulfite genomic sequencing technique (4) has found wide acceptance for determining the DNA methylation status because of its unambiguous ability to reveal DNA methylation to the order of a single nucleotide. This method is based on the selective chemical deamination of cytosine to uracil by bisulfite, whereas 5-methylcytosine (mC) residues remain unchanged under the same conditions. The bisulfite-modified DNA sequences are amplified by PCR, and then the DNA sequences are determined by conventional methods. The uracil residues in bisulfite-modified DNA fragments are detected as thymine residues during PCR amplification. The thymine residues complement with adenine residues in the polynucleotide to form a double strand. Thus CpG methylation status information can be obtained by comparing the PCR-amplified products to a computer-generated bisulfite-modified DNA sequence. Several internet services and computer software packages are available for DNA methylation studies (5,6,7,8,9,10). In our laboratory, we had earlier developed a utility software, CpG Analyzer, to help researchers generate bisulfite-modified DNA sequences, highlight CpG dinucleotides inside the DNA sequence text, and obtain detailed information about the CpG distribution locations, which are essential in determining the segment to be studied and the primer to be used in the preliminary stages of DNA methylation studies (11).
As noted above, the CpG methylation status can be obtained by sequence alignment at the final analysis stage. Currently, sequence alignment is a tedious, time-consuming, and mistake-inducible process. The data management in this process involves: (i) generation of the template DNA sequence from the original reference sequence. This template sequence is a methylated bisulfite-modified DNA sequence which is not available immediately. (ii) Checking each sample sequence manually to delete insertions and mark deletions. (iii) Finding disqualified sample sequences. Defective DNA sequence files showing artifacts of unannotated nucleotide composition (represented as DNA base n) at certain segments should be excluded from the analysis. Sample sequence files having incomplete C to T conversion should be excluded as well since inclusion of these files will distort the final correct picture of the methylation status of a gene. And, (iv) aligning the sequences to find the CpG methylation pattern.
One characteristic feature of the final stage is the large number of the sample sequences to be compared; dozens of sequence files may come from a single DNA sample. To make the methylation DNA sequence analysis reliable, several bisulfite treatments should be performed for each sample and followed by PCR. Generally speaking, the PCR products of these DNA samples cannot be sequenced directly because they are heterogeneous in nature. This may be the result of several factors. The sample is obtained from a mixture of different types of cells, of which each may have a different methylation status. The maternal and paternal DNA strands with different methylation status coexist in the same sample in the imprinted genes. In certain cases, during the bisulfite treatment, not all of the non-CpG-C residues in the DNA sample are modified by the bisulfite treatment, and the sample becomes a mixture of sequences that have different levels of the C to T conversion. All these DNA strands inside the sample can be amplified by the same pair of primers during PCR. Therefore, cloning the PCR products into suitable plasmids and then sequencing the chimeric plasmids should be done. This procedure generates an enormous number of sample DNA sequence files as well as adding extra bases belonging to the plasmids to the sequence.
As to the sequence alignment, some software packages available on the internet and commercially, such as BioEdit, ClustalW, and MegAlign of DNAStar offer an option to align multiple DNA sequences for comparison and thus can be used for the alignment. However, since these software packages are not designed specifically for DNA methylation analysis, they have some weak points: (i) information regarding the conversion of all non-CpG-C residues in a DNA sequence, which is essential for estimating the efficiency of bisulfite C to T conversion and excluding disqualified sequence files from the analysis, is difficult to obtain. (ii) The CpG dinucleotides as well as the non-CpG-C residues inside the template DNA sequence text cannot be highlighted in the alignment window, which is necessary for a quick visual inspection of methylation at CpG dinucleotides and C to T conversion efficiency in a long DNA sequence file. (iii) The starting point of the nucleotide numbering system is usually set as 1 and that does not correspond to the parent numbering system, creating confusion when trying to locate CpG dinucleotides in a DNA sequence file. (iv) Template sequence and the sample sequences to be aligned are located in the same text box and scrolled together. In case dozens of sequence files are aligned, the template sequence will scroll out of the window and thus individual sample sequence and template cannot be seen in same window. This makes the visual comparison difficult. (v) In some of the software, the component base detail of each sequence is shown in the sequence alignment text. This results in the differences in CpG positions being less identifiable.