Sign Up to BioTechniques free email alert service to receive content updates.
An improved Huffman coding method for archiving text, images, and music characters in DNA
 
Menachem Ailenberg and Ori D. Rotstein
Departments of Surgery, University of Toronto, and St. Michael's Hospital, Li Ka Shing Knowledge Institute, Keenan Research Centre, Toronto, Ontario, Canada
BioTechniques, Vol. 47, No. 3, September 2009, pp. 747–754
Full Text (PDF)
Supplementary Material
Abstract

An improved Huffman coding method for information storage in DNA is described. The method entails the utilization of modified unambiguous base assignment that enables efficient coding of characters. A plasmid-based library with efficient and reliable information retrieval and assembly with uniquely designed primers is described. We illustrate our approach by synthesis of DNA that encodes text, images, and music, which could easily be retrieved by DNA sequencing using the specific primers. The method is simple and lends itself to automated information retrieval.

Introduction

The increasing use of digital technology presents a challenge for existing storage capabilities. The need for a reliable and long-term solution for information storage is further heightened by the prediction that the current magnetic and optical storage will become unrecoverable within a century or less (1). DNA is a compact, long-term, and proven medium for information storage. Indeed, over the last few decades, a good case has been made for crucial information storage in DNA (2). Desirable properties of DNA include its capacity for long-term information storage and recovery, which are mostly independent of technological changes, the ability to conceal data in a miniaturized fashion and its ability to be transferred, when required, via self propagation (1,2,3,4,5,6). Various approaches for information coding in DNA have been reported, including the Huffman code, the comma code, and the alternating code (4), a straight coding based on 3 bases per letter (1,2,6), or sequential conversion of text to keyboard scan codes followed by conversion to hexadecimal code and then conversion to binary code with a designed nucleotide encryption key (5). Each approach offers advantages and inherent difficulties, and differs in the degree of economical use of nucleotides. We sought to develop an alternate approach for information archiving in DNA. We used the principles of the Huffman code (4,7) to define DNA codons for the entire keyboard, for unambiguous information coding. The approach described in this manuscript is based on the construction of plasmid library for information archiving with specially designed primers embedded in the message segment with an exon/intron structure for rapid, reliable, and efficient information retrieval.

Materials and methods

The DNA coding was based on modification of the Huffman code (2,4,7,8). We also adopted the nomenclature suggested by Cox (2) for definition of the DNA segment representing a single character as ‘codon’. DNA (844 bp; Figure 1A) was synthesized and inserted as a SacI/KpnI fragment in pBluescript-based plasmid (Mr. Gene GmbH, Regensburg, Germany). Sequence confirmation of supplied plasmid was provided by the manufacturer using plasmid universal primer. For information retrieval, plasmid (300 ng/7 µL) was mixed with sequencing primer (5 pmol/0.7µL; Sigma, Oakville, Ontario) (Figure 1B) and subjected to sequencing (service was performed by The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada). The chromatogram was created using the FinchTV 1.4 application (Geospiza Inc., Seattle, WA, USA). Sequences of designed and sequenced DNA were aligned using bl2seq (NCBI, Bethesda, MD, USA). PCR amplification was performed in iQ5 cycler (Bio-Rad Laboratories, Mississauga, ON, Canada). A reaction mixture contained 2 units Taq polymerase with 1× reaction buffer (New England BioLabs, Pickering, ON, Canada), 0.2 mM each dNTP (Fermentas, Burlington, ON, Canada), 0.3 mM each primer, 200 ng plasmid DNA, and UltraPure distilled water (Invitrogen, Burlington, ON, Canada) to a volume of 20 µL. PCR conditions were 94°C for 3 min; 94°C for 30 s, 55°C for 30 s, and 72°C for 60 s for 30 cycles; then 72°C for 7 min final extension; and hold at 4°C. Ten micro-liters of PCR reaction was mixed with 2 µL 6× loading buffer (Fermentas). DNA fragment size was determined by loading in parallel 5 µL 100-bp DNA ladder (Fermentas) and resolved on 1% agarose gel (Bioshop, Burlington, ON, Canada). Gel was visualized with UV transillumination, and image was captured with Biospectrum AC Imaging System (UVP, Upland, CA, USA).





Results and discussion

Rationale for the improved Huffman Method

One of the prerequisites for a good DNA coding method is the economical use of nucleotides per character. In the improved Huffman coding described in this paper, the bases-to-character ratio was ~3.5. This number is more economical compared with previously described methods that enable entire keyboard coding [e.g., the comma or alternating codes (6 bases/character) (4) or sequential encryption (~5.3 bases/character) (5)]. It should be noted that other coding methods yielded lower base-to-character ratios, but these approaches were limited to a low number of characters, usually sufficient only for text encoding of the English alphabet (1,2,6). Information storage of DNA in living organisms has a disadvantage of losing the information as a result of breakage by mutation, deletion, and insertion of DNA (9). Yachie et al. (5) described an alignment-based approach for prolonged and reliable information storage in DNA in living organisms. This approach provides efficient recovery of stored data even for damaged DNA (9). While the size of foreign DNA inserted into a living organism may be limited (2), recent studies have demonstrated successful insertion of large pieces of foreign DNA into living organisms (reviewed in Reference 9). Nevertheless, the unique codon assignment of the Huffman code as described by Smith et al. (4)—and extended in this study—readily identifies any frame shift that results from mutation in DNA or errors in sequencing. Also, information storage in naked or plasmid DNA is not subjected to mutations from the added stage of insertion of the encoding DNA into living organism. Therefore, we describe in this communication an improved Huffman coding with unique primer design using plasmid DNA libraries.

  1    2    3  



Back to top