Abstract
In 1994, two independent groups extracted DNA from several Pleistocene epoch mammoths and noted differences among individual specimens1, 2. Subsequently, DNA sequences have been published for a number of extinct species. However, such ancient DNA is often fragmented and damaged3, and studies to date have typically focused on short mitochondrial sequences, never yielding more than a fraction of a per cent of any nuclear genome. Here we describe 4.17 billion bases (Gb) of sequence from several mammoth specimens, 3.3 billion (80%) of which are from the woolly mammoth (Mammuthus primigenius) genome and thus comprise an extensive set of genome-wide sequence from an extinct species. Our data support earlier reports4 that elephantid genomes exceed 4 Gb. The estimated divergence rate between mammoth and African elephant is half of that between human and chimpanzee. The observed number of nucleotide differences between two particular mammoths was approximately one-eighth of that between one of them and the African elephant, corresponding to a separation between the mammoths of 1.5–2.0 Myr. The estimated probability that orthologous elephant and mammoth amino acids differ is 0.002, corresponding to about one residue per protein. Differences were discovered between mammoth and African elephant in amino-acid positions that are otherwise invariant over several billion years of combined mammalian evolution. This study shows that nuclear genome sequencing of extinct species can reveal population differences not evident from the fossil record, and perhaps even discover genetic factors that affect extinction.
Vertebrate genome sequencing projects have thus far assembled data from at least 28 species5, including chromosomal assemblies of six placental mammals, namely human6, 7, chimpanzee8, rhesus macaque9, mouse10, rat11 and dog12. In contrast, kilobase (kb)-scale genomic sequence data from extinct species were first reported in 2005, with 27 kb from cave bear13 and 13 megabases (Mb) from mammoth14. More recently, two projects reported up to 1 Mb from Neanderthal15, 16, some of which may be modern human contamination17.
Whereas many ancient-DNA studies have used bone samples, in 2007 we showed that DNA with fewer damage-induced substitutions can be extracted from hair shafts collected from permafrost remains18. Moreover, use of hair permits a highly efficient decontamination protocol that leaves the keratin-encased endogenous DNA unharmed. The method resulted in 15 complete mammoth mitochondrial genomes at high coverage18, 19, identified in 947 Mb of total sequence (average read length, 93 bp) from a number of samples. Hair shafts are thus suitable for sequencing ancient nuclear DNA.
We selected M4, a Siberian mammoth specimen carbon-14 dated to 18,545
70 years before present (roughly 20,000 years ago), for extensive sequencing, and generated 2.982 Gb of data from hair shafts using a Roche GS-FLX sequencing instrument. A second mammoth specimen, M25, yielded an additional 239 Mb. Together with our earlier mammoth data, this brought the total to 4.168 Gb of sequence. Given the abundance of hair available from M4 and M25, we were able to enrich the sequenced material for long DNA molecules, to overcome at least partly the high rate of breakage in ancient DNA. The average read length was 142 bp for M4 and 164 bp for M25. As a bonus, we obtained 4,430-fold coverage of the mitochondrial genome of M4, allowing us to determine error rates. (We assumed that errors in nuclear DNA equal those in mitochondrial DNA.) Specifically, for reads trimmed by aligning them to elephant sequence, the total error rate from post-mortem DNA damage and sequencing mistakes was 0.14%, neglecting errors that added or deleted bases (Table 1 and Methods).
To estimate how many of our reads are actual mammoth DNA, we determined the fraction of our sequence that aligns to the African savanna elephant (Loxodonta africana) genome (twofold assembly and sixfold reads), which indicated that 80% of our 4.17 Gb of sequence, that is, approximately 3.3 Gb, is from the mammoth. However, the yield varied substantially between specimens: M4 is 90% mammoth and M25 is only 58% mammoth (Fig. 1). As a negative control, read-sized intervals of the chicken genome20 were mapped to elephant and showed that the false-positive rate is very low. Reasons why the estimation of 80% may actually be low are given in Methods. Some microbial DNA is recognizable in the non-mammoth portion (Fig. 1), but essentially none of the DNA in these samples is human18.
Figure 1: Species composition of metagenomic DNA extracted from mammoth hair.

a, b, Pie charts for the M4 (a) and M25 (b) data sets show the percentage of sequencing reads assigned to taxa for mammoth, Archaea, Bacteria, virus, as well as the two unspecified categories 'other Eukaryota' and 'unidentified sequence'. The taxon distribution exemplifies the variability of the endogenous DNA content of ancient specimens.
High resolution image and legend (133K)Download Power Point slide (511K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
The converse result, that is, the fraction of elephant DNA that aligns to our data, can tell us how much of the mammoth genome has been sequenced. Because typical genome sizes of placental mammals are around 3 Gb, our 3.3 Gb might be expected to provide at least onefold coverage, in which case—taking into account overlapping reads21—over 63% of the bases in the mammoth genome would be sequenced at least once. However, the African elephant genome has previously been estimated at between 4.2 and 4.8 Gb using the C-value technique4, which, although less accurate than genome sequencing, has consistently predicted the Afrotherian genomes to be larger than previously sequenced placental genomes. We estimated how much of the mammoth genome has been sequenced by searching for matches to a set of elephant genes in the Ensembl gene build of August 2006 (http://www.ensembl.org) that could be confidently mapped to unique positions on human chromosomes (Fig. 2), and by searching for the so-called ultraconserved regions22. In both cases, around 50% of the bases were found; accounting for multiple reads that include the same genomic position, this translates into 0.7-fold coverage, or that the total length of our true mammoth reads is 70% of the genome's length. Because some of our reads are very short and, hence, difficult to align reliably, this may be an underestimate.
Figure 2: Sequenced mammoth orthologues of human genes.

a, Plot showing the number of RefSeq genes on each human chromosome (open white rectangles), the average fraction of protein-coding bases that align to Roche/454 reads from James D. Watson's genome30 (green), and the fraction of coding bases that align to one or more mammoth reads (red), using Ensembl-predicted elephant genes that map to the human chromosome—approximately 50% for each autosome, but only 31% for chromosome X as M4 was male (see Methods). b, Lengths of the chromosomes in a.
High resolution image and legend (188K)Download Power Point slide (566K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
Our estimates that 80% of our 4.17 Gb of sequence is from mammoth and that we have obtained 0.7-fold coverage are consistent with a genome size of 4.7 Gb, as 4.17
0.8
3.3
4.7
0.7. However, this estimation of genome size is at best a rough approximation. On the other hand, we observed the probable cause of the expanded genome, namely an unusually high fraction of interspersed repeats (Supplementary Information).
As currently understood, the evolutionary relationships among selected living and extinct elephantid species are sketched in Fig. 3; we show parallels with humans and some great apes to provide a widely familiar point of reference. Here we use estimated divergence times, which are times to the common ancestor averaged across the genome. This should be distinguished from population split times or, in the case of distinct species, speciation times. For instance, two modern European humans have a population split time of 0 yr but a mean divergence of at least 500,000 yr. This distinction is important for recent speciation events. For example, the mean divergence time between human and chimpanzee is at least 2 Myr longer than the speciation time23 (see Methods for details). We are interested in comparing sequence identity rates between elephantids and between apes.
Figure 3: Comparison of phylogenies.

a, Elephantids; b, hominoids. We show estimated divergence times, that is, times to the common ancestor averaged across autosomes (see Methods). Red circles at the leaves of the phylogenetic tree indicate discernable species. This distinction was not made for the two clades of mammoth (M4 and M25) based on the fossil record (merged red circles).
High resolution image and legend (207K)Download Power Point slide (586K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
To estimate the level of nucleotide identity (ignoring gaps) between M4 and the African elephant, we analysed the large number of elephant positions that have more than one aligning mammoth read, to reduce the effect of errors in our sequence (Methods). The estimated identity is 99.4%. The 0.6% difference rate is approximately half of that estimated between human and chimpanzee (1.24%)24, despite the similarity in divergence times (Fig. 3). This indicates that nucleotide substitutions are fixed in recent elephantid lineages at only half of the rate in great apes and humans, mirroring an earlier observation about mitochondrial DNA25. Using a similar approach (Methods), we estimate that M4 and the African elephant are 99.78% identical at the amino-acid level.
Significantly, among the multiply sequenced differences between M4 and the African elephant, M25 agrees with the African elephant in 13.3% of the cases (that is, 327 of 2,451) in which we had a high-identity alignment to M25. Under the assumption that M4 and the African elephant differ by 15 Myr of evolution (7.5 Myr in each lineage), this corresponds to a separation of about 1.5–2.0 Myr between M4 and M25. We assume that only a small fraction of the differing positions are under selection. We note that a divergence of 1–2 Myr between M4 and M25 was estimated earlier19 on the basis of mitochondrial data, for which M25 agrees with the African elephant in 14.5% of the cases (114 of 792) where M4 disagrees with the African elephant. The concordance between nuclear and mitochondrial data is particularly noteworthy because population-genetic analysis of African elephants has shown that different relationships are inferred from mitochondrial sequence than from nuclear sequence26. M4 and M25 belong to differing clades of mammoths that were identified on the basis of short mitochondrial sequences19, 27. However, morphological criteria distinguishing the two clades have not been established, similar to the case of two phenotypically identical groups of extant brown bears in Sweden that have differing mitotypes and share the same territory28.
A major reason for sequencing the woolly mammoth is to identify functionally important amino-acid differences between mammoth and elephant. It is unclear what fraction of amino-acid differences have functional consequences, but it is likely to be rather small; for instance, one estimation29 is that
20% of common human amino-acid variants are deleterious. To start looking for such differences, we combined computational criteria (designed to enrich for validity and functional importance) with PCR amplification and Sanger sequencing in M4, M25, African elephant and Indian elephant, Elephas maximus. Our initial screening yielded 92 putative differences (Fig. 4, Supplementary Information) that have also passed additional manual screening for undesirable attributes such as lack of conservation (notably homoplasy) at the critical mammoth position, potential confusion with paralogues, processed and unprocessed pseudogenes, and tandem or other duplicative debris. We found a number of cases in which mammoth differed from an amino acid that appeared to be otherwise invariant throughout placental evolution (Supplementary Information), which may suggest functional significance of the protein position and positive selection in the mammoth lineage.
Figure 4: Experimentally verified amino-acid differences among African elephant, Indian elephant, M4 and M25.

Non-synonymous coding single-nucleotide polymorphisms between African elephant and mammoth identified by computational mean, termed single-amino-acid polymorphisms (SAPs), were experimentally verified through PCR amplification and Sanger sequencing. Six SAP categories for splits specific to African elephant (AE), Indian elephant (IE), mammoth, M4 and M25 are shown, together with one category for heterozygosity and other polymorphisms. Gene names are for the putative human orthologue.
High resolution image and legend (232K)Download Power Point slide (612K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
From the data set presented here, we conclude that a high-fidelity, high-coverage mammoth genome will be feasible once the genome sequence for the African elephant has been completed and 10–30-fold (depending on the sequencing technology) more mammoth sequence has been generated. From our data, we estimate that mammoth and elephant differ on average at about one residue per protein (roughly 20,000 positions proteome-wide) and that 90% of those differences are potentially identifiable by means of higher-coverage short-read sequencing alone (that is, without enriching sequenced material for coding DNA or Sanger resequencing; see Methods). Apart from comparing protein sequences, we hope to pinpoint DNA differences between mammoth and elephant in the non-repetitive genomic intervals, so it may even be possible to detect differences in gene-regulatory signals. The catalogue of differences, along with computational predictions of the differences most likely to have functional consequences, will provide a resource to facilitate direct observation of genetically orchestrated changes over evolutionary time, for example those associated with adaptation to cold environments, dietary changes and so on. In addition, the determination of an even larger number of synonymous changes in those protein-coding intervals will permit identification of genes and gene families under selection during mammoth evolution. As we have demonstrated here, these studies can be carried out on both clades of mammoth despite the specimens' large difference in age. It will therefore become possible to conduct genome-wide studies on multiple individuals with the goal of understanding whether the observed coalescence time of 1.5–2.0 Myr between M4 and M25 in fact did generate two species of mammoth, or whether this process was terminated by premature extinction of the clade of M25.
Methods Summary
The 4.17 Gb of individual reads from our mammoth samples, along with the sequenced PCR products produced while studying SAPs, were placed in a freely Internet-accessible BLAST server (http://mammoth.psu.edu) that was used for some of the analyses described here, such as determining the fraction of putative elephant coding exons and mammalian ultraconserved intervals that align to a mammoth-sample read. In addition, sequence data from the African savanna elephant genome, produced by the Broad Institute, was a critical ingredient for our analysis. Early in the project, we used an assembly of the twofold-coverage data, downloaded from the University of California, Santa Cruz Genome Bioinformatics website (http://genome.ucsc.edu); later we employed individual Sanger reads that provide roughly sixfold coverage.
For whole-genome mammoth–elephant comparisons, we handled the problem of unmasked elephantid-specific interspersed repeats by aligning mammoth-sample reads to the twofold elephant scaffolds using the 'dynamic masking' feature of the BLASTZ alignment program (see Methods); only the reads that did not align in that preliminary step were aligned to the sixfold reads. Reads that aligned to a unique position in the twofold assembly, and specifically to a position itself aligned with high identity to a human RefSeq coding exon, were analysed to predict elephant/mammoth SAPs. We assumed that any given read is either all mammoth DNA or completely non-mammoth. One set of computational conditions, designed to enrich for substitutions of functional importance, was (1) that the putative amino-acid difference between mammoth and elephant occur in a run of five amino acids identical between human and elephant, and (2) that the substitution have a negative BLOSUM62 score.
Full methods accompany this paper.
T and G
30). Once a read was found to align to the elephant assembly, it was not compared to subsequent elephant sequences, to avoid aligning each interspersed or tandem repeat segment thousands of times. Reads that did not align to the twofold elephant assembly were compared with the sixfold elephant (Sanger) reads, requiring a gap-free alignment scoring of at least 30 when a match of 1 and a mismatch of -3. The reads that aligned to L. africana in one of these two steps, comprising 80.1% of the 4.17 Gb, were considered to be mammoth DNA. We also used this approach on just the M4 sample, just the M25 sample, and on read-sized intervals from the chicken genome