Chromosome-level genome assembly of the threatened resource plant Cinnamomum chago

Tao, Lidan; Guo, Shiwei; Xiong, Zizhu; Zhang, Rengang; Sun, Weibang

doi:10.1038/s41597-024-03293-1

Download PDF

Data Descriptor
Open access
Published: 03 May 2024

Chromosome-level genome assembly of the threatened resource plant Cinnamomum chago

Lidan Tao ORCID: orcid.org/0000-0002-1396-0524^1,2,3^na1,
Shiwei Guo^1,2,3^na1,
Zizhu Xiong^1,2,3,
Rengang Zhang ORCID: orcid.org/0000-0002-8028-9208^1,2,3 &
…
Weibang Sun^1,2,4

Scientific Data volume 11, Article number: 447 (2024) Cite this article

514 Accesses
7 Altmetric
Metrics details

Subjects

Abstract

Cinnamomum chago is a tree species endemic to Yunnan province, China, with potential economic value, phylogenetic importance, and conservation priority. We assembled the genome of C. chago using multiple sequencing technologies, resulting in a high-quality, chromosomal-level genome with annotation information. The assembled genome size is approximately 1.06 Gb, with a contig N50 length of 92.10 Mb. About 99.92% of the assembled sequences could be anchored to 12 pseudo-chromosomes, with only one gap, and 63.73% of the assembled genome consists of repeat sequences. In total, 30,497 genes were recognized according to annotation, including 28,681 protein-coding genes. This high-quality chromosome-level assembly and annotation of C. chago will assist us in the conservation and utilization of this valuable resource, while also providing crucial data for studying the evolutionary relationships within the Cinnamomum genus, offering opportunities for further research and exploration of its diverse applications.

Phylogenomics and the rise of the angiosperms

Article Open access 24 April 2024

Genomic analyses reveal the stepwise domestication and genetic mechanism of curd biogenesis in cauliflower

Article Open access 07 May 2024

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Background & Summary

The Cinnamomum genus (family: Lauraceae) comprises 248 species of evergreen trees or shrubs with a wide distribution spanning Tropical and Subtropical Asia to the Western Pacific¹. Cinnamomum encompasses several economically important plant species that have versatile uses, including construction materials, furniture, spice production, pharmaceutical applications, and industrial oilseed purposes. Moreover, certain species from this genus, such as C. camphora and C. japonicum, are extensively cultivated as ornamental landscape trees^2,3.

C. chago B.S. Sun et H.L. Zhao is endemic to Yunnan province, China, and was initially discovered in La-Guo village, Yangbi county⁴ (Fig. 1a). Recent investigations have confirmed that C. chago is exclusively distributed in Dali Prefecture and Pu’er City of the province^5,6. In Yunlong and Yangbi County of Dali Prefecture, mature seeds of C. chago were collected by villagers and sold by the local Yi people as traditional ethnic nut and traditional health products⁵. Preliminary nutritional analysis results revealed that C. chago seeds contain a high proportion of lauric acid indicating high potential for economic utilization⁷. Furthermore, the exceptional wood is frequently harvested for furniture production, significantly impacting its natural regeneration⁶.

Due to its small population size and intensive human disturbance, C. chago has been threatened and was assessed as one of the Plant Species with Extremely Small Populations (PSESP) in southwest China, requiring rescue protection in 2021^8,9. Additionally, it was designated as one of China’s nationally protected Grade II wild plants, safeguarded by law. Moreover, its unique morphological features indicate that C. chago is a key phylogenetic taxon between the two sections of Asian Cinnamomum plants (Sect. Camphora and Sect. Cinnamomum)^5,10. Therefore, a high-quality reference genome is crucial for promoting the conservation and utilization of C. chago, as well as studying the phylogeny of the family Lauraceae.

In this study, we assembled and annotated the genome of C. chago using PacBio HiFi reads (91.73 Gb, 80×), ONT reads (33.27 Gb, 30×), NGS reads (58.83 Gb, 50×), Hi-C reads (124.18 Gb), RNA-seq (16.31 Gb), and Iso-Seq (18.54 Gb). The assembled contig size was close to the estimated genome size of 1.1 Gb based on k-mer estimates, with a scaffold N50 length of 92.10 Mb. Approximately 99.92% of the assembled data were anchored onto 12 pseudo-chromosomes (Table 1; Fig. 1b,c; Supplementary Table S1). The chloroplast and mitochondrial genomes were 152,753 bp and 707,525 bp, respectively. A total of 1,366,885 repeat sequences were identified, with an approximate cumulative length of 676.3 Mb, accounting for 63.73% of the assembled genome. Of the identified repeats, long terminal repeats (LTRs) constituted the largest proportion, with a number of 466,655 and a cumulative length of 431,972,996 bp, accounting for 40.71% of the C. chago genome assembly. The genome contained 30,497 genes, including 28,681 protein-coding genes (Table 2). The high-quality reference genome and annotation information of C. chago will enhance our understanding of the evolutionary relationships within the genus Cinnamomum, and further research and utilization of the economically valuable resources.

Table 1 Summary of Cinnamomum chago genome assembly.

Full size table

Table 2 Summary of Cinnamomum chago genome annotations.

Full size table

Methods

Sampling

For genomic DNA extraction, fresh young leaves of C. chago were collected from a single adult plant in Xincun village, Yangbi County, Dali Prefecture, Yunnan Province, China (25°33′37″N, 99°55′18″E). Additionally, for transcriptome RNA extraction, tender shoots, young leaves, current-year branches, and immature fruits were collected from the same adult plant. The transcriptome samples were immediately frozen in liquid nitrogen after collection and subsequently stored at −80 °C. DNA and RNA extraction and sequencing were performed by Wuhan Benagen Technology Co. Ltd. in Wuhan, China.

Genome sequencing

A modified CTAB method was performed to extract total DNA from young C. chago leaves¹¹. The concentration of DNA was assessed using NanoDrop (NanoDrop Technologies, Wilmington, DE, USA) and a Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA). The purity and integrity of the resulting DNA were assessed using 1% agarose gel electrophoresis. The short-read library with a DNA-fragment insert size of 200–400 bp was prepared using 1 μg genomic DNA following the manufacturer’s instructions (BGI) and was subjected to paired-end (PE) sequencing on a DNBSEQ-T7 platform (BGI Inc., Shenzhen, China) using a PE 150 model, which consequently produced 58.83 Gb (~ 196 M reads, approximately 50×) of raw data (Supplementary Table S2).

Genomic DNA was purified using a DNeasy Plant Mini Kit before HiFi sequencing (Qiagen, Germantown, MD, USA), and its integrity was assessed using a Femto Pulse instrument (Agilent Technologies, Santa Clara, CA, USA). Subsequently, Megaruptor 3 (Diagenode SA., Seraing, Belgium) was employed to fragment 8 μg of genomic DNA, and the resulting fragments were concentrated using AMPure PB magnetic beads (Pacific Biosciences, Menlo Park, CA, USA). Each PacBio single molecule real-time (SMRT) library was constructed using a SMRT bell express template prep Kit 3.0 (Pacific Biosciences, Menlo Park, CA, USA), with insert sizes of 30 kb selected via the BluePippin system (Sage Science, Beverly, MA, USA). The library was then sequenced on a Pacific Bioscience Revio platform in CCS mode, and the raw data were processed into high-fidelity (HiFi) reads using the CCS workflow 7.0.0¹² with parameters (–streamed–log-level INFO–stderr-json-log–kestrel-files-layout–min-rq 0.9–non-hifi-prefix fail–knrt-ada–pbdc-model). This process yielded approximately 91.73 Gb (~ 80×) of HiFi data with an average read length of about 18 kb and an N50 read length of approximately 18 kb (Supplementary Table S3).

The Nanopore DNA library was prepared using SQK-LSK109 Kit (Oxford Nanopore Technologies, Oxford, UK), and the library was sequenced using a Nanopore PromethION sequencer. Totally about 33.27 Gb (~ 30 x) WGS ONT data were obtained (Supplementary Table S3).

Hi-C library construction and sequencing

Fresh leaf tissue was fixed in formaldehyde solution, and the cross-linked DNA was then digested and labelled with Biotin. Subsequently, the DNA fragments were ligated together using DNA ligase, then the ligated DNAs were then uncross linked, sheared, and purified. After adding A-tailing and an adapter to the DNA fragments, the biotin-labelled fragments were then enriched using streptavidin magnetic beads. The Hi-C libraries were PCR-amplified and then sequenced on the Illumina NovaSeq 6000 platform in PE150 mode (Supplementary Table S4).

Transcriptome sequencing

Total RNA from leaves, stems, fruits, and roots of the same plant was isolated. For NGS RNA-Seq, libraries were prepared using the VAHTS Universal V6 RNA-seq Library Prep Kit for Illumina. The libraries were then sequenced on the Illumina NovaSeq 6000 S4 platform. For Full-length isoform sequencing (Iso-Seq), both SQK-PCS109 and SQK-PBK004 Kits (Oxford Nanopore Technologies, Oxford, UK) were used to prepare the library, and the library was sequenced using a Nanopore PromethION sequencer. Finally, a total of 16 Gb (~ 109 M reads) NGS RNA-Seq data and 19 Gb (~ 17 M reads) full-length Iso-Seq data were obtained for genome annotation (Supplementary Tables S5, S6, S7).

Genome size estimation

Both flow cytometry (FCM) analysis and k-mer frequency analysis were employed to estimate the genome size of C. chago. For FCM analysis, the DNA content was assessed using the BD FACScalibur (BD Biosciences, USA), with maize B73 as reference standards. The frequencies of 19-mers, 25-mers, 29-mers, 39-mers and 49-mers were estimated with the software GCE v1.0.0¹³ using HIFI reads. The estimated genome size was ~1.1 Gb, with a genome heterozygosity of 0.8% (Supplementary Table S8).

Chromosome-level genome assembly

PacBio HiFi reads, WGS ONT reads, and Hi-C reads were assembled into contigs using Hifiasm v0.19.5-r592¹⁴. The primary assembly was selected for subsequent analysis. Hi-C reads were aligned to the reference genome using Juicer 3, followed by initial HiC-assisted chromosome assembly using 3D-DNA v180922¹⁵ (with the parameters–early-exit -m haploid -r 0). Manual inspection and adjustment were performed using Juicebox v1.11.08¹⁶ (pre -n -q 0 or 1), primarily focusing on refining chromosome segment boundaries and correcting assembly errors. Chromosome scaffolding was then performed separately for each chromosome using 3D-DNA, followed by manual adjustments in Juicebox, including removal of erroneous insertions and orientation adjustments, aiming to correct visible errors as much as possible. After manual inspection, the final genome assembly consisted of 12 chromosomes and un-anchored sequences. Gaps with a fixed length of 100 bp were present; therefore, gap filling was performed using quarTeT v1.1.2¹⁷ software based on HiFi reads.

Most chromosomal telomeres exhibited telomeric repeat sequences (TTTAGGG)n¹⁸; however, there were individual cases where this sequence was shorter or absent, suggesting incomplete assembly or insufficient extension. To address this, the HiFi reads were mapped back to the chromosomes, and reads mapping near the telomeres were selected. These reads were then assembled into contigs using Hifiasm v0.19.5-r592. The contigs were mapped to the chromosomes, and the chromosomes were extended outward to assemble the telomere sequences as completely as possible. GetOrganelle v1.7.5¹⁹ was used to assemble the chloroplast and mitochondrial genomes.

The assembly were polished using Nextpolish2 v0.1.0²⁰ based on HiFi and NGS short reads. Then, redundancies including rDNA fragments and haplotigs were removed using Redundans v0.13c²¹ (with the parameters -identity 0.98 -overlap 0.8) with manual curation. About 99.92% of the assembled data was anchored to the 12 pseudochromosomes, and the chromosomes were numbered according to the published genome assembly of C. kanehirae²²; 0.07%, and 0.01% of the assembled data was the mitochondrial and chloroplast genomes, respectively (Table 1; Fig. 1b,c; Supplementary Table S1). Finally, we obtained a high-quality genome of C. chago.

Identification of repetitive elements

EDTA v1.9.9²³ was utilized for de novo identification of transposable elements (parameters:–sensitive 1–anno 1) to generate a TE library. RepeatMasker v4.0.7²⁴ (with the parameters -no_is -xsmall) was then employed to identify repetitive regions in the genome. A total of 1,366,885 repetitive sequences were identified, comprising a cumulative length of 676,297,749 bp, accounting for 63.73% of the genome. Among these, the most abundant were LTR elements, with a total of 466,655 elements spanning 431,972,996 bp, making up 40.71% of the genome (Supplementary Table S9).

Gene identification and functional annotation

Homologous protein evidence was prepared by merging a total of 507,642 non-redundant protein sequences sourced from publicly available proteins for gene annotation, including Amborella trichopoda²⁵, Nymphaea colorata²⁶, Aristolochia fimbriata²⁷, Piper nigrum²⁸, Saururus chinensis²⁹, Annona glabra³⁰, Liriodendron chinense³¹, Magnolia sinica³², Chimonanthus salicifolius³³, Cinnamomum kanehirae²², Cinnamomum camphora³⁴, Litsea cubeba³⁵, Lindera megaphylla³⁶, Chloranthus sessilifolius³⁷, Acorus gramineus³⁸, Oryza sativa³⁹, Tetracentron sinense⁴⁰, and Arabidopsis thaliana⁴¹.

Transcript evidence preparation involved two approaches for NGS transcriptome data: 1) Trinity v2.0.6⁴² was employed to perform de novo assembly, and 2) hisat2 v2.1.0⁴³ was utilized to map reads to the genome, followed by assembly using StringTie v2.1.5⁴⁴. For iso-seq data, Minimap2 v2.24⁴⁵ (with the parameters -a -x splice–end-seed-pen = 60–G 200k) was used to map reads to the genome, which were subsequently assembled using StringTie v2.1.5 (with the parameters -L -t -f 0.05) (Supplementary Table S10). Gene structure annotation was performed, by employing the PASA (Program to Assemble Spliced Alignments) pipeline v2.4.1⁴⁶ based on the transcript evidence obtained, and full-length genes were identified through comparison with reference proteins. To optimize gene prediction, AUGUSTUS v3.4.0⁴⁷ was trained using the full-length gene set, undergoing five rounds of optimization. Additionally, SNAP⁴⁸ was also trained to further enhance gene prediction accuracy.

The MAKER2 v2.31.9⁴⁹ annotation workflow was employed to annotate genes based on ab initio prediction, transcript evidence, and homologous protein evidence. In this step, repetitive regions were first masked using RepeatMasker v4.0.7. AUGUSTUS v3.4.0 and SNAP were used for ab initio gene prediction. Then, the assembled transcript sequences were aligned with the genome using BLASTN, while protein sequences were aligned using BLASTX, and the alignments were optimized using Exonerate v2.2.0⁵⁰. Hints files were generated based on the evidence obtained, which were then integrated with AUGUSTUS and SNAP to predict gene models.

Further integration of MAKER and PASA annotations was performed using EVidenceModeler (EVM) v1.1.1⁵¹ to generate consistent gene annotations. TEsorter v1.4.1⁵² was utilized to identify TE protein domains in the genome, which were subsequently masked by EVM v1.1.1, to avoid introducing transposable element (TE) coding regions. Finally, PASA v2.4.1 was used to upgrade and optimize the results obtained by EVM, add UTRs, and add alternative splicing. Gene annotations with abnormal coding frames and those that were too short (<50 aa) were removed. Barrnap v0.9 (https://github.com/tseemann/barrnap) and tRNAScan-SE v1.3.1⁵³ were used to annotate rRNA and tRNAs respectively. Various non-coding ncRNAs were annotated using RfamScan v14.2⁵⁴.

Functional annotation of protein-coding genes was conducted using three strategies. 1) the predicted genes were aligned with the eggNOG v. 5.0 homologous gene database using eggNOG-mapper v. 2.0.0⁵⁵ (–target_taxa Viridiplantae -m diamond) for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEEG) annotation. 2) sequence matching was performed using DIAMOND v0.9.24⁵⁶ (–evalue 1e-5–max-target-seqs 5) (Identity >30%, E-value <1e-5), aligning the protein sequences with various databases such as Swiss_Prot, TrEMBL, NR (non-redundant protein), and Arabidopsis, to identify best gene matches. 3) InterProScan v5.27-66.0⁵⁷ was used to obtain the conserved amino acid sequences, motifs, and domains of the predicted proteins by searching for similarity of domain according to the sub-databases PRINTS, Pfam, SMART, PANTHER and CDD of the InterPro database (Table 3). Finally, 27,795 genes were functionally annotated in at least one of the above databases, accounting for 96.91% of the predicted protein-coding genes (Table 2; Supplementary Table S11).

Table 3 Statistics of functional annotation result of Cinnamomum chago genome.

Full size table

Mitochondrial and chloroplast genomes were also annotated using OGAP pipeline (https://github.com/zhangrengang/ogap). Totally, 61 genes and 108 genes were functionally annotated in mitochondrial and chloroplast genomes, respectively (Supplementary Table S12).

Data Records

The relevant data reported in this paper have been deposited in the National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under the BioProject accession number PRJCA022354 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh. BGI short-reads, PacBio HiFi long-reads, Hi-C reads, WGS ONT data, Iso-Seq data and RNA-Seq data have been deposited in the Genome Sequence Archive (GSA) in NGDC under the accession number CRR1001223⁵⁸, CRR1001224⁵⁹, CRR1001225⁶⁰, CRR1091096⁶¹, CRR1091097⁶² and CRR1001228⁶³. The final chromosome assembly and annotation data were deposited in the Genome Warehouse (GWH) in NGDC under the accession number GWHERBI00000000⁶⁴. GSA and GWH data are also available in NCBI SRA and GenBank under the accession number SRR27371173⁶⁵, SRR27371174⁶⁶, SRR27371175⁶⁷, SRR27371176⁶⁸, SRR28466993⁶⁹, SRR28466994⁷⁰, and GCA_038049695.1⁷¹. Annotation data are available in Figshare⁷².

Technical Validation

Genome assembly quality assessment

The final assembly was about 1.1 Gb, similar with the results from K-mer analysis (Supplementary Table S8; Supplementary Figure S1). There was only one gap in the assembly, contig N50 reached 92.10 Mb, which showed good continuity of the assembly. Short reads were mapped to the genome using BWA-MEM v0.7.17-r1188⁷³, while the third-generation reads were mapped using Minimap2 v2.24⁴⁵. Non-primary alignments were filtered out, and the mapping ratio and coverage percentage were calculated. The results are shown in Table 4, indicating a high level of sequence coverage for the genome. According to BUSCO (Benchmarking Universal Single-Copy Orthologs) v5.3.2⁷⁴, the proportion of complete core genes (including single-copy and duplicated genes) was found to be 99.0%. The percentage of missing genes was 0.5%, indicating a high level of gene completeness.

Table 4 Mapping ratio and coverage percentage of different data sets.

Full size table

According to the relationship between guanine-cytosine (GC) distribution and sequencing distribution, there was significant GC bias in short reads but no obvious bias in long reads (Supplementary Figure S2). The Hi-C data was further mapped onto the final genome assembly using Juicer v1.5.6¹⁶, revealing a well-executed chromosome clustering effect (Supplementary Figure S2) with no apparent chromosomal assembly errors.

The genome assembly quality was also assessed by the LTR assembly index (LAI)⁷⁵, consensus quality (QV)⁷⁶, contig/chromosome ratio (CC ratio)⁷⁷, and Clipping information for Revealing Assembly Quality (CRAQ)⁷⁸. The LAI of the assembled genome was 10.80 (>10), indicating the assembly has reached the level of the reference genome. QV of the assembled genome was approximately 70.12, indicating an accuracy of over 99.99% in the assembly. CC ratio of the assembly was 1.25, which reflects high continuity of the assembly. According to CRAQ, regional and structural assembly quality indicators (R-AQI and S-AQI) were approximately 95.31 and 97.73, respectively, which corresponds to low assembly errors (Supplementary Table S13).

The repetitive sequences were mapped to the genome to determine the position of the telomeres and other characteristic sequences on the chromosomes. Most of the chromosomes assembled complete telomere sequences (TTTAGGG), and only one telomere was missing. Putative centromere tandem repeat motif (GCGGCTCTAGAAAATTGTTGACTCTACACTGTGTTTCATGCGACTCTTGGTCCAAAGACTCCCTCTAGAAAAATCCGGGATCACGTTTTACTCTAAAAGGGGTTTCGGGTGTCCTTCTCTTGTCTTACGCCTCTAAATCCATTTGAAGGGATTCTGGGTTGAGATGCGCTTTTTAGGATATTTCGAGCTACTTTTCGGTTTAAAACGGGTTTCGGGTGAATCTTGGGTATGGAAAACACTTTCGGGGAGTTCAGTGTTTGTAAAGGCGAAAACCCGAACTTCGTGCGGGTCGTACGGTACTTTTGTACGAAAACACAATCTAT) was identified from HiFi reads using Centromics (https://github.com/zhangrengang/Centromics). Most chromosomes contained the large tandem repeat regions as putative centromere (Fig. 2). In addition, the 18-5.8-28 S rDNA arrays were detected on three chromosomes including Chr10, Chr 11 and Chr12, while 5 S rDNA arrays were found on Chr01, Chr03 and Chr06 (Fig. 2). In summary, this assembly can by described as a nearly telomere-to-telomere genome.

Evaluation of the gene annotation

The integrated and annotated proteins were evaluated using BUSCO with the lineage dataset embryophyta_odb10. Among a total of 1614 BUSCO groups, 98.6% BUSCO groups were fully covered (including 52.1% single-copy genes and 46.5% duplicated genes), 0.3% groups were fragmented and 1.1% were missing, which showed high quality annotation of the annotation (Table 5).

Table 5 BUSCO assessment result.

Full size table

Code availability

All commands and pipelines used were performed according to the manuals or protocols of the tools used in this study. The software and tools used are publicly accessible, with the version and parameters specified in the Methods section. If no detailed parameters were mentioned, default parameters were used. No custom code was used in this study.

References

Cinnamomum Schaeff. http://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:328262-2 (2024).
Ravindran, P. N., Nirmal Babu, K. & Shylaja, M. Cinnamon and cassia: the genus Cinnamomum. (CRC Press, 2004).
Li, X. et al. Lauraceae. in Flora of China (eds. Wu, Z., Raven, P. H. & Hong, D.) vol. Vol. 7 (Science Press, Beijing, and Missouri Botanical Garden Press, St. Louis., 2008).
Sun, B. X. & Zhao, H. L. A New Species of Cinnamomum from Yunnan. Journal of Yunnan University 13, 93–94 (1991).
CAS Google Scholar
Dong, W. J. et al. Biological characteristics and conservation genetics of the narrowly distributed rare plant Cinnamomum chago (Lauraceae). Plant Diversity 38, 247–252 (2016).
Article PubMed PubMed Central Google Scholar
Zhang, X. et al. Investigating the status of Cinnamomum chago (Lauraceae), a plant species with an extremely small population endemic to Yunnan, China. Oryx 54, 470–473 (2020).
Article Google Scholar
Hou, M. et al. Nutritional composition analysis and evaluation of Cinnamomum chago. J. West China For. Sci. 48, 80–85, https://doi.org/10.16473/j.cnki.xblykx1972.2019.06.013 (2019).
Article Google Scholar
Yang, J. & Sun, W. B. A new programme for conservation of Plant Species with Extremely Small Populations in south-west China. Oryx 51, 396–397, https://doi.org/10.1017/S0030605317000710 (2017).
Article Google Scholar
Sun, W. B. List of Yunan protected plant species with extremely small populations (2021). (Yunnan Science and Technology Press, 2021).
Yang, Z., Liu, B., Yang, Y. & Ferguson, D. K. Phylogeny and taxonomy of Cinnamomum (Lauraceae). Ecology and Evolution 12, e9378, https://doi.org/10.1002/ece3.9378 (2022).
Article PubMed PubMed Central Google Scholar
Doyle, J. J. & Doyle, J. L. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochemical Bulletin 19, 11–15 (1987).
Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162, https://doi.org/10.1038/s41587-019-0217-9 (2019).
Article CAS PubMed PubMed Central Google Scholar
Liu, B. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. ArXiv, 1308.2012 https://doi.org/10.48550/arXiv.1308.2012 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101, https://doi.org/10.1016/j.cels.2015.07.012 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic. Res.-England 10, https://doi.org/10.1093/hr/uhad127 (2023).
Gao, D. et al. TAR30, a homolog of the canonical plant TTTAGGG telomeric repeat, is enriched in the proximal chromosome regions of peanut (Arachis hypogaea L.). Chromosome Res. 30, 77–90, https://doi.org/10.1007/s10577-022-09684-7 (2022).
Article CAS PubMed Google Scholar
Jin, J. J. et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 21, 241, https://doi.org/10.1186/s13059-020-02154-5 (2020).
Article PubMed PubMed Central Google Scholar
Hu, J. et al. NextPolish2: a repeat-aware polishing tool for genomes assembled using HiFi long reads. Genom. Proteom. & Bioinform. 22, qzad9, https://doi.org/10.1093/gpbjnl/qzad009 (2024).
Article Google Scholar
Pryszcz, L. P. & Gabaldón, T. Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic Acids Res. 44, e113, https://doi.org/10.1093/nar/gkw294 (2016).
Chaw, S. M. et al. Stout camphor tree genome fills gaps in understanding of flowering plant genome evolution. Nat. Plants 5, 63–73, https://doi.org/10.1038/s41477-018-0337-0 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275, https://doi.org/10.1186/s13059-019-1905-y (2019).
Article CAS PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, 4.10.11–4.10.14, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Article Google Scholar
Albert, V. A. et al. The Amborella genome and the evolution of flowering plants. Science 342, 1467, https://doi.org/10.1126/science.1241089 (2013).
Article CAS Google Scholar
Zhang, L. S. et al. The water lily genome and the early evolution of flowering plants. Nature 557, 79, https://doi.org/10.1038/s41586-019-1852-5 (2019).
Article ADS CAS Google Scholar
Qin, L. Y. et al. Insights into angiosperm evolution, floral development and chemical biosynthesis from the Aristolochia fimbriata genome. Nat. Plants 7, 1239, https://doi.org/10.1038/s41477-021-00990-2 (2017).
Article CAS Google Scholar
Negi, A. et al. Rapid genome-wide location-specific polymorphic SSR marker discovery in black pepper by GBS approach. Front. Plant Sci. 13, https://doi.org/10.3389/fpls.2022.846937 (2022).
Xue, J. Y. et al. The Saururus chinensis genome provides insights into the evolution of pollination strategies and herbaceousness in magnoliids. Plant J. 113, 1021–1034, https://doi.org/10.1111/tpj.16097 (2023).
Article CAS PubMed PubMed Central Google Scholar
He, Z. W. et al. Evolution of coastal forests based on a full set of mangrove genomes. Nat. Ecol. Evol. 6, 738–749, https://doi.org/10.1038/s41559-022-01744-9 (2022).
Article PubMed Google Scholar
Li, T. et al. Genome evolution and initial breeding of the Triticeae grass Leymus chinensis dominating the Eurasian Steppe. Proc. Natl. Acad. Sci. USA 120, e2308984120, https://doi.org/10.1073/pnas.2308984120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Cai, L. et al. The chromosome-scale genome of Magnolia sinica (Magnoliaceae) provides insights into the conservation of plant species with extremely small populations (PSESP). GigaScience 13, https://doi.org/10.1093/gigascience/giad110 (2024).
Lv, Q. D. et al. The Chimonanthus salicifolius genome provides insight into magnoliid evolution and flavonoid biosynthesis. Plant J. 103, 1910–1923, https://doi.org/10.1111/tpj.14874 (2020).
Article CAS PubMed Google Scholar
Shen, T. F. et al. The chromosome-level genome sequence of the camphor tree provides insights into Lauraceae evolution and terpene biosynthesis. Plant Biotechnol. J. 20, 244–246, https://doi.org/10.1111/pbi.13749 (2022).
Article CAS PubMed Google Scholar
Chen, Y. C. et al. The Litsea genome and the evolution of the laurel family. Nat. Commun. 11, 1675, https://doi.org/10.1038/s41467-020-15493-5 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Tian, X. C. et al. Unique gene duplications and conserved microsynteny potentially associated with resistance to wood decay in the Lauraceae. Front. Plant Sci. 14, 1122549, https://doi.org/10.3389/fpls.2023.1122549 (2023).
Article PubMed PubMed Central Google Scholar
Ma, J. X. et al. The Chloranthus sessilifolius genome provides insight into early diversification of angiosperms. Nat. Commun. 12, 6929, https://doi.org/10.1038/s41467-021-26931-3 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Ma, L. et al. Diploid and tetraploid genomes of Acorus and the evolution of monocots. Nat. Commun. 14, 3661, https://doi.org/10.1038/s41467-023-38829-3 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 35, D883–D887, https://doi.org/10.1093/nar/gkl976 (2007).
Article CAS PubMed Google Scholar
Liu, P. L. et al. The Tetracentron genome provides insight into the early evolution of eudicots and the formation of vessel elements. Genome Biol. 21, 291, https://doi.org/10.1186/s13059-020-02198-7 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cheng, C. Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant. J. 89, 789–804, https://doi.org/10.1111/tpj.13415 (2017).
Article CAS PubMed Google Scholar
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652, https://doi.org/10.1038/nbt.1883 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360, https://doi.org/10.1038/NMETH.3317 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666, https://doi.org/10.1093/nar/gkg770 (2003).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644, https://doi.org/10.1093/bioinformatics/btn013 (2008).
Article CAS PubMed Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59, https://doi.org/10.1186/1471-2105-5-59 (2004).
Article PubMed PubMed Central Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491, https://doi.org/10.1186/1471-2105-12-491 (2011).
Article PubMed PubMed Central Google Scholar
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7, https://doi.org/10.1186/gb-2008-9-1-r7 (2008).
Article CAS PubMed PubMed Central Google Scholar
Zhang, R. G. et al. TEsorter: An accurate and fast method to classify LTR-retrotransposons in plant genomes. Hortic. Res. 9, uhac017, https://doi.org/10.1093/hr/uhac017 (2022).
Article PubMed PubMed Central Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964, https://doi.org/10.1093/nar/25.5.955 (1997).
Article CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137, https://doi.org/10.1093/nar/gku1063 (2014).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Mol. Biol. Evol. 34, 2115–2122, https://doi.org/10.1093/molbev/msx148 (2017).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60, https://doi.org/10.1038/nmeth.3176 (2015).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan5: genome-scale protein function classification. Bioinformatics 30, 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Article CAS PubMed PubMed Central Google Scholar
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA014129/CRR1001223 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA014129/CRR1001224 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA014129/CRR1001225 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA015570/CRR1091096 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA015570/CRR1091097 (2024).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA014129/CRR1001228 (2024).
NGDC Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/83678/show (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27371173 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27371174 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27371175 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27371176 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28466993 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28466994 (2024).
NCBI Assembly https://identifiers.org/insdc.gca:GCA_038049695.1 (2024).
Tao, L. D., Guo, S. W., Xiong, Z. Z., Zhang, R. G. & Sun, W. B. Chromosome-level genome assembly of the threatened resource plant Cinnamomum chago. Figshare https://doi.org/10.6084/m9.figshare.c.7148167.v1 (2024).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv, 1303.3997 https://doi.org/10.48550/arXiv.1303.3997 (2013).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212, https://doi.org/10.1093/bioinformatics/btv351 (2015).
Article CAS PubMed Google Scholar
Ou, S. J., Chen, J. F. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126, https://doi.org/10.1093/nar/gky730 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wang, P. & Wang, F. A proposed metric set for evaluation of genome assembly quality. Trends Genet. 39, 175–186, https://doi.org/10.1016/j.tig.2022.10.005 (2023).
Article CAS PubMed Google Scholar
Li, K. P., Xu, P., Wang, J. P., Yi, X. & Jiao, Y. N. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat. Commun. 14, 6556, https://doi.org/10.1038/s41467-023-42336-w (2023).
Article ADS CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the Second Tibetan Plateau Scientific Expedition and Research Program (GrantNo. 2019QZKK0502 to W.S.), the Yunnan Wildlife Protection Project (Grant No. 2021SJ14X-10 to L.T.), and theScience and Technology Basic Resources Investigation Program of China (Grant No. 2017FY100100 to W.S.).

Author information

These authors contributed equally: Lidan Tao, Shiwei Guo.

Authors and Affiliations

Yunnan Key Laboratory for integrative conservation of Plant Species with extremely Small Populations, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
Lidan Tao, Shiwei Guo, Zizhu Xiong, Rengang Zhang & Weibang Sun
CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
Lidan Tao, Shiwei Guo, Zizhu Xiong, Rengang Zhang & Weibang Sun
University of Chinese Academy of Sciences, Beijing, 101408, China
Lidan Tao, Shiwei Guo, Zizhu Xiong & Rengang Zhang
Kunming Botanic Garden, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
Weibang Sun

Authors

Lidan Tao
View author publications
You can also search for this author in PubMed Google Scholar
Shiwei Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zizhu Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Rengang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weibang Sun
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.S. conceived the project and designed the experiments. L.T. and S.G. investigated wild populations of Cinnamomum chago and prepared the samples. L.T., S.G. and Z.X. drafted the manuscript. R.Z. performed the bioinformatic analyses. L.T., S.G., Z.X., R.Z. and W.S. revised the manuscript. All authors contributed to the article and approved the submitted version.

Corresponding author

Correspondence to Weibang Sun.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figures

Supplementary Tables

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tao, L., Guo, S., Xiong, Z. et al. Chromosome-level genome assembly of the threatened resource plant Cinnamomum chago. Sci Data 11, 447 (2024). https://doi.org/10.1038/s41597-024-03293-1

Download citation

Received: 26 January 2024
Accepted: 22 April 2024
Published: 03 May 2024
DOI: https://doi.org/10.1038/s41597-024-03293-1