Large-scale whole-genome resequencing unravels the domestication history of Cannabis sativa

Abstract

Cannabis sativa has long been an important source of fiber extracted from hemp and both medicinal and recreational drugs based on cannabinoid compounds. Here, we investigated its poorly known domestication history using whole-genome resequencing of 110 accessions from worldwide origins. We show that C. sativa was first domesticated in early Neolithic times in East Asia and that all current hemp and drug cultivars diverged from an ancestral gene pool currently represented by feral plants and landraces in China. We identified candidate genes associated with traits differentiating hemp and drug cultivars, including branching pattern and cellulose/lignin biosynthesis. We also found evidence for loss of function of genes involved in the synthesis of the two major biochemically competing cannabinoids during selection for increased fiber production or psychoactive properties. Our results provide a unique global view of the domestication of C. sativa and offer valuable genomic resources for ongoing functional and molecular breeding research.

INTRODUCTION

Few crops have been under the spotlight of controversy as much as Cannabis sativa. As one of the first domesticated plants, it has a long and fluctuating history interwoven with the economic, social, and cultural development of human societies. Once a major source for textiles, food, and oilseed as hemp, its exploitation to that end declined in the 20th century, while its use as a recreational drug (i.e., marijuana, which is illegal in many countries) has broadened. Although much debated in the past, it is currently widely accepted that the genus Cannabis comprises a single species, C. sativa L., hereafter also referred to as Cannabis [reviewed in (1)]. The plant is annual, wind-pollinated, and predominantly dioecious. It is diploid, with 10 pairs of chromosomes (2n = 20) and is characterized by an XY/XX chromosomal sex-determining system, with a genome size of about 830 Mb (24). On the basis of distribution and archaeobotanical data, a wide region ranging from West Asia through Central Asia to North China has often been suggested as the origin of cultivation for the plant, with its later spread worldwide coinciding with continuous artificial selection and extensive hybridization between locally adapted, traditional landraces and modern commercial cultivars. Clandestine drug breeding and the propensity of domestic plants to become feral (and possibly to have admixed with their wild ancestors) have contributed to the difficulties for reconstructing the species’ domestication history [reviewed in (356)].
Recently, there has been renewed global interest in the therapeutic potential of Cannabis, given its unique chemical components (7). Cannabis hemp and drug types also differ in their relative yield of cannabidiolic acid (CBDA) and Δ9-tetrahydrocannabinolic acid (THCA), the two most abundant and studied of at least 100 unique secondary metabolites known as cannabinoids (8). After decarboxylation, their bioactive forms (the well-known CBD and psychoactive THC) bind to endocannabinoid receptors in an animal’s central nervous system, eliciting a broad range of effects, some of which may alleviate symptoms of neurological disorders (914). Hemp cultivated for fiber typically produces higher concentrations of CBDA than THCA, whereas marijuana contains very high amounts of THCA and much higher overall levels of cannabinoids. Hybrid cultivars with high CBDA content are currently developed for medical use. Hemp and marijuana have been consequently given separate statutory definitions, either based on a threshold of THC concentration (e.g., 0.3% dry weight in the European Union and the United States) or based on their chemical phenotype or chemotype [i.e., high, low, or intermediate ratio of THCA to CBDA characterizing, respectively, plants that contain predominantly THCA, predominantly CBDA, or both cannabinoids in approximately equivalent ratios (15)]. Despite an increasing need to produce varieties with specific cannabinoid profiles for therapeutic and recreational exploitation, and recent important contributions to our understanding of the structural and functional divergence as well as inheritance of their underlying synthase genes (1620), the mechanisms mediating the evolution of these genes are still not clearly known.
Despite its ancient use dating back thousands of years, the genomic history of domestication of Cannabis has been understudied compared to other important crop species, largely due to legal restrictions. Recent genomic surveys applying genotyping-by-sequencing on mostly Western commercial cultivars highlighted a marked genome-wide differentiation between hemp and drug types, a result also shown by anonymous short tandem repeat markers (2124). However, given the large gaps in our knowledge of the evolutionary history of domestication of Cannabis, a comprehensive reconstruction of the events responsible for the latter requires large-scale comparison of genomic data covering the full end use and geographic range, which is presently still lacking (625). On the basis of an unprecedented global sampling effort, we provide here such framework by compiling 110 whole genomes covering the full spectrum of wild-growing feral plants, landraces, historical cultivars, and modern hybrids from both hemp and drug types, with a particular focus on central and eastern Asia because of their hypothesized importance for the species’ origins of domestication (35).

RESULTS AND DISCUSSION

Population genetic analyses

Our dataset combines new data (82 genomes) with publicly available whole genomes from 28 hemp and drug types (Fig. 1A and table S1). After mapping to the reference CBDRx genome (18), we identified 12,010,905 putative single-nucleotide polymorphisms (SNPs) that passed filtering criteria across the 104 Cannabis accessions retained for subsequent analyses (fig. S1; see Materials and Methods). We characterized the genetic relationships among all Cannabis accessions using maximum likelihood (ML) phylogeny (rooted on Humulus lupulus), as well as admixture and principal component analysis (PCA; Fig. 1). All our analyses show a strong clustering of Cannabis accessions into four well-separated genetic groups. The first group (thereafter Basal cannabis, group A; Fig. 1B and fig. S2) includes 14 feral plants and landraces collected in China and 2 feral plants from the United States [most likely originating from 19th-century Chinese landraces (5)]; this group is sister to all other Cannabis accessions. The second group (Hemp-type, group B) includes hemp varieties distributed worldwide (5 feral plants, 13 landraces, and 20 cultivars). The third group (Drug-type feral, group C) contains at its base 3 feral samples collected in southern China, 11 feral plants collected in India and Pakistan south of the Himalayas, and one drug cultivar from India. The fourth group (Drug-type, group D) includes cultivated drug varieties distributed worldwide (35 cultivars). We found complete congruence between the four phylogenetically defined clusters and the commercial labels, current or historical end-use designation and/or predominant geographic origin of the accessions. However, to avoid bias due to potential ancestry admixture, we also conducted most downstream analyses excluding admixed samples as identified by the structure analysis (Fig. 1, C and E; see Materials and Methods for further explanations; all results are in the Supplementary Materials).
FIG. 1. Population structure of Cannabis accessions.
(A) Geographic distribution (i.e., sampling sites of feral plants or country of origin of landraces and cultivars) of the samples analyzed in this study. Color codes correspond to the four groups obtained in the phylogenetic analysis and shapes indicate domestication types. The two empty red squares symbolize drug-type cultivars obtained from commercial stores located in Europe and the United States. For sample codes, see table S1. (B) Maximum likelihood phylogenetic tree based on single-nucleotide polymorphisms (SNPs) at fourfold degenerate sites, using H. lupulus as outgroup. Bootstrap values for major clades are shown. (C) Bayesian model–based clustering analysis with different number of groups (K = 2 to 4). Each vertical bar represents one Cannabis accession, and the x axis shows the four groups. Each color represents one putative ancestral background, and the y axis quantifies ancestry membership. (D) Nucleotide diversity and population divergence across the four groups. Values in parentheses represent measures of nucleotide diversity (π) for the group, and values between pairs indicate population divergence (FST). (E) Principal component analysis (PCA) with the first two principal components, based on genome-wide SNP data. Colors correspond to the phylogenetic tree grouping.
OPEN IN VIEWER
Contrary to a widely accepted view, which associates Cannabis with a Central Asian center of crop domestication [mostly based on feral plant distribution data, e.g., (26)], our results are consistent with a single domestication origin of C. sativa in East Asia, in line with early archaeological evidence (see below). The results also indicate that some of the current Chinese landraces and feral plants represent the closest descendants of the ancestral gene pool from which hemp and marijuana landraces and cultivars have since derived. East Asia has been shown to be an important ancient hot spot of domestication for several crop species, including rice, broomcorn and foxtail millet, soybean, foxnut, apricot, and peach [reviewed in (2729)]; our results thus add another line of evidence for the importance of this domestication hot spot. Our analyses show that all hemp-type samples (group B) are reciprocally monophyletic to all drug-type samples (both feral and cultivars; groups C and D), indicative of independent breeding trajectories with remarkably little evidence for complex patterns of gene flow among end-use types during global expansion. More specifically, the phylogenetic tree topology suggests (i) a Chinese origin for modern hemp cultivars, illustrated by Chinese hemp landrace accessions (NER) at the most basal position of Hemp-type group B (fig. S2); (ii) substantial differentiation between drug-type feral plants and one cultivar from an area covering both sides of the Himalayan range (group C), and modern European and American marijuana cultivars (group D) that have arisen via intense recent selection for high THC content (as also indicated by reciprocally high FST values among drug groups C and D; Fig. 1D); and (iii) a distinct breeding history for marijuana samples from equatorial regions (MSA, PEU, SWD, HMW, and THD; for sample codes, see table S1), which tend to occupy a basal position among the group’s subclades compared to the majority of modern commercial drug-type cultivars. Archaeological and historical sources are overall consistent with our phylogenetic analyses (see below). In addition, similar levels of genetic diversity between basal group A and the other groups, the clustering of feral plants in basal group A together with cultivated landraces (NEB), and the presence of wild-growing feral plants from Central Asia nested within the Hemp-type group B (Fig. 1D and figs. S2 and S3) indicate that all feral plants studied here are not wild types, but historical escapes from domesticated forms. Although additional sampling of feral plants in these key geographical areas is still needed, our results, which are based on very broad sampling already, would suggest that pure wild progenitors of C. sativa have gone extinct (35).

Demographic history

The strong selection likely exerted on Cannabis through its long domestication process is expected to substantially affect the effective population size (Ne) of the existing genetic clusters. To address this issue, we estimated Ne using the pairwise sequentially Markovian coalescent (PSMC) method (30) and found that all four groups exhibited similar demographic trajectories (Fig. 2A and fig. S4). The ancestral Ne of Cannabis reached a peak at ~1 million years ago, followed by a continuous decline until the end of the last glacial maximum [~20,000 years before the present (B.P.)]. We further used coalescent simulations to model the recent demography of Cannabis. Drug-type feral and Drug-type genetic clusters were treated as one group to reduce model comparisons and parameters. Eighteen alternative models were defined to test bottlenecks and/or growth of the Basal cannabis group, Hemp-type group, and the integrated drug-type group with or without migration between these groups (fig. S5). The model involving a multistep domestication process (with changes in all population sizes and continuous post-domestication introgression from Basal cannabis/feral populations to both hemp and drug types) produced a significantly better fit than alternative models (Fig. 2B, figs. S6 and S7, and tables S2 and S3). The shared haplotypes between Basal cannabis and other groups were also shown in identity-by-descent analysis (fig. S8).
FIG. 2. Demographic history of C. sativa and selection signatures identified from comparison between hemp- and drug-type cultivars.
(A) Demographic history inferred from the PSMC method (30). (B) Graphical summary of the best-fitting demographic model inferred by fastsimcoal2 (65). Widths show the relative effective population sizes (Ne). Arrows and figures at the arrows indicate the average number of migrants per generation among different groups. The point estimates and 95% confidence intervals of demographic parameters are shown in table S3. Examples of genes with selection sweep signals in hemp-type cultivars (C) and drug-type cultivars (D). Three independent sets of signals (FST, π ratio, and XP-CLR) are shown along the genomic regions covering the four genes. Dashed lines represent the top 5% of the corresponding values. Below the three plot schemes are the gene models in the genomic regions. Below each gene model are the SNP allele distributions along each of the four genes for the two groups (green, heterozygous site; orange, homozygous site of reference allele; blue, homozygous site of alternative allele; gray, missing data).
OPEN IN VIEWER
Our genome-wide analyses corroborate the existing archaeobotanical, archaeological, and historical record [reviewed in (563133)] and provide a detailed picture of the domestication of Cannabis and its consequences on the genetic makeup of the species. Our genomic dating suggests that early domesticated ancestors of hemp and drug types diverged from Basal cannabis ~12,000 years B.P. (95% confidence interval: 6458 to 15,728 years B.P.; Fig. 2B and table S3), indicating that the species had already been domesticated by early Neolithic times. This coincides with the dating of cord-impressed pottery from South China and Taiwan (12,000 years B.P.), as well as pottery-associated seeds from Japan (10,000 years B.P.). Archaeological sites with hemp-type Cannabis artifacts are consistently found from 7500 years B.P. in China and Japan, and pollen consistent with cultivated Cannabis was found in China more than 5000 years B.P. Only a small number of early domesticated Cannabis strains expanded to later form hemp and drug types ~4000 years B.P., a time when multiple fiber artifacts appear in East Asia, and when fiber-grown Cannabis was spreading westward into Europe and the Middle East, as shown by Bronze Age archaeological evidence. Ritualistic and inebriant use of Cannabis has in turn been documented in Western China from archaeological remains at least 2500 years B.P. (3435). The first archaeobotanical record of C. sativa in the Indian subcontinent dates back to ~3000 years B.P., the species likely being introduced from China together with other crops (3637). In contrast with East Asia, historical texts from India from as early as 2000 years B.P. indicate that the species was only exploited for drug use. Over the next centuries, drug-type Cannabis traveled to various world regions, including Africa (13th century) and Latin America (16th century), progressively reaching North America at the beginning of the 20th century and later, in the 1970s, from the Indian subcontinent. Meanwhile, hemp-type cultivars were first brought to the New World by early European colonists during the 17th century and later replaced in North America by Chinese hemp landraces by the middle 1800s. Consistent with this history, our model shows a gradual increase in the Ne of hemp and drug types. On the basis of both demographic and phylogenetic analyses, we propose that early domesticated Cannabis was first used as a primarily multipurpose crop until ~4000 years B.P., before undergoing strong divergent selection for increased fiber or drug production.

Selection signatures during domestication and improvement

As with other crop species, the domestication and diversification of Cannabis involved several complex steps, leading to a geographical radiation and the deliberate breeding of varieties involving selection on traits to maximize yield and quality (38). We applied an integrative approach (π, FST, and XP-CLR; see Materials and Methods) to identify candidate genes involved in divergence of hemp and drug types after their early domestication. The three approaches combined allowed us to identify a total of 510 candidate genes in hemp-type samples and 689 in drug-type samples, when compared to the Basal cannabis group, of which 253 are overlapping (fig. S9), while 134 and 472 genes are specific to hemp- and drug-improved cultivars, respectively, when compared to each other (tables S4 to S9). Several genes bearing signals of positive selection in hemp-type–improved cultivars are involved in inhibiting branch formation (e.g., D14 and KNAT1), associated with flowering time and photoperiodism (e.g., FLK and EHD3) and involved in cellulose and lignin biosynthesis (e.g., SS and SPS1). In drugs, we infer selection on genes promoting branch formation (e.g., NDL2 and DTX48), associated with flowering time (e.g., HUA2 and FPF1) and involved in lignin biosynthesis (e.g., CSE and C4HFig. 2, C and D, and tables S10 and S11). In addition, we also detected signals of positive selection in drug-type cultivars when compared to hemp-type cultivars on the gene HDR (tables S5 and S10) coding for the last enzyme in the methylerythritol phosphate pathway (producing essential substrates for cannabinoid biosynthesis) and which has been shown to be potentially associated with variance in total cannabinoid content [i.e., potency (18)]. These results are consistent with traits expected to have been affected by selection during domestication of C. sativa, i.e., leading to unbranched, tall hemp plants maximizing cellulose-rich/lignin-poor bast fibers in the stems versus well-branched, short marijuana plants with lignin-rich woody cores, maximizing flower and resin production (33940).

Loss of function of the two main cannabinoid synthase genes during domestication

The two main cannabinoids CBDA and THCA characterizing hemp- and drug-type varieties are produced in a biosynthetic reaction catalyzed by the enzymes CBDA and THCA synthase, which compete for the same substrate cannabigerolic acid (CBGA) [reviewed in (8)]. The two synthases are encoded by the genes CBDAS and THCAS, which belong to the berberine bridge enzyme (BBE)–like multigene family, from which they possibly arose by duplication and neofunctionalization [reviewed in (41)]. When involved in secondary metabolism, the homologs of these genes likely play a major role in chemical plant defense (8). Confirming earlier genetic studies, recent genome assemblies showed that CBDAS and THCAS (and their multiple pseudogenic copies) lie scattered within closely linked loci, in a retrotransposon-rich, highly repetitive region of the genome with suppressed recombination, and with a history of extensive rearrangement and tandem duplication/pseudogenization events (41619). Using strict filtering criteria, we mapped the reads of the 104 analyzed genomes to a reference CBDA/THCA hybrid cultivar genome [Jamaican Lion DASH (42)], in which full-length coding sequences for THCASCBDAS, and more than 30 pseudogene copies of these genes are assembled. The results (Fig. 3A) show that all marijuana cultivars from the Drug-type genetic group D always map a complete coding sequence for THCAS and two CBDAS pseudogenes (with 93 to 94% similarity to the full CBDAS; pseudogenes 1 and 2 in Fig. 3A; see Materials and Methods), with the exception of only five samples that also map a full CBDAS gene. Conversely, within the Hemp-type genetic group B constituted of plants selected for fiber production, all accessions only map a complete sequence for CBDAS, with the exception of nine samples (mostly landraces; Fig. 3B) that either map both genes and the CBDAS pseudogenes or map THCAS and the CBDAS pseudogenes. The main pattern inferred from our comparative analysis confirms previous structural data based on full genome sequencing of single cultivars (1819). It is also consistent with published chemotype inheritance models validated among a wide variety of Cannabis accessions (1617204344), thus providing complementary evidence for the latter at the genomic sequence level and global validation across a comprehensive panel of Cannabis domestication types distributed worldwide. Although our results would require confirmation with associated phenotypic or expression data, they nevertheless provide support for a genetic model of inheritance based on CBDAS genotyping (20), in which plants that are homozygous for functional or nonfunctional alleles of CBDAS have the CBD-type or THC-type chemotype, respectively, whereas plants that are heterozygous have the intermediate-type chemotype (consistent with codominant Mendelian inheritance due to the documented physical linkage of the two synthase genes). The occurrence of five samples mapping full THCAS and two CBDAS pseudogenes (i.e., with a presumed THC chemotype) nested within the Hemp-type genetic group and, more generally, the scattered phylogenetic clustering of synthase gene combination (i.e., of more than one presumed chemotype class) across the Hemp-type and Drug-type genetic groups provide a compelling argument for the independence of cannabinoid synthase inheritance from a multitude of other positively selected traits differentiating fiber-type from drug-type Cannabis [see also the high-CBDA cultivar CBDRx, which has full CBDAS and lacks full THCAS (i.e., CBD chemotype) but clusters genetically among marijuana cultivars; figure 1 in (18)]. As such, the results call into question, from both a biological and functional point of view, the current binary categorization of Cannabis plants as “hemp” or “marijuana” derived from the assignment to a single phenotype [see also (20)].
FIG. 3. Evolution of CBDAS and THCAS.
(A) Occurrence of CBDA-synthase gene (CBDAS), THCA-synthase gene (THCAS), and two CBDAS pseudogenes across 104 Cannabis accessions, based on mapping to a reference genome having both genes and many pseudogene copies of them [Jamaican Lion DASH (42)]. Cladogram on top and symbols are as in Fig. 1. For sample codes, see table S1. Below the cladogram is indicated for each gene whether reads from each sample mapped to the reference positions. The height of each gene box represents the length of the gene. The Jamaica Lion DASH genome sequence coordinates for the four genes are shown on the right. (B) Top left: Phytocannabinoids CBDA and THCA result from a biosynthetic reaction catalyzed respectively by the enzymes CBDA and THCA synthase from the common precursor CBGA. Bottom: The proportion of CBDAS and THCAS in each of the four groups. Top right: The proportion of CBDAS and THCAS in landraces versus cultivars within the Hemp-type group. Fisher’s exact test, *P < 0.05; ***P < 0.001. (C) Transcriptomic expression for the two genes and pseudogenes in different tissues and vegetative stages [data from (47)]. Wilcoxon rank-sum test, *P < 0.05.
OPEN IN VIEWER
In contrast with these results, samples belonging to the Basal cannabis group (and to a lesser extent to the Drug-type feral group) show a more variable pattern, with the presence of one or another synthase gene, or co-occurrence. Overall, our results point to a loss of complete coding THCAS or CBDAS sequence during intensive and recent selection for increased fiber production or psychoactive properties, respectively (Fig. 3B). They suggest the ancestral possession of both genes in a functional state, a polymorphic condition before or during the early stages of domestication with loss of function of one of the two synthase genes, and the extensive loss of full THCAS in hemp-type and CBDAS in drug-type cultivars due to strong selection for beneficial crop phenotypes (Fig. 3, A and B).
The pseudogenization of CBDAS and exclusive presence of full THCAS in marijuana cultivars are consistent with artificial selection of high THCA synthesis through the suppression of competition between the two synthase enzymes for their common substrate CBGA [Fig. 3B; (4546)], possibly also because CBDA synthase has been shown to be a superior competitor for CBGA when both synthases are present (17). The predominant occurrence of CBDAS and loss of function of THCAS in hemp types, by contrast, is more puzzling. Our analysis of transcriptomics data (47) from a cultivar having both synthase genes and the two CBDAS pseudogenes reveals that the expression level of CBDAS is always significantly higher than that of THCAS, although both are expressed in all tissues and vegetative stages (Fig. 3C). A functional CBDAS does not seem a prerequisite for good quality fiber production in hemp [e.g., hemp cultivar Santhica 27, lacking both synthase genes (FSA in Fig. 3A) and known to mostly produce CBGA (48)], but it is plausible that CBDA-synthase activity (and/or the corresponding loss of that of THCA synthase) may have allowed increased bast fiber production via a physiological trade-off. Although such a trade-off might appear unlikely, it would resonate with the known role played not only in plant defense but also in the processes of cell wall biosynthesis and/or immunity by the primordial BBE-like enzymes from which cannabinoids evolved (4950). Of course, the loss of full THCAS sequence observed in modern hemp types may also simply reflect selective breeding of varieties with very low levels of THCA licensed for cultivation.

Conclusion

Together, our genomic, phylogenetic, and demographic analyses of 110 diverse C. sativa accessions have identified the time and origin of domestication, post-domestication divergence patterns and present-day genetic diversity, and genomic structure of an exhaustive worldwide panel of Cannabis wild-growing feral, landrace, and cultivar representatives. Our study thus provides new insights into the domestication and global spread of a plant with divergent structural and biochemical products at a time in which there is a resurgence of interest in its use (395152), reflecting changing social attitudes and corresponding challenges to its legal status in many countries. Our analysis has detected genes putatively under divergent selection between hemp- and drug-use accessions and has specifically disentangled the effects of domestication on the evolution of the chief cannabinoid genes targeted for their medical properties. Our results provide support for an evolutionary scenario that accounts for the variability in cannabinoid composition among plants as a result from artificial selection by early farmers for loss-of-function mutations (53). Our results also offer an unprecedented base of genomic resources for ongoing molecular breeding and functional research, both in medicine and in agriculture.

Add Comment

Select Language