Genome meeting and annotation
The genome measurement of ‘Kau’ was estimated to be 890 Mb by circulation cytometry (Supplementary Table 1), in line with the earlier k-mer based mostly estimate for macadamia of 896 Mb for ‘Mauka’21, an in depth relative of ‘Kau’22. We generated 89 Gb (100×) of lengthy learn from the PacBio Sequel II platform and 46 Gb (50×) of short-read sequence knowledge from Illumina NovaSeq (Supplementary Table 2). The preliminary contig stage meeting utilizing CANU 1.7 yielded 1.10 Gb of assembled sequences, indicating that some heterozygous areas have been assembled twice (Supplementary Table 3). To eradicate redundant sequences, Illumina reads have been mapped to the assembled contigs to determine duplicated sequences, i.e. allelic haplotypes, ensuing in the removing of 295 Mb sequences from the preliminary contig meeting. The assembled genome was 794 Mb, with a contig N50 of 281 kb (Supplementary Tables 3 and 4). Chromosomal stage meeting of the ‘Kau’ genome was achieved utilizing high-throughput chromatin conformation seize (Hi–C) for bodily mapping to anchor scaffolds, ensuing in 14 pseudo-chromosomes that anchored 794 Mb (99.97%) of the genome (Fig. 1 and Supplementary Fig. 1 and Supplementary Table 5).
a karyotype in Mbp. b Gene density, purple means excessive density, and inexperienced point out low density. c Gene expression, purple signifies excessive expression stage, and blue means low. d LTRs distribution in chromosomes. e DNA transposable components distribution in chromosomes, purple means excessive density and blue means low density.
BUSCO evaluation of 1375 conserved single copy plant genes revealed 92.1% completeness with solely 66 genes lacking in the macadamia meeting (Supplementary Table 4). Alignment of RNA-seq assembled transcripts to the meeting confirmed 99.99% of base accuracy (Supplementary Table 6). In addition, 99.5% (165.23/166.03 million) of Illumina quick reads have been mapped to the meeting, masking 99.0% of the genome (Supplementary Table 7).
Genome annotation resulted in 37,728 protein-coding genes with 88.4% BUSCO completeness and 113 microRNAs (Supplementary Tables 4 and 5). In addition, we predicted 461.07 Mb of repetitive sequences, accounting for 57.0% of the assembled genome, together with 49.0% retrotransposons and eight.8% DNA transposons (Supplementary Table 8). Long terminal repeat (LTR) retrotransposons have been the most important elements, containing 266.9 Mb of sequences and accounting for 33.0% of the genome with 16.5 Gypsy and 6.4% Copia. The LINE retrotransposon content material is unusually plentiful and better than that of Copia at 11.5% of the genome. A current burst of Gypsy retrotransposons and an historic burst of LINE components have been detected (Supplementary Fig. 2).
Comparative genomic evaluation
Comparative genomic evaluation of macadamia M. integrifolia and lotus Nelumbo nucifera confirmed fragmented conserved synteny (Fig. 2a and Supplementary Fig. 3), and recognized 1:1 syntenic depth ratios in the Macadamia-lotus and lotus-Macadamia, respectively. Macadamia chromosome 1 aligned with elements of lotus chromosome 2 and seven, whereas lotus chromosome 7 aligned with elements of macadamia chromosomes 1, 3, 10 11, and 13. In common, every macadamia chromosomes is aligned to elements of 2 or extra of the 8 lotus chromosomes, and every lotus chromosome is aligned to elements of 4 or extra of the 14 macadamia chromosomes. The shut relationship of macadamia and sacred lotus can be confirmed in the utmost probability phylogeny of 898 gene orthologs (Fig. 2b). The divergence time between macadamia and lotus is estimated at 100.3 million years in the past (MYA) (Fig. 2b), and a whole-genome duplication (WGD) in the macadamia lineage occurred about 42.3 MYA (Okays = 0.35; Fig. 2c and Supplementary Fig. 4).

a Inter-genomic comparability between M. integrifolia and N. nucifera. b Inferred phylogenetic tree throughout seven plant species together with macadamia, calibrated utilizing the divergence time of A. thaliana and C. papaya (68–72 million years in the past) and monocot and eudicot (120–140 million years in the past) as calibrators. c Synonymous substitution charge (Okays) distributions of syntenic blocks for M. integrifolia and paralogs and orthologs with N. nucifera as proven by coloured strains. d Shared gene households amongst At = A. thaliana, Os = O. sativa, Sly = S. lycopersicum, Vv = V. vinifera, Nu = N. nucifera, and Mi = M. integrifolia. The six species include 8955 frequent gene households, and M. integrifolia has 42 particular gene households.
The evaluation of gene households shared between ‘Kau’ and the representatives of six different species of numerous lineages together with 5 eudicots and one monocot resulted in 213,308 proteins (67.43% of the enter sequences) clustered into 14,999 teams (Supplementary Table 9) with 8955 gene households shared throughout the six lineages (Fig. second and Supplementary Table 9). Of 37,728 macadamia proteins, 26,889 clustered into 13,183 teams, of which 42 clusters have been macadamia-specific and contained 222 proteins (Fig. second). These species-specific genes have been distributed throughout all 14 macadamia chromosomes (Supplementary Fig. 5). KEGG pathway evaluation recognized many species-specific genes associated to environmental adaptation (Supplementary Fig. 6). There have been additionally 10,853 singleton proteins distinctive to macadamia (Supplementary Table 10).
Macadamia shell growth
There have been 2735, 2641, 2337, 2201, and 2235 differentially expressed genes (DEGs) in 5 levels of formation of the macadamia shell (Fig. 3a) examined, with 1464 DEGs shared by all levels (Fig. 3b). Following grouping of 3845 DEGs correlated to shell growth into 16 clusters based mostly on their expression patterns (Fig.3c and Supplementary Fig. 7 and Supplementary Table 11), DEGs from clusters 5 and 16 confirmed excessive expression ranges in shells in comparison with different tissues (Fig. 3c).

a Phenotype of shell and kernel at totally different levels of fruit growth. b Venn plot of up expression genes in stage Stage 1, Stage 2, Stage 3, Stage 4, and Stage 5 of shells. c Mfuzz clustering of differentially expressed transcripts in shell, kernel, and different tissues. d Schematic of shell growth and hardening. A proposed mannequin of STK, TT16, and Prx17 in regulation of shell growth and hardening. e Phylogenetic tree of STK, AG, FUL, and SHP genes in a number of species, together with M. integrifolia, N. nucifera, A. thaliana, V. vinifera, and C. papaya. STK, AG, FUL, and SHP clades are indicated in totally different colours. f Phylogenetic tree of Prx17 in M. integrifolia, N. nucifera, A. thaliana, V. vinifera, S. lycopersicum, Z. mays L., O. sativa, and S. bicolor.
Expression of genes in cluster16 exhibited an upward development that was extremely expressed in the late stage of shell growth (Supplementary Table 12). Most of these genes are concerned in histogenesis and growth of phloem or xylem, secondary cell wall formation, and lignin biosynthesis (Supplementary Table 13). In distinction, genes in cluster5 are extremely expressed in each developmental stage of shells (Supplementary Table 14). Many are key transcription components of organ formation and morphogenesis, important genes in pathway of phenylpropanoid or flavonoid biosynthesis, or sugar transporters (Supplementary Table 15).
Notably, SEEDSTICK (STK) and TRANSPARENT TESTA 16 (TT16), which encodes a MADS-domain transcription issue as a grasp regulator of growth and metabolism of the seed coat, are extremely expressed in shells (Supplementary Figs. 8, 9). Two paralogs of STK (MiSTK1 and MiSTK2) have been recognized in the macadamia genome in Chr8 and Chr6, respectively, ensuing from duplications. We additionally recognized one STK orthologs in Arabidopsis thaliana, Vitis vinifera, and Carica papaya genomes, however no copy in N. nucifera (Fig. 3e).
Macadamia MiSTK1, MiSTK2 are extremely related in coding sequences, protein sequences, and gene construction (Supplementary Figs. 10, 11). We discovered that every one three MiSTKs confirmed very related expression patterns in flowers, leaves, stems and roots, however have been strongly expressed in shells (Supplementary Fig. 8).
We recognized orthologs of the category III peroxidase PRX17, which regulate age-dependent lignified tissue formation, in the macadamia and 7 different genomes with two paralogs in macadamia, two paralogs in lotus, and one ortholog in every of different six species (Fig. 3f). The two paralogs of macadamia MiPRX17 (MiPRX17A and MiPRX17B) are extremely related in coding sequences, protein sequences, and gene construction (Supplementary Fig. 10c, d). While MiPRX17A and MiPRX17B confirmed very related expression patterns in flowers, leaves, stems, and roots, they have been strongly expressed in shells (Supplementary Fig. 8). We additionally detected sturdy expression of MiAGL15 in macadamia shell, an ortholog of AtAGL15 that immediately regulates AtPRX17 by directing binding to the CARGCW8 cis-element23. Promoter evaluation confirmed that MiPRX17A has one putative binding web site for MiAGL15 and MiPRX17B has one cis-element associated to lignin biosynthesis (Supplementary Table 16).
Accumulation of proanthocyanidins (PAs) in the innermost layer of the seed coat is critical for practical seed coat progress and in addition a attribute characteristic of seed coat growth24,25. Among the excessive expression genes in shells, many genes are associated to PAs biosynthesis pathway, together with PALs, 4CLs, COMTs, CHSs, CHIs, F3H, DFRs, and TT12 (Supplementary Fig. 9).
Fatty acid biosynthesis in macadamia kernel growth
There have been 1953, 1610, and 1417 DEGs in the Kernel 1 (Stage 1), Kernel 3 (Stage 3), and Kernel 5 (Stage 5) three levels of kernel growth, with 579 DEGs shared by three levels (Supplementary Fig. 12a). All these DEGs have been filtered from kernel samples for weighted gene co-expression community evaluation (WGCNA). Cluster analyses of the DEGs indicated the next correlation between related tissues/developmental levels. Stage 1 kernel transcriptomes clustered individually to all others whereas these of the later three growth levels clustered collectively and confirmed substantial variations to these of different tissues (Supplementary Fig. 12b). Module–trait relationships evaluation exhibits that the blue and grey module is extremely associated to fatty acid biosynthesis throughout kernel growth (Supplementary Fig. 13a). There are 591 and 488 genes in the blue and grey module individually which are extremely correlated with kernel growth (Supplementary Fig. 13b). Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway evaluation of these extremely correlated genes confirmed they have been considerably enriched in lipid metabolism (Supplementary Fig. 14), and included orthologs of identified seed oil biosynthesis management transcription components, Wrinkled1 (WRI), Abscisic Acid3 (ABI3) and Fusca3 (FUS3) (Fig. 4a). Other genes which are concerned in fatty acid biosynthesis and oil meeting are additionally extremely expressed in kernel tissues, together with Fatty Acyl-acp Thioesterases A (FATA), Biotin Carboxylase (CAC), Enoyl-acp Reductase (ENR), 3-Ketoacyl-CoA Reductase (KAR), 3-Ketoacyl-CoA Synthase (KAS), Elo homolog 2 (ELO2), Membrane-Bound O-acyl Transferase (MBOAT) and Oleosin (OLE) (Fig. 4a).

a Expression of fatty acid biosynthesis-related genes in kernels and different tissues of macadamia. b Schematic of lipid biosynthesis pathway. KAS, ketoacyl synthases; SAD stearoyl-ACP desaturase, DGAT diacylglycerol acyltransferase. c Phylogeny of the KASI genes household in M. integrifolia. LPA lysophosphatidic acid, DAG diacylglycerol, TAG triacylglycerol, FFA free fatty acid. d Expression of fatty acid biosynthesis genes in tissues of M. integrifolia. Source knowledge underlying a, d are offered as a Source Data file.
In the macadamia genome, 269 clusters (3955 genes) have been considerably expanded, with 16 (1704 genes) contracted in contrast with different plant genomes in an evaluation undertaken to research the genomic foundation of chosen metabolite biosynthesis (Supplementary Table 18). KEGG pathway evaluation of the expanded genes revealed marked enrichment in capabilities associated to fatty acid metabolisms, reminiscent of fatty acid biosynthesis, elongation and degradation, palmitate biosynthesis, stearate biosynthesis, cis-vaccenate biosynthesis and cutin, suberine, and wax biosynthesis. (p-value < 0.05, Supplementary Fig. 15, Supplementary Table 19). We recognized gene households associated to fatty acid chains elongation (Fig. 4b), desaturation, and acyl switch reminiscent of ketoacyl synthases (KAS), stearoyl-ACP desaturase (SAD), fatty acid desaturases (FAD), diacylglycerol acyltransferase (DGAT), and acyl-CoA:sn-glycerol-3-phosphate acyltransferase (GPAT) in 14 species (Supplementary Table 20). Comparison of gene numbers in macadamia to different species confirmed distinct gene household expansions of the KASI and SAD genes. For KASI in specific there have been six paralogs in macadamia, and one ortholog every in A. thaliana, B. rapa, P. dulcis, and A. duranensis (Fig. 4c). KASI is liable for the elongation of fatty acid chains from enoyl-ACP (4:0-ACP) to palmitoyl-ACP (16:0-ACP).
Phylogenetic evaluation of KASI and FAB2 proteins from macadamia and different species confirmed that the six copies of KASI and FAB2 in macadamia had a really shut relationship (Fig. 4a and Supplementary Fig. 12c), though the genomic areas don’t share synteny. MiKASI1, MiKASI3 and MiKASI5 MiFAB2.7 and MiFAB2.12, in specific, exhibited dramatically larger expression in kernels (Fig. 4d).
Genetic range and domestication origin
To discover genetic range and the transient domestication historical past, 112 macadamia accessions have been re-sequenced, together with 59 cultivars and chosen strains, and 42 wild accessions, seven hybrid cultivars, and 4 outgroup species (Supplementary Table 20). Macadamia integrifolia is distributed over ~250 km in lowland subtropical rainforest fragments of jap Australia. To determine the origins of domestication, wild accessions have been sourced primarily from three inhabitants clusters (C1–C3) north of Brisbane in Queensland. Evidence from earlier genetic research signifies that the Hawaiian cultivars originated from the northern vary of M. integrifolia. All M. integrifolia people fashioned a clade distinct from M. tetraphylla and hybrids (Supplementary Fig. 16a). This was additional supported by a principal part evaluation (PCA) (Supplementary Fig. 16b), inhabitants construction, and linkage disequilibrium (LD) evaluation (Supplementary Fig. 16c). Structure evaluation recognized two inhabitants clusters (Okay = 2) that clearly separate M. integrifolia and M. tetraphylla accessions. Three inhabitants clusters (Okay = 3) clearly distinguish M. integrifolia cultivars from wild people (Supplementary Fig. 17a, b). Wild M. integrifolia accessions have been assigned to 3 predominant regional teams (C1–C3) at Okay = 4 (Supplementary Fig. 17a).
PCA evaluation was carried out utilizing 25 Hawaiian cultivars and 35 wild accessions to evaluate genetic relationships amongst Hawaiian cultivars and three wild teams. These 60 accessions have been labeled into 4 geographic teams, C1–C3 and Hawaii cultivars, (Fig. 5 and Supplementary Table 22). LD decay exhibits wild group C2 has a quick decay charge and adopted by Hawaiian cultivars, C3 and C1 group (Fig. 5d). Fixation index (FST) was calculated amongst Hawaiian cultivars and three wild teams. Genetic differentiation between Hawaiian cultivars and the C3 group (FST = 0.111) was the most important, between Hawaiian cultivars and the C2 group (FST = 0.095) the smallest, and between cultivars and the C1 (FST = 0.109) intermediate, (Fig. 5c). The nucleotide range of every group was estimated. Group C2 had the very best common nucleotide range (π) of 4.05 × 10−4, adopted by Hawaiian cultivars (3.45 × 10−4), C3 group (2.87 × 10−4), and C1 group (2.76 × 10−4) (Fig. 5C). Nucleotide range of the Hawaiian cultivars was most much like that of the C2 group. The most northerly wild M. integrifolia populations are situated in the Mt Bauple (C1) and Gympie (C2) areas of southeast Queensland which are separated by over 70 km (Fig. 5a). The chloroplast genome phylogeny (Supplementary Fig. 17b) is concordant with earlier proof that the maternal lineage of industrial cultivars developed in Hawaii originated in the Gympie area15. The nuclear SNP phylogeny, nevertheless, supplies sturdy and conflicting help for an Mt Bauple (C1) origin of domestication (Supplementary Fig. 17c). The nuclear C1 clade additionally consists of the Hawaiian cultivars suggesting that the newest frequent ancestor of the cultivars was from Mt. Bauple. Individuals from every of the three sampled geographic areas (C1–C3) type deeply divergent monophyletic clades in each nuclear and chloroplast phylogenies. One exception is a C1 particular person with a C2 chloroplast haplotype (W04-MB04) and 50% admixed ancestry from the 2 areas (Supplementary Fig. 17a) suggesting introgression following seed dispersal from Gympie to Mt. Bauple. Other C1 people are admixed with decrease ranges of C2 ancestry. Translocation of seed between areas was presumably human-mediated on condition that small rodents, gravity and water are the proposed mechanisms for pure seed dispersal13. Phylogenetic community evaluation in Treemix signifies that the Hawaiian cultivars are derived from the C1 lineage with C2 ancestry (Fig. 5b).

a Genetic-flow paths visualized on the map, the map was based mostly on OpenStreetMap (Base map © OpenStreetMap, see https://www.openstreetmap.org/copyright) b Gene-drift in wild teams and varieties. c FST and π values of every teams. d LD decay for 3 wild teams (C1–C3) and varieties of M. integrifolia. e PCA clustering of three wild teams (C1–C3) and varieties of M. integrifolia. f Effective inhabitants measurement (Ne) historical past estimated utilizing ANGSD utilizing g (era time) = 8 years and m (impartial mutation charge per era) = 4.175*10−9 and plot by software program Stairway plots with 200 bootstrap iterations. Source knowledge underlying d, f are offered as a Source Data file.
Signatures of selection at early stage of domestication
Historical efficient inhabitants measurement (Ne) analyses indicated that M. integrifolia inhabitants measurement has remained secure over the previous 100,000 years in the past however underwent two historic Ne declines. The most up-to-date decline was ~200,000–100,000 years in the past and an earlier decline at 1,700,000–1,100,000 years in the past (Fig. 5f).
To display screen for signatures of selection, π, FST, Tajima’s D, and XP-CLR have been calculated throughout the genome of M. integrifolia. Low genetic range was detected by π and FST values of cultivars and wild accessions in a big portion of Chr2 and greater than half the chromosome in Chr5 (Fig. 6 and Supplementary Figs. 18–27), which corresponded to the areas with unusually excessive content material of Gypsy retrotransposons (Fig. 1d, blue coloration), ensuing in a low charge of or no recombination. The alerts from Tajima’s D in these two areas are artifacts, brought on by very small beginning plant supplies chosen from Mt. Bauple (C1), as no alerts of selective sweeps have been detected by XP-CLR in these areas.

The upmost dotplot is nucleotide range (π) values, purple point out varieties and inexperienced line point out wild group; the second lay is fixation index (FST) between the wild and cultivated macadamia accessions; the third lay is Tajima’s D values, purple point out varieties and inexperienced line point out a wild group. The backside is the genome-wide distribution of selective-sweep alerts recognized based mostly on the cross-population composite probability ratio take a look at (XP-CLR). TT12, TRANSPARENT TESTA; SDR short-chain dehydrogenase reductase, ANS anthocyanidin synthase.
Signals of synthetic selection have been detected in 126 blocks containing 284 protein-coding genes (Supplementary Fig. 28a). These had Tajima’s D values that have been unfavorable in cultivars and optimistic in wild accessions. These 284 genes symbolize 0.75% of the 37,723 genes accessible for selection.
Functional evaluation revealed that the 284 genes beneath selection have been enriched in a number of organic processes, with main teams together with response to stimulus, metabolic course of, single-organism course of, and mobile course of (Supplementary Fig. 29). KEGG enrichment recognized a number of pathways associated to biosynthesis or metabolism secondary metabolites which are concerned in the response to biotic or abiotic stress, together with flavonoid biosynthesis, monoterpenoid biosynthesis, diterpenoid biosynthesis, biosynthesis of secondary metabolites, stilbenoid, diarylheptanoid, and gingerol biosynthesis (Supplementary Fig. 30).
PAs are synthesized by means of the phenylpropanoid biosynthesis pathway and play an vital function in seed growth24. In specific, ANS and TT12 play vital roles in PAs biosynthesis. Analysis of the overlapping areas of XP-CLR, FST, and Tajima’s D between macadamia wild populations and cultivars present proof for selection of ANS and TT12 (Fig. 6 and Supplementary Fig. 28b). RNA-seq evaluation confirmed that these genes have been differentially expressed among the many six examined tissues, with two ANS genes and TT12 extremely expressed in shells (Supplementary Fig. 9). In addition, there have been signatures of selection for Long-Chain Acyl-CoA Synthetase (LCAS), a gene identified for affecting storage oil synthesis and plant peak26,27,28, and transcriptome evaluation confirmed differential expression in totally different tissues (Supplementary Fig. 28).
Of the 284 selective genes, we recognized three associated to warmth response (HSFB4, THF, and HSF3). RNA-seq evaluation confirmed that these three genes have been differentially expressed amongst six examined tissues, with excessive expression of one gene in flower and leaf, and one in the kernel, and one in the shell (Supplementary Fig. 28c).
Terminal runs of homozygosity29 are the hallmark of mitotic selection, so that is an efficient genomic evaluation technique to tell apart sexual recombination from the ‘one-step operation’ for the domestication of clonally propagated crops. We analyzed terminal runs of homozygosity utilizing SNPs from single-copy genes on every chromosome (Supplementary Fig. 31), nevertheless, no important in depth terminal runs of homozygosity have been recognized in macadamia cultivars (Supplementary Figs. 32–45).