Introduction

Biodiversity loss is of global concern, and is due in part to deforestation and high consumer demand for wood and wood products (Nellemann 2012; Elias 2012; van Zonneveld et al. 2018). Forests of Central and South America (or “neotropical” forests) face the largest threat because they support the most terrestrial biodiversity, with an estimated 16,000 tree species contained within the Amazon rainforest alone (Pennington et al. 2015; Pennington and Lavin 2016; Dick and Pennington 2019). Thirty-five to 72% of wood sourced in the Amazon is thought to be acquired from illegal logging (Saunders and Reeve 2014), and illegal logging accounts for 50–90% of forestry activities across tropical forests globally (Hoare 2015; Sheikh et al. 2019). Laws are in place to protect economically valuable tree species from overexploitation and promote sustainable practices (e.g., U.S. Lacey Act [2008]; European Union timber regulation [2010]; Australian Illegal Logging Prohibition Act [2012]; Japanese Clean Wood Act [2017]), but these remain difficult to enforce because of the sheer scale of illegal logging, and the challenge of identifying protected species and their countries of origin, especially after wood is transported from the site of harvest, processed, and enters commercial markets (Dormontt et al. 2015, 2020; Wiedenhoeft et al. 2019).

Illegal logging affects many tree species, but highly valuable—often rare and endangered—species are common targets. Spanish cedar (Cedrela odorata L.; Meliaceae) and congeners are among the most valuable neotropical hardwoods, making them particularly vulnerable to illegal harvesting. In 2001, C. odorata was listed under the protections of the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) Appendix III requiring validated documentation of species identity and source for both export and import documentation, protecting populations in Bolivia, Brazil, Colombia, Guatemala, and Peru (Ferriss 2014), and in 2019, C. odorata and all species in the genus Cedrela were elevated to CITES Appendix II, due to the similarity of sawn logs and processed wood across species (Gasson 2011). However, CITES protection does not entirely eliminate illegal harvesting. An investigation focused on logging of Spanish cedar (C. odorata) and mahogany (Swietenia macrophylla King) in Peru found 112 CITES export records of illegal wood listing origins that had been fabricated (Urrunaga et al. 2012). Urrunaga et al. (2012) provided evidence that illegal logging is systematic, common, and environmentally and economically damaging in Peru. In 2017, a snapshot of the scale of illegal logging in Peru was provided by the same team of investigators; a cargo ship from Peru, the Yaca Kallpa, was seized and contained nearly 10,000 m3 of illegal wood (Conniff 2017; Bargent 2017). Despite the anticipated increase in global scrutiny for Cedrela wood and wood products regardless of geographic origin, tools for predicting the origin of C. odorata wood remain valuable to process seizures predating CITES Appendix II listing, and could be valuable for verifying supply chains and identifying responsible parties.

Forensic tools that can accurately predict the geographic origin of wood have potential to assist the enforcement of trade restrictions for protected species. Genetic approaches have been used to determine the origin and trade routes of protected species, including elephant ivory (Wasser et al. 2004, 2018), sturgeon and paddlefish caviar (Doukakis et al. 2012; Ogden et al. 2013), tigers (Linacre and Tobe 2008; Gupta et al. 2011; Kitpipit et al. 2012), birds (Abe et al. 2012; Coghlan et al. 2012; White et al. 2012), fishes (Zarraonaindia et al. 2012; Clemento et al. 2014), and plants (Ogden et al. 2008; Degen et al. 2013; Blanc-Jolivet et al. 2018; Paredes-Villanueva et al. 2019). Anatomical, chemical, genetic, and isotopic methods have all been applied to address questions of taxonomy and origin of wood (Dormontt et al. 2015); however, a single method does not typically address questions of both taxonomy and source. Cedrela odorata and closely allied species will likely require multiple techniques for validation of taxonomy and origin because its wood lacks the anatomical features required for discrimination among species (Gasson 2011), and variation in wood chemistry does not vary in a manner that is geographically predictive (Paredes-Villanueva et al. 2018). While methods for DNA extraction and recovery from wood are improving (Dumolin-Lapègue et al. 1999; Asif and Cannon 2005; Rachmayanti et al. 2006; Tnah et al. 2012; Jiao et al. 2012, 2018; Yu et al. 2017; Dormontt et al. 2020), genetic markers of short length, such as single nucleotide polymorphisms (SNPs), are increasingly being used in wildlife forensics because they are suitable for low concentration, degraded DNA extracts (Ogden et al. 2009). Here, we evaluate the power of SNPs to resolve geographic origin of C. odorata across much of its range in Central America and western South America.

Materials and methods

Development of the SNP genotyping array for this study first involved the sequencing of a design panel of C. odorata specimens. From SNP variants detected in a design panel of specimens, we selected 140 candidate SNPs for the genotyping array on the basis of geographic and environmental differentiation. DNAs from a screening panel of 376 specimens were genotyped with these 140 array SNPs, and we were able to assess genotyping efficiency for DNAs derived from eight Cedrela species, three tissue types, a range of mass inputs. We evaluated discrete and continuous spatial origin prediction methods on a group of specimens from the screening panel, representing C. odorata and closely allied taxa (referred to as C. odorata sensu lato). Methods for each of these steps are described below.

Design panel sequencing

We used hybridization-based target capture and massively-parallel sequencing (Cronn et al. 2012; Heyduk et al. 2016) to identify SNPs from a panel of 46 C. odorata herbarium specimens (Appendix 1; Table S1; Fig. S1) from Peru and surrounding countries. Hybridization capture probes were designed based on gene models of a C. odorata individual originating in Mexico as described in Finch et al. (2019). Sequencing yield and depth were assessed using methods previously described (Finch 2018; Finch et al. 2019) and included in the Supporting Information for this article (Appendix 1; Table S2) (Finch 2019a, b). One Peruvian specimen (C. odorata 300; Table S1) was selected as the nuclear reference, and captured sequence reads were assembled de novo using SPAdes (v. 3.6.1; Bankevich et al. 2012). These enriched nuclear contigs were filtered to ensure that assembled sequences contained the target probe sequences, and did not contain sequences with identity to the C. odorata chloroplast genome (Finch et al. 2019) or mitochondrial genes (Kuravadi et al. 2015) (Appendix 2; Table S3).

Sequence reads from the 46 specimens were aligned to the C. odorata 300 reference using BWA-MEM (v. 0.7.17; Li and Durbin 2010; Li 2013), and sequence alignments were used as inputs for probabilistic variant calling using SAMtools (v. 1.10; Li et al. 2009) and the Genome Analysis Toolkit (GATK) (v. 3.7; McKenna et al. 2010). Best-practice guidelines for GATK variant calling were used, including indel region realignment, and high-stringency variant filtering for coverage, mapping quality, and variant position. Initial SNP filtering was performed with VCFtools (v. 0.1.17; Danecek et al. 2011) to remove insertion/deletion variants, sites with greater than 85% missing data, sites with more than two alleles, and sites with a minor allele frequency (MAF) lower than 5%. By applying these filtering parameters, we sought to provide a set of ‘high confidence’ SNPs for further analysis and eventual candidate selection for inclusion on a genotyping array.

Candidate SNP selection and SNP assay development

We identified SNPs showing the highest differentiation in allele frequencies (FST) (Weir and Cockerham 1984) for groups based on latitude (LAT; decimal degrees), mean annual temperature (MAT; °C × 10), and annual precipitation (AP; mm). In this way, FST was used to measure the partitioning of allelic variance in alternative groupings, not to make specific statements or inferences of population genetic parameters (e.g., probability of identity by descent, panmixis, or migration). Specimens were divided into a northern and southern LAT group based on a gap in the sampling distribution at 7.5° S latitude (Fig. 1a). Specimens were grouped into low, moderate, or high MAT (low < 20 °C; 20° < moderate < 25 °C; high > 25 °C; Fig. 1b) and AP (low < 2000 mm; 2000 < moderate < 3000 mm; high > 3000 mm; Fig. 1c) categories based on their tercile rank for these climate variables (Table S1) at their geographic source based on the WorldClim 2 dataset (Fick and Hijmans 2017) using R (see Table S7 for R packages and citations; R. Core Team et al. 2013; Finch 2019a, b). These measures exhibit low pairwise correlations (LAT × AP, r2 = 0.29; LAT × MAT, r2 = 0.01; AP × MAT, r2 = 0.14; Fig. S2), so genetic associations with these gradients are likely to include SNPs that respond to different neutral and selective forces.

Fig. 1
figure 1

SNP candidate selection via identification of highly differentiated SNPs (high relative FST) among C. odorata specimens (n = 46) based on: a Latitudinal (LAT) groups, b Mean Annual Temperature (MAT) groups, and c Annual Precipitation (AP) groups. Bar plots show the distribution of values and the number of counts in d LAT, e MAT, and f AP groups. g The distribution of FST for 140 SNPs selected for genotyping array development (gray), superimposed with the individual distributions of FST for SNPs selected for LAT (green), MAT (gold), and AP (blue). h The proportion of 140 SNPs selected for genotyping array development that were differentiated based on LAT (green), MAT (gold), and AP (blue). Map labels show country codes: COL (Colombia), ECU (Ecuador), PER (Peru), BOL (Bolivia), and BRA (Brazil)

LAT, MAT, and AP groups were then treated as ‘populations’ for calculating FST on a per-marker basis with VCFtools resulting in three lists of SNPs associated with these groups. To develop a SNP assay based on these loci, we sorted each SNP-FST list in descending order, concatenated the 150 top FST SNPs from each group into a single list, and filtered redundant SNPs to one SNP per contig of reference sequence. If two or more non-identical SNPs appeared on the same contig, we selected the SNP position with the highest FST. We further filtered the list to retain SNPs with a MAF ≥ 20% to avoid SNPs that were nearly-fixed across much of the geographic range. This resulted in 152 high FST and high MAF SNPs, none of which showed evidence of linkage disequilibrium (r ≤ 0.3). Contig names and SNP positions were converted into BED format, and BEDtools (v. 2.25.0; Quinlan and Hall 2010) was used to extract a multi-fasta file containing positions ± 100 bp flanking each SNP. The multi-fasta was used in primer design and multiplexing for a MassARRAY iPLEX® assay (Agena Bioscience, Inc., San Diego, CA, USA), using the MassARRAY Assay Design Suite software (Agena Bioscience, Inc.). From this list of sequences, we identified 140 SNP loci that could be evaluated in 4 multiplex groups. Resulting mass spectra were scored using Typer Viewer (v. 4.0.24.17; Agena Bioscience). All steps of SNP analysis (assay design, oligonucleotide synthesis, amplification, mass spectrometry, and SNP calling) were performed by NeoGen Genomics Inc. (Lincoln, NE, USA).

SNP assay screening

Samples screened for our selected SNPs derived from three sources (Table S4; Fig. S3): (i) 234 herbarium specimens from the Missouri Botanical Garden Herbarium (MO), (ii) 36 field collections by collaborators at the Universidad Nacional Agraria La Molina (Lima, Peru), Universidad Cayetano Heredia (Lima, Peru), and Servicio Nacional Forestal y de Fauna Silvestre (SERFOR; Lima, Peru), and (iii) 86 field collections by collaborators at the Instituto de Investigaciones de la Amazonía Peruana (IIAP; Iquitos, Peru), von Thüenen Institute of Forest Genetics (VTI; Braunschweig, Germany), and Universidad Autónoma Gabriel René Moreno (Santa Cruz, Bolivia). DNA was extracted using a modified CTAB procedure (AG-Biotech LLC, Monterey, CA, USA) from herbarium-derived dry leaf tissue from MO (150–200 mg from fragment packets). Leaf- and cambium-derived DNA from La Molina/Cayetano/SERFOR were extracted from fresh tissue using a modified CTAB protocol (Healey et al. 2014). These samples yielded < 100 ng of total DNA, so we used whole-genome amplification (Genomiphi™ V2 Amplification Kit, GE Healthcare, Chicago, IL, USA) to amplify the DNA using manufacturer’s instructions. DNA from the remaining 86 samples was derived from fresh material dried in silica gel according to Dumolin et al. (1995) at VTI or IIAP. All samples were quantified via fluorometry (Qubit, ThermoFisher Scientific, Waltham, MA, USA).

MassARRAY SNP genotyping was performed on 376 samples representing 356 unique specimens and 20 technical replicates. Included in our assay were representatives of C. odorata (n = 196), C. fissilis Vell. (n = 62), C. angustifolia DC (n = 28), C. montana Mortiz ex Turcz. (n = 22), C. nebulosa T. D. Penn. & Daza (n = 17), C. saltensis M. A. Zapater & del Castillo (n = 10), C. longipetiolulata Harms (n = 2), C. weberbaueri Harms (n = 2), and Cedrela field-collected samples that were not identified to species (n = 17; Table S4; Fig. S3). Replicated DNAs from two C. odorata individuals (C. odorata 83, C. odorata 291; Table S4) were examined across a range of DNA mass inputs (50, 75, 100, 200, and 300 ng). Our target DNA mass was 300 ng (recommended by NeoGen Genomics Inc.), although some samples had as little as 50 ng. In total, 330 DNAs derived from leaf tissue (primarily herbarium specimens) and 26 derived from cambium.

Assay assessment

To evaluate the broader use of the SNP array, SNP call rates (the proportion of successful diploid SNP genotyping calls at 140 diploid loci; Table S4) were determined for all samples, species, tissue types, and input concentrations. For population comparisons involving replicated samples (C. odorata 83, C odorata 291), we chose replicates with the highest call rates (50 ng dilutions in both cases). Loci were removed from all analyses if genotyping call rates were < 0.75 across specimens to strike a balance between missing information and sample retention. R packages adegenet (Jombart and Ahmed 2011), poppr (Kamvar et al. 2014), and hierfstat (Goudet 2005) were used to calculate observed heterozygosity, MAF, and FST for each SNP. FST was calculated on a per-marker basis treating species as populations, and by comparing northern and southern LAT groups (described above) as populations for C. odorata based on herbarium labels or field identification.

The R package adegenet was used to evaluate genetic structure among screening panel specimens with Discriminant Analysis of Principal Components (DAPC) (Jombart and Ahmed 2011), an ordination method based on genetic distances. We used DAPC clusters to define a reference database of specimens representing C. odorata and closely allied species (referred to as C. odorata sensu lato; n = 190; Appendix 3).

Discrete spatial classification with random forests

We classified C. odorata into discrete regional groups based on SNPs using random forests, a classification method that provides a consensus classification based on ‘a forest’ of many classification trees (Breiman 2001). Since our screened specimens represented dispersed samples and not necessarily populations, we designed classification tests to determine the classification accuracy obtained with our specimens and the SNP array at three geographic scales: (i) ‘range-wide’ with categories “Central America” and “South America,” (ii) ‘target countries’ with categories “Ecuador,” “Peru,” and “Bolivia,” and (iii) ‘narrow regional,’ within our target countries, with categories “NW,” “NE,” “SW,” and “SE.” Narrow regional groups were selected to represent an area approximately the size of Peruvian departments, and to maintain approximately equal sample sizes within groups.

For this analysis, we used our reference database of C. odorata s. l. and loci showing a call rate ≥ 0.75 with genotypes coded as categorical variables ‘0’ (genotype homozygous for the reference allele), ‘1’ (heterozygous), or ‘2’ (homozygous for the alternate allele). Since random forests is not tolerant of missing data, we used the R package synbreed (Wimmer et al. 2012) to impute allelic data on a per-locus basis from sampled genotypes under the assumption of Hardy–Weinberg equilibrium.

We generated a random forest of 500 classification trees for each classification question allowing each tree to have as many branches as loci (mtry = number of loci minus 1). In each case, predictor variables for classification were the loci and region of origin was the grouping variable for each specimen. To avoid biasing misclassifications, sample sizes for each regional class in the grouping variable were held constant, with the sample size determined by the class with the smallest number of specimens (Table 1) (Sun et al. 2009). Since classification error varies in random forests due to random sampling (i.e., random starting specimen and random starting locus), we calculated the mean of the median error across 5000 random forests (2,500,000 total classification trees), and evaluated the range and distribution of errors. Observed classification error was compared to random expectations by randomizing the classes by the grouping variable. This method was used to understand the baseline random classification accuracy for random forest tests with different numbers of classes (Finch et al. 2017).

Table 1 Abbreviations used to identify each random forest classification model, descriptions of regional classes examined, the number of samples per class used to train the model (n), and results for each model. Note: n is limited by the sample size for the regional class with the fewest samples

Continuous spatial assignment with SPASIBA

Spatial classification methods have been developed to provide continuous estimates of origin based on an interpolated surface of allele frequencies. The R package SPASIBA (Guillot et al. 2016) uses a training set of georeferenced genotypes to predict the highest probability origin for test genotypes. SPASIBA estimates the spatial auto-covariance of allele frequencies assuming that covariance diminishes with increased geographic distance (i.e., isolation-by-distance). Allele frequencies for individuals geographically proximate to those included in the training set are estimated under the assumption of a population in Hardy–Weinberg equilibrium, and loci are assumed to be in linkage equilibrium. Predicted origins for “unknowns” can be estimated for areas where no training genotypes exist.

We used the same data in SPASIBA analysis as was used in random forest analysis above with the exception that we limited the geographic scope of our analysis to specimens from the target countries (Ecuador, Peru, Bolivia) and additional samples from Brazil and Bolivia below − 17.5° S latitude (n = 148). We used two cross-validation methods to assess the performance of SPASIBA. The first method was a k-fold cross-validation (k-fold CV); we modeled spatial auto-covariance of allele frequencies by randomly selecting 90% of C. odorata specimens (n = 133) as a training set, and used the remaining 10% (n = 15) as validation samples to determine the error of predicted origins. K-fold CV follows recommendations to assess model performance, avoid overfitting, and increase computational efficiency (Lever et al. 2016). In practical use, however, a legal inquiry involving wood would likely employ all reference specimens to identify the origin of confiscated material. To assess the predictive accuracy of SPASIBA via this framework, we performed a leave-one-out cross-validation (LOOCV) by predicting the origin of each specimen based on composite allele frequencies from all other available specimens (n = 147). By including both cross-validation techniques, we: (i) gain an understanding of the range of prediction errors for a single individual as a function of different training and validation sets, (ii) provide a conservative estimate of prediction error that is less prone to overfitting, and (iii) obtain the optimal predictive accuracy for each specimen in our dataset. For both cross-validation methods, we defined prediction error as the distance between the known and predicted origins. Spatial predictions were made in decimal degrees; these were converted to Haversine distances (in kilometers) with the R package geosphere (Hijmans 2016).

For the k-fold CV analysis, we tested the spatial resolution of predicted origins with data sets containing missing data and imputed data separately (imputation as described above) because SPASIBA is tolerant of missing data. Since the selection of training samples influences the accuracy of continuous geographic assignment (Guillot et al. 2016), we repeated the SPASIBA analysis 100 times with each data set (missing data; missing data imputed) to evaluate the distribution of prediction error. We compared the k-fold CV results to our random forest analysis by filtering the k-fold CV results to show prediction error for the specimens used for the target country and narrow regional analyses. As with random forests, observed classification error was compared to random expectations by randomizing genotypes of the geo-referenced training data set, and repeating predictions for the validation samples.

With the LOOCV analysis, we were interested in knowing whether the optimal predicted origin and assignment errors were related to sample density. To evaluate this, we computed the mean pairwise geographic distance to the ten nearest neighbors for each sample, and assessed the relationship between the mean ‘nearest-neighbor distance’ and prediction error for each sample (Appendix 4).

All maps were drawn using the base map shapefiles from the World Borders Dataset (https://thematicmapping.org/).

Results

SNP assay development

Target capture and short read sequencing of the design panel of C. odorata specimens (n = 46; Appendix 1) resulted in 4.4 × 108 paired sequence reads (9.0 × 1010 bp total) with a mean individual sequence yield of 9.6 × 106 paired reads (range: 6.3 × 105–4.7 × 107). The sequence yield for the ‘reduced representation’ nuclear reference for C. odorata 300 was 1.4 × 107 paired reads (2.8 × 109 bp; Table S2). The C. odorata 300 de novo assembly (Appendix 2) yielded 9,139 assembled contigs with a mean length of 982.5 bp (range 156–4,053 bp; sum of length = 9.0 × 106 bp; Table S3), and was used for read mapping and variant calling. On average, 3.6 × 106 reads mapped to each target reference contig (range 1.8 × 105–1.7 × 107 reads; Table S2) for an average depth of 53.7X per target (range 1.1X–443.6X; Table S2). Estimated mean individual depth ranged from 5.9X to 277.3X (Table S2). Our initial vcf contained 1.6 × 106 sequence variants before filtering to remove insertion/deletion variants (5.9% of total variants), multi-allelic variants (7.8%), SNPs with greater than 85% missing information (46.1%), and SNPs with a MAF less than 5% (31.1%). The resulting, filtered sequence matrix of C. odorata specimens from target countries (n = 46; Table S1; Fig. 1a–c) included 144,083 SNPs, and was used as the basis for evaluating allelic associations with geographic and climatic variation, and for developing a SNP assay for spatial assignment.

Figure 1 shows the geographic distribution of samples for each grouping (LAT, MAT, and AP) used to identify spatially informative SNPs via FST (Fig. 1a–c), as well as their predicted values (Fig. 1d–f). SNPs selected for AP showed the strongest allelic differentiation relative to SNPs selected for LAT and MAT, with a mean FST of 0.42 (interquartile range: 0.41—0.45; Fig. 1g). SNPs based on MAT showed a similar median FST of 0.44 (interquartile range 0.43–0.46; Fig. 1g) with a higher mean FST (0.46) and a higher maximum FST (0.62). Surprisingly, LAT SNPs showed lower differentiation, with a median FST of 0.23 (interquartile range 0.22–0.26 Fig. 1g). Our reduced SNP assay included 61 SNPs from the AP list, 53 SNPs from the MAT list, and 26 SNPs from the LAT list (Fig. 1h). These SNPs were converted to an Agena MassARRAY assay, and together array SNPs from the LAT, MAT, and AP lists showed a mean FST of 0.41 and per-SNP FST values ranged from 0.21 to 0.62.

SNP assay results

Across all samples and SNPs screened by MassARRAY, we observed 22.3% missing data (23,708 uncalled alleles out of 106,400 potential alleles), with specimens showing call rates (CR) of 0.00 to 0.96 across 140 SNPs. Three factors influenced CR: (i) DNA concentration, as input DNA mass above 50 ng resulted in decreasing CR (Fig. S4); (ii) the use of herbarium leaves, which showed a lower CR on average than other sources (Fig. S5); and (iii) the inclusion of multiple species, as DNAs derived from C. angustifolia and C. montana showing substantially lower mean CR than other species (CR = 0.62 and 0.71, respectively; Fig. S6). We discarded loci showing CR’s < 0.75 (i.e., 25 additional loci) to mitigate the impact of missing data. This yielded a Cedrela dataset with 352 individuals (four individuals discarded as failures), 99 loci, and 1.32% missing data. This dataset showed a mean observed heterozygosity of 0.05 (range 0–0.37) and a mean MAF of 0.26 (range 0.02–1). Treating species as populations, we calculated a median FST of 0.20 (range 0.01–0.56). Filtering for only C. odorata from South America (defined by herbarium labels and field identifications) produced a dataset with 135 specimens that showed a mean call rate of 0.81 (range 0.1–0.95). This C. odorata dataset included 99 loci and 1.01% missing data, and yielded a mean observed heterozygosity of 0.11 (range 0–0.44), a mean MAF of 0.34 (range 0.04–1), and a per-SNP median FST of 0.01 (range − 0.02 to 0.25) for geographic groups similar to those shown in Fig. 1a.

DAPC identified nine clusters (Fig. S7a). Based on label and field identification, a reference database for C. odorata sensu lato was defined by genetic information with ‘exclusive’ C. odorata (DAPC clusters 1, 2, 4, 7) and closely allied taxa (5, 8; Fig. S7b; Finch 2019b; Finch et al. in preparation), and we used these specimens to test discrete and continuous spatial assignment methods (Appendix 3; Fig. S7). Clusters 9 and 6 were excluded because they were largely composed of specimens identified as C. angustifolia and C. montana or C. fissilis, respectively (Appendix 3; Fig. S7b). Cluster 3 was also excluded because it showed the lowest mean CR (0.34 compared to CR > 0.7 for all other clusters). The resulting dataset included 190 C. odorata s. l. from Central and South America used for random forest classification, and 148 C. odorata s. l. for SPASIBA analyses focused on South America. After defining the C. odorata s. l. dataset and removing individuals with high levels of missing information, we were able to retain a larger number of loci for random forest classification (116 SNPs) and SPASIBA continuous assignment (118 SNPs).

Random forest classification

Range-wide

This analysis included 190 C. odorata s. l. specimens from Central and South America and 116 SNP loci. Each individual was assigned to one of two regional classes (Table 1; Fig. 2a). The estimated mean of the median classification error of 5.8% for observed data was much lower than the estimated mean of the median classification error from 5,000 randomizations (51.1%; Table 1, Fig. 2g).

Fig. 2
figure 2

Summary of random forest classification analyses and resulting geographic classification errors. Maps show the geographic range of specimens categorized into regional classes for: a range-wide, b target country, and c narrow regional classification. Boxplots show the classification error (%) estimates by class for: d range-wide, e target country, and f narrow regional classes. Density plots show the distribution of classification error estimated for: g range-wide, h target country, and i narrow regional classification. Light gray distributions indicate error for observed genotypes, and dark gray distributions indicate error for genotypes after randomizing regional class identifiers in the grouping variable. Maps labels show country codes: NIC (Nicaragua), CR (Costa Rica), PAN (Panama), COL (Colombia), VEN (Venezuela), ECU (Ecuador), PER (Peru), BOL (Bolivia), and BRA (Brazil)

Target countries

This analysis included 141 C. odorata s. l. specimens from three countries (Fig. 2b) and 116 SNP loci. The estimated mean classification error of 34.3% for observed data was lower than the estimated mean classification error from 5,000 randomizations (68.4%; Table 1, Fig. 2h).

Narrow regional

This analysis included 129 C. odorata s. l. specimens from four narrow regions approximately 3.8 × 105 km2 in size (Fig. 2c) and 116 SNP loci. The estimated mean classification error of 34.7% was lower than the mean classification error of 76.9% estimated from 5000 randomizations (Table 1; Fig. 2i).

Classification errors for the range-wide comparison were almost equally distributed between groups, with mean errors of 4.42% for samples from Central America and 7.05% for samples from South America (Fig. 2d). Finer-scale classification tests showed classification asymmetry, where spatial assignment error was not equal across classes. For example, classification errors for the target country comparison showed that specimens from Ecuador had lower classification error (30.7%) than either Bolivia or Peru (35.1% and 37.0%, respectively; Fig. 2e). Similarly, specimens from the NE class had a substantially lower classification error (21.3%), than specimens from the NW, SW or SE classes (39.5%, 39.1%, and 39.0% respectively; Fig. 2f).

SPASIBA assignment

We investigated the accuracy of continuous assignments 148 C. odorata s. l. specimens with 118 SNPs, SPASIBA, and two cross-validation methods (sample distribution shown in Fig. S9). The median deviation between the known sampling location and predicted origin for the k-fold CV was 259.6 km (25%ile = 96.1 km; 75%ile = 820.3 km; Table 2; Fig. 3a) with a maximum error of 2540.8 km. Median k-fold CV prediction error with observed data was lower than the estimated median error from randomized genotypes (median 904.1 km; 25%ile = 494.6 km; 75%ile = 1408.7 km; maximum = 3,033.8 km; Table 2). We evaluated the estimation error in the imputed dataset used for random forest analysis to determine whether imputation of missing data influences prediction errors, and found that imputation had little influence on prediction error (imputed error = 268.4 km; unimputed error = 252.5 km; Table S5; Fig. S8).

Table 2 SPASIBA continuous prediction errors
Fig. 3
figure 3

Distributions of prediction errors (in km) for C. odorata s. l. specimens used for continuous spatial assignment with SPASIBA: a prediction error for South American specimens across 200 k-fold CV replicates (imputed and unimputed combined) with observed genotypes (light gray) and 148 LOOCV replicates with observed unimputed data (blue), b prediction error for specimens from target country groups, and c prediction error for specimens from narrow regional groups. Red dashed lines indicate the estimated mean prediction error for randomized genotypes (Table 2)

The k-fold CV results also showed that specimens from the Peru (Fig. 2b) showed a lower median prediction error (169.8 km) than Ecuador (598.0 km) and Bolivia (341.4 km; Table 2; Fig. 3b), a pattern that appears different from random forest classification results for countries (Fig. 2e). Similarly, specimens from the NE narrow region (Fig. 2c) showed a lower median prediction error (133.3 km) than the other narrow regional groups (NW = 598.1 km; SW = 135.0 km; SE = 424.0 km; Table 2; Fig. 3b); this pattern appears similar to our findings from random forest classification of these regional classes (Fig. 2f).

The median prediction error was lower under the LOOCV framework than the k-fold CV (188.7 km; 25%ile = 92.2 km; 75%ile = 754.3 km; Table 2; Fig. 3a). This was a 27.3% decrease in prediction error compared to the median k-fold CV estimate. Despite this improvement, we found no relationship between sample density and prediction error (Appendix 4).

Discussion

The neotropical tree species Cedrela odorata is a target of illegal logging, and has been heavily used for its timber at least since the description of the genus Cedrela in 1756 (Browne 1756; Pennington and Muellner 2010). Illegal logging of C. odorata typically involves fabrication of source on export documentation (Urrunaga et al. 2012), and since 2001, C. odorata has been protected in at least one country under CITES Appendix III (Ferriss 2014), increasing the importance of technologies that predict the origin of C. odorata wood. Country-of-origin declarations on import documents for wood can be difficult for wood importers to verify, and it is even more challenging for customs and border patrol agents to corroborate, making independent assessment methods a high-priority tool to aid the legal evaluation of wood products. Although, C. odorata and all Cedrela species have been elevated to CITES Appendix II, geographic localization of C. odorata wood remains relevant for seizures predating CITES Appendix II listing. Geographic localization is also essential to discovering the networks responsible for illegal logging, as has been shown for animal poaching (Wasser et al. 2018). We demonstrated that SNPs have the power to at least partially resolve the geographic origin of C. odorata across much of its range in Central America and western South America, and we present results from discrete classification of geographic origin with random forests (Breiman 2001) and continuous spatial prediction with SPASIBA (Guillot et al. 2016).

Using SNPs to predict the geographic region origin for C. odorata via discrete classification

A number of methods have been used for discrete assignment of genotypes to geographic groups (Manel et al. 2005; Ogden and Linacre 2015). Some of these methods use criteria or assumptions that are explicitly ‘genetic’ (e.g., Hardy–Weinberg and linkage equilibrium) (Rannala and Mountain 1997; Pritchard et al. 2000; Piry et al. 2004), but others are agnostic with regard to the input data type or the process(es) underlying the input data (Chen et al. 2017; Schrider and Kern 2018). ‘Model-free’ methods like random forests can use high-dimensional genetic data and non-genetic data to produce predictive functions that are robust to any type of data and distribution (e.g., non-normal distributions, zero-truncated, continuous, or categorical data), allowing genetic data to be combined with other information that provides independent evidence of geographic origin such as specifically stable isotope profiles (Kagawa and Leavitt 2010; Gori et al. 2015, 2018). This is especially important in cases involving plantation grown timber, where genotypic data may correctly identify the ancestral geographic origin, but not the growing location for a specific tree (e.g., plantation-grown C. odorata from Africa). This flexibility has led to the adoption of random forest methods for multiple applications in ecology and evolution (Boulesteix et al. 2012; Brieuc et al. 2018), genomics and genetic association analysis (Goldstein et al. 2011; Stephan et al. 2015), and population assignment based on genetic variation (Bertolini et al. 2015; Chen et al. 2017; Sylvester et al. 2018) and parasite community (Perdiguero-Alonso et al. 2008; Pérez-Del-Olmo et al. 2010). Random forests have been less frequently used for spatial classification, with current published cases based on reflection and chemical spectra rather than genotypes (Li et al. 2012; Finch et al. 2017).

In our specific analysis, we determined that random forest classification based on SNP genotypes can predict whether C. odorata s. l. specimens originated in Central or South America with 5.8% classification error (Table 1). This method offers high discrimination accuracy for broad-scale geographic source validation, and could serve as a ‘first-pass’ test for questions related to provenance on trade documentation. We found that random forest classification was less precise for identifying finer-scale questions, such ‘country-of-origin’ or ‘department-sized regions-of-origin’ within a country (34.3% error and 34.7% error, respectively). We suspect that within-class sample size was at least partly responsible for the relatively high error estimations at this scale, since both of these analyses (target country and narrow regional) indicated that some geographic signal was available for classification with our SNP assay (Table 1). It is important to note that while our method did not show high precision for identifying the country-of-origin, the four South American countries that listed C. odorata as CITES Appendix III (Bolivia, Brazil, Columbia, Peru) account for > 63% of the land area of the continent. With denser sampling across northern and eastern South America, it should be possible to test this classification method using CITES III protection status (protected versus non-protected) as the classifier, as exports from all of these countries are highly restricted.

Using SNPs to predict the geographic origin of C. odorata by continuous assignment

Methods have also been developed for estimating the origin of genotypes using continuous assignment (Wasser et al. 2004; Yang et al. 2012; Rañola et al. 2014; Guillot et al. 2016). These methods assume Hardy–Weinberg and linkage equilibrium and only use genetic data, but these limitations are balanced by the potential power of providing a precise geospatial source for a sample, rather than a categorical assignment (Degen et al. 2017; Chen et al. 2017). Continuous assignment with methods like SPASIBA are particularly relevant for questions involving wood legality because harvest locations are frequently not included in genetic reference populations, especially for geographically widespread species.

Our median prediction error for continuous assignment of C. odorata in South America was ~ 189 km via the LOOCV method, and ~ 260 km via the more conservative k-fold CV method (Table 2). These error estimates show promise for country-of-origin predictions, but may be less helpful for smaller countries and areas near international borders. Assignment errors ranged from 170 to 598 km for target countries (with Peru showing the lowest mean error; Fig. 3b), and from 133 to 598 km for department-sized geographic regions in our study area (with NE and SW regions in Peru showing the lowest errors; Fig. 3c). These errors – while large – are comparable to the continuous geographic assignment errors from other organisms based on similarly-sized datasets. For example, the mean error for the placement of humans based on 100 SNPs was ~ 430 km (Rañola et al. 2014), the 75% placement error of Arabidopsis based on 100 SNPs was 375 km (Guillot et al. 2016), and the median error for the geographic assignment of elephants based on 16 microsatellites ranged from 267 km (savannah) to 301 km (forest populations) (Wasser et al. 2015).

Despite the practical advantages of SPASIBA for providing continuous origin prediction, a potential disadvantage of the method lies in the assumption that spatial auto-covariance of allele frequencies diminishes with geographic distance. The assumptions of this simple gradient function may introduce error if allele frequency surfaces are irregular or lack a dominant cline. In our study, we also stratified samples by precipitation, temperature, and latitude (Fig. 1) to identify genes that might be responsive to different climate factors over small geographic distances, as might be the case due to heterogeneous elevation gradients imposed by the Andes Mountains. In doing so, SNPs selected for this assay appear biased towards genes showing stronger differentiation by climatic gradients than to spatial gradients of geographic distance; we observed that pairwise genetic distance was more strongly associated with pairwise MAT and AP distances than pairwise geographic distance (genetic distance ~ MAT distance Mantel r = 0.42; genetic distance ~ AP distance Mantel r = 0.26; genetic distance ~ geographic distance Mantel r = 0.08; Appendix 5) (Goslee and Urban 2007). This bias may have reduced the accuracy of our SPASIBA predictions, either by violating assumptions of simple gradients, or by selecting genes that show weaker correlations with geographic distances than they do to climatic distances. We will explore solutions to this in the future, by examining additional loci that show higher correlations to geography than climate, and testing continuous assignment methods that relax the assumption of isolation-by-distance allele frequency gradients (Rañola et al. 2014; Battey et al. 2019).

Recommendations for improving SNP-based geographic predictions for Cedrela

The accuracy of assignment can be dramatically influenced by the pattern of geographic sampling and density of genome coverage, especially for continuous assignment methods. Although we did not observe a relationship between sample density and prediction error (Appendix 4), continuous methods have been shown to provide highest accuracies when training datasets include individuals from the same genetic background as test individuals (Guillot et al. 2016). Additionally, the impact of the size of the genetic database on assignment accuracy can also be substantial. For example, two independent analyses have shown that increasing genomic density from 100 to 1,000 SNPs leads to significant reductions in prediction error (Rañola et al. 2014; Guillot et al. 2016). Our foundation dataset of 144,083 SNPs for C. odorata offers a rich resource that can be used to further refine SNP assays for geographic assignment. In this context, we note that additional SNPs (~ 350) are currently being evaluated for spatial assignment of C. odorata and C. fissilis, and this includes different nuclear and organelle markers from different source populations (Blanc-Jolivet et al., unpublished; Paredes-Villanueva et al. 2019). Joint analysis of these two marker sets using common samples should show whether simply doubling the number of genotypes and SNPs offers significant improvements in geographic assignment accuracy. Finally, different Cedrela species can show different allele frequencies across loci, and this may distort allele frequency surfaces used in spatial assignment and lead to less accurate geographic predictions for C. odorata. In this regard, we recommend exploration of genetic structure across reference specimens with DAPC (Appendix 3) or a similar method. In our reference database for C. odorata, we identified examples of specimens that were taxonomically misidentified, and this is common in natural history collections of tropical plants (Goodwin et al. 2015). Additional “C. odorata” reference specimens should be assessed for taxonomic cohesiveness with C. odorata before usage with evidence.

Conclusions

We identified greater than 100,000 SNPs that can be used to develop and refine assays for geographic localization of C. odorata wood specimens. From this database, we designed and tested a 140 SNP assay to predict the geographic origin of C. odorata, and we evaluated discrete (random forest) and continuous (SPASIBA) prediction methods. These methods make different assumptions with regard to the Hardy–Weinberg and linkage equilibrium; as such, they may show different performance depending on the degree to which SNPs track selective (environmental) gradients or deviate from genetic assumptions. Although the observed error estimates from our geographic predictions are too large for fine-scale geographic assignment, the assay shows high accuracy for determining the continent of origin and promise for country-level verification of specimens. This assay provides a tangible first step for determining the origin and legality of C. odorata wood, and these SNP resources and methods should provide the wood products industry with new (and developing) tools to improve the legality of C. odorata and closely allied species in wood trade.