Background & Summary

Species traits are essential for comparing ecological strategies among plants, both within any given vegetation and across environmental space or evolutionary lineages1,2,3,4. Broadly, a trait is any measurable property of a plant capturing aspects of its structure or function5,6,7,8. Traits thereby provide useful indicators of species’ behaviours in communities and ecosystems, regardless of their taxonomy8,9,10. Through global initiatives the volume of available trait information for plants has grown rapidly in the last two decades11,12. However, the geographic coverage of trait measurements across the globe is patchy, limiting detailed analyses of trait variation and diversity in some regions, and, more generally, development of theory accounting for the diversity of plant strategies.

One such region where trait data is sparsely documented is Australia; a continent with a flora of c. 28,900 native vascular plant taxa13 (including species, subspecies, varietas and forma). While significant investment has been made in curating and digitising herbarium collections and observation records in Australia over the last two decades (e.g. The Australian Virtual Herbarium houses ~7 million specimen occurrence records; https://avh.ala.org.au), no complementary resource yet exists for consolidating information on plant traits. Moreover, relatively few Australian species are represented in the leading global databases. For example, the international TRY database12 has measurements for only 3830 Australian species across all collated traits. This level of species coverage limits our ability to use traits to understand and ultimately manage Australian vegetation14. While initiatives such as TRY12 and the Open Traits Network15 are working towards global synthesis of trait data, a stronger representation of Australian plant taxa in these efforts is essential, especially given the high richness and endemicity of this continental flora, and the unique contribution this makes to global floral diversity16,17.

Here we introduce the AusTraits database (hereafter AusTraits), a compilation of plant traits for the Australian flora. Currently, AusTraits draws together 283 distinct sources and contains 997,808 measurements spread across 448 different traits for 28,640 taxa. To assemble AusTraits from diverse primary sources and make data available for reuse, we needed to overcome three main types of challenges (Fig. 1): (1) Accessing data from diverse original sources, including field studies, online databases, scientific articles, and published taxonomic floras; (2) Harmonising these diverse sources into a federated resource, with common taxon names, units, trait names, and data formats; and (3) Distributing versions of the data under suitable license. To meet this challenge, we developed a workflow which draws on emerging community standards and our collective experience building trait databases.

Fig. 1
figure 1

The data curation pathway used to assemble the AusTraits database. Trait measurements are accessed from original data sources, including published floras and field campaigns. Features such as variable names, units and taxonomy are harmonised to a common standard. Versioned releases are distributed to users, allowing the dataset to be used and re-used in a reproducible way.

By providing a harmonised and curated dataset on 448 plant traits, AusTraits contributes substantially to filling the gap in Australian and global biodiversity resources. Prior to the development of AusTraits, data on Australian plant traits existed largely as a series of disconnected datasets collected by individual laboratories or initiatives.

AusTraits has been developed as a standalone database, rather than as part of the existing global database TRY12, for three reasons. First, we sought to establish an engaged and localised community, actively collaborating to enhance coverage of plant trait data within Australia. We envisioned that a community would form more readily to fill gaps in national knowledge of traits with local ownership of the resource. While we will never have a counterfactual, a vibrant community excited to be part of this initiative has indeed been established and coverage is much higher for Australian species than has been achieved since TRY’s inception. Local ownership also aligns well with funding opportunities and national research priorities, and enables database coordinators to progress at their own speed. Second, we wanted to apply an entirely open-source approach to the aggregation workflow. All the code and raw files used to create the compiled database are available, and this database is freely available via a third party data repository (Zenodo) which is itself built for long term data archiving, with an established API. Finally, we targeted primary data sources, where possible, whereas TRY accepts aggregated datasets. The hope was that this would increase data quality, by removing intermediaries and easier identification of duplicates.

While independent, the overall structure of AusTraits is similar to that of TRY, ensuring the two databases will be interoperable. Both databases are founded on similar principles and terminology18,19. Increasingly, researchers and biodiversity portals are seeking to connect diverse datasets15, which is possible if they share a common foundation.

We envision AusTraits as an on-going collaborative initiative for easily archiving and sharing trait data about the Australian flora. Open access to a comprehensive resource like this will generate significant new knowledge about the Australian flora across multiple scales of interest, as well as reduce duplication of effort in the compilation of plant trait data, particularly for research students and government agencies seeking to access information on traits. In coming years, AusTraits will continue to be expanded, with integrations into other biodiversity platforms and expansion of coverage into historically neglected plant lineages in trait science, such as pteridophytes (lycophytes and ferns). Further, through international initiatives, such as the Open Traits Network, linkages are being forged between plant datasets and a variety of other organismal databases15.

Methods

Primary sources

AusTraits version 3.0.2 was assembled from 283 distinct sources, including published papers, field measurements, glasshouse and field experiments, botanical collections, and taxonomic treatments. Initially we identified a list of candidate traits of interest, then identified primary sources containing measurements for these traits, before contacting authors for access. As the compilation grew, we expanded the list of traits considered to include any measurable quantity that had been quantified for at least a moderate number of taxa (n > 20).

For a small subset of sources from herbaria, providing a text description of taxa, we used regular expressions in R to extract measurements of traits from the text. A variety of expressions were developed to extract height, leaf/seed dimensions and growth form. Error checking was completed on approximately 60% of mined measurements by visually inspecting the extracted values relative to the textual descriptions.

Trait definitions

A full list of traits and their sources appears in Supplementary Table 120,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354 . The list of sources in AusTraits was developed gradually as new datasets were incorporated, drawing from original source publications and a published thesaurus of plant characteristics19. We categorised traits based on the tissue where it is measured (bark, leaf, reproductive, root, stem, whole plant) and the type of measurement (allocation, life history, morphology, nutrient, physiological). Version 3.0.2 of AusTraits includes 358 numeric and 90 categorical traits.

Database structure

The schema of AusTraits broadly follows the principles of the established Observation and Measurement Ontology18 in that, where available, trait data are connected to contextual information about the collection (e.g. location coordinates, light levels, whether data were collected in the field or lab) and information about the methods used to derive measurements (e.g. number of replicates, equipment used). The database contains 11 elements, as described in Table 1. This format was developed to include information about the trait measurements, taxon, methods, sites, contextual information, people involved, and citation sources.

Table 1 Main elements of the harmonised AusTraits database. See Tables 28 for details on each component.

For storage efficiency, the main table of traits contains relatively little information (Table 2), but can be cross linked against other tables (Tables 38) using identifiers for dataset, site, context, observation, and taxon (Table 1). The dataset_id is ordinarily the surname of the first author and year of publication associated with the source’s primary citation (e.g. Blackman_2014). Trait values were also recorded as being one of several possible value types (value_type) (Table 9), reflecting the type of measurement submitted by the contributor, as different sources provide different levels of detail. Possible values include raw_value, individual_mean, site_mean, multisite_mean, expert_mean, experiment_mean. Further details on the methods used for collecting each trait are provided in a methods table (Table 5).

Table 2 Structure of the traits table, containing measurements of plant traits.
Table 3 Structure of the sites table, containing observations of site characteristics associated with information in traits.
Table 4 Structure of the contexts table, containing observations of contextual characteristics associated with information in traits.
Table 5 Structure of the methods table, containing details on methods with which data were collected, including time frame and source.
Table 6 Structure of the taxonomic_updates table, of all taxonomic changes implemented in the construction of AusTraits. Changes are determined by comparing against the APC (Australian Plant Census) and APNI (Australian Plant Name Index).
Table 7 Structure of the taxa table, containing details on taxa associated with information in the traits table. This information has been sourced from the APC (Australian Plant Census) and APNI (Australian Plant Name Index) and is released under a CC-BY3 license.
Table 8 Structure of the contributors table, of people contributing to each study.
Table 9 Possible value types of trait records.

Harmonisation

To harmonise each source into the common AusTraits format we applied a reproducible and transparent workflow (Fig. 1), written in R355, using custom code, and the packages tidyverse356, yaml357, remake358, knitr359, and rmarkdown360. In this workflow, we performed a series of operations, including reformatting data into a standardised format, generating observation ids for each set of linked measurements, transforming variable names into common terms, transforming data into common units, standardising terms (trait values) for categorical variables, encoding suitable metadata, and flagging data that did not pass quality checks. Details from each primary source were saved with minimal modification into two plain text files. The first file, data.csv, contains the actual trait data in comma-separated values format. The second file, metadata.yml, contains relevant metadata for the study, as well as options for mapping trait names and units onto standard types, and any substitutions applied to the data in processing. These two files provide all the information needed to compile each study into a standardised AusTraits format. Successive versions of AusTraits iterate through the steps in Fig. 1, to incorporate new data and correct identified errors, leading to a high-quality, harmonised dataset.

After importing a study, we generated a detailed report which summarised the study’s metadata and compared the study’s data values to those collected by other studies for the same traits. Data for continuous and categorical variables are presented in scatter plots and tables respectively. These reports allow first the AusTraits data curator, followed by the data contributor, to rapidly scan the metadata to confirm it has been entered correctly and the trait data to ensure it has been assigned the correct units and their categorical traits values are properly aligned with AusTraits trait values.

Taxonomy

We developed a custom workflow to clean and standardise taxonomic names using the latest and most comprehensive taxonomic resources for the Australian flora: the Australian Plant Census (APC)13 and the Australian Plant Name Index (APNI)361. These resources document all known taxonomic names for Australian plants, including currently accepted names and synonyms. While several automated tools exist for updating taxonomy, such as taxize362, these do not currently include up to date information for Australian taxa. Updates were completed in two steps. In the first step, we used both direct and then fuzzy matching (with up to 2 characters difference) to search for an alignment between reported names and those in three name sets: 1) All accepted taxa in the APC, 2) All known names in the APC, 3) All names in the APNI. Names were aligned without name authorities, as we found this information was rarely reported in the raw datasets provided to us. Second, we used the aligned name to update any outdated names to their current accepted name, using the information provided in the APC. If a name was recorded as being both an accepted name and an alternative (e.g. synonym) we preferred the accepted name, but also noted the alternative records. For phrase names, when a suitable match could not be found, we manually reviewed near matches via web portals such as the Atlas of Living Australia to find a suitable match. The final resource reports both the original and the updated taxon name alongside each trait record (Table 2), as well as an additional table summarising all taxonomic name changes (Table 6) and further information from the APC and APNI on all taxa included (Table 7). Any changes in taxonomy are exposed within the compiled dataset, enabling researchers to review these as needed.

Data Records

Access

Static versions of AusTraits, including version 3.0.2 used in this descriptor, are available via Zenodo363. Data is released under a CC-BY license enabling reuse with attribution – being a citation of this descriptor and, where possible, original sources. Deposition within Zenodo helps makes the dataset consistent with FAIR principles364. As an evolving data product, successive versions of AusTraits are being released, containing updates and corrections. Versions are labeled using semantic versioning to indicate the change between versions365. As validation (see Technical Validation, below) and data entry are ongoing, users are recommended to pull data from release, to ensure results in their downstream analyses remain consistent as the database is updated.

The R package austraits (https://github.com/traitecoevo/austraits) provides easy access to data and examples on manipulating data (e.g. joining tables, subsetting) for those using this platform.

Data coverage

The number of accepted vascular plant taxa in the APC (as of May 2020) is around 28,98113. Version 3.0.2 of AusTraits includes at least one record for 26,852 taxa (~93% of known taxa). Five traits (leaf_length, leaf_width, plant_height, life_history, plant_growth_form) have records for more than 50% of known species (Fig. 2a). Across all traits, the median number of taxa with records is 62. Supplementary Table 1 shows the number of studies, taxa, and families with data in AusTraits, as well as the number of geo-referenced records, for each trait. Looking across traits and tissue categories, coverage declined gradually, with moderate coverage(>20%) for more than 50 traits (Fig. 2). Coverage for root, stem and bark traits declined much faster than trait measurements for other plant tissues (Fig. 2b).

Fig. 2
figure 2

Coverage of traits by taxa. (a) Matrix showing the coverage of taxa for each trait, with yellow indicating presence of data. The figure was generated with a subset of 500 randomly selected taxa. (b) Number of taxa with data for first 100 traits for all traits and separated by tissue.

The most common traits are non geo-referenced records from floras; these are trait values representing a continental or region mean (or spread) and hence are not linked to a location. Yet, geo-referenced records were available for several traits for more than 10% of the flora (Fig. 3a). Coverage is notably higher for geo-referenced measurements of some tissues and trait types - such as bark stems and roots - relative to non-geo-referenced measurements (Fig. 3).

Fig. 3
figure 3

Number of taxa with trait records by plant tissue and trait category, for data that are (a) Geo-referenced, and (b) Not geo-referenced. Many records without a geo-reference come from botanical collections, such as floras.

Trait records are spread across the climate space of Australia (Fig. 4a), as well as geographic locations (Fig. 4b). As with most data in Australia, the density of records was somewhat concentrated around cities or roads in remote regions.

Fig. 4
figure 4

Coverage of geo-referenced trait records across Australian climatic and geographic space for traits in different categories. (a) AusTraits’ sites (orange) within Australia’s precipitation-temperature space (dark-grey) superimposed upon Whittaker’s classification of major biomes by climate370. Climate data were extracted at 10" resolution from WorldClim371. (b) Locations of geo-referenced records for different plant tissues.

Overall trait coverage across an estimated phylogenetic tree of Australian plant species is relatively unbiased (Fig. 5), though there are some notable exceptions. One exception is for root traits, where taxa within Poaceae have large amounts of information available relative to other plant families. A cluster of taxa within the family Myrtaceae which are largely from Western Australia have little leaf information available.

Fig. 5
figure 5

Phylogenetic distribution of trait data in AusTraits for a subset of 2000 randomly sampled taxa. The heatmap colour intensity denotes the number of traits measured within a family for each plant tissue. The most widespread family names (with more than ten taxa) are labelled on the edge of the tree.

Comparing coverage in AusTraits to the global database TRY, there were 76 traits overlapping. Of these, AusTraits tended to contain records for more taxa, but not always; multiple traits had more than 10 times the number of taxa represented in AusTraits (Fig. 6). However, there were more records in TRY for 25 traits, in particular physiological leaf traits. Many traits were not overlapping between the two databases (Fig. 6). We noted that AusTraits includes more seed and fruit nutrient data; possibly reflecting the interest in Australia in understanding how fruit and seeds are provisioned in nutrient-depauperate environments. AusTraits includes more categorical values, especially variables documenting different components of species’ fire response strategies, reflecting the importance of fire in shaping Australian communities and the research to document different strategies species have evolved to succeed in fire-prone environments.

Fig. 6
figure 6

The number of taxa with trait records in AusTraits and global TRY database (accessed 28 May 2020). Each point shows a separate trait.

Technical Validation

We implemented three strategies to maintain data quality. First, we conducted a detailed review of each source based on a bespoke report, showing all data and metadata, by both an AusTraits curator (primarily Wenk) and the original contributor (where possible). Measurements for each trait were plotted against all other values for the trait in AusTraits, allowing quick identification of outliers. Corrections suggested by contributors were combined back into AusTraits and made available with the next release. Version 3.0.2 of AusTraits, described here, is the sixth release.

Second, we implemented automated tests for each dataset, to confirm that values for continuous traits fall within the accepted range for the trait, and that values for categorical traits are on a list of allowed values. Data that did not pass these tests were moved to a separate spreadsheet (“excluded_data”) that is also made available for use and review.

Third, we provide a pathway for user feedback. AusTraits is an open-source community resource and we encourage engagement from users on maintaining the quality and usability of the dataset. As such, we welcome reporting of possible errors, as well as additions and edits to the online documentation for AusTraits that make using the existing data, or adding new data, easier for the community. Feedback can be posted as an issue directly at the project’s GitHub page (http://traitecoevo.github.io/austraits.build).

Usage Notes

Each data release is available in multiple formats: first, as a compressed folder containing text files for each of the main components, second, as a compressed R object, enabling easy loading into R for those using that platform.

Using the taxon names aligned with the APC, data can be queried against location data from the Atlas of Living Australia. To create the phylogenetic tree in Fig. 6, we pruned a master tree for all higher plants366 using the package V.PhyloMaker367 and visualising via ggtree368. To create Fig. 3a, we used the package plotbiomes369 to create the baseline plot of biomes.