Background & Summary

Biological diversity, along with the associated ecosystem services and genetic resources, is essential to support the wealth of human society. In the last decades, international efforts have been made to collect, standardise and share a huge amount of distribution data, e.g. via the Global Biodiversity Information Facility (GBIF; https://www.gbif.org). Despite the efforts to collect as much information as possible and compile comprehensive databases, several limitations are still present1 and biodiversity inventories suffer from various types of bias2. Sources of bias include the unequal sampling effort due to site accessibility and attractiveness3,4, the introduction of data from non-systematic (i.e. opportunistic or occasional) surveys5, the inconsistency of sampling methodologies across space and time6 and the taxonomical attractiveness (i.e. species or groups receiving particular attention for being rare, beautiful or somehow charismatic)7. Additionally, distribution data from biodiversity inventories are often affected by errors in spatial coordinates8, taxonomic misidentification or changes in nomenclature9 or need retrospective georeferencing prior to use10,11. Even considering all these potential limitations, biodiversity inventories still remain a powerful tool for a deeper understanding of ecosystem complexity and for tracking long-term biodiversity trajectories. In this frame, historical data (e.g. properly labelled specimens from private collections, museums or herbaria) represent a crucial source of information on past species distribution and phenology12,13. For example, comparing present-day information with data spanning the period of accelerated anthropogenic habitat destruction and climate change (e.g. the nineteenth and twentieth centuries) allowed to detect local extinctions and altitudinal shifts following human disturbance or recent climate change14,15,16.

Distribution data from collections frequently exhibit high resolution at both the spatial (collection locality) and temporal (complete date or year) scales; such biodiversity inventories are intrinsically spatially-accurate multi-temporal datasets, spanning over wide areas and temporal intervals. Comparable multi-temporal fine-grained abiotic data are thus required to shed light on the patterns of biodiversity response to environmental changes, and to identify the processes underlying such responses17. During the past decades, a substantial set of climatologies has been made available to researchers dealing with drivers of species distribution, but each of the most popular climate datasets clearly has some limitations. WorldClim18,19 has very broad spatial extent (global) and high resolution (30 arcsec), but lacks of temporal extent (multi-year climatologies: 1950–2000 and 1970–2000, respectively). On the contrary, datasets such as CRUT420, CRU TS 4.0221, GISSTEMP22, NOAA23 and BEST24 are global, providing gridded anomalies across the World at monthly resolution with a high temporal extent (first year ranging from mid-18th century for BEST to 1901 for CRU TS 4.02), but with quite coarse spatial resolution (ranging from 5° to 0.5°). Finally, the recently published CHELSA25 combines the high spatial extent and resolution of WorldClim (worldwide at 30 arcsec) with the high temporal resolution of CRU TS (monthly means), but is temporally limited to 1979–2013. A major issue of all these datasets is that they are originated from the broad-scale interpolation of data from weather stations, and can be locally inaccurate. Finally, Deblauwe et al.26 showed that such global datasets have fluctuating performances across the globe, and proposed the use of remotely sensed data to obtain more accurate climatologies, particularly in areas with a limited number of ground stations. Unfortunately, both the Deblauwe dataset and remotely-sensed data are temporally limited to the last forty years.

Our aim here was to produce and share a climatically explicit database on the distribution of Italian fauna with high spatial and temporal resolution. Based on GBIF data, Italy stands out among western countries for the low completeness of its species inventory27, even if it is one of the most biodiverse countries in Europe. Italy is formally neither a voting nor an associated participant of GBIF (https://www.gbif.org/the-gbif-network); consequently, there was no massive data contribution to the GBIF database from Italian institutions to date28. To fill this gap, in this work we updated, cleaned and georeferenced with greater precision raw distribution data from the extensive database for the Italian fauna (CKmap 5.3.8; ref.29), and integrated it with a high-resolution reconstruction of monthly temperature and precipitation in the location and at the time specimens were collected. The dataset compiled here represents the most spatially and temporally comprehensive and accurate faunal distribution database available for the Italian fauna, as well as one of the first attempts to finely yet massively reconstruct climate conditions under which biodiversity data were collected over more than two centuries.

Methods

CKmap 5.3.8 (ref.29; English edition – version 5.4) reports the taxonomical data and the occurrence in Italy of over 10,000 terrestrial and freshwater animal species30. Species distribution was mapped attributing each record (i.e. the occurrence of a species in a single location) to the 10 × 10 km UTM cell (ED50 datum, MGRS system) on the basis of a gazetteer. The gazetteer stored in the database (available for download at http://www.faunaitalia.it/documents/TCI.zip) included 46,961 toponyms taken from the “Touring Club Italiano” (TCI) atlas, accurately georeferenced using topographic maps of Italy at the scale 1:25,00030.

The geographic cleaning phase included quality checks for both CKmap and the reference gazetteer. All toponyms were checked for double spacing and/or additional spacing at the beginning and at the end of the string; all the records with an empty locality field were discarded. In the gazetteer, the precision of coordinates was indicated with the letters A, F and G30. All G records were excluded, given they represent the centroid of broad spatial polygons (i.e., mountain ranges or rivers). A one-to-many join then allowed linking each of the remaining toponyms (A and F; i.e. individual locations) to the corresponding distribution records. An R code allowing partial matches (i.e. allowing up to 5% of letter deletion, insertion or substitution) was run on non-matching records and putative matches were carefully verified by hand prior to assignment. The distribution of the retained records was polygonised via a Voronoi tessellation using the deldir R package31 and the resulting Voronoi diagram was clipped using coastline and administrative boundaries. We used this approach because database compilers described each collection locality using the closer toponym30. For each Voronoi cell, spatial precision was calculated as the distance between the sample point (toponym) and the farthest vertex (i.e. the positive pole, for bounded cells). The dimension of each Voronoi cell, and hence the record precision, obviously depends on the local toponym density.

The database cleaning consisted in the deletion of inaccurate records. All the records reporting an inaccurate taxonomical classification (e.g. using open nomenclature32 instead of the Linnean binomials) were discarded. The remaining binomials were cleaned by removing subspecific classification and checking for extra spacing and typos using the above-mentioned approach. Furthermore, all records without collection date were discarded and the remaining records were checked for agreement with publication year. All the records with publication year (if any) later than collection year were retained; we also retained all the records without publication year, which represented unpublished collection specimens. We finally removed duplicate records (i.e. occurrence records of the same taxon from the same location and year), except when they were from different sources or were collected at different elevations.

For each retained record, we reconstructed average monthly minimum and maximum temperature and total monthly precipitation exploiting the huge amount of meteorological observations available for Italy since the 18th century33. We used the anomaly method for climate reconstruction34, which is based on the independent reconstruction of the climatologies (i.e. the climate normal over the standard reference period) and the deviations from them (i.e. the anomalies with respect to the same baseline period). Climatologies are characterized by strong spatial gradients and a large number of weather stations is necessary to capture them, even if available for a short period. Consequently, an interpolation technique that exploits the dependence of climate normals on orography and other geographic parameters is necessary to reconstruct such climatologies35,36,37. Anomalies, on the other hand, are linked to climate change and variability; for this reason, they are usually characterized by higher spatial coherence34. A limited number of weather stations can therefore be sufficient to capture spatial patterns through simpler interpolation methods, but it is pivotal to use long-term time series to have a satisfactory temporal coverage over the past. These series must in turn be corrected for errors deriving from the history of the stations (changes of station and instrument location, instrument replacements, changes in observation protocols, etc.) which could interfere with the actual climate signal33,38,39. For each record, we reconstructed the climate information of the Voronoi cell centre using average cell elevation following Brunetti et al.40. For each centre we first estimated the climate normals calculating a weighted linear regression of the data from nearby stations as a function of elevation. We assigned greater weights to the stations with elevation and topographic position similar to that of the location of interest, as derived from a 30 arcsec resolution digital elevation model36,37. Anomalies were interpolated at the same locations exploiting an improved version of the dataset presented in Brunetti et al.33 and then, by combining anomalies and climatologies, temporal series of temperature and precipitation in absolute values were obtained for each record location. Bioclimatic variables were finally calculated on monthly temperature and precipitation using the dismo R package41. All the above-mentioned analyses were run using custom R and Fortran codes.

Data Records

The final dataset includes 268,977 occurrence records deriving from 8,445 binomials and 16,332 localities, dating between 1680 and 2006 CE (Fig. 1). A complete description of the database structure is reported in the Online-only Table 1.

Fig. 1
figure 1

Temporal distribution of the records from ClimCKmap. Total number of records: 268,977 from 8,445 binomials and 16,332 localities. Temporal coverage: 1680–2006 CE. The numerals above each bar indicate the percentage of records falling in that time period.

The full database is made accessible at https://doi.org/10.6084/m9.figshare.7906739.v442 and http://hdl.handle.net/21.11125/a91f85cb-befd-4e14-8e83-24f17c4a0491.

Technical Validation

The starting dataset (CKmap 5.3.8; ref.29) included 548,868 occurrence records from 10,132 taxa. After taxonomic simplification and cleaning, 544,764 records from 10,064 binomials (i.e. species) were retained. The subsequent geographic cleaning and georeferencing phase allowed assigning coordinates to 470,232 records from 19,574 localities. The removal of non- or dubiously-dated records restricted the size of the resulting dataset to 277,310 records. The removal of duplicates led to a final dataset including 268,977 distribution records deriving from 8,445 binomials and 16,332 localities, dating between 1680 and 2006 CE (Fig. 1). The georeferencing methodology returned satisfactory results in term of precision (Fig. 2; median accuracy = 2480 m; for 95% of records accuracy was between 1251 and 4474 m). Climate reconstructions did not allow assigning climate data to only a small fraction of the total dataset (894 records dating between 1680 and 1921 CE; 0.3%) due to the lack of meteorological data (pre-instrumental period) or the incompleteness of the local time series. Monthly means of minimum and maximum temperature and total precipitation were thus reconstructed for the remaining part of the records (268,083 records dating between 1790 e 2006 CE).

Fig. 2
figure 2

Distance between sample point and the farthest vertex of the Voronoi cell (precision). Median = 2480 m; 95% confidence interval = 1251–4474 m. The numerals above each bar indicate the percentage of records falling in that distance class.

Usage Notes

In the last two centuries, biodiversity has been facing abrupt climate and environmental changes globally, mainly due to human activities. Huge efforts have been spent to expand our knowledge of species distribution worldwide, but the picture is still far from complete. In fact, apart from rare exceptions43, low taxonomic coverage and temporal gaps still limit our understanding of ecosystem functioning under climate change and biodiversity loss. The dataset we compiled and presented here aimed at moving a step forward to fill this gap, by providing spatially, temporally and climatically accurate distribution information for the fauna of one of the most biodiverse countries in Europe. ClimCKmap is currently the largest spatially, temporally and climatically explicit distribution database ever assembled for the Italian fauna. Future research should consider exploiting the database, for example deriving empirical relationships between climate change and biodiversity loss and using them to evaluate impacts of future climate change scenarios, as well as to identify climate-biodiversity change hotspots. Proper use of the data will also allow for developing knowledge-based biodiversity conservation measures.

This database has of course some potential issues that have to be taken into account when working with it. First, we opted for retaining the original classification and taxonomy, i.e. a revised form of the checklist of the Italian fauna44. Additionally, some authors reported biases due to spatially unequal sampling effort45,46, as well as temporal and taxon-specific biases in the completeness of the database46. Researchers interested in exploiting the database should thus evaluate case by case if a taxonomic updating for the group under analysis is needed, and be aware of the potentially (spatially and/or temporally) biased nature of the database itself.