Example 7: Cumulative Percentage Masking of Sampling Effort

Author
Published

May 19, 2026

Note: This example is part of the Global Sampling Effort Dataset repository, which provides pre-computed, taxon-stratified rasters of spatial sampling effort derived from GBIF occurrence records. For an overview of all available examples, taxonomic groups, etc., see the main page. If you use these data or code, please cite:

El-Gabbas, A. (2026) A global, taxon-stratified, high-resolution sampling-effort dataset from GBIF for bias-aware ecological modelling. Diversity and Distributions 32, no. 5: e70205. https://doi.org/10.1111/ddi.70205..


Setup: Load required packages and define plot theme

require(ecokit)
require(dplyr)
require(terra)
require(rworldmap)
require(colorRamps)
require(sf)
require(ggplot2)
require(grid)
require(tidyterra)
plot_theme <- ggplot2::theme_minimal() +
  ggplot2::theme(
    plot.margin = ggplot2::margin(t = -18, r = 1, b = -12, l = 1),
    panel.spacing = ggplot2::unit(0, "pt"),
    legend.position = "inside",
    legend.position.inside = c(0.725, 0.055),
    legend.direction = "horizontal",
    legend.margin = ggplot2::margin(t = -4, r = 0, b = -3, l = 0),
    legend.key.height = grid::unit(8, "pt"),
    legend.key.width = grid::unit(25, "pt"),
    legend.key.spacing.x = grid::unit(5, "pt"),
    legend.box.margin  = ggplot2::margin(t = 0, r = 0, b = -5, l = 0),
    legend.title = ggplot2::element_text(
      hjust = 0, size = 12, face = "bold", margin = ggplot2::margin(t = 2, r = 4, b = -8, l = 0)),
    plot.title = ggplot2::element_text(
      hjust = 0, size = 14, face = "bold", margin = ggplot2::margin(t = 6, r = 0, b = -1, l = 0)),
    axis.text = ggplot2::element_blank(),
    axis.ticks = ggplot2::element_blank(),
    panel.grid = ggplot2::element_blank())
regional_theme <- ggplot2::theme_void() + 
  ggplot2::theme(
    plot.margin = grid::unit(c(0, 0, 0, 0), "lines"),
    plot.title = ggplot2::element_text(hjust = 0.5),
    legend.position = "right",
    legend.box.spacing = grid::unit(10, "pt"),
    legend.margin = ggplot2::margin(),
    legend.title = ggplot2::element_text(size = 7),
    legend.text = ggplot2::element_text(size = 7))
global_map <- sf::st_as_sf(rworldmap::getMap(resolution = "high"))

Download bird observation data (1980–2025; 20 km resolution)

The following call downloads a raster of the total number of bird observations globally (1980–2025) at 20 km resolution. The file is saved to the effort_maps directory.

effort_birds <- ecokit::get_sampling_effort( 
  group = "aves", descendants = "all", metric = "n_obs", 
  years = "total", resolution = 20, out_dir = "effort_maps")

Cumulative percentage masking — global extent

The ecokit::mask_cumulative_pct() function partitions a sampling effort raster into three categories based on the cumulative distribution of cell values. Grid cells are ranked from highest to lowest observation count, and the cumulative sum is used to identify which cells collectively account for a given percentage of the total. This quantifies how spatially concentrated sampling effort is — for example, at 1 km resolution, just 0.33% of Earth’s surface accounts for 95% of all GBIF records (see the manuscript for details).

The function classifies cells into:

  • Top X%: the fewest, highest-value grid cells whose cumulative sum reaches the specified percentage of all observations (e.g., 95%)
  • Bottom (100 − X)%: all remaining sampled cells, collectively accounting for the rest of the observations
  • Unsampled: grid cells with zero recorded observations

Here, the masking is applied to the full global raster, so the 95/5% split reflects the worldwide distribution of bird observations. Load the raster and apply a 95% threshold:

r_map <- terra::rast(effort_birds$local_path[[1]]) %>% 
  ecokit::mask_cumulative_pct(top_pct = 95)

The returned SpatRaster contains three layers:

  1. top_95_percent_cumulative — grid cells whose cumulative observation count accounts for the top 95%
  2. lowest_5_percent_cumulative — grid cells accounting for the remaining 5% of observations
  3. zero_observations — grid cells with no recorded observations

These layers can be visualised directly (see below) or used for quantitative spatial coverage analyses.

Top 95% of observations (log10 scale)

The map below shows only the grid cells that collectively account for 95% of all bird observations. Despite covering the vast majority of recorded data, these cells occupy a small fraction of the Earth’s surface.

ggplot2::ggplot() + 
  ggplot2::geom_sf(data = global_map, fill = "grey95", color = "black", linewidth = 0.05) +
  tidyterra::geom_spatraster(data = log10(r_map[[1]]), maxcell = 1e8) +
  ggplot2::scale_fill_gradientn(
    colours = colorRamps::matlab.like2(100), na.value = "transparent",
    name = "# observations\n(log10)\n") +
  ggplot2::geom_sf(data = global_map, fill = "transparent", color = "gray60", linewidth = 0.2) +
  ggplot2::labs(title = NULL, x = NULL, y = NULL) + 
  ggplot2::coord_sf(xlim = c(-180, 180), ylim = c(-87, 85), expand = FALSE) +
  plot_theme

Bottom 5% of observations (log10 scale)

These are the remaining sampled cells — areas where observations have been recorded, but at much lower intensity. Together they account for only 5% of the total observation count.

ggplot2::ggplot() + 
  ggplot2::geom_sf(data = global_map, fill = "grey95", color = "black", linewidth = 0.05) +
  tidyterra::geom_spatraster(data = log10(r_map[[2]]), maxcell = 1e8) +
  ggplot2::scale_fill_gradientn(
    colours = colorRamps::matlab.like2(100), na.value = "transparent",
    name = "# observations\n(log10)\n") +
  ggplot2::geom_sf(data = global_map, fill = "transparent", color = "gray60", linewidth = 0.2) +
  ggplot2::labs(title = NULL, x = NULL, y = NULL) + 
  ggplot2::coord_sf(xlim = c(-180, 180), ylim = c(-87, 85), expand = FALSE) +
  plot_theme

Unsampled areas (zero observations)

Grid cells shown in dark grey have no recorded bird observations in GBIF for the entire 1980–2025 period. These areas represent the most significant knowledge gaps in global biodiversity monitoring.

ggplot2::ggplot() + 
  ggplot2::geom_sf(data = global_map, fill = "grey95", color = "black", linewidth = 0.05) +
  tidyterra::geom_spatraster(data = r_map[[3]], maxcell = 1e8, show.legend = FALSE) +
  ggplot2::scale_fill_discrete(type = "gray40", na.value = "transparent") +
  ggplot2::geom_sf(data = global_map, fill = "transparent", color = "grey60", linewidth = 0.05) +
  ggplot2::labs(title = NULL, x = NULL, y = NULL) + 
  ggplot2::coord_sf(xlim = c(-180, 180), ylim = c(-87, 85), expand = FALSE) +
  plot_theme

Regional comparison — all taxonomic groups combined

The global maps above apply cumulative percentage masking to bird observations over the full global extent. The following panels shift to a different scope in two ways:

  1. Taxonomic scope: the raster covers all taxonomic groups combined (group = "all"), not just birds, providing a broader view of sampling effort.
  2. Geographic scope: the raster is cropped to each region before applying ecokit::mask_cumulative_pct(). This means the 95/5% split is computed relative to the observation distribution within that region, not globally. For example, a cell classified as “top 95%” in Europe belongs to the highest-value cells within Europe — the same cell might fall into the bottom 5% in a global analysis if European sampling is already intensive relative to the rest of the world.

This region-specific masking reveals the internal structure of sampling effort within well-sampled and poorly-sampled regions alike, complementing the global perspective above.

Each panel shows: (1) cells contributing to the top 95% of observations within that region, (2) cells contributing to the bottom 5%, and (3) unsampled cells with zero records.

Download the observation-count raster for all taxonomic groups combined:

effort_all <- ecokit::get_sampling_effort( 
  group = "all", descendants = "all", metric = "n_obs", 
  years = "total", resolution = 20, out_dir = "effort_maps")

Europe

Cumulative percentage masking applied to all observations within the European extent (11°W–37.5°E, 35°N–71°N):

USA

Cumulative percentage masking applied to all observations within the contiguous United States (125°W–66.5°W, 24.5°N–49.5°N):

India

Cumulative percentage masking applied to all observations within the Indian subcontinent (68.1°E–97.4°E, 6.7°N–35.5°N):


← Previous: Example 6