_images/_onelineoftext.png

Documentation

_images/_nolinesoftext.png

What is AREA2? ✓

AREA 2 (“area squared” or “area two”), short for Area Estimation & Accuracy Assessment, is a Google Earth Engine application that provides comprehensive support for sampling and estimation in a design-based inference framework. Several sampling designs and estimators are supported. The aim of AREA 2 is to provide users the tools necessary to comply with recommended practices and international guidance for estimation areas of land categories and land change, and to assess the accuracy of maps.

As explained in the Theoretical Background, areas of map categories should not be obtained by counting pixels in maps because of classification errors. Instead, a sample of reference observations need to be collected to which an unbiased estimator is applied. AREA 2 allows users to design a sample according to various design criteria and objectives.

A big benefit of Google Earth Engine is that the data needed to observe the reference conditions on the land surface are available without having to download the data. Time Series Viewer is a part of AREA 2 it extracts time series of all available Landsat and Sentinel-2 observations for reconstruction of changes on the land surface. Collect Earth Online and TimeSync are other applications that provide direct access to reference data using Google Earth Engine.

Once the sample data have been collected, an estimator that corresponds to the sampling design provide an estimate or area or accuracy with confidence intervals when applied to the sample data. Different estimator are more or less efficient Support for various estimators are provided in addition to guidance for choosing an appropriate estimator.

_images/_nolinesoftext.png

Theoretical background ✓

One of the most common results of remote sensing analyses is a quantification of various phenomena on the land surface, such as deforestation, farmland expansion, urbanization, land use patterns, etc. Remote sensing has the advantage of providing wall-to-wall coverage of the area of interest at little or no cost, but results are never perfect! Classification errors are inevitable when trying to translate reflected sunlight or backscattered longwave radiation into information about complex land surface processes. The presence of errors may introduce substantial bias in remote sensing-based maps [1]. For remote sensing science to impact policy and decision-making for the benefit of Earth and its inhabitants, satellite image-based analyses must produce valid scientific inference – maps that lack inference-based assessments of the parameters of interest are of little utility for scientific inference; “essentially, they may be just pretty pictures” [2]. With the emergence of free data and powerful computing platforms, map-making has never been easier, which makes the “pretty pictures” statement more true than ever before. Luckily, a typical remote sensing analysis is attractive from the perspective of design-based inference [3]. In a design-based inference framework, a sample of population units (i.e. pixels in this case) is selected such that it represents the much larger population (i.e. all the pixels in the map), and reference conditions are observed at each sample location. Application of an unbiased estimator, that accommodates the effects of classification errors, to the sample data provides area estimates that are bias-adjusted. Measures of map accuracy are also estimated from the sample data. Further, application of a variance estimator allows for quantification of the uncertainty of estimates in the form of confidence intervals.

A common bottleneck in the implementation of inference protocols is the collection of reference observations. A powerful source of reference data is the Landsat archive because it allows for examination of time series of observations at each sample location but downloading the data and extracting time series data require much storage space and time, especially for large areas. Google Earth Engine alleviates such demands as time series of Landsat in combination with Sentinel-2 data – both of which are hosted on GEE – can be extracted for any pixel in the world without having download a byte of data! This application provides support for selecting a sample from a study area (sampling design), extracting reference data for sample locations (response design), and applying an unbiased estimator to sample data for estimation of area and map accuracy (analysis):

  • Sampling design: support for four designs: simple random, simple systematic stratified random and stratified systematic. For stratified designs, a map should provided where the map classes correspond to strata. Currently, the assessment unit is a 30 m x 30 m pixel and the output a shapefile that contains the locations of the units in the sample. Users can arbitrarily set or determine the sample size and the allocation of sample units to strata according to the literature recommendations.
  • Response design: support for visualization of time series of Landsat data and display of Landsat and Sentinel-2 for sample locations.
  • Analysis: support for use of the stratified estimator [1] and the model-assisted regression estimator [2]. Both estimator provide estimates of the area of each category in the map provided; user’s and producer’s accuracy of each category; overall map accuracy; and confidence intervals for each estimate.
[1](1, 2) Olofsson P, et al. (2013). Making better use of accuracy data in land change studies: Estimating accuracy and area… Remote Sensing of Environment, 129, 122-131.
[2](1, 2) McRoberts R E (2011). Satellite image-based maps: Scientific inference or pretty pictures? Remote Sensing of Environment, 115, 715-724.
[3]Olofsson P, et al. (2014). Good practices for estimating area and assessing accuracy of land change. Remote Sensing of Environment, 148, 42-57.
_images/_nolinesoftext.png

Terminology ✓

Accuracy

A measure of “correctness”; in the context of this document, accuracy expresses the degree to which the map agrees with reality; an estimate of accuracy is the degree to which the map agrees with a sample of reference observations.

Accuracy assessment

The process of estimating a measure of map accuracy using a sample of reference observations.

Bias

A property of an estimator; we say that the bias of an estimator \(\hat{\mu}\) of a population parameter \(\mu\) is the difference between \(\mu\) and the expected value, \(\hat{\mu}\) over all possible samples; that is, \(\mbox{Bias}(\hat{\mu}) = \mbox{E}(\hat{\mu}) - \mu\) Casella & Berger, 2002, p. 330). Note that “…because an estimate is a number, it has no variance and no bias (Särndal et al., 1992, p. 41). The term biased [or unbiased] estimate is not recommended, although it appears occasionally in the literature.

Cluster

A “sampling unit [that] consists of group or cluster of smaller units” (Cochran, 1977, p. 233).

Commission error

Commission error is the proportion or percentage of the area mapped as the category of interest that is erroneously predicted based on comparison to the reference classification. Commission error is the complement of user’s accuracy (Olofsson et al., 2013).

Confidence interval

A 95% confidence interval for a population parameter, \(\mu\), expresses uncertainty in the parameter estimate, \(\hat{\mu}\), and is calculated using the sample data. Confidence intervals are often, but not necessarily, in the form \(\hat{\mu} \pm a\times \mbox{SE}(\hat{\mu})\) where \(\mbox{SE}(\hat{\mu})\) is the standard error of the estimate and \(a\) is a statistic related to the desired confidence level (see z-score). Among the aggregate set of confidence intervals constructed using all samples that could be realized using the sampling design, 95% of such intervals are expected to include the true value of the population parameter \(\mu\) , although which intervals do and which do not include \(\mu\) is generally unknown. The IPCC Good Practice Guidelines (IPCC, 2006, Section 3.1.3) recommend the use of 95% confidence intervals in greenhouse gas inventories.

Design-based inference

The process of drawing an inference for a population parameter by analysing a probability sample selected from the population. With design-based inference, the observation for a population unit is a constant value, apart from negligible measurement error. Design-based properties of estimators such as bias and variance are determined from the sampling distribution constructed from the aggregate set of population parameter estimates obtained from all possible samples that could be realized from the sampling design. See also Inference.

Estimate

The value obtained from the estimator when applied to a specific sample. An estimate of \(\mu\) is denoted \(\hat{\mu}\). Note that in the literature, the estimator is usually denoted in the same way as the estimate (e.g. Cochran, 1977, p. 11; Särndal et al., 1992, p. 40) because there will not likely be confusion regarding which use we intend: for example, an estimator of \(\mu\) is denoted \(\hat{\mu}\). Note that because variance and bias are properties of estimators, and “because an estimate is a number, it has no variance and no bias. The term biased [or unbiased] estimate is [not recommended but is] nevertheless used occasionally.” (Särndal et al., 1992, p. 41).

Estimator

“The rule by which an estimate of some population characteristic [i.e. parameter] \(\mu\) is calculated from the sample results” (Cochran, 1977, p. 11). Note that that an estimator is not the same thing as an estimate – instead “An estimator is a function of the sample, while an estimate is the realized value of an estimator (that is, a number) that is obtained when a sample is actually taken” (Casella & Berger, 2002, p. 312).

Expected value

“The expected value, or expectation, of a random variable is merely its average value, where we speak of ‘average’ value as one that is weighted according to the probability distribution. […] The expected value or mean of a random variable \(X\) is denoted by \(\mbox{E}(X)\)” (Casella & Berger, 2002, p. 55).

Inference

In a sampling framework, an inference expresses the relationship between the population parameter, \(\mu\), and its estimate, \(\hat{\mu}\), in probabilistic terms, typically in the form of either of a confidence interval or a test of hypothesis (Dawid, 1983).

Margin of error

A relative measure of the uncertainty in an estimate. Note that the definitions of margin of error are not all the same. Typically, it is calculated as the ratio of the half width of a 95% confidence interval to an estimate.

Model-assisted estimator

An estimator used in design-based inference that incorporates auxiliary information to increase precision by comparing a model’s predictions, often in the form of map unit values, to a probability sample of reference observations (Särndal et al., 1992, p. 219). Model-assisted estimators can be particularly effective when the response variable is continuous (e.g., proportion forest or biomass) rather than categorical (e.g., forest/non-forest or forest change class).

Model-based Inference

Inference based on the perspective that an observation for a population unit is a random variable and that the model used to make predictions for all units in the population has been adequately specified in the sense that there is no systematic lack of the model fit to data that represent the entire population. See also Inference.

Monte Carlo method

A technique in which a large quantity of randomly generated numbers are studied using a probabilistic model to find an approximate solution to a numerical problem that would be difficult to solve by other methods. For example, suppose a sample \(s\) is a realization of set-valued random variable \(S\) specified by the sampling design. The expected value and variance of a statistic \(Q(S)\) are unknown but \(Q(S)\) can be calculated once the sample data has been collected. If we select many samples (let’s say 10,000) using the same design, each of size \(n\), and compute \(Q\) for each sample, the average and variance of the values \(Q(S)\) will approximate the expectation and variance of \(Q(S)\). “This method of approximating the value of quantities that may be hard to calculate by analytic means is known as a Monte Carlo simulation” (Särndal et al., 1992, p. 35).

Omission error

Omission error is the proportion or percentage of area with the reference classification of the category of interest that is erroneously predicted (mapped) to be in other categories. Omission error is the complement of producer’s accuracy (Olofsson et al., 2013).

Overall accuracy

The overall accuracy is the “overall proportion of area correctly classified” (Stehman, 1997, p. 79).

Parameter

See population parameter.

Population

“The aggregate [that we want to obtain information about] from which the sample is chosen” (Cochran, 1977, p. 5). (Example: all likely voters in the next U.S. presidential election; an example in the context of this document, is all Landsat pixels of a study area).

Population parameter

“Numerical characteristics, or parameters, of the population [with constant values that will be estimated from a sample]” (Rice, 1995, p. 186). (Examples: the area deforested in Brazil 2010-2015 or a fixed value in a model.)

Population unit

An individual member of the set of elements that make up the population. Population units are referred to as elements in Särndal et al. (1992, p. 9).

Post-stratified estimation

Post-stratification refers to a stratification of the study area that is independent of the selection of the sample and the stratification is applied subsequent to sampling in the estimation of parameters. For example, if a systematic sample of plots exists and we want to estimate the area of forest, the use of a forest/non-forest map to stratify the area is likely to increase precision in the area estimate, even if stratified random sampling is not used. In this case, a stratified estimator applied to the sample data is referred to as a post-stratified estimator.

Precision

In the context of estimation, Cochran (1977, p. 16) states that because “of the difficulty of ensuring that no unsuspected bias enters into estimates [sic], we will usually speak of the precision of an estimate instead of its accuracy. Accuracy refers to the size of deviations from the true mean math:mu, whereas precision refers to the size of deviations from the mean m obtained by repeated application of the sampling procedure.” In the context of this document, we often characterize the precision of an estimate with a 95% confidence interval – the larger the interval the less the precision (and greater the uncertainty).

Probability distribution function (PDF)

From IPCC (2006, Volume 1, Chapter 3, Section 3.1.3): The PDF describes the range and relative likelihood of possible values. The PDF can be used to describe uncertainty in the estimate of a quantity that is a fixed constant whose value is not exactly known, or it can be used to describe inherent variability.

Probability sample

A sample drawn from a population using a randomization mechanism such that “the inclusion probability for each element of the sample is known, and the inclusion probabilities are non-zero for all elements of the population.” (Stehman, 1999).

Producer’s accuracy

From Stehman (1997, p. 79): “Producer’s accuracy for [category] j [is] the conditional probability that an area classified as category j by the reference data is classified as category j by the map.” When expressed in terms of area, producer’s accuracy is the proportion of area that has the reference classification of the category of interest that is correctly predicted (mapped) as that category. For a simple random sample of reference observations, each of which represents an equal area, producer’s accuracy for a category is estimated by dividing the number of correctly classified map units in the category by the number of reference units in the category (Lillesand et al., 2008, p. 586). For stratified random sampling of reference data, producer’s accuracy is estimated in two steps: (1) the number of map units in to each map category (or stratum) is multiplied by the stratum weight; and (2) the product for the category of interest is divided by the sum of the products. Producer’s accuracy is the complement of omission error (Olofsson et al., 2013)

Random variable

A variable whose value depends on possible outcomes of some process (a coin toss for example). More formally, it is defined as (Gut 2009): “A random variable is a (measurable) function from a probability space to the real numbers”. Also called stochastic variable.

Reference data

Data characterizing the most accurate available assessment of the true condition at the sample location (example: fine-resolution satellite imagery).

Reference classification

The most accurate available assessment of the true condition of a population unit (example: deforestation).

Reference observations

The reference classification applied to the collection of all sample units.

Sample

A subset of population units selected from the population.

Sampling frame

List of population units that can be selected for inclusion in a sample. The population units in the list are referred to as sampling units. In other words, a frame is a device that provides observational access to the population by associating the population units with the sampling units (Särndal et al., 1992, p. 9).

Sample unit

The sampling units drawn from the sampling frame for inclusion in a sample. I.e., a sample unit is different from sampling and population units.

Sampling unit

Entities that make up the sampling frame (Särndal et al., 1992, p.5). In the literature, sometimes there is no distinction made between population units and sampling units (e.g. Cochran, 1977, p. 6). However, population and samplings units are different entities because the sampling frame and the population are sometimes different entities (in many situations though, the sampling frame is equivalent to the population).

Simple random sampling

“A method for selecting n units out of the \(N\) such that every one of [the sets of \(n\) specified units] has an equal chance of being drawn.” (Cochran 1977, p. 18).

Spatial assessment unit

In the context of accuracy assessment a spatial assessment unit is a “unit for comparing the map and reference classifications” (Stehman & Wickham, 2011, p. 3044; typically a pixel, block of pixels or polygon.

Standard deviation

The standard deviation of \(X\) is the square root of the variance: \(\mbox{SD}(X) = \sqrt{\mbox{V}(X)}\) and is often denoted by \(\mbox{S}(X)\), \(\mbox{SD}(X)\), \(\mbox{D}(X)\), or \(\sigma_X\). Because of the square root, the standard deviation has the same unit as the random variable as opposed to the variance. A standard deviation calculated from sample data is sometimes referred to as the sample standard deviation to distinguish it from the population standard deviation. The standard deviation of an estimate is referred to as its standard error.

Standard Error

The standard error is the standard deviation (i.e. square root of the variance) of an estimator (Rice, 1995, p. 192). For example, consider the situation in the explanation of variance below: we want to estimate the mean \(\overline{Y}\) of a population of size \(N\) with variance \(\sigma^2\). To do this, we select a simple random sample of \(n\) units \((y_1 ... y_n)\); \(\overline{y}\) is an estimate of \(\overline{Y}\) and the estimated variance of the sample mean is \(\hat{\mbox{V}}(\overline{y}) = s^2 \div n\). The standard deviation of \(\overline{y}\) is referred to as its standard error. Because the sample variance is an unbiased estimator of the population variance, we can estimate the standard error using the sample variance: \(\mbox{SE}(\hat{y}) = \sqrt{\mbox{V}(\overline{y})} = s \div \sqrt{n}\). Note that because a standard error is a standard deviation of an estimator, all standard errors are also standard deviations but not all standard deviations are standard errors. This sometimes causes confusion even though the definitions of standard error and standard deviation are consistent. The confusion is exacerbated by the common use of the letter \(S\) to denote both standard errors and standard deviations.

Strata

Strata are “subpopulations that are non-overlapping, and together comprise the whole population” (Cochran, 1977, p. 89)

Stratified estimator

The stratified estimator is used in design-based inference to estimate of the mean of a population often applied to a sample selected by stratified random sampling (Cochran, 1977, p. 91). The estimator is expressed as the sum of the means of the simple random samples within strata weighted by stratum weights calculated as relative proportions of the population within strata. When applied to a simple random or simple systematic sample from the entire population (i.e., without using the strata in the sampling design), the stratified estimator is called a post-stratified estimator.

Stratified random sampling

When “simple random sampling is taken in each stratum” (Cochran 1977, p. 89).

Time series

“A time series is a sequence of observations taken sequentially in time. […] An intrinsic feature of a time series is that, typically, adjacent observations are dependent.” (Box et al. 1994).

Time series analysis

“Time series analysis is concerned with techniques for the analysis of this dependency [of adjacent observations]”. (Box et al. 1994).

Unbiased estimator

“An estimator \(\hat{\mu}\) of \(\mu\) is unbiased if the mean value of \(\hat{\mu}\), taken over all possible samples obtained using the [design], is equal to \(\mu\)” (Cochran, 1977, p. 11); or in other words, the estimator is characterized as unbiased if it produces an “estimate [that] is correct ‘on the average’” (Rice, 1995, p. 192) over all possible samples. See also bias.

Uncertainty

The opposite of precision.

User’s accuracy

From Stehman (1997, p. 79): “User’s accuracy for [category] i [is] the conditional probability that an area classified as category i by the map is classified as category i by the reference data”. When expressed in terms of area, user’s accuracy is the proportion of the area that has the predicted class of the category of interest that is correctly classified as determined by comparison to the reference classification. For a simple random sample of reference observations, each of which represents an equal area, user’s accuracy is estimated by dividing the number of correctly classified map units in each category by the total number of units classified into that category (Lillesand et al., 2008, p. 586). User’s accuracy is the complement of commission error (Olofsson et al., 2013).

Variance

The formal definition of the variance of a random variable \(X\) with expected value \(\mbox{E}(X)\), is \(\mbox{V}(X) = \mbox{E}(X - \mbox{E}(X))^2\) and provides “a measure of the degree of spread of a distribution around its mean” (Casella & Berger, 2002, p. 59). But this definition is not very relevant in the context of this document. Instead, we are concerned with situations where a sample has been selected from a population (e.g. all pixels of a country) with the objective of estimating a certain population parameter, \(\mu\) (e.g. the area of deforestation). For example, let’s say we have a population of \(N\) units \((y_1 ... y_N)\) with mean \(\overline{Y}\) and variance \(\sigma^2\) from which a sample of \(n\) units \((y_1 ... y_n)\) has been selected by simple random sampling. We have collected reference observations for the \(n\) sample units. The sample variance is \(s^2 = \sum{(y_i - \overline{y})^2 \div (n-1)}\) (Cochran, 1977, p. 26). Again, this is also not very helpful as we are usually not interested in the sample variance but in the variance of an estimate (e.g. the area of deforestation). For simple random sampling designs, the sample mean \(\overline{y}\) is an unbiased estimator of the population mean \(\overline{Y}\), and the sample variance \(s^2\) is an unbiased estimator of the population variance \(\sigma^2\), such that \(\mbox{E}(\overline{y}) = \overline{Y}\) and \(\mbox{E}(s^2) = \sigma^2\) (Cochran, 1977, p. 22, 26). The variance of \(\overline{y}\) is \(\mbox{V}(\overline{y}) = \sigma^2 \div n\) (assuming a small \(n\) relative \(N\)); the estimated variance of \(\overline{y}\) substitutes the sample variance \(s^2\) for the population variance \(\sigma^2\) to create the variance estimate of \(\overline{y}\) as \(\hat{\mbox{V}}(\overline{y}) = s^2 \div n\) This is the variance that is of primary interest to us, and that allows us to calculate a standard error and a confidence interval of the estimated population parameter of interest.

Z-score

A z-score (also referred to as standard score), \(z_{1- \alpha \div 2}\) where \(1-\alpha\) is the confidence level between 0 and 100%, is a constant such that the area under the standard normal density function in between \(\pm z_{1- \alpha \div 2}\) is \(1-\alpha\) (Rice, 1995, p. 202). For example, a confidence interval at the 95% confidence level for an estimate \(\hat{\mu}\) would be computed as \(\hat{\mu} \pm z_{1-0.025} \times \mbox{SE}(\hat{\mu})\). A condition for using \(z\) to compute a confidence interval is that that the sampling distribution for \(\hat{\mu}\) is approximately a normal distribution, which we can assume for large sample sizes (Särndal et al., 1992).

References

Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1994). Time Series Analysis: Forecasting and Control (third). Upper Saddle River, NJ: Prentice-Hall.

Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Pacific Grove, CA: Duxbury Press.

Cochran, W. G. (1977). Sampling Techniques. New York, NY: Wiley.

Dawid, A. P. (1983). Inference, Statistical: I. Encyclopedia of Statistical Sciences Vol. 4, S. Kotz, N. L. Johnson and C. B. Read (Eds.). New York, NY: Wiley.

Gut, A. (2009). An Intermediate Course in Probability. New York, NY: Springer. IPCC. (2006). 2006 IPCC Guidelines for National Greenhouse Gas Inventories. H. S. Eggleston, L. Buendia, K. Miwa, T. Ngara, & K. Tanabe (Eds.). Japan: IGES.

Lillesand, T.M., Kiefer, R.W., Chipman, J.W. (2008). Remote sensing and image interpretation, 6th ed. New York, NY: Wiley.

Olofsson, P., Foody, G. M., Stehman, S. V, & Woodcock, C. E. (2013). Making better use of accuracy data in land change studies: Estimating accuracy and area and quantifying uncertainty using stratified estimation. Remote Sensing of Environment, 129, 122–131.

Rice, J. A. (1995). Mathematical statistics and data analysis (2nd ed.). Belmont, CA: Duxbury Press. Särndal, C. E., Svensson, B. H., & Wretman, J. H. (1992). Model assisted survey sampling. New York, NY: Springer. Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62, 77–89.

Stehman, S. V, & Wickham, J. D. (2011). Pixels, blocks of pixels, and polygons: Choosing a spatial unit for thematic accuracy assessment. Remote Sensing of Environment, 115(12), 3044–3055.

_images/_nolinesoftext.png

Getting started ✓

  1. A general in-depth guide to the Google Earth Engine API is provided here https://developers.google.com/earth-engine/ – below is provided a short guide for accessing AREA 2 on Google Earth Engine.
  2. Navigate in a web browser to https://code.earthengine.google.com/?accept_repo=projects/AREA2/public
  3. Login with a Google account if you are not already logged in. You should see a screen similar to the figure below but with a blank script editor and console.
  4. To run a script, highlight it in Script Manager (A), which displays the code in the Code Editor (B), and click the Run button (located in the Code Editor).
  5. When running the scripts in AREA 2, a Dialog Pane will appear (E). The Dialog Pane is where you specify the information required for each step of the sampling design, response design, and analysis. Note that after communicating via the Dialog Pane (loading a map for example), Earth Engine does not indicate if the application running. Therefore, push the buttons only once and wait for the application to respond before continuing.
  6. The Console (C) displays output specified by script. If errors occur while running a script, the error messages are displayed here.
  7. The Map (D) is where spatial data is displayed.
_images/screen_shot.png
_images/_nolinesoftext.png

Simple Random Sampling ✓

Selecting a sample by simple random sampling requires you to specify the study area and a sample size.

  1. Start by running the Sampling Design script in the Scripts pane.
  2. In the first entry box Specify an image ID and sampling scheme, specify a map that defines the study area. Because you are not using strata or clusters, the contents of the map does not matter.
  3. Under Specify band if a multi-band map (if not, specify 1): if you are using a map that contains multiple bands, specify which band you want to use to define the study area. Otherwise, just specify “1”.
  4. If the map has a no-data value or if you want to exclude a certain value in the map when defining the study area, specify the value in Specify no data value.
  5. Click Load image – the map should be displayed in the image pane.
  6. Under Select a sampling scheme, select Simple Random
  7. Under Determine sample size, you can either specify an arbitrary sample size or determine the sample size by specifying a target standard error of the anticipated overall accuracy of the map that is being assessed.
If specifying arbitrary sample size
  1. If choosing to specify an arbitrary sample size, simply add a number under Specify sample size.
  2. Click Create sample.
If specifying a target precision
  1. To determine the sample size required to meet a certain target standard error of the overall accuracy, first specify the anticipated overall accuracy of the map you are assessing under Specify anticipated overall accuracy (0-1)
  2. Then Specify target standard error of overall accuracy (0-1) and click Calculate sample size – the sample size is calculated using Equation 12 in [1] which is derived from Equation 4.2 in [2].
  3. Click Create sample.
  1. To view, the sample in the Display pane, click Add to map
  2. To export the sample, click Export sample and select the desired file format.
[1]Olofsson, P., Foody, G. M., Herold, M., Stehman, S. V., Woodcock, C. E., & Wulder, M. A. (2014). Good practices for estimating area and assessing accuracy of land change. Remote Sensing of Environment, 148, 42-57.
[2]Cochran, W. G. (1977). Sampling techniques. John Wiley & Sons.
_images/_nolinesoftext.png

Stratified Random Sampling ✓

Selecting a sample by stratified random sampling requires you to specify a map to define strata, a total sample size, and an allocation of sample units to strata.

  1. Start by running the Sampling Design script in the Scripts pane.
  2. In the first entry box Specify an image ID and sampling scheme, specify a map that defines the study area. Because you are not using strata or clusters, the contents of the map does not matter.
  3. Under Specify band if a multi-band map (if not, specify 1): if you are using a map that contains multiple bands, specify which band you want to use to define the study area. Otherwise, just specify “1”.
  4. If the map has a no-data value or if you want to exclude a certain value in the map when defining the study area, specify the value in Specify no data value.
  5. Click Load image – the map should be displayed in the image pane.
  6. Under Select a sampling scheme, select Stratified Random
  7. Under Determine sample size, you can either specify an arbitrary sample size or determine the sample size by specifying a target standard error of the anticipated overall accuracy of the map that is being assessed.
If specifying arbitrary sample size
  1. If choosing to specify an arbitrary sample size, simply specify the sample size in each stratum.
  2. Click Create sample.
If specifying a target precision of the overall accuracy
  1. To determine the sample size required to meet a certain target standard error of the overall accuracy require you to specify the anticipated user’s accuracy of each of the map classes used as strata under Specify anticipated user’s accuracies (0-1)
  2. Then Specify target standard error of overall accuracy (0-1) and click Calculate sample size – the sample size is calculated using Equation 13 in [1] which is derived from Equation 5.25 in [2].
  3. Click Create sample.
If specifying a target precision of an area estimate
  1. To determine the sample size required to meet a certain target standard error of the area estimate of a certain class, you first need to specify which class to target. When loading the stratification in Step 5., console will print the strata weights – the stratification used in Examples: Stratified estimation has the following strata and weights (1 is Non-forest, 2 is Forest, 3 is Water, 4 is Forest loss, 5 is Forest gain, and 6 is Forest gain/loss):
>>> Area weights of strata:
    List (6 elements)
    1: 0.41211
    2: 0.49320
    3: 0.02195
    4: 0.06674
    5: 0.00365
    6: 0.00234
  1. Assume that we want to estimate stratum 4 (Forest loss) – simply select “4” under Select target class.
  2. The second step is to specify how much Forest loss according to the reference data is present in the other strata. The amount of actual Forest loss present in the Forest loss stratum, equals the user’s accuracy of the Forest loss map class. Specify the anticipated user’s accuracy of the Forest loss map class.
  3. Then specify the anticipated proportion of forest loss present in the other strata. These proportions equals the anticipated omission of forest loss in each of the map classes.
  4. Finally, specify the target standard error of the class of interest. In my case, the area of forest loss was mapped at 0.066 of the total map area. While the area of forest loss is unknown, the mapped area is best “guesstimate”. If trying to achieve a 95% confidence interval of pm 0.01, we would need to specify a target standard error of 0.005 of the study area.
  5. Click Calculate sample size to apply Equation 13 in [1] but with the overall accuracy substituted for the area of a map class. The equation is derived from Equation 5.25 in [2].
  6. Allocate the sample to strata and click Create sample – note that a proportional allocation, but with a sufficient sample size in smaller classes, is preferable when estimating the area of a class.
  1. To view, the sample in the Display pane, click Add to map
  2. To export the sample, click Export sample and open the Task tab next to the Console tab. Two tasks called “sample” appear – the first saves as a GEE asset and the second as a CSV file when clicking Run.
[1](1, 2) Olofsson, P., Foody, G. M., Herold, M., Stehman, S. V., Woodcock, C. E., & Wulder, M. A. (2014). Good practices for estimating area and assessing accuracy of land change. Remote Sensing of Environment, 148, 42-57.
[2](1, 2) Cochran, W. G. (1977). Sampling techniques. John Wiley & Sons.
_images/_nolinesoftext.png

Simple Systematic Sampling ○

Coming soon

_images/_nolinesoftext.png

Stratified Systematic Sampling ○

Coming soon

_images/_nolinesoftext.png

Two-stage Sampling ✓

Selecting a sample by two-stage sampling requires you to specify a stratification of the study area, a shapefile that defines the population of Primary Sampling Units (PSUs), and a sample size of both Primary and Secondary Sampling Units (PSUs and SSUs). The sampling is limited to simple random selection of SSUs in PSUs, and to selection of PSUs in strata based on the proportion of a single class of interest.

  1. Start by running the Sampling Design script in the Scripts pane > click Two-stage sampling.
  2. In the first entry box Specify stratification of the study area, specify a map that defines the study area.
  3. Under Specify the band values in the stratification that correspond to the class of interest, list the band values in stratification that correspond to the class of interest. For example, if the forest is the class of interest, specify all band values separated by comma that correspond to the forest strata.
  4. If the map has a no-data value or if you want to exclude a certain value in the map when defining the study area, specify the value in Specify no data value.
  5. Under Specify vector asset that defines PSU population, specify a GEE vector asset that outlines the population of PSUs. If shapefile exist, import into GEE from the Assets tab > New > Table upload.
  6. Click Load image – the map should be displayed in the image pane with all PSUs draped on top.
  7. Optional: click Display Class of Interest in PSUs to calculate and display the proportion of the class of interest in each PSU as a proportion between zero and one.
  8. In order to create the strata, you need specify a threshold or thresholds to divide the population of PSUs based on the proportion of the class of interest. For example, a threshold of 0.3 would create one stratum corresponding to the PSUs that contain the top 30% of the class of interest, and one stratum corresponding to the remaining PSUs. Specifying three thresholds of “0.25, 0.50, 0.75” would create four strata, one stratum that contains the top 25% PSUs, one that contains 25-50% of the top PSUs, one that contains 50-75% of the top PSUs and one with the remaining PSU. To understand what “Top x% PSUs” measn in this case. Think of a list of all the PSUs ranked by how much they contain the class of interest from 0 to 1. Specify the threshold under Select proportion for dividing strata (0-1) and click Divide into strata.
  9. Specify the number of 30 PSUs to be included in the sample under Specify desired number of Primary Sampling Units, and click Select PSUs.
  10. Specify the number of SSUs to selected in each of PSUs under Specify desired number of SSUs in each PSU. The total sample size will the number of selected PSUs times the number of selected SSUs.
  11. Optional: export the PSUs by clicking Export PSUs – note that this is not necessarily for collecting sample data.
  12. Export the total sample for interpretation, click Export sample and select the desired file format.

Time Series Viewer ○

Coming soon

_images/_nolinesoftext.png

Choosing an estimator ✓

We provide support for five different estimators. Consider the following when choosing between estimators:

Stratified estimator. The stratified estimator is recommended for estimating population parameters from sample data obtained by stratified random sampling. The estimator is expressed as the sum of the means of the simple random samples within strata weighted by stratum weights calculated as relative proportions of the population within strata.

Post-stratified estimator. We recommend the use of a post-stratified estimator when sample data has been collected by simple random or simple systematic designs, and when a stratification has been applied to the study area subsequent to the sampling. Consider the following example: if a systematic sample of field plots has been collected for estimation of forest area, post-stratifying the study area by a forest/non-forest map and using a post-stratified estimator is likely to increase precision in area estimates compared to just simply applying a simple random estimator (i.e. the mean value). Note that the post-stratified estimator is a stratified estimator applied to sample data collected to simple random/systematic designs.

Model-assisted regression estimator. The model-assisted regression estimator is, just as the stratified estimator, recommended for estimating population parameters from sample data collected by stratified random sampling. But while the stratified estimator has been shown to provide more precise estimates when the map and the reference observations represent classes of a categorical variable (such as deforestation, forest, non-forest), model-assisted regression is likely to provide more precise estimates when the sample data contains continuous reference observations such as proportion of forest or proportion of deforestation.

Model-assisted difference estimator. The difference estimator has been used to estimate net forest change when no reference observations or maps of change exist. For example, the difference estimator can be used for estimating the area of forest loss and gain if forest inventory data have been collected of the same area at times 1 and 2, and two forest/non-forest representing conditions at times 1 and 2 exist. The difference estimator require the sample data have been collected by simple random or systematic sampling or stratified random sampling if the sample size is proportionally allocated to strata.

Ratio estimator. If sample data have been collected by stratified random sampling for estimation of area and map accuracy but the strata are different from the map classes, we recommend using a ratio estimator. For example, a sample that was collected by stratified random sampling using version 1 of certain land cover map as stratification, can still be used to estimate the accuracy of an updated map of the same area, if using the ratio estimator.

More information about the use of various estimator for area estimation is provided in: Stehman, S. V. (2013). Estimating area from an accuracy assessment error matrix. Remote Sensing of Environment, 132, 202-211.

_images/_nolinesoftext.png

Stratified Estimator ○

To use the stratified estimator for estimation of map accuracy and the area of map categories, make sure that you have in place a sample and a map to define strata. The sample should have been collected by stratified sampling and a reference observations should exist for each sample unit. In case the sample data was collected by simple random and simple systematic sampling, use the post-stratified estimator.

  1. Start by opening the Google Earth Engine AA Estimation Toolbox

users/bullockebu/gee-us-2018/stratification_cambodia

  1. Sampling
  2. Interpretation
  3. Analysis
_images/_nolinesoftext.png

Post-stratified Estimator ○

Coming soon

_images/_nolinesoftext.png

Model-assisted Regression Estimator ○

Coming soon

_images/_nolinesoftext.png

Ratio Estimator for when strata ≠ map classes ○

Coming soon

_images/_nolinesoftext.png

Ratio Estimator for two-stage designs ○

Coming soon

_images/_nolinesoftext.png

Stratified estimation of area and accuracy ✓

An example is provided here to illustrate the estimation of the area of forest loss for the country of Cambodia between 2000 and 2010 using stratified estimation. The example has three different parts: 1) sampling design – a sample is selected by stratified random sampling; 2) response design which involves observing the reference conditions at sample unit in various satellite data available in Google Earth Engine; and 3) the analysis which is based on the application of a stratified estimator to the sample data.

Note that the sample data are made-up and do not reflect actual conditions in Cambodia!

1. Sampling Design

  1. Stratification. Because the area of forest loss is small relative the total study area, using non-stratified designs would require a very large sample size to ensure a sufficient number of observations of forest loss in the sample data. To use a stratified design, we will first need to define a stratification. In this case, we will extract a map of Cambodia from a global map [1] of (1) forest, (2) non-forest, (3) water, (4) forest loss, (5) forest gain, and (6) forest loss and gain. The map will serve a stratification of study area. Preferably, use the GeoTIFF format. (A tutorial from BEEODA illustrating the creation of a local stratification from the global map is provided here)
  2. You can download the stratification of Cambodia here. Once downloaded to your local computer, click the Assets tab next to the Scripts tab and NEW > Image upload > SELECT; click Advanced > Masking mode > and set No-data value to “0”. Click OK to upload the stratification as an image file. You can check the progress in the Tasks tab (the upload will take several minutes).
  3. Once the stratification has been added to Assets, highlight the script “1_Sampling_Design” in the Scripts tab and click Run. (Check Overview to familiarize yourself with GEE interface if you have done so already.)
  4. In Sampling Design dialog, click Stratified or Simple Random Sampling > add the path to the stratification under Specify an image to define study area. The path is likely something like “users/[your name]/stratification_cambodia_utm_small”. Set the band to “1” and mask value to “0” and click Load image.
  5. Sampling scheme. Because the objective of the exercise is to estimate the area forest loss which is a small part of the study area, the sample will be selected by stratified random sampling – under Select a sampling scheme select Stratified random.
  6. Sample size. Further, we will Determine sample size by setting a Target SE of a class, in this case, class number 4, Forest loss. The application will print in the Dialog and Console the area proportion of class 4:
>>> Area proportion of class 4:
    0.0667431639819303
  1. To determine the sample size, n, we’ll use Equation 5.25 in [2]
\[n = \left(\frac{\sum_{h} W_h \mbox{SD}_h}{\mbox{SE}(\hat{y})}\right )^2\]

where \(W_h\) are the strata weights that are automatically extracted from the stratification; \(\mbox{SD}_h\) are the stratum standard deviations, calculated as \(\mbox{SD}_h= \sqrt{p_h (1-p_h)}\) where \(p_h\) is the anticipated area of Forest loss according to the reference data in stratum \(h\). For class 4, \(p_h\) simply becomes the anticipated user’s accuracy of the Forest loss class. Let’s assume a user’s accuracy of 0.6; specify “0.6” under Anticipated users accuracy (0-1) for class 4.

  1. For the other strata, \(p_h\) becomes the anticipated area of Forest loss according to the reference data in stratum \(h\). Just like the user’s accuracy of Forest loss, these numbers are unknown and we have to make a best guess. Let’s assume an area proportion of Forest loss of 0.01 in the region classified as Forest, Non-forest and Water but zero in the other forest change strata: add under Specify anticipated proportion of class 4 in other strata the following \(p_1 = p_2 = p_3 = 0.01\) and \(p_5 = p_6 = 0\).
  2. The denominator \(\mbox{SE}(\hat{y})\) is the target standard error of the area estimate of forest loss \(\hat{y}\) that we aim to achieve. The target standard error has a substantial impact on the sample size; trying to achieve a small error will result in a larger sample. While the area of forest loss is unknown, we know that the mapped area proportion is 0.066. A target standard error expressed as an area proportion of 0.005 is equivalent of a 95% confidence interval of \(\pm 1.96 \times 0.005 = 0.01\) or a margin of error of \(0.01 \div 0.066 = 15\%\). Specify 0.005 under Set target SE of the area of class 4; the Dialog should look like this:
_images/str_dialog.png
  1. Allocation. Click Calculate sample size; a sample of 625 units is recommended. The next step is to allocate the sample to strata. To estimate area (or any estimate across strata such as overall and producer’s accuracy as opposed to user’s accuracy), a proportional allocation of the sample to strata is preferable [3]. The problem with proportional allocation is that the sample size will be very small in the smaller strata. A proportional allocation would yield 258, 308, 14, 42, 2, 1 sample units respectively. If bumping the sample size in the very small strata to 30, in forest loss stratum from 42 to 50, and keep the sample size at 250 and 300 in the forest and non-forest stratum, we get the following allocation – specify under Allocate sample to strata the following and click Create sample:
  1. Forest: 250
  2. Non-forest: 300
  3. Water: 30
  4. Forest loss: 50
  5. Forest gain: 30
  6. Forest gain/loss: 30
  1. Export sample. Clicking Add to map will display the sample units as red dots in the Map display. The final step is to export the sample: click the Tasks tab, and then Export sample in the Dialog – three entries named “sample” will appear in Tasks. These are identical but one is for exporting the sample as a CSV file, one to save as a GEE Asset file for use in Google Earth Engine, and one for export to a KML file to allow sample locations to be viewed in Google Earth. Click the Run button right next to the “sample” entries and save as GEE Asset, CSV and KML (the latter to should be saved to your Google Drive). Use the name “STR_sample_Cambodia” for GEE Asset, the KML and the CSV file.

2. Response Design

We now need to provide reference observations for each unit in the sample that we designed in the previous step. Reference data are required for observing reference conditions. A powerful reference dataset is the combination of high resolution imagery and time series. Different applications have been developed that allow you to display such data at sample locations. In this tutorial we will use an application in the Earth Engine called Time Series Viewer. (Other applications include TimeSync and Collect Earth Online .)

  1. Run the script 2_0_Time_Series_Viewer in the Code Editor. A dialog appears that asks you to specify the sample data and the characteristics of the reference data.
  2. First, under Specify sample data as a GEE asset, specify the GEE Asset that contains the sample which should be /user/[your name]/STR_sample_Cambodia; you created the GEE Asset in Step 2.11
  3. Second, specify under Specify start year of reference data and …end year… the start and end as YYYY-01-01 of the time series plots. In this case, we are assessing conditions from 2000 to 2010 so a start of 1998-01-01 and 2012-01-01 might be sufficient. Note that longer time times will take longer time to plot.
  4. Next, specify under Select variables to plot as time series which Landsat bands or indices to display as time series at each sample location. We recommend SWIR1, EVI, NBR and Wetness.
  5. In the Assets tab, click the GEE Asset Table contains the sample (i.e. the GEE Asset you created in Step 2). A dialog box should pop up with the header “Table: STR_sample_Cambodia” – click Import.
  6. Click View to start interpreting the sample units.
  7. Click either Next or add “1” as Search ID and hit Enter – the first unit in the sample will appear as a red square in the Map pane and plots of time series of surface reflectance and spectral transforms based on Landsat data will appear in the Dialog (as shown in Figure 1 below).
  8. Clicking a point in the time series will display the associated Landsat image in the Map pane – use this data to collect reference observations on the land surface. In this example, the response legend corresponds to the map legend – i.e. each sample unit will be labeled either forest, non-forest, water, forest loss, forest gain, and forest gain/loss (the latter being pixels that experienced forest loss but recovered during the study period 2000-2010).
_images/response_design.png

Figure 1. Screen shot of using Time Series Viewer to collect reference observations for sample units.

  1. To record the reference labels, open the CSV file you created in step 2:11 in Google Sheets. Delete columns “classification” and “.geo”.; and add columns “reference”, “reference_name”, “date_event”, “confidence” to the right of column “.geo”.

  2. In column “reference”, record the grid code of the reference label (1: forest, 2: non-forest, 3: water, 4: forest loss, 5: forest gain, and 6: forest loss and gain); in column “reference_name”, write the name of the reference label; in “date_event”, note the year of the loss or gain events; in column “confidence”, indicate by how confident you are that the label is correct – specify 1 if label is uncertain, 2 if probable but not certain, and 3 if certain. For example, I am certain that sample unit #7 in Figure 1 above experienced a forest loss event in 2010. I would then add:

    reference: 1
    reference_name: Forest loss
    date_event: 2010
    confidence: 3
    
  3. Because this is just a tutorial, we won’t go through all sample units. If this was a real life situation though, you would save the CSV after completion as “STR_sample_Cambodia_interpreted.csv” and open it in software that allow you to export the CSV as a shapefile. We recommend using QGIS which executes the GDAL program ogr2ogr – in QGIS, just click Layer > Add Layer > Add Delimited Text Layer; once displayed right-click the CSV file in QGIS layer pane and click Export > Save Feature As; specify ESRI Shapefile as Format and click OK.

  1. We have prepared a shape file that contains reference labels for each sample unit. Download the following individual files of the shapefile: .prj .shp .shx and .dbf . In Google Earth Engine, click Assets tab > New > Table Upload; navigate to downloaded files and upload the individual .prj .shp .shx and .dbf files of shapefile. Ingesting the shapefile will take about five minutes. The shapefile that contains the sample data should appear in the Assets tab with the name “STR_sample_Cambodia_interpreted” – you are now ready to analyze the sample data.

3. Analysis

  1. As explained in Choosing an estimator because the sample was selected under stratified random sampling using a categorical change map to define strata, the stratified estimator is efficient for estimating area.
  2. Run the script 3_0_stratified_estimation. Because we need the strata weights and because we can easily estimate the accuracy of the map using the sample data, under Specify the map used to stratify, specify “stratification_cambodia_utm_small”; specify “reference” under Specify the reference attribute name; and “STR_sample_Cambodia_interpreted” under Specify the reference feature collection – click Load data; and then “Apply stratified estimator”.
  3. Click Show error matrices to display a cross tabulation of map and reference labels at sample locations. The first matrix shows the number of sample units, \(n_{hj}\) identified as class \(j\) in the reference data in stratum \(h\). The second matrix shows the estimated area proportion of class \(j\) in stratum \(h\) as \(\hat{p}_{hj} = W_h \times n_{hj} \div n_{h+}\).
  4. The stratification we used doesn’t have a buffer stratum so we won’t click Use buffer stratum. (If the forest loss stratum would have been small and the forest stratum large, it would have been a good idea to create and use a buffer around mapped forest loss.)
  5. In the Dialog, specify class number “4” (forest loss) under Select map class for which to estimate area and accuracy. This will print the area and accuracy estimates with 95% confidence intervals. Change class to view estimates for the other classes. For Forest loss (class 4), the area ± a 95% confidence interval should be 0.0517 ± 0.0120 expressed as proportion of the total study area or 944,987 ± 219,536 ha. User’s accuracy is 0.62, producer’s 0.80 and overall accuracy 0.92.
[1]Hansen, M. C., Potapov, P. V., Moore, R., Hancher, M., Turubanova, S. A. A., Tyukavina, A., … & Kommareddy, A. (2013). High-resolution global maps of 21st-century forest cover change. Science, 342(6160), 850-853.
[2]Cochran, W. G. (1977). Sampling techniques. John Wiley & Sons.
[3]Olofsson, P., Foody, G. M., Herold, M., Stehman, S. V., Woodcock, C. E., & Wulder, M. A. (2014). Good practices for estimating area and assessing accuracy of land change. Remote Sensing of Environment, 148, 42-57.
_images/_nolinesoftext.png

Two-stage sampling and estimation of area ○

Coming soon

_images/_nolinesoftext.png

Ratio estimation of area and accuracy when strata ≠ map classes ○

The following tutorial replicates the workflow in [1]

[1]Arévalo, P., Olofsson, P., & Woodcock, C. E. (2019). Continuous monitoring of land change activities and post-disturbance dynamics from Landsat time series: A test methodology for REDD+ reporting. Remote Sensing of Environment. In press. 10.1016/j.rse.2019.01.013

Use Collect Earth Online with AREA2 ○

Coming soon