How should an auditor determine the precision required in establishing a statistical sampling plan?

4.1 CENSUS COSTS AND OBJECTIVES OF SAMPLING
4.2 ACCURACY AND PRECISION IN SAMPLING
4.3 ACCURACY AS A FUNCTION OF SAMPLE SIZE
4.4 A PRIORI ACCURACY INDICATORS
4.5 SAFE SAMPLE SIZE FOR LANDINGS AND EFFORT
4.6 VARIABILITY INDICATORS
4.7 STRATIFICATION AND ITS IMPACT ON SURVEY COST
4.8 THE PROBLEM OF BIASED ESTIMATES
4.9 NEED FOR REPRESENTATIVE SAMPLES
4.10 THE “BOAT” AND “GEAR” APPROACHES

Choosing to undertake sample-based surveys is based primarily on the recognition that complete enumeration through census-based surveys imposes huge costs that are both unsustainable and unnecessary if the nature and methods of statistical sampling are properly considered. Such considerations include understanding of:

  • the reasons for and objectives of sampling.
  • the relationship between accuracy and precision.
  • the reliability of estimates with varying sample size.
  • the determination of safe sample sizes for surveys.
  • the variability of data.
  • the nature of stratification and its impact on survey cost.
  • the risks posed by biased estimates.
  • the differences between “boat” and “gear” statistical approaches.

Census-based techniques are generally impractical in small-scale fisheries due to the large number of fishing operations that would have to be monitored over a reference period. The following example outlines the logistics problems and costs involved in census-based surveys.

4.1 CENSUS COSTS AND OBJECTIVES OF SAMPLING

Assume a fishery of moderate size comprising 1,000 fishing canoes, each fishing 24 times during a month on a one-day-per-trip basis. This would mean that:

1] There would be about 24,000 landings during the month and all landings would have to be recorded, each with its complete set of basic fishery data [species composition, weight, etc] [Note that there will be no need for a separate survey for fishing effort, since all trips will be recorded.]

2] Assuming that a single recording of a landing would take a minimum of ten minutes [experience shows that this is the case in many data collection systems], a minimum of 240,000 minutes [4,000 work hours] will be needed.

3] If a data collector works 8 hours per day for 25 days in a month, then collection of data would require 4,000/8 × 25 = 20 data collectors just to monitor this relatively small fishery. This assumes that such a level of data collection is feasible and that landings and hence fisher availability is spread evenly over the day.

4] In addition to the costs of data collectors there would also be the costs of a] supervision, b] data editing, checking and inputting for 24,000 landings per month, and c] computer data storage for 12 x 24,000 = 288,000 landings per year.

On the other hand a well-defined sampling scheme would most likely need only one or two recorders for data collection and only a fraction of the computer storage and processing resources, due to the much lower volume of incoming data.

Thus there are three objectives of a sampling programme:

  • to examine representative sub-sets of the data with the purpose of producing estimates of parameters, such as CPUE, prices, etc, that are as close as possible to the “true’ values that would be obtained through complete enumeration.
  • to reduce operational costs.
  • to reduce analytical and computing requirements.

4.2 ACCURACY AND PRECISION IN SAMPLING

In sampling procedures accuracy and precision are two different statistical indicators and it is perhaps worth clarifying their meaning at this point, as frequent reference will be made to these two terms in the coming sections.

4.2.1 Sampling Accuracy

  • Sampling accuracy is usually expressed as a relative index in percentage form [i.e. between 0 and 100%] and indicates the closeness of a sample-based parameter estimator to the true data population value.
  • When expressed as a relative index, sampling accuracy is independent of the variability of the data population, i.e. data population parameters of high variability can still be estimated with good accuracy.
  • When sample size increases and samples are representative, sampling accuracy also increases. Its rate of growth, very sharp in the region of small samples, becomes slower beyond a certain sample size.

4.2.2 Sampling Precision

Sampling precision is related to the variability of the samples used. It is measured, in reverse sense, by the coefficient of variation [CV], a relative index of variability that utilizes the sample variance and the sample mean.

The CV index also determines the confidence limits of the estimates, that is the range of values that are expected to contain the true data population values at a given probability.

Estimates can be of high precision [that is with narrow confidence limits], but of low accuracy. This occurs when samples are not representative and the resulting estimates are lower or higher than the true data population value.

When sample size increases precision also increases as a result of decreasing variability. Its growth, very sharp in the region of small samples, becomes slower and steadier beyond a certain sample size.

The figure below illustrates the meaning of accuracy and precision. They are both important statistical indicators and regularly used for assessing the effectiveness of sampling operations. Their correct interpretation can greatly assist in identifying problem areas and applying appropriate corrective actions as necessary.

4.3 ACCURACY AS A FUNCTION OF SAMPLE SIZE

The following diagram illustrates the pattern of accuracy growth when sample size increases [see also table 4.5].

To be noted that:

  • Accuracy is 100% when the entire population has been examined [as in the case of a census].
  • The pattern of accuracy growth is not linear. The accuracy of a sample equal to half the data population size is not 50% but very near to 100%.
  • Good accuracy levels can be achieved at relatively small sample sizes, provided that the samples are representative.
  • The result of this relationship is that beyond a certain sample size the gains in accuracy are negligible, while sampling costs increase significantly.

4.4 A PRIORI ACCURACY INDICATORS

A frequent concern of fishery administrations is the limited budgetary and human resources for data collection. Such constraints have direct impacts on the frequency and extent of field operations for data collection and demand the development of cost-effective sampling schemes. Therefore, during survey design it is better to establish accuracy indicators so that sample sizes can guarantee an acceptable level of reliability for the estimated data population parameters. This is at times difficult, since at the outset little may be known about the distribution and variability of the target data populations. Until some guiding statistical indicators become available statistical developers will tend to require large samples which increase the size and complexity of field operations and data management procedures.

Formulation of a priori indicators for sampling accuracy during the design phase is feasible and may be achieved by:

  • Guessing the general shape of the distribution of the target data populations.
  • Setting-up accuracy indicators that are only a function of the data population size.

4.4.1 Target data populations

In the estimation of total catch and fishing effort [Sections 2 and 3], the two target data populations in sample-based catch/effort surveys are:

  • The set of landings by all boats over a month from which an overall CPUE can be estimated.
  • The set of 0-1 values [equivalent to “boat not fishing”, “boat fishing”] describing the fishing activity status of all boats over a month.

The target data population of fishing activity is used to formulate the probability [BAC] that any one boat would be fishing on any one day. The BAC will then be combined with the number of boats from a frame survey and a time raising factor to formulate an estimate for fishing effort.

The above two data populations have different sampling requirements for achieving the same level of accuracy. The next paragraph provides more detail on how sample size is determined in each case and in accordance with the level of accuracy desired.

4.5 SAFE SAMPLE SIZE FOR LANDINGS AND EFFORT

The desired accuracy level for a sampling and estimation process depends on the subsequent use of the statistics and the amount of error that users are willing to tolerate. In general, experience indicates that the accuracy of basic fishery estimates should be in the range 90% - 95%.

The table below illustrates safe sample sizes required for achieving a given accuracy level for two target data populations, boat activities and landings.

Accuracy
in %

Sample size for boat activities
[boats sampled]

Sample size for
Landing surveys
[landings sampled]

90

96

32

91

119

40

92

150

50

93

196

65

94

267

89

95

384

128

96

600

200

97

1,067

356

98

2,401

800

99

9,602

3,201


From the table above the following conclusions can be made:
  • Sample requirements for boat activities are about three times higher than those for landings.
  • For a general sampling survey accuracy level of 90%, 32 landing records and 96 boats’ state of fishing activity records are required.

The above sampling requirements refer only to a given estimating context, that is a geographical stratum, a reference period [i.e. calendar month], and a specific boat/gear category. The process of determining safe sample size at a given level of accuracy must be repeated for all estimating contexts with the view to determining overall sampling requirements.

4.6 VARIABILITY INDICATORS

As already mentioned earlier, the second important statistical indicator is related to precision or, in reverse terms, to variability. The Coefficient of Variation [CV] is the most commonly used relative index of variability, usually expressed in percentage [i.e. 10%, 15%, etc]. Experience indicates that CVs below 15% are indicators of acceptable variability in data samples. When very low variability [e.g. 0.1%, 0.5%] is repeatedly reported these results may be suspicious. Although this may indicate a very homogeneous data population, it may also be an indication of biased samples.

There are standard methods for explaining the overall variability in space and time. This is useful when it is feasible to increase sampling operations with the view to decreasing the variability of estimates. In such cases the availability of separate variability indicators in space and time would direct sampling operations to collect data from more locations or on more days. Reducing variability in estimates can also be addressed through the stratification of sampling [see below and section 5].

4.7 STRATIFICATION AND ITS IMPACT ON SURVEY COST

4.7.1 Definition

Stratification is the process of partitioning a target data population [e.g. all fishing vessels] into a number of more homogeneous sub-sets based on their characteristics [e.g. trawl, gillnet, purse seine; or large, medium, small; or commercial, artisanal, subsistence]. Stratification is normally undertaken for the following reasons:

  • For statistical purposes [e.g. to show the difference in catch by vessel type] and when there is a need to reduce the overall variability of the estimates. For example, catch rates will differ greatly between vessels of a similar type but of a different size, therefore sampling of each size class separately will enable the preparation of meaningful statistics. If all vessel size classes are ‘lumped’ together - i.e. not stratified - then, say, average catch is not meaningful for any one size class.
  • For non-statistical purposes [e.g. different geographic regions] and when current estimates are not meaningful to users of the statistics unless estimates are shown separately.
  • At times stratification is “forced” due to administrative needs such as limits to data collection and reporting.

4.7.2 Impact on costs

The implementation of sampling stratification can be an expensive exercise and should always be applied with caution because all new strata need to be covered by the sampling programme. Introducing a large number of strata may have serious cost implications because the overall accuracy of the estimates will not be increased if data collection effort is kept at the original level, even though the results from strata will be more homogeneous than the original data population. In general, more strata means greater sampling costs, although obtaining better value [= statistical accuracy] for money.

To fully benefit from a stratified population, safe sample sizes must be determined for each new stratum. In very large populations this would mean that a new sampling scheme with three strata would need three times more samples for achieving the desired accuracy, hence greater costs.

4.8 THE PROBLEM OF BIASED ESTIMATES

4.8.1 An illustrated example

The figure above illustrates in basic terms the problem of bias. Biased estimates may be found systematically above or below the true [but unknown] population value [here all estimates are shown higher than the true value]. Bias is independent of the precision [= variability] of the estimates. In this example accuracy is bad but precision is misleadingly good and this is indicated by the narrow confidence limits.

4.8.2 Bias as a major risk in sampling programmes

Biased estimates are systematically lower or higher than the true population value, generally because they are derived from samples that are not representative of the data population. Bias is not easily detectable and at times not detectable at all. Consequently users may be unaware of the problem since they also do not know the true population value.

Precision [or the relative variability indicator CV] cannot be used to detect bias. However, repeated cases of extremely small variability [e.g. CV

Chủ Đề