ERRORS IN STATISTICAL DATA Introduction The accuracy of a survey estimate refers to the closeness of the estimate to the true population value. Where there is a discrepancy between the value of the survey estimate and true population value, the difference between the two is referred to as the error of the survey estimate. The total error of the survey estimate results from the two types of error:
Sampling Error Sampling error reflects the difference between an estimate derived from a sample survey and the "true value" that would be obtained if the whole survey population were enumerated. It can be measured from the population values, but as these are unknown (otherwise there would be no need for a survey), it can also be estimated from the sample data. It is important to consider sampling error when publishing survey results as it gives an indication of the accuracy of the estimate and therefore reflects the importance that can be placed on interpretations. If sampling principles are applied carefully within the constraints of available resources, sampling error can be accurately measured and kept to a minimum.Factors Affecting Sampling Error Sampling error is affected by a number of factors including sample size, sample design, the sampling fraction and the variability within the population. In general, larger sample sizes decrease the sampling error, however this decrease is not directly proportional. As a rough rule of thumb, you need to increase the sample size fourfold to halve the sampling error. Of much lesser influence is the sampling fraction (the fraction of the population size in the sample), but as the sample size increases as a fraction of the population, the sampling error should decrease.The population variability also affects the sampling error. More variable populations give rise to larger errors as the samples or the estimates calculated from different samples are more likely to have greater variation. The effect of the variability within the population can be reduced by increasing the sample size to make it more representative of the survey population. Various sample design options also affect the size of the sampling error. For example, stratification reduces sampling error whereas cluster sampling tends to increase it (these designs are discussed in Sample Design). Standard Error The most commonly used measure of sampling error is called the standard error (SE). The standard error is a measure of the spread of estimates around the "true value". In practice, only one estimate is available, so the standard error can not be calculated directly. However, if the population variance is known the standard error can be derived mathematically. Even if the population variance is unknown, as happens in practice, the standard error can be estimated by using the variance of the sample units. Any estimate derived from a probability based sample survey has a standard error associated with it (called the standard error of the estimate, written se(y) where y is the estimate of the variable of interest). Note that :
For more information on how to calculate estimates and their standard errors please refer to Analysis. Variance The variance is another measure of sampling error, which is simply the square of the standard error: Var(y) = se(y)2 Relative Standard Error Another way of measuring sampling error is the relative standard error (RSE) where the standard error is expressed as a percentage of the estimate. The RSE avoids the need to refer to the estimate and is useful when comparing variability of population estimates with different means. RSE is an important measure when expressing the magnitude of standard error relative to the estimate. The relative standard error is calculated as follows (where y is the estimate of the variable of interest): RSE(y) = 100 * {se(y) / y}Confidence Interval There is a 95% chance that the confidence interval which extends to two standard errors on either side of the estimate contains the "true value". This interval is called the 95% confidence interval and is the most commonly used confidence interval. The 95% confidence interval is written as follows: 95% CI(y) = [y - {2*se(y)} , y + {2*se(y)}] This is expressed: "We are 95% confident that the true value of the variable of interest lies within the interval [y - {2*se(y)} , y + {2*se(y)}]". Other confidence intervals are the 68% confidence interval (where the confidence interval extends to one standard error on either side of the estimate has a 68% chance of containing the "true value") and the 99% confidence interval (where the confidence interval extends to three standard errors on either side of the survey estimate has a 99% chance of containing the "true value"). For example, suppose a survey estimate is 50 with a standard error of 10. The confidence interval 40 to 60 has a 68% chance of containing the "true value", the interval 30 to 70 has a 95% chance of containing the "true value" and the interval 20 to 80 has a 99% chance of containing the "true value". NON-SAMPLING ERROR Non-sampling error is all other errors in the estimate. Some examples of causes of non-sampling error are non-response, a badly designed questionnaire, respondent bias and processing errors. Non-sampling errors can occur at any stage of the process. They can happen in censuses and sample surveys. Non-sampling errors can be grouped into two main types: systematic and variable. Systematic error (called bias) makes survey results unrepresentative of the target population by distorting the survey estimates in one direction. For example, if the target population is the population of Australia but the survey population is just males then the survey results will not be representative of the target population due to systematic bias in the survey frame. Variable error can distort the results on any given occasion but tends to balance out on average. Some of the types of non-sampling error are outlined below:Failure to Identify Target Population / Inadequate Survey Population The target population may not be clearly defined through the use of imprecise definitions or concepts. The survey population may not reflect the target population due to an inadequate sampling frame and poor coverage rules. Problems with the frame include missing units, deaths, out-of-scope units and duplicates. These are discussed in detail in Frames and Population. Non-Response Bias Non-respondents may differ from respondents in relation to the attributes/variables being measured. Non-response can be total (none of the questions answered) or partial (some questions may be unanswered owing to memory problems, inability to answer, etc.). To improve response rates, care should be taken in designing the questionnaires, training of interviewers, assuring the respondent of confidentiality, motivating him/her to co-operate, and calling back at different times if having difficulties contacting the respondent. "Call-backs" are successful in reducing non-response but can be expensive for personal interviews. Non-response is covered in more detail in Non-Response.Questionnaire problems The content and wording of the questionnaire may be misleading and the layout of the questionnaire may make it difficult to accurately record responses. Questions should not be loaded, double-barrelled, misleading or ambiguous, and should be directly relevant to the objectives of the survey.It is essential that questionnaires are tested on a sample of respondents before they are finalised to identify questionnaire flow and question wording problems, and allow sufficient time for improvements to be made to the questionnaire. The questionnaire should then be re-tested to ensure changes made do not introduce other problems. This is discussed in more detail in Questionnaire Design. Respondent Bias Refusals to answer questions, memory biases and inaccurate information because respondents believe they are protecting their personal interest and integrity may lead to a bias in the estimates. The way the respondent interprets the questionnaire and the wording of the answer the respondent gives can also cause inaccuracies. When designing the survey you should remember that uppermost in the respondent's mind will be protecting their own personal privacy, integrity and interests. Careful questionnaire design and effective questionnaire testing can overcome these problems to some extent. Respondent bias is covered in more detail below.Processing Errors There are four stages in the processing of the data where errors may occur: data grooming, data capture, editing and estimation. Data grooming involves preliminary checking before entering the data onto the processing system in the capture stage. Inadequate checking and quality management at this stage can introduce data loss (where data is not entered into the system) and data duplication (where the same data is entered into the system more than once). Inappropriate edit checks and inaccurate weights in the estimation procedure can also introduce errors to the data. To minimise these errors, processing staff should be given adequate training and realistic workloads.Misinterpretation of Results Time Period Bias This occurs when a survey is conducted during an unrepresentative time period. For example, if a survey aims to collect details on ice-cream sales, but only collects a weeks worth of data during the hottest part of summer, it is unlikely to represent the average weekly sales of ice-cream for the year.Minimising Non-Sampling Error Non-sampling error can be difficult to measure accurately, but it can be minimised by
RESPONDENT BIAS No matter how good the questionnaire or the interviewers are, errors can be introduced into a survey either consciously or unconsciously by the respondents. The main sources of error relating to respondents are outlined below. Sensitivity If respondents are faced with a question that they find embarrassing, they may refuse to answer, or choose a response which prevents them from having to continue with the questions. For example, if asked the question: "Are you taking any oral contraceptive pills for any reason?", and knowing that if they say "Yes" they will be asked for more details, respondents who are embarrassed by the question are likely to answer "No", even if this is incorrect.Fatigue Fatigue can be a problem in surveys which require a high level of commitment from respondents. For example, diary surveys where respondents have to record all expenses made in a two week period. In these type of surveys, the level of accuracy and detail supplied may decrease as respondents become tired of recording all expenditures.NON-RESPONSE Non-Response results when data is not collected from respondents. The proportion of these non-respondents in the sample is called the non-response rate. Non-response can be either partial or total. It is important to make all reasonable efforts to maximise the response rate as non-respondents may have differing characteristics to respondents. This causes bias in the results. Partial Non-Response When a respondent replies to the survey answering some but not all questions then it is called partial non-response. Partial non-response can arise due to memory problems, inadequate information or an inability to answer a particular question. The respondent may also refuse to answer questions if they
Total Non-Response Total non-response can arise if a respondent cannot be contacted (the frame contains inaccurate or out-of-date contact information or the respondent is not at home), is unable to respond (may be due to language difficulties or illness) or refuses to answer any questions. When conducting surveys it is important to collect information on why a respondent has not responded. For example when evaluating a program a respondent may indicate they were not happy with the program and therefore do not wish to be part of the survey. Another respondent may indicate that they simply don't have the time to complete the interview or survey form. If a large number of those not responding indicate dissatisfaction with the program, and this is not indicated in the final report, an obvious bias would be introduced in the results. Minimising Non-Response Response rates can be improved through good survey design via short, simple questions, good forms design techniques and explaining survey purposes and uses. Assurances of confidentiality are very important as many respondents are unwilling to respond due to a fear of lack of privacy. Targeted follow-ups on non-contacts or those initially unable to reply can increase response rates significantly. Following are some hints on how to minimise refusals in a personal or phone contact: Find out the reasons for refusal and try to talk through them
Allowing for Non-Response Where response rates are still low after all reasonable attempts of follow-up are undertaken, you can reduce bias by using population benchmarks to post-stratify the sample (covered in Sample Design), intensive follow-up of a subsample of the non-respondents or imputation for item non-response (non-response to a particular question). The main aim of imputation is to produce consistent data without going back to the respondent for the correct values thus reducing both respondent burden and costs associated with the survey. Broadly speaking the imputation methods fall into three groups:
Example: Effect of Non-Response
|