The procedure of collecting data from every member in a given population.

In a sense, the techniques for collecting data are the most important step in the process of statistics; the procedures set the stage for obtaining information that can be used to draw meaningful and accurate conclusions. All of the statistical calculations we’ll be learning can be used on any set of data, regardless of how the data was obtained. Useless results come from bad data. Therefore, no matter how careful and exacting you are in organizing, summarizing, and analyzing data, your conclusions will be useless if you aren’t careful in the beginning to collect data appropriately.

There are four main ways to obtain data.

A census is a list of all the individuals and their characteristics in a population. An example of a census is the US Census held every 10 years (this is only an example, though). The main advantage of using a survey to obtain information is that your conclusions will have 100% certainty. The disadvantages of conducting a census are that it may be difficult or impossible to obtain all the information, and costs may be prohibitive.
An existing source is an appropriate data set that has already been collected, and can be used for your study. The advantage of finding an existing source of data is obviously the savings in both time and money. A disadvantage is that it can often be difficult to find the exact data you need.
A survey sample is a study when only a subset of the population is considered and where there is no attempt to influence the value of the variable of interest. The advantage of using a survey is the savings in both time and money of not having to get information from every individual in the population. The main disadvantage of a survey sample, and this is extremely important, is that choosing an appropriate sample could be difficult. The sample must represent the overall population, even though it is just a subset of the population.
A survey sample is an example of an observational study, where there is no attempt to influence the value of the variable. Observational studies are great for detecting associations (relationships) between variables, but they cannot isolate causes to determine causation. This happens when we fail to observe certain variables, called lurking variables.
A designed experiment is an experiment that applies a treatment to individuals. In an experiment, information from the treated group is often compared with a control (untreated) group. Variables from the individuals and the treatments can easily be controlled in an experiment. A major advantage of an experiment is that you can analyze individual factors. Disadvantages of experiments are that they cannot be conducted when the variables cannot be controlled and in cases for moral/ethical reasons. Section 1.5 discusses methods for setting up and conducting an experiment.

When conducting a census is unrealistic (as is usually the case), sampling from the population is the next best thing. There is one main question: How do you choose your sample? For example, if you are interested in knowing the average grade point average (GPA) of graduating high school students in your city, you wouldn’t want your sample to consist of only women or of just athletes or of just honor roll students. You would want your sample to represent the entire population of interest.

We must use the process of randomness to select the individuals included in our sample. If we are allowed to do the selecting, our sample will most certainly be biased, i.e., it will include a group of individuals that does not represent the entire population, and therefore, conclusions will most certainly systematically favor certain outcomes.

The most popular sampling technique that relies on randomness is simple random sampling, a technique where every possible sample of size n out of a population of size N has an equally likely chance of occurring. For example, a simple random sample of size n = 2 from a population size of N = 4 has 6 possible samples, and each has an equally likely chance of occurring:

Population: {1, 2, 3, 4}
Possible Samples of Size n = 2: {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}

As simple random sampling is similar to “drawing names out of a hat,” we need a method to select the individuals for our sample. We will use either a table of random digits or technology to do this. A quick search on the Internet shows many places to find tables of random digits. One great site contains freely available tables (download as a PDF or read online): http://www.rand.org/pubs/monograph_reports/MR1418.html

In such a table of random digits, each entry is equally likely to be any of the 10 digits 0 through 9, which means that entries are independent of one another (knowledge of one number gives us no information about any of the other entries surrounding it). In fact, if we read the table in groups of two numbers, each pair of entries is equally likely to be any of the 100 pairs 00, 01, …, 98, 99. Reading each triple of entries gives us an equally likely chance of seeing any of the 1000 entries 000, 001, 002, …, 998, 999.

To conduct a Simple Random Sample, begin by numbering every member in your population. If your population has size 30, you will read numbers from the table in groups of two (pairs); if your population has size 168, you will read numbers from the table in groups of three (triples). Start anywhere you’d like in the table, and read in any direction, left, right, up, or down. It’s nice to follow a pattern, just so you don’t get lost. Select the random numbers as you move along, and match the numbers chosen to the individuals in your population. If you select a random number that does not correspond to an individual in your population, or if you encounter a repeat number (this WILL happen because the digits are random!), skip it and move on.

For example, suppose I want to select 4 students from a class of 30 to estimate the class average on an exam (this is unrealistic because it’s trivial to find the average of only 30 students, but it’s just an example). I would first assign a number to each student, starting at 01 and ending at 30. Start at the beginning of a line, say 263, and read in pairs from left to right. Here are the numbers that I’d record:

32 03 13 96 08 75 99 27 34 45 01 …

Table of Random Digits
00250	59467	58309	87834	57213	37510	33689	01259	62486	56320	46265
00251	73452	17619	56421	40725	23439	41701	93223	41682	45026	47505
00252	27635	56293	91700	04391	67317	89604	73020	69853	61517	51207
00253	86040	02596	01655	09918	45161	00222	54577	74821	47335	08582
00254	52403	94255	26351	46527	68224	90183	85057	72310	34963	83462

00255	49465	46581	61499	04844	94626	02963	41482	83879	44942	63915
00256	94365	92560	12363	30246	02086	75036	88620	91088	67691	67762
00257	34261	08769	91830	23313	18256	28850	37639	92748	57791	71328
00258	37110	66538	39318	15626	44324	82827	08782	65960	58167	01305
00259	83950	45424	72453	19444	68219	64733	94088	62006	89985	36936

00260	61630	97966	76537	46467	30942	07479	67971	14558	22458	35148
00261	01929	17165	12037	74558	16250	71750	55546	29693	94984	37782
00262	41659	39098	23982	29899	71594	77979	54477	13764	17315	72893
00263	32031	39608	75992	73445	01317	50525	87313	45191	30214	19769
00264	90043	93478	58044	06949	31176	88370	50274	83987	45316	38551

Since the first number does not correspond to anyone in the population, we skip it. The first student to be selected for the population would be Student 03. Following this would be Students 13, 08, and finally 27:

32 03 13 96 08 75 99 27 34 45 01 …

Therefore, all our statistical research would focus on the exam scores of the four students pertaining to the numbers 03, 08, 13, and 27. Numbering students and “allowing” a table to choose the students to include in our study removes the baises that exist if we tried to choose the students ourselves.

Sampling Errors

Now that we have seen how to obtain samples appropriately, here are some of the issues that can arise during a sampling process. There are two types of errors that arise: sampling errors and nonsampling errors.

Sampling errors are very difficult to control or predict. These errors result from using the sample (a subset of the population) to describe characteristics of the population. Therefore, the process of sampling may give incomplete information about the population. In other words, even if we use a random process to select a sample, our sample may not “perfectly” represent the population of interest. This occurs because data and information vary from member to member in the population.

There are numerous nonsampling errors that result from the sampling process, including the nonresponse of individuals selected in the sample, inaccurate responses to poorly worded questions, bias in the selection of the sample, and so on. Nonsampling errors are often largely avoidable with a good study design, and minimizing these errors is of high priority in designing a sample survey. Some examples of nonsampling errors include:

Using an incomplete population
Nonresponse
Interviewer errors
Misrepresented answers
Mistakes in recording or entering data
Questionnaire design
Wording of questions
Order of questions, words, and responses

Example 1: Identifying Parts of a Survey

Two shortened survey reports are given. In each report, identify the following: the population, the sample, the results, and whether the results represent a sample statistic or a population parameter.

a. A headline about the rising obesity among young people led a school board to survey local high school students. Out of 231 students surveyed, 58% reported eating a “high fat” snack at least 4 times a week.

b. A nonprofit organization interviewed 618 adult shoppers at malls across Louisiana about their views on obesity in youths. The resulting report stated that an estimated 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options.

Solution

a. Population: local high school students.

Sample: the 231 students who were surveyed.

Results: 58% of students surveyed eat a “high fat” snack at least 4 times a week. The result refers to only those students who were surveyed, thus the result is a sample statistic.

b. Population: Louisiana adults

Sample: the 618 adult Louisiana mall shoppers who were surveyed.

Results: 48% of Louisiana adults are in favor of government regulation of “high fat” fast food options. The results refer to all Louisiana adults, thus this is a population parameter. This population parameter is an estimate based on the sample statistics, which were not reported.

Types of Sampling

There are many ways that one can sample from a population. Ideally we would like to sample in such a way that we get a sample that reflects all the characteristics of the population and therefore a statistic that represents the parameter well. The quality of a sample statistic (i.e., accuracy, precision, representativeness) is highly affected by how sample(s) are chosen; that is., by the sampling method. Below we will describe different sampling methods.

A representative sample is one that has the same relevant characteristics as the population and does not favor one group of the population over another.
A random sample is one in which every member of the population has an equal chance of being selected.
A stratified sample is one in which members of the population are divided into two or more subgroups, called strata, that share similar characteristics like age, gender, or ethnicity. A random sample from each stratum is then drawn.
A cluster sample is one chosen by dividing the population into groups, called clusters, that are each similar to the entire population. The researcher then randomly selects some of the clusters. The sample consists of the data collected from every member of each cluster selected.
A systematic sample is one chosen by selecting every nth member of the population. Systematic sampling is easy to detect because it always produces the same sample for the same n. To get a different sample you will need a different n value.
A convenience sample is one in which the sample is “convenient” to select. It is so named because it is convenient for the researcher.

Cluster sampling and stratified sampling are often confused but perhaps a simple thought experiment will help. Suppose you wanted to study the comparison of fuel-efficient in different cars driven in the United States. Can you think of some ways to divide the cars into strata that might represent a broader scope of vehicles on the market? For example: size of engine, manufacturer, make, safety rating, number of doors. Notice that these are characteristics the individuals in the samplemay or may not have have and are qualitative. For a clustering look at the same example ask yourself, do you think this method would produce a good representative sample of vehicles if we allowed our clusters to be price ranges? Why or why not? Because cluster sampling is an “all from one group” method, comparing mpg’s from cars in only certain price ranges would not produce a representative sample.

Suppose instead you decide to gather data from half the students in your class for our comparison of fuel-efficient cars. Do you think this example of convenience sampling would be an accurate picture of the population of cars driven in the United States? It is unlikely that students (or any age group for that matter) will drive a wide range of cars. Newer, more expensive cars are less likely to be driven by students and would not be well represented in the student sample.

Lastly, lets try identifying every 5th car accessing the interstate on a particular entrance ramp during rush hour traffic. This is an example of systematic sampling for our fuel study. Can you identify any potential biases that we might need to be aware of when choosing the observation spot? The location of the entrance ramp might lend itself to having cars only on one end of the price scale depending on the businesses located in the area.

Example 2: Identifying Sampling Techniques

Identify the sampling technique used to obtain a sample in each of the following situations.

a. To conduct a survey on collegiate social life, you knock on every 5th dorm room door on campus.

b. Student ID numbers are randomly selected from a computer print out for free tickets to the championship game.

c. Fourth grade reading levels across the county were analyzed by the school board by randomly selecting 25 fourth graders from each school in the county district.

d. In order to determine what ice cream flavors would sell best, a grocery store polls shoppers that are in the frozen foods section.

e. To determine the average number of cars per household, each household in 4 of the 20 local counties were sent a survey regarding car ownership.

Solution

a. Because the sample is obtained by choosing every nth dorm room, this is systematic sampling. This is a representative sample, as long as students were randomly assigned to dorm rooms and there are no hidden potential biases, like only males may live in every nth room.

b. Since every member has an equal chance of being selected, this is random sampling.

c. The students were divided into strata based on their schools and then a random sample from each school was chosen. This is stratified sampling.

d. Because of the ease of choosing shoppers right in their own store, this is convenience sampling. In this case, convenience sampling is a viable method for gaining a representative sample since the store would be interested in knowing the thoughts of their customers.

e. Cluster sampling was used here because the counties are the natural clusters and all of the households in some of the counties received the surveys.