What is the best way to divide the test when using the split-half method?

  • Split-half reliability

    Split-half reliability is mainly used for written/standardized tests, but it is sometimes used in physical/human performance tests (albeit ones that require a number of trials). However, it is based on the assumption that the measurement procedure can be divided (i.e., split) into two matched halves.

    Split-half reliability is assessed by splitting the measures/items from the measurement procedure in half, and then calculating the scores for each half separately. Before calculating the split-half reliability of the scores, you have to decide how to split the measures/items from the measurement procedure (e.g., a written/standardized test). How you do this will affect the values you obtain.

    • One option is to simply to divide the measurement procedure in half; that is, take the scores from the measures/items in the first half of the measurement procedure and compare them to the scores from those measures/items in the second half of the measurement procedure. This can be problematic because of (a) issues of test design (e.g., easier/harder questions are in the first/second half of the measurement procedure), (b) participant fatigue/concentration/focus (i.e., scores may decrease during the second half of the measurement procedure), and (c) different items/types of content in different parts of the test.

    • Another option is to compare odd- and even-numbered items/measures from the measurement procedure. The aim of this method is to try and match the measures/items that are being compared in terms of content, test design (i.e., difficulty), participant demands, and so forth. This helps to avoid some of the potential biases that arise from simply dividing the measurement procedure in two.

    After dividing the measures/items from the measurement procedure, the scores from each of the halves is calculated separately, before the internal consistency between the two sets of scores is assessed, usually through a correlation (e.g., using the Spearman-Brown formula). The measurement procedure is considered to demonstrate split-half reliability if the two sets of scores are highly correlated (i.e., there is a strong relationship between the scores).

  • Cronbach's alpha

    Cronbach's alpha coefficient (also known as the coefficient alpha technique or alpha coefficient of reliability) is a test of reliability as internal consistency (Cronbach, 1951). At the undergraduate and master's dissertation level, it is more likely to be used than the split-half method. It is most likely to be used in written/standardized tests (e.g., a survey).

    Cronbach's alpha is also used to measure split-half reliability. However, rather than simply examining two sets of scores; that is, computing the split-half reliability on the measurement procedure only once, Cronbach's alpha does this for each measure/item within a measurement procedure (e.g., every question within a survey). Therefore, Cronbach's alpha examines the scores between each measure/item and the sum of all the other relevant measures/items you are interested in. This provides us with a coefficient of inter-item correlations, where a strong relationship between the measures/items within the measurement procedure suggests high internal consistency (e.g., a Cronbach's alpha coefficient of .80).

    Cronbach's alpha is often used when you have multi-items scales (e.g., a measurement procedure, such as a survey, with multiple questions). It is also a versatile test of reliability as internal consistency because it can be used for attitudinal measurements, which are popular amongst undergraduate and master's level students (e.g., attitudinal measurements include Likert scales with options such as strongly agree, agree, neither agree nor disagree, disagree, strongly disagree). However, Cronbach's alpha does not determine the unidimensionality of a measurement procedure (i.e., that a measurement procedure only measures one construct, such as depression, rather than being able to distinguish between multiple constructs that are being measured within a measurement procedure; perhaps depression and employee burnout). This is because you could get a high Cronbach's alpha coefficient (e.g., .80) when testing a measurement procedure that involves two or more constructs.


  • Page 2

  • Test-retest reliability on separate days

    Test-retest reliability on separate days assesses the stability of a measurement procedure (i.e., reliability as stability). We emphasize the fact that we are interested in test-retest reliability on separate days because test-retest reliability can also be assessed on the same day, where it has a different purpose (i.e., it assesses reliability as internal consistency rather than reliability as stability).

    A test (i.e., measurement procedure) is carried out on day one, and then repeated on day two or later. The scores between these two tests are compared by calculating the correlation coefficient between the two sets of scores. The same version of the measurement procedure (e.g., a survey) is used for both tests. The samples (i.e., people being tested) for each test should be the same (or very similar); that is, the characteristics of the samples should be closely matched (e.g., on age, gender, etc.). If there is a strong relationship between the two sets of scores, highlighting consistency between the two tests, the measurement procedure is considered to be reliable (i.e., stable). Where the measurement procedure is reliable in this way, we would expect to see identical (or very similar) results from a similar sample under similar conditions when this measurement procedure was used in future.

    Test-retest reliability on separate days is particularly appropriate for studies of physical performance, but it can also be used with written tests/survey methods. However, in such cases, there is greater potential for learning effects [see the section, Testing effects and internal validity, in the article: Internal validity] to result in spuriously high correlations (i.e., the reliability is exaggerated because it cannot mitigate for learning effects; it simply takes into account the two sets of scores).

    The interval between the test and retest (i.e., between measurement procedures) will be determined by a number of factors. In physical performance tests, for example, you may need to assess the amount of rest participants? require, especially if the test is physically demanding. In written tests/survey methods, greater time between the test and retest will likely increase the threat from learning effects. Therefore, you will need to assess what is the appropriate interval between the test and retest: too short and there is the potential for memory effects from the first test; too long and there is the potential for extraneous/confounding effects [see the article: Extraneous and confounding variables]. Ultimately, any length of interval where maturation, learning effects, changes in ability, outside influences/situational factors, participant interest, akin to learning effects, and so on, could affect the retest [see the article, Internal validity, if you are unsure what some of these threats to research are].

  • Parallel-forms reliability

    Parallel-forms reliability (the parallel-forms method, alternate-forms method or equivalence method/forms), is used to assess the reliability of a measurement procedure when different (alternate/modified) versions of the measurement procedure are used for the test and retest. The same group of participants is used for both test and retest. The measurement procedures, whilst different, should address the same construct (e.g., intelligence, depression, motivation, etc.).

    Where the test-retest reliability method is more appropriate for physical performance measures, the parallel-forms reliability method is more frequently used in written/standardised tests. It is seldom appropriately used in physical performance tests because designing two measurement procedures that measure the same thing is more challenging compared with two sets of standardised test questions.

    The reliability of the measurement procedure is determined by the similarity/consistency of the results between the two versions of the measurement instrument (i.e., reliability as equivalence). Such reliability is tested using a t-test, similarity of means and standard deviations (i.e., between the two groups; that is, the scores from the two versions of the measurement instrument), and a high correlation coefficient.

  • The parallel-forms method, in many instances, is expensive and frequently difficult to construct. Therefore, a less direct method of assessing the effects of different samples of items is the so-called split-half method. Only one test is administered to compute the reliability coefficient in this technique.

    The whole set of test-items is divided into two equal halves. Then a test can be administered, and separate scores assigned to every individual on two arbitrarily selected halves of that test can be obtained.

    For example, an individual may be given one score on the odd-numbered items and a second score on the even-numbered items.

    Then the product­ moment correlation between the two sets of scores gives the parallel-forms reliability coefficient for a test half as long as the original test.

    A notable problem arises in the split-half technique while splitting the test in order to obtain the most nearly comparable halves.

    In most tests, the first half and the second half would not be comparable owing to differences in nature and difficulty level of items.

    A procedure that is adequate for most purposes is to find the scores on odd and even-numbered items of the test. If the items were originally arranged in approximate order of difficulty, such a division yields very nearly equivalent half­scores.

    Once the two-half scores have been obtained for each individual, they may be correlated by the usual method. It should be noted, however, that correlation actually gives the reliability of only a half test.

    For example, if the entire test consists of 100 items, the correlation is computed between sets of scores, each of which is based on only 50 items.

    In both test-retest and parallel-forms reliability, on the other hand, each score is based on the full number of items in the test.

    Assuming that the two halves are equivalent, the reliability of the full test ( rtt ) can be estimated by means of the Spearman-Brown prophecy formula, as given below.

    What is the best way to divide the test when using the split-half method?

    Where rhh is the correlation between the half test. As an example, if the correlation of the total scores on the odd-numbered items with total scores of the even-numbered items is 0.80, the estimated reliability of the whole test is

    What is the best way to divide the test when using the split-half method?

    The Spearman-Brown prophecy formula stated in (d) above is a particular case of a more general formula of the following type:

    What is the best way to divide the test when using the split-half method?

    in which k is the factor by which the test is to be lengthened or shortened with respect to the original test whose reliability coefficient is ro.

    The formula (e) is of particular significance. It gives an estimate of the effect of lengthening or shortening a test on its reliability coefficient. Suppose we have an n1 item test, and we know its reliability, which is r0.

    By the use of this formula, we can predict what its reliability would be if n2 additional similar items were added to the test. Here k = ( n1 + n2 ) / n1 . Substituting k in (e), the reliability coefficient of the new test can be calculated.

    Similarly, if we have an m1 item test of known reliability and we wish to reduce it to a test of m2 item (m2 < m1), Spearman-Brown formula may be employed with k = ( m1 – m2 )  to estimate the reliability of the shortened test.

    In addition, the formula is useful when we want to determine how many items will be needed to achieve a given level of reliability.

    For instance, if rtt were set at 0.9, we can determine how many items would be needed inter-correlating 0.5 to achieve this desired level of reliability. This can be obtained from the following formula.

    What is the best way to divide the test when using the split-half method?

    The formula can easily be obtained from a sample rearrangement of (e). Setting rtt=0.90, and ro=O.5O, we find;

    What is the best way to divide the test when using the split-half method?

    This demonstrates the main justification for including a number of items in a test (or scale) that the reliability can thereby be increased to a satisfactory level.

    The formula further shows that the number of items needed to reach a given level of reliability depends on the homogeneity of the items that are on the inter-correlations between them.

    Example #1

    Suppose we have a 20-item test with a reliability coefficient

    0.60. Estimate what the reliability of this test would be if 80 similar items were added to make it a 100-item test.

    Solution: In this instance k = ( n1 + n2 ) / n1 = (20 + 80 ) / 20 = 5 and r0 = 0,60.

    Hence using (e);

    What is the best way to divide the test when using the split-half method?

    Example #2

    Suppose we have a 110-item test, the length of which is reduced to 55 items. The reliability coefficient of the original test is 0.80. What would be the reliability of the shortened test?

    Solution: Here k = ( m1 – m2 ) / m1 = ( 110 – 55 ) 110 = 0.50 and r0 = 0.80.

    Hence;

    What is the best way to divide the test when using the split-half method?

    An alternative method for finding split-half reliability was developed by Rulon (1939).

    It requires only the variance of the differences between each person’s scores and the two half-tests ( s2e ) and the variance of total scores ( s2t ); these two values are substituted in the following formula, which yields the reliability of the whole test directly:

    What is the best way to divide the test when using the split-half method?

    Thus for a test with a standard deviation of 6 and a standard error of measurement 3, the Rulon’s method gives a reliability coefficient of;

    What is the best way to divide the test when using the split-half method?

    Example #3

    A sample of ten students, all of the same age, was selected and given a Math test to assess strengths and weaknesses in a number of math and math-related areas.

    Use the split-half method with a Spearman-Brown correction for the desired assessment. The scaled scores for the odd and even items are shown in the accompanying table.

    What is the best way to divide the test when using the split-half method?

    The computed correlation coefficient is;

    What is the best way to divide the test when using the split-half method?

    With 8 df, the r is significant at a 1% level (for r to be significant, a table value of 0.765 is required). Since r is significant, we apply the Spearman-Brown formula as follows:

    What is the best way to divide the test when using the split-half method?

    The reliability is thus established at 0.938, which is certainly a lofty value and indicates a high degree of reliability.