What makes a good screening test

11.7

Selecting the correct screening test

13.4

for a disease is affected by a number of factors: - The test must be easy to administer to people. - Cause minimal discomfort. - Be reliable - that is, consistently give the same result. - Be valid that is be able (by using key markers) to distinguish disease and non-disease states. - And, it must be affordable to the health service. A detailed examination of the retina, at the back of the eye, is the key procedure when screening for diabetic retinopathy. Various examination methods are available, from ophthalmoscopy to digital retinal imaging - with or without pupil dilation.

52.3

Whichever examination method is selected: Screeners must be able to find all cases of retinopathy correctly without causing unnecessary anxiety to the person being screened.

62.9

And, the health system must be able to: - Pay for, and manage the purchasing of, screening equipment and; - Train personnel to use and maintain it. We can use a 2 by 2 table to plot the presence of diabetic retinopathy against the ability of the test to detect the condition correctly. A good screening test for DR must have the following 4 key characteristics. - A high sensitivity that is the test must correctly identify all cases of retinopathy (known as true positives). - A high specificity the test must minimise falsely identifying cases as having retinopathy (false positives).

102.7

The test must also have a high positive predictive value ensuring a high probability that each person with a positive screening test truly has retinopathy. - And finally, the test must achieve high coverage so that everyone at risk of retinopathy (i.e. people with diabetes) is tested. Lets look at a hypothetical example to understand why these characteristics are important. In our population of 100 people with diabetes, 10 have retinopathy (these are known as cases). If all 100 people are screened, we have achieved high coverage. The perfect screening test will correctly detect all 10 cases (100% sensitivity) and identify all of the remaining 90 people as being free of retinopathy (100% specificity). Now lets see how an imperfect screening test performs.

161.3

This time, even when we ensure everyone is tested (high coverage) the test is only able to pick up 7 out of 10 cases correctly. Sensitivity is only 70%.

180.7

The test also incorrectly identifies 5 cases (false positives).

192.7

Only 85 out of the 90 people without retinopathy are correctly identified. Specificity is now 94%.

203.9

The predictive value of this screening test is low. Only 7 out of the 12 people (or 58%) identified by this test as having diabetic retinopathy actually have the disease. So this test is poor and not appropriate for scale up for screening a real population of people with diabetes. In practice, even with a test with high sensitivity and specificity, screening programmes need to achieve good coverage to perform well. In our example here, if half of the population is not screened, perhaps because of lack of resources, we will miss 4 cases who do not even get a chance to be detected.

243.3

In the group which is examined, the test correctly picks up 4 out of the 6 cases giving us a sensitivity of only 67% and making this a poor test. To deliver a high-quality screening programme, we have to consider both the key markers of a good test and how it will be implemented.

262.7

The balance we need to find is a test that: - Is good - has a high sensitivity and high specificity. - Is acceptable to the people being tested. - Can achieve high coverage - can be delivered to the whole eligible population. - And can be conducted repeatedly over regular intervals for example annual screening of the whole at risk population. To achieve these requirements, investment and protocols must be put in place.

293.7

The St Vincents Declaration of 2005 suggests that when establishing systematic screening,

299

programmes should aim to reach: a sensitivity of more than 80%; a specificity of more than 90%; and an acceptable coverage of at least 80%. For example, the English national screening programme, which began in 2003, reached 82.8% of the population by 2016. The selected test of 2 field mydriatic photography achieved a sensitivity of about 88% and a specificity of just over 86%, with 3.7% of images ungradeable. As camera technology develops, these rates are likely to improve. To maintain good coverage, the English screening programme closely monitors acceptance of screening as part of its performance criteria.

348.2

In summary. To be selected for a screening programme,

351.2

screening tests should: - Have a high sensitivity - to correctly find all cases of diabetic retinopathy - Have a high specificity - to reduce overloading the system with false positives and causing unnecessary anxiety amongst people incorrectly identified as having diabetic retinopathy - And achieve high coverage - to ensure everyone at risk is screened The test also has to be acceptable to the people with diabetes being screened and affordable for the health provider.

1Everest Consulting Associates, Cranbury, NJ, USA

Find articles by L. Daniel Maxim

1Everest Consulting Associates, Cranbury, NJ, USA

Find articles by Ron Niebo

1Everest Consulting Associates, Cranbury, NJ, USA

Find articles by Mark J. Utell

Received 2014 Jul 8; Revised 2014 Aug 6; Accepted 2014 Aug 13.

Screening tests are widely used in medicine to assess the likelihood that members of a defined population have a particular disease. This article presents an overview of such tests including the definitions of key technical (sensitivity and specificity) and population characteristics necessary to assess the benefits and limitations of such tests. Several examples are used to illustrate calculations, including the characteristics of low dose computed tomography as a lung cancer screen, choice of an optimal PSA cutoff and selection of the population to undergo mammography. The importance of careful consideration of the consequences of both false positives and negatives is highlighted. Receiver operating characteristic curves are explained as is the need to carefully select the population group to be tested.

Keywords: Benefits and limitations, positive and negative predicted value, prevalence, screening tests, sensitivity, specificity

A screening test (sometimes termed medical surveillance) is a medical test or procedure performed on members (subjects) of a defined1 asymptomatic population or population subgroup to assess the likelihood of their members having a particular disease.2 With few exceptions, screening tests do not diagnose the illness. Rather subjects who test positive typically require further evaluation with subsequent diagnostic tests or procedures. Examples of actual or proposed screening tests include the pap smear for cervical cancer (Arbyn et al., 2008; Mayrand et al., 2007), mammography (or tomosynthesis) for breast cancer (Friedewald et al., 2014; Rafferty et al., 2013), PSA (and/or digital rectal exam) for prostate cancer (Catalona et al., 1991), cholesterol level for heart disease, X-ray (or computed tomography) for lung cancer (discussed below), PKU test for phenylketonuria in newborns, B-natriuretic peptide test for screening patients undergoing echocardiography to determine left ventricular dysfunction (Maisel et al., 2001), and urinalysis or other screening tests for sexually transmitted diseases or illicit drug use (Gastwirth, 1987; Jafari et al., 2013; Watson et al., 2002). Screening tests may be based on the measurement of a particular chemical in the blood or urine (a quantitative measurement) or some qualitative assessment by a trained observer (e.g. interpretation of an x-ray or CT scan, or semi-quantitative analysis by a polygraph operator).

A major objective of most screening tests is to reduce morbidity or mortality in the population group being screened for the disease by early detection, when treatment may be more successful.3 An alternative objective might be to reduce morbidity or mortality in persons other than the screened population who might be impacted by a communicable and preventable disease (such as screening for HIV in blood donors4) among subjects in the population being tested.

Although some of the key analytical/statistical results applicable to the design and evaluation of screening tests have been around since the late 1700s, when the Reverend Thomas Bayes first developed the theorem that bears his name and numerous tutorials or review articles have been written more recently (Alberg et al., 2004; Altman & Bland, 1994a,b,c; Deeks & Altman, 2004; Goetzinger & Odibo, 2011; Lalkhen & McCluskey, 2008; Thompson et al., 2005; Zou et al., 2007), there is still some confusion among practitioners about how to interpret and assess the utility of screening tests (Casscells et al., 1978; Grimes & Schutz, 2002; Manrai et al., 2014; Wegwarth et al., 2012), which is why the article might be of interest to readers of Inhalation Toxicology.

In its simplest form, the screening test has only two outcomes: positive (suggesting that the subject has the disease or condition) or negative (suggesting that the subject does not have the disease or condition).5 An ideal screening test would have a positive result if and only if the subject actually has the disease and a negative result if and only if the subject did not have the disease. Actual screening tests typically fall short (sometimes far short, see below) of this ideal. Instead, most screening tests exhibit what are termed false positives and false negatives to varying degrees. Logical possibilities are described in the 2 × 2 Table 1.

Logical possibilities for true disease state and screening test outcome.

Test resultSubject has diseaseSubject disease freeSubtotal
PositiveCorrect resultFalse positiveTotal positive test results
NegativeFalse NegativeCorrect resultTotal negative test results
SubtotalTotal subjects with diseaseTotal subjects disease freeTotal subjects

In most cases,6 screening tests need to be benchmarked against an agreed “Gold Standard” test (Greenhalgh, 1997). The gold standard test is a diagnostic test that is usually regarded as definitive (e.g. by biopsy or autopsy). The actual gold standard test may be invasive (e.g. biopsy), unpleasant, too late (e.g. autopsy) to be relevant, too expensive or otherwise impractical to be used widely as a screening test.

Table 2 provides examples of various screening tests and possible Gold Standards.

In principle, a “Gold Standard” should have 100% sensitivity and 100% specificity (see below for definitions), that is, it would never make a classification error. In practice that may not be the case and the “Gold Standard” is regarded as the best test under “reasonable conditions.” As noted by Versi (1992):

“As science increases its hold on the practice of medicine we become more aware of the limitations of the clinical method. Unfortunately, we also become more aware of the limitations of various diagnostic tests. Nevertheless, at any given time there may well be a consensus that a given test in a given situation is the best available test. It therefore serves as the gold standard against which newer tests can be compared. When enough data have accumulated to make that gold standard untenable, it can perfectly reasonably be replaced by another. This can then preside until it too is toppled.”

Troy et al. (1996) offered the following perspective on gold standards:

“… however, gold standards for comparison are not always available. Moreover, a perfect gold standard is less often available than an imperfect gold standard (‘alloyed gold standard’), an adopted standard based on observed data which is measured with error.”

For the purposes of this discussion, the gold standard test is assumed to be without error. Several authors have developed statistical approaches for dealing with “alloyed gold standards” (Dendukuri, 2011; Hawkins et al., 2001; Johnson et al., 2001; Joseph et al., 1995; Lewis & Torgerson, 2012; Rutjes et al., 2007; van Smeden et al., 2014; Walter & Irwig, 1988). As might be expected, none of these alternative procedures perform as well as if a true Gold Standard were available, but several are improvements over naively assuming the Gold Standard is “unalloyed.”

The possible outcomes shown in Table 1 are quantified by two probabilities, termed the test sensitivity and specificity. These are two key characteristics of a screening test.

  • Sensitivity is the test's ability to correctly designate a subject with the disease as positive; it is the conditional probability (Pr{T+|D+})7, denoted by the symbol S that a subject who has the disease, D+, tests positive, T+. A highly sensitive test means that there are few false negative results; few actual cases are missed. Ceteris paribus, tests with high sensitivity have potential value for screening, because they rarely miss subjects with the disease (Goetzinger & Odibo, 2011).

  • Specificity is the test's ability to correctly designate a subject without the disease as negative; it is the conditional probability (Pr{T−|D−}), denoted by the symbol Sp that a subject who does not have the disease, D−, tests negative, T−. A highly specific test means that there are few false positive results. Therefore, high specificity tests perform well for diagnosis because of low false positive errors. Tests with low specificity have the disadvantage that (among other things) many subjects without the disease will screen positive and potentially receive unnecessary (and possibly invasive, risky or expensive) follow-up diagnostic or therapeutic procedures.

Publications about screening tests typically report both the sensitivity and specificity of the test. It is clearly desirable to have a test that is both highly sensitive and highly specific. (In some cases, it may be possible to structure the test so as to tradeoff sensitivity and specificity, as discussed below.)

Figure 1 shows a sample of reported sensitivities and specificities of various screening tests as summarized by Alberg et al. (2004), denoted by the triangles, and from our own literature search (refer Table A1), denoted by the circles. As can be seen, there are a substantial number of screening tests with both high sensitivity and high specificity, but also many that fall far short of this ideal. It should be noted that not all the screening tests shown in Figure 1 are actually used at present – some may have been found wanting.

As some test results (e.g. reading an X-ray) require interpretation, it is possible that there will be inter-observer variation (notwithstanding attempts at standardization) so that the reported sensitivity and specificity may vary with the observer (Deeks, 2001; Elmore et al., 2002 for illustrations). This creates issues when large scale screening tests are being contemplated and it is necessary to extrapolate or generalize from screening test data based on pilot studies, often conducted by highly specialized and experienced personnel. Refer Whiting et al., (2004) and Table 3 for a useful systematic review of sources of variation and bias in studies of diagnostic or screening accuracy.

Common sources of bias in study design.

Type of biasDescription
Verification biasNon-random selection for definitive assessment for disease with the old standard reference test
Errors in the referenceTrue disease status is subject to misclassification because the gold standard is imperfect
Spectrum biasTypes of cases and controls included are not representative of the population
Test interpretation biasInformation is available that can distort the diagnostic test
Unsatisfactory testsTests that are uninterpretable or incomplete do not yield a test result
Extrapolation biasThe conditions or characteristics of populations in the study are different from those in which the test will be applied
Lead time biasEarlier detection by screening may erroneously appear to indicate beneficial effects on the outcome of a progressive disease
Length biasSlowly progressing disease is over-represented in screened subjects relative to all cases of disease that arise in the population
Overdiagnosis biasSubclinical disease may regress and never become a clinical problem in the absence of screening, but is detected by screening

Frequently, it is of interest to compare one screening test (a potential improvement) with another. It is important to make careful statistical comparisons to assess whether the “improved” test is actually superior and to calculate confidence intervals on the various proportions (e.g. using the methods given in Newcombe, 1998). Ideally such comparisons should be made on the same population and randomly assigning subjects to each test.

Before addressing additional important definitions applicable to screening tests it is appropriate to mention some of the consequences of false negatives and false positives in screening tests. Briefly:

  • A false negative means that a subject with the disease is misclassified as not having the disease on the basis of the screening test. The subject is given a misleading impression that he/she is free of the disease and thus does not undergo more suitable diagnostic tests. At a minimum this means that correct diagnosis is delayed (perhaps until the subject develops symptoms) and, in the case of diseases for which early treatment offers improved chances of recovery, there is increased risk of morbidity and mortality [refer Kaufman et al. (2014) for an example with breast cancer]. False negatives from a screening test for illicit drug use or a polygraph test design to detect deception have obvious negative consequences (Gastwirth, 1987). As another example, failure to detect someone with an STD, may result in increased morbidity or mortality of future sexual partners of the subject. Systematic reviews of the consequences of false negatives are provided in Petticrew et al. (2000, 2001). False negatives may also lead to legal action being taken by affected individuals and may reduce public confidence in screening.

  • A false positive means that a subject without the disease is misclassified as having the disease on the basis of the screening test. The subject is given the misleading impression that he/she has the disease and thus endures the unnecessary psychological consequences as well has having to undergo possibly invasive diagnostic or treatment procedures. The consequences of a false positive can be material. For example:

    • Elmore et al. (1998) provides examples of the consequences of false-positive screening mammograms. Among other things, false-positive mammograms led to more outpatient visits, diagnostic imaging examinations, and biopsies than false positive clinical breast examinations. In one patient, cellulitis requiring hospitalization for surgical debridement and intravenous antibiotic therapy developed after a biopsy prompted by a false positive mammogram.

    • Wiener et al. (2011) provide an assessment of the population-based risk for complications after transthoracic needle lung biopsy of a pulmonary nodule discovered using a CT scan.

    • Croswell et al. (2009) reported on the cumulative incidence of false-positive results in repeated multimodal cancer screening. Among other things this study revealed that for a woman the cumulative risk of undergoing a false-positive-prompted invasive diagnostic procedure was about 12.3% after 4 tests increasing to 22.1% after 14 tests. For men the corresponding percentages were 17.2% after 4 tests and 28.5% after 14 tests.

    • A study of 12 669 Swedish youths (aged 16 and over) diagnosed with cancer found a 60% increased risk of suicide or attempted suicide (Lu et al., 2013). A false positive in this instance has a material adverse consequence.

    • Another study (Baade et al., 2006) of people diagnosed with cancer in Queensland, Australia indicated that this group experienced a SMR of 149.9 for non-cancer deaths.

    • False positives may also decrease the likelihood that a subject will return for subsequent follow-up procedures (Álamo-Junquera et al., 2011). And, false positives may also result in litigation and loss of public confidence in screening.

  • The consequences of both false positives and false negatives need to be carefully considered in assessing the utility of a screening test. In some cases, it may be possible to alter the decision criterion (or criteria) of a particular screening test to alter the sensitivity or specificity of the test and thus trade off one type of error for another. For such cases, the usual procedure is to calculate a receiver-operating characteristic curve (discussed below) for the test (Thompson et al., 2005; Zou et al., 2007). As well, some of the consequences of false positives can be altered by the choice of follow-up procedures among those subjects who test positive. For example, less invasive or non-invasive diagnostic tests can be selected, depending upon the specific outcomes of the initial screening test.

The prevalence of the disease is the fraction, Π, of subjects in the population under study that have the disease. It is equal to the a priori probability (Pr{D+}) that a subject selected at random from the population or subgroup has the disease.8 Prevalence, along with sensitivity and specificity, is a key determinant of the utility of the screening test (see below). For reasons discussed below, it is desirable to be able to define the population to be screened in such a way that the prevalence in the test population is high. The reported prevalence among various populations that are the subject of screening tests (Alberg et al., 2004) range from 0.05 to 0.9, but clustered among the higher values.

There are four additional relevant characteristics of a screening test, the positive predictive value, negative predictive value, accuracy and likelihood ratio:

  • The positive predicted value (PPV) is the probability that a subject with a positive (abnormal) test actually has the disease (Pr{D+|T+}) also called the a posteriori probability. Given the above notation;

    • PPV = ΠS/((ΠS + (1 – Π)(1 – Sp)).

In words, the a posteriori probability that the subject has the disease given a positive test is the ratio of true positives (the product of the prevalence and sensitivity) divided by total positives (the sum of true positives and false positives). It is desirable that the screening test has a high PPV.

  • The negative predicted value (NPV) is the post-test probability that the subject has no disease given a negative test result (Pr{D−|T−}) also termed the a posteriori probability given a negative test. Given the above notation:

    • NPV = (1 − Π)Sp/((1 − Π)Sp + Π(1 − S)).

In words, the a posteriori probability that the subject does not have the disease given a negative test is the ratio of the true negatives (complement of prevalence times the specificity) divided by the total negatives (the sum of true negatives and false negatives). It is also desirable that the test has a high NPV.

  • The accuracy (also termed overall accuracy, diagnostic accuracy or test efficiency) of a test is the overall proportion of correct test results. This includes true positives and true negatives. Mathematically it is calculated from the equation:

The term ΠS includes the true positives (prevalence times sensitivity) and the term (1 − Π)Sp are the true negatives (probability the subject does not have the disease times the probability that the test is negative given the subject is without disease). As noted by Alberg et al., (2004): “Overall accuracy is the weighted average of a test's sensitivity and specificity, where sensitivity is weighted by prevalence and specificity is weighted by the complement of prevalence.” Refer to Alberg et al. (2004) for a discussion of the limitations of this measure of screening efficiency.

  • The Likelihood ratio is another term used to characterize screening tests; it is defined as the probability of a subject who has the disease testing positive divided by the probability of a subject who does not have the disease testing positive, L = S/(1 − Sp).

To many, the PPV or NPV are the key characteristics of a screening program. It is important to remember that the PPV or NPV are dependent on both the population under study and the technical characteristics of the screening test.9 A screening test with relatively high sensitivity and specificity may still have a low PPV if the population prevalence is sufficiently low. Thus, to assess a proposed screening test it is necessary to evaluate both the technical and population characteristics.

The probability of a positive test, Pr(T+) is the sum of the probabilities of a subject with the disease correctly testing positive and someone without the disease incorrectly testing positive, or ΠS + (1 − Π)(1 − Sp). Some have suggested using the observed fraction of positive tests, F+, (sometimes termed the apparent prevalence) as a surrogate for or estimate of Π, but, unless both the sensitivity and specificity are both equal to unity, this will give a biased answer. Given the definitions, an improved estimate for Π is equal to (F+ + Sp −1)/(S + Sp − 1). Refer Gart & Buck (1966), Gastwirth (1987), Levy & Kass (1970) and Rogan & Gladen (1978) for a derivation and Karaağaoğlu (1999) for additional analyses.

All of these screening test characteristics are determined by testing a particular population (using one or more screening tests) and recording the number of subjects that fall into the various categories shown in Table 1. To illustrate, Table 4 provides a hypothetical data from a screening test evaluation of a population of 10 000 subjects, assumed to have a disease prevalence of 0.5, with a calculated sensitivity of 0.9 (95% confidence interval including continuity correction [0.8913, 0.9081]), and a specificity of 0.3 (95% confidence interval including continuity correction [0.2874, 0.313]).10

Hypothetical data from screening experiment.

Raw DataIn symbolsNumerical illustration
Actual disease stateActual disease state
Test ResultYesNoSubtotalTest ResultYesNoSubtotal
Positiveaba + bPositive450035008000
Negativecdc + dNegative50015002000
Sub totalsa + cb + dNSubtotal5000500010 000
DefinitionsTermDefinitionFormulaNumerical ResultAlternative Formula or Term
Prevalence, ΠFraction of test subjects with disease(a + c)/N0.5000Assumed a priori probability of disease,  relatively high in this illustration
Sensitivity, SFraction of subjects with positive test given that test subject  has disease; “true positive/disease”a/(a + c)0.9000Hypothetical data show relatively  high sensitivity
False negative rateFraction of subjects with disease, but with negative test resultc/(a + c)0.1000(1 − S)
Specificity, SpFraction of test subjects with negative test given that the  test subject does not have diseased/(b + d)0.3000Hypothetical data show relatively low specificity
False positive rateFraction of test subjects with no disease, but positive test resultb/(b + d)0.7000(1 − Sp)
Probability of positive testTrue positives + false positives divided by total tests(a + b)/N0.8000P(T+) = ΠS + (1 − Π)(1 − Sp)
Probability of negative testTrue negatives + false negatives divided by total tests(c + d)/N0.2000P(T−) = Π(1 − S) + (1 − Π)Sp
Positive predictive value PPVPost-test probability of disease given a positive resulta/(a + b)0.5625A posteriori probability of disease given  positive test result
Negative predictive value NPVPost-test probability of no disease given a negative test resultd/(c + d)0.750A posteriori probability no disease given  negative test result
AccuracyProportion of correct test results(a + d)/N0.6000ΠS + (1 − Π)Sp
Likelihood ratioThe probability of a subject who has the disease testing positive  divided by the probability of a subject who does not have the disease testing positiveS/(1 − Sp)1.2857
Regret given positive testProbability that disease free subject has positive testb/(a + b)0.4375(1 − Π)(1 − Sp)/(ΠS + (1 − Π)(1 − Sp))
Bayes TheoremPositive testNegative test
True stateP(Hi) A priori Probability Person tested is in this stateP(T+/Hi) Probability of positive test in this stateP(Hi)P(T+/Hi) Joint probabilityP(Hi/T+) A posteriori probabilityTrue stateP(Hi) A priori Probability Person tested is in this stateP(/T−Hi) Probability of negative test in this stateP(Hi)P(T−/Hi) Joint probabilityP(Hi/T−) A posteriori probability
Disease0.50000.900.45000.5625Disease0.50000.100.05000.2500
No disease0.50000.700.35000.4375No disease0.50000.300.15000.7500
Probability of positive test = P(T+)0.8000Probability of negative test = P(T−)0.2000

Table 4 also illustrates the equations and numerical computation of the various quantities defined above. The bottom of Table 4 provides the calculation (using Bayes' theorem) of the a posteriori probabilities corresponding to either a positive or negative test outcome. In this example, a subject who tests positive has an a posteriori probability of having the disease of 0.5625 – not materially greater than the a priori prevalence (0.5) in the population. This is because although the sensitivity is relatively high, the specificity of the test is relatively low. Conversely, a subject who tests negative has an a posteriori probability of not having the disease of 0.75 – in this case, clearly different from the a priori prevalence (0.5) in the population in this example.

The a posteriori probability of having the disease given a positive test result, or PPV, is one obvious measure of the evidence provided by the test. Other things being equal, tests with high specificity (few false positives) tend to have a high PPV. However, unlike sensitivity or specificity (which might be termed “pure characteristics” of the test), the PPV is also a function of the characteristics of the population under study; PPV is a function of the prevalence. In the numerical example given in Table 4, the prevalence was assumed to be 0.5 (i.e. 50% of the population or subpopulation had the disease). Figure 2 shows how the PPV, NPV and accuracy depend upon the assumed prevalence Π in the population being screened. As can be seen, both PPV and accuracy decrease (sharply in the case of PPV) as Π decreases from the base case assumption of 0.50. Conversely, the NPV increases as the prevalence decreases.

To help place the content of Figure 2 in perspective, note that if the prevalence, Π, were as low as 0.16, the Positive Predicted Value, PPV, would be only 0.2. Put another way, a subject who tested positive under these circumstances would have an 80% chance of not having the disease! And, if Π were as low as 0.08, there would be a 90% chance that a subject with a positive test would be disease free. If the consequences of a positive test (e.g. worry, invasive or expensive and unnecessary follow-up procedures) were substantial, this would not be a satisfactory screening test. Thus, the quantity 1 – PPV might aptly be termed the regret probability.

Figure 3 shows how the regret (so defined) varies with both the prevalence and specificity, when the sensitivity is held constant at 0.90. Looking at Figure 3, you can see how the likelihood that a subject who tests positive actually is disease free changes as the prevalence changes. If the actual prevalence in the population were say 0.3, the regret would be approximately 0.7 and if the prevalence were as low as 0.08, the regret would be 0.9. This example illustrates the point that both technical parameters of the screening test and prevalence need to be considered.

Figure 4 shows the locus of points that have a constant regret (equal to 0.8) as a function of specificity and prevalence for values of sensitivity ranging from 0.7 through 0.9; this test characteristic does not have much leverage in this example. Rather the prevalence and specificity are the key variables.

Although the example given in Table 4 is hypothetical, it is relevant to many actual tests. One is described below.

Low dose computed tomography (LDCT) has been proposed as a screening test for lung cancer. Several studies (Humphrey et al., 2013; Tiitola et al., 2002) have shown that this technique has a high sensitivity (as a percentage ranging from 80 to nearly 100%, Humphrey et al. [2013]) at detecting nodules. The National Lung Cancer Screening Trial (NLST) Research Team (2011) published a study reporting that the estimated reduction in mortality from use of LDCT screening was approximately 20% compared to alternative test strategies. Subjects included in the study population were between 55 and 74 years of age at the time of randomization, had a history of cigarette smoking of at least 30 pack-years, and, if former smokers, had quit within the previous 15 years. These criteria were used in an attempt to define a population with a relatively high prevalence and thus a high Positive Predictive Value:

  • Smokers (and those who quit quite recently) are included in the population under test because although lung cancer has multiple risk factors, it is estimated that 85–90% of all cases are attributed to smoking (Ruano-Ravina et al., 2013; Samet et al., 2009).

  • The age range of the cohort is relevant because older smokers presumably have experienced a greater dose of carcinogens and, as noted by the American Cancer Society:11

  “Lung cancer mainly occurs in older people. About 2 out of 3 people diagnosed with lung cancer are 65 or older; fewer than 2% of all cases are found in people younger than 45. The average age at the time of diagnosis is about 70.”

Based on this (and other) research team's findings, a nationwide screening program was proposed and has been endorsed by several organizations [e.g. the American Lung Association, see ALA, 2012, the US Preventive Services Task Force, 2013; American College of Chest Physicians (ACCP) and the American Society of Clinical Oncology (ASCO)].

However, this and earlier proposals for LDCT screening have also had numerous critics or skeptics (American Academy of Family Physicians (AAFP) 2014; Heffner & Silvestri, 2002; Ruano-Ravina et al., 2013; Silvestri, 2011; Vansteenkiste et al., 2012), some arguing that the estimated benefits of LDCT screening in reducing mortality are uncertain, lower than estimated or absent (Bach et al., 2007, 2012; Black, 2000; Oken et al., 2011; Pastorino et al., 2012; Saghir et al., 2012), others that the procedure is not cost-effective (Mahadevia et al., 2003), and yet others that the radiation risks might be excessive (Brenner, 2004).

One of the major concerns about the use of LDCT even among advocates (Marshall et al., 2013) is that LDCT detects a large number of benign but uncalcified pulmonary nodules – properly termed false positives – that are challenging to diagnose (MacRedmond et al., 2006; Nawa et al., 2002; Patz et al., 2004; Swensen et al., 2002, 2003, 2005) and which create other problems depending upon what is done as part of the follow-up to a positive test (Wiener et al., 2011). As noted by Diedrerich (2008):

Many pulmonary nodules even in smokers are due to benign lesions such as granulomas and hamartomas.12

In short, although this test is highly sensitive, it has a low specificity. Table 5 provides estimates of the false positive rate (benign nodules discovered by CT scans) as reported in several studies – even those that favor routine screening of this population subgroup. The calculated or reported false positive rates shown in Table 5 vary substantially among the studies;13 some of the differences can be explained by different criteria for defining a positive (e.g. size of the nodule that is classified as a positive) and whether or not multiple LDCTs were used (and the criteria for a positive on multiple tests) as part of the procedure. Despite this variability it is apparent that most reported estimates of false positive probabilities are quite high. The study by van Klavern et al. [2009] reports a false positive probability very much lower than the other results depicted in Table 5. The actual test and decision criteria developed by these investigators differed from others. Specifically, they used a mathematical model to evaluate a non-calcified nodule according to its volume or volume-doubling time. Growth was defined as an increase in volume of at least 25% between two scans. The first-round screening test used by these investigators was considered to be negative if the volume of a nodule was less than 50 mm3, if it was 50 to 500 mm3 but had not grown by the time of the 3-month follow-up CT, or if, in the case of those that had grown, the volume-doubling time was 400 days or more. Another concern of critics of the NLST is that it might be difficult to generalize the results to community practices. Silvestri (2011), for example, wrote:

Participants in the NLST were enrolled in tertiary care hospitals with expertise in all aspects of cancer care. [LDCT] studies were interpreted by dedicated chest radiologists with expertise in characterizing nodules and providing appropriate recommendations for follow up. As a result, few patients required invasive testing and radiographic follow-up was sufficient for many patients.

However, community radiologists without expertise in evaluating lung nodules may feel compelled to advise invasive testing for a screening-detected nodule. Of the 26 309 persons randomly assigned to chest CT screening in the NLST, 7191 (27%) had an abnormal finding. Most scans (96.4%) yielded false-positive results that were followed by serial radiography. Variation in how nodules are managed could lead to a substantial increase in transthoracic needle aspiration of lung nodules, unnecessary surgery, additional morbidity and even mortality for some persons who never had cancer to begin with.

From the data given in Table 5, it is clear that a conservative estimate of the false positive probability is at least 0.7, which means that the specificity of this test is at most 0.3 – the value assumed in the hypothetical example given in Table 4 – and might be much lower. Thus, even for the potentially high risk group of elderly heavy cigarette smokers included in the screening trials, the Positive Predictive Value of the test is not likely to be high.

There is some discrepancy in reported PPVs for the NLST; according to Humphrey et al. (2013) reported calculated positive predictive values (PPVs) for abnormal screening results ranging from 2.2% to 36.0%, while Ruano-Ravina et al. (2013) report the PPV for the NLCT as only 3.6%. Ruano-Ravia et al. (2013) have summarized the PPVs for 14 other LDCT investigations. Including their estimate for the NLCT, PPVs in these tests range from 0.028 to 0.115 with a median value of 0.053 and an arithmetic mean of 0.064, meaning that the probability that someone with a single positive test does not have lung cancer ranges from 0.885 to 0.972!

Kovalchik et al. (2013) examined how the reduction in lung cancer mortality as reported by the NLST varied with the estimated risk based on a prediction model using age, body-mass index, family history of lung cancer, pack-years of smoking, years since smoking cessation and emphysema diagnosis. Based on model predictions, they divided the study population into quintiles based on a predicted 5-year risk of lung cancer. They analyzed the NLST data and found:

Screening with low-dose CT prevented the greatest number of deaths from lung cancer among participants who were at highest risk and prevented very few deaths among those at lowest risk. These findings provide empirical support for risk-based targeting of smokers for such screening.

This finding highlights the importance of identifying the target population that is likely to benefit most from the screening procedure.

Overdiagnosis is another factor to consider in assessing the merits of LDCT cancer screening. This is because although screening has a high sensitivity and potential to detect aggressive tumors, screening will also detect indolent tumors that otherwise might not cause immediate clinical symptoms. Patz et al. (2014) used data from the NLST to estimate that more than 18% of all lung cancers detected by LDCT seemed to be more indolent, and the potential of overdiagnosis should be considered when describing the risks of LDCT for lung cancer.

Depending upon what is done in terms of follow up in the event of a positive screening test result, the impact of false positives could be substantial. Wiener et al. (2011) determined population-based estimates of risks of complications following transthoracic needle biopsy of a pulmonary nodule. This group collected data on the percentage of biopsies complicated by hemorrhage, any pneumothorax and pneumothorax requiring chest tube, and computed adjusted odds ratios for these complications associated with various biopsy characteristics, calculated using multivariable population-averaged generalized estimating equations among a population of 15 865 adults (in California, Florida, Michigan and New York) who underwent transthoracic needle biopsy of a pulmonary nodule.

These investigators reported:

Although hemorrhage was rare, complicating 1.0% (95% CI 0.9–1.2%) of biopsies, 17.8% (95% CI 11.8–23.8%) of patients with hemorrhage required a blood transfusion. In contrast, the risk of any pneumothorax was 15.0% (95% CI 14.0–16.0%), and 6.6% (95% CI 6.0–7.2%) of all biopsies resulted in a pneumothorax requiring chest tube. Compared to patients without complications, those who experienced hemorrhage or pneumothorax requiring chest tube had longer lengths of stay (p < 0.001) and were more likely to develop respiratory failure requiring mechanical ventilation (p = 0.02). Patients aged 60–69 years (as opposed to younger or older patients), smokers and those with chronic obstructive pulmonary disease had higher risk of complications.

It is apparent form these results that the consequences of false positives are potentially material.

Based largely on concerns over the high false positive rate, the Medicare Evidence Development and Coverage Advisory Committee (MEDCAC) in the United States recently recommended against covering the procedure for this patient group based on a lack of evidence to support the benefits of the screening test (http://www.aafp.org/news/health-of-the-public/20140521medcacctrec.html). MEDCAC makes recommendations, not decisions; as noted by the Centers for Medicare and Medicaid Services (http://www.cms.gov/Regulations-and-Guidance/Guidance/FACA/MEDCAC.html):

The MEDCAC reviews and evaluates medical literature, technology assessments, and examines data and information on the effectiveness and appropriateness of medical items and services that are covered under Medicare, or that may be eligible for coverage under Medicare. The MEDCAC judges the strength of the available evidence and makes recommendations to CMS based on that evidence.

The merits of this screening test are likely to be reviewed by other panels and the MEDCAC recommendation may ultimately be reversed – the Centers for Medicare & Medicaid Services (CMS) is expected to issue a proposed decision on the issue by November 2014, and a final decision in February 2015. The decision may be made on policy grounds but, from a scientific perspective, the ultimate outcome is likely to hinge on the judgment of the key parameters prevalence in the population, the high false positive rate, and ultimately the low PPV (Nelson, 2009; Phend, 2014; US Preventive Services Task Force, 2013).

As a related example, we were asked by ECFIA (a trade association of manufacturers of high temperature insulating wools) to comment on the suitability of routine use of LDCT scans in a medical surveillance program for workers of all ages (including both smokers and non-smokers) engaged in the manufacture of refractory ceramic fiber (RCF) and other high temperature insulating wools in France. The available results of a mortality study of these workers in two US plants does not indicate any increase over baseline cancer rates (LeMasters et al., 2003; Utell & Maxim, 2010), so the likely prevalence of lung cancer in this population is not likely to be high. This is because most of the employed population is substantially younger than those included in the NLST (indeed, the retirement age in France is 60–62 depending upon what age the employee entered the workforce) and not all employees are smokers, let alone heavy smokers.

According to data from SEER (see http://www.cancer.org/cancer/cancerbasics/lifetime-probability-of-developing-or-dying-from-cancer) the lifetime probability of contracting lung cancer among American males (including both smokers and non-smokers) is approximately 7.6%. Taking this as an estimate applicable to the French population prevalence14 and using the sensitivity and specificity values from Table 4, the positive Predictive Value of CT lung cancer screening is approximately 0.1. (Obviously it would be much lower for young men and non-smokers and higher among those nearing retirement and heavy smokers.) This means that the a posteriori probability (regret) that a subject who tests positive in a single CT scan does not have lung cancer is approximately 0.9. Despite the high probability that a subject with a positive test does not have lung cancer, these subjects would be subject to whatever follow-up procedures might accompany such a test result. Members of this group would, at a minimum, suffer some mental distress and would be subject to follow-up CT scans and possibly invasive procedures. This screening test would clearly be inappropriate for this group. In this context it is noteworthy that the American Lung Association's guidance document (ALA, 2012) that endorsed LDCT scans for older smokers also states:

Low-dose CT screening should NOT be recommended for everyone. [Emphasis in original.]

And, Bach et al. (2012) also endorsed LDCT screens for older smokers, but also recommended:15

For individuals who have accumulated fewer than 30 pack-years of smoking or are either younger than 55 years or older than 74 years, or individuals who quit smoking more than 15 years ago, and for individuals with severe comorbidities that would preclude potentially curative treatment, limit life expectancy or both, we suggest that CT screenings should not be performed.

Thus, regardless of whether one believes that LDCT is an appropriate screening test for the population of older smokers, it is not justified for a population with much lower prevalence or those who are not likely to benefit from a correct diagnosis, evaluation and treatment.

Some screening tests are designed as “once-off” tests, but many are intended to be administered periodically, such as annually. For example, mammography and clinical breast examination have been proposed for screening for breast cancer. As Elmore et al. (1998) wrote:

If a woman undergoes annual screening beginning at the age of 40, she will have had 60 opportunities for a false positive result by the age of 70, with 30 mammograms and 30 clinical breast examinations. The cumulative lifetime risk from her having a result from a screening test that requires further workup, even though no breast cancer is present, is not known…It is important to determine the cumulative risk of false positive tests, because women are advised to have breast-cancer screening every 1–2 years over several decades of their lifetimes, and false positive rates can provoke anxiety, increase costs and cause morbidity.

Thus, in evaluating periodic screening, it is necessary to measure or calculate cumulative probabilities. Care must be taken because the results of multiple tests may not be independent events.

As the PPV of a screening test depends critically on the prevalence of the disease in the population it is important to identify criteria to define a population group or subgroup with a high disease incidence to begin with. As noted above, this is why the LDCT program was limited to older smokers. Lung cancer rates increase with age and the vast majority of lung cancers occur in smokers. This is potentially a reasonable population subgroup for screening.

To illustrate the selection of a relevant population subgroup, we use an example from a study of breast cancer screening. Kerlikowske et al. (1993) reported on a cross-sectional study of 31 814 women aged 30 years and older referred for mammography at the University of California. They segmented the population into women of various age groups with and without a family history of breast cancer. Figure 5 shows a bar chart of the estimated PPVs for these groups. These investigators found that five times as many cancers per 1000 first-screening mammographic examinations were diagnosed in women aged 50 years or older compared with women aged less than 50 years. The highest PPVs for mammography were older women with a family history of breast cancer. This finding guided their recommendation.

Possible criteria for defining a population subgroup include various demographic factors (age, gender, race and country), known risk factors (e.g. smoking), medical history and occupation. For screening to be highly effective, the prevalence in the population should be as high as is practicable. Harper et al. (2000) provides additional comments on the importance of the study population.

It is noted above that there may be opportunities to design a screening test that has different combinations of sensitivity and specificity. If so, there are opportunities to design the test to possess characteristics that are superior in terms of the combination of possible consequences of false positives and false negatives. For example in the LDCT test, the threshold for size (mm) of the nodules or other characteristics (e.g. solid or semisolid nodules) might be varied (Lam et al., 2013), choices that would alter the sensitivity or specificity. Thompson et al. (2005) and Zou et al. (2007) offer other relevant examples of ROC curves.

Figure 6 shows a typical receiver operating characteristic (ROC)16 curve for a prostate specific antigen (PSA) test administered to men aged 70 or more.

Each subject is tested and a specific PSA score determined. Subjects were also administered digital rectal examinations and biopsies – those with positive biopsies were used as the gold standard for assessment of disease status. A series of possible PSA cutoff scores (measured in nanograms per milliliter ng/ml) were considered for the screening test. Each cutoff score resulted in a partitioning of subjects into those who tested positive and those who tested negative. Knowing the actual disease status of the subjects enabled calculation of the sensitivity and specificity of the test. The ROC curve plots the calculated sensitivity against the false positive error (1 – Sp). Thus, each plotted point on the curve represents a different possible screening test with its own sensitivity and specificity. By considering the consequences of false positives and false negatives, it is possible to determine a cutoff value for the PSA test that is optimal in some sense. One statistic often used to characterize the ROC is the area under the curve (AUC). A perfectly discriminatory ROC would have an AUC = 1.0. The value for the PSA tests studied by Thompson et al. (2005) was 0.678.

Thompson et al. (2005) also considered using a so-called Gleason score17 with a cutoff of 8 or more in this population. Figure 7 shows the ROC curve (topmost curve) for this possible screening test. As can be seen, this series of tests dominates the tests based upon PSA score alone (the AUC in this case is 0.827). The dashed line in Figure 7 shows the ROC curve that would occur under chance alone.

Whether or not and for whom PSA screening is appropriate requires the same sort of analysis noted for the LDCT screening evaluation. The ROC curve is just one piece of the puzzle, but this type of analysis shows that it is possible to design a screening test with several alternative combinations of sensitivity and specificity.

A complete specification of a screening test includes the intrinsic test characteristics (sensitivity, selectivity and cost) and ROC curve (if multiple tests are possible), characteristics of the subject population (including opportunities for segmenting the population to identify high risk groups), the key derived quantities (PPV and NPV) and the consequences of false positives and negatives.

Screening tests have the potential to be a cost effective means for identifying subjects with early stage (and thus potentially more treatable) disease before symptoms develop and therefore, for saving lives. The ideal screening test would discriminate perfectly between those who have or do not have the disease and be inexpensive and not invasive. In practice, screening tests exhibit false positives and false negatives – errors with consequences that need to be carefully considered when evaluating the advantages and disadvantages of the test.

The predictive value of the test depends in part on the technical parameters of the test, including the sensitivity and specificity, but also on the prevalence of the disease in the population. For this reason, it is necessary to be able to define the population to be tested so that the prevalence is high. This is why mammography is appropriate only for older women and those with a family history of breast cancer and why lung CT scans are not appropriate for screening the general population.

With some screening tests it is possible to alter the test decision criterion to alter the balance between sensitivity and specificity in which case it may be possible to develop an optimal screening test.

Nonetheless, screening of asymptomatic populations is not always appropriate and could do more harm than good.18 Table 6 summarizes the circumstances/conditions when screening might be either appropriate or contra-indicated.

Circumstances/conditions when screening might be appropriate or contra-indicated.

Circumstances favoring screeningCircumstances when screening not appropriate
Disease constitutes a significant public health problem, meaning that it is a relatively common condition with significant morbidity and mortality or disease is contagious and might infect others before symptoms occur and disease detected.Disease is rare or not serious or, if serious there is no effective treatment for disease.
The population to be screened can be so defined that the prevalence is high and there are no significant co-morbidities.Unknown or low population prevalence
Treatment before symptoms occur is more effective than if treatment is delayedNo benefit to early treatment and/or significant likelihood of overdiagnosis (pseudodisease)
“Gold Standard” diagnostic exists and screening test sensitivity and specificity is high and based on adequate sample sizeScreening test data is based on small sample sizes or is difficult to extrapolate to larger pool of screening centers with high sensitivity and specificity (e.g. high inter-observer variability)
Consequences of false negative or false positives are modestConsequences of one or more of these errors significant
Screening test is inexpensive, easy to administer, not harmful and reliableAny of these circumstances not met
There must be some mechanism for follow-up of subjects with positive screening results to ensure subsequent diagnostic testing and ultimate treatment takes place.

We appreciate the constructive comments offered by two anonymous reviewers. Their comments have improved this manuscript.

Table of test specificity and sensitivity results in the literature.

ReferencesTest InformationSpecificitySensitivity
Menon et al. (2009)Multi-modal and ultrasound for ovarian cancer;  primary ovarian and tubal0.9980.894
Primary invasive epithelial ovarian and tubal0.9980.895
USS0.9820.75
Citing van Nagell et al. (2007)0.9870.763
Grim et al. (1979)Renal vascular hypertension screening test0.920.93
Weiss et al. (1985)HTLV-III (AIDS Agent) screening test0.9860.973
Stoll et al. (1999)PTSD screening test0.9750.77
Kulasingam et al. (2002)HPV testing thin-layer pap0.8240.613
PCR0.7880.882
Signal amplification0.7260.908
Perkins et al. (2001)Peripheral neuropathy in Diabetes clinicVibration (on off)0.990.53
Monofilament0.960.77
Superficial pain0.970.59
Vibration (timed)0.980.8
Deeks & Altman (2004)Obstructive airway disease and >40 pack-years smoking0.9860.284
Doobay & Anand (2005)ABI and strokeCHDNewman et al. (1999)0.9080.163
Abbott et al. (2000)0.9440.167
StrokeNewman et al. (1999)0.9080.17
Abbott et al. (2000)0.8870.22
Tsai et al. (2001)0.9720.092
Schiffman et al. (2000)HPV DNA testing for cervical cancer0.9420.771
0.9340.748
Mayrand et al. (2007)HPV for cervical cancer (conservative case)0.9410.946
Pap for cervical cancer (conservative case)0.9680.554
Sabroe et al. (1999)Autologous serum skin tests to screen for  chronic idiopathic urticaria0.810.65
0.780.71
Maisel et al. (2001)B-natriuretic peptide for left ventricular dysfunction,  75 pg/mL BNP level0.980.86
Shumway-Cook et al. (2000)Probability of falls by timed up and go test0.870.87
Ferreira et al. (1992)Endomysial antibody screening for coeliac disease,  four tests0.991
0.990.91
0.850.91
0.880.76
Watson et al. (2002)Various tests for ChlamydiaPCR cervix10.965
LCR urine10.875
EIA urine10.188
EIA cervix10.52
EIA cervix0.990.8
DNA probe0.960.72
LET urine0.8080.778
EIA urine0.990.75
EIA cervix10.844
PCR cervix11
LCR urine10.96
EIA urine10.37
EIA cervix10.783
PCR cervix11
PCR cervix10.85
PCR cervix10.953
PCR cervix0.9861
PCR urine0.9860.923
LCR cervix0.9970.886
PCR, EIA cervix0.9970.97
LET0.910.41
LCR and LET urine0.9490.589
PCR urine0.9970.82
PCR cervix0.9980.82
PACE2 cervix10.795
PCR urine0.990.85
DFA cervix0.960.85
LCR urine10.882
EIA10.84
PCR cervix0.9980.992
LCR, PCR10.93
LCR, PCR0.9960.62
DFA cervix0.9950.778
PCR cervix10.714
EIA cervix10.647
PCR urine0.9930.895
Arbyn et al. (2008)Five cervical cancer screening tests (Table 3)VIA0.8360.887
VILI0.8320.957
VIAM0.8550.826
Pap Smear0.9850.651
HC20.930.721
Legro et al. (1998)Fasting glucose to insulin ratio to measure insulin sensitivity0.840.95
Schroeder et al. (1999)Noninvasive determination of endothelium-mediated vasodilationCoronary artery disease0.810.71
Angina pectoris0.5710.824
Allison et al. (1996)Four tests for colorectal-screeningHemoccult II0.9810.324
Hemoccult II Sensa0.8750.712
Hemoselect0.9520.672
Combined0.9790.537
Ewer et al. (2011)Pulse oximetry screening for congenital heart defectsCritical cases0.99120.75
All major cases0.99160.4906
Boppana et al. (2011)Saliva polymerase chain reaction assay for cytomegalovirusLiquid Saliva0.9991
Dried Saliva0.9990.974
Whitlock et al. (2008)Several colorectal cancer screening tests0.940.85
0.9440.688
0.910.875
0.9490.865
0.8310.667
0.9690.818
0.9710.556
0.9560.909
Cuzick et al. (2013)Six human papillomavirus testsBD HPV0.8430.975
Roche Cobas0.8450.975
Qiagen Hybrid0.8540.975
Abbott real time0.8720.95
Gen-probe0.9020.975
NorChip0.9520.714
Donovan et al. (2013)Various tests for gestational diabetes50-G OGCT0.860.85
50-G OGCT0.840.88
50-G OGCT0.830.85
50-G OGCT0.690.81
50-G OGCT0.890.7
50-G OGCT0.770.99
50-G OGCT0.660.88
50-G OGCT10.17
Fasting plasma glucose0.520.87
Fasting plasma glucose0.760.77
Fasting plasma glucose0.920.76
Fasting plasma glucose0.930.54
HbA 1c0.280.92
HbA 1c0.970.12
HbA 1c0.610.86
HbA 1c0.210.82
Ng et al. (2013)MRI and mammographic screening in survivors of Hodgkin LymphomaMammogram0.930.68
MRI0.940.67
Both0.90.94
Jafari et al. (2013)Various tests for syphilis (imperfect reference)
DetermineSerum0.94150.9004
Whole Blood0.95850.8632
SD BiolineSerum0.95850.8706
Whole Blood0.97950.845
SyphicheckSerum0.99140.7448
Whole Blood0.99580.7447
VisitectSerum0.96450.8513
Whole Blood0.99430.7426
Salami et al. (2013)Various tests for prostate cancerOptimized0.90.8
Kloten et al. (2013)Various tests for blood-based breast  cancer screeningRASSF1A UTIH50.730.54
RASSF1A DKK30.750.59
DKK3 ITIH50.940.4
RASSF1A DKK3 ITH50.720.67
Firnhaber et al. (2013)Cervical cancer screening methods in HIV positive women CIN 2+Cytology (MD intern)0.6810.755
HPV0.5140.919
Cytology (RN intern)0.6850.654
Teertstra et al. (2009)Breast tomosynthesis compared to mammography for detection of cancerMammography0.8830.963
Tomosynthesis0.8670.963
Rafferty et al. (2013)Breast tomosynthesis compared to mammography for detection of cancerMammography0.8410.655
Mammography plus0.8920.762
Tomosynthesis0.8620.627
Mammography0.8450.787
Mammography plus Tomosynthesis
Catalona et al. (1991)Prostate-specific antigen in serum screening testRectal examination0.440.86
Ultrasonography0.270.92
Serum PSA0.590.79

1The basis for definition of the population might include age, gender, race, occupation, known medical condition or other risk factor (e.g. smoking).

2Diseases frequently begin before the onset of symptoms during a period sometimes referred to as the “detectable pre-clinical Phase” (DPCP).

3From this, it follows that the benefits of screening will be minimal if the disease has no cure (such as certain stage mesotheliomas) or if early detection does not materially improve chances for survival. In addition, depending upon the population under study, some diseases (sometimes termed pseudo diseases) are detected that do not affect mortality because the subject may die from another disease or event. This is termed overdiagnosis (refer Black, 2000 for more detail).

4Screening tests for donated blood using nucleic acid amplification are now so efficient that the risks of human immunodeficiency virus and hepatitis C virus transmission through blood transfusion is estimated to be approximately 1 in 2 million (Stramer, 2007).

5See Coste & Pouchot (2003) for an extension in which the test results are permitted to fall into three zones, a positive, negative and in intermediate “grey zone.” In principle, many test outcomes as well as sequential tests can be handled mathematically. We focus on the 2 × 2 because it has proven useful and is easier to analyze.

6There are a few examples (e.g. certain tests for HIV) of screening tests with such high sensitivity and specificity that they are virtually a Gold Standard.

7The symbols T+ and T− denote the events that the test outcome is positive and negative, respectively. The symbols D+ and D− denote the events that the subject has or does not have the disease.

8Thus, Pr{D−} = 1−Π.

9It is beyond the scope of this article to consider optimal screening study designs, but it is appropriate to comment on one possible design, the case control design. As noted by Goetzinger & Odibo (2011): “It is important to highlight that the case control study design cannot be used to determine predictive values because these values are influenced by disease prevalence. Because cases and controls are selected for inclusion, the prevalence of the disease is, therefore, “fixed” by the study design. Reproducing a generalizable spectrum of patients also becomes difficult with this type of study design”.

10The width of these confidence intervals is small due of the assumed size of the population under test. Many studies, however, are conducted on few individuals and it is important to understand the consequences in terms of the likely precision of the estimates.

11See http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/non-small-cell-lung-cancer-key-statistics.

12A hamartoma is a benign, focal malformation that resembles a neoplasm in the tissue of its origin.

13This is obviously not desirable, but also not entirely unexpected. For example, Elmore at al. (2002) noted a variation in false positive rates ranging from 2.6% to 15.9% among radiologists interpreting mammograms.

14Male mortality rates from lung cancer are approximately the same in France and the United States (see http://www.oecd-ilibrary.org/docserver/download/8111101ec007.pdf?expires=1404337643&id=id&accname=guest&checksum=03F45C46CE1A31E393DD2EAFDF0157D3). Moreover, the 7.6% figure assumed for the prevalence is for an entire lifetime. The probability of contacting cancer through age 60 or 62 (when workers will retire) is certainly lower. Thus, this estimate probably overstates the actual prevalence for the worker cohort.

15ACCP and ASCO have made essentially the same recommendation, see http://www.cancer.net/research-and-advocacy/asco-care-and-treatment-recommendations-patients/lung-cancer-screening.

16ROC analysis emerged from the study of signal detection problems differentiating signals from noise. These were first used by scientists in Britain during World War II as the abilities of radar receiver operators were being assessed based on their ability to differentiate signal (e.g. enemy aircraft) from noise (non-relevant targets). The term was later borrowed by statisticians assessing screening tests.

17The Gleeson score is a grading system for prostate cancer based on microscopic appearance of the tumor.

18For a discussion of ethical issues relevant to screening programs (McQueen, 2002; WHO, 2003).

This paper represents independent research and the authors are solely responsible for the content. Two of the authors (LDM and MJU) were asked by ECFIA to give an opinion on the use of CT scans for workers engaged in the production of high temperature insulating wools.

  • Abbott RD, Petrovich H, Rodriguez BL, et al. Ankle/brachial blood pressure in men >70 years of age and the risk of coronary heart disease. Am J Cardiol. 2000;86:280–4. [PubMed] [Google Scholar]
  • Achkar JM, Lawn SD, Moosa M-YS, et al. Adjunctive tests for diagnosis of tuberculosis: Serology, ELISPOT for site-specific lymphocytes, urinary lipoarabinomannan, string test, and fine needle aspiration. J Infec Dis. 2011;204:S1130–41. [PMC free article] [PubMed] [Google Scholar]
  • Álamo-Junquera D, Murat-Nascimento C, Maciá F, et al. Effect of false-positive results on reattendance at breast cancer screening programmes in Spain. Eur J Public Health. 2011;22:404–8. [PubMed] [Google Scholar]
  • Alberg AJ, Park JW, Hager BW, et al. The use of “overall accuracy” to evaluate the validity of screening or diagnostic tests. JGIM. 2004;19:460–5. [PMC free article] [PubMed] [Google Scholar]
  • Allison JE, Tekawa IS, Ranson LJ, Adrain AL. A comparison of fecal occult-blood tests for colorectal-cancer screening. N Engl J Med. 1996;334:155–9. [PubMed] [Google Scholar]
  • Altman DG, Bland JM. Diagnostic tests 1: sensitivity and specificity. Br Med J. 1994a;308:1552. [PMC free article] [PubMed] [Google Scholar]
  • Altman DG, Bland JM. Diagnostic tests 2: predictive values. Br Med J. 1994b;309:102. [PMC free article] [PubMed] [Google Scholar]
  • Altman DG, Bland JM. Diagnostic tests 3: receiver operating characteristic plots. Br Med J. 1994c;309:188. [PMC free article] [PubMed] [Google Scholar]
  • American Academy of Family Physicians. Evidence Lacking to Support or Oppose Low-dose CT Screening for Lung Cancer, Says AAFP. 2013. Available at http://www.aafp.org/news/health-of-the-public/20140113aafplungcarec.html [last accessed 30 June 2014]
  • American Lung Association (ALA) Providing guidance on lung cancer screening to patients and physicians. Washington, DC: American Lung Association; 2012. p. 35. April 23, 2012. Available online at: http://www.lung.org/lung-disease/lung-cancer/lung-cancer-screening-guidelines/lung-cancer-screening.pdf [last accessed 26 June 2014] [Google Scholar]
  • Arbyn M, Sankaranarayanan R, Muwonge R, et al. Pooled analysis of the accuracy of five cervical cancer screening tests assessed in eleven studies in Africa and India. Int J Cancer. 2008;123:153–60. [PubMed] [Google Scholar]
  • Baade PD, Fritschi L, Eakin EG. Non-cancer mortaility among people diagnosed with cancer (Australia) Cancer Causes Control. 2006;17:287–97. [PubMed] [Google Scholar]
  • Bach PB, Jett JR, Pastorino U, et al. Computed tomography screening and lung cancer outcomes. JAMA. 2007;297:953–61. [PubMed] [Google Scholar]
  • Bach PB, Mirkin JN, Oliver TK, et al. Benefits and harms of CT screening for lung cancer: a systematic review. JAMA. 2012;307:2418–29. [PMC free article] [PubMed] [Google Scholar]
  • Bauman A. The epidemiology of clinical tests. Aust Prescr. 1990;13:62–4. [Google Scholar]
  • Black WC. Overdiagnosis: an under recognized cause of confusion and harm in cancer screening. J Nat Cancer Inst. 2000;92:1280–2. [PubMed] [Google Scholar]
  • Boppana SB, Ross SA, Shimamura M, et al. Saliva polymerase-chain-reaction assay for cytomegalovirus screening in newborns. New Engl J Med. 2011;364:2111–18. [PMC free article] [PubMed] [Google Scholar]
  • Brenner DJ. Radiation risks potentially associated with low-dose CT screening of adult smokers for lung cancer. RSNA. 2004;231:440–5. [PubMed] [Google Scholar]
  • Casscells W, Schoenberger A, Graboys TB. Interpretation by physicians of clinical laboratory results. N Engl J Med. 1978;299:999–1001. [PubMed] [Google Scholar]
  • Catalona WJ, Smith DD, Ratliff TL, et al. Measurement of prostate-specific antigen in serum as a screening test for prostate cancer. N Engl J Med. 1991;324:1156–61. [PubMed] [Google Scholar]
  • Centers for Disease Control and Prevention (CDC) Tuberculosis (TB): Testing and diagnosis. Centers for Disease Control and Prevention. 2013. Atlanta, GA, USA, 4 pp. Available online at: http://www.cdc.gov/TB/TOPIC/testing/default.htm [last accessed 5 August 2014]
  • Coste J, Pouchot J. A grey zone for quantitative diagnostic and screening tests. Int. J. Epidemiol. 2003;32:304–13. [PubMed] [Google Scholar]
  • Croswell JM, Baker SG, Marcus PW, et al. Cumulative incidence of false-positive test results in lung cancer screening: a randomized trial. Ann Intern Med. 2010;152:505–11. [PubMed] [Google Scholar]
  • Croswell JM, Kramer BS, Kreimer AR, et al. Cumulative incidence of false-positive results in repeated, multimodal cancer screening. Ann Fam Med. 2009;7:212–22. [PMC free article] [PubMed] [Google Scholar]
  • Cuzick J, Cadman L, Mesher D, et al. Comparing the performance of six human papillomavirus tests in a screening population. Br J Cancer. 2013;108:908–13. [PMC free article] [PubMed] [Google Scholar]
  • Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. Br Med J. 2004;329:168–9. [PMC free article] [PubMed] [Google Scholar]
  • Deeks JJ. Systematic reviews of evaluations of diagnostic and screening tests. Br Med J. 2001;323:157–62. [PMC free article] [PubMed] [Google Scholar]
  • Dendukuri N. Evaluating diagnostic tests in the absence of a gold standard. 2011. Powerpoint presentation for Advanced TB Diagnostics Course, Montreal, July 2011. 54 slides. Available at: http://www.teachepi.org/documents/courses/tbdiagrx/day2/Dendukuri%20Diagnostic%20Tests%20in%20the%20Absence%20of%20a%20Gold%20Standard.pdf [last accessed 4 August 2014]
  • Diedrerich S, Wormanns D, Semik M, et al. Screening for early lung cancer with low-dose spiral CT: prevalence in 817 asymptomatic smokers. RSNA. 2002;222:773–81. [PubMed] [Google Scholar]
  • Diedrerich S. CT screening for lung cancer. Cancer Imaging. 2008;8:S24–6. [PMC free article] [PubMed] [Google Scholar]
  • Donovan L, Hartling L, Muise M, et al. Screening tests for gestational diabetes: a systematic review for the U.S. Preventative Services Task Force. Ann Intern Med. 2013;159:115–22. [PubMed] [Google Scholar]
  • Doobay A, Anand SS. Sensitivity and Specificity of the ankle-brachial index to predict future cardiovascular outcomes: a systematic review. Arterioscler Thromb Vasc Biol. 2005;25:1463–9. [PubMed] [Google Scholar]
  • Eddy DM. Chapter 18. Probabilistic reasoning in clinical medicine: problems and opportunities. In: Kahneman D, Slovic P, Tversky A, editors. Judgment under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press; 1982. pp. 249–67. [Google Scholar]
  • Elmore JG, Barton MB, Moceri VM, et al. Ten-year risk of false positive screening mammograms and clinical breast examinations. N Engl J Med. 1998;338:1089–96. [PubMed] [Google Scholar]
  • Elmore JG, Miglioretti DL, Reisch LM, et al. Screening mammograms by community radiologists: variability in false-positive rates. J Natl Cancer Inst. 2002;94:1373–80. [PMC free article] [PubMed] [Google Scholar]
  • Ewer AK, Middleton LJ, Furmston AT, et al. Pulse oximetry screening for congenital heart defects in newborn infants (PulseOx): a test accuracy study. Lancet. 2011;378:785–94. [PubMed] [Google Scholar]
  • Ferreira M, Davies SL, Butler M, et al. Endomysial antibody: is it the best screening test for celiac disease? Gut. 1992;33:1633–7. [PMC free article] [PubMed] [Google Scholar]
  • Firnhaber C, Mayisela N, Mao L, et al. Validation of cervical cancer screening methods in HIV positive women from Johannesburg, South Africa. PLoS One. 2013;8:e53494. [PMC free article] [PubMed] [Google Scholar]
  • Friedewald SM, Rafferty EA, Rose SL, et al. Breast cancer screening using tomosynthesis in combination with digital mammography. JAMA. 2014;311:2499–507. [PubMed] [Google Scholar]
  • Gart JJ, Buck AA. Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. Am J Epidemiol. 1966;83:593–602. [PubMed] [Google Scholar]
  • Gastwirth JL. The statistical precision of medical screening procedures: application to polygraph and AIDS antibodies test data. Stat Sci. 1987;2:213–38. [Google Scholar]
  • Goetzinger KR., Odibo AO. Statistical analysis and interpretation of prenatal diagnostic imaging studies, Part 1: evaluating the efficiency of screening and diagnostic tests. J Ultrasound Med. 2011;30:1121–7. [PubMed] [Google Scholar]
  • Gohagan J, Marcus P, Fagerstrom R, et al. Baseline findings of a randomized feasibility trial of lung cancer screening with spiral CT scan vs chest radiograph; The lung screening study of the national cancer institute. Chest. 2004;126:114–21. [PubMed] [Google Scholar]
  • Gotzak-Uzan L, Jiminez W, Nofech-Mozes S, et al. Sentinel lymph node biopsy vs. pelvic lymphadenectomy in early stage cervical cancer: is it time to change the gold standard? Gynecol Oncol. 2010;116:28–32. [PubMed] [Google Scholar]
  • Greenhalgh T. Papers that report diagnostic or screening tests. BMJ. 1997;315:540–3. [PMC free article] [PubMed] [Google Scholar]
  • Grim CE, Luft FC, Weinberger MH, Grim CM. Sensitivity and specificity of screening tests for renal vascular hypertension. Ann Intern Med. 1979;91:617–22. [PubMed] [Google Scholar]
  • Grimes DA, Schulz KF. Uses and abuses of screening tests. Lancet. 2002;359:881–4. [PubMed] [Google Scholar]
  • Harper R, Henson D, Reeves BC. Appraising evaluations of screening/diagnostic tests: the importance of the study populations. Br J Opthalmol. 2000;84:1198–202. [PMC free article] [PubMed] [Google Scholar]
  • Hawkins DM, Garrett JA, Stephenson B. Some issues in resolution of diagnostic tests using and imperfect gold standard. Technical Report 628, School of Statistics, University of Minnesota. Stat Med. 2001;20:1987–2001. [PubMed] [Google Scholar]
  • Heffner JE, Silvestri G. CR screening for lung cancer: is smaller better? Am J Respir Crit Care Med. 2002;165:433–7. [PubMed] [Google Scholar]
  • Henschke CI, Yankelevitz DF, Mirtcheva R, et al. CT screening for lung cancer: frequency and significance of part-solid and nonsolid nodules. AJR. 2002;178:1053–7. [PubMed] [Google Scholar]
  • Henschke CI, Yip R, Yankelevitz DF, Smith JP. Definition of a positive test result in computed tomography screening for lung cancer. Ann Int Med. 2013;158:246–52. [PubMed] [Google Scholar]
  • Herman C. What makes a screening exam “good”? Ethics J Am Med Assoc. 2006;8:34–7. [PubMed] [Google Scholar]
  • Humphrey LL, Deffebach M, Pappas M, et al. Screening for lung cancer with low-dose computed tomography: a systematic review to update the U.S. Preventative Services Task Force recommendation. Ann Int Med. 2013;159:411–20. [PubMed] [Google Scholar]
  • Jafari Y, Peeling RW, Shivkumar S, et al. Are Treponema pallidum specific rapid and point-of-care tests for Syphilis accurate enough for screening in resource limited settings? Evidence from a meta-analysis. PLoS One. 2013;8:e54695. [PMC free article] [PubMed] [Google Scholar]
  • Johnson WO, Gastwirth JL, Pearson LM. Screening without a “gold standard”: the Hui-Walter paradigm revisited. Am J Epidemiol. 2001;153:921–4. [PubMed] [Google Scholar]
  • Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am J Epidemiol. 1995;141:263–72. [PubMed] [Google Scholar]
  • Karaağaoğlu E. Estimation of the prevalence of a disease from screening tests. Trends J Med Sci. 1999;29:429–30. [Google Scholar]
  • Kaufman PA, Bloo KJ, Burris H, et al. Assessing the discordance rate between local and central HER2 testing in women with locally determined HER2-negative breast cancer. Cancer. 2014;120:2657–64. [PMC free article] [PubMed] [Google Scholar]
  • Kerlikowske K, Grady D, Barclay J, et al. Positive predictive value of screening mammography by age and family history of breast cancer. JAMA. 1993;270:2444–50. [PubMed] [Google Scholar]
  • Kloten V, Becker B, Winner K, et al. Promoter hypermethylation of the tumor-suppressor genes ITIH5, DKK3, and RASSF1A as novel biomarkers for blood-based breast cancer screening. Breast Cancer Res. 2013;15:1–11. [PMC free article] [PubMed] [Google Scholar]
  • Kovalchik SA, Tammemagi M, Berg CD, et al. Targeting of low-dose CT screening according to the risk of lung-cancer death. N Engl J Med. 2013;369:245–54. [PMC free article] [PubMed] [Google Scholar]
  • Kulasingam SL, Hughes JP, Kiviat NB, et al. Evaluation of human papillomavirus testing in primary screening for cervical abnormalities: comparison of sensitivity, specificity, and frequency of referral. JAMA. 2002;288:1749–57. [PubMed] [Google Scholar]
  • Lalkhen AG, McCluskey A. Clinical tests: sensitivity and specificity. Cont Ed Anesth Crit Care Pain. 2008;9:221–3. [Google Scholar]
  • Lam S, McWilliams A, Mayo J, Tammemagi M. Computed tomography screening for lung cancer: what is a positive screen? Ann Int Med. 2013;158:289–90. [PubMed] [Google Scholar]
  • Legro RS, Finegood D, Dunaif A. A fasting glucose to insulin ratio is a useful measure of insulin sensitivity in women with polycystic ovary syndrome. J Clin Endocrinol Metabol. 1998;83:2694–8. [PubMed] [Google Scholar]
  • LeMasters GK, Lockey JE, Yiin JH, et al. Mortality of workers occupationally exposed to refractory ceramic fibers. J Occup Environ Med. 2003;45:440–50. [PubMed] [Google Scholar]
  • Levy PS, Kass EH. A three-population model for sequential screening for bacteriuria. Am J Epidemiol. 1970;91:148–54. [PubMed] [Google Scholar]
  • Lewis FI, Torgerson PR. A tutorial in estimating the prevalence of disease in humans and animals in the absence of a gold standard diagnostic. Emerg Themes Epidemiol. 2012;9:1–8. [PMC free article] [PubMed] [Google Scholar]
  • Li F, Sone S, Abe H, et al. Malignant versus benign nodules at CT screening for lung cancer: comparison of thin-section CT findings. Radiology. 2004;233:793–8. [PubMed] [Google Scholar]
  • Lu D, Fall K, Sparen P, et al. Suicide and suicide attempt after a cancer diagnosis among young individuals. Ann Oncol. 2013;24:3112–17. [PubMed] [Google Scholar]
  • MacRedmond R, McVey G, Lee M, et al. Screening for lung cancer using low dose CT scanning: results of 2 year follow up. Thorax. 2006;61:54–6. [PMC free article] [PubMed] [Google Scholar]
  • Mahadevia PJ, Fleisher LA, Frick KD, et al. Lung cancer screening with helical computed tomography in older adult smokers: a decision and cost-effectiveness analysis. JAMA. 2003;289:313–22. [PubMed] [Google Scholar]
  • Maisel AS, Koon J, Krishnaswamy P, et al. Utility of B-natriuretic peptide as a rapid, point-of-care test for screening patients undergoing echocardiography to determine left ventricular dysfunction. Am Heart J. 2001;141:367–74. [PubMed] [Google Scholar]
  • Manos D. CT screening for lung cancer: controversy and misconceptions. Oncology Exch. 2013;12:10–12. [Google Scholar]
  • Manrai AK, Bhatia G, Strymish J, et al. Medicine's uncomfortable relationship with math: calculating positive predictive value. JAMA Intern Med. 2014;174:991–3. [PMC free article] [PubMed] [Google Scholar]
  • Marshall HM, Bowman RC, Yang IA, et al. Screening for lung cancer with low-dose computed tomography: a review of current status. J Thor Dis. 2013;5:S524–39. [PMC free article] [PubMed] [Google Scholar]
  • Mayrand M-H, Duarte-Franco E, Rodriques I, et al. Human Papillomavirus DNA versus Papanicolaou screening tests for cervical cancer. N Engl Med J. 2007;357:1579–88. [PubMed] [Google Scholar]
  • McQueen MJ. Some ethical and design challenges of screening programs and screening tests. Clin Chim Acta. 2002;315:41–8. [PubMed] [Google Scholar]
  • McWilliams A, Mayo J, MacDonald S, et al. Lung cancer screening: a different paradigm. Am J Respir Crit Care Med. 2003;168:1167–73. [PubMed] [Google Scholar]
  • Menon U, Gentry-Maharaj A, Hallett R, et al. Sensitivity and specificity of multimodal and ultrasound screening for ovarian cancer, and stage distribution of detected cancers: results of the prevalence screen of the UK Collaborative Trial of Ovarian Cancer Screening (UKCTOCS) Lancet. 2009;10:327–40. [PubMed] [Google Scholar]
  • Mertens L, Friedberg MK. The gold standard for noninvasive imaging in congenital heart disease: echocardiography. Curr Opin Cardiol. 2009;24:119–24. [PubMed] [Google Scholar]
  • National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med. 2011;365:395–409. [PMC free article] [PubMed] [Google Scholar]
  • Nawa T, Nakagawa T, Kusano S, et al. Lung cancer screening using low-dose spiral CT: results of baseline and 1-Year follow-up studies. Chest. 2002;122:15–20. [PubMed] [Google Scholar]
  • Nelson R. ASCO 2009: low-dose CT screening for lung cancer produces high rate of false positives. 2009. Available at: http://www.medscape.com/viewarticle/703909 [last accessed 9 June 2014]
  • Newcombe RG. Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med. 1998;17:857–72. [PubMed] [Google Scholar]
  • Newman AB, Shemanski L, Manolio TA, et al. Ankle-arm index as a predictor of cardiovascular disease and mortality in the Cardiovascular Health Study. Arterioscler Thromb Vasc Biol. 1999;19:538–45. [PubMed] [Google Scholar]
  • Ng AK, Garber JE, Diller LR, et al. Prospective study of the efficacy of breast magnetic resonance imaging and mammographic screening in survivors of Hodgkin Lymphoma. J Clin Oncol. 2013;31:2282–8. [PubMed] [Google Scholar]
  • Novello S, Fava C, Borasio P, et al. Three-year findings of an early lung cancer detection feasibility study with low-dose spiral computed tomography in heavy smokers. Ann Oncol. 2005;16:1662–6. [PubMed] [Google Scholar]
  • Oken MM, Hocking WC, Kvale PA, et al. Screening by chest radiograph and lung cancer mortality: the prostate, lung, colorectal, and ovarian (PLCO) randomized trial. JAMA. 2011;306:1865–73. [PubMed] [Google Scholar]
  • Pastorino U, Bellomi M, Landoni C, et al. Early lung-cancer detection with spiral CT and positron emission tomography in heavy smokers: 2-year results. Lancet. 2003;362:593–7. [PubMed] [Google Scholar]
  • Pastorino U, Rossi M, Rosato V, et al. Annual or biennial CT screening versus observation in heavy smokers: 5-year results of the MILD trial. Eur J Cancer Prev. 2012;21:308–15. [PubMed] [Google Scholar]
  • Patz EF, Pinsky P, Gatsonis C, et al. Overdiagnosis in low-dose computed tomography screening for lung cancer. JAMA Intern Med. 2014;174:269–74. [PMC free article] [PubMed] [Google Scholar]
  • Patz EF, Swenson SJ, Herndon JE., II Estimate of lung cancer mortality from low-dose spiral computed tomography screening trials: implications for current mass screening recommendations. J Clin Oncol. 2004;22:2202–6. [PubMed] [Google Scholar]
  • Pedersen JH, Ashraf H, Dirksen A, et al. The Danish randomized lung cancer CT screening trial – overall design and results of the prevalence round. J Thorac Oncol. 2009;4:609–14. [PubMed] [Google Scholar]
  • Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press; 2003. [Google Scholar]
  • Perkins BA, Olaleye D, Zinman B, Bril V. Simple screening tests for peripheral neuropathy in the diabetes clinic. Diabetes Care. 2001;24:250–6. [PubMed] [Google Scholar]
  • Petticrew MP, Sowden AJ, Lister-Sharp D, Wright K. False-negative results in screening programmes: systematic review of impact and implications. Health Technol Assess. 2000;4:1–20. [PubMed] [Google Scholar]
  • Petticrew MP, Sowden AJ, Lister-Sharp D, Wright K. False-negative results in screening programmes: medical, psychological, and other implications. Int J Tech Technol Assess Health Care. 2001;17:164–70. [PubMed] [Google Scholar]
  • Phend C. Medicare advisers say no to lung cancer screening. 2014. Available at: http://www.medpagetoday.com/Pulmonology/LungCancer/45512 [last accessed 9 June 2014]
  • Pickering TG, Hall JE, Appel LJ, et al. Recommendations for blood pressure measurements in humans and experimental animals: Part 1: blood pressure measurement in humans: a statement for professionals from the Subcommittee of Professional and Public Education of the American Heart Association Council on High Blood Pressure Research. Hypertension. 2005;45:142–61. [PubMed] [Google Scholar]
  • Rafferty EA, Park JM, Philpotts LE, et al. Assessing radiologist performance using combined digital mammography and breast tomosynthesis compared with digital mammography alone: results of a multicenter multireader trial. Radiology. 2013;266:104–13. [PMC free article] [PubMed] [Google Scholar]
  • Rogan WJ, Gladen B. Estimating prevalence from the results of a screening test. Am J Epidemiol. 1978;107:71–6. [PubMed] [Google Scholar]
  • Ruano-Ravina A, Ríos MP, Fernández-Villar A. Cribado de cáncer de pulmón con tomografía computarizada de baja dosis después del National Lung Screening Trial. El debate continúa abierto. Arch Bronconeumol. 2013;49:158–65. [PubMed] [Google Scholar]
  • Rutjes AWS, Reitsma JB, Coomarasamy A, et al. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess. 2007;11:1–72. [PubMed] [Google Scholar]
  • Sabroe RA, Grattan CE, Francis DM, et al. The autologous serum skin test: a screening test for autoantibodies in chronic idiopathic utricaria. Br J Dermatol. 1999;140:446–52. [PubMed] [Google Scholar]
  • Saghir Z, Dirksen A, Ashraf H, et al. CT screening for lung cancer brings forward early disease. The randomized Danish Lung Cancer Screening Trial: status after five annual screening rounds with low-dose CT. Thorax. 2012;67:296–301. [PubMed] [Google Scholar]
  • Salami SS, Schmidt F, Laxman B, et al. Combining urinary detection of TMPRSS2:ERG and PCA3 with serum PSA to predict diagnosis of prostate cancer. Urol Oncol. 2013;31:566–71. [PMC free article] [PubMed] [Google Scholar]
  • Samet JM, Avila-Tang E, Boffetta P, et al. Lung cancer in never smokers: clinical epidemiology and environmental risk factors. Clin Cancer Res. 2009;15:5626–45. [PMC free article] [PubMed] [Google Scholar]
  • Schiffman M, Herrero R, Hildesheim A, et al. HPV DNA testing in cervical cancer screening: results from women in a high-risk province of Costa Rica. JAMA. 2000;283:87–93. [PubMed] [Google Scholar]
  • Schroeder S, Enderle MD, Ossen R, et al. Noninvasive determination of endothelium-mediated vasodilation as a screening test for coronary artery disease: pilot study to assess the predictive value in comparison with angina pectoris, exercise electrocardiography, and myocardial perfusion imaging. Am Heart J. 1999;138:731–9. [PubMed] [Google Scholar]
  • Shumway-Cook A, Brauer S, Woollacott M. Predicting the probability for falls in community-dwelling older adults using the timed up & go test. Phys Ther. 2000;80:896–903. [PubMed] [Google Scholar]
  • Silvestri GA. Screening for lung cancer: it works, but does it really work? Ann Intern Med. 2011;155:537–9. [PubMed] [Google Scholar]
  • Sone S, Li F, Yang Z-G, et al. Results of three-year mass screening programme for lung cancer using mobile low-dose spiral computed tomography scanner. Br J Cancer. 2001;84:25–32. [PMC free article] [PubMed] [Google Scholar]
  • Stoll C, Kapfhammer HP, Rothenhauser HB, et al. Sensitivity and specificity of a screening test to document traumatic experiences and to diagnose post-traumatic stress disorder in ARDS patients after intensive care treatment. Intensive Care Med. 1999;25:697–704. [PubMed] [Google Scholar]
  • Stramer SL. Current risks of transfusion-transmitted agents: a review. Arch Pathol Lab Med. 2007;131:702–7. [PubMed] [Google Scholar]
  • Swensen SJ, Jett JR, Hartman TE, et al. Lung cancer screening with CT: Mayo Clinic experience. RSNA. 2003;226:756–61. [PubMed] [Google Scholar]
  • Swensen SJ, Jett JR, Hartman TE, et al. CT screening for lung cancer: five-year prospective experience. RSNA. 2005;235:259–65. [PubMed] [Google Scholar]
  • Swensen SJ, Jett JR, Sloan JA, et al. Screening for lung cancer with low-dose spiral computed tomography. Am J Respir Crit Care Med. 2002;165:508–13. [PubMed] [Google Scholar]
  • Teertstra HJ, Loo CE, van den Bosch MAAJ, et al. Breast tomosynthesis in clinical practice: initial results. Eur Radiol. 2009;20:16–24. [PubMed] [Google Scholar]
  • Thejls H, Gnarpe J, Gnarpe H, et al. Expanded gold standard in the diagnosis of Chlamydia trachomatis in a low prevalence population: diagnostic efficacy of tissue culture, direct immunofluorescence, enzyme immunoassay, PCR and serology. Genitourin Med. 1994;70:300–3. [PMC free article] [PubMed] [Google Scholar]
  • Thompson IA, Ankerst DP, Chi C, et al. Operating characteristics of prostate-specific antigen in men with an initial PSA level of 3.0 ng/mL or lower. JAMA. 2005;294:66–70. [PubMed] [Google Scholar]
  • Tiitola M, Kivisaari L, Huuskonen MS, et al. Computed tomography screening for lung cancer in asbestos-exposed workers. Lung Cancer. 2002;35:17–22. [PubMed] [Google Scholar]
  • Toyoda Y, Nakayama T, Kusunoki Y, et al. Sensitivity and specificity of lung cancer screening using chest low-dose computed tomography. Br J Cancer. 2008;98:1602–7. [PMC free article] [PubMed] [Google Scholar]
  • Troy LM, Michels KB, Hunter DJ, et al. Self-reported birthweight and history of having been breastfed among younger women: an assessment of validity. Int J Epidemiol. 1996;25:122–7. [PubMed] [Google Scholar]
  • Tsai AW, Folsom AR, Rosamon W, Jones DW. Ankle-brachial index and 7-year ischemic stroke incidence. The ARIC Study. Stroke. 2001;32:1721–4. [PubMed] [Google Scholar]
  • US Preventive Services Task Force. Screening for Lung Cancer; U.S. Preventive Services Task Force Recommendation Statement. 2013. Available at http://www.uspreventiveservicestaskforce.org/uspstf13/lungcan/lungcanfinalrs.htm [last accessed 30 June 2014]
  • Utell MJ, Maxim LD. Refractory ceramic fiber (RCF) toxicity and epidemiology: a review. Inhal Toxicol. 2010;22:500–21. [PubMed] [Google Scholar]
  • van Klavern RJ, Oudkerk M, Prokop M, et al. Management of lung nodules detected by volume CT scanning. N Engl J Med. 2009;361:2221–9. [PubMed] [Google Scholar]
  • van Nagell JR, DePriest PD, Ueland FR, et al. Ovarian cancer screening with annual transvaginal sonography: findings of 25,000 women screened. Cancer. 2007;109:1887–96. [PubMed] [Google Scholar]
  • van Smeden M, Naaktegeboren CA, Reitsma JB, et al. Latent class models in diagnostic studies when there is no reference standard – a systematic review. Am J Epidemiol. 2014;179:423–31. [PubMed] [Google Scholar]
  • Vansteenkiste J, Dooms C, Mascaux C, Nackaerts K. Screening and early detection of lung cancer. Ann Oncol. 2012;23:320–7. [PubMed] [Google Scholar]
  • Versi E. “Gold standard” is an appropriate term. BMJ. 1992;305:187. [PMC free article] [PubMed] [Google Scholar]
  • Vogelsang H, Wyatt GD, Lochs WJ, et al. Screening for celiac disease: a prospective study on the value of noninvasive tests. Am J Gastroenterol. 1995;90:394–8. [PubMed] [Google Scholar]
  • Walter SD, Irwig LM. Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol. 1988;43:923–37. [PubMed] [Google Scholar]
  • Watson EJ, Templeton A, Russell I, et al. The accuracy and efficacy of screening tests for chlamydia trachmatis: a systematic review. J Med Microbiol. 2002;51:1021–31. [PubMed] [Google Scholar]
  • Wegwarth O, Schwartz LM, Woloshin S, et al. Do physicians understand cancer screening statistics? A national survey of primary care physicians in the United States. Ann Intern Med. 2012;156:340–9. [PubMed] [Google Scholar]
  • Weiss SH, Goedert JJ, Samgadharanm MG, et al. Screening test for HTLV-III (AIDS Agent) Antibodies, specificity, sensitivity, and applications. JAMA. 1985;253:221–5. [PubMed] [Google Scholar]
  • Whiting P, Rutjes AWS, Reitsma JB, et al. Chapter 2. A systematic review of sources of variation and bias in studies of diagnostic accuracy. Ann Int Med. 2004;140:189–202. [PubMed] [Google Scholar]
  • Whitlock EP, Lin JS, Liles E, et al. Screening for colorectal cancer: a targeted, updated systematic review for the U.S. Preventative Services Task Force. Ann Int Med. 2008;149:638–58. [PubMed] [Google Scholar]
  • Wiener RS, Schwartz LM, Woloshin S, Welch HG. Population-based risk for complications after transthoracic needle lung biopsy of a pulmonary nodule: an analysis of discharge records. Ann Intern Med. 2011;155:137–44. [PMC free article] [PubMed] [Google Scholar]
  • Wilson JMG, Jungner G. Principles and practice of screening for disease. Geneva: WHO; 1968. Available at: http://whqlibdoc.who.int/php/WHO_PHP_34.pdf [last accessed 5 August 2014] [Google Scholar]
  • World Health Organization (WHO) Review of ethical issues in medical genetics. 2003. Prepared for WHO by Wertz DC, Fletcher JC, and Berg K. WHO/HGN/ETH/00.4. 48 pp. Available online at: http://www.who.int/genomics/publications/en/ethical_issuesin_medgenetics%20report.pdf [last accessed 5 August 2014]
  • Zou KH, O'Malley J, Mauri L. Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation. 2007;115:654–7. [PubMed] [Google Scholar]