Selecting the correct screening test
for a disease is affected by a number of factors: - The test must be easy to administer to people. - Cause minimal discomfort. - Be reliable - that is, consistently give the same result. - Be valid that is be able (by using key markers) to distinguish disease and non-disease states. - And, it must be affordable to the health service. A detailed examination of the retina, at the back of the eye, is the key procedure when screening for diabetic retinopathy. Various examination methods are available, from ophthalmoscopy to digital retinal imaging - with or without pupil dilation.
Whichever examination method is selected: Screeners must be able to find all cases of retinopathy correctly without causing unnecessary anxiety to the person being screened.
And, the health system must be able to: - Pay for, and manage the purchasing of, screening equipment and; - Train personnel to use and maintain it. We can use a 2 by 2 table to plot the presence of diabetic retinopathy against the ability of the test to detect the condition correctly. A good screening test for DR must have the following 4 key characteristics. - A high sensitivity that is the test must correctly identify all cases of retinopathy (known as true positives). - A high specificity the test must minimise falsely identifying cases as having retinopathy (false positives).
The test must also have a high positive predictive value ensuring a high probability that each person with a positive screening test truly has retinopathy. - And finally, the test must achieve high coverage so that everyone at risk of retinopathy (i.e. people with diabetes) is tested. Lets look at a hypothetical example to understand why these characteristics are important. In our population of 100 people with diabetes, 10 have retinopathy (these are known as cases). If all 100 people are screened, we have achieved high coverage. The perfect screening test will correctly detect all 10 cases (100% sensitivity) and identify all of the remaining 90 people as being free of retinopathy (100% specificity). Now lets see how an imperfect screening test performs.
This time, even when we ensure everyone is tested (high coverage) the test is only able to pick up 7 out of 10 cases correctly. Sensitivity is only 70%.
The test also incorrectly identifies 5 cases (false positives).
Only 85 out of the 90 people without retinopathy are correctly identified. Specificity is now 94%.
The predictive value of this screening test is low. Only 7 out of the 12 people (or 58%) identified by this test as having diabetic retinopathy actually have the disease. So this test is poor and not appropriate for scale up for screening a real population of people with diabetes. In practice, even with a test with high sensitivity and specificity, screening programmes need to achieve good coverage to perform well. In our example here, if half of the population is not screened, perhaps because of lack of resources, we will miss 4 cases who do not even get a chance to be detected.
In the group which is examined, the test correctly picks up 4 out of the 6 cases giving us a sensitivity of only 67% and making this a poor test. To deliver a high-quality screening programme, we have to consider both the key markers of a good test and how it will be implemented.
The balance we need to find is a test that: - Is good - has a high sensitivity and high specificity. - Is acceptable to the people being tested. - Can achieve high coverage - can be delivered to the whole eligible population. - And can be conducted repeatedly over regular intervals for example annual screening of the whole at risk population. To achieve these requirements, investment and protocols must be put in place.
The St Vincents Declaration of 2005 suggests that when establishing systematic screening,
programmes should aim to reach: a sensitivity of more than 80%; a specificity of more than 90%; and an acceptable coverage of at least 80%. For example, the English national screening programme, which began in 2003, reached 82.8% of the population by 2016. The selected test of 2 field mydriatic photography achieved a sensitivity of about 88% and a specificity of just over 86%, with 3.7% of images ungradeable. As camera technology develops, these rates are likely to improve. To maintain good coverage, the English screening programme closely monitors acceptance of screening as part of its performance criteria.
In summary. To be selected for a screening programme,
screening tests should: - Have a high sensitivity - to correctly find all cases of diabetic retinopathy - Have a high specificity - to reduce overloading the system with false positives and causing unnecessary anxiety amongst people incorrectly identified as having diabetic retinopathy - And achieve high coverage - to ensure everyone at risk is screened The test also has to be acceptable to the people with diabetes being screened and affordable for the health provider. 1Everest Consulting Associates, Cranbury, NJ, USA Find articles by L. Daniel Maxim 1Everest Consulting Associates, Cranbury, NJ, USA Find articles by Ron Niebo 1Everest Consulting Associates, Cranbury, NJ, USA Find articles by Mark J. Utell Received 2014 Jul 8; Revised 2014 Aug 6; Accepted 2014 Aug 13. Screening tests are widely used in medicine to assess the likelihood that members of a defined population have a particular disease. This article presents an overview of such tests including the definitions of key technical (sensitivity and specificity) and population characteristics necessary to assess the benefits and limitations of such tests. Several examples are used to illustrate calculations, including the characteristics of low dose computed tomography as a lung cancer screen, choice of an optimal PSA cutoff and selection of the population to undergo mammography. The importance of careful consideration of the consequences of both false positives and negatives is highlighted. Receiver operating characteristic curves are explained as is the need to carefully select the population group to be tested. Keywords: Benefits and limitations, positive and negative predicted value, prevalence, screening tests, sensitivity, specificity A screening test (sometimes termed medical surveillance) is a medical test or procedure performed on members (subjects) of a defined1 asymptomatic population or population subgroup to assess the likelihood of their members having a particular disease.2 With few exceptions, screening tests do not diagnose the illness. Rather subjects who test positive typically require further evaluation with subsequent diagnostic tests or procedures. Examples of actual or proposed screening tests include the pap smear for cervical cancer (Arbyn et al., 2008; Mayrand et al., 2007), mammography (or tomosynthesis) for breast cancer (Friedewald et al., 2014; Rafferty et al., 2013), PSA (and/or digital rectal exam) for prostate cancer (Catalona et al., 1991), cholesterol level for heart disease, X-ray (or computed tomography) for lung cancer (discussed below), PKU test for phenylketonuria in newborns, B-natriuretic peptide test for screening patients undergoing echocardiography to determine left ventricular dysfunction (Maisel et al., 2001), and urinalysis or other screening tests for sexually transmitted diseases or illicit drug use (Gastwirth, 1987; Jafari et al., 2013; Watson et al., 2002). Screening tests may be based on the measurement of a particular chemical in the blood or urine (a quantitative measurement) or some qualitative assessment by a trained observer (e.g. interpretation of an x-ray or CT scan, or semi-quantitative analysis by a polygraph operator). A major objective of most screening tests is to reduce morbidity or mortality in the population group being screened for the disease by early detection, when treatment may be more successful.3 An alternative objective might be to reduce morbidity or mortality in persons other than the screened population who might be impacted by a communicable and preventable disease (such as screening for HIV in blood donors4) among subjects in the population being tested. Although some of the key analytical/statistical results applicable to the design and evaluation of screening tests have been around since the late 1700s, when the Reverend Thomas Bayes first developed the theorem that bears his name and numerous tutorials or review articles have been written more recently (Alberg et al., 2004; Altman & Bland, 1994a,b,c; Deeks & Altman, 2004; Goetzinger & Odibo, 2011; Lalkhen & McCluskey, 2008; Thompson et al., 2005; Zou et al., 2007), there is still some confusion among practitioners about how to interpret and assess the utility of screening tests (Casscells et al., 1978; Grimes & Schutz, 2002; Manrai et al., 2014; Wegwarth et al., 2012), which is why the article might be of interest to readers of Inhalation Toxicology. In its simplest form, the screening test has only two outcomes: positive (suggesting that the subject has the disease or condition) or negative (suggesting that the subject does not have the disease or condition).5 An ideal screening test would have a positive result if and only if the subject actually has the disease and a negative result if and only if the subject did not have the disease. Actual screening tests typically fall short (sometimes far short, see below) of this ideal. Instead, most screening tests exhibit what are termed false positives and false negatives to varying degrees. Logical possibilities are described in the 2 × 2 Table 1. Logical possibilities for true disease state and screening test outcome.
In most cases,6 screening tests need to be benchmarked against an agreed “Gold Standard” test (Greenhalgh, 1997). The gold standard test is a diagnostic test that is usually regarded as definitive (e.g. by biopsy or autopsy). The actual gold standard test may be invasive (e.g. biopsy), unpleasant, too late (e.g. autopsy) to be relevant, too expensive or otherwise impractical to be used widely as a screening test. Table 2 provides examples of various screening tests and possible Gold Standards. In principle, a “Gold Standard” should have 100% sensitivity and 100% specificity (see below for definitions), that is, it would never make a classification error. In practice that may not be the case and the “Gold Standard” is regarded as the best test under “reasonable conditions.” As noted by Versi (1992): “As science increases its hold on the practice of medicine we become more aware of the limitations of the clinical method. Unfortunately, we also become more aware of the limitations of various diagnostic tests. Nevertheless, at any given time there may well be a consensus that a given test in a given situation is the best available test. It therefore serves as the gold standard against which newer tests can be compared. When enough data have accumulated to make that gold standard untenable, it can perfectly reasonably be replaced by another. This can then preside until it too is toppled.” Troy et al. (1996) offered the following perspective on gold standards: “… however, gold standards for comparison are not always available. Moreover, a perfect gold standard is less often available than an imperfect gold standard (‘alloyed gold standard’), an adopted standard based on observed data which is measured with error.” For the purposes of this discussion, the gold standard test is assumed to be without error. Several authors have developed statistical approaches for dealing with “alloyed gold standards” (Dendukuri, 2011; Hawkins et al., 2001; Johnson et al., 2001; Joseph et al., 1995; Lewis & Torgerson, 2012; Rutjes et al., 2007; van Smeden et al., 2014; Walter & Irwig, 1988). As might be expected, none of these alternative procedures perform as well as if a true Gold Standard were available, but several are improvements over naively assuming the Gold Standard is “unalloyed.” The possible outcomes shown in Table 1 are quantified by two probabilities, termed the test sensitivity and specificity. These are two key characteristics of a screening test.
Publications about screening tests typically report both the sensitivity and specificity of the test. It is clearly desirable to have a test that is both highly sensitive and highly specific. (In some cases, it may be possible to structure the test so as to tradeoff sensitivity and specificity, as discussed below.) Figure 1 shows a sample of reported sensitivities and specificities of various screening tests as summarized by Alberg et al. (2004), denoted by the triangles, and from our own literature search (refer Table A1), denoted by the circles. As can be seen, there are a substantial number of screening tests with both high sensitivity and high specificity, but also many that fall far short of this ideal. It should be noted that not all the screening tests shown in Figure 1 are actually used at present – some may have been found wanting. As some test results (e.g. reading an X-ray) require interpretation, it is possible that there will be inter-observer variation (notwithstanding attempts at standardization) so that the reported sensitivity and specificity may vary with the observer (Deeks, 2001; Elmore et al., 2002 for illustrations). This creates issues when large scale screening tests are being contemplated and it is necessary to extrapolate or generalize from screening test data based on pilot studies, often conducted by highly specialized and experienced personnel. Refer Whiting et al., (2004) and Table 3 for a useful systematic review of sources of variation and bias in studies of diagnostic or screening accuracy. Common sources of bias in study design.
Frequently, it is of interest to compare one screening test (a potential improvement) with another. It is important to make careful statistical comparisons to assess whether the “improved” test is actually superior and to calculate confidence intervals on the various proportions (e.g. using the methods given in Newcombe, 1998). Ideally such comparisons should be made on the same population and randomly assigning subjects to each test. Before addressing additional important definitions applicable to screening tests it is appropriate to mention some of the consequences of false negatives and false positives in screening tests. Briefly:
The prevalence of the disease is the fraction, Π, of subjects in the population under study that have the disease. It is equal to the a priori probability (Pr{D+}) that a subject selected at random from the population or subgroup has the disease.8 Prevalence, along with sensitivity and specificity, is a key determinant of the utility of the screening test (see below). For reasons discussed below, it is desirable to be able to define the population to be screened in such a way that the prevalence in the test population is high. The reported prevalence among various populations that are the subject of screening tests (Alberg et al., 2004) range from 0.05 to 0.9, but clustered among the higher values. There are four additional relevant characteristics of a screening test, the positive predictive value, negative predictive value, accuracy and likelihood ratio:
In words, the a posteriori probability that the subject has the disease given a positive test is the ratio of true positives (the product of the prevalence and sensitivity) divided by total positives (the sum of true positives and false positives). It is desirable that the screening test has a high PPV.
In words, the a posteriori probability that the subject does not have the disease given a negative test is the ratio of the true negatives (complement of prevalence times the specificity) divided by the total negatives (the sum of true negatives and false negatives). It is also desirable that the test has a high NPV.
The term ΠS includes the true positives (prevalence times sensitivity) and the term (1 − Π)Sp are the true negatives (probability the subject does not have the disease times the probability that the test is negative given the subject is without disease). As noted by Alberg et al., (2004): “Overall accuracy is the weighted average of a test's sensitivity and specificity, where sensitivity is weighted by prevalence and specificity is weighted by the complement of prevalence.” Refer to Alberg et al. (2004) for a discussion of the limitations of this measure of screening efficiency.
To many, the PPV or NPV are the key characteristics of a screening program. It is important to remember that the PPV or NPV are dependent on both the population under study and the technical characteristics of the screening test.9 A screening test with relatively high sensitivity and specificity may still have a low PPV if the population prevalence is sufficiently low. Thus, to assess a proposed screening test it is necessary to evaluate both the technical and population characteristics. The probability of a positive test, Pr(T+) is the sum of the probabilities of a subject with the disease correctly testing positive and someone without the disease incorrectly testing positive, or ΠS + (1 − Π)(1 − Sp). Some have suggested using the observed fraction of positive tests, F+, (sometimes termed the apparent prevalence) as a surrogate for or estimate of Π, but, unless both the sensitivity and specificity are both equal to unity, this will give a biased answer. Given the definitions, an improved estimate for Π is equal to (F+ + Sp −1)/(S + Sp − 1). Refer Gart & Buck (1966), Gastwirth (1987), Levy & Kass (1970) and Rogan & Gladen (1978) for a derivation and Karaağaoğlu (1999) for additional analyses. All of these screening test characteristics are determined by testing a particular population (using one or more screening tests) and recording the number of subjects that fall into the various categories shown in Table 1. To illustrate, Table 4 provides a hypothetical data from a screening test evaluation of a population of 10 000 subjects, assumed to have a disease prevalence of 0.5, with a calculated sensitivity of 0.9 (95% confidence interval including continuity correction [0.8913, 0.9081]), and a specificity of 0.3 (95% confidence interval including continuity correction [0.2874, 0.313]).10 Hypothetical data from screening experiment.
Table 4 also illustrates the equations and numerical computation of the various quantities defined above. The bottom of Table 4 provides the calculation (using Bayes' theorem) of the a posteriori probabilities corresponding to either a positive or negative test outcome. In this example, a subject who tests positive has an a posteriori probability of having the disease of 0.5625 – not materially greater than the a priori prevalence (0.5) in the population. This is because although the sensitivity is relatively high, the specificity of the test is relatively low. Conversely, a subject who tests negative has an a posteriori probability of not having the disease of 0.75 – in this case, clearly different from the a priori prevalence (0.5) in the population in this example. The a posteriori probability of having the disease given a positive test result, or PPV, is one obvious measure of the evidence provided by the test. Other things being equal, tests with high specificity (few false positives) tend to have a high PPV. However, unlike sensitivity or specificity (which might be termed “pure characteristics” of the test), the PPV is also a function of the characteristics of the population under study; PPV is a function of the prevalence. In the numerical example given in Table 4, the prevalence was assumed to be 0.5 (i.e. 50% of the population or subpopulation had the disease). Figure 2 shows how the PPV, NPV and accuracy depend upon the assumed prevalence Π in the population being screened. As can be seen, both PPV and accuracy decrease (sharply in the case of PPV) as Π decreases from the base case assumption of 0.50. Conversely, the NPV increases as the prevalence decreases. To help place the content of Figure 2 in perspective, note that if the prevalence, Π, were as low as 0.16, the Positive Predicted Value, PPV, would be only 0.2. Put another way, a subject who tested positive under these circumstances would have an 80% chance of not having the disease! And, if Π were as low as 0.08, there would be a 90% chance that a subject with a positive test would be disease free. If the consequences of a positive test (e.g. worry, invasive or expensive and unnecessary follow-up procedures) were substantial, this would not be a satisfactory screening test. Thus, the quantity 1 – PPV might aptly be termed the regret probability. Figure 3 shows how the regret (so defined) varies with both the prevalence and specificity, when the sensitivity is held constant at 0.90. Looking at Figure 3, you can see how the likelihood that a subject who tests positive actually is disease free changes as the prevalence changes. If the actual prevalence in the population were say 0.3, the regret would be approximately 0.7 and if the prevalence were as low as 0.08, the regret would be 0.9. This example illustrates the point that both technical parameters of the screening test and prevalence need to be considered. Figure 4 shows the locus of points that have a constant regret (equal to 0.8) as a function of specificity and prevalence for values of sensitivity ranging from 0.7 through 0.9; this test characteristic does not have much leverage in this example. Rather the prevalence and specificity are the key variables. Although the example given in Table 4 is hypothetical, it is relevant to many actual tests. One is described below. Low dose computed tomography (LDCT) has been proposed as a screening test for lung cancer. Several studies (Humphrey et al., 2013; Tiitola et al., 2002) have shown that this technique has a high sensitivity (as a percentage ranging from 80 to nearly 100%, Humphrey et al. [2013]) at detecting nodules. The National Lung Cancer Screening Trial (NLST) Research Team (2011) published a study reporting that the estimated reduction in mortality from use of LDCT screening was approximately 20% compared to alternative test strategies. Subjects included in the study population were between 55 and 74 years of age at the time of randomization, had a history of cigarette smoking of at least 30 pack-years, and, if former smokers, had quit within the previous 15 years. These criteria were used in an attempt to define a population with a relatively high prevalence and thus a high Positive Predictive Value:
“Lung cancer mainly occurs in older people. About 2 out of 3 people diagnosed with lung cancer are 65 or older; fewer than 2% of all cases are found in people younger than 45. The average age at the time of diagnosis is about 70.” Based on this (and other) research team's findings, a nationwide screening program was proposed and has been endorsed by several organizations [e.g. the American Lung Association, see ALA, 2012, the US Preventive Services Task Force, 2013; American College of Chest Physicians (ACCP) and the American Society of Clinical Oncology (ASCO)]. However, this and earlier proposals for LDCT screening have also had numerous critics or skeptics (American Academy of Family Physicians (AAFP) 2014; Heffner & Silvestri, 2002; Ruano-Ravina et al., 2013; Silvestri, 2011; Vansteenkiste et al., 2012), some arguing that the estimated benefits of LDCT screening in reducing mortality are uncertain, lower than estimated or absent (Bach et al., 2007, 2012; Black, 2000; Oken et al., 2011; Pastorino et al., 2012; Saghir et al., 2012), others that the procedure is not cost-effective (Mahadevia et al., 2003), and yet others that the radiation risks might be excessive (Brenner, 2004). One of the major concerns about the use of LDCT even among advocates (Marshall et al., 2013) is that LDCT detects a large number of benign but uncalcified pulmonary nodules – properly termed false positives – that are challenging to diagnose (MacRedmond et al., 2006; Nawa et al., 2002; Patz et al., 2004; Swensen et al., 2002, 2003, 2005) and which create other problems depending upon what is done as part of the follow-up to a positive test (Wiener et al., 2011). As noted by Diedrerich (2008):
In short, although this test is highly sensitive, it has a low specificity. Table 5 provides estimates of the false positive rate (benign nodules discovered by CT scans) as reported in several studies – even those that favor routine screening of this population subgroup. The calculated or reported false positive rates shown in Table 5 vary substantially among the studies;13 some of the differences can be explained by different criteria for defining a positive (e.g. size of the nodule that is classified as a positive) and whether or not multiple LDCTs were used (and the criteria for a positive on multiple tests) as part of the procedure. Despite this variability it is apparent that most reported estimates of false positive probabilities are quite high. The study by van Klavern et al. [2009] reports a false positive probability very much lower than the other results depicted in Table 5. The actual test and decision criteria developed by these investigators differed from others. Specifically, they used a mathematical model to evaluate a non-calcified nodule according to its volume or volume-doubling time. Growth was defined as an increase in volume of at least 25% between two scans. The first-round screening test used by these investigators was considered to be negative if the volume of a nodule was less than 50 mm3, if it was 50 to 500 mm3 but had not grown by the time of the 3-month follow-up CT, or if, in the case of those that had grown, the volume-doubling time was 400 days or more. Another concern of critics of the NLST is that it might be difficult to generalize the results to community practices. Silvestri (2011), for example, wrote:
From the data given in Table 5, it is clear that a conservative estimate of the false positive probability is at least 0.7, which means that the specificity of this test is at most 0.3 – the value assumed in the hypothetical example given in Table 4 – and might be much lower. Thus, even for the potentially high risk group of elderly heavy cigarette smokers included in the screening trials, the Positive Predictive Value of the test is not likely to be high. There is some discrepancy in reported PPVs for the NLST; according to Humphrey et al. (2013) reported calculated positive predictive values (PPVs) for abnormal screening results ranging from 2.2% to 36.0%, while Ruano-Ravina et al. (2013) report the PPV for the NLCT as only 3.6%. Ruano-Ravia et al. (2013) have summarized the PPVs for 14 other LDCT investigations. Including their estimate for the NLCT, PPVs in these tests range from 0.028 to 0.115 with a median value of 0.053 and an arithmetic mean of 0.064, meaning that the probability that someone with a single positive test does not have lung cancer ranges from 0.885 to 0.972! Kovalchik et al. (2013) examined how the reduction in lung cancer mortality as reported by the NLST varied with the estimated risk based on a prediction model using age, body-mass index, family history of lung cancer, pack-years of smoking, years since smoking cessation and emphysema diagnosis. Based on model predictions, they divided the study population into quintiles based on a predicted 5-year risk of lung cancer. They analyzed the NLST data and found:
This finding highlights the importance of identifying the target population that is likely to benefit most from the screening procedure. Overdiagnosis is another factor to consider in assessing the merits of LDCT cancer screening. This is because although screening has a high sensitivity and potential to detect aggressive tumors, screening will also detect indolent tumors that otherwise might not cause immediate clinical symptoms. Patz et al. (2014) used data from the NLST to estimate that more than 18% of all lung cancers detected by LDCT seemed to be more indolent, and the potential of overdiagnosis should be considered when describing the risks of LDCT for lung cancer. Depending upon what is done in terms of follow up in the event of a positive screening test result, the impact of false positives could be substantial. Wiener et al. (2011) determined population-based estimates of risks of complications following transthoracic needle biopsy of a pulmonary nodule. This group collected data on the percentage of biopsies complicated by hemorrhage, any pneumothorax and pneumothorax requiring chest tube, and computed adjusted odds ratios for these complications associated with various biopsy characteristics, calculated using multivariable population-averaged generalized estimating equations among a population of 15 865 adults (in California, Florida, Michigan and New York) who underwent transthoracic needle biopsy of a pulmonary nodule. These investigators reported:
It is apparent form these results that the consequences of false positives are potentially material. Based largely on concerns over the high false positive rate, the Medicare Evidence Development and Coverage Advisory Committee (MEDCAC) in the United States recently recommended against covering the procedure for this patient group based on a lack of evidence to support the benefits of the screening test (http://www.aafp.org/news/health-of-the-public/20140521medcacctrec.html). MEDCAC makes recommendations, not decisions; as noted by the Centers for Medicare and Medicaid Services (http://www.cms.gov/Regulations-and-Guidance/Guidance/FACA/MEDCAC.html):
The merits of this screening test are likely to be reviewed by other panels and the MEDCAC recommendation may ultimately be reversed – the Centers for Medicare & Medicaid Services (CMS) is expected to issue a proposed decision on the issue by November 2014, and a final decision in February 2015. The decision may be made on policy grounds but, from a scientific perspective, the ultimate outcome is likely to hinge on the judgment of the key parameters prevalence in the population, the high false positive rate, and ultimately the low PPV (Nelson, 2009; Phend, 2014; US Preventive Services Task Force, 2013). As a related example, we were asked by ECFIA (a trade association of manufacturers of high temperature insulating wools) to comment on the suitability of routine use of LDCT scans in a medical surveillance program for workers of all ages (including both smokers and non-smokers) engaged in the manufacture of refractory ceramic fiber (RCF) and other high temperature insulating wools in France. The available results of a mortality study of these workers in two US plants does not indicate any increase over baseline cancer rates (LeMasters et al., 2003; Utell & Maxim, 2010), so the likely prevalence of lung cancer in this population is not likely to be high. This is because most of the employed population is substantially younger than those included in the NLST (indeed, the retirement age in France is 60–62 depending upon what age the employee entered the workforce) and not all employees are smokers, let alone heavy smokers. According to data from SEER (see http://www.cancer.org/cancer/cancerbasics/lifetime-probability-of-developing-or-dying-from-cancer) the lifetime probability of contracting lung cancer among American males (including both smokers and non-smokers) is approximately 7.6%. Taking this as an estimate applicable to the French population prevalence14 and using the sensitivity and specificity values from Table 4, the positive Predictive Value of CT lung cancer screening is approximately 0.1. (Obviously it would be much lower for young men and non-smokers and higher among those nearing retirement and heavy smokers.) This means that the a posteriori probability (regret) that a subject who tests positive in a single CT scan does not have lung cancer is approximately 0.9. Despite the high probability that a subject with a positive test does not have lung cancer, these subjects would be subject to whatever follow-up procedures might accompany such a test result. Members of this group would, at a minimum, suffer some mental distress and would be subject to follow-up CT scans and possibly invasive procedures. This screening test would clearly be inappropriate for this group. In this context it is noteworthy that the American Lung Association's guidance document (ALA, 2012) that endorsed LDCT scans for older smokers also states:
And, Bach et al. (2012) also endorsed LDCT screens for older smokers, but also recommended:15
Thus, regardless of whether one believes that LDCT is an appropriate screening test for the population of older smokers, it is not justified for a population with much lower prevalence or those who are not likely to benefit from a correct diagnosis, evaluation and treatment. Some screening tests are designed as “once-off” tests, but many are intended to be administered periodically, such as annually. For example, mammography and clinical breast examination have been proposed for screening for breast cancer. As Elmore et al. (1998) wrote:
Thus, in evaluating periodic screening, it is necessary to measure or calculate cumulative probabilities. Care must be taken because the results of multiple tests may not be independent events. As the PPV of a screening test depends critically on the prevalence of the disease in the population it is important to identify criteria to define a population group or subgroup with a high disease incidence to begin with. As noted above, this is why the LDCT program was limited to older smokers. Lung cancer rates increase with age and the vast majority of lung cancers occur in smokers. This is potentially a reasonable population subgroup for screening. To illustrate the selection of a relevant population subgroup, we use an example from a study of breast cancer screening. Kerlikowske et al. (1993) reported on a cross-sectional study of 31 814 women aged 30 years and older referred for mammography at the University of California. They segmented the population into women of various age groups with and without a family history of breast cancer. Figure 5 shows a bar chart of the estimated PPVs for these groups. These investigators found that five times as many cancers per 1000 first-screening mammographic examinations were diagnosed in women aged 50 years or older compared with women aged less than 50 years. The highest PPVs for mammography were older women with a family history of breast cancer. This finding guided their recommendation. Possible criteria for defining a population subgroup include various demographic factors (age, gender, race and country), known risk factors (e.g. smoking), medical history and occupation. For screening to be highly effective, the prevalence in the population should be as high as is practicable. Harper et al. (2000) provides additional comments on the importance of the study population. It is noted above that there may be opportunities to design a screening test that has different combinations of sensitivity and specificity. If so, there are opportunities to design the test to possess characteristics that are superior in terms of the combination of possible consequences of false positives and false negatives. For example in the LDCT test, the threshold for size (mm) of the nodules or other characteristics (e.g. solid or semisolid nodules) might be varied (Lam et al., 2013), choices that would alter the sensitivity or specificity. Thompson et al. (2005) and Zou et al. (2007) offer other relevant examples of ROC curves. Figure 6 shows a typical receiver operating characteristic (ROC)16 curve for a prostate specific antigen (PSA) test administered to men aged 70 or more. Each subject is tested and a specific PSA score determined. Subjects were also administered digital rectal examinations and biopsies – those with positive biopsies were used as the gold standard for assessment of disease status. A series of possible PSA cutoff scores (measured in nanograms per milliliter ng/ml) were considered for the screening test. Each cutoff score resulted in a partitioning of subjects into those who tested positive and those who tested negative. Knowing the actual disease status of the subjects enabled calculation of the sensitivity and specificity of the test. The ROC curve plots the calculated sensitivity against the false positive error (1 – Sp). Thus, each plotted point on the curve represents a different possible screening test with its own sensitivity and specificity. By considering the consequences of false positives and false negatives, it is possible to determine a cutoff value for the PSA test that is optimal in some sense. One statistic often used to characterize the ROC is the area under the curve (AUC). A perfectly discriminatory ROC would have an AUC = 1.0. The value for the PSA tests studied by Thompson et al. (2005) was 0.678. Thompson et al. (2005) also considered using a so-called Gleason score17 with a cutoff of 8 or more in this population. Figure 7 shows the ROC curve (topmost curve) for this possible screening test. As can be seen, this series of tests dominates the tests based upon PSA score alone (the AUC in this case is 0.827). The dashed line in Figure 7 shows the ROC curve that would occur under chance alone. Whether or not and for whom PSA screening is appropriate requires the same sort of analysis noted for the LDCT screening evaluation. The ROC curve is just one piece of the puzzle, but this type of analysis shows that it is possible to design a screening test with several alternative combinations of sensitivity and specificity. A complete specification of a screening test includes the intrinsic test characteristics (sensitivity, selectivity and cost) and ROC curve (if multiple tests are possible), characteristics of the subject population (including opportunities for segmenting the population to identify high risk groups), the key derived quantities (PPV and NPV) and the consequences of false positives and negatives. Screening tests have the potential to be a cost effective means for identifying subjects with early stage (and thus potentially more treatable) disease before symptoms develop and therefore, for saving lives. The ideal screening test would discriminate perfectly between those who have or do not have the disease and be inexpensive and not invasive. In practice, screening tests exhibit false positives and false negatives – errors with consequences that need to be carefully considered when evaluating the advantages and disadvantages of the test. The predictive value of the test depends in part on the technical parameters of the test, including the sensitivity and specificity, but also on the prevalence of the disease in the population. For this reason, it is necessary to be able to define the population to be tested so that the prevalence is high. This is why mammography is appropriate only for older women and those with a family history of breast cancer and why lung CT scans are not appropriate for screening the general population. With some screening tests it is possible to alter the test decision criterion to alter the balance between sensitivity and specificity in which case it may be possible to develop an optimal screening test. Nonetheless, screening of asymptomatic populations is not always appropriate and could do more harm than good.18 Table 6 summarizes the circumstances/conditions when screening might be either appropriate or contra-indicated. Circumstances/conditions when screening might be appropriate or contra-indicated.
We appreciate the constructive comments offered by two anonymous reviewers. Their comments have improved this manuscript. Table of test specificity and sensitivity results in the literature.
1The basis for definition of the population might include age, gender, race, occupation, known medical condition or other risk factor (e.g. smoking). 2Diseases frequently begin before the onset of symptoms during a period sometimes referred to as the “detectable pre-clinical Phase” (DPCP). 3From this, it follows that the benefits of screening will be minimal if the disease has no cure (such as certain stage mesotheliomas) or if early detection does not materially improve chances for survival. In addition, depending upon the population under study, some diseases (sometimes termed pseudo diseases) are detected that do not affect mortality because the subject may die from another disease or event. This is termed overdiagnosis (refer Black, 2000 for more detail). 4Screening tests for donated blood using nucleic acid amplification are now so efficient that the risks of human immunodeficiency virus and hepatitis C virus transmission through blood transfusion is estimated to be approximately 1 in 2 million (Stramer, 2007). 5See Coste & Pouchot (2003) for an extension in which the test results are permitted to fall into three zones, a positive, negative and in intermediate “grey zone.” In principle, many test outcomes as well as sequential tests can be handled mathematically. We focus on the 2 × 2 because it has proven useful and is easier to analyze. 6There are a few examples (e.g. certain tests for HIV) of screening tests with such high sensitivity and specificity that they are virtually a Gold Standard. 7The symbols T+ and T− denote the events that the test outcome is positive and negative, respectively. The symbols D+ and D− denote the events that the subject has or does not have the disease. 8Thus, Pr{D−} = 1−Π. 9It is beyond the scope of this article to consider optimal screening study designs, but it is appropriate to comment on one possible design, the case control design. As noted by Goetzinger & Odibo (2011): “It is important to highlight that the case control study design cannot be used to determine predictive values because these values are influenced by disease prevalence. Because cases and controls are selected for inclusion, the prevalence of the disease is, therefore, “fixed” by the study design. Reproducing a generalizable spectrum of patients also becomes difficult with this type of study design”. 10The width of these confidence intervals is small due of the assumed size of the population under test. Many studies, however, are conducted on few individuals and it is important to understand the consequences in terms of the likely precision of the estimates. 11See http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/non-small-cell-lung-cancer-key-statistics. 12A hamartoma is a benign, focal malformation that resembles a neoplasm in the tissue of its origin. 13This is obviously not desirable, but also not entirely unexpected. For example, Elmore at al. (2002) noted a variation in false positive rates ranging from 2.6% to 15.9% among radiologists interpreting mammograms. 14Male mortality rates from lung cancer are approximately the same in France and the United States (see http://www.oecd-ilibrary.org/docserver/download/8111101ec007.pdf?expires=1404337643&id=id&accname=guest&checksum=03F45C46CE1A31E393DD2EAFDF0157D3). Moreover, the 7.6% figure assumed for the prevalence is for an entire lifetime. The probability of contacting cancer through age 60 or 62 (when workers will retire) is certainly lower. Thus, this estimate probably overstates the actual prevalence for the worker cohort. 15ACCP and ASCO have made essentially the same recommendation, see http://www.cancer.net/research-and-advocacy/asco-care-and-treatment-recommendations-patients/lung-cancer-screening. 16ROC analysis emerged from the study of signal detection problems differentiating signals from noise. These were first used by scientists in Britain during World War II as the abilities of radar receiver operators were being assessed based on their ability to differentiate signal (e.g. enemy aircraft) from noise (non-relevant targets). The term was later borrowed by statisticians assessing screening tests. 17The Gleeson score is a grading system for prostate cancer based on microscopic appearance of the tumor. 18For a discussion of ethical issues relevant to screening programs (McQueen, 2002; WHO, 2003). This paper represents independent research and the authors are solely responsible for the content. Two of the authors (LDM and MJU) were asked by ECFIA to give an opinion on the use of CT scans for workers engaged in the production of high temperature insulating wools.
|