Notes on “Meehl, 1954 Clinical versus Statistical Prediction” [@meehl-1954]
Table of contents
Meehl 1954
Summary
p.iii (1996 preface)
As a result of this book and articles I published shortly thereafter, numerous studies compared the efficacy of subjective clinical judgment with prediction via mechanical or actuarial methods. Accumulating over the years, they have tended overwhelmingly to come out in the same way as the small number of studies available in 1954. There is now a meta- analysis of studies of the comparative efficacy of clinical judgment and actuarial prediction methods (Grove et al., 2000; a summary is given in Grove and Meehl, 1996). Of 136 research studies, from a wide variety of predictive domains, not more than 5 percent show the clinician’s informal predictive procedure to be more accurate than a statistical one.
Cited studies are:
Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (sub- jective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical-statistical controversy. Psychology, Public Policy, and Law, 2, 293-323.
Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical vs. mechanical prediction: A meta-analysis. Psychological Assessment, 12, 19-30.
Conclusion p.119
In spite of the defects and ambiguities present, let me emphasize the brute fact that we have here, depending upon one’s standards for admission as relevant, from 16 to 20 studies involving a comparison of clinical and actuarial methods, in all but one of which the predictions made actuarially were either approximately equal or superior to those made by a clinician. Further investigation is in order to eliminate the defects men- tioned, and to establish the classes of situations in which each method is more efficient. I do not feel that such a strong generalization as that made by Sarbin is warranted as yet. Note that in terms of the kind of thing being predicted, there is not much heterogeneity, Essentially three sorts of things are being predicted in all but one of these studies, namely: (1) success in some kind of training or schooling; (2) recidivism; (3) recovery from a major psychosis … [emphasis added]
Notes on Papers from Chapter 8
Main meat of book is “Chapter 8 - Empirical Comparisons of Clinical and Actuarial Prediction”. Lots of studies. No point excerpting too many here as you can just read. but here are a few.
Sarbin p.90 ff.
Sarbin chose as his criterion variable academic success as measured by honor-point ratio. The sample consisted of 162 freshmen (73 men and 89 women) who matriculated in the fall of 1939 in the arts college at the Uni- versity of Minnesota. Honor-point ratios were calculated at the end of the first quarter of the students’ freshman year. The statistical prediction was made by a clerk who simply inserted the values of the predictor variables into a two-variable regression equation. The predictor variables were high school percentile rank and score on the college aptitude test. (Note: One psychometric and one nonpsychometric variable.) The sample used was cross-validating, since the regression equation had been based upon a previous sample.
The clinical predictions were made on an eight-point scale by one of five clinical counselors in the university’s Student Counseling Bureau. Four of the five counselors possessed the doctorate and all had “consider- able experience” in clinical counseling of university students. The data available to the counselors were considerably in excess of those utilized by the statistician, namely: a preliminary interviewer’s notes, scores on the Strong Vocational Interest Blank, scores on a four-variable structured personality inventory, an eight-page individual record form filled out by the student, scores on several additional aptitude and achievement tests, as well as the two scores utilized by the statistician. In addition, the predicting clinician had one interview with the student prior to the beginning of fall quarter classes. At the end of the fall quarter the correla- tions shown in the tabulation were obtained between the two sets of predictions and the facts. There is no significant difference between the efficiency of the two methods.
Clinical ........... 35 .69
Statistical......... 45 .70
Burgess - Prisons. p.95
Burgess (17) studied the outcome of 1000 cases of parole from three Illinois state prisons. Using 21 objective factors such as nature of the crime, length of sentence, nationality of father, county of indictment, size of community, type of residence, and chronological age, and combining them in unweighted fashion by simply counting the number of factors operating for or against a successful outcome of the parole, he achieved certain percentages of success in postdiction which can be compared with the percentages of two different prison psychiatrists. Again we find that both of these clinicians employed a “doubtful” category, but Burgess’ pre- sentation makes it impossible to say how many cases were so classed. When he predicts success, each of the psychiatrists is slightly better than the statistical method (85 per cent and 80 per cent versus 76 per cent hits). When predicting failure, each psychiatrist is quite clearly inferior to the statistician (30 per cent and 51 per cent versus 69 per cent). Since these percentages are based upon a reference class of all cases for the statistician but upon a smaller reference class which excludes some (unknown) fraction of the “doubtfuls” for the two psychiatrists, it seems quite safe to favor the statistical method.
Melton - College scores p.105
Melton (73) studied the efficiency of fourteen counselors in forecasting the honor-point ratios earned by 543 entering arts college freshmen in their first year’s work. The actuarial prediction was based upon a two- variable regression equation (ACE and high school rank) with betas derived from a previous sample. The counselors made their predictions immediately after an interview of 45 minutes to one hour duration, and had available the two regression variables plus scores on the Cooperative English Test, the Mooney Problem Check List, and a four-page personal inventory form. The counselors were graduate students in psychology or
educational psychology in their second to final year of graduate study. He found that the mean absolute error of the actuarial prediction was significantly less than that of the counselors; the counselors overestimated honor-point ratio; there were significant differences among the counselors in their average error; eleven counselors were less accurate than the regression equation, while three were more accurate, but not significantly; when a counselor predicts knowing the actuarial prediction, his result tends to be less accurate than the actuarial prediction itself, i.e., the addition of clinical judgment reduces predictive power (borderline significance); and, finally, if counselors who are poor predictors are allowed to use the actuarial table in making predictions, they then predict as well as the good predictors.
Chauncey 1936 Harvard p.112
A final relevant study I have from personal correspondence with Henry Chauncey, president of the Educational Testing Service. In 1936 he undertook a comparison of predictive methods, the criterion being college grades (end of freshman year) and the subjects being a random sample of 100 Harvard entering freshmen. Statistical predictions were made on the basis of high school rank and College Board Examination scores. These were genuine predictions, i.e., made at the start of the freshman year. The clinical predictions were made by three members of the freshman dean’s office, working independently on each case. These clinical predictions were based upon the same two quantitative items as the regression equation, plus letters of recommendation, information on extracurricular activities, and a statement by the student as to his reasons for coming to Harvard. All four of the resulting validity correlations were in the .60’s, the statistical validity ranking second. The difference between the statis- tical coefficient and that of the “best” clinician would not be significant with an N of 100.
Combining factors can be problematic for humans
p. 108 - important point that even when we can pull out factors accurately we are often poor at combining them.
However, this very crude prediction index showed a point-biserial of .62 with (dichotomized) movement ratings on the derivation sample, which shrank to .52 on the cross-validation sample (N = 47). The same skilled judges were also asked to make a dichotomous clinical prediction of movement on the basis of their reading of the same initial interview data; these predictions, as made by each of the three judges, had no validity, and the judges did not agree with one another. Apparently these skilled case readers can rate relatively more specific but still fairly complex factors reliably enough so that an inefficient mathematical formula combining them can predict the criterion; whereas the same judges cannot combine the same data “impressionistically” to yield results above chance.
Qualifications to the conclusion
p.114 ff. - He tries his best to be very fair and anticipate most objections …
In the interpretation of these studies, there are several complicating factors which must be kept in mind. In the first place, we know too little about the skill and qualifications of the clinicians who were making the predictions. For instance, there is no reason to assume that the guesses of an otherwise undescribed Bavarian physician will be based upon suffic- ient psychiatric insight so that they ought to be taken as fair samples of the outcome of clinical judgment.
Secondly, some of the studies have involved the comparison of clinical predictions with the predictions of regression equations in which the sta- tistical weights were determined by the data of the group to be predicted, not cross-validated. Partly counterbalancing this is the fact that only seven of the multivariable studies used empirical weights assigned by efficient methods; the remaining eight assigned weights judgmentally or by other non-optimal methods. Only five of these studies evaluate predictive efficiencies for the several clinicians separately. The clinician is a shadowy figure, and while it is important to know what the average clinician can do in competition
with the actuary, it is also important, and of even greater theoretical interest, to know whether there are some clinicians who can (consistently) do better than the regression equation. However, it is difficult to evaluate the argument sometimes offered that the best clinician is the appropriate representative for comparison with the statistical technique. Actually, if we judge from the studies reviewed, even this standard of evaluation would probably not do much to change the box score as between the two methods. … Presumably some kind of longitudinal study is needed to find out whether and to what degree the “good” clinician is stably such, rather than being merely the momentarily luckiest fellow among a crew of equal or near-equal mediocre guessers. Even a clear proof of stable differences among clinicians would still leave us with a serious practical problem.
Meehl 1986
This is a short follow up paper (just a few pages).
Why people still don’t accept his conclusions:
- Sheer ignorance: It amazes me how many psychologists, sociologists, and social workers do not know the data, do not know the mathematics and statistics that are relevant, do not know the philosophy of science, and are not even aware that a controversy exists in the scholarly literature. But what can you expect, when I find that the majority of clinical psychology trainees getting a PhD at the University of Minnesota do not know what Bayes’ Theorem is, or why it bears upon clinical decision making, and never heard of the Spearman–Brown Prophecy Formula!
- The threat of technological unemployment: If PhD psychologists spend half their time giving Rorschachs and talking about them in team meetings, they do not like to think that a person with an MA in biometry could do a better job at many of the predictive tasks.
- Self-concept: “This is what I do; this is the kind of professional I am.” Denting this self-image is something that would trouble any of us, quite apart from the pocketbook nerve.
- Theoretical identifications: “I’m a Freudian, although I have to admit Freudian theory doesn’t enable me to predict anything of practical importance about the patients.” Although not self-contradictory, such a cognitive position would make most of us uncomfortable.
- Dehumanizing flavor: Somehow, using an equation to forecast a person’s actions is treating the individual like a white rat or an inanimate object, as an it rather than as a thou; hence, it is spiritually disreputable.
- Mistaken conceptions of ethics: I agree with Aquinas that caritas is not an affair of the feelings but a matter of the rationally informed will. If I try to forecast something important about a college student, or a criminal, or a depressed patient by inefficient rather than efficient means, meanwhile charging this person or the taxpayer 10 times as much money as I would need to achieve greater predictive accuracy, that is not a sound ethical practice. That it feels better, warmer, and cuddlier to me as predictor is a shabby excuse indeed.
- Computer phobia: There is a kind of general resentment, found in some social scientists but especially people in the humanities, about the very idea that a computer can do things better than the human mind. I can detect this in myself as regards psychoanalytic inference and theory construction, but I view it as an irrational thought, which I should attempt to conquer.