Publish and be (quite rightly) damned

This week, the popular science magazine Psychology Today found itself at the centre of controversy following their publication of a blog post by evolutionary psychologist Satoshi Kanazawa. His post was subsequently removed from the Psychology Today website, but you can consult the it here (or here for a text-based version). Interestingly, the way I located this cached post-deletion copy was via a web search that took me to a particular discussion board described by Wikipedia as “a white nationalist and supremacist neo-Nazi Internet forum” and “the Internet’s first major hate site” (for legal reasons, I won’t name or link the forum here). The fact that contributors at the forum were wholly approving of Kanazawa’s article (“very well written”, “well researched”, and “truthful” was their verdict) probably tells you much of what you need to know about the nature of its content. So does the article’s title: “Why are black women less physically attractive than other women?

Kanazawa described some data he had accumulated from a large-scale longitudinal study of adolescents. Then, following a methodology that I explain a bit more below, he drew the conclusion that black women were statistically “far less attractive than white, Asian, and Native American women”. He went on to slice and dice the figures a bit, but it all came back to the same point: “black women are objectively less physically attractive than other women” (my emphasis). (He is not cautious about elaborating upon the implications either, unhesitatingly raising what he calls “the positive association between intelligence and physical attractiveness.”) To clarify his reasoning, he presented his statistical findings using several graphics such as the following:

Now you just know this must be scientific because: (a) the variable name is nicely jargon-laden (“Mean latent physical attractiveness”); (b) the scores are expressed to a whopping five decimal places; and (c) error bars are shown, which presumably somehow reflect the sophistication of the analysis that was performed. In addition, the fact that there were several such graphics appears to suggest that Kanazawa was looking at a very large dataset. However, contrary to these visual cues, the study described here (such as it is an actual ‘study’) is really quite weak. In fact, very weak. In fact it is so terribly weak as to lie almost beyond the realm of ordinary scientific criticism. And this is precisely the reason that I believe that the post should not have been taken down.

Kanazawa’s post attracted a wave of protest from around the world. Most of this appeared to be stimulated by a degree of horror at the findings he described (however, it is also clear that many critics were quick to criticise his methods). However, to a large extent the incendiary nature of the findings serves to distract us from a more fundamental problem: there is something truly disturbing about the quality of science on display. The problem with Kanazawa’s study was its shockingly bereft methodology. For, although he generated lovely graphics depicting means and standard errors (to five decimal places, remember), and although he reported using factor analysis (a complex statistical procedure at the best of times), his measures were hampered by a pretty small sample size. And when I say small, I mean small. “So what was it?” I hear you ask. Brace yourselves. It was…three. [Although, please see below for Post-Script on how this point is actually ambiguous; it may be more accurate to state that the attractiveness evaluation for each participant was a single-item rating provided by three source persons. ]

Yup. Three.  For  each adolescent recruited onto the study, three interviewers working on the research personally rated her attractiveness. So while the attractiveness ratings related to hundreds (if not thousands) of target persons, the ratings of each individual target were generated by three source persons. And get this — Kanazawa doesn’t tell us whether the judges were men or women themselves, or what race they were. So all we know is that each black woman was rated by three judges (of variable but unspecified race and gender) who returned lower average attractiveness ratings than were returned for other women. This isn’t scientific data. This isn’t even science. It is pseudoscience.

The reason this work is pseudoscience is because it portrays itself as scientific in ways that it is clearly not. Take those graphs, for example. The graphs present summaries of data spread across multiple categories and to lots of decimal places. By all appearances, they refer to a large sample size. However, these graphs are misleading in this respect, because they are not grounded in the sample size of the study itself (i.e., in its number of participants). Instead of reporting three sets of ratings, Kanazawa treats all of the ratings provided by the three judges as separate units of analysis, when they are not.

This is akin to me providing a rating for a hundred different TV shows and then analysing my own data as if I had a sample of n = 100 (instead of a sample of n = 1). And then posting on my blog some verbiage to the effect that I had statistically established the world’s most popular TV show.

Moreover, Kanazawa tells us that the findings are statistically significant. Actually, this is impossible. Or at least, it is impossible to truly test such data for significance based on comparisons of means, because they cannot be said to meet the assumptions of the relevant parametric statistical tests. Comparisons of means are valid only if each individual bit of data is independent, which is to say (in this context) if each rating analysed was generated by a separate participant. Which they weren’t.

>>[POST-SCRIPT: Actually, there is an ambiguity in Kanazawa’s description of this procedure (see the Comments section below). While Kanazawa states that ratings were provided by “three different interviewers“, this could mean that three different interviewers rated all persons, or that a different three interviewers rated each person. If the latter, then the ‘sample’ in this study would have been far larger than n = 3 and so would negate one of the above criticisms. However, it would not negate all the problems, and in some ways makes things worse. Firstly, it remains the case that each adolescent was rated by just three judges — and that we still don’t know what gender or race these judges were. If it is true that multiple judges were used, this would introduce a whole new level of inconsistency to the ratings: it could mean that some black women were rated by black men, others by black women, others by white women, others by Asian women, etc. Unless all adolescents were rated by all the possible type of judge (by my reckoning there are eight categories), then this likelihood of inconsistency would only further undermine the purported validity of the attractiveness ratings. Secondly, it would not change the fact that each score is based on a single subjective judgement (on a 5-point scale) rather than on an objective assessment of attractiveness; and such single-item scales do not cater for instance-specific measurement error, nor do they possess construct validity. Thirdly, the problem of data independence would still remain (unless the study organisers had ensured that each of their research interviewers collected data from just one person, which seems completely unlikely). And fourthly, the graphics used (with their error bars and five decimal places) would still amount to over-kill, given the simplistic and vague nature of the data being analysed.]<<

And then we have this picture:

This is a facial proportions diagram, of the kind used in studies of the aesthetics of physical beauty. Scientists sometimes use the superimposed lines to investigate the so-called golden proportions likely to be seen as attractive by potential sexual partners. Some argue comparing such data against previously established benchmarks represents an objective way of assessing beauty. Hence its use as an illustration for Kanazawa’s article. The problem is that Kanazawa didn’t use this method; the picture here is just for show. It is as though its function was to convey an impression of scientific objectivity that is not even nearly warranted by the article it was used to illustrate.

All told, I can understand why Psychology Today chose to remove this pseudoscientific post. In scholarly terms, the post is quite a shocker; while in sociopolitical terms, many readers will have found the content offensive. These two problems together would make any post extremely problematic. However, the censoring of scientific claims is a risky business. The big difficulty is that once you deprive audiences of the opportunity to scrutinize the basis for the claim, all that remains is the memory of the claim itself, which lives on in the rumour mill for anyone to promulgate for their own reasons. So now we have neo-Nazis declaring that Kanazawa’s work is “well researched” and “truthful”; meanwhile, the degree to which his methods are misleading is hidden from view. In short, the whole episode ends up accruing a greater benefit for xenophobes and racists than for scientists and citizens. This is a wholly poor outcome.

I argue: put the post back up and open a discussion forum. Let everyone see how awful Kanazawa’s work on this has been. Link the article to other blogs on the Psychology Today website that have recorded criticisms of the methods. Post further links to other psychology research that shows how researchers’ own biases and value-systems can influence the research process. Explain the need for blinding, for sampling, for statistical power, for proper validation of measures, for replication, for construct validity, and for peer-review.

And perhaps also post links to research showing how censorship of disapproved viewpoints can often backfire. Don’t turn this researcher into another free-speech martyr to be valorised by extremists. Let his post be seen by everyone, and revealed for what it is: a study of such monumental weakness as to stand as an object lesson in science at its most ugly.


  1. I read one of his books years ago; “why beautiful people have more duaghters”. Very disappointed with this nonsense.

  2. Very cool summary of the whole issue. how certain are you of the n=3 statement though? My understanding is that Addhealth is actually a reallyneat data set that employed a host of raters. Each subject is indeed rated only by 3 different raters at different points in time, but across subjects there are much more. So you have 3 within subject data points, that are technically independent across subjects (to the extent that each subject is rated by a different set of 3 raters).
    Of course I am still with you about Kanazawa’s study being Pseudoscience, as I pointed out here: (where I take the liberty this time of obnoxiously citing my own post 😛 ). I really like your suggestion about how should deal with the issue though, and several bloggers at the cite have made similar suggestions.

  3. Very nice post! I do agree with Daniel Hawes though. I don’t think the same 3 raters rated EVERYONE. I think there were a very large number of different raters at different time points, but I could be wrong. Would be interested to find that out for sure. At any rate, you may be interested in my renalaysis of the Add Health dataset:


    • Brian Hughes

      Thanks, Scott. Am definitely interested in your re-analysis of the dataset, and recommend that readers consult it:

      For one thing, your analyses highlight the key problem with single-item scales (as alluded to above); namely, that they increase instance-specific measurement error. Typically, such error is ‘averaged’ out a bit with multi-item scales, but it generates the maximum amount of noise with single-item scales. Your analyses, showing horrendously poor test-retest reliability (2% shared variance in some cases) captures this point superbly.


      • Thanks Brian. As you know, this is how science should be. I’m happy to engage with others like you who truly are interested in the truth. In this instance, it looks like Kanazawa’s portrayal of the truth was incredibly misguided.

  4. Pingback: Friday’s Reading List » PROSPECT Blog

  5. daniel is correct — there were not just three evaluators in the add health surveys. there were three waves of surveys, i.e. three different interviews over time of each subject.

    so, each subject presumably had three different evaluators. but we’re talking about a couple tens of thousands of subjects, so there must’ve been hundreds of evaluators at least.

    i did question the race of the evaluators, though, after looking at the websites of the companies to which the fieldwork was outsourced.

    • Brian Hughes

      Yes, good points. See above re number of evaluators. On the question of race of evaluators, this of course is also crucial. As I said on a different board, all we have to do is to consider the options.

      If we were to discover that all the evaluators were themselves Black women, we are likely to draw a particular conclusion about the ratings based on that knowledge. However, if we were to discover that all the evaluators were actually White men, then we are likely to draw a very different conclusion. Surely this is completely obvious? Therefore, the fact that Kanazawa didn’t describe (or even mention) the races (or ages or social classes) of the evaluators is a huge omission, which renders his interpretations at the very least extremely vague, if not utterly risible.

      We already know that his underlying statistical analysis is very dodgy, as per Scott’s post (see above).

      I also encourage readers to check out your blog post on this story:

Leave a Reply