Ipsative Tests: Psychometric Properties

In this final blog I want to look at the psychometric properties of ipsative measures and also look at the supporting evidence for ipsative tests.

Psychometric properties
As most of our readers are HR practitioners not statisticians I will try to make the psychometric critique relatively brief. However, the psychometric weaknesses of ipsative testing are well reviewed and for those interested I strongly suggest a thorough read of Meade (2004). In essence the critiques are at both a factor structure level as well as the corollary of reliability of measurement.

The factor analysing of data using an ipsative tool is more complex. The way that it was done in Saville and Wilson’s article (1991) was IMO artificial and to quote Barrett: “This (thier) finding completely invalidates Saville and Willson’s (1991) and, by extension, Cronbach’s contention that a factor analysis can be reasonably implemented on ipsative data by simply dropping one score. The interpretation of factor analysis depends entirely on the weights of the variables after regression onto a number of underlying traits. Thus, unless the focus of a factor analysis was simply to determine the amount of variance accounted for by each factor, this procedure is quite insupportable. The choice of which scale to drop will dramatically affect the interpretation of the factor solution”.

In short ipsative data does not lend itself well to factor analysis. Factor analysis in turn is the basis for which we determine construct validity (i.e. the basis for understanding the psychological phenomena we are hoping to measure). As a result it is not surprising that the reliability of ipsative scales has consistently been shown to be lower than that of normative scales.

In reference to a famous article entitled Spurious and Spurious: The Use of Ipsative Personality Tests Johnson, Wood, and Blinkhorn (1988) re-stated the arguments for the abandonment of ipsative testing via questionnaire on psychometric grounds, and provided some empirical examples of the error-prone consequences of their use. This article was, perhaps, the strongest indictment of ipsative measurement until the more recent paper by Meade 2004.

Moreover Hough and Ones (2001) make the issues very clear. The key issue is not reliability and factor analysis or even what an ipsative test correlates with. You may be able to reliably produce results from ipsative questionnaires, but they are WITHIN PERSON RANKS thus as soon as you compare two people’s results you are treading on dangerous ground. Between people comparisons, are necessary for selection when you have more than one candidate.

The Rebuttal
All of the rebuttals (to my knowledge) on ipsative testing for use in selection come from one company, SHL. This is not surprising given that SHL have developed tests which they hope to sell for selection that are ipsative. Their line of reason, as is often the case, is based on a good story, that the tests are equally as valid and difficult to fake.

Despite a lack of independent support, direct criticism, and a recent top-class paper using SHL data (Meade, 2004) it would be a miss not to tackle the points raised by Dave Bartram (SHL Director of Research) directly. In essence they are based on the main premise that the key difference is the number of scales. This has been critiqued thoroughly by Paul Barrett and much of what is cited below comes from direct posting and conversation between myself and Paul. The first key defense of ipsative testing was published by Dave Bartram in 1996, in his pre-SHL Director of Research role as Professor at the University of Hull (unfortunately after Sean Hammond’s and my conference paper was given in January 1996). The paper reference and abstract is: Bartram, D. (1996) The relationship between ipsatized and normative measures of personality. Journal of Occupational & Organizational Psychology. Vol 69(1), Mar 1996, pp. 25-39.

Abstract: Presents a general expression for computing the relationships between normative scales and ipsative ones derived from them, based on the number of scales and the intercorrelations between the normative scales. The results obtained from various empirical and computer generated data sets were compared with those expected on the basis of the equations and a close correspondence was found. Expressions for computing the reliability of ipsatized scales and the reliability of ipsatized scale differences were also produced and the implications of these for profile analysis are discussed. It is noted that ipsatized measures are unreliable when the number of scales is less than about 10 or when the correlations between normative scales are greater than .30. This unreliability is increased by full ipsatization and by inequality of the variances of the normative scales from which the ipsatized scales are derived.

Now this was a very well thought out study – using computer-generated data (N=2000) which allowed normative data to be reconstructed as ipsative – thus permitting a direct “head-to-head” comparison without worrying about confounding by social desirability. This paper really did put to rest the psychometrics part of the debate on ipsative vs. normative measures. The reason why every SHL employee does not have this paper indelibly stamped in their minds is because of several cautionary passages in the paper which do not mesh well with their sales message, one of which I quote below:

“These results show that ipsative and normative scales have a high degree of equivalence only when the normative scales are independent of one another [0.0 correlation between scales]. When there are correlations between the normative scales, the correlations between them and ipsative scales rapidly decrease. When the number of scales is large, reasonable levels of equivalence are only maintained for low levels of normative scale intercorrelation” (Pg. 30, Bartram 1996).

Quite by chance (or maybe not!), the Barrett et al. (1996) paper looking at the OPQ Normative Concept Model analysis was published, containing on page 15, a histogram of the inter-scale correlations of the OPQ within a dataset of 2301 applicants. Of these inter-scale correlations 64 out of 465 were greater than r=0.3 and 149 were greater than r=0.2. Obviously, the level of correlation between scales is low – but not 0.0.

The interesting feature of Bartram’s paper is that he shows that you can compute comparative ipsative scale reliabilities (albeit from a derived formula that works using the normative values to estimate ipsative values). It was left to Helen Baron (1996) (formerly of SHL) to conclude “However, for larger sets of scales (N~30) with low average intercorrelations, ipsative data seems to provide robust statistical results in reliability analysis, but not under factor analysis”. Thus, by her omission, the factor structure of ipsative data is poor. This leaves the practitioner with little knowledge of what construct was indeed measured. This is compounded of course by the fact that the items responded to are different in every case!

Saville and Wilson (1991) responded to criticisms by attempting to demonstrate that ipsative tests manifest equal, if not superior, validity to normative tests. Using a novel, if somewhat ill specified computer-generated dataset, they showed that under certain conditions ipsative and normative tests will yield equivalent psychometric parameters. In addition, they went on to show that, with certain real datasets, the expected statistical results from Johnson et al. (1988) were not observed. However, these conclusions have been challenged by Cornwell and Dunlap (1994) who carried out a re-analysis of the Saville and Wilson data and found little support for their claims. The reality is that gains in validity have not been shown and indeed the scores on ipsative and normative measures are often cited as comparable (Bartram, 2006). So, not only does the practitioner end up with a faulty measure they do so for no comparative gain! Practical and robust are not mutually exclusive. This is a classic red herring to imply that those that take measurement seriously are just pie in the sky. The complete opposite is true. Those interested in psychometrics are the people who want to see things done right so the discipline goes forward.

The Issues in Summary
The key issue is that you cannot practice unless you understand what you are using. To again quote Paul Barrett: “Yes, it is important to have a good bedside manner but this is secondary to knowing what medication to prescribe.”

Ipsative tools:
1. Are a within person measure to be used for individual counseling not comparisons across people.
2. Have questionable psychometric properties
3. Are not resistant to faking 4. Have no demonstrable validity gains
5. Are in the main supported by only one company with a vested interest in determining their usefulness. Their application is therefore more market driven than science driven.

We have a lot of psychological interventions prescribed by people who know little about what they are prescribing. At least with ipsative testing we know what the medication can be prescribed for. The application of ipsative testing for selection, a within person measure, is ill-advised and it is time that this practice was eradicated once and for all on the grounds that I/O psychology is truly a discipline guided by science and not marketing whims.

Now rather than it being ‘MY’ view, and for those that want the references:

Baron, H. (1996) Strengths and limitations of ipsative measurement. Journal of Occupational and Organisational Psychology, 67, 89-100.

Cattell, R.B. (1944) Psychological Measurement: ipsative, normative, and interactive. Psychological Review, 51, 292-303.

Clemans, W. V. (1966) An analytic and empirical investigation of some properties of ipsative measures. Psychometric Monographs, vol.14

Closs, S.J. (1976) Ipsative vs normative interpretation of test scores or “What do you mean by like?”. Bulletin of the British Psychological Society, 29, 228-299

Cornwell, J .M. and Dunlap, W.P. (1994) On the questionable soundness of factoring ipsative data: a response to Saville and Willson. Journal of Occupational and Organisational Psychology, 67, 89-100.

Hicks, L.E. (1970) Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74, 167-184.

Hough, L. and Furnham, A. (2003) Use of Personality Variables in Work Settings. In W. Borman, Ilgen, D.R., and Klimoski, R.J. (eds) Handbook of Psychology, Volume 12: Industrial and Organizational Psychology. New York, Wiley. (Chapter 5, pp 77-106)

Hough, L. and Ones, D. (2001) The Structure, Measurement, Validity, and Use of Personality Variables in Industrial, Work, and Organzational Psychology. Chapter 12 (pp 233-267) in N. Anderson, D. Ones, Sinangil, H., and Viswesvaran, C. (eds.) Handbook of Industrial, Work, and Organizational Psychology, Volume 1: Personnel Psychology. New York: Wiley.

Johnson, C. E., Wood, R., and Blinkhom, S. F. (1988) Spurious and Spurious: the use of ipsative personality tests. Journal of Occupational Psychology, 61, 153-162.

Martin, B.A., Bowen, C., and Hunt, S. (2002) How effective are people at faking on personality questionnaires? Personality & Individual Differences. Vol 32, 2, 247-256.

Saville, P. & Wilson, E. (1991). The reliability and validity of normative and ipsative approaches in the measurement of personality. Journal of Occupational Psychology, 64, 219-238.

Schmit, M.J., and Ryan, A.M. (1993) The big five in personnel selection: factor structure in applicant and non-applicant populations. Journal of Applied Psychology, 78, 6, 966-974.


10 thoughts on “Ipsative Tests: Psychometric Properties

  1. Gary Boettcher

    Thoughts on the MBTI and IRT as a means of measuring the instruments reliability? Have you reviewed the MBTI resource manual?


    PhD student

    1. drpaulatopra Post author

      Hi Gary

      Firstly thank you for being a reader of the Blog. I have not reviewed the MBTI resource manual in depth. Is there something in particular that you are referring to?

  2. Derek

    I would appreciate if you could point me in the right direction. I am conducting longitudinal research with a time-lag 12months My question relates to two of my latent variables Resilience and work-life balance.

    Time 1 resilience is correlated with itself at Time 2, at r = .50
    Time 1 work-life balance is correlated with itself at Time 2 at r = .45

    I am wanting to make some comments about whether the strength of the correlation appears to be more state or trait like. On my search I notice longitudinal Big Five dimensions correlate at about .75 -.80 Are there any guidelines in journals for this distinction regarding strength of correlations?

    The reason i wanted to make a comment is at Time 1 I have many mediations whereas at Time 2 very few

    I would appreciate any help you can give
    Thank you

    1. Dr Paul Post author

      Hi Derek

      My view is that the question itself needs to be revisited. The trait-state distinction is highly subjective and is an artificial dichotomy by which to describe human behavior IMO, especially as it relates to stress tolerance. A The situational affect of specific-stressors (as they relate to specific individuals) is I think the guts of the issue not whether the underlying finding is more trait or state like. The starting point in this regard is quality theory building not the stats and I don’t think that a cut-off based on the size of r is going to fundamentally answer the research question you are asking Best of luck in your studies.

  3. julia

    Hi, Great pleasure to find your blog and comments on ipsative tests in selection. I have always firmly believed this to be a big no. However, other professionals have been trying to convince me I am wrong regarding the OPQ32r as they reckon it combines normative and ipsative so is ok. I cannot see how it does this and even if it did why then have the ipsative and how can it be used to compare across candidates? I would be extremely grateful of you thoughts here.


    1. Dr Paul Post author

      Hi Julia

      Thank you for the positive feedback. The work on the OPQ32r is statistically impressive. I would suggest that SHL have done more work than most and therefore you can use that as a benchmark to discount 99.99% of ipsative tool claims! That said they simply cannot get around the theoretical divide that an ipsative measure by design is a within person measurement. You can’t address this by statistics, it is inherent to the measurement.

      Test producers continually come out with variations on the ipsative theme and each time independent studies, like Meade (2004) (done by the way with OPQ data) fail to support the claim. The result? A new variation.

      Statistics are not the only way to validate an argument and once this is understood the argument that is presented by sales-people will be less convincing. I very, very much doubt that those selling the tests can understand the depth of the statistical analysis used in the OPQr and therefore it is up to them to show you the practical difference between the scores on the normative and ipsative version. This is the principle of Occam’s razor. If you know that normative tools don’t suffer from these issues for comparative measurement, don’t suffer from the issues of reliability and poor factor structure, and are undisputed by all major psych associations for use in selection than no matter how fancy the stats the test has to perform exceptionally better than a normative equivalent to even get out of the blocks. The fact that they demonstrate it’s construct validity by comparing it with normative versions should highlight that the whole exercise is somewhat futile over and above having a unique selling proposition (USP) to be seemingly ‘different’. The creation of a USP has far more to do with how psych tools are positioned than their demonstrable increased usefulness. Your line of questioning indicates you are clearly very sensible and I would suggest that you stick to your view (which is right) and go with what simply makes the most logical sense to you and don’t allow yourself to be distracted by spurious stats.

  4. Peki

    Hi, wonderful discussion and posts (and website!). I learned so much! I am just starting to wade into workplace test development and ipsative tests (I am much more familiar with normative tests as my background is in personality and social psychology and not IO or OD) and struggling to discern research vs marketing-driven claims as their effectiveness. Do you have any new thoughts on this in light of the recent-ish IRT paper by Brown and Olivares (2012)?

    1. Dr Paul Englert Post author

      Hi Peki

      Thank you for the kind words and I’m pleased to see you are enjoying the blog.

      Anna Brown’s work in this space has definite merit. I’m yet to see any evidence to see a change in criterion related validity but the methodology is sound and does present a sea change for ipsative testing. I therefore think it wise to think of ipsative tests pre and post this work. The field develops and I’m the first to both encourage this development and recognise it when it happens.

  5. Alice

    Great blog, very informative so thank you for posting. I am currently trying to create a questionnaire and at the moment I have designed it on a normative scale. My employer wants the end product to be ipsative. Can I by any chance perform a factor analysis on the normative questionnaire (likert scale), and once I am happy with the validity of the results, use the questionnaire items to create a forced choice questionnaire? Would that create a valid ipsative test? Or would changing the measurement scale violate a lot of statistical/validity laws?!

    1. Dr Paul Englert Post author

      Hi Alice

      Great to hear from you and glad that you liked the blog. As I think you suspect, changing the measurement scale means that a simple transposing of a factor derived construct into an ipsative test will result in the supporting stats not being upheld.

      There are two suggestions that I have for you. The first is to figure out why an ipsative measure is being requested. Is the construct that is being measured best assessed between people or between constructs? Unless it is the later than I see little rationale to develop such a measure.

      Secondly, if you are interested in what is involved, I suggest you look at some of the work by Guenole or Brown. These two researchers are conducting excellent work on ipsative measurement and are demonstrating that there are means of developing robust tools using ipsative methodology. However, both of these researchers are in the upper echelons of the psychometric community, and only you can answer whether you have the skill set, and data set to conduct this level of work inside your setting:

      Guenole, N., Brown, A. A., & Cooper, A. J. (2016). Forced-Choice Assessment of Work-Related Maladaptive Personality Traits: Preliminary Evidence From an Application of Thurstonian Item Response Modeling. Assessment, 1073191116641181.

      Brown, A., & Maydeu-Olivares, A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1), 36.

      Best of luck!


Please Comment!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s