Motivation for reading / Who should read this:
I read Validity and Validation – Understanding Statistics, by Catherine S. Taylor in part as a follow up to the meta-analysis book from two weeks ago, and in part because I wanted to know more about the process of validating a scale made from a series of questions (e.g. The Beck Depression Inventory, or a questionnaire about learning preferences). For scale validation, there’s a field called Item Response Theory, which I understand better because of this book, but not at any depth.
This book, a quick read at 200 small pages, compliments the handbook of educational research design that Greg Hum and I made. I plan to recommend it anyone new to conducting social science research because it provides a solid first look at the sort of issues that can prevent justifiable inferences (called “threats to internal validity”), and those that can limit the scope of the results (called “threats to external validity”).
A good pairing with “Validity and Validation” is “How to Ask Survey Questions” by Ardene Fink. My findings from that book are in this post from last year. If I were to give reading homework to consulting clients, I would frequently assign both of these books.
What I learned:
Some new vocabulary and keywords.
For investigating causality, there is a qualitative analogue to directed acyclic graphs (DAGs), called ‘nomological networks’. Nomological networks are graphs describing factors, directly or not, that contribute to a construct. A construct is like a qualitative analogue of a response variable, but has a more inclusive definition.
To paraphrase of Chapter 1 of , beyond statistical checks that scores from a questionnaire or test accurately measure a construct, it’s still necessary to ensure the relevance and utility of that construct.
Hierarchical linear models (HLMs) resemble random effect models, or a model that uses Bayesian hyperpriors. An HLM is a linear model where the regression coefficients are themselves response values to their own linear models, possibly with random effects. More than two layers are possible, in which the coefficients in each of those models could also be responses to their own models, hence the term ‘hierarchical’.
What is Item Response Theory?
Item response theory (IRT) is set of methods that puts both questions/items and respondents/examinees in the same or related parameter spaces.
The simplest model is a 1-Parameter IRT model, also called a Rasch model. A Rasch model assigns a ‘difficulty’ or ‘location’ for an item based on how many respondents give a correct/high answer or an incorrect/low answer. At the same time, respondents also have a location value based on the items they give a correct/high response. An item that few people get correct will have a high location value, and a person that gets many items correct will have a high location value.
A 2-parameter model includes a dimension for ‘discrimination’. Items with higher discrimination will elicit a greater difference in responses between respondents with a lower and those with a higher location than the item. Models with more parameters and ones for non-binary questions also exist.
The WINSTEPS software package for item response theory (IRT):
WINSTEPS is a program that, when used on a data set of n cases giving numerical responses each of to p items, gives an assessment of how well each item fits in with the rest of the questionnaire. It gives two statistics: INFIT and OUTFIT. OUTFIT is like a goodness-of-fit measure for extreme respondents at the tail-ends of dataset. INFIT is a goodness-of-fit measure for typical respondents.
In the language of IRT, this means INFIT is sensitive to odd patterns from respondents whose locations are near that of the item, and OUTFIT is sensitive to odd patterns from respondents with locations far from the item. Here is a page with the computations behind each statistic.
On CRAN there is a package called Rwinsteps, which allows you to call functions in the WINSTEPS program inside R. There are many item response theory packages in R, but the more general ones appear to be “cacIRT”, “classify”, “emIRT”, “irtoys”, “irtProb”, “lordif”, “mcIRT”, “mirt”, “MLCIRTwithin”, “pcIRT”, and “sirt”.
For future reference.
Page 11 has a list of common threats to internal validity.
Pages 91-95 have a table of possible validity claims (e.g. “Scores can be used to make inferences”), which are broken down into specific arguments (e.g. “Scoring rules are applied consistently”), which in turn are broken down into tests of that argument (e.g. “Check conversion of ratings to score”).
Pages 158-160 have a table of guidelines for making surveys translatable between cultures. These are taken from a document of guidelines of translating and adapting tests between languages and cultures from the International Test Commission. https://www.intestcom.org/files/guideline_test_adaptation.pdf
The last chapter is entirely suggestions for future reading. The following references stood out:
 (Book) Educational Measurement, 4th edition, by Brennan 2006. Especially the first chapter, by Messick
 (Book) Hierarchical Linear Models: Applications and Data Analysis by Ravdenbush and Bryk 2002.
 (Book) Structural Equation Modelling with EQS by Byrne 2006. (EQS is a software package)
 (Book) Fundamentals of Item Response Theory, by Hambleton, Swaminthan, and Rogers 1991.
 (Book) Experimental and Quasi-Experimental Designs for General Causal Interance by Shadish, Cook, and Campbell 2002. (this is probably different from the ANOVA/Factorial heavy Design/Analysis of Experiments taught in undergrad stats)
 (Journal) New Directions in Evaluation