Wednesday, 28 October 2015

Take-home lessons and code from a factor-cluster analysis with imputation

Recently I was tapped to examine data from a survey of ~200 children to find if their learning preferences fell into well-defined profiles (i.e. clusters). The relevant part of the survey had more than 50 Likert scale questions The client and I had decided that a factor analysis, followed by a cluster analysis would be most appropriate.

I learned some things in doing this analysis, and wanted to share that and some relevant code.

1. Dimension reduction pairs well with clustering.

Had we simply done a k-means clustering, we would have gotten poor results. There are meaningful correlations between questions, but k-means clustering treats each question as an independent variable. Differences on correlated sets of questions would overstate the differences in responses as a whole.

Also, factor analysis, a standard dimension reduction method, provides combinations of variables which may make clustering results easier to explain.

2. Carry multiple imputations through the entire analysis. Don’t combine results prematurely.

The purpose of doing multiple imputations, rather than a single imputation, is to have the final results reflect the additional uncertainty from using unknown values. By combining the results in the middle of an analysis, such as after dimension reduction, but before k-means clustering, you mask that uncertainty. For example, by aggregating the imputed results before clustering, the centre of each cluster will be interpreted as a single point, perhaps with some variance estimated by the delta method. By doing a k-means cluster analysis for each imputed data set, we produced m points for each cluster center, for m=7 imputations.

Here is the function I made to take one of the m imputed datasets and perform the factor-cluster analysis. It made working with multiple datasets a lot less tedious.

factor_cluster = function(dat, Nfactors=2, Nprofiles=4)
# Do a factor analysis and get the matrix of factor loadings
loading_values = factanal(dat, factors=Nfactors)$loadings
loading_values = loading_values[1:ncol(dat), 1:Nfactors]

# factor_values is the matrix of values
factor_values = as.matrix(dat) %*% loading_values

###### K-means clustering
cluster_output = kmeans(factor_values, Nprofiles)

### Output the loadings, factor scores, and clustering results
return(list(loadings = loading_values, factor=factor_values, cluster=cluster_output))

3. Results from multiple imputations are found independently, so they sometimes need alignment.

Figure 1 below shows where the k=3 cluster means were computed along two of the factor scores for each of the m=7 imputed datasets.

There is strong agreement between analyses on the location of the means, but not their labels. The numbers represent the number of the cluster given in that analysis. What one cluster analysis refers to as “centre number 2” is essentially the same as what another analysis calls “center number 3”.

This is a big problem because the ~200 cases are assigned these labels, and gives the appearance of a lot more disagreement in assignment to clusters between different imputations. However, we can realign the labels so that nearby centres from different imputed datasets are the same. The following code does that alignment and relabeling:

align_clusters = function(input_centers, input_cluster)
output_centers = input_centers
output_cluster = input_cluster

Nimpute = dim(input_centers)[1]
Nclust = dim(input_centers)[2]
#Nfactor = dim(input_centers)[3]
trans = array(NA, dim=dim(input_centers)[1:2])

for(i in 1:Nclust)
dist_vec = rep(NA,Nclust)
for(j in 1:Nimpute)
for(k in 1:Nclust)
dist_vec[k] = dist( rbind(input_centers[1,i,],input_centers[j,k,]))
trans[j,i] = which(dist_vec == min(dist_vec))[1]

for(i in 1:Nimpute)
output_centers[i,,] = input_centers[i,trans[i,],]
output_cluster[,i] = mapvalues(input_cluster[,i], from=1:3, to=trans[i,])

return(list(centers = output_centers, cluster = output_cluster))

There is a potential uniqueness issue in the relabeling, but it wasn’t a problem for me, so I opted for fast over perfect. Figure 2 shows the means with the realigned labels.

This realignment was only necessary for the cluster analysis. Thankfully, R and other software arrange factors by the amount of variance they explain. So, what the factor analysis of one imputed dataset labels ‘Factor 1’ is the essentially the same as what all of the analyses will label ‘Factor 1’. If there wasn’t a clear ordering of factors, we could have repurposed the realignment code to fix that.

Main code file for this analysis
Functions code file for this analysis

This post written with the permission of the client.

No comments:

Post a Comment