How to perform CompCor on HCP fMRI data in R

This is a guest post written by Damon Pham. Damon is a recent graduate from Indiana University, where he was a Wells Scholar (the highest honor for incoming IU students) and all around extraordinaire. He has been part of my research group for the last several years, where one of his main focuses has been developing software to advance and facilitate research using CIFTI-format data (more on that in a future post!). He has also been working on methods for outlier detection in fMRI data. In that work we decided to use aCompCor as a preprocessing step before outlier detection. Since we are using CIFTI-format HCP fMRI data, figuring out how to actually do aCompCor was an undertaking. Below, Damon describes why and how he did this. We hope this is useful for other researchers wanting to use CompCor on CIFTI-format data in the HCP and beyond.

Word to the wise: Damon is currently applying to graduate schools, so keep an eye out for him! He will no doubt go on to do amazing work.

Take it away, Damon!

What is aCompCor?

There are many sources of noise in fMRI, and perhaps many more ways to clean it up. One such technique is anatomical CompCor (aCompCor), first presented by Behzadi et. al. (2007). Numerous studies including Muschelli et. al. (2014) and Ciric et. al. (2017) have demonstrated its effectiveness for attenuating the effects of motion and improving functional connectivity estimates, among other benchmarks, compared to alternatives. aCompCor is also quite simple in theory. It's based on the observation that the BOLD signal in white matter (WM) and cerebrospinal fluid (CSF) is not neuronal in origin. Thus, variation common to gray matter and WM or CSF should be artifactual (e.g. due to motion, heartbeat, respiration, scanner drift). aCompCor regresses the top few principal components (PCs) within WM and CSF from the grey matter, thereby removing shared variation thought to represent noise.

We've observed that aCompCor is really good at attenuating trends and sudden shifts in the data. Both patterns are not of neuronal origin, and later analyses such as functional connectivity assume that they have been eliminated. Let's see an example of aCompCor in action! Here's the timecourse of an arbitrary grey matter voxel before and after aCompCor-5 (five WM PCs and five CSF PCs):

The first cortical WM PC (red) mimics the low-frequency trend (blue) while also following sudden changes. After all ten WM or CSF PCs are regressed from the data (bottom plot), minimal low-frequency variation remains even though no explicit high-pass filtering was performed. (In practice, though, we've found that using aCompCor and discrete cosine transform (DCT) bases in a combined regression most effectively detrends fMRI data.) Other popular detrending methods, such as the DCT by itself or smoothing spline regression, are susceptible to missing or exacerbating any high-frequency artifacts—see Figure 2 in Cheveigné and Arzounian (2018).

How to perform aCompCor on HCP data

In a recent analysis we wanted to use aCompCor on CIFTI-format data from the Human Connectome Project (HCP). Although “simple in theory,” performing aCompCor was unexpectedly challenging to implement in this context. Multiple files were required: the WM and CSF regions of interest (ROIs) are identified by numeric codes in a label file, and their timecourses are identified using a volumetric fMRI file, both of which are separate from the CIFTI file itself. Piecing everything together was time consuming and required help from a variety of sources. Once we got it to work, we wanted to save future researchers the hassle! We incorporated our scripts into clever, an R package that implements several data cleaning methods for fMRI. In this tutorial, I'll explain the process for performing aCompCor on CIFTI data and demonstrate how to use clever to do so conveniently. Although I focus on the HCP file structure, these tips should be applicable to similar datasets too.

1. Locate the files to use

The HCP resting-state fMRI data exist in two basic forms: a CIFTI file, which contains surface-based cortical data and volumetric subcortical data, and a NIFTI file, which contains volumetric whole-brain data. There are numerous advantages to analyzing cortical data on the surface, as explained very clearly by Fischl, Dale and Sereno (1999a and 1999b) and Glasser et. al. (2013) in the context of the HCP. Therefore we will use the CIFTI to get the grey matter data, and the NIFTI to get the WM and CSF PCs for nuisance regression. We will also need the NIFTI labels file, which indicates the brainstructure of each voxel. Thus, for each scan, the following three files are needed:

  • The CIFTI fMRI data, e.g. ../Results/rfMRI_REST1_LR/rfMRI_REST1_LR_Atlas_MSMAll.dtseries.nii (91282 greyordinates x 1200 frames)
  • The NIFTI fMRI data, e.g. ../Results/rfMRI_REST1_LR/rfMRI_REST1_LR.nii.gz (109 x 91 x 109 voxels; x 1200 frames)
  • The NIFTI labels, ../ROIs/Atlas_wmparc.2.nii.gz (109 x 91 x 109 voxels)

Note that different visits and acquisitions for the same subject will have different fMRI files (i.e. “REST2” instead of “REST1”, or “RL” instead of “LR”), but the NIFTI labels file remains the same. For this tutorial, I arbitrarily went with “REST1” and “LR”. Also note that alternative files exist. Firstly, there are CIFTI files without “_MSMAll” in the file name. These have cortical data aligned by MSMSulc instead of MSMAll; either is fine to use. (See this blog post by Dr. Emma Robinson for more information on MSM-based registration techniques.) Secondly, there are CIFTI and NIFTI files with the “_hp2000_clean” suffix in their names: these have been cleaned with ICA-FIX and a high-pass filter in addition to minimal pre-processing. In theory one could perform aCompCor on top of ICA-FIX. However, we chose to use the minimally pre-processed data instead aCompCor and ICA-FIX are a bit redundant.

What's important to ensure is that the different files have undergone the same spatiotemporal transformations. This is why we forego the native-space cortical data (e.g. ../Results/rfMRI_REST1_LR/rfMRI_REST1_LR.L.native.func.gii) and labels file (../ROIs/wmparc.2.nii.gz), instead using their MNI-space counterparts (“Atlas” in the file name). We also do not use higher-resolution (../aparc+aseg.nii.gz) or incomplete (ROIs.2.nii.gz) labels files.

In the following code, we read in the three necessary files:

library(ciftiTools) #read_cifti
# Point to the Connectome Workbench on your computer.
ciftiTools.setOption("wb_path", "../workbench")
library(oro.nifti) #readNIfTI


# Arbitrarily choose subject 112112.
# Point to the subject's folder on your computer.
subj_dir <- "/N/dcwan/projects/hcp/112112/MNINonLinear"

# CIFTI for the cortical data
cii_fname <- file.path(subj_dir, "Results/rfMRI_REST1_LR", "rfMRI_REST1_LR_Atlas_MSMAll.dtseries.nii")
cii <- read_cifti(cii_fname)
cii <- do.call(rbind, cii$data) #Get the data matrix; we don't need any metadata
# NIFTI for the noise ROI data
nii_fname <- file.path(subj_dir, "Results/rfMRI_REST1_LR", "rfMRI_REST1_LR.nii.gz")
nii <- readNIfTI(nii_fname)
# NIFTI labels so we can identify the ROIs
niiLabs_fname <- file.path(subj_dir, "ROIs/Atlas_wmparc.2.nii.gz")
niiLabs <- readNIfTI(niiLabs_fname)

2. Identify the ROIs

The NIFTI labels file has a numerical code at each voxel indicating the brainstructure at that location. A table listing the brainstructure label for each numerical code can be found here. We referenced the label names and these forum posts by Dr. Eleftherios Garyfallidis and Dr. Matthew Glasser to come up with the set of codes belonging to each aCompCor ROI:

  • Cortical WM: 3000-4035, 5001, and 5002
  • Cerebellar WM: 7 and 46
  • CSF: 4, 5, 14, 15, 24, 31, 43, 44, 63, 250, 251, 252, 253, 254, 255

(Although I included the cerebellar WM codes for completeness, I'll continue to focus on cortical WM and CSF for the remainder of the tutorial.)

Below is the code for obtaining the ROI masks for cortical WM and CSF. (I'll show what they look like in the following section.)

ROI_labels <- list(
  WM_cort = c(seq(3000, 4035), 5001, 5002),
  CSF = c(4, 5, 14, 15, 24, 31, 43, 44, 63, seq(250, 255))
)

niiLabs <- array(niiLabs, dim=dim(niiLabs)) # for plotting
ROIs <- rep(list(niiLabs), 2); names(ROIs) <- names(ROI_labels)
ROIs$WM_cort[] <- ROIs$WM_cort %in% ROI_labels$WM_cort
ROIs$CSF[] <- ROIs$CSF %in% ROI_labels$CSF

3. Erode the ROIs

Erosion is the removal of boundary voxels from a ROI. Due to spatial smoothing, motion and partial volume effects, some BOLD signal from grey matter may leak into voxels labeled as WM or CSF. We remove boundary voxels to be more confident that the region contains only noise information. Often, one to two voxel layers are removed. (Eroding by one layer means that every boundary voxel in the ROI is removed. Eroding by two layers means that the ROI is eroded by one layer, and then eroded by one layer again. We define boundary voxels as those which share a face with a voxel outside of the ROI. “Diagonal” neighbors that share a vertex but not a face are not considered adjacent.)

Here's the code for performing erosion. Note that we choose to erode 2 layers from WM and 1 layer from CSF, since the CSF has fewer voxels.

# Helper Functions
pad_vol <- function(x, padding, fill=NA){
  new_dim <- vector("numeric", 3)
  for (ii in seq_len(3)) { new_dim[ii] <- dim(x)[ii] + padding[ii,1] + padding[ii,2] }
  y <- array(fill, dim=new_dim)
  y[
    seq(padding[1,1]+1, padding[1,1]+dim(x)[1]),
    seq(padding[2,1]+1, padding[2,1]+dim(x)[2]),
    seq(padding[3,1]+1, padding[3,1]+dim(x)[3])
  ] <- x
  y
}
neighbor_counts <- function(arr, pad=FALSE){
  arrPad <- pad_vol(arr, matrix(1, 3, 2), fill=pad)
  dPad <- dim(arrPad)
  # Look up, down, left, right, forward, behind (not diagonally)
  arrPad[1:(dPad[1]-2),2:(dPad[2]-1),2:(dPad[3]-1)] +
    arrPad[3:(dPad[1]),2:(dPad[2]-1),2:(dPad[3]-1)] +
    arrPad[2:(dPad[1]-1),1:(dPad[2]-2),2:(dPad[3]-1)] +
    arrPad[2:(dPad[1]-1),3:(dPad[2]),2:(dPad[3]-1)] +
    arrPad[2:(dPad[1]-1),2:(dPad[2]-1),1:(dPad[3]-2)] +
    arrPad[2:(dPad[1]-1),2:(dPad[2]-1),3:(dPad[3])]
}

# Function to erode
erode_vol <- function(vol, n_erosion=1, out_of_mask_val=NA){
  stopifnot(is.numeric(n_erosion))
  if (n_erosion==0) {return(vol)}
  mask <- !array(vol %in% out_of_mask_val, dim=dim(vol))
  for(ii in 1:n_erosion){
    to_erode <- mask & neighbor_counts(mask, pad=TRUE) < 6
    mask[to_erode] <- FALSE
    vol[to_erode] <- out_of_mask_val[1]
  }
  vol
}

# Erode WM by 2 layers and CSF by 1
ROIs$WM_cort <- erode_vol(ROIs$WM_cort, n_erosion=2, out_of_mask_val=c(.5, 0, NA))
ROIs$CSF <- erode_vol(ROIs$CSF, n_erosion=1, out_of_mask_val=c(.5, 0, NA))

Below are orthogonal views of the cortical WM (orange) and CSF (green). The colored voxels are eroded from the ROI; the white voxels are ultimately used for aCompCor. Before erosion, there are 66,380 cortical WM voxels and 3423 CSF voxels. After erosion, there are 16,847 cortical WM voxels and 936 CSF voxels.

# Code to view the ROI interactively.
# Screenshots were taken to produce the static images in the blog post.
# (For the CSF, orange was changed to green.)
mni <- oro.nifti::as.nifti(as.array(
  oro.nifti::readNIfTI(system.file("extdata/MNI152_T1_2mm.nii.gz", package="ciftiTools"))
)*1)
to_view <- oro.nifti::as.nifti(ROIs$WM_cort) # or $CSF
papayar::papaya(list(mni, to_view))

Now that we have the eroded ROIs, we can use them to mask the NIFTI data:

# Get the aCompCor ROI data.
# Recall eroded voxels = 0.5, and in-ROI voxels = 1.
nii <- list(
  WM_cort = matrix(nii[ROIs$WM_cort > .8], ncol=1200),
  CSF = matrix(nii[ROIs$CSF > .8], ncol=1200)
)

4. Drop the first few frames

In a few scans from the HCP, we noticed that some of the CSF PC scores had a large spike at the start:

This artifact is likely due to scanner warm-up. Similar trends do not seem to occur in the CIFTI data, so we removed the first 15 frames (blue line in the figure) prior to the aCompCor regression, so that there is greater correspondence between the grey matter and the aCompCor ROIs. At a repetition time (TR) of 0.72 seconds, 15 frames is about 11 seconds. The spike resolves itself within that timeframe in the scans we observed. Dropping a few frames at the beginning of a run is a common pre-processing technique and, to our knowledge, is not part of the HCP minimal pre-processing pipelines.

cii <- cii[,seq(16, 1200)]
nii <- lapply(nii, function(x){x[,seq(16, 1200)]})

5. Estimate the PCs and perform nuisance regression

Finally, the top few PC scores from each eroded ROI are collected. The full set of noise regressors, i.e. the “design matrix,” will also contain an intercept term and optionally DCT bases or other additional detrending terms. This matrix is used in a nuisance regression to yield the data cleaned by aCompCor:

Removing these different sources of noise within a single nuisance regression not only is efficient, but also ensures that each regressor in the design matrix is properly removed from the data. If successive nuisance regressions were performed, nuisance signals from an earlier regression can be re-introduced by later regressions. For example, if the data were high-pass filtered by a nuisance regression with DCT bases and then cleaned with aCompCor, aCompCor can re-introduce the low-frequency components removed by the DCT (Lindquist et. al., 2019).

library(matrixStats) # rowMedians
library(clever)

# Center and scale the data & noise ROIs.
center_and_scale <- function(x){
  x <- t(x)
  x <- x - c(rowMedians(x, na.rm=TRUE))
  mad <- 1.4826 * rowMedians(abs(x), na.rm=TRUE)
  mad_inv <- ifelse(mad < 1e-8, 0, 1/mad)
  x <- x * mad_inv
  t(x)
}
cii <- center_and_scale(t(cii))
nii <- lapply(nii, t)
nii <- lapply(nii, center_and_scale)
# Perform PCA on the noise ROIs. Get the top 3 PC scores.
design <- lapply(nii, function(x){svd(tcrossprod(x))$u[,seq(3)]})
# Concatenate PCs into one matrix.
design <- do.call(cbind, design)
# Add DCT bases.
design <- cbind(design, clever:::dct_bases(1200-15, 4))
# Center design matrix.
design <- scale(design)

# Code to perform nuisance regression.
nuisance_regression <- function(X, design){
  # https://stackoverflow.com/questions/19100600/extract-maximal-set-of-independent-columns-from-a-matrix
  # https://stackoverflow.com/questions/39167204/in-r-how-does-one-extract-the-hat-projection-influence-matrix-or-values-from-an
  qrd <- qr(design)
  design <- design[, qrd$pivot[seq_len(qrd$rank)]]
  qrd <- qr(design)
  Qd <- qr.Q(qrd)
  I_m_H <- diag(nrow(design)) - (Qd %*% t(Qd))
  if (nrow(X)==nrow(design)) {
    return(I_m_H %*% X)
  } else if (ncol(X)==nrow(design)) {
    return(X %*% I_m_H)
  } else {
    stop("X and design are not of compatible dimensions.")
  }
}

# Get result.
result <- nuisance_regression(cii, design)

Performing aCompCor with clever

Now let's see how we can use the clever function CompCor_HCP to conveniently perform aCompCor! By default, five PCs each from the cortical WM and CSF are regressed from the cortical grey matter data. The default implements a minimal pre-processing approach that centers and scales the data but does not perform other steps such as erosion, removal of the first few frames, or detrending with the DCT.

# Minimal options:
cc <- CompCor_HCP(
  nii = nii_fname, 
  nii_labels = niiLabs_fname, 
  cii = cii_fname
)

Here is an example that demonstrates the additional pre-processing steps, and a different number of PCs:

  • aCompCor-3: 3 PCs from cortical WM and CSF (default: 5 each)
  • Erosion of 2 voxels for cortical WM and 1 voxel from CSF (default: no erosion)
  • Removal of the first 15 frames (default: no frames dropped)
  • Inclusion of 4 DCT bases in the nuisance regression (default: no additional nuisance regressors)
# Additional options:
cc <- CompCor_HCP(
  nii = nii_fname, 
  nii_labels = niiLabs_fname,
  cii = cii_fname,
  noise_nPC = 3, 
  noise_erosion = c(2,1),
  frames = seq(16, 1200),
  DCT = 4
)

The PCs from this example are shown below:

Here is a last example where we include the third ROI, cerebellar WM, and specify the number of PCs explicitly for all three ROIs.

# Even more options:
cc <- CompCor_HCP(
  nii = nii_fname, 
  nii_labels = niiLabs_fname,
  cii = cii_fname,
  ROI_noise = c("wm_cort", "csf", "wm_cblm"),
  noise_nPC = list(wm_cort=5, csf=5, wm_cblm=3)
)

Further options in CompCor_HCP include:

  • Custom ROIs can be specified by listing the NIFTI brainstructures to include via their numerical codes. For example, ROI_noise = list(left_wm_cort = c(seq(3000:3035), 5001)) will use the WM from only the left cortex.
  • Instead of specifying the number of PCs, the user can set the proportion of variance explained. If the argument noise_nPCs is a number between 0 and 1, it will be interpreted as percent variance instead of number of PCs.
  • By default only the cortical data is extracted from the CIFTI. The user can set brainstructures="all" to include the subcortical data, brainstructures="subcortical" to extract only the subcortical data, or brainstructures="left" or brainstructures="right"to extract only the left or right cortical data, respectively.
  • To only compute the aCompCor nuisance signals and not perform nuisance regression, simply omit the cii argument.
  • To include additional custom nuisance signals in the nuisance regression step, such as motion timeseries or spike regressors, pass these to the nuisance_X argument.

We hope clever helps other researchers use aCompCor on HCP data! And of course, if we've missed anything, please let us know in the comments below.

Appendix: Correspondence between the CIFTI and NIFTI subcortical data in the HCP

Something that confused us is that the CIFTI and NIFTI data are not identical, even though both files are aligned to MNI space. Some in-mask voxels are zero-valued in the NIFTI but not the CIFTI. Moreover, voxels that are not zero-valued have slightly different values. Here are the timeseries of an arbitrary voxel (the 100th left cerebellar voxel):

We suspect that the CIFTI data are smoothed relative to the NIFTI data. Indeed, if we compare individual voxel values in the first ten frames within a given brainstructure (e.g. left cerebellum), the CIFTI values have a smaller range:

The first PCs for all other brainstructures also have a slope less than one, indicating greater variance in the NIFTI data.

Lastly, we can visually confirm that the volumetric slices for a given frame look similar but not identical. For example, this is slice 50 in the CIFTI and the NIFTI:

Overall, this doesn't appear to be a problem for aCompCor, since the principal components remain quite similar:

Let us know in the comments below if you have any insight into this mystery!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: