`neurocaps.extraction`.TimeseriesExtractor

class TimeseriesExtractor(space='MNI152NLin2009cAsym', parcel_approach={'Schaefer': {'n_rois': 400, 'resolution_mm': 1, 'yeo_networks': 7}}, standardize='zscore_sample', detrend=True, low_pass=None, high_pass=None, fwhm=None, use_confounds=True, confound_names=None, fd_threshold=None, n_acompcor_separate=None, dummy_scans=None, dtype=None)[source]

Timeseries Extractor Class.

Initializes the Timeseries Extractor class.

Parameters:

space (str, default="MNI152NLin2009cAsym") -- The standard template space that the preprocessed bold data is registered to. Used for querying with pybids to locate preprocessed BOLD-related files.
parcel_approach (dict[str, dict[str, str | int]] or os.PathLike, default={"Schaefer": {"n_rois": 400, "yeo_networks": 7, "resolution_mm": 1}}) --
The approach to parcellate NifTI images. This must be a nested dictionary with the first key being the parcellation name. Currently, only "Schaefer", "AAL", and "Custom" are supported. Recognized second level keys (sub-keys) are listed below:
- For "Schaefer":
  - "n_rois": The number of ROIs (100, 200, 300, 400, 500, 600, 700, 800, 900, or 1000). Defaults to 400.
  - "yeo_networks": The number of Yeo networks (7 or 17). Defaults to 7.
  - "resolution_mm": The spatial resolution of the parcellation in millimeters (1 or 2). Defaults to 1.
- For "AAL":
  - "version": The version of the AAL atlas used ("SPM5", "SPM8", "SPM12", or "3v2"). Defaults to "SPM12" if {"AAL": {}} is supplied.
- For "Custom":
  - "maps": Directory path to the location of the parcellation file.
  - "nodes": A list of node names in the order of the label IDs in the parcellation.
  - "regions": The regions or networks in the parcellation.
Refer to documentation from nilearn's datasets.fetch_atlas_schaefer_2018 and datasets.fetch_atlas_aal functions for more information about the "Schaefer" and "AAL" sub-keys. Also, refer to the "Note" section below for an explanation of the "Custom" sub-keys.
standardize ({"zscore_sample", "zscore", "psc", True, False}, default="zscore_sample") -- Standardizes the timeseries. Refer to nilearn.maskers.NiftiLabelsMasker for an explanation of each available option.
detrend (bool, default=True) -- Detrends the timeseries during extraction.
low_pass (float, int, or None, default=None) -- Filters out signals above the specified cutoff frequency.
high_pass (float, int, or None`, default=None) -- Filters out signals below the specified cutoff frequency.
fwhm (float, int, or None, default=None) -- Applies spatial smoothing to data (in millimeters). Note that using parcellations already averages voxels within parcel boundaries, which can improve signal-to-noise ratio (SNR) assuming Gaussian noise distribution. However, smoothing may also blur parcel boundaries.
use_confounds (bool, default=True) -- Perform nuisance regression using the default or user-specified confounds in confound_names when extracting timeseries. Note, the confound tsv files must be located in the same directory as the preprocessed BOLD images.
confound_names (list[str] or None, default=None) -- The names of the confounds to extract from the confound tsv files. If None, default confounds are used, which consists of all cosine-basis parameters, the six-head motion parameters and their first-order derivatives, and the first six combined acompcor components. Additionally, the names of these confounds follow the naming scheme of confounds in fMRIPrep versions >= 1.2.0. Note, an asterisk ("*") can be used to find confound names that start with the term preceding the asterisk. For instance, "cosine*" will find all confound names in the confound files starting with "cosine".
fd_threshold (float, dict[str, float], or None, default=None) --
Sets a threshold for removing exceeding volumes. This requires a column named framewise_displacement in the confounds file and use_confounds set to True. Additionally, framewise_displacement should be specified in confound_names if using this parameter. By default, censoring is done after nuisance regression; however, this behavior can be modified with the "use_sample_mask" key to censor prior to nuisance regression. If, fd_threshold is a dictionary, the following keys can be specified:
- "threshold": A float value. Volumes with a framewise_displacement value exceeding this threshold are removed.
- "outlier_percentage": A float value between 0 and 1 representing a percentage. Runs where the proportion of volumes exceeding the "threshold" is higher than this percentage are removed. If condition is specified in self.get_bold, only the runs where the proportion of volumes exceeds this value for the specific condition of interest are removed. Note, this proportion is calculated after dummy scans have been removed. A warning is issued whenever a run is flagged.
- "n_before": An integer value indicating the number of volumes to scrub before the flagged volume. Hence, if frame 5 is flagged and "n_before" is 2, then volumes 3, 4, and 5 are scrubbed.
- "n_after": An integer indicating the number of volumes to scrub after to the flagged volume. Hence, if frame 5 is flagged and "n_after" is 2, then volumes 5, 6, and 7 are scrubbed.
- "use_sample_mask": A boolean value. If True, a sample mask is generated and passed to the sample_mask parameter in nilearn's NiftiLabelsMasker to censor prior to nuisance regression. Internally, clean__extrapolate is set to False and passed to NiftiLabelsMasker, which prevents censored volumes at the end from being interpolated prior to applying the butterworth filter. See documentation from nilearn.signal_clean and nilearn.maskers.NiftiLabelsMasker for how nilearn handles censored volumes when sample_mask is used. If this key is set to False, data is only censored after nuisance regression, which is the default behavior.
Added in version 0.18.8: "use_sample_mask"
n_acompcor_separate (int or None, default=None) -- Specifies the number of separate acompcor components derived from white-matter (WM) and cerebrospinal fluid (CSF) masks to use. For example, if set to 5, the first five components from the WM mask and the first five from the CSF mask will be used, totaling ten acompcor components. If this parameter is not None, any acompcor components listed in confound_names will be disregarded. To use acompcor components derived from combined masks (WM & CSF), leave this parameter as None and list the specific acompcors of interest in confound_names.
dummy_scans (int, dict[str, bool | int], or None, default=None) --
Removes the first n volumes before extracting the timeseries. If, dummy_scans is a dictionary, the following keys can be used:
- "auto": A boolean value. If True, the number of dummy scans removed depend on the number of "non_steady_state_outlier_XX" columns in the participants fMRIPrep confounds tsv file. For instance, if there are two "non_steady_state_outlier_XX" columns detected, then dummy_scans is set to two since there is one "non_steady_state_outlier_XX" per outlier volume for fMRIPrep. This is assessed for each run of all participants so dummy_scans depends on the number number of "non_steady_state_outlier_XX" in the confound file associated with the specific participant, task, and run number.
- "min": An integer value indicating the minimum dummy scans to discard. The "auto" sub-key must be True for this to work. If, for instance, only two "non_steady_state_outlier_XX" columns are detected but the "min" is set to three, then three dummy volumes will be discarded.
- "max": An integer value indicating the maximum dummy scans to discard. The "auto" sub-key must be True for this to work. If, for instance, six "non_steady_state_outlier_XX" columns are detected but the "max" is set to five, then five dummy volumes will be discarded.
dtype (str or "auto", default=None) -- The numpy dtype the NIfTI images are converted to when passed to nilearn's load_img function.

Properties

space: str

The standard template space that the preprocessed BOLD data is registered to. The space can also be set after class initialization using self.space = "New Space" if the template space needs to be changed.

parcel_approach: dict[str, dict[str, os.PathLike | list[str]]]

A dictionary containing information about the parcellation. Can also be used as a setter, which accepts a dictionary or a dictionary saved as pickle file. If "Schaefer" or "AAL" was specified during initialization of the TimeseriesExtractor class, then nilearn.datasets.fetch_atlas_schaefer_2018 and nilearn.datasets.fetch_atlas_aal will be used to obtain the "maps" and the "nodes". Then string splitting is used on the "nodes" to obtain the "regions":

# Structure of Schaefer
{
    "Schaefer":
    {
        "maps": "path/to/parcellation.nii.gz",
        "nodes": ["LH_Vis1", "LH_SomSot1", "RH_Vis1", "RH_Somsot1"],
        "regions": ["Vis", "SomSot"]
    }
}

# Structure of AAL
{
    "AAL":
    {
        "maps": "path/to/parcellation.nii.gz",
        "nodes": ["Precentral_L", "Precentral_R", "Frontal_Sup_L", "Frontal_Sup_R"],
        "regions": ["Precentral", "Frontal"]
    }
}

Refer to the example for "Custom" in the Note section below for the expected structure.

signal_clean_info: dict[str]

Dictionary containing parameters for signal cleaning specified during initialization of the TimeseriesExtractor class. This information includes standardize, detrend, low_pass, high_pass, fwhm, dummy_scans, use_confounds, n_compcor_separate, and fd_threshold.

task_info: dict[str]

If self.get_bold() ran, is a dictionary containing all task-related information such as task, condition, session, runs, and tr (if specified) else None.

subject_ids: list[str]

A list containing all subject IDs that have retrieved from pybids and subjected to timeseries extraction.

n_cores: int

Number of cores used for multiprocessing with joblib.

subject_timeseries: dict[str, dict[str, np.ndarray]

A dictionary mapping subject IDs to their run IDs and their associated timeseries (TRs x ROIs) as a numpy array. Can also be a path to a pickle file containing this same structure. If this property needs to be deleted due to memory issues, delattr(self, "_subject_timeseries") (version < 0.18.10) or del self.subject_timeseries (version >= 0.18.10) can be used to delete this property and only have it return None. The structure is as follows:

subject_timeseries = {
        "101": {
            "run-0": np.array([...]), # Shape: TRs x ROIs
            "run-1": np.array([...]), # Shape: TRs x ROIs
            "run-2": np.array([...]), # Shape: TRs x ROIs
        },
        "102": {
            "run-0": np.array([...]), # Shape: TRs x ROIs
            "run-1": np.array([...]), # Shape: TRs x ROIs
        }
    }

Note

Passed Parameters: standardize, detrend, low_pass, high_pass, fwhm, and nuisance regression (confound_names) uses nilearn.maskers.NiftiLabelsMasker. The dtype parameter is used by nilearn.image.load_img. For framewise displacement, if the "use_sample_mask" key is set to True in the fd_threshold dictionary, then a boolean sample mask is generated (setting indices corresponding to high motion volumes as False) and is passed to the sample_mask parameter in nilearn.maskers.NiftiLabelsMasker.

Custom Parcellations: If using a "Custom" parcellation approach, ensure that the parcellation is lateralized (where each region/network has nodes in the left and right hemisphere). This is due to certain visualization functions assuming that each region consists of left and right hemisphere nodes. Additionally, certain visualization functions in this class also assume that the background label is 0. Therefore, do not add a background label in the "nodes" or "regions" keys.

The recognized sub-keys for the "Custom" parcellation approach includes:

"maps": Directory path containing the parcellation file in a supported format (e.g., .nii or .nii.gz for NifTI).

"nodes": A list of all node labels. The node labels should be arranged in ascending order based on their numerical IDs from the parcellation files. The node with the lowest numerical label in the parcellation file should occupy the 0th index in the list, regardless of its actual numerical value. For instance, if the numerical IDs are sequential, and the lowest, non-background numerical ID in the parcellation is "1" which corresponds to "left hemisphere visual cortex area" ("LH_Vis1"), then "LH_Vis1" should occupy the 0th element in this list. Even if the numerical IDs are non-sequential and the earliest non-background, numerical ID is "2000" (assuming "0" is the background), then the node label corresponding to "2000" should occupy the 0th element of this list.

# Example of numerical label IDs and their organization in the "nodes" key
"nodes": {
    "LH_Vis1",          # Corresponds to parcellation label 2000; lowest non-background numerical ID
    "LH_Vis2",          # Corresponds to parcellation label 2100; second lowest non-background numerical ID
    "LH_Hippocampus",   # Corresponds to parcellation label 2150; third lowest non-background numerical ID
    "RH_Vis1",          # Corresponds to parcellation label 2200; fourth lowest non-background numerical ID
    "RH_Vis2",          # Corresponds to parcellation label 2220; fifth lowest non-background numerical ID
    "RH_Hippocampus"    # Corresponds to parcellation label 2300; sixth lowest non-background numerical ID
}

"regions": A dictionary defining major brain regions or networks. Each region should list node indices under "lh" (left hemisphere) and "rh" (right hemisphere) to specify the respective nodes. Both the "lh" and "rh" sub-keys should contain the indices of the nodes belonging to each region/hemisphere pair, as determined by the order/index in the "nodes" list. The naming of the sub-keys defining the major brain regions or networks have zero naming requirements and simply define the nodes belonging to the same name.
```
# Example of the "regions" sub-keys
"regions": {
    "Visual": {
        "lh": [0, 1], # Corresponds to "LH_Vis1" and "LH_Vis2"
        "rh": [3, 4]  # Corresponds to "RH_Vis1" and "RH_Vis2"
    },
    "Hippocampus": {
        "lh": [2], # Corresponds to "LH_Hippocampus"
        "rh": [5]  # Corresponds to "RH_Hippocampus"
    }
}
```

The provided example demonstrates setting up a custom parcellation containing nodes for the visual network (Vis) and hippocampus regions in full:

parcel_approach = {
    "Custom": {
        "maps": "/location/to/parcellation.nii.gz",
        "nodes": [
            "LH_Vis1",
            "LH_Vis2",
            "LH_Hippocampus",
            "RH_Vis1",
            "RH_Vis2",
            "RH_Hippocampus"
        ],
        "regions": {
            "Visual": {
                "lh": [0, 1],
                "rh": [3, 4]
            },
            "Hippocampus": {
                "lh": [2],
                "rh": [5]
            }
        }
    }
}

Note: Different sub-keys are required depending on the function used. Refer to the Note section under each function for information regarding the sub-keys required for that specific function.

Methods

`get_bold`(bids_dir, task[, session, runs, ...])	Retrieve Preprocessed BOLD Data from BIDS Datasets.
`timeseries_to_pickle`(output_dir[, filename])	Save the Extracted Subject Timeseries.
`visualize_bold`(subj_id, run[, roi_indx, ...])	Plot the Extracted Subject Timeseries.

get_bold(bids_dir, task, session=None, runs=None, condition=None, tr=None, run_subjects=None, exclude_subjects=None, exclude_niftis=None, pipeline_name=None, n_cores=None, parallel_log_config=None, verbose=True, flush=False)[source]

Retrieve Preprocessed BOLD Data from BIDS Datasets.

This function uses pybids for querying and requires the BOLD data directory (specified in bids_dir) to be BIDS-compliant, including a "dataset_description.json" file. It assumes the dataset contains a derivatives folder with BOLD data preprocessed using a standard pipeline, specifically fMRIPrep. The pipeline directory must also include a "dataset_description.json" file for proper querying.

The timeseries data of all subjects are appended to a single dictionary self.subject_timeseries. Additional information regarding the structure of this dictionary can be found in the "Note" section.

Basic BIDS directory:

bids_root/
├── dataset_description.json
├── sub-<subject_label>/
│   └── func/
│       └── *task-*_events.tsv
├── derivatives/
│   └── fmriprep-<version_label>/
│       ├── dataset_description.json
│       └── sub-<subject_label>/
│           └── func/
│               ├── *confounds_timeseries.tsv
│               ├── *brain_mask.nii.gz
│               └── *preproc_bold.nii.gz

BIDS directory with session-level organization:

bids_root/
├── dataset_description.json
├── sub-<subject_label>/
│   └── ses-<session_label>/
│       └── func/
│           └── *task-*_events.tsv
├── derivatives/
│   └── fmriprep-<version_label>/
│       ├── dataset_description.json
│       └── sub-<subject_label>/
│           └── ses-<session_label>/
│               └── func/
│                   ├── *confounds_timeseries.tsv
│                   ├── *brain_mask.nii.gz
│                   └── *preproc_bold.nii.gz

Note: Only the preprocessed BOLD file is required. Additional files such as the confounds tsv (needed for denoising), mask, and task timing tsv file (needed for filtering a specific task condition) depend on the specific analyses. As mentioned previously, the "dataset_description.json" is required in both the bids root and pipeline directories for querying with pybids.

This pipeline is most optimized for BOLD data preprocessed by fMRIPrep.

Parameters:

bids_dir (os.PathLike) -- Path to a BIDS compliant directory. A "dataset_description.json" file must be located in this directory or an error will be raised.
task (str) -- Name of task to extract timeseries data from (i.e "rest", "n-back", etc).
session (int, str, or None, default=None) -- Session ID to extract timeseries data from. Only a single session can be extracted at a time. While files having session IDs are not mandatory, this parameter must be specified if the dataset has multiple sessions . If session is None and multiple sessions are detected when the preprocessed NifTI files are queried, an error will be raised. The value can be an integer (e.g. session=2) or a string (e.g. session="001").
runs (int, str, list[int], list[str], or None, default=None) -- List of run numbers to extract timeseries data from. Extracts all runs if unspecified. For instance, extract only "run-0" and "run-1", use runs=[0, 1]. For non-integer run IDs, use strings: runs=["000", "001"].
condition (str or None, default=None) -- Isolates the timeseries data corresponding to a specific condition, only after the timeseries has been extracted and subjected to nuisance regression. Only a single condition can be extracted at a time.
tr (int, float, or None, default=None) -- Repetition time (TR) for the specified task. If not provided, the TR will be automatically extracted from the first BOLD metadata file found for the task, searching first in the pipeline directory, then in the bids_dir if not found.
run_subjects (list[str] or None, default=None) -- List of subject IDs to process (e.g. run_subjects=["01", "02"]). Processes all subjects if None.
exclude_subjects (list[str] or None, default=None) -- List of subject IDs to exclude (e.g. exclude_subjects=["01", "02"]).
exclude_niftis (list[str] or None, default=None) --
List of the specific preprocessed NIfTI files to exclude, preventing their timeseries data from being extracted. Used if there are specific runs across different participants that need to be excluded.

Changed in version 0.18.0: moved from being the second to last parameter, to being underneath exclude_subjects
pipeline_name (str or None, default=None) -- The name of the pipeline folder in the derivatives folder containing the preprocessed data. If None, BIDSLayout will default to using the bids_dir with derivatives=True. This parameter should be used if multiple pipelines exist or when the pipeline folder containing the "dataset_description.json" file is nested within another folder. The specified folder must contain the "dataset_description.json" file in its root level. For instance, if the json file is in "path/to/bids/derivatives/fmriprep/fmriprep-20.0.0", then pipeline_name = "fmriprep/fmriprep-20.0.0".
n_cores (int or None, default=None) -- The number of cores to use for multiprocessing with joblib. The default backend for joblib is used.

parallel_log_config (dict[str, Union[multiprocessing.Manager.Queue, int]]) --

Passes a user-defined managed queue and logging level to the internal timeseries extraction function when parallel processing (n_cores) is used. Note, if parallel processing is used, global logging configurations won't be passed to the child processes. Thus, to prevent the child processes from using the default logging behavior, this parameter must be used. Additionally, this parameter must be a dictionary and the available keys are:

"queue": The instance of multiprocessing.Manager.Queue to pass to QueueHandler. If not specified, all logs will output to sys.stdout.
"level": The logging level (e.g. logging.INFO, logging.WARNING). If not specified, the default level is logging.INFO.

import logging
from logging.handlers import QueueListener
from multiprocessing import Manager

# Configure root with FileHandler
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
file_handler = logging.FileHandler('neurocaps.log')
file_handler.setFormatter(logging.Formatter('%(asctime)s %(name)s [%(levelname)s] %(message)s'))
root_logger.addHandler(file_handler)

if __name__ == "__main__":
    # Import the TimeseriesExtractor
    from neurocaps.extraction import TimeseriesExtractor

    # Setup managed queue
    manager = Manager()
    queue = manager.Queue()

    # Set up the queue listener
    listener = QueueListener(queue, *root_logger.handlers)

    # Start listener
    listener.start()

    extractor = TimeseriesExtractor()

    # Use the `parallel_log_config` parameter to pass queue and the logging level
    extractor.get_bold(bids_dir="path/to/bids/dir",
                       task="rest",
                       tr=2,
                       n_cores=5,
                       parallel_log_config = {"queue": queue, "level": logging.WARNING})

    # Stop listener
    listener.stop()

Changed in version 0.18.0: moved from being the last parameter, to being underneath n_cores

verbose (bool, default=True) -- If True, logs detailed subject-specific information including: subjects skipped due to missing required files, current subject being processed for timeseries extraction, confounds identified for nuisance regression in addition to requested confounds that are missing for a subject, and additional warnings encountered during the timeseries extraction process.
flush (bool, default=False) -- If True, flushes the logged subject-specific information produced during the timeseries extraction process.

Note

Subject Timeseries Dictionary: This method stores the extracted timeseries of all subjects in self.subject_timeseries. The structure is a dictionary mapping subject IDs to their run IDs and their associated timeseries (TRs x ROIs) as a numpy array:

subject_timeseries = {
        "101": {
            "run-0": np.array([timeseries]), # Shape: TRs x ROIs
            "run-1": np.array([timeseries]), # Shape: TRs x ROIs
            "run-2": np.array([timeseries]), # Shape: TRs x ROIs
        },
        "102": {
            "run-0": np.array([timeseries]), # Shape: TRs x ROIs
            "run-1": np.array([timeseries]), # Shape: TRs x ROIs
        }
    }

By default, "run-0", will be used if run IDs are not specified in the NifTI file.

Parcellation & Nuisance Regression: For timeseries extraction, nuisance regression, and spatial dimensionality reduction using a parcellation, nilearn's NiftiLabelsMasker function is used. If requested, dummy scans are removed from the NIfTI images and confound dataset prior to timeseries extraction. For volumes exceeding a specified framewise displacement (FD) threshold, if the "use_sample_mask" key in the fd_threshold dictionary is set to True, then a boolean sample mask is generated (where False indicates the high motion volumes) and passed to the sample_mask parameter in nilearn's NiftiLabelsMasker. If, "use_sample_mask" key is False or not specified in the fd_threshold dictionary, then censoring is done after nuisance regression, which is the default behavior.

Extraction of Task Conditions: when extracting specific conditions, int to round down for the beginning scan index start_scan = int(onset/tr) and math.ceil is used to round up for the ending scan index end_scan = math.ceil((onset + duration)/tr). Filtering a specific condition from the timeseries is done after nuisance regression. Additionally, if the "use_sample_mask" key in the fd_threshold dictionary is set to True, then the truncated 2D timeseries is temporarily padded to ensure the correct rows corresponding to the condition are obtained.

timeseries_to_pickle(output_dir, filename=None)[source]

Save the Extracted Subject Timeseries.

Saves the extracted timeseries stored in the self.subject_timeseries dictionary (obtained from running self.get_bold) as a pickle file. This allows for data persistence and easy conversion back into dictionary form for later use.

Parameters:

output_dir (os.PathLike) -- Directory to save self.subject_timeseries dictionary as a pickle file. The directory will be created if it does not exist.
filename (str or None, default=None) --
Name of the file with or without the "pkl" extension.

Changed in version 0.19.0: file_name to filename

visualize_bold(subj_id, run, roi_indx=None, region=None, show_figs=True, output_dir=None, filename=None, **kwargs)[source]

Plot the Extracted Subject Timeseries.

Uses the self.subject_timeseries to visualize the extracted BOLD timeseries data of data Regions of Interest (ROIs) or regions for a specific subject and run.

Parameters:

subj_id (str or int) -- The ID of the subject.
run (int or str) -- The run ID of the subject to plot.
roi_indx (int, str, list[int], list[int] or None, default=None) -- The indices of the parcellation nodes to plot. See "nodes" in self.parcel_approach for valid nodes.
region (str or None, default=None) -- The region of the parcellation to plot. If not None, all nodes in the specified region will be averaged then plotted. See "regions" in self.parcel_approach for valid region.
show_figs (bool, default=True) -- Display figures.
output_dir (os.PathLike or None, default=None) -- Directory to save plot as png image. The directory will be created if it does not exist. If None, plot will not be saved.
filename (str or None, default=None) --
Name of the file without the extension.

Changed in version 0.19.0: file_name to filename
kwargs (dict) --
Keyword arguments used when saving figures. Valid keywords include:
- dpi: int, default=300
  Dots per inch for the figure. Default is 300 if output_dir is provided and dpi is not specified.
- figsize: tuple, default=(11, 5)
  Size of the figure in inches. Default is (11, 5) if figsize is not specified.
- bbox_inches: str or None, default="tight"
  Alters size of the whitespace in the saved image.

Returns:

matplotlib.Figure -- An instance of matplotlib.Figure.

Note

Parcellation Approach: the "nodes" and "regions" sub-keys are required in parcel_approach.

neurocaps.extraction.TimeseriesExtractor

Properties

`neurocaps.extraction`.TimeseriesExtractor