CAP.get_caps#

CAP.get_caps(subject_timeseries, runs=None, n_clusters=5, cluster_selection_method=None, random_state=None, init='k-means++', n_init='auto', max_iter=300, tol=0.0001, algorithm='lloyd', standardize=True, n_cores=None, show_figs=False, output_dir=None, progress_bar=False, as_pickle=False, **kwargs)[source]#

Perform K-Means Clustering to Identify CAPs.

Concatenates the timeseries of each subject into a single NumPy array with dimensions (participants x TRs) x ROI and uses sklearn.cluster.KMeans on the concatenated data. Separate KMeans models are generated for all groups.

Parameters:
  • subject_timeseries (SubjectTimeseries or str) – A dictionary mapping subject IDs to their run IDs and their associated timeseries (TRs x ROIs) as a NumPy array. Can also be a path to a pickle file containing this same structure. Refer to documentation for SubjectTimeseries in the “See Also” section for an example structure.

  • runs (int, str, list[int], list[str], or None, default=None) – Specific run IDs to perform the CAPs analysis with (e.g. runs=[0, 1] or runs=["01", "02"]). If None, all runs will be used.

  • n_clusters (int or list[int], default=5) – Number of clusters to use. Can be a single integer or a list of integers (if cluster_selection_method is not None).

  • cluster_selection_method ({“elbow”, “davies_bouldin”, “silhouette”, “variance_ratio”} or None, default=None) – Method to find the optimal number of clusters. Options are “elbow”, “davies_bouldin”, “silhouette”, and “variance_ratio”.

  • random_state (int or None, default=None) – Random state (seed) value to use.

  • init ({“k-means++”, “random”}, Callable, or ArrayLike, default=”k-means++”) – Method for choosing initial cluster centroid. Options are “k-means++”, “random”, or callable or array-like of shape (n_clusters, n_features).

  • n_init ({“auto”} or int, default=”auto”) – Number of times k-means is ran with different initial clusters. The model with lowest inertia from these runs will be selected.

  • max_iter (int, default=300) – Maximum number of iterations for a single run of k-means.

  • tol (float, default=1e-4) – Stopping criterion if the change in inertia is below this value, assuming max_iter has not been reached.

  • algorithm ({"lloyd", "elkan"}, default="lloyd") – The algorithm to use. Options are “lloyd” and “elkan”.

  • standardize (bool, default=True) –

    Standardizes the columns (ROIs) of the concatenated timeseries data. Uses sample standard deviation (n-1).

    Note

    Standard deviations below np.finfo(std.dtype).eps are replaced with 1 for numerical stability.

  • n_cores (int or None, default=None) – Number of cores to use for multiprocessing, with Joblib, to run multiple k-means models if cluster_selection_method is not None. The “loky” backend is used.

  • show_figs (bool, default=False) – Displays the plots for the specified cluster_selection_method for all groups.

  • output_dir (str or None, default=None) – Directory to save plots as png files if cluster_selection_method is not None. The directory will be created if it does not exist. If None, plots will not be saved.

  • progress_bar (bool, default=False) – If True and cluster_selection_method is not None, displays a progress bar.

  • as_pickle (bool, default=False) –

    When output_dir and cluster_selection_method is specified, plots are saved as pickle filess, which can be further modified, instead of png images.

    Added in version 0.26.5.

  • **kwargs

    Additional keyword arguments when cluster_selection_method is specified:

    • S: int or float, default=1.0 – Adjusts the sensitivity of finding the elbow. Larger values are more conservative and less sensitive to small fluctuations. Passed to KneeLocator from the kneed package.

    • dpi: int, default=300 – Dots per inch for the figure.

    • figsize: tuple, default=(8, 6) – Adjusts the size of the plots.

    • bbox_inches: str or None, default=”tight” – Alters size of the whitespace in the saved image.

    • step: int, default=None – An integer value that controls the progression of the x-axis in plots.

    • max_nbytes: int, str, or None, default=”1M” – If n_cores is not None, serves as the threshold to trigger Joblib’s automated memory mapping for large arrays.

      Added in version 0.28.5.

See also

neurocaps.typing.SubjectTimeseries

Type definition for the subject timeseries dictionary structure.

Returns:

self

Raises:

NoElbowDetectionError – Occurs when cluster_selection_method is set to elbow but kneed’s KneeLocator does not detect an elbow in the convex curve.

Notes

KMeans Algorithm: Refer to scikit-learn’s Documentation for additional information about the KMeans algorithm used in this method.

The n_clusters, random_state, init, n_init, max_iter, tol, and algorithm parameters are passed to sklearn.cluster.KMeans. Only n_clusters differs from scikit-learn’s default value, changing from 8 to 5.

Default Group Naming: When group is None during initialization of the CAP class, then “All Subjects” is the default group name. On the first call of this function, the subject IDs in subject_timeseries will be automatically detected and stored in self.group. This mapping persists until the CAP class is re-initialized or unless self.clear_groups() is used.

Concatenated Timeseries: The concatenated timeseries is stored in self.concatenated_timeseries for user convenience and can be deleted using del self.concatenated_timeseries without disruption to the any other function. Additionally, for versions >= 0.25.0, the concatenation of subjects is performed lexicographically based on their subject IDs.