Slide Collection
Overview
The SlideCollection is the central controller of CuBATS. It organizes and manages a collection of whole-slide images (WSIs), with each WSI being implemented by the Slide class. The SlideCollection class loads and stores processing results, the status of the pipeline.
It wraps all functional processing stages of CuBATS:
Image Registration
Tumor Segmentation
Quantification of Staining Intensities
Combinatorial Analysis of Antigen Co-Expression
Each stage is implemented as a separate module and called internally by the SlideCollection class.
Note
You can still run individual modules separately, particularly the registration and segmentation modules. However, the quantification and combinatorial analysis modules are tightly integrated into the SlideCollection class and cannot be run independently.
Classes
Slide Collection
- class cubats.slide_collection.slide_collection.SlideCollection(collection_name, src_dir, dest_dir, ref_slide=None, path_antigen_profiles=None)[source]
Initializes a slide collection, stores slide info and performs slide processing.
SlideCollection is a class that initializes a collection of slides, stores relevant processing information, and allows the execution of separate processing steps. Previously processed data can be reloaded.
- name
Name of parent directory (i.e. name of tumorset).
- Type:
str
- src_dir
Path to source directory containing the WSIs.
- Type:
str
- dest_dir
Path to destination directory for results.
- Type:
str
- data_dir
Path to data directory, a subdirectory of dest_dir. The directory will be initiaded upon class creation inside the dest_dir. Inside the data directory summaries of quantification results, dual antigen expression results and triplet antigen expression results are stored as .CSV file. The data_dir also contains the pickle_dir.
- Type:
str
- pickle_dir
Path to pickle directory, a subdirectory of data_dir. Inside the pickle directory pickled copies of the slide_collection, quantification results and antigen analysis are stored which will be automatically reloaded if a slide collection is (re-)initialized with the same output_dir.
- Type:
str
- tiles_dir
Path to tiles directory, a subdirectory of dest_dir. Inside the tiles directory the tile directories for each slide are stored.
- Type:
str
- colocalization_dir
Path to colocalization directory, a subdirectory of dest_dir. Inside the colocalization directory results of dual and triplet overlap analyses are stored.
- Type:
str
- reconstruct_dir
Path to reconstruct directory, a subdirectory of dest_dir. Inside the reconstruct directory reconstructed slides are stored.
- Type:
str
- collection_info_df
Dataframe containing relevant information on all the slides. The colums are:
Name (str): Name of slide.
Reference (bool): True if slide is reference slide.
Mask (bool): True if slide is mask slide.
OpenSlide Object (OpenSlide): OpenSlide object of the slide.
Tiles (DeepZoomGenerator): DeepZoom tiles of the slide.
Level Count (int): Number of Deep Zoom levels in the image.
Level Dimensions (list): List of tuples (pixels_x, pixels_y) for each Deep Zoom level.
Tile Count (int): Number of total tiles in the image.
- Type:
Dataframe
- mask_coordinates
List containing the tile coordinates for tiles that are covered by the mask Coordinates are tuples (column, row).
- Type:
list
- quantification_results
Dataframe containing the quantification results for all processed slides. The columns are:
Name (str): Name of the slide.
Coverage (%) (float): Positively stained pixels in the slide, characterizing the tumor coverage.
High Positive (%) (float): Percentage of highly positive stained pixels in the slide.
Medium Positive (%) (float): Percentage of medium positive stained pixels in the slide.
Low Positive (%) (float): Percentage of low positive stained pixels in the slide.
Negative (%) (float): Percentage negatively stained pixels in the slide.
Total Tissue (%) (float): Total amount of tissue in the slide.
Background / No Tissue (%) (float): Percentage of background and non-tissue regions in the slide.
Mask Area (%) (float): Area of the slide covered by the tumor mask.
Non-mask Area (%) (float): Area of the slide not covered by the tumor mask.
H-Score (float): Established pathological score calculated based on the distribution of staining intensities in the slide.
Score (str): Additional overall score of the slide calculated from the average of scores for all tiles. However, this score may be misleading, as it is an average over the entire slide.
Total Processed Tiles (%) (float): Percentage of processed tiles.
Error (%) (float): Percentage of tiles that were not processed due to insufficient tissue coverage. Only the case for tile-level masking.
Thresholds (list): List containing the thresholds applied during quantification.
- Type:
Dataframe
- dual_antigen_expressions
Dataframe containing a summary of the dual antigen expression results for all processed analyses:
Slide 1 (str): Name of the first slide.
Slide 2 (str): Name of the second slide.
Total Coverage (%) (float): Percentage of combined coverage in two slides.
Total Overlap (%) (float: Percentage of overlapping antigen expressions in two slides.
Total Complement (%) (float): Percentage of complementary antigen expressions in two slides.
High Positive Overlap (%) (float): Percentage of highly positive overlapping antigen expressions in two slides.
High Positive Complement (%) (float): Percentage of highly positive complementary antigen expressions in two slides.
Medium Positive Overlap (%) (float): Percentage of medium positive overlapping antigen expressions in two slides.
Medium Positive Complement (%) (float): Percentage of medium positive complementary antigen expressions in two slides.
Low Positive Overlap (%) (float): Percentage of low positive overlapping antigen expressions in two slides.
Low Positive Complement (%) (float): Percentage of low positive complementary antigen expressions in two slides.
Negative Tissue (%) (float): Percentage of negative tissue in the two slides.
Total Tissue (%) (float): Percentage of total tissue in the two slides.
Background / No Tissue (%) (float): Percentage of background and non-tissue regions in the two slides.
Total Processed Tiles (%) (float): Percentage of processed tiles.
Total Error (%) (float): Percentage of tiles that were not processed due to insufficient tissue coverage.
Error1 (%) (float): Percentage of tiles where neither tile contained tissue.
Error2 (%) (float): Percentage of tiles where tiles could not be analyzed due to incorrect tile shapes.
Thresholds1 (str): Antigen thresholds for slide 1.
Thresholds2 (str): Antigen thresholds for slide 2
- Type:
Dataframe
- triplet_antigen_results
Dataframe containing a summary of the triplet antigen expression results for all processed analyses:
Slide 1 (str): Name of the first slide.
Slide 2 (str): Name of the second slide.
Slide 3 (str): Name of the third slide.
Total Coverage (%) (float): Percentage of combined coverage in three slides.
Total Overlap (%) (float: Percentage of overlapping antigen expressions in three slides.
Total Complement (%) (float): Percentage of complementary antigen expressions in three slides.
High Positive Overlap (%) (float): Percentage of highly positive overlapping antigen expressions in three slides.
High Positive Complement (%) (float): Percentage of highly positive complementary antigen expressions in three slides.
Medium Positive Overlap (%) (float): Percentage of medium positive overlapping antigen expressions in three slides.
Medium Positive Complement (%) (float): Percentage of medium positive complementary antigen expressions in three slides.
Low Positive Overlap (%) (float): Percentage of low positive overlapping antigen expressions in three slides.
Low Positive Complement (%) (float): Percentage of low positive complementary antigen expressions in three slides.
Negative Tissue (%) (float): Percentage of negative tissue in the three slides.
Total Tissue (%) (float): Percentage of total tissue in the three slides.
Background / No Tissue (%) (float): Percentage of background and non-tissue regions in the three slides.
Total Processed Tiles (%) (float): Percentage of processed tiles.
Total Error (%) (float): Percentage of tiles that were not processed due to insufficient tissue coverage.
Error1 (%) (float): Percentage of tiles where neither tile contained tissue.
Error2 (%) (float): Percentage of tiles where tiles could not be analyzed due to incorrect tile shapes.
Thresholds1 (str): Antigen thresholds for slide 1.
Thresholds2 (str): Antigen thresholds for slide 2
Thresholds3 (str): Antigen thresholds for slide 3
- Type:
Dataframe
- __init__(collection_name, src_dir, dest_dir, ref_slide=None, path_antigen_profiles=None)[source]
Initializes the class. The class contains information on the slide collection.
- Parameters:
collection_name (str) – Name of the collection (i.e. Name of tumor set or patient ID)
src_dir (str) – Path to src directory containing the WSIs.
dest_dir (str) – Path to destination directory for results.
ref_slide (str, optional) – Path to reference slide. If ‘ref_slide’ is None it will be automatically set to the HE slide based on the filename of input files. Defaults to None.
path_antigen_profiles (str, optional) – Path to antigen profiles. Definitions as .json or .csv are accepted. If no default thresholds will be applied during processing.
- evaluate_antigen_pair(slide1, slide2, save_img=False, masking_mode='tile-level')[source]
Analyzes the antigen co-expression for a pair of two slides.
Analyzes antigen co-expressions for each tile for the given pair of slides using multiprocesing based on the antigen-specific thresholds of each slide. Results from each of the tiles are summarized, stored in dual_antigen_expressions and saved as CSV in data_dir as well as PICKLE in pickle_dir. For more detailed explanation of the co-expression results see SlideCollection.triplet_antigen_expressions.
- Parameters:
slide1 (Slide) – Slide Object for slide 1.
slide2 (Slide) – Slide Object for slide 2.
save_img (bool) – Boolean determining if tiles shall be saved during processing. Necessary if slide shall be reconstructed later on. However, storing images will require additional storage. Defaults to False.
masking_mode (str) –
Determines the mode for mask application that was used for quantification.
- tile-level (default): Applies the mask coarsly - tiles overlapping the mask are fully included.
Recommended when registration quality is lower (e.g. high median rTRE).
- pixel-level: Applies the mask precisely at pixel level - only masked pixels are included.
Offers finer co-expression evaluation, but is more sensitive to registration errors.
- evaluate_antigen_triplet(slide1, slide2, slide3, save_img=False, masking_mode='tile-level')[source]
Analyzes the antigen co-expression for a triplet of three slides.
Analyzes antigen co-expressions for each of tiles of the given triplet of slides using Multiprocessing based on the antigen-specific thresholds of each slide. Results from each of the tiles are summarized, stored in triplet_antigen_expressions and saved as CSV in data_dir as well as PICKLE in pickle_dir. For more detailed explanation of the co-expression results see SlideCollection.triplet_antigen_expressions.
- Parameters:
slide1 (Slide) – Slide Object for slide 1.
slide2 (Slide) – Slide Object for slide 2.
slide3 (Slide) – Slide Object for slide 3.
save_img (bool) – Boolean determining if tiles shall be saved during processing. Necessary if slide shall be reconstructed later on. However, storing images will require additional storage. Defaults to False.
masking_mode (str) –
Determines the mode for mask application that was used for quantification.
- tile-level (default): Applies the mask coarsly - tiles overlapping the mask are fully included.
Recommended when registration quality is lower (e.g. high median rTRE).
- pixel-level: Applies the mask precisely at pixel level - only masked pixels are included.
Offers finer co-expression evaluation, but is more sensitive to registration errors.
- extract_mask_tile_coordinates(save_img=False)[source]
Extracts mask coordinates from the mask slide.
Generates a list containing of tiles coordinates that are part of the mask. This allows to only process tiles that are part of the mask and thus contain tumor tissue. Tiles with less than 10% tumor tissue are dropped due to save runtime. Previous mask coordinates will be overwritten and the results will be stored as pickle in pickle_dir.
- Parameters:
save_img (bool) – Boolean to determine if mask tiles shall be saved as image. Necessary if mask shall be reconstructed later on. Note: Storing tiles will require addition storage. Defaults to False.
- generate_antigen_pair_combinations(masking_mode='tile-level')[source]
Creates all possible antigen pairs and analyzes antigen co-expression for all pairs. Results are stored in dual_antigen_expressions.
- Parameters:
masking_mode (str) –
Determines the mode for mask application that was used for quantification.
- tile-level (default): Applies the mask coarsly - tiles overlapping the mask are fully included.
Recommended when registration quality is lower (e.g. high median rTRE).
- pixel-level: Applies the mask precisely at pixel level - only masked pixels are included.
Offers finer co-expression evaluation, but is more sensitive to registration errors.
- generate_antigen_triplet_combinations(masking_mode='tile-level')[source]
Creates all possible antigen triplets and analyzes antigen co-expression for all triplets. Results are stored in triplet_antigen_expressions.
- Parameters:
masking_mode (str) –
Determines the mode for mask application that was used for quantification.
- tile-level (default): Applies the mask coarsly - tiles overlapping the mask are fully included.
Recommended when registration quality is lower (e.g. high median rTRE).
- pixel-level: Applies the mask precisely at pixel level - only masked pixels are included.
Offers finer co-expression evaluation, but is more sensitive to registration errors.
- quantify_all_slides(save_imgs=False, masking_mode='tile-level')[source]
Quantifies all registered slides sequentially and stores results.
Quantifies all slides that were instantiated sequentially with the exception of the reference_slide and the mask_slide. Results are stored as .CSV into the data_dir. All previous quantification results in the quant_res_df will be reset and the previous .CSV file overwritten.
- Parameters:
save_imgs (bool) – Boolean determining if tiles shall be saved as image during processing. This is necessary if slides shall be reconstructed after processing. Note: storing tiles may require substantial additional storage. Defaults to False.
masking_mode (str) –
Defines how the tumor mask is applied to tiles during quantification.
tile-level (default): Applies the mask coarsly - tiles overlapping the mask are fully included. Recommended when registration quality is lower (e.g. high median rTRE).
pixel-level: Applies the mask precisely at pixel level - only masked pixels are included. Offers finer quantification, but is more sensitive to registration errors.
- quantify_single_slide(slide_name, save_img=False, masking_mode='tile-level')[source]
Quantifies a single slide and appends the results to quant_res_df.
This function quantifies staining intensities for all tiles of the given slide using multiprocessing. The slide matching the passed slide_name is retrieved from the collection_list and quantified using the quantify_slide function of the Slide class. Results are appended to quant_res_df, which is then stored as .CSV in self. data_dir and as .PICKLE in pickle_dir. Existing .CSV/.PICKLE files are overwritten. For more information on quantification checkout Slide.quantify_slide() function in the slide.py.
- Parameters:
slide_name (str) – Name of the slide to be processed.
save_img (bool) – Boolean determining if tiles shall be saved during processing. Necessary if slide shall be reconstructed later on. However, storing images will require addition storage. Defaults to False.
masking_mode (str) –
Defines how the tumor mask is applied to tiles during quantification.
- tile-level (default): Applies the mask coarsly - tiles overlapping the mask are fully included.
Recommended when registration quality is lower (e.g. high median rTRE).
- pixel-level: Applies the mask precisely at pixel level - only masked pixels are included.
Offers finer quantification, but is more sensitive to registration errors.
- register_slides(reference_slide=None, microregistration=True, max_non_rigid_registration_dim_px=2000, crop=None, high_res_alignement=False, high_res_fraction=None)[source]
Registers all WSIs in the collection using Valis.
Registers all WSIs in the collection and aligns them. Registration can be performed towards a selected referenceWSI or automatically towards a WSI chosen by VALIS. Registration includes rigid and non-rigid registration steps. Optional microregistration can be applied to further improve registration quality. The registered slides are saved in the registration directory. Intermediate results are stored in the intermediate_registration_results directory. Further configuration options such as max_non_rigid_registration_dim_px and cropping methods are available. Lastly, an additional, customizable high-resolution alignment can be performed if specified, which allows tailoring the alignment resolution via the high_res_fraction parameter.
- Parameters:
reference_slide (str) – Path to reference slide. If None, the first slide in the collection will be used. Defaults to None.
microregistration (bool) – Boolean determining if microregistration shall be used. Defaults to True.
max_non_rigid_registration_dim_px (int) – Maximum size of non-rigid registration dimension in pixels. Defaults to 2000.
crop (str) – Crop method to be used. Defaults to None which results cropping to reference slide if one is provided. If no reference slide is provided, the default crop method is ‘overlap’.
high_res_alignment (bool) – Boolean determining if customizable high-resolution alignment shall be performed. Defaults to False.
high_res_fraction (float) – Fraction of the image to be used for high resolution alignment. Defaults to None.
- tumor_segmentation(model_path, reference_slide=None, tile_size=[1024, 1024], output_path=None, normalization=False, inversion=False, plot_results=False)[source]
Performs tumor segmentation on the HE WSI in the SlideCollection.
Performs tumor segmentation on either a specified reference WSI or the reference WSI selected by the SlideCollection. The segmentation model needs to be passed. By default, segmentation results are saved into the registration_dir of the SlideCollection, an optional ouput path can be provided. Model-specific parameters such as normalization, or inversion can also be passed. Lastly, segmentation_results can be plotted onto the tissue if plot_results is set to True.
- Parameters:
model_path (str) – Path to segmentation model.
reference_slide (str, optional) – Path to reference slide. If None, the reference slide of the SlideCollection is used. Defaults to None.
tile_size (list, optional) – Tile size to be used for segmentation. Defaults to [1024, 1024].
output_path (str, optional) – Output path for segmentation results. If None, the registration_dir of the SlideCollection is used. Defaults to None.
normalization (bool, optional) – Boolean determining if stain normalization shall be applied. Defaults to False.
inversion (bool, optional) – Boolean determining if color inversion shall be applied. Defaults to False.
plot_results (bool, optional) – Boolean determining if segmentation results shall be plotted onto the tissue. Defaults to False.
- Returns:
None
Slide
- class cubats.slide_collection.slide.Slide(name, path, is_mask=False, is_reference=False)[source]
Slide Class.
- Slide class instatiates a slide object containing all relevant information and results for a single slide. All
slide specific operations that can be performed on a single slide rather than on a collection of slides are implemented in this class. This includes quantification of staining intensities and reconstruction of a slide. The class is initialized with the name of the slide, the path to the slide file, as well as information on whether the slide is a mask or reference slide. The class contains a dictionary of detailed quantification results for each tile, as well as a dictionary of summarized quantification results for the entire slide. The slide object also contains information on the OpenSlide object, the tiles, the level count, the level dimensions, as well as the tile count. The slide object also contains a directory to save the tiles after color deconvolution, which is necessary for reconstruction of the slide later on.
- name
The name of the slide.
- Type:
str
- openslide_object
The OpenSlide Object of the slide.
- Type:
openslide.OpenSlide
- tiles
DeepZoom Generator containing the tiles of the slide.
- Type:
openslide.deepzoom.DeepZoomGenerator
- level_count
The number of DeepZoom levels of the slide.
- Type:
int
- level_dimensions
List of tuples (pixels_x, pixels_y) for each Deep Zoom level.
- Type:
list
- tile_count
The number of tiles in the slide.
- Type:
int
- masked_tiles
List containing the tiles after intersection of the slide and mask.
- Type:
list
- dab_tile_dir
Directory to save the tiles after color deconvolution is applied. Necessary for reconstruction of the slide. If save_img is False no tiles are saved and this attribute is None.
- Type:
str
- is_mask
Whether the slide is the slide_collections mask slide.
- Type:
bool
- is_reference
Whether the slide is the slide_collections reference slide.
- Type:
bool
- antigen_profile
Dictionary containing antigen-specific quantification thresholds (high-positive, medium-positive, low-postitive). If no profile is provided, default thresholds are used.
- Type:
dict, optional
- detailed_quantification_results
Dictionary containing detailed quantification results for each tile of the slide. The dictionary is structured as follows:
key (int): Index of the tile.
value (dict): Dictionary containing the following:
Flag (int): Flag indicating whether the tile was processed (1) or not (0).
Histogram (array): Array containing the histogram of the tile.
Hist_centers (array): Array containing the centers of the histogram bins.
Zones (array): Array containing the number of pixels in each zone sorted by index according to the following attribution High Positive, Positive, Low Positive, Negative, Background).
Score (str): Score of the tile based on the zones.
- Type:
dict
- quantification_summary
Dictionary containing the summarized quantification results for the slide:
Name (str): Slide name
Coverage (%) (float): Percentage of all positively quantified pixels in the Slide. This includes all high-postive, medium-positive, and low-positive stained pixels.
High Positive (%) (float): Percentage of highly positive stained pixels in the Slide.
Medium Positive (%) (float): Percentage of medium positive stained pixels in the Slide.
Low Positive (%) (float): Percentage of low positive stained pixels in the Slide.
Negative (%) (float): Percentage of negatively stained pixels in the Slide.
Total Tissue (%) (float): Percentage of tissue in the Slide.
Background / No Tissue (%) (float): Percentage of Background pixels in the Slide.
H-Score (int): H-Score of the slide.
Score (str): Pathological score of the slide: High Positive, Medium Positive, Low Positive, Negative, Background.
Total Processed Tiles (%): Percentage of processed tiles of the slide.
Error (%) (float): Percentage of skipped tiles because they did not contain tissue.
Thresholds: Antigen-specific thresholds used during quantification.
- Type:
dict
- properties
Dictionary containing relevant slide properties including: name, is_reference, is_mask, openslide_object, tiles, level_count, level_dimensions, and tile_count.
- Type:
dict
- __init__(name, path, is_mask=False, is_reference=False)[source]
Initialize a Slide object.
- Parameters:
name (str) – The name of the slide.
path (str) – The path to the slide file.
is_mask (bool, optional) – Whether the slide is a mask. Defaults to False.
is_reference (bool, optional) – Whether the slide is a reference slide. Defaults to False.
- quantify_slide(mask_coordinates, save_dir, save_img=False, img_dir=None, mask=None)[source]
Quantifies staining intensities for masked tiles of this slide.
This function uses multiprocessing to quantify staining intensities of masked tiles for the slide. Each tile undergoes color deconvolution followed by staining intensity quantification based on the IHC Profiler’s algorithm. If save_img is True, tiles are saved in img_dir after deconvolution for later reconstruction. After color deconvolution, each tile is processed as a grayscale image, and each pixel’s staining intensity (0-255) is quantified based on the thresholds defined in self.antigen_profile. If no specific antigen_profile was provided the default profile will be used. Results are stored in self.detailed_quantification_results and summarized in self.quantification_summary. Both are saved as PICKLE files in save_dir.
- Parameters:
mask_coordinates (list) – List of xy-coordinates from the maskslide where the mask is positive.
save_dir (str) – Directory to save the results. Usually the slides pickle directory.
save_img (bool, optional) – Whether to save the tiles after color deconvolution. Defaults to False. Necessary for reconstruction of the slide.
img_dir (str, optional) – Directory to save the tiles. Must be provided if tiles shall be saved. Defaults to None.
mask (openslide.deepzoom.DeepZoomGenerator, optional) – DeepZoomGenerator containing the detailed mask. Defaults to None. Provides a more detailed mask for the quantification of the slide, however, might result in larger inaccuracies for WSI with low congruence.
- reconstruct_slide(in_path, out_path)[source]
Reconstructs a slide into a Whole Slide Image (WSI) based on saved tiles. This is only possible if tiles have been saved during processing. The WSI is then saved as .tif in the specified out_path.
- Parameters:
in_path (str) – Path to saved tiles
out_path (str) – Path where to save the reconstructed slide.
- summarize_quantification_results()[source]
Summarizes quantification results.
Summarizes quantification results for a given slide and appends them to self.quantification_summary. This includes the total number of pixels in each zone, the percentage of pixels in each zone, and a score for each zone.
- The summary contains the following keys:
Slide (str): Name of the slide.
Coverage (float): Tumor coverage of the antigen in the slide.
High Positive (float): Percentage of pixels in the high positive zone.
Positive (float): Percentage of pixels in the positive zone.
Low Positive (float): Percentage of pixels in the low positive zone.
Negative (float): Percentage of pixels in the negative zone.
Total Tissue (float): Total amount of tissue in the slide.
Background (float): Percentage of pixels in the white space background or fatty tissues.
H-Score (float): H-score calculation based on positive pixels.
Score (str): Overall score of the slide based on the zones. However, the score for the entire slide may be misleading since much negative tissue may lead to a negative score even though the slide may contain a lot of positive tissue as well. Therefore, the score for the entire slide should be interpreted with caution.
Total Processed Tiles (float): Percentage of total processed tiles.
Error (float): Percentage of tiles that were not processed as they did not contain sufficient tissue.
- Raises:
ValueError – If the slide is a mask slide or a reference slide.