recovar.data_io¶
Dataset loading, metadata extraction, and image access for cryo-EM and cryo-ET data.
Flow¶
flowchart TD
A[CLI / pipeline args] --> B[halfsets.py<br/>split policy + halfset loading]
A --> C[cryoem_dataset.py<br/>load_dataset(...)]
C --> D[image_sources.py<br/>image source assembly]
C --> E[metadata_readers.py<br/>STAR / CS metadata parsing]
D --> F[image_backends.py<br/>file-backed SPA / cryo-ET loaders]
F --> G[image_loader.py<br/>MRC / MRCS / HDF5 I/O]
E --> H[image_metadata.py<br/>ImageMetadata]
D --> I[_index_utils.py<br/>image/group remapping]
B --> I
D --> J[cryoem_dataset.py<br/>CryoEMDataset]
H --> J
I --> J
B --> J
J --> K[iter_batches(...)]
K --> L[pipeline / compute_state / analyze]
SPA / cryo-ET load path
-> cryoem_dataset.load_dataset(...)
-> image_sources.create_image_source(...)
-> image_backends.py
-> image_loader.py
-> metadata_readers.auto_parse_poses / auto_parse_ctf
-> image_metadata.ImageMetadata
-> CryoEMDataset(...)
Halfset / subset path
-> halfsets.get_split_indices / get_split_tilt_indices
-> halfsets.load_halfset_dataset / load_halfset_dataset_from_args
-> CryoEMDataset with halfset_indices
Downstream runtime path
-> CryoEMDataset.iter_batches(...)
-> explicit tuples:
(images, rotation_matrices, translations, ctf_params, noise_variance, particle_indices, image_indices)
Cross-cutting indexing
_index_utils.py
- DatasetIndexLayout: local <-> original image/group ids
- TiltSeriesOriginalIndexMap: particle <-> image ids in the original file
Keep these responsibilities separate:
image_sources.pyowns raw image access, lazy/eager loading, and subset views.image_metadata.pyowns rotations, translations, and CTF rows.cryoem_dataset.pyis the only high-level coordinator and batch iterator surface.halfsets.pyowns split policy and halfset bookkeeping._index_utils.pyowns image/group/particle remapping logic.image_backends.pyowns only the low-level stack and tilt-series loaders used underneath image sources.
Public surface used by the main runtime:
cryoem_dataset.load_datasethalfsets.load_halfset_datasethalfsets.load_halfset_dataset_from_argsCryoEMDataset.iter_batchesCryoEMDataset.subset
cryoem_dataset¶
Core dataset classes and loading functions.
recovar.data_io.cryoem_dataset
¶
Top-level cryo-EM / cryo-ET dataset assembly and batch iteration.
Architecture:
- image_sources.py owns raw image loading, lazy/eager access, and subset views
- image_metadata.py owns poses and CTF metadata only
- CryoEMDataset coordinates both layers and exposes the single explicit
batch iterator used by downstream code
CryoEMDataset(image_source, voxel_size, metadata, ctf_evaluator=None, dtype=np.complex64, dataset_indices=None, grid_size=None, tilt_series_flag=False, premultiplied_ctf=False)
¶
Core dataset class for cryo-EM heterogeneity analysis.
Wraps particle images with per-image metadata (poses, CTF parameters) and provides geometry helpers for 3-D reconstruction and embedding.
For half-set reconstructions, two CryoEMDataset instances are typically
managed via halfset_indices on the dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
grid_size |
Side length of the image (and default 3-D reconstruction grid). |
|
voxel_size |
Pixel / voxel size in Angstroms. |
|
n_images |
Number of particle images in this dataset. |
|
image_source |
Underlying image-loading layer ( |
|
tilt_series_flag |
|
metadata
property
¶
The per-image metadata store.
image_source
property
¶
Image-loading layer for this dataset.
index_layout
property
¶
Explicit local/original image/group mapping for this dataset.
original_image_indices
property
¶
Original source-file image index for each local image.
original_group_indices
property
¶
Original source-file group index for each local group.
rotation_matrices
property
writable
¶
Per-image rotation matrices (read-only view).
translations
property
writable
¶
Per-image translations (read-only view).
CTF_params
property
writable
¶
Per-image CTF parameters (read-only view).
image_mask
property
¶
Circular window mask from the image stack.
data_multiplier
property
writable
¶
Sign multiplier for data inversion (±1).
dataset_tilt_indices
property
¶
Per-particle tilt index lists (tilt-series only).
tilt_particles
property
¶
List of per-particle tilt index arrays (tilt-series only).
ctf_evaluator
property
¶
The :class:~recovar.core.ctf.CTFEvaluator for this dataset.
get_ctf_column(col)
¶
Read a single CTF parameter column for all images.
get_ctf_params_copy()
¶
Return a mutable copy of the full CTF parameter array.
update_poses(rots, trans)
¶
Replace all poses.
update_ctf(ctf_params)
¶
Replace all CTF parameters.
process_images(images, apply_image_mask=False)
¶
Apply windowing + full DFT preprocessing to raw images.
process_images_half(images, apply_image_mask=False)
¶
Apply windowing + rfft2 preprocessing → half-spectrum output.
subset(indices)
¶
Return a new CryoEMDataset containing only the images at indices.
The returned dataset uses an ImageSource subset view, so the
subset/remap logic stays inside the image-loading layer rather than
being duplicated in the dataset class.
can_reload_from_original_images()
¶
Whether this dataset can rebuild a file-backed view by original ids.
reload_from_original_images(original_image_indices, *, lazy=None)
¶
Reload a dataset view from original file image indices.
This is used only when an independent file-backed dataset is required. The input indices are always in original file ordering, never this dataset's local ordering.
get_halfset_dataset(halfset_id, *, independent=False, lazy=None)
¶
Return one halfset as either a lightweight view or independent reload.
prefers_independent_halfset_datasets()
¶
Return whether hot halfset iteration should reload independent datasets.
Lazy file-backed datasets pay extra per-batch remapping cost when they iterate through subset views. Reloading each halfset directly from the original files preserves the same batch contents while avoiding that remap layer in heavy downstream loops.
materialize_halfset_datasets(*, independent=None, lazy=None)
¶
Build the two halfset datasets used by downstream kernels.
Parameters¶
independent : bool, optional
Whether to reload each halfset from the original files instead of
constructing subset views. Defaults to
:meth:prefers_independent_halfset_datasets.
lazy : bool, optional
Laziness flag used only for independent reloads. Defaults to the
parent dataset's image-source laziness.
get_predicted_image(indices, volume, skip_ctf=False, spatial=True)
¶
Get predicted images for given indices using forward model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Array of indices to predict images for |
required | |
volume
|
Volume to use for prediction |
required | |
skip_ctf
|
Whether to skip CTF application |
False
|
|
spatial
|
Whether to return images in real space (True) or Fourier space (False) |
True
|
Returns:
| Type | Description |
|---|---|
|
Predicted images in real space if spatial=True, otherwise in Fourier space |
n_halfset_images(halfset_id)
¶
Number of images in a given halfset.
get_particle_halfset_indices()
¶
Per-half canonical particle indices for tilt-series datasets.
For SPA datasets, this simply returns halfset_indices (images
and particles are 1-to-1). For tilt-series, it maps each half's
image indices through the image→particle mapping and returns the
unique canonical (dataset_tilt_indices) particle ids per half.
split_halfset_array(arr, per_particle=False)
¶
Split a dataset-local-ordered array by halfset membership.
Parameters¶
per_particle : bool If True and this is a tilt-series dataset, split by particle/group indices instead of image indices.
iter_batches(batch_size, *, halfset_id=None, indices=None, noise_model=None, noise_half=True, noise_by_particle=False, by_image=True, prefetch=True, pack_groups=False)
¶
Iterate over dataset batches, yielding explicit batch fields.
Parameters¶
batch_size : int
halfset_id : int, optional
Halfset index (0 or 1). Mutually exclusive with indices.
indices : array-like, optional
Iterate over this subset of image indices.
noise_model : optional
Noise model used to populate the yielded noise_variance field.
noise_half : bool
Use half-spectrum noise (default True for mean reconstruction).
noise_by_particle : bool
Index noise by particle group (for covariance path).
by_image : bool
True = flat per-image iteration; False = particle-grouped (tilt).
prefetch : bool
Enable 1-lookahead prefetch buffer (default True).
pack_groups : bool
Pack multiple tilt-series particles into each batch up to
batch_size images. Only applies when by_image=False.
Padded entries get sentinel particle_id=-1.
Yields¶
tuple
(images, rotation_matrices, translations, ctf_params,
noise_variance, particle_indices, image_indices)
get_halfset(halfset_id)
¶
Return a halfset dataset, lazily materializing and caching.
The cache is invalidated automatically when mutable state (contrasts, noise, poses, CTF) changes on this dataset.
set_contrasts(contrasts)
¶
Multiply per-image CTF contrast column by contrasts.
contrasts must be in this dataset's ordering (original ordering for a full dataset, or local ordering for a subset). For tilt-series with per-particle contrasts (len < n_images), each particle's tilt images share a single contrast value.
set_noise(noise_variance)
¶
Set the radial noise model for this dataset.
If the dataset already has a VariableRadialNoiseModel, updates
it; otherwise sets a RadialNoiseModel.
load_dataset(particles_file, poses_file=None, ctf_file=None, datadir=None, n_images=None, ind=None, lazy=True, padding=0, uninvert_data=False, tilt_series=False, tilt_series_ctf=None, dose_per_tilt=2.9, angle_per_tilt=3, premultiplied_ctf=False, strip_prefix=None, sort_with_Bfac=False, downsample_D=None)
¶
Load a cryo-EM / cryo-ET dataset.
Poses and CTF can come from: - Pickle files (legacy cryoDRGN format) via poses_file / ctf_file - Auto-extracted from the particles STAR or CS file when those are None
reorder_to_original_indexing(arr, ds, use_tilt_indices=False)
¶
Reorder a halfset-concatenated array back to original file ordering.
For SPA (use_tilt_indices=False), uses ds.halfset_indices
(image-level). For tilt-series (use_tilt_indices=True), uses
the canonical particle indices derived from each half's images so
that per-particle data is scattered to its original particle position.
reorder_to_dataset_indexing(arr, ds, use_tilt_indices=False)
¶
Reorder a halfset-concatenated array back to this dataset's local ordering.
subsample_cryoem_dataset(cryo, good_indices)
¶
Return a new CryoEMDataset containing only the images at good_indices.
image_backends¶
Low-level Grain-backed image backends.
recovar.data_io.image_backends
¶
File-backed image backends used underneath :mod:recovar.data_io.image_sources.
This module owns the low-level Grain-backed loaders for:
- single-particle image stacks
- cryo-ET tilt-series grouped by particle
It does not own metadata, halfset policy, or the top-level dataset view.
Those live in image_metadata.py, halfsets.py, and
cryoem_dataset.py respectively.
ParticleImageDataset(image_file, lazy=True, ind=None, invert_data=False, datadir='', padding=0, max_threads=16, strip_prefix=None, downsample_D=None, device=None, **kwargs)
¶
Dataset for cryo-EM particle images.
Implements __getitem__ / __len__ which is the protocol expected
by both grain.RandomAccessDataSource and the downstream loaders.
process_images_half(images, apply_image_mask=False)
¶
Return half-spectrum images using the legacy full-FFT path.
The old pipeline applied process_images first and then converted
the full FFT layout to half-spectrum storage. Direct rfft is
mathematically close, but it is not numerically identical and that
drift is enough to change downstream PCA / outlier regressions.
TiltSeriesDataset(starfile_path, lazy=True, num_tilts=None, random_tilts=False, ind=None, voltage=None, dose_per_tilt=None, angle_per_tilt=None, expected_res=None, tilt_file_option='relion5', **kwargs)
¶
image_sources¶
Raw image loading abstraction and subset/image-group remapping.
recovar.data_io.image_sources
¶
Image-source layer for cryo-EM / cryo-ET datasets.
This module cleanly separates image loading from metadata storage and from the top-level dataset/view object. It provides:
- backend sources that load images from files, lazily or eagerly
- subset views that remap image/group indices without leaking that logic into the dataset class
image_metadata¶
Typed metadata container for poses and CTF rows.
recovar.data_io.image_metadata
¶
Per-image metadata storage for poses and CTF parameters.
ImageMetadata(rotation_matrices, translations, ctf_params, *, rotation_dtype=np.float32, ctf_dtype=np.float32, real_dtype=np.float32)
¶
Per-image metadata store.
This class owns only metadata arrays. It has no loading, iteration, subset-view, or halfset logic.
halfsets¶
Halfset and split logic for SPA and cryo-ET.
recovar.data_io.halfsets
¶
Half-set splitting logic for cryo-EM reconstruction.
Provides functions for splitting a dataset into two independent half-sets used for FSC-based resolution estimation. Supports random splits, RELION _rlnRandomSubset, explicit halfset files, and tilt-series-aware particle-level splitting.
HalfsetDatasetSpec(particles_file, poses_file=None, ctf_file=None, datadir=None, uninvert_data=False, padding=0, n_images=None, tilt_series=False, tilt_series_ctf=None, angle_per_tilt=None, dose_per_tilt=None, premultiplied_ctf=False, strip_prefix=None, downsample_D=None)
dataclass
¶
Normalized file/loader settings for constructing a halfset dataset.
split_index_list(all_valid_image_indices, split_random_seed=0)
¶
Split a list of indices into two balanced halves with reproducible randomization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_valid_image_indices
|
Array of indices to split |
required | |
split_random_seed
|
Random seed for reproducible splits |
0
|
Returns:
| Type | Description |
|---|---|
|
List of two numpy arrays containing the split indices |
get_split_indices(particles_file, datadir=None, strip_prefix=None, ind_file=None, split_random_seed=0, validate_split=True, n_images=None)
¶
Get indices for splitting dataset into halfsets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
particles_file
|
Path to particles STAR file |
required | |
datadir
|
Data directory (optional) |
None
|
|
strip_prefix
|
Prefix to strip from file paths (optional) |
None
|
|
ind_file
|
File containing specific indices to use (optional) |
None
|
|
split_random_seed
|
Random seed for reproducible splits |
0
|
|
validate_split
|
Whether to validate the split is balanced |
True
|
|
n_images
|
Pre-computed image count (avoids re-reading the file) |
None
|
Returns:
| Type | Description |
|---|---|
|
List of two numpy arrays containing indices for each halfset |
get_split_tilt_indices(particles_file, ind_file=None, tilt_ind_file=None, ntilts=None, datadir=None, particle_halfset_indices_file=None)
¶
Split a tilt-series dataset into two halfsets (image indices).
Supports optional filtering by image/particle indices and precomputed splits.
load_halfset_dataset(spec, *, ind_split, lazy=False)
¶
Load one dataset view and attach halfset-local indices for iteration.
resolve_halfset_indices(args)
¶
Determine which images belong to each reconstruction half-set.
Priority order
- Explicit halfsets file (
--halfsets). - _rlnRandomSubset column in the STAR file (RELION convention).
- Random 50/50 split of all valid images.
load_halfset_dataset_from_args(args, lazy=False, ind_split=None)
¶
Resolve halfsets from args and load the shared dataset view.
_index_utils¶
Canonical local/original image, group, and particle index mapping helpers.
recovar.data_io._index_utils
¶
Explicit index-domain helpers for cryo-EM / cryo-ET datasets.
This module centralizes the translation between:
- local image indices inside a loaded dataset view
- original image indices in the source file
- local group indices inside a loaded dataset view
- original group indices in the source file
For SPA datasets, image and group domains are identical. For grouped datasets such as cryo-ET tilt series, a group corresponds to one particle / tilt series and expands to one or more local images.
DatasetIndexLayout(original_image_indices, grouped=False, original_group_indices=None, group_local_image_indices=None)
dataclass
¶
Index mapping for one dataset view.
Parameters¶
original_image_indices
For each local image index, the source-file image index.
grouped
False for SPA, where image and group domains are the same.
True for grouped datasets such as tilt-series data.
original_group_indices
For each local group index, the source-file group index.
group_local_image_indices
Only used when grouped=True. Each entry lists the local images
belonging to one local group.
Notes¶
Original image/group ids may repeat in SPA subsets created from duplicate selections. Reverse lookup therefore uses explicit "last-write-wins" semantics, matching the previous subset remap behavior.
TiltSeriesOriginalIndexMap(particle_to_images, image_to_particle, tilt_numbers=None)
¶
Original-file particle/image mapping used by cryo-ET selection logic.
normalize_indices(values, n_total, *, name='indices', allow_none=False)
¶
Normalize int/bool indices to an int32 array with bounds checking.
load_index_like(value)
¶
Return an in-memory index selection from an array-like or pickle path.
normalize_image_indices(values, *, n_total=None, name='indices')
¶
Normalize image indices, optionally without a known dataset size.
When n_total is known, this is strict bounds-checked normalization.
When it is unknown, the function still validates rank, dtype, and
non-negativity, but cannot reject out-of-range values.
deduplicate_preserve_order(values, *, name='indices')
¶
Drop duplicate values while keeping the first occurrence order.
filter_preserve_order(values, allowed)
¶
Return the subset of values that appears in allowed, keeping order.
metadata_readers¶
Extract poses and CTF parameters from RELION .star and cryoSPARC .cs files.
recovar.data_io.metadata_readers
¶
Extract poses and CTF parameters directly from RELION .star and cryoSPARC .cs files.
This module eliminates the need for cryoDRGN preprocessing. The output formats
match exactly what load_utils.load_ctf_params and load_utils.load_poses
return, so downstream code (load_cryodrgn_dataset) can consume either source
interchangeably.
parse_poses_from_star(star_path, D)
¶
Extract rotation matrices and translations from a RELION .star file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
star_path
|
str
|
Path to .star file. |
required |
D
|
int
|
Target image dimension in pixels (used for translation normalisation). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
rotations |
ndarray
|
|
translations |
ndarray
|
|
parse_ctf_from_star(star_path, D)
¶
Extract CTF parameters from a RELION .star file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
star_path
|
str
|
Path to .star file. |
required |
D
|
int
|
Target image dimension in pixels. Pixel size is adjusted
for the ratio |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
|
ndarray
|
|
ndarray
|
This matches the output format of |
parse_poses_from_cs(cs_path, D)
¶
Extract rotation matrices and translations from a cryoSPARC .cs file.
CryoSPARC stores rotations as 3-vector exponential maps (axis-angle / Rodrigues). Translations are in pixels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cs_path
|
str
|
Path to .cs file. |
required |
D
|
int
|
Target image dimension in pixels. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
rotations |
ndarray
|
|
translations |
ndarray
|
|
parse_ctf_from_cs(cs_path, D)
¶
Extract CTF parameters from a cryoSPARC .cs file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cs_path
|
str
|
Path to .cs file. |
required |
D
|
int
|
Target image dimension in pixels. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
|
can_extract_poses(filepath)
¶
Return True if poses can be auto-extracted from this file type.
auto_parse_poses(filepath, D)
¶
Auto-extract poses from STAR or CS file based on extension.
auto_parse_ctf(filepath, D)
¶
Auto-extract CTF parameters from STAR or CS file based on extension.
starfile¶
RELION .star file reading and writing.
recovar.data_io.starfile
¶
Utilities for reading and writing RELION .star files.
Supports both RELION 3.0 (single data table) and RELION 3.1 (with optics table).
Equivalent to cryodrgn/starfile
StarFile(starfile=None, *, data=None, data_optics=None)
¶
Container for RELION .star file data with convenient access methods.
Attributes:
| Name | Type | Description |
|---|---|---|
df |
Main data table |
|
data_optics |
Optics table (None for RELION 3.0) |
Initialize from file or data tables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
starfile
|
Optional[str]
|
Path to .star file (mutually exclusive with data) |
None
|
data
|
Optional[DataFrame]
|
Main data table (keyword only) |
None
|
data_optics
|
Optional[DataFrame]
|
Optics table (keyword only) |
None
|
has_optics
property
¶
Whether this is RELION 3.1 format with optics table.
relion31
property
¶
Alias for has_optics (compatibility).
apix
property
¶
Pixel size (Angstroms/pixel) for each particle.
Tries _rlnImagePixelSize first (RELION 3.1+), then falls back to _rlnDetectorPixelSize * 1e4 / _rlnMagnification (older RELION).
resolution
property
¶
Image size (pixels) for each particle.
Tries _rlnImageSize first (RELION 3.1 optics table).
Falls back to reading the MRC header of the first particle stack
referenced in _rlnImageName (RELION 3.0 files).
load(filepath)
classmethod
¶
Load from .star file (convenience method).
save(filepath)
¶
Save to .star file.
write(filepath)
¶
Alias for save().
__len__()
¶
Number of particles in main data table.
__eq__(other)
¶
Check equality with another StarFile.
get_optics_values(field, dtype=None)
¶
Get per-particle values for a field, consulting optics table if available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field
|
str
|
Field name to retrieve |
required |
dtype
|
Optional[dtype]
|
Optional dtype to cast values to |
None
|
Returns:
| Type | Description |
|---|---|
Optional[ndarray]
|
Array of values (one per particle) or None if field not found |
set_optics_values(field, values)
¶
Set per-particle values for a field in appropriate table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field
|
str
|
Field name to set |
required |
values
|
Union[float, List, ndarray]
|
Single value or array of values |
required |
flatten_to_relion30()
¶
Convert to RELION 3.0 format by flattening optics into main table.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with all optics fields merged into main table |
to_relion30()
¶
Alias for flatten_to_relion30 (compatibility).
read_star(filepath)
¶
Parse a RELION .star file into main data and optional optics tables.
Results are cached by normalised absolute path + file mtime so that repeated calls for the same unchanged file (e.g. once for halfsets, once per halfset for CTF/poses/image loading) incur only one disk read, while any write to the file automatically triggers a re-parse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Path to .star file |
required |
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, Optional[DataFrame]]
|
Tuple of (main_data, optics_data) where optics_data is None for RELION 3.0 |
write_star(filepath, data, data_optics=None)
¶
Write data to a RELION .star file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Output file path |
required |
data
|
DataFrame
|
Main data table |
required |
data_optics
|
Optional[DataFrame]
|
Optional optics table (for RELION 3.1 format) |
None
|
load_utils¶
CTF and pose loading utilities (legacy pickle format).
recovar.data_io.load_utils
¶
Utilities for loading CTF parameters and pose information from pickle files. Equivalent to cryodrgn/load
load_ctf_params(D, ctf_params_pkl)
¶
Load and adjust CTF parameters for a given image size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
D
|
int
|
Target image dimension (must be even) |
required |
ctf_params_pkl
|
str
|
Path to pickle file containing CTF parameters |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
CTF parameters array with shape (N, 8), excluding image size column |
load_poses(infile, Nimg, D, ind=None)
¶
Load pose information (rotations and translations) from pickle files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
infile
|
Union[str, List[str]]
|
Path to pickle file(s). Can be: - Single file containing (rotations, translations) tuple - Single file containing rotations only - List of two files: [rotations_file, translations_file] |
required |
Nimg
|
int
|
Expected number of images |
required |
D
|
int
|
Image dimension in pixels |
required |
ind
|
Optional[ndarray]
|
Optional index array to filter poses |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[ndarray, Optional[ndarray], int]
|
Tuple of (rotations, translations, D) where: - rotations: (Nimg, 3, 3) array of rotation matrices - translations: (Nimg, 2) array of translations in pixels, or None - D: Image dimension (passthrough) |
image_loader¶
Image loading from MRC/MRCS stacks and HDF5 files.
recovar.data_io.image_loader
¶
Utilities for loading cryo-EM particle images from various file formats.
Supported formats: - MRC/MRCS: Single or multi-image MRC stacks - STAR: RELION star files referencing MRC stacks - CS: cryoSPARC particle files - TXT: Text file listing MRC paths
All loaders share the ImageLoader base class which provides a uniform interface for indexing, batching, and caching.
ImageLoader(num_images, image_size, dtype=np.float32)
¶
Base class for loading particle images.
Provides a uniform interface for indexing (int, slice, array, bool mask), lazy/eager loading, caching, and batched iteration.
n
property
¶
Compatibility alias for num_images.
D
property
¶
Compatibility alias for image_size.
selection_indices
property
¶
Original source-row indices represented by this loader.
from_file(filepath, lazy=True, indices=None, datadir='', max_threads=1, strip_prefix=None)
staticmethod
¶
Compatibility alias for load_images().
image_count(filepath, datadir=None, strip_prefix=None)
classmethod
¶
Get image count without constructing a full loader.
For MRC/MRCS files, reads only the header (1 kB). For other formats, falls back to full lazy construction.
__getitem__(key)
¶
Get images using indexing syntax.
get(indices=None)
¶
Get images at specified indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Indices to retrieve (int, slice, array, or None for all) |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Array of shape (N, image_size, image_size) |
images(indices=None, require_contiguous=False)
¶
Compatibility alias for get().
iter_batches(batch_size=1000)
¶
Iterate over images in batches.
Yields:
| Type | Description |
|---|---|
Tuple[ndarray, ndarray]
|
(indices, images) tuples |
chunks(chunksize=1000)
¶
Compatibility alias for iter_batches().
load_all()
¶
Load and cache all images in memory.
MRCLoader(filepath, indices=None, lazy=True, skip_staging=False)
¶
Bases: ImageLoader
Load images from a single MRC/MRCS file.
Uses contiguous seek+fromfile for sequential reads and individual
seek+fromfile for scattered random access. A lazy np.memmap view
is available via _get_memmap() for bulk access patterns.
Local staging: if RECOVAR_CACHE_DIR (or $TMPDIR) is set,
the MRC file is transparently copied to that directory on first access
and all subsequent reads go to the fast local copy. See
:mod:recovar.data_io.staging for details and performance numbers.
close()
¶
Release memory-mapped file resources.
MultiMRCLoader(file_map, indices=None, lazy=True, max_threads=1, raw_paths=None, skip_staging=False)
¶
Bases: ImageLoader
Load images distributed across multiple MRC files.
StarLoader(filepath, indices=None, datadir='', lazy=True, max_threads=1, strip_prefix=None, skip_staging=False)
¶
CryoSparcLoader(filepath, indices=None, datadir='', lazy=True, max_threads=1, strip_prefix=None, skip_staging=False)
¶
DownsamplingImageLoader(base_loader, target_D)
¶
load_images(filepath, indices=None, datadir='', lazy=True, max_threads=1, strip_prefix=None, skip_staging=False)
¶
Load cryo-EM images from file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Path to data file (.mrcs, .star, .txt, .cs) |
required |
indices
|
Optional[ndarray]
|
Optional subset of image indices to load |
None
|
datadir
|
str
|
Base directory for resolving relative paths |
''
|
lazy
|
bool
|
If True, defer loading until access |
True
|
max_threads
|
int
|
Number of threads for parallel I/O |
1
|
strip_prefix
|
Optional[str]
|
Prefix to strip from paths in metadata |
None
|
skip_staging
|
bool
|
If True, skip local staging (useful for one-shot reads like downsampling where staging the full-res data is wasteful) |
False
|
Returns:
| Type | Description |
|---|---|
|
ImageLoader instance for the specified file |