🤔 Frequently Asked Questions (FAQ)

Performance and Memory

Why does to_multiscales perform computation immediately with large datasets?

For both small and large datasets, to_multiscales returns a simple Python dataclass composed of basic Python datatypes and lazy dask arrays. The lazy dask arrays, as with all dask arrays, are a task graph that defines how to generate those arrays and do not exist in memory. This is the case for both regular sized datasets and very large datasets.

For very large datasets, though, data conditioning and task graph engineering is performed during construction to improve performance and avoid running out of system memory. This preprocessing step ensures that when you eventually compute the arrays, the operations are optimized and memory-efficient.

If you want to avoid this behavior entirely, you can pass cache=False to to_multiscales:

multiscales = to_multiscales(image, cache=False)

Warning: Disabling caching may cause you to run out of memory when working with very large datasets!

The lazy evaluation approach allows ngff-zarr to handle extremely large datasets that wouldn’t fit in memory, while still providing optimal performance through intelligent task graph optimization.

Network Storage and Authentication

How do I read from network stores like S3, GCS, or other remote storage?

ngff-zarr can read from any network store that provides a Zarr Python compatible interface. This includes stores from fsspec, which supports many protocols including S3, Google Cloud Storage, Azure Blob Storage, and more.

You can construct network stores with authentication options and pass them directly to ngff-zarr functions.

The following examples require fsspec backends, which are installed with the remote extra:

pip install "ngff-zarr[remote]"

This provides backends for http(s), S3 (s3fs), Google Cloud Storage (gcsfs), and Azure (adlfs). Alternatively, install only the backend you need, for example pip install fsspec s3fs for S3.

import zarr
from ngff_zarr import from_ngff_zarr

# S3 example with authentication using FsspecStore
s3_store = zarr.storage.FsspecStore.from_url(
    "s3://my-bucket/my-dataset.zarr",
    storage_options={
        "key": "your-access-key",
        "secret": "your-secret-key",
        "region_name": "us-west-2"
    }
)

# Read from the S3 store
multiscales = from_ngff_zarr(s3_store)

For public datasets, you can omit authentication:

# Example using OME-Zarr Open Science Vis Datasets
s3_store = zarr.storage.FsspecStore.from_url(
    "s3://ome-zarr-scivis/v0.5/96x2/carp.ome.zarr",
    storage_options={"anon": True}  # Anonymous access for public data
)

multiscales = from_ngff_zarr(s3_store)

You can also pass S3 URLs directly to ngff-zarr functions, which will create the appropriate store automatically:

# Direct URL access for public datasets
multiscales = from_ngff_zarr(
    "s3://ome-zarr-scivis/v0.5/96x2/carp.ome.zarr",
    storage_options={"anon": True}
)

For more control over the underlying filesystem, you can use S3FileSystem directly:

import zarr
from s3fs import S3FileSystem

# Using S3FileSystem with Zarr
fs = S3FileSystem(
    key="your-access-key",
    secret="your-secret-key",
    region_name="us-west-2"
)
store = zarr.storage.FsspecStore(fs=fs, path="my-bucket/my-dataset.zarr")

multiscales = from_ngff_zarr(store)

Authentication Options:

In addition to specification of credentials explicitly, there are other options.

  • Environment variables: Set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, etc.

  • IAM roles: Use EC2 instance profiles or assume roles

  • Configuration files: Use ~/.aws/credentials or similar

  • Direct parameters: Pass credentials directly to the store constructor

The same patterns work for other cloud providers (GCS, Azure) by using their respective fsspec implementations (e.g., gcsfs, adlfs).

Troubleshooting

I’m getting “Invalid argument”, “Invalid page offset”, or OSError errors when converting TIFF/SVS files

These errors usually mean the TIFF or SVS file is corrupted, truncated, or has an invalid internal structure. With very large whole-slide images the failure can surface only late in the conversion, once the affected data is actually read. Common causes include:

  1. Incomplete file transfer: the file was partially downloaded or copied

  2. Disk errors: physical disk errors during file creation

  3. Software bugs: issues in the software that created the file

  4. File format violations: non-standard structures that violate the spec

How to diagnose:

# Try opening the file with tifffile directly to check for errors
python3 -c "import tifffile; tif = tifffile.TiffFile('your_file.svs'); print(f'Series: {len(tif.series)}'); tif.close()"

If this command fails or reports errors, the file is likely corrupted.

Possible solutions:

  1. Re-download or re-transfer the file from the original source

  2. Verify file integrity using checksums if available

  3. Contact the data provider if the file came from an external source

  4. Re-export the image from the original acquisition software if possible

My TIFF/SVS conversion is taking extremely long (hours/days)

Large whole-slide imaging (WSI) files can take significant time to convert. To optimize the process:

  1. Increase the memory target to use more of your available RAM:

    # For a system with 64 GB RAM, target ~50 GB
    ngff-zarr --memory-target 50G -i input.svs -o output.ozx
    
  2. Validate the file first to avoid a long run that ends in an error:

    python3 -c "import tifffile; tif = tifffile.TiffFile('input.svs'); print(f'{len(tif.series)} series found'); tif.close()"
    
  3. Use SSD storage for both input and output, as I/O is often the bottleneck

If a conversion runs for a long time and then fails with an error, the source file is likely corrupted; see the section above for diagnosis and solutions.