youtube image
From YouTube: 2021-11-30 - Dan Allan - Tiled: A Service for Structured Data Access

Description

NERSC Data Seminars Series: https://github.com/NERSC/data-seminars

Title:
Tiled: A Service for Structured Data Access

Speaker:
Dan Allan, NSLS-II - Brookhaven National Laboratory

Abstract:
In the Data Science and Systems Integration Program at NSLS-II, we have explored various ways to separate I/O code from user science code. After seven years of developing in-house solutions and contributing to external ones (including Intake), we propose an abstraction that we think is a broadly useful building block, named Tiled. Tiled is a data access service for data-aware portals and data science tools. It has a Python client that feels much like h5py to use and integrates naturally with dask, but nothing about the service is Python-specific; it also works from curl. Tiled’s service sits atop databases, filesystems, and/or remote services to enable search and structured, chunk-wise access to data in an extensible variety of appropriate formats, providing data in a consistent structure regardless of the format the data happens to be stored in at rest. The natively-supported formats span slow but widespread interchange formats (e.g. CSV, JSON) and fast, efficient ones (e.g. C buffers, Apache Arrow Tables). Tiled enables slicing and sub-selection to read and transfer only the data of interest, and it enables parallelized download of many chunks at once. Users can access data with very light software dependencies and fast partial downloads. Tiled puts an emphasis on structures rather than formats, including N-dimensional strided arrays (i.e. numpy-like arrays), tabular data (i.e. pandas-like“dataframes”), and hierarchical structures thereof (e.g. xarrays, HDF5-compatible structures like NeXus). Tiled implements extensible access control enforcement based on web security standards, similar -to JupyterHub Authenticators. Like Jupyter, Tiled can be used by a single user or deployed as a shared resource. Tiled facilitates local client-side caching in a standard web browser or in Tiled’s Python client, making efficient use of bandwidth and enabling an offline “airplane mode.” Service-side caching of "hot" datasets and resources is also possible. Tiled is conceptually “complete” but still new enough that there is room for disruptive suggestions and feedback. We are interested in particular in exploring how Tiled could be made broadly available to NERSC users alongside traditional file-based access, and how that work might prompt us to rethink aspects of Tiled’s design.

Bio:
Dan Allan is scientific software developer and group lead in the Data Science and Systems Integration Program at NSLS-II. He joined Brookhaven National Lab as a post-doc in 2015 after studying soft condensed-matter experimental physics and getting involved in the open source scientific Python community. He works on data acquisition, management, and analysis within and around the "Bluesky" software ecosystem.

Host of Seminar:
Bjoern Enders
Data Science Engagement Group
National Energy Research Scientific Computing Center (NERSC)
Lawrence Berkeley National Laboratory