Load & Visualize HDF5 in Python
I came across encrypted network security data stored in HDF5 previously at work and recently learned that HDF5 is often used in other fields such as physics, bioinformatics, and other scientific domains. HDF5 (Hierarchical Data Format) is specifically designed to handle large datasets that require high-performance data storage, including long-term storage. Let’s dive into understanding what HDF5 is and how to work with HDF5 files.
What is HDF5?
HDF5 is written in C. It has been around since the mid-1990s and gained popularity in the engineering community due to its ability to store and organize large, hierarchical numerical datasets with chunking capabilities. It is not widely used in everyday data science and is quite different from other commonly used file formats such as CSV, JSON, or SQL.
HDF5 consists of three primary elements in the data model:
- Datasets: Data arrays that store numerical data. It has a type, name, and shape, similar to a NumPy array.
- Groups: Hierarchical containers that organize datasets and other groups within an HDF5 file
- Attributes: Metadata associated with datasets, groups, or the HDF5 file such as units, labels
These three elements provide an efficient data model to create and manage complex scientific data in a structured, hierarchical way.
View HDF5
We often use Microsoft Excel to view and inspect tabular CSV or XLSX files. To view HDF5 files, I have been using HDFView to display the datasets, groups, attributes, and other contents in an HDF5 file.
Here’s an example of an hdf5 file related to X-ray measurements:
h5py & HDF5
h5py is one of the interface Python packages for HDF5. It contains high-level wrapper functions to interact with HDF5 files. h5py simplifies the tasks that create, read, modify, and access data elements (datasets, groups, attributes) in an HDF5 file.
I find h5py very useful to work with experiment datasets, and I can further use dictionary indexing or NumPy array slicing to organize, access, and manipulate datasets along with the metadata associated with the groups or datasets from the attributes.
Pandas & HDF5
Pandas offers functions read_hdf
and to_hdf
to read and write DataFrames in HDF5 format. Pandas library leverages PyTables to directly read Dataframes from HDF5 and write an HDF5 file from a DataFrame.
When loading the hdf5 example file, via read_hdf
, I run into TypeError: cannot create a storer if the object is not existing nor a value are passed.
It seems that Pandas is expecting a specific hdf5 schema. In the end, I fell back to using h5py and converting them to Pandas.
The script above loads a downloaded HDF5 file locally. If you have bulks of HDF5 files to work with and would like to share them with people easily, the files can be stored in HDFS (Hadoop Distributed File System), AWS S3, and Azure Blob Storage. We can load HDF5 files directly from those on-premise or cloud storage services via its SDK (e.g., boto3 for S3), h5py, etc.
When it comes to massive HDF5 files, instead of loading the entire dataset, we could read or write HDF5 in chunks with predefined chunk sizes. For HDF5 datasets that do not come with predefined chunking, we could define a desired chunk size and use slicing to access data in smaller blocks.
Plot HDF5
You can choose any plotting library you like, and here’s an example of plotting one trace from the experiment data.
Instead of manually inputting the index, we can use a slider to navigate different frames of the data in sequence.
import h5py
import plotly.graph_objects as go
import numpy as np
with h5py.File('xrr_dataset.h5', 'r') as f:
dataset = f['DIP_1/experiment/data']
x = f['DIP_1/experiment/q'][...]
y = dataset[...]
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y[0], mode='lines+markers', name=f'Row 1'))
frames = []
for i, y_val in enumerate(y):
frames.append(go.Frame(
data=[go.Scatter(x=x, y=y_val, mode='lines+markers', name=f'Row {i+1}')],
name=f'Frame {i}',
layout=go.Layout(title_text=f'DIP_1 - Row {i+1}')
))
fig.update(frames=frames)
sliders = [dict(
steps=[dict(method='animate', args=[[f'Frame {i}'], dict(mode='immediate', frame=dict(duration=500, redraw=True), transition=dict(duration=0))], label=f'Row {i+1}')
for i in range(len(y))],
transition=dict(duration=0),
xanchor='left',
yanchor='top'
)]
fig.update_layout(
sliders=sliders,
title="DIP_1 - Row 1",
xaxis_title="Momentum Transfer q",
yaxis_title="X-ray reflectivity",
)
fig.show()
Here’s a more advanced way to plot raw X-ray reflectivity measurements with fitting curves done by the authors.
While learning about HDF5, I find this book by Andrew Collette incredibly helpful in understanding HDF5’s capabilities to effectively process and analyze large datasets using Python. Overall, HDF5 is one of the widely used file formats for storing and managing huge datasets in scientific computing and data analysis projects.