Skip to content

Missing Data

Handling missing data is a crucial aspect of data processing, especially in streaming environments. This document outlines what is considered missing data within the StreamingDataFrame class and how it is managed.

Types of Missing Data in Streaming

  • Missing Column: This occurs when a field is not present in the message at all. In streaming data, schemas can be dynamic, meaning that not all fields are required to be present in every message. This type of missing data is handled by the system's ability to adapt to changes in the schema over time.

  • Missing Value: This occurs when a field is present in the message, but its value is None. This indicates that the data for that field is missing, even though the field itself is part of the message schema.

Handling Missing Data in Aggregations

  • Rows with None Values: These rows are ignored during aggregation operations. This means that if a row contains a None value, it will not contribute to the aggregation result. This applies to the following aggregations: Count, Sum, Mean, Min, and Max.

  • NaN Values: Unlike None, NaN values (Python float('nan') or math.nan) are treated as numeric values and are included in aggregations. They propagate through operations like Sum, Mean, Min, and Max, meaning a single NaN input will make the aggregation result NaN. Use sdf.fill() or filter them out beforehand if this is not the desired behavior.

StreamingDataFrame.fill Method

The fill method in the StreamingDataFrame class is used to fill missing columns and missing values in the message with a constant value.

Example Usage

from quixstreams import Application

# Initialize the Application
app = Application(...)
sdf = app.dataframe(...)

Fill missing data for a single column with None:

# This would transform {"x": 1} to {"x": 1, "y": None}
sdf.fill("y")

Fill missing data for multiple columns with None:

# This would transform {"x": 1} to {"x": 1, "y": None, "z": None}
sdf.fill("y", "z")

Fill missing data with a constant value using a dictionary:

# This would transform {"x": None} to {"x": 1, "y": 2}
sdf.fill(x=1, y=2)

Use a combination of positional and keyword arguments:

# This would transform {"y": None} to {"x": None, "y": 2}
sdf.fill("x", y=2)