Missing Data
Handling missing data is a crucial aspect of data processing, especially in streaming environments. This document outlines what is considered missing data within the StreamingDataFrame class and how it is managed.
Types of Missing Data in Streaming
-
Missing Column: This occurs when a field is not present in the message at all. In streaming data, schemas can be dynamic, meaning that not all fields are required to be present in every message. This type of missing data is handled by the system's ability to adapt to changes in the schema over time.
-
Missing Value: This occurs when a field is present in the message, but its value is
None. This indicates that the data for that field is missing, even though the field itself is part of the message schema.
Handling Missing Data in Aggregations
-
Rows with
NoneValues: These rows are ignored during aggregation operations. This means that if a row contains aNonevalue, it will not contribute to the aggregation result. This applies to the following aggregations: Count, Sum, Mean, Min, and Max. -
NaNValues: UnlikeNone,NaNvalues (Pythonfloat('nan')ormath.nan) are treated as numeric values and are included in aggregations. They propagate through operations likeSum,Mean,Min, andMax, meaning a singleNaNinput will make the aggregation resultNaN. Usesdf.fill()or filter them out beforehand if this is not the desired behavior.
StreamingDataFrame.fill Method
The fill method in the StreamingDataFrame class is used to fill missing columns and missing values in the message with a constant value.
Example Usage
from quixstreams import Application
# Initialize the Application
app = Application(...)
sdf = app.dataframe(...)
Fill missing data for a single column with None:
Fill missing data for multiple columns with None:
Fill missing data with a constant value using a dictionary:
Use a combination of positional and keyword arguments: