Execution

A Timestream may be executed and written to a destination in a variety of ways. The things to consider when deciding how to execute the Timestream are:

  1. Whether you want the history of points or the snapshot of values for each entity at a given time.
  2. Whether you want to run a query once or start a live process continually materializing.
  3. Whether you want to limit the output to points in a specific time range (for history) or entities that have changed since a specific time (for snapshots).
  4. Whether you want to stop at a given point in time.

State can be used to provide fault-tolerance and allow incremental processing of only new events.

Preview during development

While developing queries, it is often useful to view a few rows from the result. Using kaskada.Timestream.preview you can retrieve a small set of rows from the result set as a Pandas DataFrame.

History vs. Snapshot

Executing a Timestream for the history outputs every point in the Timestream. This means that each entity likely appears in the output multiple times. This is particularly useful for creating training examples from past data points, or visualizing how the Timestream has changed over time.

Executing a Timestream for a snapshot produces a value for each entity at a specific point in time. This means that each entity appears at-most once in the results. This is useful for maintaining a feature store based on the latest values.

Todo

Expose the configuration for snapshots. See https://github.com/kaskada-ai/kaskada/issues/719

Query vs. Materialize

Every Timestream may be executed as a single query or used to start a materialization. Single queries are useful when you want the results for some later batch process, such as fine-tuning a model or populating an in-memory feature store. Materialization is useful when you want to stream new results out as quickly as possible, such as maintaining an in-memory feature store or reacting to specific conditions.

Changed Since

Configuring the changed since time lets you control the points or entities included in the output.

For a historic query, only points occurring after the changed since time are included in the output. This allows incrementally outputting the entire history to some external store, by repeatedly performing a “changed since” query.

For a snapshot query, only entities that have changed after this time are included in the output. This reduces the amount of data written when the past snapshot is already present in the destination.

Todo

Expose the configuration for changed since. See https://github.com/kaskada-ai/kaskada/issues/719

Up To

Configuring the up to time lets you control the maximum points output (and in the case of snapshots, the time represented in the snapshot).

For a historic query, only points occurring before or at the up to time are included in the output. For a snapshot query, this corresponds to the time at which the snapshot will be taken.

Todo

Currently when not specified, the up to time is determined from the maximum event present in the data. We have plans to change this to a parameter to run defaulting to the current time.

Todo

Expose the configuration for up-to. See https://github.com/kaskada-ai/kaskada/issues/719

State

Kaskada checkpoints state during and after execution. This provides fault-tolerance, incremental querying and automatic handling of late-data.

When a query is executed, Kaskada determines whether it can use any of the available states to reduce the amount of processing needed. For instance, when producing a snapshot Kaskada can use any persisted state before the earliest new event and before the time to snapshot. Similarly, when producing a history, Kaskada can use any persisted state before the earliest new event and before the “changed since” time.

Destinations

The methods kaskada.Timestream.preview and kaskada.Timestream.to_pandas provide the results of a query in a Pandas DataFrame for easy visualization and consumption within the Python process.

The kaskada.Timestream.run_iter methods provides synchronous and asynchronous iterators over the results in a variety of formats including Pandas DataFrames, PyArrow RecordBatches, and rows as Python dictionaries. This allows you to run the entire retrieve-evaluate-respond loop within a single Python process.

The kaskada.Timestream.write function allows you to specify a destination from kaskada.destinations for results. This supports both once and live queries.

See the reference on destinations for more on the supported destinations.