Quickstart

How to export query results for model training and serving.

Example Modeling Problem

We'll begin by laying out a simple modeling problem. The goal is to build a model predicting if a given purchase will result in a fraud report within the next 30 days given two event tables; a Purchase table and a FraudReport table.

To train a model a data scientist produces separate training and validation datasets. These datasets will be created by partitioning the full set of training examples into two time ranges, one spanning the years 2000-2015 and another spanning the years 2015-2020.

Two query variables $start_time and $end_time are used to dynamically constrain the examples produced by each query. We'll execute this query twice, once to produce the training dataset and another to produce the validation dataset. The training query will provide start and end times 2000-2015, while the validation query will provide times 2015-2020.

Usage Note - The fenlmagic extension is designed to make it easy to interactively explore your dataset. When you're ready to train a model we recommend using the Python client. The python client exposes the full functionality of the Kaskada API and is better suited to tasks such as managing Views, Tables, and making multiple queries with different query variables.

The below query uses the fenlmagic flag --var which assigns the query string to a local variable named predictors and examples. We'll use the examples variable as the query string when we make the training and validation queries. We’ll then use the predictors variable as the query string to get to production.

%%fenl --var predictors

let AveragePurchaseByCustomer = Purchase 
  | with_entity_key($input.customer_id)
  | mean()

in record{
  purchase_total: Purchase.total | last(),
  mean_purchase: AveragePurchaseByCustomer | lookup(Purchase.customer_id)
}

%%fenl --var examples

let examples = predictors
  | when(!is_null(Purchase))
  | shift_by(days = 30)
  | with { target: count(FraudReport) > 0 }

in examples
  | when($start_time <= time_of($input) and time_of($input) < $end_time)

Training a Model Locally

Depending on the size of your training dataset and how you intend to train a model, you may want to copy the training features locally or transfer them to a remote data store. We'll begin with the simple case of training locally.

Begin by making the training dataset query and reading the resulting file into a dataframe.

training_resp = compute.query(
  expression = examples,
  variables = {
    "$start_time": "datetime(2000, 1, 1)",
    "$end_time": "datetime(2015, 1, 1)",
  },
  client = client,
)
training = pandas.read_parquet(training_resp.parquet.path, engine = "pyarrow")

Now we make the same query, but provide different query variables to control which examples are included.

validation_resp = compute.query(
  expression = examples,
  variables = {
    "$start_time": "datetime(2015, 1, 1)",
    "$end_end": "datetime(2020, 1, 1)",
  },
  client = client,
)
validation = pandas.read_parquet(validation_resp.parquet.path, engine = "pyarrow")

We're finally ready to train a model. This shows a simple linear regression model.

from sklearn.linear_model import LogisticRegression

X_train = training.drop('target', axis = 1)
y_train = training['target']
X_validation = validation.drop('target', axis = 1)
y_validation = validation['target']

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

trained_model.score(X_validation, y_validation)

Getting to production

Kaskada's feature engine connects directly to event-based data allowing you to compute historical features directly, specify your model context iteratively, join values between entities without leakage, and share feature definitions in production.

Here we compute all historical feature values for all known entities and cache in a feature store, in this case, Redis. We simply reuse the feature definitions that were used for training and compute with a time of now.

Fenl expressions can be shared and re-used by creating a view. A view is a named expression. In subsequent Fenl expressions the view's name is synonymous with the view's expression. Views are persisted in the Kaskada platform and are accessible to any user within an organization.

To create a view, we can use the predictors variable to create a view using the Python client without re-typing the expression. The “examples” query computed the features at historical times, the below is the same query made available in a view for use at any point in the future.

from kaskada import view

view.create_view(
  name = "ChurnFeatures", 
  expression = predictors,
  client = client,
)

We've now created a view named ChurnFeatures. We can verify it was created successfully by searching for views matching the name:

view.list_views(search = "Churn")

Now that we've created a view, let's look at how the view can be used. We can use a view's name anywhere we could use the view's expression - the only restriction placed on views is that they must be valid Fenl expressions. Here's an example of using a view to populate the feature values into Redis directly from a Jupyter Notebook.

%%fenl --to-redis redisai
ChurnFeatures

From here, a data engineer or software engineer simply needs to query Redis to get the results.

redis-cli -h redisai AI.TENSORGET
  YyO+tlZtAXYXoZhNr3Vg3+dfVQvrBVGO8j1mfqe4ZHc= values
Congratulations, you've completed this quickstart and now know how to take features built with Kaskada to train a model and get to production. Your next step is to read a case study showing how to build better models, start with this Survivorship and Churn case study.