Case Study | 3 — 5 mins
In this case study, we'll dive deep into a survival analysis technique to better understand the patterns around customer attrition. The dataset is available on Kaggle and was used in the KKBox Churn Prediction Challenge.
In addition, Kaskada can be used with any of your typical data visualization, statistical modeling, and model training libraries. Here we use:
Kaskada's Fenl magic extension
numpy for some statistical analysis
matplotlib for visualization
lifelines for the CoxPHFitter
All external libraries are used exclusively for visualization, exploration, model training, and feature caching. This is because Kaskada is the first feature engine with time travel. While you could rebuild data pipelines in different technologies you'll see model degradation due to leakage. Instead, deploy Kaskada to your production environment, share the feature definitions, and compute the production feature values directly.
Digging into the data
Kaskada will need access to the transactions log and members log. To do this you'll need to define the tables with their time and entity columns and upload the data with our table APIs.
Looking at the transaction events and membership information associated with a single member helps to understand the columns available for exploration:
Visualizations help to not miss the forest (distribution) for the trees (individual data points). Initial exploration and visualization shows a number of typical messy data scenarios found when connecting directly to event-based data.
In a traditional workflow, after examining the transactions and member data to identify relevant information, you'll need to weed out entries that are not relevant and begin stitching together several custom SQL queries to figure out when a customer joined, how the subscription changed over time and when the customer churned. This process can be quite involved. Some solutions even require manually modifying events or creating dummy events to get an accurate picture of what transpired.
With Kaskada, Data Scientists can define features directly from the event-based data even when:
The transaction log is quite busy, often with multiple entries recorded on a given transaction date (transaction_date) as can be observed for this customer on several dates including 2016-09-25.
Some records have a value of zero or less for payment plan days (payment_plan_days).
Many transaction entries are changing the subscription's expiration date (membership_expire_date).
There are backdated records due to some kind of subscription management activity such as a change in auto-renewal status or the like.
We can simply add logic in Fenl for which events to include by writing conditional statements and predicates. The events that meet our conditions are the selected examples that remain available for prediction times. The below code snippet handles all of the above scenarios:
Kaskada handles multiple events natively
Eliminate 0 or fewer plan days with a data-dependent conditional
Save the maximum known expiration date
Handle backdated records with a second data-dependent conditional
Building a target feature and label time
Once all of the relevant events have been identified, we can define a churn target feature, then we shift the prediction events forward in time, computing the label values for each event when the data-dependent condition is true. In this case, the business doesn't define a subscription as churned until no transactions are completed for 30 days. As Kaskada's compute engine walks through time, values are emitted without the risk of leakage.
A single customer however can have multiple subscriptions, the above defines all of the positive examples of churn. Negative examples of churn can be built up in similar ways.
This dataset, as written, will have one entry per customer for each subscription. A subscription begins on the earliest transaction and ends on the earliest churn event that is on or after the transaction date. In the source data, a single customer may have multiple subscriptions, though only one would be active at a time.
Enrich transactions with Membership info
The subscription information isn't enough to complete the survival analysis. We'll need to join the membership information to understand if the acquisition channel impacts retention. Note the join in Fenl using a lookup in the below snippet prevents leakage.
Now that we've prepared the full subscription examples we can use any of the available survival analysis techniques available. Since we're looking at more than just the duration of the subscription we'll use a survival regression to regress covariates against the duration of the subscription. The lifelines CoxPHFitter library supports the type of dataframe that Kaskada is returning: each row of the dataframe represents an observation with a column denoting the duration of the observation. The result shows that the median survival time is around 6 months. The graph below shows the survival function at 30 days, 1 and 2 years.
From here, we can use various prediction times and attributes associated with the subscriptions to understand what attributes make a statistical difference at various points in time to build a model to predict the retention of future customers.
In addition, we can use Kaskada to build features directly from the event-based data to build better models than we're able to with pre-aggregated data.
Looking back at the initial columns, consider building features such as:
amount_discount: Transactions.plan_list_price - Transactions.actual_amount_paid
percent_discount: amount_discount / Transactions.plan_list_price
See additional case studies and examples below for more ways to use Kaskada to build better models from event-based data in the lab and production.