Solution | 7 — 9 mins
Companies collect massive amounts of data using customer data platforms like Splunk, Heap, Segment, or explicit tracking with manually defined event logs describing user behavior. With Kaskada you can compute features directly from these event-based data platforms and train models that would have been impractical to build in the past.
After reading this example you'll have a deeper understanding of how to use Kaskada’s time travel functionality to explore your users' event-based data to predict user retention over time.
The problem statement
Product and revenue teams have been collecting data for years describing renewal decisions and behavior. Often the first or second task of a data science team is to develop an understanding of how customer lifetime value should progress and develop ML models that could include churn predictions, lifetime value estimates, and segment assignments that reflect the revenue team's instinctive heuristic to make projections more stable. And, of course, if you discover anything else along the way that’s useful, be ready to make a model for that too.
The data that you have access to is generated in many different systems: Salesforce, Hubspot, Google Analytics, and Heap. Building the data pipelines to filter and join the data traditionally would require a team of data engineers. So before the business invests in the data pipelines the data science team is going to manually munge the data and see if the investment is worth it. Using Kaskada to build features de-risks experimentation because the features built will be production-ready.
In general, the event tracking data has a relatively standard hierarchy: User, Session, and Event data. Many users conduct many sessions and every session has many events. At every level, there can be associated metadata, i.e. users may have associated emails, sessions may have UTM Sources, events may have a URL path.
On top of this, your business may layer on logic such as fiscal calendar, product launch events, the definition of churn, renewal and growth, and more. The revenue data and marketing data however are grouped into different entities such as Accounts, Sales Reps, Subscriptions, etc.
To get the most out of your data you need a system that allows you to:
Compute historical features directly from this event-based data to try out features
Specify your model context iteratively with expressive time selection
Join values between different entities, at precise times—without leakage
Share feature definitions to power live models without degradation
Conduct Stakeholder interviews
The first, most important part of exploring user retention is to interview folks at your organization who are customer-facing and make the definitions of churn, etc. For example, the reps on your customer success team will describe how they manually interpret user data, what signals they look for that may mean risk, or that it’s time to pitch an expansion. When the signals are used are just as important. Be sure to note the cadence that data is pulled such as monthly or quarterly check-ins. Also, note what events matter like the number of active users dropping below a certain threshold.
In this example, we’ll illustrate how to approach the problem in a truly iterative way: defining features that reflect the signals that people are looking for already, investigating what it looks like to compute these features at the key points in time, looking for patterns in the data itself that may indicate missed patterns, then splitting training examples by various segments to see if features are more predictive, rinse and repeat.
Inspect the data
Inspecting the data shows that the only constant is change. Users are upgrading and downgrading, changing their addresses and payment methods, interacting with your product or not. The transaction and streaming logs are verbose, with multiple records recorded per day for some users and zero for others.
With Kaskada, you can connect directly to your event data and group and regroup your data to different entities and lookup entities that your model needs as you iterate and your understanding of the data evolves. In the past, you’d need to undergo a large (and manual) data preparation task that depends on the exact business logic you’ll eventually need. Instead, you’ll define the model context and feature definitions independently and iterate.
The sooner a business can know a User is at risk the higher the likelihood something can be done to retain the user and the revenue. Designing predictive models on stale data leads to lower performance. With Kaskada you can define feature definitions and compute their values at arbitrary data-dependent points in time to build event-triggered predictive models.
Start by defining predictors that are the signals that your reps look for at various points in time, the leading indicators of success or failure. In many cases, you’ll need to define these mathematically for the first time. Today, your reps look at many different data points across several entities to form an opinion on what is good or bad.
A small number of users may be good, in the first month of a subscription for an enterprise customer. While a large number of users may indicate a customer’s intention to quit soon. Building the feature will require defining Active Users from Heap data, finding the contract start date and the segment a customer belongs to in the revenue system and joining this information over time since each customer may have had multiple contracts in different segments over time. With Kaskada this is easily achieved with a lookup.
Next, you’ll define a target feature that labels each example as renewed or churned based on business logic. Instead of hand labeling your dataset, you’ll want to do this with a feature. By defining a feature you can compute this label at any time for each entity to generate as many training examples as needed.
In most enterprise SaaS businesses that are contract-based a customer doesn’t count as “churned” until the books are closed for the month or after a grace period. So while a subscription record may have an end date on the 2nd, the revenue team may have until a fixed day like the 15th of each month or after a 30 day period to get a contract renewed before it counts as “closed lost”. This brings us to defining model context. Kaskada allows you to define both how to compute the label and when to compute the label based on the data.
Retention rates are typically measured at monthly, quarterly, and yearly time periods depending on the business model. However, customers sign up, upgrade, and downgrade or cancel their subscriptions without respect to month, quarter, and yearly boundaries, and the events that build up to churn happen any time.
Kaskada allows you to define both how to compute the prediction times and label times relative to each other or independently at arbitrary data-dependent points in time. Now you can iterate on your time selection, figuring out when to produce your training examples to build a model that makes predictions available to:
Customer success teams with enough time left to attempt to save accounts
Sales reps to predict if a new customer might be successful
Revenue leaders to predict quarterly and annual revenue targets
Iteration enables exploration and discovery. True time travel allows for specifying feature definitions and time selection independently during the feature engineering and selection process. Before we dive deeper into specific data it's helpful first to understand what is possible with Kaskada's time travel capabilities that haven't been possible in your previous workflow.
The above shows a basic example showing 4 User entities and their subscriptions over time. With Kaskada, instead of pre-aggregating your data, or doing one-off analysis inside of a notebook dropping columns you can specify the prediction and label times you'd like your feature values to be computed.
The example starts with a data-dependent prediction time: the subscription start date and a relative label time: 30 days after prediction time. This is to say, can we predict at the start of a customer's subscription if they will still be a customer next month?
But, after some exploration, you might find that you have the ability to compute the number of sessions and this is likely correlated to success. While your feature definitions stay the same you can change your prediction time to occur some number of days after the start date, in this case, 30.
After additional exploration, you may find that you also need a data-dependent label time, the planned subscription end date that was provided on the subscription start event. In summary, with Kaskada you can choose to compute your feature and label values at arbitrary data-dependent points in time without building complex data pipelines first.
Build and deploy multiple models
After iterating you’ll find that the business needs multiple models to compute predictions for some segments independently and at different points in time —based on the use case. To do this you’ll want to break your exploration into multiple views that can be computed independently and made available in production.
Consider building at least 6 different models to start:
Small accounts, quarterly check-in - Often reps with smaller accounts have many more of them with fewer necessary touchpoints. This might be good enough for your small accounts.
Large accounts, first-year renewal only, at the 6 month check-in - In the first year large accounts behave very differently in the adoption phase, continuous growth with your product is very important.
Large accounts, excluding first-year renewals, at the 6 month check-in - Excluding the first year of renewal decisions may dramatically reduce your dataset but if there are enough examples left your model often becomes more predictive.
Large accounts, when usage drops below a threshold - Defining a relative form of usage is helpful here so that you can alert reps that a customer is at risk at any point in time instead of waiting for check-in. This can often catch when a decision-maker shift has happened or a reorg before it’s announced. Transitions are critical points in time to build relationships with your new point of contact before they make cuts.
Large accounts, after the first year, at the 9 month mark - Reps will use this to propose upsell options if usage is good or recovery options if usage has stagnated.
A composite revenue projection model computed at the start of each quarter - your leadership team will want to look across the population of all possible renewals to predict revenue at quarter close and adjust resourcing accordingly.
Not every feature will be statistically significant at every point in time for every model. With Kaskada, you can build hundreds of features and test for significance at relevant moments to reduce the set to those that provide your model the most information.
Note: The code examples in this article are using our Fenl magic extension for IPython, all examples can be converted into native python, just see our docs for your preferred workflow.
Check out our case studies for specific analyses and quickstart guides for how to walkthroughs:
- event logs