DAU forecasting using cohort matrices

DAU forecasting using cohort matrices

Home » News » DAU forecasting using cohort matrices
Table of Contents

One problem in forecasting DAU for a product is that numerous teams might exhibit completely different retention charges that adjust meaningfully. An apparent instance of that is customers acquired from completely different channels, nevertheless it may be true for various geographies, completely different platforms (eg., iOS vs. Android), and over time, with retention typically degrading with every subsequent cohort.

In an effort to accommodate this impact, retention charges needs to be utilized to DAU projections for these teams, with the projections being aggregated into a worldwide forecast. That is the aim of Theseus, my open supply Python library for advertising cohort evaluation. On this submit, I’ll unpack the analytical logic behind how Theseus works and supply an instance of the way to implement it in Python.

The atomic items of a DAU forecast are: (1) some group’s cohort sizes (eg., the variety of individuals from some group that onboarded to the product throughout some time period) and (2) the historic retention curve for that group. Every of those atomic items is represented as a vector over some timeline. The cohort vector captures the every day variety of customers from the group onboarding onto the product; the retention curve vector captures the historic every day retention charges for that group following onboarding. Every of those timelines — the cohort timeline and the retention curve timeline — might be arbitrarily lengthy, and they’re unbiased of one another (the cohort timeline doesn’t must match the retention curve timeline). The notation used right here for these atomic items is:

Word right here that the retention price vector would possible be generated by becoming a retention mannequin to historic retention information for the group. Extra on that concept in this submit.

With these parts, it’s doable to assemble a DAU matrix for the retention timeline (mathbf{D_r}) that may seize the cohort decay over that interval. A useful place to begin is an upper-triangular Toeplitz matrix, (mathbf{Z}), of dimension (D_r instances D_r) with the retention price vector working alongside the diagonal:

(mathbf{Z}) right here simply populates a matrix with the retention curves padded with 0s on the left in order that Day One retention (the primary worth of the retention curve) runs alongside the diagonal. In sensible phrases, Day One retention is 1, or 100%, since, tautologically, 100% of the cohort is current on the day of the cohort’s onboarding. In an effort to get to DAU, the retention charges should be broadcast to a matrix comprised of cohort sizes. This may be accomplished by establishing a diagonal matrix, ( mathbf{diag}(mathbf{c}) ) from (mathbf{c}):

It’s vital to notice right here that, with the intention to broadcast the cohort sizes in opposition to the retention charges, ( mathbf{diag}(mathbf{c}) ) should be of dimension (D_r instances D_r). So if the cohort dimension vector is longer than the retention price vector, it must be truncated; conversely, if it’s shorter, it must be padded with zeroes. The toy instance above assumes that (D_c ) is the same as (D_r ), however be aware that, as beforehand said, this isn’t a constraint.

Now, a 3rd matrix of DAU values, (mathbf{DAU_{D_r}}) might be created by multiplying (mathbf{Z}) and ( mathbf{diag}(mathbf{c}) ):

This produces a sq. matrix of dimension (D_r instances D_r) (once more, assuming (D_c = D_r)) that adjusts every cohort dimension by its corresponding every day retention curve worth, with Day 1 retention being 100%. Right here, every column within the matrix represents a calendar day and every row captures the DAU values of a cohort, padded in line with the date of its onboarding. Summing every column would supply the whole DAU on that calendar day, throughout all cohorts.

Whereas that is helpful information, and it’s a projection, it solely captures DAU over the size of the retention timeline, (D_r ), ranging from when the primary cohort was onboarded. What can be extra helpful is a forecast throughout the retention timeline (D_r ) for every cohort; in different phrases, every cohort’s DAU projected for a similar variety of days, no matter when that cohort was onboarded. It is a banded cohort matrix, which offers a calendar view of per-cohort DAU.

This matrix has a form of (D_c instances (D_r + D_c – 1)), the place every row is that cohort’s full (D_r)-length DAU projection, padded with a zero for every cohort that preceded it. In an effort to arrive at this, the banded retention price matrix, (mathbf{Z}_text{banded}) stacks the retention curve (D_c) instances however pads every row (i) with (i-1) zeroes on the left and (D_c – 1 + i) zeroes on the precise such that every row is of size (D_r + D_c – 1). To do that, we are able to outline a shift-and-pad operator (S^{(i)}):

Once more, this ends in a matrix, (mathbf{Z}_text{banded}), of form (D_c instances (D_r + D_c – 1)) the place every row (i) has (i – 1) zeroes padded to the left and ((D_c – i)) zeroes padded to the precise so that each cohort’s full (D_r)-length retention curve is represented.

In an effort to derive the banded DAU matrix, (mathbf{DAU}_text{banded}), the banded retention matrix, (mathbf{Z}_text{banded}), is multiplied by (mathbf{c}^{mathsf{T}}), the transposed conversion charges vector. This works as a result of (mathbf{Z}_text{banded}) has (D_c) rows:

Implementing this in Python is easy. The crux of the implementation is beneath (full code might be discovered right here).

## create the retention curve and cohort dimension vectors
r = np.array( [ 1, 0.75, 0.5, 0.3, 0.2, 0.15, 0.12 ] )  ## retention charges
c = np.array( [ 500, 600, 1000, 400, 350 ] )  ## cohort sizes

D_r = len( r )
D_c = len( c )
calendar_days = D_c + D_r - 1

## create the banded retention matrix, Z_banded
Z_banded = np.zeros( ( D_c, calendar_days ) ) ## form D_c * D_c + D_r - 1
for i in vary( D_c ):
    start_idx = i
    end_idx = min( i + D_r, calendar_days )
    Z_banded[ i, start_idx:end_idx ] = r[ :end_idx - start_idx ]

## create the DAU_banded matrix and get the whole DAU per calendar day
DAU_banded = ( c[ :, np.newaxis ] ) * Z_banded
total_DAU = DAU_banded.sum( axis=0 )

The retention and cohort dimension values used are arbitrary. Graphing the stacked cohorts produces the next chart:

It’s easy to imaging how this technique can be utilized to mix DAU schedules for various teams: a matrix for every group (eg., per-geography cohorts, with their separate cohort sizes and retention charges) might be constructed, with all the matrices stacked vertically to supply an image of whole, international DAU.

Supply hyperlink

author avatar
roosho Senior Engineer (Technical Services)
I am Rakib Raihan RooSho, Jack of all IT Trades. You got it right. Good for nothing. I try a lot of things and fail more than that. That's how I learn. Whenever I succeed, I note that in my cookbook. Eventually, that became my blog. 
share this article.

Enjoying my articles?

Sign up to get new content delivered straight to your inbox.

Please enable JavaScript in your browser to complete this form.
Name