3 posts tagged

sql

Diagram of BCG (Boston Consulting Group) Matrix

I will water down the blog with an interesting report, that was developed for Yota company on November, 2011. BCG Matrix has inspired us to develop this report.

We had: one Excel package, 75 VBA macro, ODBC connection to Oracle, SQL queries to databases of all sorts and colours. We will review report construction within this stack, but first, let’s speak about the very idea of the report.

BCG Matrix – is 2x2 matrix, whereon the clients’ segments are displayed by circumferences with their centres in the intersection of coordinates, formed by the relevant paces of two indicators selected.

To make it simple, we had to divide all the clients of the company into 4 segments: ARPU above average/below average, traffic consumption (main service) above average/below average. Thus, it turned out that 4 quadrants appear, and you need to place a bubble chart into each one of them, whereas the size of a bubble means the total amount of users within a segment. In addition to that, one more bubble was added to each quadrant (smaller one), that showcased the churn in each segment (author’s improvement).

What did we want to get at the output?
A chart of the following type:

Representation of the BCG matrix on the data of Yota company

The task statement is more or less clear, let’s move to the realization.
Let’s assume, that we’ve already collected all the required data (meaning that, we’ve learned to identify the average ARPU and average traffic consumption, in this post we won’t examine SQL-query), then the paramount task lies in understanding how to display the bubbles in the required places by means of Excel tools.

For this aim, a bubble chart comes to help:

Insert – Chart – Bubble

Going to the menu Selection of data source and evaluating, what is required in order to build a chart in the type that we need: coordinates X, coordinates Y, values of bubbles’ sizes.

Great, so it turns out that if we assume that our chart will be located in coordinates on the X axis from -1 to 1, and on the Y axis from -1 to 1, then the centre of the right upper bubble will be the spot (0.5; 0.5) on the chart. Likewise, we’ll place all the other bubbles.

We should separately consider the bubbles of Churn type (for displaying of the churn), they are located more to the right then the main bubble and might intersect with it, therefore we will place the right upper bubble to empirically obtained coordinates (0.65; 0.35).

Thus, for four main and four additional bubbles, we can organize the data as follows:

Let’s review more thoroughly how we’ll use them:

So, we set on X-axis – horizontal coordinates of the centres of our bubbles, that lie in the cells A9:A12, on Y-axis – vertical coordinates of the centres of our bubbles, that lie in the cells B9:B12, and the sizes of the bubbles are stored in the cells E9:E12.
Furthermore, we add another data set for the Churn, once more indicating all the required parameters.

We’ll get the following chart:

Then, we’re making it pretty: changing colours, deleting axis and getting a beautiful result.

By adding the required data labels, we receive what we initially needed in the task.

Share your experience in comments – did you build such charts and how you solved the task?

 No comments    30   3 mon   analysis   excel   marketing   sql   strategy   visualisation

How to calculate Retention?

In this post we will discover, how to properly construct a report on Retention with application of Redash and SQL language.
For starters, let’s explain in a nutshell what the metric Retention rate is, why it is important,

Retention rate

Retention rate metric is widespread and is particularly popular within the mobile industry, since it allows to understand how well a product engages the users into daily use. Let’s recall (or discover), how Retention is calculated:

Retention of day X – is N% of users that will return to the product on day X. In other words, if on some specific day (day 0) 100 new users came, and 15 returned on the first day, then Retention of the 1st day will be equal to 15/100=15%.
Most commonly, Retention of days 1, 3, 7 and 30 are singled out as the most descriptive metrics of a product, however it’s useful to address Retention curve as a whole and make conclusions, proceeding from it.

Retention curve

In the end, we are interested in construction of such curve, that shows the retention of users from day 0 to day 30.

Retention rate curve from day 0 do day 30

Rolling Retention (RR)

Besides classic Retention rate, Rolling Retention (hereinafter, RR) is allocated. At calculation of RR, apart from day X, all the subsequent days are also considered. Thus, RR of the 1st day – the amount of users who returned on the 1st and subsequent days.

Let’s compare Retention and Rolling Retention of the 10th day:
Retention10 — the amount of users, who returned on the 10th day / the amount of users, who installed the app 10 days ago * 100%.
Rolling Retention10 — the amount of users, who returned on the 10th day or later / the amount of users, who installed the app 10 days ago * 100%.

Granularity (retention of time periods)

In some industries and respective tasks, it is useful to understand the Retention of a specific day (most often, in the mobile industry), in other cases it is useful to understand the retention of users on various time intervals: for example, weekly or monthly periods (oftentimes, it’s handy in e-commerce, retail).

An example of cohorts by months and monthly Retention respective thereto

How to build a Retention report on SQL language?

We have sorted out above how to calculate Retention in formulas. Now let’s apply it with SQL language.
Let’s assume, that we have two tables: user — storing data about users’ identifiers and meta-information, client_session — information on visits of the mobile app by users.
Only these two tables will be present in the query, so you can easily adapt the query to yourself.
note: within this code, I am using Impala as DBMS.

Collecting the size of cohorts

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1

Let’s sort out this pretty simple query: for every day we calculate the number of unique users for the period [60 days ago; 31 days ago].
In order not to mess with documentation: command ndv() in Impala is analogue of a command count(distinct).

Calculating the number of returned users on each cohort

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2

In this query, the key part is contained in the command datediff: now we are calculating for each cohort and for each datediff the number of unique users with the very same command ndv() (practically, the number of users, who returned within the days from 0 to 30).

Great, now we have the size of cohorts and the number of returned users.

Combining all together

SELECT reg.reg_date AS date_registration,
       reg.users AS cohort_size,
       cohort.date_diff AS day_difference,
       cohort.ret_base AS retention_base,
       cohort.ret_base/reg.users AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
    ORDER BY 1,3

We have received the query, that calculates Retention for each cohort, and, eventually, the result can be displayed as follows:

Retention rate, calculated for each cohort of users

Construction of the sole Retention curve

Let’s modify our query a bit and obtain the data for construction of one Retention curve:

SELECT 
       cohort.date_diff AS day_difference,
       avg(reg.users) AS cohort_size,
       avg(cohort.ret_base) AS retention_base,
       avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date
    GROUP BY 1        
    ORDER BY 1

Now, we have average by all the cohorts Retention rate, calculated for each day.

More on the subject

 No comments    23   3 mon   analysis   redash   sql   visualisation

Redash – full-fledged on-demand analytics

Today we will examine and try to get to the bottom of one tool that is quite famous nowadays – Redash. The tool is extremely convenient due to the fact that it can work with Clickhouse. All the other developers of BI-tools don’t support Clickhouse. Just recently, an ODBC driver has appeared for Tableau.

I will make a short review of Redash’s useful features, and in the following number of posts I will figure out which useful requests and reports can be constructed using Redash.

Redash Homepage

So, what is Redash? It is the tool for on-demand analytics, moreover it can be used on completely different databases. For instance, Redash can be connected to a database under MySQL or HP Vertica.
The main thing, proposed by Redash is a handy console for writing SQL-queries to a database.

Console for writing SQL-queries

Consequently, the first major observation: in order to use Redash you should know SQL or have an employee in team who knows SQL.
Apart from the basic console, Redash provides visualization tools (inter alia, construction of Funnel-charts, cohort analysis and pivot tables (however, the functionality of the latter is very limited)), as well as tools for construction of dashboards and alert systems (via mail or in slack).
Let’s examine the basic console a bit more thoroughly. A user has an access to some useful features: application of filters, multifilters and query’s parameters:

Simultaneous application of parameters, filters and multifilters in queries

Further, opportunities of visualizations should be mentioned separately. They are, indeed, convenient and extensive, especially considering that one can set automatic updating of query results in accordance with the required timetable.
In the following note we will discuss visualization in Redash more closely, and later – construction of dashboards and alert systems.

 No comments    408   2018   redash   sql