6 posts tagged

sql

SQL Window Functions Cheat Sheet with examples

Estimated read time – 1 min

Window functions are calculation functions that can increase the efficiency and reduce the complexity of SQL queries, making things much easier:
View as  PDF

Special Thanks to LearnSQL, best place to master your SQL skills !

 No comments    69   3 mon   Data engineering   sql

Cohort analysis in Redash

Estimated read time – 5 min

In one of the previous articles we have reviewed building of Retention-report and have partially addressed the concept of cohorts therein.
Cohort usually implies group of users of a product or a company. Most often, groups are allocated on the basis of time of app installation / appearance of a user in a system.
It turns out, that, using cohort analysis, one can track down how the changes in a product affected the behaviour of users (for example, of old and new users).

Along with that, cohorts can be defined also proceeding from other parameters: geography of a user, traffic source, device platform and other important parameters of your product.

We will figure out, how to compare Retention of users of weekly cohorts in Redash, since Redash has special type of visualization for building such type of report.
Firstly, let’s sort out SQL-query. We still have two tables – user (id of a user and time of app installation) and client_session – timestamps (created_at) of activity of each user (user_id). Let’s consider the Retention of the first seven days for last 60 days.
The query is written in Cloudera Impala, let’s review it.

For starters, let’s build the total size of cohorts:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
	ndv(distinct user.id) as cohort_size //counting the number of users in the cohort
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now() //taking registered users for last 60 days
group by trunc(from_unixtime(user.installed_at), "WW")

The second part of the query can calculate the quantity of active users for every day during the first thirty days:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
	ndv(distinct user.id) as value  //counting the number of active users for every day
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60) //taking sessions for last 60 days
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30 //cutting off only the first 30 days of activity
group by 1,2

Bottom line, all the query entirely:

select size.cohort_week, size.cohort_size, ret.days, ret.value from
(select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
		ndv(distinct user.id) as cohort_size 
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
group by trunc(from_unixtime(user.installed_at), "WW")) size
left join (select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
		ndv(distinct user.id) as value 
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60)
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30
group by 1,2) ret on size.cohort_week=ret.cohort_week

Great, now correctly calculated data is available to us.

Data of cohorts in tabular form

Let’s create new visualization in Redash and indicate the parameters correctly:

It’s important to indicate the parameters correctly – the columns of the resulting query are compliant therewith.

Let’s make sure to indicate that we have weekly cohorts:

Voila, our visualization of cohorts is ready:

You can add filters and parameters to it and use in a dashboard

Materials on the topic

 No comments    751   6 mon   BI-tools   redash   sql   visualisation

Looker Overview

Estimated read time – 10 min

Today we are going to talk about BI-platform Looker, on which I managed to work in 2019.

Here is the short content of the article for convenient and fast navigation:

  1. What is Looker?
  2. Which DBMS you can connect to via Looker and how?
  3. Building of Looker ML data model
  4. Explore Mode (data research on the model built)
  5. Building of reports and their saving in Look
  6. Examples of dashboards in Looker

What is Looker?

Creators of Looker position it as a software of business intelligence class and big data analytics platform, that helps to research, analyze and share business analytics in real time mode.
Looker — is a really convenient tool and one of a few BI products, that allows to work with pre-set data cubes in a real-time mode (actually, relational tables that are described in Look ML-model).
An engineer, working with Looker, needs to describe a data model on Look ML language (it’s something between CSS and SQL), publish this data model and then set reporting and dashboards.
Look ML itself is pretty simple, the nexus between the data objects are set by a data-engineer, which consequently allows to use the data without knowledge of SQL language (to be precise: Looker engine generates the code in SQL language itself on user’s behalf).

Just recently, in June 2019, Google announced acquisition of Looker platform for $2.6 billion.

Which DBMS you can connect to via Looker and how?

The selection of DBMS that Looker is working with is pretty wide. You can see the various connections on the screen shot below as of October, 2019:

Available DBMS for connection

You can easily set a connection to the database via web-interface:

Web-interface of connection to DBMS

With regard to connections to databases, I’d like to highlight the following two facts: first of all, unfortunately, Clickhouse support from Yandex is currently missing (as well as in the foreseeable future). Most likely, the support won’t appear, considering the fact that Looker was acquired by a competitor, Google.
updated: Actually, Looker supports Clickhouse from the December 2019
The second nuisance is that you can’t build one data model, that would apply to different DBMS. There is no inbuilt storage in Looker, that could combine the results of query (unlike the same Redash).
It means, that analytical architecture should be built within one DBMS (preferably with high action speed or on aggregated data).

Building of Looker ML data model

In order to build a report or a dashboard in Looker, you need to provisionally set a data model. Syntax of Look ML language is quite thoroughly described in the documentation. Personally, I can just add that model description doesn’t require long-time immersion for a specialist with SQL knowledge. Rather, one needs to rearrange the approach to data model preparation. Look ML language is very much alike CSS:

Console of Look ML model creation

In the data model the following is set up: links with tables, keys, granularity, information of some fields being facts, and other – measurements. For facts, the aggregation is written. Obviously, at model creation one can use various IF / CASE expressions.

Explore mode

Probably, it’s the main killer-feature of Looker, since it allows any business departments to get data without attraction of analysts / data engineers. And, guess that’s why use of accounts with Explore mode is billed separately.

In fact, Explore mode is an interface, that allows to use the set up Look ML data model, select the required metrics and measurements and build customized report / visualization.
For example, we want to understand how many actions of any kind were performed in Looker’s interface last week. In order to do it, using Explore mode, we select Date field and apply a filter to it: last week (in this sense, Looker is quite smart and it and it will be enough writing ‘Last week’ in the filter), thereafter we choose “Category” from the measurements, and Quantity as a metric. After pressing the button Run the ready report will be generated.

Building report in Looker

Then, using the data received in the table form, you can set up the visualization of any type.
For example, Pie chart:

Applying visualization to report

Building of reports and their saving in Look

Sometimes you can have a desire to save the set of data / visualization received in Explore and share it with colleagues, for this purpose Looker has a separate essense – Look. That is ready constructed report with selected filters / measurements / facts.

Example of the saved Look

Examples of dashboards in Looker

Systemizing the warehouse of Look created, oftentimes you want to receive a ready composition / overview of key metrics, that could be displayed on one list.
For these purposes dashboard creation fits perfectly. Dashboard is created either on the wing, or using previously created Look. One of the dashboard’s “tricks” is configuration of parameters, that are changed on all the dashboard and can be applied to all the Look at the same time.

Interesting features in one line

  • In Looker you can refer to other reports and, using such function, you can create a dynamic parameter, that is passed on by a link.
    For example, you’ve created a report with division of revenue by countries, and in this report you can refer to the dashboard on a separate country. Following the link, a user sees the dashboard on a specific country, that he clicked on.
  • On every Looker page there is a chat, where support service answers very promptly
  • Looker is not able to work with data merge on the level of various DBMS, however it can combine the data on the level of ready Look (in our case, this function works really weird).
  • Within the framework of work with various models, I have found out an extremely non-trivial use of SQL for calculation of unique values in a non-normalized data table, Looker calls it symmetric aggregates.
    SQL, indeed, looks very non-trivial:
SELECT 
 order_items.order_id AS "order_items.order_id",
 order_items.sale_price AS "order_items.sale_price",
 (COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
 *(1000000*1.0)) AS DECIMAL(38,0))) + 
 CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) 
 - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) 
 AS DOUBLE PRECISION) 
 / CAST((1000000*1.0) AS DOUBLE PRECISION), 0) 
 / NULLIF(COUNT(DISTINCT CASE WHEN users.age IS NOT NULL THEN users.id 
 ELSE NULL END), 0)) AS "users.average_age"
FROM order_items AS order_items
LEFT JOIN users AS users ON order_items.user_id = users.id

GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500
  • At implementation of Looker to a purchase, JumpStart Kit is mandatory, which costs not less than $6k. Within this kit you receive support and consultation from Looker at tool implementation.
 No comments    245   8 mon   analysis   BI-tools   Data engineering   looker   sql

Diagram of BCG (Boston Consulting Group) Matrix

Estimated read time – 5 min

I will water down the blog with an interesting report, that was developed for Yota company on November, 2011. BCG Matrix has inspired us to develop this report.

We had: one Excel package, 75 VBA macro, ODBC connection to Oracle, SQL queries to databases of all sorts and colours. We will review report construction within this stack, but first, let’s speak about the very idea of the report.

BCG Matrix – is 2x2 matrix, whereon the clients’ segments are displayed by circumferences with their centres in the intersection of coordinates, formed by the relevant paces of two indicators selected.

To make it simple, we had to divide all the clients of the company into 4 segments: ARPU above average/below average, traffic consumption (main service) above average/below average. Thus, it turned out that 4 quadrants appear, and you need to place a bubble chart into each one of them, whereas the size of a bubble means the total amount of users within a segment. In addition to that, one more bubble was added to each quadrant (smaller one), that showcased the churn in each segment (author’s improvement).

What did we want to get at the output?
A chart of the following type:

Representation of the BCG matrix on the data of Yota company

The task statement is more or less clear, let’s move to the realization.
Let’s assume, that we’ve already collected all the required data (meaning that, we’ve learned to identify the average ARPU and average traffic consumption, in this post we won’t examine SQL-query), then the paramount task lies in understanding how to display the bubbles in the required places by means of Excel tools.

For this aim, a bubble chart comes to help:

Insert – Chart – Bubble

Going to the menu Selection of data source and evaluating, what is required in order to build a chart in the type that we need: coordinates X, coordinates Y, values of bubbles’ sizes.

Great, so it turns out that if we assume that our chart will be located in coordinates on the X axis from -1 to 1, and on the Y axis from -1 to 1, then the centre of the right upper bubble will be the spot (0.5; 0.5) on the chart. Likewise, we’ll place all the other bubbles.

We should separately consider the bubbles of Churn type (for displaying of the churn), they are located more to the right then the main bubble and might intersect with it, therefore we will place the right upper bubble to empirically obtained coordinates (0.65; 0.35).

Thus, for four main and four additional bubbles, we can organize the data as follows:

Let’s review more thoroughly how we’ll use them:

So, we set on X-axis – horizontal coordinates of the centres of our bubbles, that lie in the cells A9:A12, on Y-axis – vertical coordinates of the centres of our bubbles, that lie in the cells B9:B12, and the sizes of the bubbles are stored in the cells E9:E12.
Furthermore, we add another data set for the Churn, once more indicating all the required parameters.

We’ll get the following chart:

Then, we’re making it pretty: changing colours, deleting axis and getting a beautiful result.

By adding the required data labels, we receive what we initially needed in the task.

Share your experience in comments – did you build such charts and how you solved the task?

 No comments    85   10 mon   analysis   data analytics   excel   marketing   sql   strategy   visualisation

How to calculate Retention?

Estimated read time – 6 min

In this post we will discover, how to properly construct a report on Retention with application of Redash and SQL language.
For starters, let’s explain in a nutshell what the metric Retention rate is, why it is important,

Retention rate

Retention rate metric is widespread and is particularly popular within the mobile industry, since it allows to understand how well a product engages the users into daily use. Let’s recall (or discover), how Retention is calculated:

Retention of day X – is N% of users that will return to the product on day X. In other words, if on some specific day (day 0) 100 new users came, and 15 returned on the first day, then Retention of the 1st day will be equal to 15/100=15%.
Most commonly, Retention of days 1, 3, 7 and 30 are singled out as the most descriptive metrics of a product, however it’s useful to address Retention curve as a whole and make conclusions, proceeding from it.

Retention curve

In the end, we are interested in construction of such curve, that shows the retention of users from day 0 to day 30.

Retention rate curve from day 0 do day 30

Rolling Retention (RR)

Besides classic Retention rate, Rolling Retention (hereinafter, RR) is allocated. At calculation of RR, apart from day X, all the subsequent days are also considered. Thus, RR of the 1st day – the amount of users who returned on the 1st and subsequent days.

Let’s compare Retention and Rolling Retention of the 10th day:
Retention10 — the amount of users, who returned on the 10th day / the amount of users, who installed the app 10 days ago * 100%.
Rolling Retention10 — the amount of users, who returned on the 10th day or later / the amount of users, who installed the app 10 days ago * 100%.

Granularity (retention of time periods)

In some industries and respective tasks, it is useful to understand the Retention of a specific day (most often, in the mobile industry), in other cases it is useful to understand the retention of users on various time intervals: for example, weekly or monthly periods (oftentimes, it’s handy in e-commerce, retail).

An example of cohorts by months and monthly Retention respective thereto

How to build a Retention report on SQL language?

We have sorted out above how to calculate Retention in formulas. Now let’s apply it with SQL language.
Let’s assume, that we have two tables: user — storing data about users’ identifiers and meta-information, client_session — information on visits of the mobile app by users.
Only these two tables will be present in the query, so you can easily adapt the query to yourself.
note: within this code, I am using Impala as DBMS.

Collecting the size of cohorts

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1

Let’s sort out this pretty simple query: for every day we calculate the number of unique users for the period [60 days ago; 31 days ago].
In order not to mess with documentation: command ndv() in Impala is analogue of a command count(distinct).

Calculating the number of returned users on each cohort

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2

In this query, the key part is contained in the command datediff: now we are calculating for each cohort and for each datediff the number of unique users with the very same command ndv() (practically, the number of users, who returned within the days from 0 to 30).

Great, now we have the size of cohorts and the number of returned users.

Combining all together

SELECT reg.reg_date AS date_registration,
       reg.users AS cohort_size,
       cohort.date_diff AS day_difference,
       cohort.ret_base AS retention_base,
       cohort.ret_base/reg.users AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
    ORDER BY 1,3

We have received the query, that calculates Retention for each cohort, and, eventually, the result can be displayed as follows:

Retention rate, calculated for each cohort of users

Construction of the sole Retention curve

Let’s modify our query a bit and obtain the data for construction of one Retention curve:

SELECT 
       cohort.date_diff AS day_difference,
       avg(reg.users) AS cohort_size,
       avg(cohort.ret_base) AS retention_base,
       avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date
    GROUP BY 1        
    ORDER BY 1

Now, we have average by all the cohorts Retention rate, calculated for each day.

More on the subject

 No comments    83   10 mon   analysis   BI-tools   redash   sql   visualisation
Earlier Ctrl + ↓