LEFT JOIN: blog on analytics, visualisation & data science

Cohort analysis in Redash

In one of the previous articles we have reviewed building of Retention-report and have partially addressed the concept of cohorts therein.
Cohort usually implies group of users of a product or a company. Most often, groups are allocated on the basis of time of app installation / appearance of a user in a system.
It turns out, that, using cohort analysis, one can track down how the changes in a product affected the behaviour of users (for example, of old and new users).

Along with that, cohorts can be defined also proceeding from other parameters: geography of a user, traffic source, device platform and other important parameters of your product.

We will figure out, how to compare Retention of users of weekly cohorts in Redash, since Redash has special type of visualization for building such type of report.
Firstly, let’s sort out SQL-query. We still have two tables – user (id of a user and time of app installation) and client_session – timestamps (created_at) of activity of each user (user_id). Let’s consider the Retention of the first seven days for last 60 days.
The query is written in Cloudera Impala, let’s review it.

For starters, let’s build the total size of cohorts:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
	ndv(distinct user.id) as cohort_size //counting the number of users in the cohort
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now() //taking registered users for last 60 days
group by trunc(from_unixtime(user.installed_at), "WW")

The second part of the query can calculate the quantity of active users for every day during the first thirty days:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
	ndv(distinct user.id) as value  //counting the number of active users for every day
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60) //taking sessions for last 60 days
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30 //cutting off only the first 30 days of activity
group by 1,2

Bottom line, all the query entirely:

select size.cohort_week, size.cohort_size, ret.days, ret.value from
(select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
		ndv(distinct user.id) as cohort_size 
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
group by trunc(from_unixtime(user.installed_at), "WW")) size
left join (select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
		ndv(distinct user.id) as value 
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60)
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30
group by 1,2) ret on size.cohort_week=ret.cohort_week

Great, now correctly calculated data is available to us.

Data of cohorts in tabular form

Let’s create new visualization in Redash and indicate the parameters correctly:

It’s important to indicate the parameters correctly – the columns of the resulting query are compliant therewith.

Let’s make sure to indicate that we have weekly cohorts:

Voila, our visualization of cohorts is ready:

You can add filters and parameters to it and use in a dashboard

Materials on the topic

 No comments    80   1 mon   redash   sql   visualisation

Clickhouse as a consumer for Amazon MSK

Disclaimer: the note is of a technical nature, therefore it might be interesting to fewer people with business background.

This blog hasn’t addressed the topic of Clickhouse yet, however it’s one of the fastest databases from Yandex company. Brief account without going into details: Clickhouse – is the most efficiently written DBMS of a column type with respect to program code, information about the DBMS is quite thoroughly described in the documentation and in multiple videos on Youtube (one, two, three).

Over the last four years, I’ve been using Clickhouse in my practice as an analyst and expert in building analytical reporting. Mostly, I’ve been using Redash for solution of tasks on reporting visualization / reports with parameters / dashboards, as it is the most convenient interface for access to Clickhouse data.
However, just recently, in Looker, that I spoke about previously, an opportunity to connect Clickhouse as a data source appeared. It’s worth noting, that in Tableau connection to Clickhouse has existed for quite a while.

The architecture of the analytical service, based on Clickhouse, is predominantly cloud one. That’s how it was in the task reviewed. Let’s assume you have an allocated instance EC2 in Amazon (on which you’ve installed Clickhouse) and a separate Kafka-cluster (solution of Amazon MSK).

The task: is to connect Clickhouse as a  consumer in order to obtain information from brokers of your Kafka cluster. In fact, it’s quite thoroughly described how exactly one can connect to Kafka cluster in documentation on the site of Amazon MSK, so I won’t repeat this information. In my case, the guide helped: the topics were created by a producer from machine with installed Clickhouse, and were read from it by a consumer.

However, a problem arose: at connection of Clickhouse to Kafka as a consumer, the following error occurred:

020.02.02 18:01:56.209132 [ 46 ] {e7124cd5-2144-4b1d-bd49-8a410cdbd607} <Error> executeQuery: std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 20.1.2.4 (official build) (from 127.0.0.1:46586) (in query: SELECT * FROM events), Stack trace (when copying this message, always include the lines below):

For a long time I’ve been searching information in Clickhouse documentation regarding a possible cause of this error, however it was in vain. The next idea that I had was checking the work of a local Kafka broker from the same machine. I installed Kafka client, connected Clickhouse, sent the data to topic and to Clickhouse and managed to read it easily, so Clickhouse consumer works with a local broker, meaning that it works in general.

Having spoken with all my acquaintances who are experts in the fields of infrastructure and Clickhouse, we weren’t able to identify the cause of the problem in stride. We’ve checked firewall, network settings,- everything was opened. It was also confirmed by the fact, that messages could be sent from a local machine to the topic of remote browser by the command bin/kafka-console-producer.sh and could be also read from there bin/kafka-console-consumer.sh.

Thereafter, I got the idea to appeal to the main guru and developer of Clickhouse – Alexey Milovidov. Alexey eagerly tried to reply to the questions arisen and proposed a number of hypothesis, that we checked (for instance, tracing of network connections, etc.), however, even after more low-level audit we didn’t manage to localize the problem. then, Alexey recommended to turn to Michail Philimonov from the company Altinity. Michail turned out to be an extremely responsive expert, and proposed one hypothesis after another in order to conduct testing (in parallel, providing tips on a better way of testing).

As a result of our joint efforts, we discovered that the problem arises at the library librdkafka, since the other package kafkacat, that uses the same library, falls off the connection to the broker with the very same problem (Local: timed out).

After examination of connection through bin/kafka-console-consumer.sh and connection parameters, Michail advised to add the following line into /etc/clickhouse-server/config.xml:

<kafka><security_protocol>ssl</security_protocol></kafka>

And, oh, what a miracle! Clickhouse connected to the cluster and pulled the required data from the broker.

I hope, this recipe and my experience will allow you to save time and powers in studying the similar problem :)

P.S. Actually Clickhouse has a very friendly community and telegram-chat where you can ask for advice and more likely to get help.

 No comments    142   2 mon   clickhouse   expert   troubleshooting   yandex

Using parameters in Redash

The most convenient and useful thing in Redash is parameters. Parameters can be both in a report and in dashboards.
Parameter – is an element of the interface, controlled by a user producing a report.

Parameter in a report can be of the following types:

Each type explained separately:

  • Text – text input field, can be used in constructions of LIKE type, such parameter can’t be applied in public dashboards
  • Number – number, entered by a user
  • Dropdown list – list of values, from which a user can select just one or several values (not long ago, an option of multi-entry of parameters in a dropdown list appeared)
  • Query dropdown list – similarly to the previous one, however the values will be taken from the existing query results
  • Date / Date and Time / Date and Time (with seconds) – fields of date entry
  • Date Range / Date and Time Range / Date and Time Range (with seconds) – fields of date ranges entry.
    It’s convenient to use for the following construction
between '{{parameter.start}}' and '{{parameter.end}}'

In case of dashboard, a situation might arise, when a parameter is named the same way in several queries, then it will become common for all the dashboard, it’s handy.

One of not obvious, but rather useful solutions: how to make a parameter disabled?
Let’s assume, that parameter of “dropdown list” type is named parameter and we want to set it for a column geo of a table, then the code of the query will look approximately like that:

WHERE
    ('{{parameter}}' = 'Disabled' or geo = '{{parameter}}')

At that, surely, ‘Disabled’ should be added to the values of the dropdown list.

 No comments    114   2 mon   parameter   redash

Kazakhstan Marketing Conference 2020

Yesterday I had a chance to address the largest marketing conference in Kazakhstan: Kazakhstan Marketing Conference 2020.

Almaty, as a city, has made a positive impression on me, whereas the conference itself turned out to be highly professional event, filled with plenty of smart, versatile and kind people.

A pleasant bonus for conference participants: presentation of my speech available on slideshare (careful, VPN!), so one can recall what it was about.

Apart from the speech, in the main forum’s section I was holding a masterclass on “How to construct a comprehensible technical specification on analytics?”.
And, within the framework of work with the audience, we managed to formulate points for a template of a technical specification.

Sharing the template, it will be helpful for those, who faced with difficulties in translating of a task from business language to the technical one.

 No comments    140   2 mon   analysis   conference   marketing   template

Looker Overview

Today we are going to talk about BI-platform Looker, on which I managed to work in 2019.

Here is the short content of the article for convenient and fast navigation:

  1. What is Looker?
  2. Which DBMS you can connect to via Looker and how?
  3. Building of Looker ML data model
  4. Explore Mode (data research on the model built)
  5. Building of reports and their saving in Look
  6. Examples of dashboards in Looker

What is Looker?

Creators of Looker position it as a software of business intelligence class and big data analytics platform, that helps to research, analyze and share business analytics in real time mode.
Looker — is a really convenient tool and one of a few BI products, that allows to work with pre-set data cubes in a real-time mode (actually, relational tables that are described in Look ML-model).
An engineer, working with Looker, needs to describe a data model on Look ML language (it’s something between CSS and SQL), publish this data model and then set reporting and dashboards.
Look ML itself is pretty simple, the nexus between the data objects are set by a data-engineer, which consequently allows to use the data without knowledge of SQL language (to be precise: Looker engine generates the code in SQL language itself on user’s behalf).

Just recently, in June 2019, Google announced acquisition of Looker platform for $2.6 billion.

Which DBMS you can connect to via Looker and how?

The selection of DBMS that Looker is working with is pretty wide. You can see the various connections on the screen shot below as of October, 2019:

Available DBMS for connection

You can easily set a connection to the database via web-interface:

Web-interface of connection to DBMS

With regard to connections to databases, I’d like to highlight the following two facts: first of all, unfortunately, Clickhouse support from Yandex is currently missing (as well as in the foreseeable future). Most likely, the support won’t appear, considering the fact that Looker was acquired by a competitor, Google.
updated: Actually, Looker supports Clickhouse from the December 2019
The second nuisance is that you can’t build one data model, that would apply to different DBMS. There is no inbuilt storage in Looker, that could combine the results of query (unlike the same Redash).
It means, that analytical architecture should be built within one DBMS (preferably with high action speed or on aggregated data).

Building of Looker ML data model

In order to build a report or a dashboard in Looker, you need to provisionally set a data model. Syntax of Look ML language is quite thoroughly described in the documentation. Personally, I can just add that model description doesn’t require long-time immersion for a specialist with SQL knowledge. Rather, one needs to rearrange the approach to data model preparation. Look ML language is very much alike CSS:

Console of Look ML model creation

In the data model the following is set up: links with tables, keys, granularity, information of some fields being facts, and other – measurements. For facts, the aggregation is written. Obviously, at model creation one can use various IF / CASE expressions.

Explore mode

Probably, it’s the main killer-feature of Looker, since it allows any business departments to get data without attraction of analysts / data engineers. And, guess that’s why use of accounts with Explore mode is billed separately.

In fact, Explore mode is an interface, that allows to use the set up Look ML data model, select the required metrics and measurements and build customized report / visualization.
For example, we want to understand how many actions of any kind were performed in Looker’s interface last week. In order to do it, using Explore mode, we select Date field and apply a filter to it: last week (in this sense, Looker is quite smart and it and it will be enough writing ‘Last week’ in the filter), thereafter we choose “Category” from the measurements, and Quantity as a metric. After pressing the button Run the ready report will be generated.

Building report in Looker

Then, using the data received in the table form, you can set up the visualization of any type.
For example, Pie chart:

Applying visualization to report

Building of reports and their saving in Look

Sometimes you can have a desire to save the set of data / visualization received in Explore and share it with colleagues, for this purpose Looker has a separate essense – Look. That is ready constructed report with selected filters / measurements / facts.

Example of the saved Look

Examples of dashboards in Looker

Systemizing the warehouse of Look created, oftentimes you want to receive a ready composition / overview of key metrics, that could be displayed on one list.
For these purposes dashboard creation fits perfectly. Dashboard is created either on the wing, or using previously created Look. One of the dashboard’s “tricks” is configuration of parameters, that are changed on all the dashboard and can be applied to all the Look at the same time.

Interesting features in one line

  • In Looker you can refer to other reports and, using such function, you can create a dynamic parameter, that is passed on by a link.
    For example, you’ve created a report with division of revenue by countries, and in this report you can refer to the dashboard on a separate country. Following the link, a user sees the dashboard on a specific country, that he clicked on.
  • On every Looker page there is a chat, where support service answers very promptly
  • Looker is not able to work with data merge on the level of various DBMS, however it can combine the data on the level of ready Look (in our case, this function works really weird).
  • Within the framework of work with various models, I have found out an extremely non-trivial use of SQL for calculation of unique values in a non-normalized data table, Looker calls it symmetric aggregates.
    SQL, indeed, looks very non-trivial:
SELECT 
 order_items.order_id AS "order_items.order_id",
 order_items.sale_price AS "order_items.sale_price",
 (COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
 *(1000000*1.0)) AS DECIMAL(38,0))) + 
 CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) 
 - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) 
 AS DOUBLE PRECISION) 
 / CAST((1000000*1.0) AS DOUBLE PRECISION), 0) 
 / NULLIF(COUNT(DISTINCT CASE WHEN users.age IS NOT NULL THEN users.id 
 ELSE NULL END), 0)) AS "users.average_age"
FROM order_items AS order_items
LEFT JOIN users AS users ON order_items.user_id = users.id

GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500
  • At implementation of Looker to a purchase, JumpStart Kit is mandatory, which costs not less than $6k. Within this kit you receive support and consultation from Looker at tool implementation.
 No comments    92   2 mon   analysis   looker   sql
Earlier Ctrl + ↓