10 posts tagged

Later Ctrl + ↑

# Diagram of BCG (Boston Consulting Group) Matrix

Estimated read time – 5 min

I will water down the blog with an interesting report, that was developed for Yota company on November, 2011. BCG Matrix has inspired us to develop this report.

We had: one Excel package, 75 VBA macro, ODBC connection to Oracle, SQL queries to databases of all sorts and colours. We will review report construction within this stack, but first, let’s speak about the very idea of the report.

BCG Matrix – is 2x2 matrix, whereon the clients’ segments are displayed by circumferences with their centres in the intersection of coordinates, formed by the relevant paces of two indicators selected.

To make it simple, we had to divide all the clients of the company into 4 segments: ARPU above average/below average, traffic consumption (main service) above average/below average. Thus, it turned out that 4 quadrants appear, and you need to place a bubble chart into each one of them, whereas the size of a bubble means the total amount of users within a segment. In addition to that, one more bubble was added to each quadrant (smaller one), that showcased the churn in each segment (author’s improvement).

What did we want to get at the output?
A chart of the following type:

Representation of the BCG matrix on the data of Yota company

The task statement is more or less clear, let’s move to the realization.
Let’s assume, that we’ve already collected all the required data (meaning that, we’ve learned to identify the average ARPU and average traffic consumption, in this post we won’t examine SQL-query), then the paramount task lies in understanding how to display the bubbles in the required places by means of Excel tools.

For this aim, a bubble chart comes to help:

Insert – Chart – Bubble

Going to the menu Selection of data source and evaluating, what is required in order to build a chart in the type that we need: coordinates X, coordinates Y, values of bubbles’ sizes.

Great, so it turns out that if we assume that our chart will be located in coordinates on the X axis from -1 to 1, and on the Y axis from -1 to 1, then the centre of the right upper bubble will be the spot (0.5; 0.5) on the chart. Likewise, we’ll place all the other bubbles.

We should separately consider the bubbles of Churn type (for displaying of the churn), they are located more to the right then the main bubble and might intersect with it, therefore we will place the right upper bubble to empirically obtained coordinates (0.65; 0.35).

Thus, for four main and four additional bubbles, we can organize the data as follows:

Let’s review more thoroughly how we’ll use them:

So, we set on X-axis – horizontal coordinates of the centres of our bubbles, that lie in the cells A9:A12, on Y-axis – vertical coordinates of the centres of our bubbles, that lie in the cells B9:B12, and the sizes of the bubbles are stored in the cells E9:E12.
Furthermore, we add another data set for the Churn, once more indicating all the required parameters.

We’ll get the following chart:

Then, we’re making it pretty: changing colours, deleting axis and getting a beautiful result.

Share your experience in comments – did you build such charts and how you solved the task?

# How to calculate Retention?

Estimated read time – 6 min

In this post we will discover, how to properly construct a report on Retention with application of Redash and SQL language.
For starters, let’s explain in a nutshell what the metric Retention rate is, why it is important,

## Retention rate

Retention rate metric is widespread and is particularly popular within the mobile industry, since it allows to understand how well a product engages the users into daily use. Let’s recall (or discover), how Retention is calculated:

Retention of day X – is N% of users that will return to the product on day X. In other words, if on some specific day (day 0) 100 new users came, and 15 returned on the first day, then Retention of the 1st day will be equal to 15/100=15%.
Most commonly, Retention of days 1, 3, 7 and 30 are singled out as the most descriptive metrics of a product, however it’s useful to address Retention curve as a whole and make conclusions, proceeding from it.

## Retention curve

In the end, we are interested in construction of such curve, that shows the retention of users from day 0 to day 30.

Retention rate curve from day 0 do day 30

## Rolling Retention (RR)

Besides classic Retention rate, Rolling Retention (hereinafter, RR) is allocated. At calculation of RR, apart from day X, all the subsequent days are also considered. Thus, RR of the 1st day – the amount of users who returned on the 1st and subsequent days.

Let’s compare Retention and Rolling Retention of the 10th day:
Retention10 — the amount of users, who returned on the 10th day / the amount of users, who installed the app 10 days ago * 100%.
Rolling Retention10 — the amount of users, who returned on the 10th day or later / the amount of users, who installed the app 10 days ago * 100%.

## Granularity (retention of time periods)

In some industries and respective tasks, it is useful to understand the Retention of a specific day (most often, in the mobile industry), in other cases it is useful to understand the retention of users on various time intervals: for example, weekly or monthly periods (oftentimes, it’s handy in e-commerce, retail).

An example of cohorts by months and monthly Retention respective thereto

## How to build a Retention report on SQL language?

We have sorted out above how to calculate Retention in formulas. Now let’s apply it with SQL language.
Let’s assume, that we have two tables: user — storing data about users’ identifiers and meta-information, client_session — information on visits of the mobile app by users.
Only these two tables will be present in the query, so you can easily adapt the query to yourself.
note: within this code, I am using Impala as DBMS.

### Collecting the size of cohorts

``````SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
ndv(user.id) AS users
FROM USER
GROUP BY 1``````

Let’s sort out this pretty simple query: for every day we calculate the number of unique users for the period [60 days ago; 31 days ago].
In order not to mess with documentation: command ndv() in Impala is analogue of a command count(distinct).

### Calculating the number of returned users on each cohort

``````SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
ndv(user.id) AS ret_base
FROM USER
LEFT JOIN client_session cs ON cs.user_id=user.id
WHERE 1=1
AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
GROUP BY 1, 2``````

In this query, the key part is contained in the command datediff: now we are calculating for each cohort and for each datediff the number of unique users with the very same command ndv() (practically, the number of users, who returned within the days from 0 to 30).

Great, now we have the size of cohorts and the number of returned users.

### Combining all together

``````SELECT reg.reg_date AS date_registration,
reg.users AS cohort_size,
cohort.date_diff AS day_difference,
cohort.ret_base AS retention_base,
cohort.ret_base/reg.users AS retention_rate
FROM
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
ndv(user.id) AS users
FROM USER
GROUP BY 1) reg
LEFT JOIN
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
ndv(user.id) AS ret_base
FROM USER
LEFT JOIN client_session cs ON cs.user_id=user.id
WHERE 1=1
AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
ORDER BY 1,3``````

We have received the query, that calculates Retention for each cohort, and, eventually, the result can be displayed as follows:

Retention rate, calculated for each cohort of users

### Construction of the sole Retention curve

Let’s modify our query a bit and obtain the data for construction of one Retention curve:

``````SELECT
cohort.date_diff AS day_difference,
avg(reg.users) AS cohort_size,
avg(cohort.ret_base) AS retention_base,
avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
ndv(user.id) AS users
FROM USER
GROUP BY 1) reg
LEFT JOIN
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
ndv(user.id) AS ret_base
FROM USER
LEFT JOIN client_session cs ON cs.user_id=user.id
WHERE 1=1
AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date
GROUP BY 1
ORDER BY 1``````

Now, we have average by all the cohorts Retention rate, calculated for each day.

## More on the subject

No comments    926   2019   analysis   BI-tools   redash   sql   visualisation

# Parsing the data of site’s catalogue, using Beautiful Soup and Selenium (part 2)

Estimated read time – 6 min

Follow-up of the previous article on data collection from the famous online catalogue of goods.
If analyzing the behaviour of page with goods thoroughly, one can notice, that the goods are uploaded dynamically, i.e. scrolling the page down, you will receive a new set of goods, and, thus, the code from the previous article will turn out to be useless for this task.

For these cases, there is also a solution in python – Selenium library, it launches the browser’s engine and emulates human’s behaviour.

In the first part of the script we will assemble a tree of categories similarly to the previous article, but already using Selenium.

``````import time
from selenium import webdriver
from bs4 import BeautifulSoup as bs

browser = webdriver.Chrome()
browser.get("https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true")
cookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_igooods_session_cross_domain', 'path': '/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}
cookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}
cookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}
browser.get("https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true")
source_data = browser.page_source
soup = bs(source_data)
categories=soup.find_all('div', {'class':['with-children']})
tree = {}
for x in categories:
tree[x.findNext('span').text]=x.findNext('a').get('href')``````

In this snippet, the same way as before, by a get-request with parameters, we call the desired browser page and download the data, then we get an object of bs class, with which we make the similar operations. Thus, we receive a dictionary tree, where URL pages are stored for each category. Subsequently, we will need this dictionary for item-by-item examination in the cycle.

Let’s initiate the data collection for goods. In order to do it, we import the library pandas and create a new dataframe with four columns.

``````import pandas as pd
df = pd.DataFrame(columns=['SKU', 'Weight', 'Price','Category'])``````

Thereafter, we’ll use our dictionary tree and obtain page’s data for each category. You can see the code below. We still want to install cookie, that a user has installed, and also to perform some tricky commands for operation of browser’s engine, that can emulate cursor’s movement down the page.

``````for cat, link in tree.items():
browser.maximize_window()
cookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_i****_session_cross_domain', 'path': '/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}
cookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}
cookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}

# Script, that searches the end of the page every 3 seconds and is performed until the receipt of new data is finished.
lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
lastCount = lenOfPage
time.sleep(3)
lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount==lenOfPage:
match=True``````

Now, we have made it to the end of the page and can collect the data for work of the library beautifulsoup.

``````# Collecting data from the page
source_data = browser.page_source
soup = bs(source_data)
skus=soup.find_all('div', {'class':['b-product-small-card']})
last_value=len(df)+1 if len(df)>0 else 0
for i,x in enumerate(skus):
df.loc[last_value+i]=[x.findNext('a').contents[0],\
x.findNext('div',{'class':'product-weight'}).contents[0],\
x.findNext('div',{'class':'g-cart-action small'})['data-price'],\
cat]
browser.close()``````

In the code fragment, presented above, we look for all the elements <div>, that have class – b-product-small-card, and then, for every found good, we collect the values of the fields of weight and price.

Source code of the product card site.

We launch the script’s performance and go to have a cup of coffee. Voila, now we have pandas dataframe with data of all the goods:

DataFrame with goods, collected from the site.

Now we possess great data for training of the NLP model – names of goods and their affiliation to various categories.

No comments    89   2019   analysis   data science   python

# Overview of Yandex DataLens

Estimated read time – 5 min

Let’s take our minds off of the project on receipt data collection for a while. We will speak about the project’s following steps a bit later.

Today we’ll be discussing a new service from Yandex – DataLens (the access to demo was kindly provided to me by my great friend Vasiliy Ozerov and the team Fevlake / Rebrain). Currently, the service is in Preview mode and is, in essence, a cloud BI. The main shtick of the service is that it can easily and handy work with clickhouse (Yandex Clickhouse).

## Connection of data sources

Let’s review the major things: connection of a data source and dataset setting.
The selection of DBMS is not vast, nevertheless some main things are present. For the purpose of our testing, let’s take MySQL.

Selection of data sources DataLens

On the basis of the connection created, it is suggested to create a dataset:

Interface of dataset settings, definition of measurements and metrics

On this stage it’s defined which table’s attributes are becoming measurements, and which are turning into metrics. You can choose data aggregation type for the metrics.
Unfortunately, I didn’t manage to discover how it’s possible to state several interconnected tables (for example, attach a handbook for measurements) instead of a single table. Perhaps, on this stage developers suggest us to solve this issue by creating of required view.

## Data visualization

Regarding the interface itself – everything is pretty easy and handy. It reminds of a cloud version of Tableau. If comparing to Redash, which is most frequently used in conjunction with Clickhouse, the opportunities of visualization are simply staggering.
Even pivot tables, in which one can use Measure Names as columns’ names are worth something:

Setting of pivot tables in DataLens

Obviously, there is an opportunity to make also basic charts in DataLens from Yandex:

Construction of a linear chart in DataLens

There are also area charts:

Construction of an area chart in DataLens

However, I didn’t manage to find out how data classification by months / quarters / weeks is carried out. According to an example of data, available in the demo version, developers are still solving this issue by creating additional attributes (DayMonth, DayWeek, etc).

## Dashboards

For now, interface of dashboard blocks’ creation looks bulky, and interface windows are not always comprehensive. Here is, for instance, a window, allowing to state a parameter:

Not really apparent setting window for dashboard parameters

However, in the gallery of examples we can see highly functional and convenient dashboards with selectors, tabs and parameters:

An example of a working dashboard with parameters and tabs in DataLens

Looking forward to fixing of interface shortcomings, improving of Datalens and preparing to use it together with Clickhouse!

# Collecting data from hypermarket receipts on Python

Estimated read time – 6 min

Recently, once again buying products in a hypermarket, I recalled that, according to the Russian Federal Act FZ-54, any trade operator, that issues a receipt, is obliged to send the data thereof to the Tax Service.

Receipt from “Lenta” hypermarket. The QR-code of our interest is circled.

So, what does it mean for us, data analysts? It means that we can know ourselves and our needs better, and also acquire interesting data on own purchases.

Let’s try to assemble a small prototype of an app that will allow to make a dynamic of our purchases within the framework of blog posts’ series. So, we’ll start from the fact, that each receipt has a QR-code, and if you identify it, you’ll receive the following line:

t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1

This line comprises:

t – timestamp, the time when you made a purchase
s – sum of the receipt
fn – code number of fss, will be needed further in a request to API
i – receipt number, will be needed further in a request to API
fp – fiscalsign parameter, will be needed further in a request to API

Within the solution of the first step, we will parse the receipt data and collect it in pandas dataframe, using Python modules.

We will use API, that provides data on the receipt from the Tax Service website.

Initially, we will receive authentication data:

``````import requests
your_phone = '+7XXXYYYZZZZ' #you need to state your phone number, SMS with password will arrive thereon
r = requests.post('https://proverkacheka.nalog.ru:9999/v1/mobile/users/signup', json = {"email":"email@email.com","name":"USERNAME","phone":your_phone})``````

As a result of performing POST request we receive a password in SMS to the indicated phone number. Further on, we will be using it in a variable pwd

Now we’ll parse our line with values from QR-code:

``````import re
qr_string='t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1'
t=re.findall(r't=(\w+)', qr_string)[0]
s=re.findall(r's=(\w+)', qr_string)[0]
fn=re.findall(r'fn=(\w+)', qr_string)[0]
i=re.findall(r'i=(\w+)', qr_string)[0]
fp=re.findall(r'fp=(\w+)', qr_string)[0]``````

We’ll use the variables obtained in order to extract the data.
One Habr post pretty thoroughly examines status of errors at formation of API request, therefore I won’t repeat this information.

In the beginning, we need to verify the presence of data on this receipt, so we form a GET request.

``````headers = {'Device-Id':'', 'Device-OS':''}
payload = {'fiscalSign': fp, 'date': t,'sum':s}
print(check_request.status_code)``````

In the request one needs to indicate headers, at least empty ones. In my case, GET request returns error 406, thus I get that such receipt is found (why GET request returns 406 remains a mystery to me, so I will be glad to receive some clues in comments). If not indicating sum or date, GET request returns error 400 – bad request.

Let’s move on to the most interesting part, receiving data of the receipt:

``````request_info=requests.get('https://proverkacheka.nalog.ru:9999/v1/inns/*/kkts/*/fss/'+fn+'/tickets/'+i+'?fiscalSign='+fp+'&sendToEmail=no',headers=headers,auth=(your_phone, pwd))
print(request_info.status_code)
products=request_info.json()``````

We should receive code 200 (successful execution of the request), and in the variable products – everything, that applies to our receipt.

In order to further work with this data, let’s use pandas and transform everything in dataframe.

``````import pandas as pd
from datetime import datetime
my_products=pd.DataFrame(products['document']['receipt']['items'])
my_products['price']=my_products['price']/100
my_products['sum']=my_products['sum']/100
datetime_check = datetime.strptime(t, '%Y%m%dT%H%M') #((https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior formate the date))
my_products['date']=datetime_check
my_products.set_index('date',inplace=True)``````

Now we have working pandas.dataframe with receipts, visually it looks as follows:

You can construct a bar chart of purchases or observe everything as a box plot:

``````import matplotlib.pyplot as plt
%matplotlib inline
my_products['sum'].plot(kind='hist', bins=20)
plt.show()
my_products['sum'].plot(kind='box')
plt.show()``````
boxplot_cheques.png

In conclusion, we will simply get descriptive statistics as text, using a command .describe():

``my_products.describe()``

It’s convenient to write down data as .csv file, so that the next time you can amend the statistics:

``````with open('hyper_receipts.csv', 'a') as f: