Later Ctrl + ↑

Handling website buttons in Selenium

Estimated read time – 8 min

In our previous article, Parsing the data of site’s catalogue, using Beautiful Soup and Selenium we have addressed the problem of working with dynamic pages, but sometimes this method doesn’t work, as with “Show more” buttons. Today we will show how you can imitate button click with Selenium to load a whole page, collect beer IDs, ratings, and send the data to Clickhouse.

Webpage structure

Let’s take a random brewery that has 105 check-ins, or customer feedbacks. One page with check-ins displays up to 25 records and looks like this:

If we try to scroll down to the bottom, we will encounter the same button that prevents us from getting all 105 records at once:

First off, to address this task, let’s find out the button class and just click it until it works. Since Selenium launches the browser and the next “Show more” button may not be loaded in time, that’s why we set 2-second intervals between the clicks. As soon as the page is loaded we will take its content and parse the relevant data.
Let’s view the source code and  find the button, it’s assigned to the more_checkins class.

The button has style attributes, such as display. When the button is displayed this attribute takes the block value. But when we scroll the page to the buttom and there is nothing left to display, the attribute takes the none value and we can stop clicking.

Writing our code

Let’s import the necessary libraries

import time
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re
from datetime import datetime
from clickhouse_driver import Client

Chromedriver is used to run Selenium tests on Chrome and can be downloaded from the official website

Connect to the database and create cookies:

client = Client(host='ec1-23-456-789-10.us-east-2.compute.amazonaws.com', user='', password='', port='9000', database='')
count = 0
cookies = {
    'domain':'untappd.com',
    'expiry':1594072726,
    'httpOnly':True,
    'name':'untappd_user_v3_e',
    'path':'/',
    'secure':False,
    'value':'your_value'
}

You can find out more about working with cookies in Selenium from Parsing the data of site’s catalogue, using Beautiful Soup and Selenium. We will need the untappd_user_v3_e parameter.

As we are going to work with pages that have more than hundreds of thousands of records, it’s pretty heavy and our instance may be overloaded. To prevent this, we will shut down unnecessary parts and then enable authentication cookie:

options = webdriver.ChromeOptions()
prefs = {'profile.default_content_setting_values': {'images': 2, 
                            'plugins': 2, 'fullscreen': 2}}
options.add_experimental_option('prefs', prefs)
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(options=options)
driver.get('https://untappd.com/TooSunnyBrewery')
driver.add_cookie(cookies)

We will need a function that would take a link, open it in the browser, load a whole page and return a soup object to be parsed. Get the  display attribute, assign it to the more_checkins: variable and click the button until the attribute is none. Let’s set 2-second intervals between the clicks, to wait for the page to load. As soon as we received the page, converth it into a soup object using the bs4 library.

def get_html_page(url):
    driver.get(url)
    driver.maximize_window()
    more_checkins = driver.execute_script("var more_checkins=document.getElementsByClassName('more_checkins_logged')[0].style.display;return more_checkins;")
    print(more_checkins)
    while more_checkins != "none":
        driver.execute_script("document.getElementsByClassName('more_checkins_logged')[0].click()")
        time.sleep(2)
        more_checkins = driver.execute_script("var more_checkins=document.getElementsByClassName('more_checkins_logged')[0].style.display;return more_checkins;")
        print(more_checkins)
    source_data = driver.page_source
    soup = bs(source_data, 'lxml')
    return soup

Write the following function that will take a page url, pass it in the get_html_page and receive a soup object to parse. The function returns zipped lists with beer IDs and ratings.

See how you can use Beautiful Soup to retrieve data from a website catalogue

def parse_html_page(url):
    soup = get_html_page(url)
    brewery_id = soup.find_all('a', {'class':'label',
                                     'href':re.compile('https://untappd.com/brewery/*')})[0]['href'][28:]
    items = soup.find_all('div', {'class':'item',
                                  'id':re.compile('checkin_*')})
    checkin_rating_list = []
    beer_id_list = []
    count = 0
    print('Filling the lists')
    for checkin in items:
        print(count, '/', len(items))
        try:
            checkin_rating_list.append(float(checkin.find('div', {'class':'caps'})['data-rating']))
        except Exception:
            checkin_rating_list.append('cast(Null as Nullable(Float32))')
        try:
            beer_id_list.append(int(checkin.find('a', {'class':'label'})['href'][-7:]))
        except Exception:
            beer_id_list.append('cast(Null as Nullable(UInt64))')
        count += 1 
    return zip(checkin_rating_list, beer_id_list)

Finally, write a function call for the breweries. We’ve covered how to receive a list of Russian brewery IDs in this article: Example of using dictionaries in Clickhouse with Untappd.
Let’s fetch it from the Clickhouse table.

brewery_list = client.execute('SELECT brewery_id FROM brewery_info')

If we print out the brewery_list, we will find out that the data is stored in a list of tuples.

Let’s make it a bit prettier with the lambda expression:

flatten = lambda lst: [item for sublist in lst for item in sublist]
brewery_list = flatten(brewery_list)

That’s much better:

Create a url for each brewery in the list, it includes a standard link and a brewery ID in the end. Pass it to the parse_html_page function that fetches the get_html_page and return lists with beer_id and rating_score. Since the lists are zipped, we can iterate throught them, create a tuple and send it to Clickhouse.

for brewery_id in brewery_list:
    print('Fetching the brewery with id', brewery_id, count, '/', len(brewery_list))
    url = 'https://untappd.com/brewery/' + str(brewery_id)
    returned_checkins = parse_html_page(url)
    for rating, beer_id in returned_checkins:
        tuple_to_insert = (rating, beer_id)
        try:
            client.execute(f'INSERT INTO beer_reviews VALUES {tuple_to_insert}')
        except errors.ServerException as E:
            print(E)
    count += 1

That’s it about the way we can handle “Show more” buttons. Over time we will form a large dataset for further analysis, to work with in our next series.

 No comments    176   2020   Amazon Web Services   Analytics engineering   AWS   clickhouse   python

Example of using dictionaries in Clickhouse with Untappd

Estimated read time – 12 min

In Clickhouse we can use internal dictionaries as well as external dictionaries, they can be an alternative to JSON that doesn’t always work fine. DIctionaries store information in memory and can be invoked with the dictGet method. Let’s review how we can create one in Clickhouse and use it for our queries.

We will illustrate an example of data using the Untappd API. Untappd is a social network for everyone who loves craft beer. We are going to use сheck-ins of Russian-based craft breweries and start collecting information about them to analyze this data later on and to draw some conclusions. in today’s article, we will analyze how to receive metadata on Russian breweries with Untappd and store it in a Clickhouse dictionary.

Collecting data with Untappd

First off, we need to create a new app to receive client_id and  client_secret_key to make API calls. Follow  this link and fill in the fields:

Usually, it takes about 1 to 3 weeks to wait for approval.

import requests
import pandas as pd
import time

We’ll be using the requests library to make API calls, view results in a Pandas DataFrame, and save them in a CSV file before sending it to a Clickhouse dictionary. Untappd has strict limits on the number of requests, prohibiting us to make more than 100 calls per hour. Therefore, we need to make our script wait for 38 seconds using the Python time module.

client_id = 'your_client_id'
client_secret = 'your_client_secret'
all_brewery_of_russia = []

We want to get data for one thousand Russian breweries. One request to the Brewery Search method enables us to view up to 50 breweries. The website gave us 3369 breweries when searching the word “Russia” manually.

Let’s check this, scroll down to the bottom, and open the page code.

Each brewery received is stored in the beer-item class. This means we can the number of references to beer-item:

And as it turned out, we have exactly 1000 breweries, not 3369. When searching the word “Russia” manually, the results also contain some American breweries. So, we need to make 20 calls, getting 50 breweries at a time:

for offset in range(0, 1000, 50):
    try:
        print('offset = ', offset)
        print('remained:', 1000 - offset, '\n')
        response = requests.get(f'https://api.untappd.com/v4/search/brewery?client_id={client_id}&client_secret={client_secret}',
                               params={
                                   'q':'Russia',
                                   'offset':offset,
                                   'limit':50
                               })
        item = response.json()
        print(item, '\n')
        all_brewery_of_russia.append(item)
        time.sleep(37)
    except Exception:
        print(Exception)
        continue

The Brewery Search method includes several parameters, q – a string with a country name (specify specify “Russia” to get all the breweries based in Russia), offset – allows us to shift by 50 lines in the search to get the next list of breweries, limit – restricts the number of breweries received and can not be more than 50. Convert the answer to JSON and append data sotred in the item object to the  all_brewery_of_russia list.

Our data may also include breweries from other countries. That’s why we need to filter the data. Iterate through the all_brewery_of_russia list and keep only those breweires, which country_name is Russia.

brew_list = []
for element in all_brewery_of_russia:
    brew = element['response']['brewery']
    for i in range(brew['count']):
        if brew['items'][i]['brewery']['country_name'] == 'Russia':
            brew_list.append(brew['items'][i])

Print out the first element in our brew_list:

print(brew_list[0])

Create a DataFrame with the following columns: brewery_id, beer_count, brewery_name, brewery_slug, brewery_page_url, brewery_city, lat и  lng. And several lists to sort out the data stored in the brewery_list:

df = pd.DataFrame()
brewery_id_list = []
beer_count_list = []
brewery_name_list = []
brewery_slug_list = []
brewery_page_url_list = []
brewery_location_city = []
brewery_location_lat = []
brewery_location_lng = []
for brewery in brew_list:
    brewery_id_list.append(brewery['brewery']['brewery_id'])
    beer_count_list.append(brewery['brewery']['beer_count'])
    brewery_name_list.append(brewery['brewery']['brewery_name'])
    brewery_slug_list.append(brewery['brewery']['brewery_slug'])
    brewery_page_url_list.append(brewery['brewery']['brewery_page_url'])
 brewery_location_city.append(brewery['brewery']['location']['brewery_city'])
    brewery_location_lat.append(brewery['brewery']['location']['lat'])
    brewery_location_lng.append(brewery['brewery']['location']['lng'])

Assign them as column values:

df['brewery_id'] = brewery_id_list
df['beer_count'] = beer_count_list
df['brewery_name'] = brewery_name_list
df['brewery_slug'] = brewery_slug_list
df['brewery_page_url'] = brewery_page_url_list
df['brewery_city'] = brewery_location_city
df['brewery_lat'] = brewery_location_lat
df['brewery_lng'] = brewery_location_lng

And view our DataFrame:

df.head()

Let’s sort the values by brewery_id and store our DataFrame as a CSV file without index column and headings:

df = df.sort_values(by='brewery_id')
df.to_csv('brewery_data.csv', index=False, header=False)

Creating a Clickhouse dictionary

You can create Clickouse dictionaries in many different ways. We will try to structure it in an XML file, configure the server files, and access it through our client. The XML file structure will be the following:

Learn more about other ways you can create Clickhouse dictionaries in the documentation

<yandex>
<dictionary>
        <name>breweries</name>
        <source>
                <file>
                        <path>/home/ubuntu/brewery_data.csv</path>
                        <format>CSV</format>
                </file>
        </source>
        <layout>
                <flat />
        </layout>
        <structure>
                <id>
                        <name>brewery_id</name>
                </id>
                <attribute>
                        <name>beer_count</name>
                        <type>UInt64</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_name</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_slug</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_page_url</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_city</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>lat</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>lng</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
        </structure>
        <lifetime>300</lifetime>
</dictionary>
</yandex>

name is a dictionary name, attribute holds the properties of the columns, id is a key field, file stores file path and format. We are going to store our file in this directory: /home/ubuntu.

Let’s upload our CSV and XML files to the server, it can be done using an FTP like FileZilla. We explained how to deploy Clickhouse on an Amazon instance in our previous article, this time need to do the same. Open your FileZilla client and go to SFTP settings to add a private key:

Connect to your server address, it can be found in the EC2 management console. Specify SFTP as a protocol, your Host, and Ubuntu as a username.

Your Public DNS may change in case of overload

After connecting we will wind up in this location /home/ubuntu. Let’s put the files in that folder and connect via SSH using Termius. Then we need to move the files to /etc/clickhouse-server to view them in Clickhouse:

Learn how you can connect to an AWS server using SSH client from our previous material Installing Clickhouse on AWS

sudo mv breweries_dictionary.xml /etc/clickhouse server/

Go to the config file:

cd /etc/clickhouse-server
sudo nano config.xml

We need the  tag, it’s the path to a file that describes the dictionaries structure. Specify the path to our XML file:

<dictionaries_config>/etc/clickhouse-server/breweries_dictionary.xml</dictionaries_config>

Save our file and run the Clickhouse client:

clickhouse client

Let’s check that the dictionary really loaded:

SELECT * FROM system.dictionaries\G

In case of success you will get the following:

Now, let’s write a query with the  dictGet function to get the name of the brewery with ID 999. Pass in the dictionary name, as the first argument, then the filed name and ID.

SELECT dictGet('breweries', 'brewery_name', toUInt64(999))

And our query returns this:

Similarly, we could use this function to get a beer name, when the table contains only IDs.

SQL Window Functions Cheat Sheet with examples

Estimated read time – 1 min

Window functions are calculation functions that can increase the efficiency and reduce the complexity of SQL queries, making things much easier:
View as  PDF

Special Thanks to LearnSQL, best place to master your SQL skills !

 No comments    267   2020   Analytics engineering   sql

Collecting Data on Ad Campaigns from VK.com

Estimated read time – 11 min

We have a lot to share in today’s longread: we’ll retrieve data on ad campaigns from Vkontakte (widely popular social network in Russia and CIS countries) and compare them to Google Analytics data in Redash. This time we don’t need to create a server, as our data will be transferred to Google Docs via Google Sheets API.

Getting an Access Token
We need to create an app to receive our access token. Follow this link https://vk.com/apps?act=manage and click “Create app” on the developer’s page. Choose a name for your app and check it as a “Standalone app”. Then, click Settings in the left menu and save your app ID.

More details on access tokens can be found here: Getting an access token

Copy this link:

https://oauth.vk.com/authorize?client_id=YourClientID&scope=ads&response_type=token

And change YourClientID to your app ID, this will allow you to get information about your advertising account. Open this link in your browser and you will be redirected to another page, which URL address holds your generated access token.

Access token expires in 86400 seconds or 24 hours. If you want to generate a token with an unlimited lifetime period, just pass scope to the offline parameter. In case if you need to generate a new token – change your password account or terminate all active sessions in security settings.

You will also need your advertising account ID to make API requests. It can be found via this link, just copy it:  https://vk.com/ads?act=settings

Using APIs to collect data
Let’s write a script that would allow us to retrieve information on all user’s ad campaigns: number of impressions, сlicks and costs. The script will pass this data to a DataFrame and send it to Google Docs.

from oauth2client.service_account import ServiceAccountCredentials
from pandas import DataFrame
import requests
import gspread
import time

We have several constant variables: access token, advertising account ID and Vkontakte API Version. Here we are using the most recent API version, which is 5.103.

token = 'fa258683fd418fafcab1fb1d41da4ec6cc62f60e152a63140c130a730829b1e0bc'
version = 5.103
id_rk = 123456789

To get advertising stats you need to use the  ads.getStatistics method and pass your ad campaign ID to it. Since we don’t run any advertisements yet, we’ll use the  ads.getAds method that returns IDs of ads and campaigns.

Learn more about the API methods available for Vkontakte here

Use the requests library to send a request and convert the response to JSON.


campaign_ids = []
ads_ids = []
r = requests.get('https://api.vk.com/method/ads.getAds', params={
    'access_token': token,
    'v': version,
    'account_id': id_rk
})
data = r.json()['response']

We have a familiar list of dictionaries returned, similar to the one we have reviewed in the previous article, “Analysing data on Facebook Ad Campaigns with Redash”.

Fill in the ad_campaign_dict dictionary as follows: specify ad ID as a key, and campaign ID as a value, where this ad belongs to.

ad_campaign_dict = {}
for i in range(len(data)):
    ad_campaign_dict[data[i]['id']] = data[i]['campaign_id']

Having ID for every ad needed we can invoke the  ads.getStatistics method to collect data on the number of impressions, clicks, costs, and dates for a particular ad, so create several empty lists in advance.

ads_campaign_list = []
ads_id_list = []
ads_impressions_list = []
ads_clicks_list = []
ads_spent_list = []
ads_day_start_list = []
ads_day_end_list = []

We need to invoke the getStatistics method for each ad separately, let’s refer to the ad_campaign_dict and iterate our requests. Retrieve all-time data by calling the ‘period’ method with the  ‘overall’ value. Some ads may not have impression or clicks if they haven’t been launched yet, this may cause a  KeyError. Let’s recall to the try — except approach to handle this error.

for ad_id in ad_campaign_dict:
        r = requests.get('https://api.vk.com/method/ads.getStatistics', params={
            'access_token': token,
            'v': version,
            'account_id': id_rk,
            'ids_type': 'ad',
            'ids': ad_id,
            'period': 'overall',
            'date_from': '0',
            'date_to': '0'
        })
        try:
            data_stats = r.json()['response']
            for i in range(len(data_stats)):
                for j in range(len(data_stats[i]['stats'])):
                    ads_impressions_list.append(data_stats[i]['stats'][j]['impressions'])
                    ads_clicks_list.append(data_stats[i]['stats'][j]['clicks'])
                    ads_spent_list.append(data_stats[i]['stats'][j]['spent'])
                    ads_day_start_list.append(data_stats[i]['stats'][j]['day_from'])
                    ads_day_end_list.append(data_stats[i]['stats'][j]['day_to'])
                    ads_id_list.append(data_stats[i]['id'])
                    ads_campaign_list.append(ad_campaign_dict[ad_id])
        except KeyError:
            continue

Now, create a DataFrame and print out the first 5 data points

df = DataFrame()
df['campaign_id'] = ads_campaign_list
df['ad_id'] = ads_id_list
df['impressions'] = ads_impressions_list
df['clicks'] = ads_clicks_list
df['spent'] = ads_spent_list
df['day_start'] = ads_day_start_list
df['day_end'] = ads_day_end_list
print(df.head())

Exporting Data to Google Docs
We’ll need a Google API access token, navigate to https://console.developers.google.com and create one. Choose any name you like, then go to your Dashboard and click “Enable APIs and Services”. Choose Google Drive API from the list, enable it and do exactly the same for Google Sheets API.

After activation you will be redirected to the API control panel. Click Credentials – Create Credentials, click choose data type and create an account. Choosing a role is optional, just proceed and specify JSON as a key type.

After these steps you can download a JSON file with your credentials, we’ll rename it to «credentials.json». On the main page you’ll find the email field – copy your email address.

Go to https://docs.google.com/spreadsheets and create a new file named data, we’ll pass data from our DataFrame to it. Put the  credentials.json file in one directory with the script and continue coding. Add these links to the scope list:

scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']

We will use the  ServiceAccountCredentials.from_json_keyfile_name and  gspread.authorize methods available in the  oauth2client and  gspread libraries for authenticaion process. Specify your file name and the scope variable in the ServiceAccountCredentials.from_json_keyfile_name method. The  sheet variable will allow us to send requests to our file in Google Docs.

creds = ServiceAccountCredentials.from_json_keyfile_name('credentials.json', scope)
client = gspread.authorize(creds)
sheet = client.open('data').sheet1

Apply the update_cell method to enter new value in a table cell. It’s worth mentioning that the indexing starts at 0, not 1. With the first loop we’ll move the column names of our DataFrame. And with the following loops we’ll move the rest of our data points. The default limits allow us to make 100 loops for 100 seconds. These restrictions may cause errors and stop our script, that’s why we need to use time.sleep and make the script sleep for 1 second after each loop.

count_of_rows = len(df)
count_of_columns = len(df.columns)
for i in range(count_of_columns):
    sheet.update_cell(1, i + 1, list(df.columns)[i])
for i in range(1, count_of_rows + 1):
    for j in range(count_of_columns):
        sheet.update_cell(i + 1, j + 1, str(df.iloc[i, j]))
        time.sleep(1)

In case of success, you’ll get the same table:

Exporting data to Redash

See how you can connect Google Analytics to Redash in this article «How to connect Google Analytics to Redash?».

Having a table with Google Analytics and ad campaigns from Vkontakte exported we can compare them by writing the following query:

SELECT
    query_50.day_start,
    CASE WHEN ga_source LIKE '%vk%' THEN 'vk.com' END AS source,
    query_50.spent,
    query_50.impressions,
    query_50.clicks,
    SUM(query_49.ga_sessions) AS sessions,
    SUM(query_49.ga_newUsers) AS users
FROM query_49
JOIN query_50
ON query_49.ga_date = query_50.day_start
WHERE query_49.ga_source LIKE '%vk%' AND DATE(query_49.ga_date) BETWEEN '2020-05-16' AND '2020-05-20'
GROUP BY query_49.ga_date, source

ga_source — the traffic source, from which a user was redirected. Use the  CASE method to combine everything that contains “vk” in one column called «vk.com». With the help of JOIN operator we can add the table with the data on ad campaigns, merging by date. Let’s take the day of the last ad campaign and a couple of days after, this will result in the following output:

Takeaways
Now we have a table that reflects how much were spent in ad costs on a certain day, the number of users who viewed this ad, were engaged and redirected to our website, and then completed the sign-up process.

 No comments    276   2020   Analytics engineering   BI-tools   data analytics   longread

Working with Materialized Views in Clickhouse

Estimated read time – 7 min

This time we’ll illustrate how you can pass data on Facebook ad campaigns to Clickhouse tables with Python and implement Materialized Views. What is materialized views, you may ask. Oftentimes Clickhouse is used to handle large amounts of data and the time spent waiting for a response from a table with raw data is constantly increasing. Usually, we would use ETL-process to address this task efficiently or create aggregate tables, which are not that useful because we have to regularly update them. Clickhouse system offers a new way to meet the challenge using materialized views.
Materialized Views allow us to store and update data on a hard drive in line with the SELECT query that was used to get a view. When we need to insert data into a table, the SELECT method transforms our data and populates a materialized view.

Setting Up Amazon EC2 instance
We need to connect our Python script that we created in this article to Cickhouse. The script will make queries, so let’s open several ports. In your AWS Dashboard go to Network & Security — Security Groups. Our instance belongs to the launch-wizard-1 group. Сlick it and pay attention to the Inbound rules, you need to set them as shown in this screenshot:

Setting up Clickhouse
It’s time to set up Clickhouse. Let’s edit the config.xml file using nano text editor:

cd /etc/clickhouse-server
sudo nano config.xml

Learn more about the shortcuts here if you didn’t get how to exit nano too :)

Uncomment this line:

<listen_host>0.0.0.0</listen_host>

to access your database from any IP-address:

Create a table and its materialized view
Open a terminal window to create our database with tables:

CREATE DATABASE db1
USE db1

We’ll refer to the same example of data collection from Facebook. The data on Ad Campaigns may often change and be updated, with this in mind we want to create a materialized view that would automatically update aggregate tables containing the costs data. Our Clickhouse table will look almost the same as the DataFrame used in the previous post. We picked ReplacingMergeTree as an engine for our table, it will remove duplicates by sorting key:

CREATE TABLE facebook_insights(
	campaign_id UInt64,
	clicks UInt32,
	spend Float32,
	impressions UInt32,
	date_start Date,
	date_stop	 Date,
	sign Int8
) ENGINE = ReplacingMergeTree
ORDER BY (date_start, date_stop)

And then, create a materialized view:

CREATE MATERIALIZED VIEW fb_aggregated
ENGINE = SummingMergeTree()
ORDER BY date_start
	AS
	SELECT campaign_id,
		      date_start,
		      sum(spend * sign) as spent,
		      sum(impressions * sign) as impressions,
		      sum(clicks * sign) as clicks
	FROM facebook_insights
	GROUP BY date_start, campaign_id

More details are available in the Clickhouse blog.

Unfortunately for us, Clikhouse system doesn’t include a familiar UPDATE method. So we need to find a workaround. Thanks to the Yandex team, these guys offered to insert rows with a negative sign first, and then use sign for reversing. According to this principle, the old data will be ignored when summing.

Script
Let’s start writing the script and import a new library, which is called clickhouse_driver. It allows to make queries to Clickhouse in Python:

We are using the updated version of the script from “Collecting Data on Facebook Ad Campaigns”. But it will work fine if you just combine this code with the previous one.

from datetime import datetime, timedelta
from clickhouse_driver import Client
from clickhouse_driver import errors

An object of the Clientclass enables us to make queries with the execute() method. Type in your public DNS in the host field, port – 9000, specify default as a user, and a databasefor the connection.

client = Client(host='ec1-2-34-56-78.us-east-2.compute.amazonaws.com', user='default', password=' ', port='9000', database='db1')

To ensure that everything works as expected, we need to write the following query that will print out names of all databases stored on the server:

client.execute('SHOW DATABASES')

In case of success the query will return this list:

[('_temporary_and_external_tables',), ('db1',), ('default',), ('system',)]

For example, we want to get data for the past three days. Create several datetime objects with the datetime library and convert them to strings using thestrftime()
method:

date_start = datetime.now() - timedelta(days=3)
date_end = datetime.now() - timedelta(days=1)
date_start_str = date_start.strftime("%Y-%m-%d")
date_end_str = date_end.strftime("%Y-%m-%d")

This query returns all table columns for a certain period:

SQL_select = f"select campaign_id, clicks, spend, impressions, date_start, date_stop, sign from facebook_insights where date_start > '{date_start_str}' AND date_start < '{date_end_str}'"

Make a query and pass the data to the old_data_list. And then, replace their signfor -1 and append elements to the new_data_list:

new_data_list = []
old_data_list = []
old_data_list = client.execute(SQL_select)

for elem in old_data_list:
    elem = list(elem)
    elem[len(elem) - 1] = -1
    new_data_list.append(elem)

Finally, write our algorithm: insert the data with the sign =-1, optimize it with ReplacingMergeTree, remove duplicates, and INSERT new data with the sign = 1.

SQL_query = 'INSERT INTO facebook_insights VALUES'
client.execute(SQL_query, new_data_list)
SQL_optimize = "OPTIMIZE TABLE facebook_insights"
client.execute(SQL_optimize)
for i in range(len(insight_campaign_id_list)):
    client.execute(SQL_query, [[insight_campaign_id_list[i],
                                insight_clicks_list[i],
                                insight_spend_list[i],
                                insight_impressions_list[i],
                                datetime.strptime(insight_date_start_list[i], '%Y-%m-%d').date(),
                                datetime.strptime(insight_date_start_list[i], '%Y-%m-%d').date(),
                                1]])
    client.execute(SQL_optimize)

Get back to Clickhouse and make the next query to view the first 20 rows:
SELECT * FROM facebook_insights LIMIT 20

And  SELECT * FROM fb_aggregated LIMIT 20 to compare our materialized view:

Nice work! Now we have a materialized view that will be updated each time when the data in the facebook_insights table changes. The trick with the sign operator allows to differ already processed data and prevent its summation, while ReplacingMergeTree engine helps us to remove duplicates.

 No comments    581   2020   Amazon Web Services   Analytics engineering   AWS   clickhouse   data analytics   python
Earlier Ctrl + ↓