19 posts tagged

data analytics

Collecting Social Media Data for Top ML, AI & Data Science related accounts on Instagram

Estimated read time – 9 min

Instagram is in the top 5 most visited websites, perhaps not for our industry. Nevertheless, we are going to test this hypothesis using Python and our data analytics skills. In this post, we will share how to collect social media data using the Instagram API.

Data collection method
The Instagram API won’t let us collect data about other platform users for no reason, but there is always a way. Try sending the following request:

https://instagram.com/leftjoin/?__a=1

The request returns a JSON object with detailed user information, for instance, we can easily get an account name, number of posts, followers, subscriptions, as well as the first ten user posts with likes count, comments and etc. The pyInstagram library allows sending such requests.

SQL schema
Data will be collected into thee Clickhouse tables: users, posts, comments. The users table will contain user data, such as user id, username, user’s first and last name, account description, number of followers, subscriptions, posts, comments, and likes, whether an account is verified or not, and so on.

CREATE TABLE instagram.users
(
    `added_at` DateTime,
    `user_id` UInt64,
    `user_name` String,
    `full_name` String,
    `base_url` String,
    `biography` String,
    `followers_count` UInt64,
    `follows_count` UInt64,
    `media_count` UInt64,
    `total_comments` UInt64,
    `total_likes` UInt64,
    `is_verified` UInt8,
    `country_block` UInt8,
    `profile_pic_url` Nullable(String),
    `profile_pic_url_hd` Nullable(String),
    `fb_page` Nullable(String)
)
ENGINE = ReplacingMergeTree
ORDER BY added_at

The posts table will be populated with the post owner name, post id, caption, comments coun, and so on. To check whether a post is an advertisement, Instagram carousel, or a video we can use these fields: is_ad, is_album and is_video.

CREATE TABLE instagram.posts
(
    `added_at` DateTime,
    `owner` String,
    `post_id` UInt64,
    `caption` Nullable(String),
    `code` String,
    `comments_count` UInt64,
    `comments_disabled` UInt8,
    `created_at` DateTime,
    `display_url` String,
    `is_ad` UInt8,
    `is_album` UInt8,
    `is_video` UInt8,
    `likes_count` UInt64,
    `location` Nullable(String),
    `recources` Array(String),
    `video_url` Nullable(String)
)
ENGINE = ReplacingMergeTree
ORDER BY added_at

In the comments table, we store each comment separately with the comment owner and text.

CREATE TABLE instagram.comments
(
    `added_at` DateTime,
    `comment_id` UInt64,
    `post_id` UInt64,
    `comment_owner` String,
    `comment_text` String
)
ENGINE = ReplacingMergeTree
ORDER BY added_at

Writing the script
Import the following classes from the library: Account, Media, WebAgent and Comment.

from instagram import Account, Media, WebAgent, Comment
from datetime import datetime
from clickhouse_driver import Client
import requests
import pandas as pd

Next, create an instance of the WebAgent class required for some library methods and data updating. To collect any meaningful information we need to have at least account names. Since we don’t have them yet, send the following request to search for porifles by the  keywords specified in queries_list. The search results will be composed of Instagram pages that match any keyword in the list.

agent = WebAgent()
queries_list = ['machine learning', 'data science', 'data analytics', 'analytics', 'business intelligence',
                'data engineering', 'computer science', 'big data', 'artificial intelligence',
                'deep learning', 'data scientist','machine learning engineer', 'data engineer']
client = Client(host='12.34.56.789', user='default', password='', port='9000', database='instagram')
url = 'https://www.instagram.com/web/search/topsearch/?context=user&count=0'

Let’s iterate the keywords collecting all matching accounts. Then remove duplicates from the obtained list by converting it to set and back.

response_list = []
for query in queries_list:
    response = requests.get(url, params={
        'query': query
    }).json()
    response_list.extend(response['users'])
instagram_pages_list = []
for item in response_list:
    instagram_pages_list.append(item['user']['username'])
instagram_pages_list = list(set(instagram_pages_list))

Now we need to loop through the list of pages and request detailed information about an account if it’s not in the table yet. Create an instance of the Account class and pass username as a parameter.
Then update the account information using the agent.update()
method. We will collect only the first 100 posts to keep it moving. Next, create a list named media_list to store received post ids after calling the agent.get_media() method.


Collecting user media data

all_posts_list = []
username_count = 0
for username in instagram_pages_list:
    if client.execute(f"SELECT count(1) FROM users WHERE user_name='{username}'")[0][0] == 0:
        print('username:', username_count, '/', len(instagram_pages_list))
        username_count += 1
        account_total_likes = 0
        account_total_comments = 0
        try:
            account = Account(username)
        except Exception as E:
            print(E)
            continue
        try:
            agent.update(account)
        except Exception as E:
            print(E)
            continue
        if account.media_count < 100:
            post_count = account.media_count
        else:
            post_count = 100
        print(account, post_count)
        media_list, _ = agent.get_media(account, count=post_count, delay=1)
        count = 0

Because we need to count the total number of likes and comments before adding a new user to our database, we’ll start with them first. Almost all required fields belong to the Media class:


Collecting user posts

for media_code in media_list:
            if client.execute(f"SELECT count(1) FROM posts WHERE code='{media_code}'")[0][0] == 0:
                print('posts:', count, '/', len(media_list))
                count += 1

                post_insert_list = []
                post = Media(media_code)
                agent.update(post)
                post_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
                post_insert_list.append(str(post.owner))
                post_insert_list.append(post.id)
                if post.caption is not None:
                    post_insert_list.append(post.caption.replace("'","").replace('"', ''))
                else:
                    post_insert_list.append("")
                post_insert_list.append(post.code)
                post_insert_list.append(post.comments_count)
                post_insert_list.append(int(post.comments_disabled))
                post_insert_list.append(datetime.fromtimestamp(post.date).strftime('%Y-%m-%d %H:%M:%S'))
                post_insert_list.append(post.display_url)
                try:
                    post_insert_list.append(int(post.is_ad))
                except TypeError:
                    post_insert_list.append('cast(Null as Nullable(UInt8))')
                post_insert_list.append(int(post.is_album))
                post_insert_list.append(int(post.is_video))
                post_insert_list.append(post.likes_count)
                if post.location is not None:
                    post_insert_list.append(post.location)
                else:
                    post_insert_list.append('')
                post_insert_list.append(post.resources)
                if post.video_url is not None:
                    post_insert_list.append(post.video_url)
                else:
                    post_insert_list.append('')
                account_total_likes += post.likes_count
                account_total_comments += post.comments_count
                try:
                    client.execute(f'''
                        INSERT INTO posts VALUES {tuple(post_insert_list)}
                    ''')
                except Exception as E:
                    print('posts:')
                    print(E)
                    print(post_insert_list)

Store comments in the variable with the same name after calling the get_comments() method:


Collecting post comments

comments = agent.get_comments(media=post)
                for comment_id in comments[0]:
                    comment_insert_list = []
                    comment = Comment(comment_id)
                    comment_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
                    comment_insert_list.append(comment.id)
                    comment_insert_list.append(post.id)
                    comment_insert_list.append(str(comment.owner))
                    comment_insert_list.append(comment.text.replace("'","").replace('"', ''))
                    try:
                        client.execute(f'''
                            INSERT INTO comments VALUES {tuple(comment_insert_list)}
                        ''')
                    except Exception as E:
                        print('comments:')
                        print(E)
                        print(comment_insert_list)

And now, when we have obtained user posts and comments new information can be added to the table.


Collecting user data

user_insert_list = []
        user_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
        user_insert_list.append(account.id)
        user_insert_list.append(account.username)
        user_insert_list.append(account.full_name)
        user_insert_list.append(account.base_url)
        user_insert_list.append(account.biography)
        user_insert_list.append(account.followers_count)
        user_insert_list.append(account.follows_count)
        user_insert_list.append(account.media_count)
        user_insert_list.append(account_total_comments)
        user_insert_list.append(account_total_likes)
        user_insert_list.append(int(account.is_verified))
        user_insert_list.append(int(account.country_block))
        user_insert_list.append(account.profile_pic_url)
        user_insert_list.append(account.profile_pic_url_hd)
        if account.fb_page is not None:
            user_insert_list.append(account.fb_page)
        else:
            user_insert_list.append('')
        try:
            client.execute(f'''
                INSERT INTO users VALUES {tuple(user_insert_list)}
            ''')
        except Exception as E:
            print('users:')
            print(E)
            print(user_insert_list)

Conclusion
To sum up, we have collected data of 500 users, with nearly 20K posts and 40K comments. As the database will be updated, we can write a simple query to get the top 10 ML, AI & Data Science related most followed accounts for today.

SELECT *
FROM users
ORDER BY followers_count DESC
LIMIT 10

And as a bonus, here is a list of the most interesting Instagram accounts on this topic:

  1. @ai_machine_learning
  2. @neuralnine
  3. @datascienceinfo
  4. @compscistuff
  5. @computersciencelife
  6. @welcome.ai
  7. @papa_programmer
  8. @data_science_learn
  9. @neuralnet.ai
  10. @techno_thinkers

View the code on GitHub

 No comments    9   22 d   clickhouse   data analytics   Data engineering   instagram   python

Future Data Conference Review

Estimated read time – 10 min

The Future Data Conference, which I happened to participate in, took place on September 8-9. And in today’s post, I’d like to share my observations about thoughts about presented ideas. Before we get started, I apologize for the poor quality of some images, I tried to make the most meaningful screens straight from the video.

Featured Keynote: Automating Analysis
Speaker: Pat Hanrahan
The report was presented by the Stanford Professor and Tableau Co-Founder and mostly touched the use of AI and analytics. Pat discussed where we are now, today’s AI use cases, although the report alone was kind of repetitive, the Q&A part turned out to be interesting.

The Modern Data Stack: Past, Present, and Future
Speaker: Tristan Handy
The main builder of dbt and author of the well-known post serving as a guide to data analytics for startup founders, spoke about changes in modern data-stack from 2012 to 2020. Personally, I think it was one of the best conference reports since Tristan made predictions about growing tendencies and the future of data-stack.

1-17.png
2-16.png
3-15.png
4-13.png

7-7.png
8-6.png
9-6.png
10-1.png
11-1.png
12.png
13.png
14.png
15.png

Making Enterprise Data Timelier and More Reliable with Lakehouse Technology
Speaker: Matei Zaharia
This report belongs to the CTO of DataBricks. Unfortunately, the audio part had sound issues, but Matei considered the problems of modern Data Lake, promoting a new technology of DataBricks – DeltaLake. The report was more promotional but still interesting to listen to.

16.png
17.png
18.png
19.png
20.png
21.png
22.png
23.png
24.png

How to Close the Analytic Divide
Speaker: Alan Jacobson
The Chief Data Officer of Alteryx went on about the Data Scientist job and wages statistics, citing that the average salary of a data scientist is significantly higher than others in this field. By the way, our recent research with Roman Bunin also confirms this. Alan discussed the revenue of companies at different stages of analytical growth. Companies with more advanced analytical approaches grow faster (surprising fact). A separate part was focused on changes in modern approaches to working with data. Overall, it’s a great report that was easy to listen to.

25.png
26.png
27.png
28.png
29.png
30.png
31.png

Hot Analytics — Handle with Care
Speaker: Gian Merlino
The Co-Founder and CTO of Impy compared hot & cold data (a clue to
Snowflake?). Then he demonstrated some BI tool with drag-n-drop in a simple interface. Gian went on talking about possible analytic architectures and overviewed some features of Apache Druid.

32.png
33.png
34.png
35.png
36.png
37.png
38.png
39.png

 No comments    13   27 d   conference   data analytics

Analyzing Business Intelligence (BI) and Analytics job market in Tableau

Estimated read time – 13 min

1.1.png

According to the SimilarWeb rating, hh.ru is the third among the most popular job search websites in the world. In one of the conversations with Roman Bunin, we came up with the idea of making a common project and collect data using the HeadHunter API for later analysis and visualization in Tableau Public. Our goal was to understand the dependency between salary and skills specified in a job posting and compare how things are in Moscow, Saint Petersburg, and other regions.

Data Collection Process

Our scheme is based on fetching a  brief job description, returned by the GET /vacancies method. According to the structure we need to create the following columns: vacancy type, id, vacancy rate (‘premium’), pre-employment testing (‘has_test’), company address, salary, work schedule, and so forth. We created a table using the following CREATE query down below:


Query for creating the vacancies_short table in ClickHouse

CREATE TABLE headhunter.vacancies_short
(
    `added_at` DateTime,
    `query_string` String,
    `type` String,
    `level` String,
    `direction` String,
    `vacancy_id` UInt64,
    `premium` UInt8,
    `has_test` UInt8,
    `response_url` String,
    `address_city` String,
    `address_street` String,
    `address_building` String,
    `address_description` String,
    `address_lat` String,
    `address_lng` String,
    `address_raw` String,
    `address_metro_stations` String,
    `alternate_url` String,
    `apply_alternate_url` String,
    `department_id` String,
    `department_name` String,
    `salary_from` Nullable(Float64),
    `salary_to` Nullable(Float64),
    `salary_currency` String,
    `salary_gross` Nullable(UInt8),
    `name` String,
    `insider_interview_id` Nullable(UInt64),
    `insider_interview_url` String,
    `area_url` String,
    `area_id` UInt64,
    `area_name` String,
    `url` String,
    `published_at` DateTime,
    `employer_url` String,
    `employer_alternate_url` String,
    `employer_logo_urls_90` String,
    `employer_logo_urls_240` String,
    `employer_logo_urls_original` String,
    `employer_name` String,
    `employer_id` UInt64,
    `response_letter_required` UInt8,
    `type_id` String,
    `type_name` String,
    `archived` UInt8,
    `schedule_id` Nullable(String)
)
ENGINE = ReplacingMergeTree
ORDER BY vacancy_id

The first script collects data from the HeadHunter website through API and inserts to our Database using the following libraries:

import requests
from clickhouse_driver import Client
from datetime import datetime
import pandas as pd
import re

Next, we create a DataFrame and connect to the Database in ClickHouse:

queries = pd.read_csv('hh_data.csv')
client = Client(host='1.234.567.890', user='default', password='', port='9000', database='headhunter')

The queries table stores a list of our search queries, having the following columns: query type, level, career field, and search phrase. The last column contains logical operators, for instance, we can get more results by putting logical ANDs between “Python”, “data” and “analysis”.

edata@2x.png

The search results may not always match the expectations, chiefs, marketers, and administrators can accidentally get into our database. To prevent this, we will write a function named check_name(name), it will accept a vacancy name and return a boolean value, depending on the match.

def check_name(name):
    bad_names = [r'курьер', r'грузчик', r'врач', r'менеджер по закупу',
           r'менеджер по продажам', r'оператор', r'повар', r'продавец',
          r'директор магазина', r'директор по продажам', r'директор по маркетингу',
          r'кабельщик', r'начальник отдела продаж', r'заместитель', r'администратор магазина', 
          r'категорийный', r'аудитор', r'юрист', r'контент', r'супервайзер', r'стажер-ученик', 
          r'су-шеф', r'маркетолог$', r'региональный', r'ревизор', r'экономист', r'ветеринар', 
          r'торговый', r'клиентский', r'начальник цеха', r'территориальный', r'переводчик', 
          r'маркетолог /', r'маркетолог по']
    for item in bad_names:
        if re.match(item, name):
            return True

Moving further, we need to create a while loop to collect data non-stop. Iterate over the Dataframe queries selecting the type, level, field, and search phrase columns. Send a GET request using a keyword to get the number of pages. Then we loop through the number of pages sending the same requests and populating vacancies_from_response with job descriptions. In the per_page parameter we specified 10, this is the max limit for the HH API. Since we didn’t pass any value to the area field, the results are collected worldwide.

while True:
   for query_type, level, direction, query_string in zip(queries['Query Type'], queries['Level'], queries['Career Field'], queries['Seach Phrase']):
           print(f'seach phrase: {query_string}')
           url = 'https://api.hh.ru/vacancies'
           par = {'text': query_string, 'per_page':'10', 'page':0}
           r = requests.get(url, params=par).json()
           added_at = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
           pages = r['pages']
           found = r['found']
           vacancies_from_response = []

           for i in range(0, pages + 1):
               par = {'text': query_string, 'per_page':'10', 'page':i}
               r = requests.get(url, params=par).json()
               try:
                   vacancies_from_response.append(r['items'])
               except Exception as E:
                   continue

Create a for loop to escape duplicate rows in our table. First, send a query to the database, verifying whether there is a vacancy with the same id and search phrase. If the verification was successful we then
pass the job title to check_name() and move on to the next one.

for item in vacancies_from_response:
               for vacancy in item:
                   if client.execute(f"SELECT count(1) FROM vacancies_short WHERE vacancy_id={vacancy['id']} AND query_string='{query_string}'")[0][0] == 0:
                       name = vacancy['name'].replace("'","").replace('"','')
                       if check_name(name):
                           continue

Now we need to extract all the necessary data from a job description. The table will contain empty cells, since some data may be missing.


View the code for extracting job description data

vacancy_id = vacancy['id']
                       is_premium = int(vacancy['premium'])
                       has_test = int(vacancy['has_test'])
                       response_url = vacancy['response_url']
                       try:
                           address_city = vacancy['address']['city']
                           address_street = vacancy['address']['street']
                           address_building = vacancy['address']['building']
                           address_description = vacancy['address']['description']
                           address_lat = vacancy['address']['lat']
                           address_lng = vacancy['address']['lng']
                           address_raw = vacancy['address']['raw']
                           address_metro_stations = str(vacancy['address']['metro_stations']).replace("'",'"')
                       except TypeError:
                           address_city = ""
                           address_street = ""
                           address_building = ""
                           address_description = ""
                           address_lat = ""
                           address_lng = ""
                           address_raw = ""
                           address_metro_stations = ""
                       alternate_url = vacancy['alternate_url']
                       apply_alternate_url = vacancy['apply_alternate_url']
                       try:
                           department_id = vacancy['department']['id']
                       except TypeError as E:
                           department_id = ""
                       try:
                           department_name = vacancy['department']['name']
                       except TypeError as E:
                           department_name = ""
                       try:
                           salary_from = vacancy['salary']['from']
                       except TypeError as E:
                           salary_from = "cast(Null as Nullable(UInt64))"
                       try:
                           salary_to = vacancy['salary']['to']
                       except TypeError as E:
                           salary_to = "cast(Null as Nullable(UInt64))"
                       try:
                           salary_currency = vacancy['salary']['currency']
                       except TypeError as E:
                           salary_currency = ""
                       try:
                           salary_gross = int(vacancy['salary']['gross'])
                       except TypeError as E:
                           salary_gross = "cast(Null as Nullable(UInt8))"
                       try:
                           insider_interview_id = vacancy['insider_interview']['id']
                       except TypeError:
                           insider_interview_id = "cast(Null as Nullable(UInt64))"
                       try:
                           insider_interview_url = vacancy['insider_interview']['url']
                       except TypeError:
                           insider_interview_url = ""
                       area_url = vacancy['area']['url']
                       area_id = vacancy['area']['id']
                       area_name = vacancy['area']['name']
                       url = vacancy['url']
                       published_at = vacancy['published_at']
                       published_at = datetime.strptime(published_at,'%Y-%m-%dT%H:%M:%S%z').strftime('%Y-%m-%d %H:%M:%S')
                       try:
                           employer_url = vacancy['employer']['url']
                       except Exception as E:
                           print(E)
                           employer_url = ""
                       try:
                           employer_alternate_url = vacancy['employer']['alternate_url']
                       except Exception as E:
                           print(E)
                           employer_alternate_url = ""
                       try:
                           employer_logo_urls_90 = vacancy['employer']['logo_urls']['90']
                           employer_logo_urls_240 = vacancy['employer']['logo_urls']['240']
                           employer_logo_urls_original = vacancy['employer']['logo_urls']['original']
                       except Exception as E:
                           print(E)
                           employer_logo_urls_90 = ""
                           employer_logo_urls_240 = ""
                           employer_logo_urls_original = ""
                       employer_name = vacancy['employer']['name'].replace("'","").replace('"','')
                       try:
                           employer_id = vacancy['employer']['id']
                       except Exception as E:
                           print(E)
                       response_letter_required = int(vacancy['response_letter_required'])
                       type_id = vacancy['type']['id']
                       type_name = vacancy['type']['name']
                       is_archived = int(vacancy['archived'])

The last field is the work schedule. If there is mentioned a fly-in-fly-out method, these kinds of job postings will be skipped.

try:
    schedule = vacancy['schedule']['id']
except Exception as E:
    print(E)
    schedule = ''"
if schedule == 'flyInFlyOut':
    continue

Next, we create a list of obtained variables, replacing None values with empty strings to escape errors with Clickhouse and insert them into the table.

vacancies_short_list = [added_at, query_string, query_type, level, direction, vacancy_id, is_premium, has_test, response_url, address_city, address_street, address_building, address_description, address_lat, address_lng, address_raw, address_metro_stations, alternate_url, apply_alternate_url, department_id, department_name,
salary_from, salary_to, salary_currency, salary_gross, insider_interview_id, insider_interview_url, area_url, area_name, url, published_at, employer_url, employer_logo_urls_90, employer_logo_urls_240,  employer_name, employer_id, response_letter_required, type_id, type_name, is_archived, schedule]
for index, item in enumerate(vacancies_short_list):
    if item is None:
        vacancies_short_list[index] = ""
tuple_to_insert = tuple(vacancies_short_list)
print(tuple_to_insert)
client.execute(f'INSERT INTO vacancies_short VALUES {tuple_to_insert}')

Connecting Tableau to the data source

Unfortunately, we can’t work with databases in  Tableau Public, that’s why we decided to connect our Clickhouse Database to Google Sheets. With this in mind, we picked the following libraries: gspread and oauth2client for accessing Google Spreadsheets API, and schedule for task scheduling.

Refer to our previous article where we used  Google Spreadseets API for  Collecting Data on Ad Campaigns from VK.com

import schedule
from clickhouse_driver import Client
import gspread
import pandas as pd
from oauth2client.service_account import ServiceAccountCredentials
from datetime import datetime

scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
client = Client(host='54.227.137.142', user='default', password='', port='9000', database='headhunter')
creds = ServiceAccountCredentials.from_json_keyfile_name('credentials.json', scope)
gc = gspread.authorize(creds)

The update_sheet() function will transfer all data from Clickhouse to a Google Sheets table:

def update_sheet():
   print('Updating cell at', datetime.now())
   columns = []
   for item in client.execute('describe table headhunter.vacancies_short'):
       columns.append(item[0])
   vacancies = client.execute('SELECT * FROM headhunter.vacancies_short')
   df_vacancies = pd.DataFrame(vacancies, columns=columns)
   df_vacancies.to_csv('vacancies_short.csv', index=False)
   content = open('vacancies_short.csv', 'r').read()
   gc.import_csv('1ZWS2kqraPa4i72hzp0noU02SrYVo0teD7KZ0c3hl-UI', content.encode('utf-8'))

Using schedule to run our function every day at 1:00 PM (UTC):

schedule.every().day.at("13:00").do(update_sheet)
while True:
   schedule.run_pending()

What’s the final point?

Roman created an informative dashboard based on this data.
3-4.jpg https://revealthedata.com/examples/hh/

And made a youtube video with a detailed explanation of the dashboard features.

Key Insights

  1. Data Analysts specializing in BI are most in-demand in the job market since the highest number of search results were returned with this query. However, the average salary is higher in Product Analyst and BI-analyst openings.
  2. Most of the postings were found In Moscow, where the average salary is 10-30K RUB higher than in Saint Petersburg and 30-40K higher than in other regions.
  3. Top highly paid positions: Head of Analytics (110K RUB per month on avg.), Database Engineer (138K RUB per month), and Head of Machine Learning (250K RUB per month).
  4. The most useful skills to have are a solid knowledge of Python with Pandas and Numpy, Tableau, Power BI, ETL, and Spark. Most of the posings found contained these requirements and were highly paid than any others. For Python programmers, it’s more valuable to have expertise with Matplotlib than Plotly.

View the code on  GitHub

 No comments    8   29 d   BI-tools   data analytics   Data engineering   headhunter

How to build a dashboard with Bootstrap 4 from scratch (Part 1)

Estimated read time – 12 min

In previous articles we reviewed Plotly’s Dash Framework, learned to build scatter plots and  create a map visualization. This time we will summarize our knowledge and put all the pieces together to design a dashboard layout using the Bootstrap 4 grid system.
To facilitate the development, we’ll refer to the dash-bootstrap-components library. This is a great tool that integrates Bootstrap in Dash, allowing us to write web pages in pure Python, and add any Bootstrap components and styling.

Draft Layout

Before we begin coding it’s crucial to have a plan of our app, a rough layout that would help us to see the big picture and quickly modify the structure. We used draw.io to make a dashboard draft, this application enables to create diagrams, graphs, flowcharts, and forms at the click of a button. The dashboard will be built according to this template:

template_1@2x.png

Like the dashboard itself, the top header will be colored in gold and white, the main colors of Untappd. Just below the header, there is a section with breweries, which includes a scatter plot and a control panel. And at the bottom of the page, there will be a map showing beverage rating across the regions of Russia.

All right, let’s get started, first create a new python file with the name application.py. The file will store all the front end components of the dashboard, and create a new directory named assets. The directory structure should be similar:

- application.py
- assets/
    |-- typography.css
    |-- header.css
    |-- custom-script.js
    |-- image.png

Then we import the libraries and initialize our application:

import dash
import dash_bootstrap_components as dbc
import dash_html_components as html
import dash_core_components as dcc
import pandas as pd
from get_ratio_scatter_plot import get_plot
from get_russian_map import get_map
from clickhouse_driver import Client
from dash.dependencies import Input, Output

standard_BS = dbc.themes.BOOTSTRAP
app = dash.Dash(__name__, external_stylesheets=[standard_BS])

Main parameters of the app:
__name__ — to enable access to static elements stored in the assets folder (such as images, CSS and JS files)
external_stylesheets — external CSS styling, here we are using a standard Bootstrap theme, however you can create your own theme or use any of  the availables ones.

Hook up a few more things to work with local files and connect to the Clickhouse Database:

app.scripts.config.serve_locally = True
app.css.config.serve_locally = True

client = Client(host='ec2-3-16-148-63.us-east-2.compute.amazonaws.com',
                user='default',
                password='',
                port='9000',
                database='default')

Add a palette of colors:

colors = ['#ffcc00', 
          '#f5f2e8', 
          '#f8f3e3',
          '#ffffff', 
          ]

Creating a layout

All the dashboard elements will be placed within a Bootstrap container, which is in the  <div> block:

- app 
    |-- div
     |-- container
      |-- logo&header
     |-- container
      |-- div
       |-- controls&scatter
       |-- map
app.layout = html.Div(
                    [
                        dbc.Container(

                                         < header>
                         
                        dbc.Container(       
                            html.Div(
                                [
                        
                                    < body >
                        
                                ],
                            ),
                            fluid=False, style={'max-width': '1300px'},
                        ),
                    ],
                    style={'background-color': colors[1], 'font-family': 'Proxima Nova Bold'},
                )

Here we set a fixed container width, background color, and font style of the page that is stored in typography.css in the assets folder. Let’s take a closer look at the first element in the div block, that’s the top header with the Untappd logo:

logo = html.Img(src=app.get_asset_url('logo.png'),
                        style={'width': "128px", 'height': "128px",
                        }, className='inline-image')

and the header:

header = html.H3("Russian breweries stats from Untappd", style={'text-transform': "uppercase"})

We used Bootstrap Forms to position these two elements on the same level.

logo_and_header = dbc.FormGroup(
        [
            logo,
            html.Div(
                [
                    header
                ],
                className="p-5"
            )
        ],
        className='form-row',
)

The class name ‘p-5’ allows to increase padding and vertically align the title while specifying ‘form-row’ as the form class name we put the logo and header in one row. At this point, the top header should look the following:

logo_and_header.png

Now we need to center the elements and add some colors. Create a separate container that will take one row. Specify ‘d-flex justify-content-center’ in the className to achieve the same output.

dbc.Container(
                    dbc.Row(
                        [
                            dbc.Col(
                                html.Div(
                                    logo_and_header,
                                ),
                            ),
                        ],
                        style={'max-height': '128px',
                               'color': 'white',
                       }

                    ),
                    className='d-flex justify-content-center',
                    style={'max-width': '100%',
                           'background-color': colors[0]},
                ),

And now the top header is done:

top-header.png

We’re approaching the main part, create the next Bootstrap Container and add a subheading:

dbc.Container(
                    html.Div(
                        [
                            html.Br(),
                            html.H5("Breweries", style={'text-align':'center', 'text-transform': 'uppercase'}),
                            html.Hr(), # horizontal  break

The main body will consist of Bootstrap Cards, they can provide a structured layout of all parts, giving each element a clear border and saving the white space. Create the next element, a control panel with sliders:

slider_day_values = [1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
slider_top_breweries_values = [5, 25, 50, 75, 100, 125, 150, 175, 200]

controls = dbc.Card(
    [
       dbc.CardBody(
           [
               dbc.FormGroup(
                    [
                        dbc.Label("Time Period", style={'text-align': 'center', 'font-size': '100%', 'text-transform': 'uppercase'}),
                        dcc.Slider(
                            id='slider-day',
                            min=1,
                            max=100,
                            step=10,
                            value=100,
                            marks={i: i for i in slider_day_values}
                        ),
                    ], style={'text-align': 'center'}
               ),
               dbc.FormGroup(
                    [
                        dbc.Label("Number of breweries", style={'text-align': 'center', 'font-size': '100%', 'text-transform': 'uppercase'}),
                        dcc.Slider(
                            id='slider-top-breweries',
                            min=5,
                            max=200,
                            step=5,
                            value=200,
                            marks={i: i for i in slider_top_breweries_values}
                        ),
                    ], style={'text-align': 'center'}
               ),
           ],
       )
    ],
    style={'height': '32.7rem', 'background-color': colors[3]}
)

2@2x.png

The control panel consists of two sliders that can be used to change the view on the scatter, they are positioned one below the other in a Bootstrap Form. The sliders were put inside the dbc.CardBody block, other elements will be added in the same way. It allows to eliminate alignment problem and achieve clear borders. By default, the sliders are painted in blue, but we can easily customize them by changing the properties of the class in sliders.css. Add the control panel with the scatter plot as follows:

dbc.Row(
                [
                    dbc.Col(controls, width={"size": 4,
                                     "order": 'first',
                                             "offset": 0},
                     ),
                     dbc.Col(dbc.Card(
                                [
                                    dbc.CardBody(
                                        [
                                            html.H6("The ratio between the number of reviews and the average brewery rating",
                                                    className="card-title",
                                                    style={'text-transform': 'uppercase'}), 
                                            dcc.Graph(id='ratio-scatter-plot'),
                                        ],
                                    ),
                                ],
                                style={'background-color': colors[2], 'text-align':'center'}
                             ),
                     md=8),
                ],
                align="start",
                justify='center',
            ),
html.Br(),

And at the bottom of the page we will position the map:

html.H5("Venues and Regions", style={'text-align':'center', 'text-transform': 'uppercase',}),
                            html.Hr(), # horizontal  break
                            dbc.Row(
                                [
                                    dbc.Col(
                                        dbc.Card(
                                            [
                                                dbc.CardBody(
                                                    [
                                                        html.H6("Average beer rating across regions",
                                                                className="card-title",
                                                                style={'text-transform': 'uppercase'},
                                                        ),  
                                                        dcc.Graph(figure=get_map())
                                                    ],
                                                ),
                                            ],
                                        style={'background-color': colors[2], 'text-align': 'center'}
                                        ),
                                md=12),
                                ]
                            ),
                            html.Br(),

Callbacks in Dash

Callback functions allow making dashboard elements interactive through the  Input and Output properties of a particular component.

@app.callback(
    Output('ratio-scatter-plot', 'figure'),
    [Input('slider-day', 'value'),
     Input('slider-top-breweries', 'value'),
     ]
)
def get_scatter_plots(n_days=100, top_n=200):
    if n_days == 100 and top_n == 200:
        df = pd.read_csv('data/ratio_scatter_plot.csv')
        return get_plot(n_days, top_n, df)
    else:
        return get_plot(n_days, top_n)

In this example, our inputs are the “value” properties of the components that have the ids “slider-day’” and  “slider-top-breweries”. Our output is the “children” property of the component with the id “ratio-scatter-plot”. When the input values are changed, the decorator function will be called automatically and the output on the scatter is updated. Learn more about callbacks from the examples in the docs.
It’s worth noting, that the scatter plot may not be displayed correctly when the page is loaded. To avoid this scenario we need to specify its initial state and produce a scatter plot from the saved CSV file stored in the data folder. Then, when changing the slider values, all data will be taken directly from the Clickhouse tables.
scatter_empty_2@2x.png
scatter_2@2x.png

Add a few more lines responsible for deployment and our app is ready to run:

application = app.server

if __name__ == '__main__':
    application.run(debug=True, port=8000)

Next, we need to  deploy our app to AWS BeansTalk and the first part of our Bootstrap Dashboard is completed:

Thanks for reading the first part of our series about Bootstrap Dashboards, in the next one we are going to add more new components, improved callbacks, and talk about tables in Bootstrap.

View the code on Github

 No comments    8   1 mon   bootstrap   dash   data analytics   untappd

VIsualizing COVID-19 in Russia with Plotly

Estimated read time – 9 min

Maps are widely used in data visualization, it’s a great tool to display statistics for certain areas, regions, and cities. Before displaying the map we need to encode each region or any other administrative unit. Choropleth map gets divided into polygons and multipolygons with latitude and longitude coordinates. Plotly has a built-in solution for plotting choropleth map for America and Europe regions, however, Russia is not included yet. So we decided to use an existing GeoJSON file to map administrative regions of Russia and display the latest COVID-19 stats with Plotly.

from urllib.request import urlopen
import json
import requests
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import plotly.graph_objects as go

Modifying GeoJSON

First, we need to download a public GeoJSON file with the boundaries for the Federal subjects of Russia. The file already contains some information, such as region names, but it’s still doesn’t fit the required format and missing region identifiers.

with urlopen('https://raw.githubusercontent.com/codeforamerica/click_that_hood/master/public/data/russia.geojson') as response:
    counties = json.load(response)

Besides that, there are slight differences in the namings. For example, Bashkortostan on стопкоронавирус.рф, the site we are going to scrape data from, it’s listed as “The Republic of Bashkortostan”, while in our GeoJSON file it’s simply named “Bashkortostan”. These differences should be eliminated to avoid possible confusion. Also, the names should start with a capital.

regions_republic_1 = ['Бурятия', 'Тыва', 'Адыгея', 'Татарстан', 'Марий Эл',
                      'Чувашия', 'Северная Осетия – Алания', 'Алтай',
                      'Дагестан', 'Ингушетия', 'Башкортостан']
regions_republic_2 = ['Удмуртская республика', 'Кабардино-Балкарская республика',
                      'Карачаево-Черкесская республика', 'Чеченская республика']
for k in range(len(counties['features'])):
    counties['features'][k]['id'] = k
    if counties['features'][k]['properties']['name'] in regions_republic_1:
        counties['features'][k]['properties']['name'] = 'Республика ' + counties['features'][k]['properties']['name']
    elif counties['features'][k]['properties']['name'] == 'Ханты-Мансийский автономный округ - Югра':
        counties['features'][k]['properties']['name'] = 'Ханты-Мансийский АО'
    elif counties['features'][k]['properties']['name'] in regions_republic_2:
        counties['features'][k]['properties']['name'] = counties['features'][k]['properties']['name'].title()

It’s time to create a DataFrame from the resulting GeoJSON file with the regions of Russia, we’ll take the identifiers and names.

region_id_list = []
regions_list = []
for k in range(len(counties['features'])):
    region_id_list.append(counties['features'][k]['id'])
    regions_list.append(counties['features'][k]['properties']['name'])
df_regions = pd.DataFrame()
df_regions['region_id'] = region_id_list
df_regions['region_name'] = regions_list

As a result, our DataFrame looks like the following:

Data Scraping

We need to scrape the data stored in this table:

Let’s use the Selenium library for this task. We need to navigate to the webpage and convert it into a BeautifulSoup object

driver = webdriver.Chrome()
driver.get('https://стопкоронавирус.рф/information/')
source_data = driver.page_source
soup = bs(source_data, 'lxml')

The region names are wrapped with <th> tags, while the latest data is stored in table cells, each one is defined with a <td> tag.

divs_data = soup.find_all('td')

The divs_data list should return something like this:

The data is grouped in one line, this includes both new cases and active ones. It is noticeable that each region corresponds to five values, for Moscow these are the first five, for Moscow Region the next five and so on. We can use this pattern to create five lists and populate with values according to the index. The first value will be appended to the list with active cases, the second value to the list of new ones, etc. After every five values, the index will be reset to zero.

count = 1
for td in divs_data:
    if count == 1:
        sick_list.append(int(td.text))
    elif count == 2:
        new_list.append(int(td.text))
    elif count == 3:
        cases_list.append(int(td.text))
    elif count == 4:
        healed_list.append(int(td.text))
    elif count == 5:
        died_list.append(int(td.text))
        count = 0
    count += 1

The next step is to extract the region names from the table, they are stored within the col-region class. We also need to clean up the data by eliminating extra white spaces and line breaks.

divs_region_names = soup.find_all('th', {'class':'col-region'})
region_names_list = []
for i in range(1, len(divs_region_names)):
    region_name = divs_region_names[i].text
    region_name = region_name.replace('\n', '').replace('  ', '')
    region_names_list.append(region_name)

Create a DataFrame:

df = pd.DataFrame()
df['region_name'] = region_names_list
df['sick'] = sick_list
df['new'] = new_list
df['cases'] = cases_list
df['healed'] = healed_list
df['died'] = died_list

After reviewing our data once again we detected white space under the index 10. This should be fixed immediately, otherwise, we may run into problems.

df.loc[10, 'region_name'] = df[df.region_name == 'Челябинская область '].region_name.item().strip(' ')

Finally, we can merge our DataFrame on the region_name column, so that the resulted table will include a column with region id, which is required to make a choropleth map.

df = df.merge(df_regions, on='region_name')

Creating a choropleth map with Plotly

Let’s create a new figure and pass a choroplethmapbox object to it. The geojson parameter will accept the counties variable with the GeoJSON file, assign the region_id to locations. The z parameter represents the data to be color-coded, in this example we’re passing the number of new cases for each region. Assign the region names to text. The colorscale parameter accepts lists with values ranging from 0 to 1 and RGB color codes. Here, the palette changes from green to yellow and then red, depending on the number of active cases. By passing the values stored in customdata we can change our hovertemplate.

fig = go.Figure(go.Choroplethmapbox(geojson=counties,
                           locations=df['region_id'],
                           z=df['new'],
                           text=df['region_name'],
                           colorscale=[[0, 'rgb(34, 150, 79)'],
                                       [0.2, 'rgb(249, 247, 174)'],
                                       [0.8, 'rgb(253, 172, 99)'],
                                       [1, 'rgb(212, 50, 44)']],
                           colorbar_thickness=20,
                           customdata=np.stack([df['cases'], df['died'], df['sick'], df['healed']], axis=-1),
                           hovertemplate='<b>%{text}</b>'+ '<br>' +
                                         'New cases: %{z}' + '<br>' +
                                         'Active cases: %{customdata[0]}' + '<br>' +
                                         'Deaths: %{customdata[1]}' + '<br>' +
                                         'Total cases: %{customdata[2]}' + '<br>' +
                                         'Recovered: %{customdata[3]}' +
                                         '<extra></extra>',
                           hoverinfo='text, z'))

Let’s customize the map, we will use a ready-to-go neutral template, called carto-positron. Set the parameters and display the map:
mapbox_zoom: responsible for zooming;
mapbox_center: centers the map;
marker_line_width: border width (we removed the borders by setting this parameter to 0);
margin: usually accepts 0 values to make the map wider.

fig.update_layout(mapbox_style="carto-positron",
                  mapbox_zoom=1, mapbox_center = {"lat": 66, "lon": 94})
fig.update_traces(marker_line_width=0)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

And here is our map. According to the plot, we can say that the highest number of cases per day is happening in Moscow – 608 new cases. It’s really high compared to the other regions, and especially to Nenets Autonomous Okrug, where this number is surprisingly low.

View the code on GitHub

 No comments    680   2 mon   dash   data analytics   plotly   python
Earlier Ctrl + ↓