Handling Time Series Window roles in Data Science Interviews

Handling Time Series Window roles in Data Science Interviews




Data scientists manager time series data on a daily basis and being able to manipulate and analyses these data is a required part of the job. SQL window roles allow you to do just this and is a shared data science interview question. So let’s talk about what time series data is, when to use them, and how to implement roles to help manage times series data.
What Is Times Series Data?
Time series data are variables within your data that have a time part. This method that each value in this attribute has either a date or time value, sometimes they have both. Here are some examples of times series data:

• The daily stock price for companies because each stock price is associated with a specific day
• The daily average stock index value over the last few years because each value is mapped to a specific day
• rare visits to a website over a month
• Platform registrations each day
• Monthly sales and revenue
• Daily logins for an app
LAG and rule Window roles
When handling time series data a shared calculation is to calculate growth or averages over time. This method that you’ll need to either grab the future date or past date and it’s associated values.

Two WINDOW roles that allow you to accomplish this is LAG and rule, which are extremely useful for dealing with time related data. The main difference between LAG and rule is that LAG gets data from past rows, while rule is the opposite, it fetches data from following rows.

We can use either one of the two roles to compare month over month growth for example. As a data analytics specialized, you are very likely to work on time related data, and if you are able to use LAG or rule efficiently, you will be a very productive data scientist.

A Data Science Interview Question That Requires A Window Function
Let’s go by an progressive data science sql interview question dealing with this window function. You’ll see window roles commonly being part of interview questions but you’ll also see them a lot in your daily work so it’s important to know how to use them.

Let’s go by one question from Airbnb called growth of Airbnb. If you want to follow along interactively, you can do so here.

The question is to calculate the growth of Airbnb each year using the number of hosts registered as the growth metric. The rate of growth is calculated by taking ((number of hosts registered in the current year – number of hosts registered in the past year) / the number of hosts registered in the past year) * 100.

Output the year, number of hosts in the current year, number of hosts in the past year, and the rate of growth. Round the rate of growth to the nearest percent and order the consequence in the ascending order based on the year.
Approach Step 1: Count the great number for the current year
The first step is to count hosts by year so we’ll need to extract the year from the date values.

SELECT extract(year
FROM host_since::date) AS year,
count(id) current_year_host
FROM airbnb_search_details
WHERE host_since IS NOT NULL
GROUP BY extract(year
FROM host_since::date)
ORDER BY year
Approach Step 2: Count the great number for the past year.
This is where you’ll be using the LAG window function. Here you’ll create a view where we have the year, number of hosts in that current year, and then number of hosts from the past year. Use a lag function for the past year count and take the last year’s value and put it in the same row as this year’s count. This way you will have 3 columns in your view — year, current year great number count, and last year’s great number count. The LAG function allows you to easily pull the last year’s great number count in your row. This makes it easy for you to implement any metric like a growth rate because you have all the values you need on one row for SQL to easily calculate a metric. Here’s the code for it:

SELECT year,
current_year_host,
LAG(current_year_host, 1) OVER (ORDER BY year) AS prev_year_host
FROM
(SELECT extract(year
FROM host_since::date) AS year,
count(id) current_year_host
FROM airbnb_search_details
WHERE host_since IS NOT NULL
GROUP BY extract(year
FROM host_since::date)
ORDER BY year) t1) t2
Approach 3: Implement the growth metric
As mentioned earlier, it’s much easier to implement a metric like the one below when all the values are on one row. This is why you perform the LAG function. Implement the growth rate calculation round(((current_year_host – prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth

SELECT year,
current_year_host,
prev_year_host,
round(((current_year_host – prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth
FROM
(SELECT year,
current_year_host,
LAG(current_year_host, 1) OVER (ORDER BY year) AS prev_year_host
FROM
(SELECT extract(year
FROM host_since::date) AS year,
count(id) current_year_host
FROM airbnb_search_details
WHERE host_since IS NOT NULL
GROUP BY extract(year
FROM host_since::date)
ORDER BY year) t1) t2




leave your comment

Search

Top