RevoU SQL Assignment
This project is the application of skills that have been learned in a course. We were given datasets and instructions to follow. And here are the results:
Intermediate projects
Dataset overview:
For Task 1 - 6: San Francisco Ford GoBike Share
San Francisco Ford GoBike, managed by Motivate, provides the Bay Area’s bike share system. Bike share is a convenient, healthy, affordable, and fun form of transportation. It involves a fleet of specially designed bikes that are locked into a network of docking stations. Bikes can be unlocked from one station and returned to any other station in the system. People use bike share to commute to work or school, run errands, get to appointments, and more. The dataset contains trip data from 2013-2018, including start time, end time, start station, end station, and latitude/longitude for each station.
SQL code:
For Task 1 - 5: see the code here.
Task:
-
Create a query to get the average amount of duration (in minutes) per month. Please use the start date from 2014 - 2017.
Expected output: month, average (in minute)
Code:SELECT DATE_TRUNC(date(start_date), MONTH) AS months, ROUND(AVG(duration_sec/60), 2) AS avg_duration_in_minutes FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` WHERE EXTRACT(YEAR FROM start_date) BETWEEN 2014 AND 2017 GROUP BY months ORDER BY months ASC;
Result: (Top 5 data)
-
Create a query to get total trips and total number of unique bikes grouped by region name. Please used the start date from 2014 - 2017.
Expected output: region name, total trips, total bikes
Code:SELECT br.name AS region_name, COUNT(bt.trip_id) AS total_trips, COUNT(DISTINCT bt.bike_number) AS total_bikes FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` bt LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` si ON bt.start_station_id = si.station_id LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_regions` br ON si.region_id = br.region_id WHERE EXTRACT(YEAR FROM bt.start_date) BETWEEN 2014 AND 2017 GROUP BY region_name ORDER BY region_name ASC;
Result:
-
Find the youngest and oldest age of the members, for each gender. Assume this year is 2022.
Expected output: gender, youngest age, oldest age
Code:WITH member_gender_age AS ( SELECT member_gender, 2022-member_birth_year AS age FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` WHERE member_birth_year IS NOT NULL) SELECT member_gender, MIN(age) AS youngest_age, MAX(age) AS oldest_age FROM member_gender_age GROUP BY member_gender;
Result:
-
Get the latest departure trip in each region with detail below
Expected output: trip_id, duration_sec, start_date, start_station_name, member_gender, region_name
Code:WITH bikeshare_data AS( SELECT bt.trip_id AS trip_id, bt.start_date AS start_date, bt.start_station_name AS start_station_name, bt.duration_sec AS duration_sec, bt.member_gender AS member_gender, si.station_id AS station_id, br.name AS region_name, DENSE_RANK() OVER(PARTITION BY br.name ORDER BY start_date DESC) AS `rank` FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` bt LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` si ON bt.start_station_id = si.station_id LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_regions` br ON si.region_id = br.region_id WHERE EXTRACT(YEAR FROM bt.start_date) BETWEEN 2014 AND 2017 ORDER BY rank ASC) SELECT region_name, trip_id, start_date, start_station_name, duration_sec, member_gender FROM bikeshare_data WHERE region_name IS NOT NULL AND `rank` = 1;
Result:
-
Create a query to get month to date of total trips in each region, breakdown by date. Please use timeframe from November 2017 until December 2017
Expected output: start date (in date format), region name, total trips (in cumulative)
Code:WITH obs AS( SELECT DATE(bt.start_date) AS start_date, EXTRACT(DAY FROM start_date) AS day, EXTRACT(MONTH FROM start_date) AS month, EXTRACT(YEAR FROM start_date) AS year, AS region_name, COUNT(bt.trip_id) AS total_trips, FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` bt LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` si ON bt.start_station_id = si.station_id LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_regions` br ON si.region_id = br.region_id WHERE br.name IS NOT NULL AND EXTRACT(MONTH FROM start_date) IN (11,12) AND EXTRACT(YEAR FROM start_date) = 2017 GROUP BY region_name, year, month, day, start_date ORDER BY region_name, year, month, day, start_date) SELECT start_date, region_name, SUM(total_trips) OVER(PARTITION BY region_name, month ORDER BY day ASC) `total_trips_cum` FROM obs ORDER BY region_name, year, month, day, start_date;
Result: (Top 5 data)
Advance projects
Dataset overview:
For Task 7: Hacker News.
This dataset contains all stories and comments from Hacker News from its launch in 2006 to present. Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.
SQL code:
For Task 6 & 7: see the code here.
Task:
-
Find monthly growth of trips in percentage, ordered by time descendingly. Only for trips from the region that has the highest total number of trips.
Please use the start date from 2014 - 2017.
Expected output: month, region, growth (%)
Code:WITH all_data AS( SELECT br.region_id AS region_id, br.name AS region_name, bt.trip_id AS trip_id, bt.start_date AS star_date, EXTRACT(YEAR FROM start_date) AS year, EXTRACT(MONTH FROM start_date) AS month, FORMAT_DATE("%Y-%m", DATE(start_date)) AS month_formatted FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` bt LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` si ON bt.start_station_id = si.station_id LEFT JOIN `bigquery-public-data.san_francisco_bikeshare.bikeshare_regions` br ON si.region_id = br.region_id WHERE EXTRACT(YEAR FROM start_date) BETWEEN 2014 AND 2017 ), number_of_trip_per_region AS( SELECT region_id, region_name, COUNT(trip_id) AS total_trips FROM all_data GROUP BY region_id, region_name ), region_rank AS( SELECT region_id, region_name, total_trips,DENSE_RANK() OVER(ORDER BY total_trips DESC) AS `rank` FROM number_of_trip_per_region ), highest_trip_region AS( SELECT region_id, region_name FROM region_rank WHERE `rank` = 1 ), total_trip_by_year_month_region AS( SELECT year, month, month_formatted, all_data.region_name, COUNT(trip_id) AS current_trips FROM all_data INNER JOIN highest_trip_region ON all_data.region_id = highest_trip_region.region_id WHERE all_data.region_name IS NOT NULL GROUP BY all_data.region_name, year, month, month_formatted ORDER BY all_data.region_name, year, month, month_formatted ), current_prev_trips_2014_2015_2016 AS( SELECT *, LAG(current_trips) OVER(ORDER BY month_formatted ASC) AS prev_trips, FROM total_trip_by_year_month_region WHERE region_name IS NOT NULL AND year IN(2014,2015,2016) GROUP BY region_name, year, month, month_formatted, current_trips ORDER BY region_name, year DESC, month DESC ), current_prev_trips_2017 AS( SELECT *, LAG(current_trips) OVER(PARTITION BY year ORDER BY month ASC) AS prev_trips, FROM total_trip_by_year_month_region WHERE region_name IS NOT NULL AND year = 2017 GROUP BY region_name, year, month, month_formatted, current_trips ORDER BY region_name, year, month DESC ), growth_2014_2015_2016 AS( SELECT month_formatted, region_name, current_trips, prev_trips, ROUND(((current_trips-prev_trips)/prev_trips)*100,2) AS growth FROM current_prev_trips_2014_2015_2016 GROUP BY region_name, year, month, month_formatted, current_trips, prev_trips ORDER BY region_name, year DESC, month DESC ), growth_2017 AS( SELECT month_formatted, region_name, current_trips, prev_trips, ROUND(((current_trips-prev_trips)/prev_trips)*100,2) AS growth FROM current_prev_trips_2017 GROUP BY region_name, year, month, month_formatted, current_trips, prev_trips ORDER BY region_name, year DESC, month DESC ) SELECT month_formatted, region_name, growth FROM growth_2014_2015_2016 UNION ALL SELECT month_formatted, region_name, growth FROM growth_2017 ORDER BY month_formatted DESC;
Result: (Top 5 data)
-
Please use table "stories" to create monthly retention cohorts (the groups, or cohort, can be define based upon the data that a user/author started a story)
and then how many of them (%) coming back for the following months in 2014. After analysing the retention cohort, is there any interesting insight that we can get?
(Notes: initial start date can be defined using first story start date from each author in table stories)
Expected output: first post month, active post month, number of users
Code:WITH cohort_items AS( SELECT `by` AS user_id, MIN(DATE(DATE_TRUNC(time_ts,MONTH))) AS cohort_month, FROM `bigquery-public-data.hacker_news.stories` GROUP BY 1 ), user_activities AS( SELECT act.`by` AS user_id, DATE_DIFF( DATE(DATE_TRUNC(time_ts,MONTH)), cohort.cohort_month, MONTH ) AS month_number, FROM `bigquery-public-data.hacker_news.stories` act LEFT JOIN cohort_items cohort ON act.`by` = cohort.user_id WHERE EXTRACT(YEAR FROM cohort.cohort_month) = 2014 AND EXTRACT (YEAR FROM act.time_ts) = 2014 GROUP BY 1,2 ), cohort_size AS( SELECT cohort_month, COUNT(*) AS num_users FROM cohort_items GROUP BY 1 ORDER BY 1 ), retention_table AS( SELECT c.cohort_month, a.month_number, COUNT(*) AS num_users FROM user_activities a LEFT JOIN cohort_items c ON a.user_id = c.user_id GROUP BY 1,2 ) SELECT FORMAT_DATE("%Y-%m", DATE(b.cohort_month)) as cohort_month, s.num_users as total_users, b.month_number, b.num_users as active_users_in_a_month, CAST(b.num_users AS decimal)/s.num_users AS percentage FROM retention_table b LEFT JOIN cohort_size s ON b.cohort_month = s.cohort_month WHERE b.cohort_month IS NOT NULL ORDER BY 1,3;
Result: (Top 5 data)
Result: (Cohort)
In the cohort table, we can see that around 85.5% of users have not posted again in the second month. And from 14.5% of users who are still posting in the second month, only 10.4% are still posting in the third month or there is a 4.1% decrease in users posting in the third month. The decline in the first two months was the largest compared to the decrease in users who were still actively posting in the following months. To find out the causes and how to increase the percentage of user retention, we need to do a deeper analysis.