-
[캐글] Nested and Repeated Data 정답<Kaggle-Course> 2023. 3. 13. 18:49728x90
Introduction
Now that you know how to query nested and repeated data, you're ready to draw interesting insights from the GitHub Repos dataset.
Before you get started, run the following cell to set everything up.
# Set up feedback system from learntools.core import binder binder.bind(globals()) from learntools.sql_advanced.ex3 import * print("Setup Complete")
Exercises
1) Who had the most commits in 2016?
GitHub is the most popular place to collaborate on software projects. A GitHub repository (or repo) is a collection of files associated with a specific project, and a GitHub commit is a change that a user has made to a repository. We refer to the user as a committer.
The sample_commits table contains a small sample of GitHub commits, where each row corresponds to different commit. The code cell below fetches the table and shows the first five rows of this table.
from google.cloud import bigquery # Create a "Client" object client = bigquery.Client() # Construct a reference to the "github_repos" dataset dataset_ref = client.dataset("github_repos", project="bigquery-public-data") # API request - fetch the dataset dataset = client.get_dataset(dataset_ref) # Construct a reference to the "sample_commits" table table_ref = dataset_ref.table("sample_commits") # API request - fetch the table sample_commits_table = client.get_table(table_ref) # Preview the first five lines of the table client.list_rows(sample_commits_table, max_results=5).to_dataframe()
# Print information on all the columns in the table sample_commits_table.schema
Write a query to find the individuals with the most commits in this table in 2016. Your query should return a table with two columns:
- committer_name - contains the name of each individual with a commit (from 2016) in the table
- num_commits - shows the number of commits the individual has in the table (from 2016)
Sort the table, so that people with more commits appear first.
NOTE: You can find the name of each committer and the date of the commit under the "committer" column, in the "name" and "date" child fields, respectively.
# Write a query to find the answer max_commits_query = """ SELECT committer.name AS committer_name, COUNT(1) AS num_commits FROM `bigquery-public-data.github_repos.sample_commits` WHERE EXTRACT(YEAR FROM committer.date) = 2016 GROUP BY committer_name ORDER BY num_commits DESC """ # Check your answer q_1.check()
2) Look at languages!
Now you will work with the languages table. Run the code cell below to print the first few rows.
# Construct a reference to the "languages" table table_ref = dataset_ref.table("languages") # API request - fetch the table languages_table = client.get_table(table_ref) # Preview the first five lines of the table client.list_rows(languages_table, max_results=5).to_dataframe()
Each row of the languages table corresponds to a different repository.
- The "repo_name" column contains the name of the repository,
- the "name" field in the "language" column contains the programming languages that can be found in the repo, and
- the "bytes" field in the "language" column has the size of the files (in bytes, for the corresponding language).
Run the following code cell to print the table schema.
# Print information on all the columns in the table languages_table.schema
Assume for the moment that you have access to a table called sample_languages that contains only a very small subset of the rows from the languages table: in fact, it contains only three rows! This table is depicted in the image below.
# Fill in the blank num_rows = 6 # Check your answer q_2.check()
3) What's the most popular programming language?
Write a query to leverage the information in the languages table to determine which programming languages appear in the most repositories. The table returned by your query should have two columns:
- language_name - the name of the programming language
- num_repos - the number of repositories in the languages table that use the programming language
Sort the table so that languages that appear in more repos are shown first.
728x90'<Kaggle-Course>' 카테고리의 다른 글
캐글 intro_to_sql/02-select-from-where (0) 2023.03.15 캐글 intro_to_sql/03-group-by-having-count (0) 2023.03.15 캐글 intro_to_sql/05-as-with (0) 2023.03.15 캐글 intro_to_sql/04-order-by (0) 2023.03.15 캐글 intro_to_sql/06-joining-data (0) 2023.03.15