TDM 20100: Project 11 — 2023
Motivation: Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so, we perform "joins"! In this project we will explore, learn about, and practice using joins on our database. The database has many tables, so the benefit of using joins will become obvious.
Context: We’ve introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
Scope: SQL, sqlite, joins
Dataset(s)
For this project, we will be using the lahman
sqlite database. This database contains the data in the directory
-
/anvil/projects/tdm/data/lahman
You may get some more lahman
database information from this youtube video youtube.com/watch?v=tS_-oTbsDzs
To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db
For every following cell where you want to run a SQL query, prepend %%sql
to the top of the cell — just like we do for R or bash cells.
Questions
Question 1 (2 pts)
In the previous project, you already learned how to get data from a single table.
Now that we are learning about joins, so that we will have the ability to make much more interesting queries!
You may get more information on joins here: the-examples-book.com/programming-languages/SQL/joins |
Table AVG = H/AB A batting average is an indicator that shows a batter’s ability to produce offensively. You may get more batting average information from Wikipedia: en.wikipedia.org/wiki/Batting_average |
-
Please find the 10 players with the lowest batting average for the year 2022. Use the batting table and INNER JOIN with the people table to get players' first name and last name. The output will contain following fields: playerID, player’s first name, player’s last name, and their battingAverage.
Question 2 (2 pts)
When considering the batting average, pitchers often have a significantly lower bating average, because they are not trained hitters. To focus on regular batters, pitchers need to be excluded.
-
Use the
appearances
table, to find out the players who are pitchers for the year 2022. -
Return to the query from Question 1, but this time, use a subquery to exclude the pitchers from Question 2a.
Pitchers have field G_p>0 in appearances table |
Question 3 (2 points)
In question 2, instead of using a sub query, we can use a left join to accomplish the same task.
-
Modify your query from question 2, to use a left join (instead of a sub query). The goal is the same as question 2b, namely: to get the 10 players (who are not pitchers!) with lowest batting average.
Question 4 (2 pts)
-
Write another query, to find out what is the average batting average for all players (exclude pitchers) in year 2022.
Project 11 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project11.ipynb
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |