Python DB-API
The Python Database API Specification provides a standard way for interacting with databases from Python (see PEP 249). The API allows programmers to manage connections, execute SQL statements, and process results in a consistent manner across different database systems. In today's lab, you will use Python to create a database from CSV files, execute SQL queries to analyze the data, and make changes to the data.
Heads up!
Be careful not to confuse the acronym "SOL" (Standards of Learning) with "SQL" (Structured Query Language).
Thank Virginia for coming up with such a ridiculous acronym.
Step 1: VDOE Data for Research¶
The Virginia Department of Education (VDOE) collects extensive data about K-12 public schools in Virginia. The data is anonymized, aggregated, and released to the public for a variety of purposes. For example, researchers can use data to find new ways to improve the K-12 education system. This lab will give you a taste of what that kind of work is like.
-
Download the following CSV data files:
- School-Participation_rates_23_24 – number of students tested in 2023–24 (by school, subject, and subgroup)
- School_Test_by_level_2023_2024 – pass rates for SOL tests taken in 2021–24 (by school, grade level, and test)
-
Open these files in a spreadsheet app like Excel, Numbers, or LibreOffice Calc. Discuss the data with the students at your table and become familiar with the data. Ask questions about any columns or values you don't understand.
Acknowledgment
The original version of this data came from SOL Test Pass Rates & Other Results. We exported the Excel files as CSV so that you would not need to install additional Python libraries to read the data files.
Step 2: Create a Python Module¶
-
Create a file named
vdoe.pycontaining the following code:"""Analyze SOL test pass rates and other results provided by VDOE. Author: TYPE YOUR NAME HERE Version: TYPE DUE DATE HERE """ import csv import sqlite3 # DO NOT EDIT THESE VARIABLES; use them in your functions. # Connect to the SQLite database file and create a cursor. con = sqlite3.connect("vdoe.db") cur = con.cursor() # TODO add your functions here (see the lab instructions) if __name__ == "__main__": pass -
Type your name and today's date in the docstring at the top.
-
Make sure you can run the module without getting any errors.
- When you run the module, the file
vdoe.dbwill be created.
- When you run the module, the file
Step 3: Drop and Create Tables¶
-
Define a function named
create_tables(), and call the function from the__main__block.- Note:
passis a placeholder so that incomplete code will compile and run. In your existing__main__block, replacepasswithcreate_tables().
def create_tables(): pass if __name__ == "__main__": create_tables() - Note:
-
In the
create_tables()function, usecur.execute()to execute the followingCREATE TABLEstatements. After eachCREATE TABLEstatement, callcon.commit()to save the changes to disk.CREATE TABLE participation ( div_num integer, div_name text, sch_num integer, sch_name text, sch_type text, low_grade text, high_grade text, subject text, subgroup text, number_tested integer, number_students integer, part_rate_2324 real )CREATE TABLE test_results ( div_num integer, div_name text, sch_num integer, sch_name text, sch_type text, low_grade text, high_grade text, subject text, grade_level text, test_name text, pass_rate_2122 integer, pass_rate_2223 integer, pass_rate_2324 integer, adv_rate_2122 integer, adv_rate_2223 integer, adv_rate_2324 integer ) -
Use
cur.execute()to run the followingDROP TABLEstatements before theCREATE TABLEstatements. That way, the database will be rebuilt every time thevdoe.pymodule is run.DROP TABLE IF EXISTS participationDROP TABLE IF EXISTS test_results -
Run
vdoe.pymultiple times to make sure the tables are dropped and created without error.
Step 4: Import Data from CSV¶
-
Define a function named
import_data(), and call the function from the__main__block.def import_data(): pass if __name__ == "__main__": create_tables() import_data() -
Add the following code snippet to your
import_data()function. Discuss the meaning/purpose of each line with the students at your table. Ask questions about any lines you don't understand.with open("School-Participation_rates_23_24.csv", newline="") as file: reader = csv.reader(file) next(reader) # skip header row for row in reader: cur.execute("INSERT INTO participation " "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", row) con.commit() -
Copy and modify the code snippet to read from the other CSV file and insert into the other table. You will need to change the filename, the table name, and the number of question marks.
-
Run
vdoe.pymultiple times to make sure the CSV files are read and the data is inserted without error.
Step 5: Exploratory Queries¶
-
Open
vdoe.dbusing DB Browser. Click the "Browse Data" tab and verify that data exists in both tables:participationshould have 56,063 rowstest_resultsshould have 18,440 rows
-
Click the "Execute SQL" tab and run the following query:
SELECT * FROM participation WHERE subject = 'Mathematics' AND subgroup = 'All Students' ORDER BY div_num, sch_num -
Compare the results with the original data in the
participationtable. Discuss with the students at your table what the results show. Be able to explain, in plain English, what the query means. -
Write the following queries in SQL. Have the instructor check your code.
-
For each school division, show the total number of students tested and the total number of students for Mathematics. The first line of the query should look like this:
SELECT div_num, div_name, sum(number_tested), sum(number_students) -
Show the total number of students tested and total number of students state-wide for Mathematics. The query results should have one row, with 636,589 students tested and 648,262 total students.
-
-
Open a new SQL tab (click the toolbar button or press Ctrl+T). Then run this query:
SELECT * FROM test_results WHERE sch_type = 'High' AND test_name = 'Algebra II' ORDER BY div_num, sch_num -
Compare the results with the original data in the
test_resultstable. Discuss with the students at your table what the results show. Be able to explain, in plain English, what the query means. -
Write the following queries in SQL. Have the instructor check your code.
-
For each school division, show the minimum and maximum pass rates for Algebra II at high schools in 2023–24. Also show the number of high schools. The first line of the query should look like this:
SELECT div_num, div_name, min(pass_rate_2324), max(pass_rate_2324), count(sch_num) -
Modify the previous query to show the same results state-wide. Add the following condition to the
WHEREclause to filter out'<'values and blank results:AND typeof(pass_rate_2324) = 'integer'
-
Step 6: Query About a School¶
-
Go back to
vdoe.py, and add the following new functions and test code:def query_school(div_num, sch_num): pass def update_school(div_num, sch_num): pass if __name__ == "__main__": create_tables() import_data() div_num = input("Enter division number: ") sch_num = input("Enter school number: ") query_school(div_num, sch_num) update_school(div_num, sch_num) -
Run the program to make sure the code works. Use division and school numbers of your choice. For example, Harrisonburg High School is division number 113 and school number 12.
-
Add the following code to the
query_school()function. Run the program to see the results.# Query the participation rates res = cur.execute(""" SELECT div_name, sch_name, subject, number_tested, number_students FROM participation WHERE div_num = ? AND sch_num = ? AND subgroup = 'All Students' """, (div_num, sch_num)) data = res.fetchall() print() print(f"Results for {data[0][1]}, {data[0][0]}") print() for row in data: print(f"Subject: {row[2]}, Tested: {row[3]}/{row[4]}") -
Write a similar block of code (in Python) to query the SOL test pass rates for the school. The first line of the query should be:
SELECT subject, grade_level, test_name, pass_rate_2324The results can be printed this way:
print() for row in data: if row[3] is not None and row[3] != "<": print(f"{row[0]}, {row[1]}, {row[2]}, 2023-24 Pass Rate: {row[3]}") print()
Step 7: Change a School's Data¶
Okay this last step is somewhat a joke, but it's good practice writing an UPDATE statement.
-
Add the following lines to
update_school():print("Hacking database...", end="") # TODO add your code here print("success!\nHave a nice day.") -
Replace the
TODOcomment with anUPDATEstatement that changespass_rate_2324andadv_rate_2324to 100 for the given school. Don't forget to callcommit()at the end to save the changes to disk. Verify the changes were made using DB Browser toSELECTthe test results for the school.
Submission
Upload your vdoe.py file to Gradescope by the end of the day.