Python DB-API
The Python Database API Specification provides a standard way for interacting with databases from Python (see PEP 249). The API allows programmers to manage connections, execute SQL statements, and process results in a consistent manner across different database systems. In today's lab, you will use Python to create a database from CSV files, execute SQL queries to analyze the data, and make changes to the data.
Heads up!
Be careful not to confuse the acronym "SOL" (Standards of Learning) with "SQL" (Structured Query Language). Thank Virginia for coming up with such a ridiculous acronym.
Step 1: VDOE Data for Research¶
The Virginia Department of Education (VDOE) collects extensive data about K-12 public schools in Virginia. The data is anonymized, aggregated, and released to the public for a variety of purposes. For example, researchers can use data to find new ways to improve the K-12 education system. This lab will give you a taste of what that kind of work is like.
-
Download the following CSV data files:
- School-Participation_rates_23_24 – number of students tested in 2023–24 (by school, subject, and subgroup)
- School_Test_by_level_2023_2024 – pass rates for SOL tests taken in 2021–24 (by school, grade level, and test)
-
Open these files in a spreadsheet app like Excel, Numbers, or LibreOffice Calc. Discuss the data with the students at your table and become familiar with the data. Ask questions about any columns or values you don't understand.
Acknowledgment
The original version of this data came from SOL Test Pass Rates & Other Results. We exported the Excel files as CSV so that you would not need to install additional Python libraries to read the data files.
Step 2: Create a Python Module¶
-
Create a file named
vode.py
containing the following code:"""Analyze SOL test pass rates and other results provided by VDOE. Author: TYPE YOUR NAME HERE Version: TYPE DUE DATE HERE """ import csv import sqlite3 # DO NOT EDIT THESE VARIABLES; use them in your functions. # Connect to the SQLite database file and create a cursor. con = sqlite3.connect("vdoe.db") cur = con.cursor() # TODO add your functions here (see the lab instructions) if __name__ == "__main__": pass
-
Type your name and today's date in the docstring at the top.
-
Make sure you can run the module without getting any errors.
- When you run the module, the file
vdoe.db
will be created.
- When you run the module, the file
Step 3: Drop and Create Tables¶
-
Define a function named
create_tables()
, and call the function from the__main__
block.- Note:
pass
is a placeholder so that incomplete code will compile and run. In your existing__main__
block, replacepass
withcreate_tables()
.
def create_tables(): pass if __name__ == "__main__": create_tables()
- Note:
-
In the
create_tables()
function, usecur.execute()
to execute the followingCREATE TABLE
statements. After eachCREATE TABLE
statement, callcon.commit()
to save the changes to disk.CREATE TABLE participation ( div_num integer, div_name text, sch_num integer, sch_name text, sch_type text, low_grade text, high_grade text, subject text, subgroup text, number_tested integer, number_students integer, part_rate_2324 real )
CREATE TABLE test_results ( div_num integer, div_name text, sch_num integer, sch_name text, sch_type text, low_grade text, high_grade text, subject text, grade_level text, test_name text, pass_rate_2122 integer, pass_rate_2223 integer, pass_rate_2324 integer, adv_rate_2122 integer, adv_rate_2223 integer, adv_rate_2324 integer )
-
Use
cur.execute()
to run the followingDROP TABLE
statements before theCREATE TABLE
statements. That way, the database will be rebuilt every time thevdoe.py
module is run.DROP TABLE IF EXISTS participation
DROP TABLE IF EXISTS test_results
-
Run
vdoe.py
multiple times to make sure the tables are dropped and created without error.
Step 4: Import Data from CSV¶
-
Define a function named
import_data()
, and call the function from the__main__
block.def import_data(): pass if __name__ == "__main__": create_tables() import_data()
-
Add the following code snippet to your
import_data()
function. Discuss the meaning/purpose of each line with the students at your table. Ask questions about any lines you don't understand.with open("School-Participation_rates_23_24.csv", newline="") as file: reader = csv.reader(file) next(reader) # skip header row for row in reader: cur.execute("INSERT INTO participation " "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", row) con.commit()
-
Copy and modify the code snippet to read from the other CSV file and insert into the other table. You will need to change the filename, the table name, and the number of question marks.
-
Run
vdoe.py
multiple times to make sure the CSV files are read and the data is inserted without error.
Step 5: Exploratory Queries¶
-
Open
vdoe.db
using DB Browser. Click the "Browse Data" tab and verify that data exists in both tables:participation
should have 56,063 rowstest_results
should have 18,440 rows
-
Click the "Execute SQL" tab and run the following query:
SELECT * FROM participation WHERE subject = 'Mathematics' AND subgroup = 'All Students' ORDER BY div_num, sch_num
-
Compare the results with the original data in the
participation
table. Discuss with the students at your table what the results show. Be able to explain, in plain English, what the query means. -
Write the following queries in SQL. Have the instructor check your code.
-
For each school division, show the total number of students tested and the total number of students for Mathematics. The first line of the query should look like this:
SELECT div_num, div_name, sum(number_tested), sum(number_students)
-
Show the total number of students tested and total number of students state-wide for Mathematics. The query results should have one row, with 636,589 students tested and 648,262 total students.
-
-
Open a new SQL tab (click the toolbar button or press Ctrl+T). Then run this query:
SELECT * FROM test_results WHERE sch_type = 'High' AND test_name = 'Algebra II' ORDER BY div_num, sch_num
-
Compare the results with the original data in the
test_results
table. Discuss with the students at your table what the results show. Be able to explain, in plain English, what the query means. -
Write the following queries in SQL. Have the instructor check your code.
-
For each school division, show the minimum and maximum pass rates for Algebra II at high schools in 2023–24. Also show the number of high schools. The first line of the query should look like this:
SELECT div_num, div_name, min(pass_rate_2324), max(pass_rate_2324), count(sch_num)
-
Modify the previous query to show the same results state-wide. Add the following condition to the
WHERE
clause to filter out'<'
values and blank results:AND typeof(pass_rate_2324) = 'integer'
-
Step 6: Query About a School¶
-
Go back to
vdoe.py
, and add the following new functions and test code:def query_school(div_num, sch_num): pass def update_school(div_num, sch_num): pass if __name__ == "__main__": create_tables() import_data() div_num = input("Enter division number: ") sch_num = input("Enter school number: ") query_school(div_num, sch_num) update_school(div_num, sch_num)
-
Run the program to make sure the code works. Use division and school numbers of your choice. For example, Harrisonburg High School is division number 113 and school number 12.
-
Add the following code to the
query_school()
function. Run the program to see the results.# Query the participation rates res = cur.execute(""" SELECT div_name, sch_name, subject, number_tested, number_students FROM participation WHERE div_num = ? AND sch_num = ? AND subgroup = 'All Students' """, (div_num, sch_num)) data = res.fetchall() print() print(f"Results for {data[0][1]}, {data[0][0]}") print() for row in data: print(f"Subject: {row[2]}, Tested: {row[3]}/{row[4]}")
-
Write a similar block of code (in Python) to query the SOL test pass rates for the school. The first line of the query should be:
SELECT subject, grade_level, test_name, pass_rate_2324
The results can be printed this way:
print() for row in data: if row[3] is not None and row[3] != "<": print(f"{row[0]}, {row[1]}, {row[2]}, 2023-24 Pass Rate: {row[3]}") print()
Step 7: Change a School's Data¶
Okay this last step is somewhat a joke, but it's good practice writing an UPDATE
statement.
-
Add the following lines to
update_school()
:print("Hacking database...", end="") # TODO add your code here print("success!\nHave a nice day.")
-
Replace the
TODO
comment with anUPDATE
statement that changespass_rate_2324
andadv_rate_2324
to 100 for the given school. Don't forget to callcommit()
at the end to save the changes to disk. Verify the changes were made using DB Browser toSELECT
the test results for the school.