Skip to content

Python DB-API

The Python Database API Specification provides a standard way for interacting with databases from Python (see PEP 249). The API allows programmers to manage connections, execute SQL statements, and process results in a consistent manner across different database systems. In today's lab, you will use Python to create a database from CSV files, execute SQL queries to analyze the data, and make changes to the data.

Heads up!

Be careful not to confuse the acronym "SOL" (Standards of Learning) with "SQL" (Structured Query Language). Thank Virginia for coming up with such a ridiculous acronym. 😄

Step 1: VDOE Data for Research

The Virginia Department of Education (VDOE) collects extensive data about K-12 public schools in Virginia. The data is anonymized, aggregated, and released to the public for a variety of purposes. For example, researchers can use data to find new ways to improve the K-12 education system. This lab will give you a taste of what that kind of work is like.

  1. Download the following CSV data files:

  2. Open these files in a spreadsheet app like Excel, Numbers, or LibreOffice Calc. Discuss the data with the students at your table and become familiar with the data. Ask questions about any columns or values you don't understand.

Acknowledgment

The original version of this data came from SOL Test Pass Rates & Other Results. We exported the Excel files as CSV so that you would not need to install additional Python libraries to read the data files.

Step 2: Create a Python Module

  1. Create a file named vode.py containing the following code:

    """Analyze SOL test pass rates and other results provided by VDOE.
    
    Author: TYPE YOUR NAME HERE
    Version: TYPE DUE DATE HERE
    """
    
    import csv
    import sqlite3
    
    # DO NOT EDIT THESE VARIABLES; use them in your functions.
    # Connect to the SQLite database file and create a cursor.
    con = sqlite3.connect("vdoe.db")
    cur = con.cursor()
    
    
    # TODO add your functions here (see the lab instructions)
    
    
    if __name__ == "__main__":
        pass
    
  2. Type your name and today's date in the docstring at the top.

  3. Make sure you can run the module without getting any errors.

    • When you run the module, the file vdoe.db will be created.

Step 3: Drop and Create Tables

  1. Define a function named create_tables(), and call the function from the __main__ block.

    • Note: pass is a placeholder so that incomplete code will compile and run. In your existing __main__ block, replace pass with create_tables().
    def create_tables():
        pass
    
    
    if __name__ == "__main__":
        create_tables()
    
  2. In the create_tables() function, use cur.execute() to execute the following CREATE TABLE statements. After each CREATE TABLE statement, call con.commit() to save the changes to disk.

    CREATE TABLE participation (
        div_num integer,
        div_name text,
        sch_num integer,
        sch_name text,
        sch_type text,
        low_grade text,
        high_grade text,
        subject text,
        subgroup text,
        number_tested integer,
        number_students integer,
        part_rate_2324 real
    )
    
    CREATE TABLE test_results (
        div_num integer,
        div_name text,
        sch_num integer,
        sch_name text,
        sch_type text,
        low_grade text,
        high_grade text,
        subject text,
        grade_level text,
        test_name text,
        pass_rate_2122 integer,
        pass_rate_2223 integer,
        pass_rate_2324 integer,
        adv_rate_2122 integer,
        adv_rate_2223 integer,
        adv_rate_2324 integer
    )
    
  3. Use cur.execute() to run the following DROP TABLE statements before the CREATE TABLE statements. That way, the database will be rebuilt every time the vdoe.py module is run.

    DROP TABLE IF EXISTS participation
    
    DROP TABLE IF EXISTS test_results
    
  4. Run vdoe.py multiple times to make sure the tables are dropped and created without error.

Step 4: Import Data from CSV

  1. Define a function named import_data(), and call the function from the __main__ block.

    def import_data():
        pass
    
    
    if __name__ == "__main__":
        create_tables()
        import_data()
    
  2. Add the following code snippet to your import_data() function. Discuss the meaning/purpose of each line with the students at your table. Ask questions about any lines you don't understand.

    with open("School-Participation_rates_23_24.csv", newline="") as file:
        reader = csv.reader(file)
        next(reader)  # skip header row
        for row in reader:
            cur.execute("INSERT INTO participation "
                        "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", row)
        con.commit()
    
  3. Copy and modify the code snippet to read from the other CSV file and insert into the other table. You will need to change the filename, the table name, and the number of question marks.

  4. Run vdoe.py multiple times to make sure the CSV files are read and the data is inserted without error.

Step 5: Exploratory Queries

  1. Open vdoe.db using DB Browser. Click the "Browse Data" tab and verify that data exists in both tables:

    • participation should have 56,063 rows
    • test_results should have 18,440 rows
  2. Click the "Execute SQL" tab and run the following query:

    SELECT *
    FROM participation
    WHERE subject = 'Mathematics'
      AND subgroup = 'All Students'
    ORDER BY div_num, sch_num
    
  3. Compare the results with the original data in the participation table. Discuss with the students at your table what the results show. Be able to explain, in plain English, what the query means.

  4. Write the following queries in SQL. Have the instructor check your code.

    1. For each school division, show the total number of students tested and the total number of students for Mathematics. The first line of the query should look like this:

      SELECT div_num, div_name, sum(number_tested), sum(number_students)
      

    2. Show the total number of students tested and total number of students state-wide for Mathematics. The query results should have one row, with 636,589 students tested and 648,262 total students.

  5. Open a new SQL tab (click the toolbar button or press Ctrl+T). Then run this query:

    SELECT *
    FROM test_results
    WHERE sch_type = 'High'
      AND test_name = 'Algebra II'
    ORDER BY div_num, sch_num
    
  6. Compare the results with the original data in the test_results table. Discuss with the students at your table what the results show. Be able to explain, in plain English, what the query means.

  7. Write the following queries in SQL. Have the instructor check your code.

    1. For each school division, show the minimum and maximum pass rates for Algebra II at high schools in 2023–24. Also show the number of high schools. The first line of the query should look like this:

      SELECT div_num, div_name, min(pass_rate_2324), max(pass_rate_2324), count(sch_num)
      

    2. Modify the previous query to show the same results state-wide. Add the following condition to the WHERE clause to filter out '<' values and blank results:

      AND typeof(pass_rate_2324) = 'integer'
      

Step 6: Query About a School

  1. Go back to vdoe.py, and add the following new functions and test code:

    def query_school(div_num, sch_num):
        pass
    
    
    def update_school(div_num, sch_num):
        pass
    
    
    if __name__ == "__main__":
        create_tables()
        import_data()
        div_num = input("Enter division number: ")
        sch_num = input("Enter school number: ")
        query_school(div_num, sch_num)
        update_school(div_num, sch_num)
    
  2. Run the program to make sure the code works. Use division and school numbers of your choice. For example, Harrisonburg High School is division number 113 and school number 12.

  3. Add the following code to the query_school() function. Run the program to see the results.

    # Query the participation rates
    res = cur.execute("""
        SELECT div_name, sch_name, subject, number_tested, number_students
        FROM participation
        WHERE div_num = ?
            AND sch_num = ?
            AND subgroup = 'All Students'
        """, (div_num, sch_num))
    data = res.fetchall()
    print()
    print(f"Results for {data[0][1]}, {data[0][0]}")
    print()
    for row in data:
        print(f"Subject: {row[2]}, Tested: {row[3]}/{row[4]}")
    
  4. Write a similar block of code (in Python) to query the SOL test pass rates for the school. The first line of the query should be:

    SELECT subject, grade_level, test_name, pass_rate_2324
    

    The results can be printed this way:

    print()
    for row in data:
        if row[3] is not None and row[3] != "<":
            print(f"{row[0]}, {row[1]}, {row[2]}, 2023-24 Pass Rate: {row[3]}")
    print()
    

Step 7: Change a School's Data

Okay this last step is somewhat a joke, but it's good practice writing an UPDATE statement.

  1. Add the following lines to update_school():

        print("Hacking database...", end="")
        # TODO add your code here
        print("success!\nHave a nice day.")
    
  2. Replace the TODO comment with an UPDATE statement that changes pass_rate_2324 and adv_rate_2324 to 100 for the given school. Don't forget to call commit() at the end to save the changes to disk. Verify the changes were made using DB Browser to SELECT the test results for the school.