After the print(" Ready to load") statement in the main() function of load_data.py, we want to send the data stored in the newly-created clean.csv file to the database. Then, we want to update/add the appropriate row of the Postgres 'manifest' table to reflect the status of that file in the database.
Load cleaned.csv into table
This assumes all necessary tests have been passed (i.e. do_fields_match etc.), but double-checking deal-breaker items doesn't hurt.
Things the object will need passed to it:
'manifest_row', 'meta' (memory version of meta.json), and 'csv_filename' (the path to the clean.csv file created by Cleaner), 'overwrite' (a boolean indicating whether preexisting entries of the manifest_row['unique_data_id'] should be deleted if they exist).
- Use a 'manifest_row' dictionary with keys "destination_table" to decide which SQL table to put the data in.
- Check if the table already exists. If not, create it using the meta.json expected field names and data types.
- If the table does exist, check to see if there are any records corresponding to the manifest_row['unique_data_id']. If not, do nothing. If so and overwrite = True, drop those rows (not all rows). If so and overwrite = False, raise an error for load_data to handle.
- Use a passed string of the file path to decide which .csv file should be loaded (this will be provided by load_data.py)
- The Cleaner should have already added a new column to the clean.csv file called 'unique_data_id' and every row should match the value of manifest_row['unique_data_id']. Verify these match, raise an error if not.
- Use the 'COPY FROM' command in postgres, using a sqlAlchemy connection object, to copy the whole file in one command into the corresponding database table. This should append to the existing table, leaving any already existing entries in place.
After writing to the new file is completed, we want to update the SQL manifest table so that it is an accurate reflection of which files are currently in the database.
- If the 'manifest' table does not exist in SQL, create it.
- Use the passed 'manifest_row' (a dictionary) to provide the data for a single row of the manifest corresponding to this table.
- If there is already a row corresponding to this unique_data_id, delete the row first (or update to match the new manifest_row values)
- in SQL, the manifest table has the exact same data as manifest.csv, but one additional column called 'status' should be added, which reflects whether that file has been loaded into the database or not. Use the passed 'status' value to provide this.
This function is called immediately after trying to load the clean.csv file. If the loading process for clean.csv raises an error, the status provided to this function will be 'error', and if it succeeds it will be 'loaded'
There are a few options of how to implement all of this functionality code structure wise:
-
object oriented, with a root 'HISql' class (i.e. HousingInsightsSql) that is extended twice into a 'ManifestSql' and 'DataSql' class. This approach is consistent with how we currently do DataReader.
-
Object oriented, with one big object that handles all this stuff relatively automatically. Might be easier for an end-user if we abstract away the updating of the manifest; conversely, most users of this should learn about how we handle the manifest so might be better to make those methods transparent.
-
Functional. OK approach if whoever handles this is not object-oriented savvy, but we are probably better off trying to do as much as we can object oriented.
Note, there is currently a get_sql_manifest_row function in load_data.py. If we use an object-oriented approach, this function should be folded into the appropriate class, and load_data.main() should call this version instead (call is currently located at logging.info("Preparing to load row {} from the manifest")