Adding features to the residential/condo model is now a somewhat lengthy process, with steps across many repositories and domains. We should document this process on the wiki to make it easier to remember for future features. Write a how-to article that outlines each step of the process, including any details and caveats.
I am trying to create a sample of sales from the Cook County Assessors' Open Data portal for sales ratio studies. In the SOP on sales ratio studies, you have:
Properties with known characteristic changes. Properties known to have undergone physical and/or legal characteristic changes between the time of sale and assessment are excluded.
Special properties. Some residential properties classified as 'Single-Family' are valued by the 'Special Properties' division of the Valuations Department. These are excluded from the sales ratio study.
It is unclear to me how to identify these properties from the sales data, or what fields in another data set I can join in to ID these sales.
Create an architecture diagram that shows the general structure of the department's data architecture. It should give newcomers an idea of how data flows for the processes the Data Department is responsible for.
See old GitLab issue. This issue needs to be updated to reflect current data cataloguing plans. (Summer 2023)
We should consolidate all of our disparate data catalogues, inventories, and trackers into a single Excel sheet. I've created a template of what should be included:
The Data Department's data is now spread over multiple locations/servers. We need to create a short directory that shows what is stored where. Include (at least) the following locations:
The Mission, Vision, and Values section of the handbook hasn't been touched in awhile. We may want to revisit this section to make sure it aligns with where the Department is headed.
Some specific edits that should be made:
Trim down the number of values. It's hard to embody values when you can't even remember them all. We should pare back to the ones that really matter and collapse similar ones. Something like the social rules of the Recurse Center might be more useful.
Create a (super) high-level architecture diagram for the CCAO as a whole. Include only major components and data flows, with a focus on how the Data Department interacts with the rest of the organization.
We should write a Standard Operating Procedure (SOP) codifying how and why we select final valuation model runs. This is mostly about formalizing and documenting the best practices and making sure that we're implementing them internally.
Collect examples of similar SOPs from other departments/companies
Collect best practices re: model selection, see Tidymodels docs, Max Kuhn, other predictive modeling resources
Seek feedback from Valuations on any proposed changes
Publish the SOP to the wiki, with link from README
Currently only @wrridgeway heavily uses/maintains the Open Data portal. We should add a wiki article that outlines how to update data assets, add new columns, rewrite data notes, etc.
DVC can be a little confusing when starting out. While DVC documentation is robust, it's very obtuse for those just dipping their toes in. It would be nice for us to have a small guide to help folks understand the basics.
We should create a list of any Data Department-specific accounts, including their login, who maintains credentials, who primarily uses them, and the account purpose. So far, I can think of two accounts:
PyPI
Cook County Data Portal
Reetro
draw.io
This excludes personal accounts tied to an organization i.e. GitHub.
As documented in DyfanJones/noctua#96, the noctua R package tries to delete results from the results S3 bucket after retrieving them. However, our read-only AWS accounts aren't permissioned to delete things from S3, resulting in a 403 error after every query.
This behavior can be disabled by enabling noctua caching: noctua_options(cache_size = 10)
We should document this flag in the How-To/Connect-to-AWS-Resources.md doc.
Now that the Data Department is growing, we should create a short document outlining coding best practices in the office. This should include things like:
Styling and tinting practices for different languages
Pre-commit standards and practices
Code review / PR practices
Standards for setting user permissions
Steps:
Move coding standards from handbook into separate SOP
Update the onboarding issue template in ccao-data/people to include link to this standard
We already have R-based examples for how to query Athena using noctua. We should add a short section describing the setup and packages needed to query Athena using Python (probably using pyathena).