Light

ccao-data / wiki Goto Github PK

View Code? Open in Web Editor NEW

2.0 0.0 1.0 98.47 MB

Handbook, how-tos, and other documentation

R 22.14% CSS 2.10% HTML 75.76%

handbook wiki documentation

wiki's Introduction

The Data Department

People and Onboarding

See People repository
See internal README - Must be logged into GitHub

How-To

Standard Operating Procedures (SOPs)

Residential

Data Documentation

See main data catalog/dbt documentation

Office Terminology

Property Class Definitions - Available via the CCAO R package
Township Codes and Maps - Available via the CCAO R package
CDU Codes - Available via the CCAO R package

Data Guides

Data Catalog - Deprecated, see dbt documentation
iasWorld Tables
iasWorld Keys - As of June 14, 2021
iasWorld Entity Relationship Diagram (ERD)

wiki's People

Contributors

Stargazers

Forkers

wiki's Issues

Create explainer for Data server

How to logon
Who controls
What it is / where is it
What's running on it
Resources

Document how to add features to the residential model

Adding features to the residential/condo model is now a somewhat lengthy process, with steps across many repositories and domains. We should document this process on the wiki to make it easier to remember for future features. Write a how-to article that outlines each step of the process, including any details and caveats.

Unclear how to recreate sales ratio sample data from open data

I am trying to create a sample of sales from the Cook County Assessors' Open Data portal for sales ratio studies. In the SOP on sales ratio studies, you have:

Properties with known characteristic changes. Properties known to have undergone physical and/or legal characteristic changes between the time of sale and assessment are excluded.

Special properties. Some residential properties classified as 'Single-Family' are valued by the 'Special Properties' division of the Valuations Department. These are excluded from the sales ratio study.

It is unclear to me how to identify these properties from the sales data, or what fields in another data set I can join in to ID these sales.

Add overall architecture diagram for the Data Dept.

Create an architecture diagram that shows the general structure of the department's data architecture. It should give newcomers an idea of how data flows for the processes the Data Department is responsible for.

Consolidate data inventories and catalogues into single workbook

See old GitLab issue. This issue needs to be updated to reflect current data cataloguing plans. (Summer 2023)

We should consolidate all of our disparate data catalogues, inventories, and trackers into a single Excel sheet. I've created a template of what should be included:

new_data_catalog.xlsx

And I'm working to consolidate the following sheets:

warehouse_athena_map.xlsx
data_catalog_wiki.xlsx
data_catalog_warehouse.xlsx
update_inventory.xlsx

The final workbook should:

Live at Data/Data_Dept_Catalog.xlsx in this repo
Be linked to from the Home and _sidebar wiki pages + from a readme note in the data architecture repo
Be tracked using Git LFS
Orange columns in the worksheet should be updated programmatically via daily API calls to AWS. Can use GitLab's CI + boto3 to accomplish this
Be machine-readable in the long format, no merged cells!

Document organizational processes and timelines

Create a new wiki section "Processes" to document how internal CCAO processes work:

Start with an investigation of existing process documents
- Look in the shared drives for existing documentation
- Ask Tia and Mirella if any such documents already exist
- Check the intranet and CCAO handbook

Things to Document

What the general timeline is between departments for finalizing assessments (Data --> Valuations --> iasWorld --> Legacy?)
Timeline between offices (CCAO --> BoR --> Clerk --> Treasurer) + what data they are passing
Internal data flows after the Data Department hands off data
GIS processes for updating parcel files and other boundaries (timeline?)

Add tutorial/how-to for end-to-end run of the residential model

Create a directory of file storage locations used by the Data Dept.

The Data Department's data is now spread over multiple locations/servers. We need to create a short directory that shows what is stored where. Include (at least) the following locations:

Sharepoint
Shared drive ("ocommon")
Data's S3 buckets
Open Data Portal
iasWorld
Teams

Add checklist of yearly model tasks

While discussing the requirement that we update our delete-model-runs workflow role policy once per year, Dan suggested we create a yearly model checklist we can use to track chores related to model development, Draft that doc and add it to the wiki.

Update Mission Vision and Values

The Mission, Vision, and Values section of the handbook hasn't been touched in awhile. We may want to revisit this section to make sure it aligns with where the Department is headed.

Some specific edits that should be made:

Trim down the number of values. It's hard to embody values when you can't even remember them all. We should pare back to the ones that really matter and collapse similar ones. Something like the social rules of the Recurse Center might be more useful.

Add high-level data architecture diagram for the CCAO

Create a (super) high-level architecture diagram for the CCAO as a whole. Include only major components and data flows, with a focus on how the Data Department interacts with the rest of the organization.

Create model selection SOP

We should write a Standard Operating Procedure (SOP) codifying how and why we select final valuation model runs. This is mostly about formalizing and documenting the best practices and making sure that we're implementing them internally.

Collect examples of similar SOPs from other departments/companies
Collect best practices re: model selection, see Tidymodels docs, Max Kuhn, other predictive modeling resources
Seek feedback from Valuations on any proposed changes
Publish the SOP to the wiki, with link from README

Document how to add columns to Open Data

Currently only @wrridgeway heavily uses/maintains the Open Data portal. We should add a wiki article that outlines how to update data assets, add new columns, rewrite data notes, etc.

Update and cleanup data documentation

Create list of automated processes

Create a table of what processes are running, where, and on what schedule. Some processes include:

Sqoop
Glue jobs
- Ratio stats
Open data pull
Appeal worksheets

Add how-to article for setting up git

How to setup SSH keys and link with GitHub
Setting up a global email address, both locally and on the server
Linking that email address to GitHub
Cloning a repo
Optional, GPG setup

Add brief DVC documentation

DVC can be a little confusing when starting out. While DVC documentation is robust, it's very obtuse for those just dipping their toes in. It would be nice for us to have a small guide to help folks understand the basics.

Create a list of Data Department-specific accounts

We should create a list of any Data Department-specific accounts, including their login, who maintains credentials, who primarily uses them, and the account purpose. So far, I can think of two accounts:

PyPI
Cook County Data Portal
Reetro
draw.io

This excludes personal accounts tied to an organization i.e. GitHub.

Document `noctua` caching behavior and 403 errors

As documented in DyfanJones/noctua#96, the noctua R package tries to delete results from the results S3 bucket after retrieving them. However, our read-only AWS accounts aren't permissioned to delete things from S3, resulting in a 403 error after every query.

This behavior can be disabled by enabling noctua caching: noctua_options(cache_size = 10)

We should document this flag in the How-To/Connect-to-AWS-Resources.md doc.

Create coding practices SOP

Now that the Data Department is growing, we should create a short document outlining coding best practices in the office. This should include things like:

Styling and tinting practices for different languages
Pre-commit standards and practices
Code review / PR practices
Standards for setting user permissions

Steps:

Move coding standards from handbook into separate SOP
Update the onboarding issue template in ccao-data/people to include link to this standard

Add Python connection example to `How-To/Connect-to-AWS-Resources.md`

We already have R-based examples for how to query Athena using noctua. We should add a short section describing the setup and packages needed to query Athena using Python (probably using pyathena).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs