matttriano / analytics_data_where_house Goto Github PK

An analytics engineering sandbox focusing on real estates prices in Cook County, IL

Home Page: https://docs.analytics-data-where-house.dev/

License: GNU Affero General Public License v3.0

Dockerfile 0.49% Python 97.64% Makefile 1.19% Shell 0.68%

airflow data-engineering data-pipelines data-warehousing dbt elt python docker mkdocs-material open-source data-catalog data-discovery data-platform superset

analytics_data_where_house's People

Contributors

Stargazers

Watchers

analytics_data_where_house's Issues

Implement a first_startup makefile target to handle all manual one-time initialization steps

PR #18 significantly reduces the burden on a potential user, but there are still some manual tasks that they might miss if they don't read all the way through the Usage section.

The manual tasks so far are

execute make build ,
execute make init_airflow,
set up connection to data warehouse database (this actually has to happen before the following DAG runs, which is not reflected in the current README.md instructions),
run DAG ensure_metadata_table_exists,
run DAG ensure_data_raw_schema_exists,
execute make update_dbt_packages to initially install dbt packages (~~mainly~~ exclusively dbt_utils)

Regarding the connection bit, I think I can even integrate that into the make_credentials makefile target.

Implement a strategy for dropping "temp_" tables

The "temp_" tables are useful when developing expectations, but outside of that, they will just be recreated every time fresh data is ingested.

I see two possible courses of action and I'm not sure which I prefer yet.

Implement a task in update DAG that drops the "temp_" table ff a suite of expectations already exists for the data set, otherwise leave it.
OR
Implement a manually or periodically run maintenance DAG that drops all "temp_" tables.

The former is the ideal long-run solution (i.e., after expectations for the raw data ingestion are developed and mature), but in the short term it would complicate the process of editing a new suite of expectations.

Explore open source self-service analytics/visualization/BI tools

Fuzzy goal: I want low/non-technical users to be able to explore data in the data warehouse. The objectives are to:

Democratize data access (enable more people to answer questions with data).
Crowdsource questions from a pool of people with a broader set of experiences.
Serve curated dashboards.
and if the system is deployed for use by more than one user,
Relieve engineers and analysts from having to answering some proportion of data questions.

Desired functionality: low-code or no-code Dashboarding or Data Visualization or Business Intelligence so that less/non-technical users can explore data sets in a data warehouse without technical assistance.

In subsequent comments, I'll explore different options, learn what features are available, and try to evaluate maintainability and ease of integration with the system.

Remove the clean_schedule attr from the SocrataTable class

The general flow for a data set's pipeline no longer involves separate EL and T DAGs, so the clean_schedule time is unnecessary now. Simplify documentation and the code by removing it from

Decide on the number of data cleaning stages and conditions to meet at each stage

At present, the flow for a given data set starts through these extraction and loading steps:

Source: source data is download from a source to a file (accessible by airflow containers and the host system),
data_raw.temp_<table>: that file is ingested to a "temp" table in the data_raw schema,
data_raw.<table>: temp_<table> data is filtered to new and updated records, which are appended to the persistent table in data_raw,

I've implemented an initial pattern for the transformation stage that breaks the process into
4. Standardize: applies basic preprocessing and typesetting, produces a key that can uniquely identify one unit of the grain of the table (eg one property sale for the parcel_sales table, or one CTA train station in the CTA station table; even if the record changes and there are multiple records for a given train station, all records for that station will have the same key), and sets the desired order of columns, and
5. Clean: filters the standardized table down to the most recently updated version of each table-grain unit, and engineers a feature indicating the earliest data update where the table-grain unit was added to the table.

I'm not sure "clean" is the right name for this stage, and I'm also not sure if the mentioned feature engineering step (that engineers the 'first appearance of a unit in the table' feature) belongs with this table or should just be something to be joined on if relevant.

Evaluate open source data catalog options for integration into this platform

A data catalog should have:

Document core metadata (table name, table description, table grain, source, etc),
Document table schema (column names, descriptions, data types, etc),
Lineage information,
Usage information,
Access control information,
Search functionality, and
the ability for users to enrich data with tags and further information.

dbt's built-in doc server does include most of that functionality (even access control, apparently https://www.getdbt.com/blog/teaching-dbt-about-grants/), but it doesn't allow users to edit things through the portal, and I think it's intended more as a dev tool than a production option.

There are two options I want to evaluate:

DataHub
- ~7k stars, project started in 2016, main open source option.
- features
OpenMetadata
- ~1.8k starts, project started in Aug 2021, growing faster than DataHub and has a slightly more active community.
- features

I've looked at Amundsen, but its community is about 5% as active as OpenMetadata's community, and I don't think it will keep up.

Integrate dbt into the workflow to add T to the existing EL framework

Explore the possibility of keeping the /dbt project directory outside of the /airflow directory. (Most tutorials and code samples have everything in Airflow).
Weigh the costs of having a less common project structure and of engineering something rather than adopting VS the benefit of having a top level project directory that shows where things are.

Socrata metadata check fails on data sets that were published and haven't been updated (yet)

This issue emerges when a SocrataTableMetadata instance calls its .check_warehouse_data_freshness(engine) method, specifically when calling check_warehouse_data_freshness with an input value that doesn't match the expected datetime format ("%Y-%m-%dT%H:%M:%S %z"). That function is called with the instance's .latest_data_update_datetime and .latest_metadata_update_datetime attributes, and it seems (per examination of the metadata for the example data set below) that if the data set's data hasn't ever been updated, that field is None in the metadata response.

Editing the logic in check_warehouse_data_freshness() can handle this new understanding. I should also probably open an issue to refactor pytesting into an airflow container rather than in a separate (rather heavyweight) python container.

Example data set: Chicago city boundary

Add a mainentance DAG to delete aging raw data files

Develop a dashboarding strategy using Google's Data Studio

As a means to publish simple dashboards, Google Data Studio looks like a pretty good free-ish option. I'll have to add in some tasks to throw some data up onto BigQuery (and sort out the requisite permissions up there and secret-keeping-strategies down here), but that should be easy enough. And if I only throw up highly aggregated feature or dwh tables, I'll never get close to any BQ usage anywhere near the payment threshold.

Add a DAG that runs dbt _standardized and _clean model generation and execution outside of a data update

Add makefile + a recipes for creating the .env files, starting the docker-compose app, cleaning up, etc

The README has fallen somewhat behind and it would be much more maintainable if the setup instructions were reduced to instructions for installing docker (if not already installed) and simple make init, make build, make compose_up, etc.

Engineer features related to parcel owners

At present, I think knowing

the name of the current owner/owning org,
whether the owner occupies the parcel

are potentially useful features, and they could be engineered from the parcel locations and sales tables.

FeatEng: Aggregations of Parcel Sales through Time and Space (maybe quarterly and by nbhd/block/tract?)

Specifically:

Metrics (per geometry/geography)

Count of parcel sales per month|quarter|year
mean|median|max|min sale price per month|quarter|year

Geographic Groupings

census tract,
census block,
neighborhood (town_nbhd from cook_county_neighborhood_boundaries),
township,
school districts (e.g., school_elem_district and school_hs_district in cook_county_parcel_locations),
Chicago community area,
Chicago police beats
Chicago police districts
In general I think that smaller areas will produce more interesting information (as I assume forces causally driving changes in prices or sale rates could be very localized, but that's an assumption to check against larger areas).

Implement functionality to support idempotent ingestion

Many of the tables I'm interested in are updated monthly, weekly, even daily, and they're often snapshots exported from database tables that support operational systems (eg records correspond to some workload and are updated as the workload is worked). By retrieving a data table and comparing it against prior pulls of that table, it's possible to identify which records are new as well as which existing records were changed/updated. To avoid missing an update, it's necessary to do this retrieval+comparison for every distinct export of the table from its source system, but most public data systems don't indicate when the next update will happen (although this cadence can often be reliably deduced by checking at periodic intervals), so it's often necessary to check more often than is necessary.

Data tables can be pretty large (often in excess of 1GB), so it's both rude and expensive to download and ingest the table more frequently than is needed. Fortunately, the main data tables this project uses are served via Socrata's data platform, which provides an API for checking table metadata which includes the time that the data was last updated, which can quickly avert the need to execute an unnecessary data pull. And when it is necessary to pull data it would be ideal to only ingest new or updated records.

Refactor SocrataTable class into centralized location and import/load them

At present, SocrataTable dataclass instances for a given data-source-table are created in multiple places. It would be better to have a centralized module of these dataclass instances and just import instances into the DAG files that use them. Also, I'll probably want the ability to categorize tables as the dbt models directories will quickly become very crowded otherwise, so another attribute should be added to the SocrataTable class to categorize this table and communicate the path to proper dbt script.

Down the road (i.e. after I develop and implement ingestion patterns for other data sources) I'll have a better idea about what can be comfortably generalized from the SocrataTable dataclass into just a TableSource class (maybe a protocol or abstract class).

Develop a suite of expectations for Chicago Homicide and Shooting Victimization data

Use the manual expectation setting process and document the process.

Also include instructions on how to take expectations directly from the expectations gallery and fill them in without going through the interactive process.

Or even better, instructions on how to use instances of ExpectColumn..... (i.e., ColumnExpectation subclass instances) from the repo to define suites of expectations. If I can cut out the manual jupyter notebook expectation setting process, I can really democratize this process (although it would count on the initial suite of expectations to be valid and meaningful, which will be the hard part as open source automated data profilers aren't quite there yet).

Implement utils to handle conditionally ingesting data

Where the relevant condition is whether the data freshness check indicates there's fresher data (fresher than what's already in the data database) to be retrieved from the data source.

If data is successfully pulled:

new or preexisting-but-updated records should be ingested to the correct table,
the record for that freshness check should be updated to indicate data was pulled.

If our data warehouse already has fresh data, do nothing.

Updates are available for Airflow, pgAdmin, and dbt

Airflow 2.5.0 -> 2.5.1
pgAdmin4 6.18 -> 6.19
dbt-postgres 1.3.0 -> 1.3.1

Explore the feasibility of a task that automatically generates the dbt model that identifies new or updated records in the data_raw stage

I've already implemented the logic for producing the contents of a table's data_raw dbt model. I've been using this manually after the update-DAG for a new data errors out at the dbt task (which is just past the "temp_" table creation task) and then manually copying the output into a file, but I'm pretty sure it would be feasible to make a task that just creates the script if it didn't already exist.

This change, coupled with the refactoring of the update_socrata_table task_group (which cut update-DAGs from this down to this, which reduces the number of lines a user has to add an update-DAG from 66 to 27 + 3 (defining the SocrataTable in /airflow/dags/sources/tables.py). Automating the initial dbt-model generation would make adding a data set to the warehouse basically a 3 minute operation.

Engineer more efficient data loading functions for larger files

[Geo]Pandas [Geo]DataFrame.to_sql() methods aren't too efficient, but they are convenient in that they automatically include functionality to create a database table and set sensible column data types for free. Switching to a proper "COPY ... FROM " implementation would be much quicker (and wouldn't involve loading the entire file into memory) but I'll have to make up for that by engineering utils to create the initial data_raw table.

Fortunately, a bit of tinkering with pandas.io.sql's functionality lead me to hack together something quick and dirty to explore further.

from pandas.io.sql import SQLTable
df = pd.read_csv(file_path, nrows=2000000) # ie a large enough n that we can reliably infer dtypes
a_table = SQLTable(
    frame=df,
    name=socrata_metadata_obj.table_name,
    schema="data_raw",
    pandas_sql_engine=engine
)
table_create_obj = a_table._create_table_setup()
table_create_obj.create(bind=engine)

and it works just as nicely with a GeoDataFrame if you have GeoAlchemy2 installed.

I know it's not wise to build anything on top of an "internal" method (._create_table_setup() here) as the leading underscore flags that the maintainers can guiltless change the function name or implementation or behavior without warning. But I expect I'll be much more comfortable with SQLAlchemy v2.0 by then, so I'm fine with a temp filler here.

Add a proper documentation site and significantly tighten the focus of the README

In the past, I've put together documentation sites using the Hugo theme Google developed named Docsy and Sphinx.

Docsy was pleasant enough to work with (file headers were intuitive and the content could all be markdown), but the git-submodule part made updating a real headache (although it looks like they may have re-implemented the theme so that the git-submodule installation is unnecessary).

Sphinx was less pleasant to work with. I modeled my doc site after the beautiful Geopandas docs site, and while I was very pleased with the aesthetics of my doc site, I just know that reStructuredText is a bad choice for me as it adds too much friction to the documentation writing process (as compared with regular markdown, which is essentially frictionless).

I'd also like to take a look at:

mkdocs,
Docasaurus, and
regular Hugo

Assuming all of those options support use of images/gifs-of-demos/standard markdown capabilities, the main focus should be ease of use. Friction in the documentation process makes updating documentation a chore, and inaccurate documentation will quickly teach potential users to ignore documentation (which is a fine way to lose users).

Update makefile to use v2.x.x style `docker compose` commands and test behavior

It looks like docker-compose was completely reimplemented (presumably in 2020-early 2021) in go (from python), likely as part of the push to get docker working on Apple silicon.

In any case, I've updated my docker-compose install. I still have to remove my prior install (the binary in /usr/local/bin/docker-compose), but I've started up the system using the new implementation (it re-downloaded images) and the few things I checked were functional, but I should check more extensively.

Assuming the new compose is a proper drop-in replacement, the makefile recipes should be edited (from docker-compose ... to docker compose) and any references in documentation should also be updated.

Enforce some column name rules at the initial ingestion stage

Upon attempting to add this small and frequently updating table on vehicles towed in Chicago, dbt objected to the column "Tow Date" (syntax error at or near "Date"). This is the first table I've tried to add that has a column with 1+ space in its name, and while I'm confident dbt offers some mechanism to quote these column names, I'd rather just clean this up at the start and not have to deal with the added downstream complexity of having to quote column names every step along the way.

I think the cleanest way to do this will be to slip a line in right before this one that cleans up column names, but I might also have to insert a comparable step for some file-type-ingestion-tasks (e.g. here for geojson ingestion) or ignore the column headers (e.g. here for csv ingestion).

In the column-name-cleaning, the ideal and obvious strategy is to replace spaces with underscores, but I'll have to check that this doesn't produce duplicate column names (e.g. if a table includes columns, "tow_date" and "tow date", for some reason), and if it does, I'll have to A) iterate on space-filling strategies until finding one that doesn't produce duplicate column names, and B) record this metadata so that users can correctly trace the column back to its source. (Note: I don't expect I'll run into this soon if ever, so I'll probably leave at least the metadata tracking part unhandled).

Should the system leave out the great_expectations anonymous_usage_statistics identifiers?

Per this GE docs page, the great_expectations team added a bit of code to enable them to track usage of their code, which can be disabled in the great_expectations.yml file. That page advises there's more information on a blog post from 2020, but the given link is dead. Still, per the wayback machine that post, the GE team states

"We do not track credentials, validation results, or arguments passed to Expectations. We consider these private, and frankly none of our business. User-created names are always hashed, to create a longitudinal record without leaking any private information. We track types of Expectations, to understand which are most useful to the community."

This is very reasonable and I'm keen to provide the GE team with information that helps them figure out what features are worth working on. However, as my project is intended to be both a specific project but also a platform that other people can fork and make their own pipelines for (but from the traffic page, I see people are mainly cloning the repo without forking), I don't know if I should strip out the UUID as it would produce a polluted longitudinal record.

So I should experiment with stripping out this UUID (both in /great_expectations/expectations/.ge_store_backend_id and .../great_expectations.yml files; per grep, all other appearances of the UUID are in the /.uncommitted/ dir) and see if anything complains when I run checkpoints.

Add (mermaid.js) diagrams to the docs that show system architecture

From the material-for-mkdocs diagrams doc page, it looks like it will be a very smooth integration.

And from the mermaid.js docs, it looks like mermaid.js makes it unbelievably easy to create extremely expressive diagrams purely in simple syntax in markdown files.

Examples of that expressiveness:

This looks like a really cool tool.

Explore workflows for integrating more rigorous data validation and monitoring steps

Some testing is done with dbt, but this isn't adequate to detect most of the interesting or important changes that might appear in the data, so I've largely settled on using great_expectations to perform validation and monitoring.

great_expectations provides a moderately user-friendly interface for interactively creating suites of expectations in generated jupyter notebooks, but it will still involve manual intervention to correct the generated expectations. So I'll have to think carefully about where this should get integrated into the flow, or it could become a major source of friction.

Also, I don't think I want this to be run out of the airflow-scheduler container; perhaps this is better suited for the python-utils container.

Further simplify setup process and update documentation

Since the last revision of the setup process, I've added (at least) the dwh and feature schemas, a significant volume of docs to be served by dbt, the great_expectations workflow, and a lot of misc functionality in the pyutils service container. Some of this setup should probably go in the airflow-init service, as at present, it's handled semi-manually in the create_warehouse_infra makefile recipe.

But the main thing I want to simplify is the dot-env file creation. For users who are fluent with a password manager, I made that process significantly easier (i.e., less error-prone) by packing it up into the make_credentials makefile recipe. That recipe runs a python script that presents the user with prompts for the distinct values needed then assembles those into 18 of the values in the .env and .dwh.env files. Still, I suspect that asking a user to enter ~10 values (including 5 password-like values) could be overwhelming for users that don't use a password manager. Further, the prior implementation depended on the user's host system having python, but now that I've developed the py-utils service container (including functionality to generate a Fernet key), it might make more sense to run the script from the py-utils container, and also relieve the user of having to generate the Fernet key (although I still want users to have the choice to change credentials before they initialize the system, as I don't want users to worry that the system is giving them insecure credentials or anything).

Create a database table to track table metadata

As mentioned in issue #1, Socrata has a metadata API that can be used to check when a table has been updated, and it also returns a lot of information about the table, including the description of the data set, names/data types/descriptions for all columns, links to the hosting domain, and more.

I want to implement ELT DAGs (extracting from Socrata sources) to check the metadata API, extract the dataUpdatedAt timestamp, compare that against prior pulls, and then pull only if dataUpdatedAt is greater than the dataUpdatedAt value for the most recent table pull where data was successfully pulled and new/updated records were successfully ingested.

I'm not sure if I want to put all metadata into one table to create several metadata tables, but in any case, this will need at least columns for

an identifier for the data table (ie its Socrata table_id),
the name of that table in the local database,
the last time the data was updated, and
the last time data was successfully pulled and new/updated records were successfully ingested.

Refactor project organization to produce smaller contexts for each docker image

At present, the entire project (less things indicated in .dockerignore) gets baked into each container, which is unnecessary and will make security harder if/when the system goes to prod.

Sort out issues with permissions errors involving directories and files created by Airflow

This kind of error can be reproduced by, for example, running this version of the clean_dbt makefile recipe that will delete the /airflow/dbt/target/ directory (which contains compiled dbt models/resources). Then run a DAG that includes a dbt task, which will recreate the /airflow/dbt/target/ and then try to write compiled models to that directory, but the permissions of the newly created directory won't allow that write.

It's simple enough to just tack on mkdir airflow/dbt/target to the clean_dbt recipe, but I want to use this issue to track appearances of this issue, help reliably understand the cause, and account for it.

Add a maintenance DAG that drops all "temp_" tables from data_raw

The "temp_" tables in data raw are replaced every time the data is updated and aren't used for much if anything outside of the update-DAG context. I've used those tables while developing expectations or _standardized stage models (so I don't necessarily want to automate deletion yet), but before exporting or backing up data, it's a good idea to shed that weight.

Implement tasks to roll out dbt models for the _standardized and _clean stages

The _standardized stage dbt model (example) is where users should:

change column dtypes if desired,
standardize column values (e.g., zero-padding fixed width strings, upper-casing strings),
change column names if needed,
define the set of columns that uniquely identify a record (i.e., a primary or composite key) across stages,
generate a surrogate key from that uniquely identifying column set, and
give that surrogate key column an appropriate column name.

The _clean stage dbt model (example) will select the most recently modified version of each record (as distinguished by the surrogate key values). The dbt model for the _clean stage shouldn't need any information that wasn't already entered into the _standardized stage model, but copying that information over might be a bit hacky.

Possible implementations

Manual Intervention

The system could generate a partially complete _standardized model (and maybe also the _clean model) and then have the generating DAG fail with a message that instructs a user to go clean up the generated _standardized model stub (and _clean stub, if it exists).

Pros:

Would be the easiest to implement, and
would avoid imposing choices on the data representation that might require manual cleanup

Cons:

Would be somewhat difficult to document,
adds friction to the curation process.

Data Profiler implementation

It feels like it should be feasible to automate a lot of the standardization logic. For example, identifying a minimal spanning set of columns (for making a composite key) should be an algorithmic operation, but the only implementation I can think of right now would involve running many expensive queries (although this would be a one time cost per table, to generate the _standardized model file).

Pros:

Would make it largely frictionless to add a new data pipeline, at least up to the feature engineering stage, and
would be interesting to implement.

Cons:

Would take a while to implement,
would add a fair amount of complexity to the code, and
wouldn't ever be perfect, so the user would still always have to review the model.

I guess that kind of settles the ultimate question (ie the user can't be completely freed from having to review the standardization model), but the result can be somewhere in between full automation and simple templating.

Explore WHERE NOT EXIST behavior when the subquery compares columns with null values

Testing the idempotent update query by checking that a table didn't re-ingest records failed (records were re-ingested).

The README's images are out of date

Also, the documentation is correspondingly slightly out of date.

Rename DAGs to reflect the "new" integreted pipeline construction

Mainly replace _raw_ with _ in DAG names and DAG filenames, as these DAGs run all stages of the update pipeline, not just the data_raw stage.

Add _standardized and _clean models for tables that don't have them yet

With the addition of a DAG for easily generating _standardized stage dbt models for data sets, it would be ideal to use that to cover _standardized and _clean for all tables (or at least as many as is reasonable).

Integrate OpenMetadata with the platform

Following an exploration of open source metadata platform options, in #35 , I'm going to go with OpenMetadata.

I'll want to modify the OpenMetadata ingestion and server frameworks to use my existing Airflow setup rather than have a second Airflow setup just for running OpenMetadata, and they kindly provide instructions.

Set up Airflow lineage backend
Add Connectors for any sources I want to ingest metadata from (and add their DAGs to the DagBag)
Add Airflow API endpoints so that OpenMetadata's UI can communicate ingestion requests to Airflow services.
Extract the MOUNTAIN OF ENV VARS in their docker-compose into dot-env files.

The `load_csv_data` task_group doesn't add "source_data_updated" or "ingestion_check_time" columns

The ingestion implementation I ended up using for ingesting flat/csv data simply COPYs data from file (via a streamreader), while the geospatial ingestion scheme ingests the data after reading it into GeoDataFrames (which made it very easy to add in the columns in the title, [source_data_updated and ingestion_check_time].

I'd rather update the existing data as I assume new data was published today, so I'll probably manually update my existing data warehouse but implement these fixes to use provide this behavior in stride.

Add functionality and data needed to calculate navigable routes between points.

PostGIS makes it very easy to calculate the distance between two points, but that's not very useful if I'm trying to evaluate the shortest safe walking distance between say, a house and a CTA train station.

The postgres extension pgRouting is a rich and mature tool for applying routing and network analysis algorithms to graph data, and it would be quite easy to add that extension. The hard part in implementing this functionality would be in preparing graph data covering all public roadways in Chicago or Cook County that (ideally) also indicates where there are sidewalks. There are two data sources I know of (both below) from Cook County and Chicago data portals respectively, and I could probably also get this data from OpenStreetMaps.

Street Shapefile Sources

Make suite of expectations for data_raw cook_county_parcel_value_assessments data

Implement DAG to clean up downloaded files and XComs in Airflow metadata database

If the ingestion is reliable enough, it might even be feasible to include cleanup of the downloaded file after ingestion to the DAG.

geopandas' `to_postgis()` throws an error when attempting to load missing geometry values

Refactor the data model in the dwh schema

I'll flesh out the data model more, but from preliminary work, for my homebuying business logic, the central entity is the parcel, and for my understanding-Cook-County-areas goals, area-based features and aggregations are key so areas will be the central concept.

To produce these areal features aggregations, concepts of interest include:

amenities (transit stops, grocery stores, restaurants/cafes/coffeeshops, parks, culture),
access to economic opportunity (high speed internet, professionally relevant neighbors, adjacency to economic opportunities, etc),
hazards and risks (crime, air pollution, noise pollution, floodplains, etc)

Add task to run dbt models (_standardized and downstream) when updating a data set

Now that the update_socrata_table taskgroup includes functionality to determine whether the _standardized model is ready (and thus the autogenerated _clean model), it makes sense to run the _standardized model and all downstream models to materialize those views and tables with updated data.

This might become a computational headache if/when end tables cause "downstream" to involve a significant number of updates, but when we get there, we can reexamine update materialization logic.

Reimplement socrata dags using the much DRYer implementation

Implement a collector for FCC broadband data to explore internet availability by location

A high-speed internet connection is a prerequisite for remote work/school, but many blocks in the US have no high-speed internet options as no telecom company has built the infrastructure needed to connect that block to a high speed network (yet?).

Telecom companies have to register infrastructure deployments with the FCC (through FCC Form 477) and some (or maybe all?) of that entered data can be downloaded through this data access portal. It looks like files might follow consistently applied naming conventions (and if that's accurate, I'll be able to engineer URIs and automate collection).

It looks like the FCC just changed their broadband deployment data collection process. It's not clear if this will change the method for accessing data, but from this Form 477 resources page, the replacement system won't require telecoms to provide block-level deployment data.

TLDR: I shouldn't prioritize building this until it's clear that the data will still have sufficient resolution to be useful.

Other Resources:

The DAG graph fails to render sometimes in the Airflow web UI's graph view

It looks like this issue is new in Airflow 2.5.1 and has already been solved and that fix will roll out in Airflow 2.5.2.

So update to Airflow 2.5.2 promptly when that release is published.

Configure the system to use the jupyter lab IDE instead of the notebook interface for expectation dev

Investigate dbt's implicit logic for creating a table when the target table doesn't exist

From preliminary experiments, I've observed that dbt will create the target table and (predictably) defines columns in the target table such that they have the column name from the result_set-generating-query and dtype from the source column or transformation.

Assuming there are no other subtle differences from the table dbt creates and the one I make explicitly in a DAG @task, and UNION operations are tolerant of the situation where one table doesn't exist yet. This is relevant when a new data set is added to the warehouse (assuming my implementation runs all deliveries through the WHERE NOT EXISTS [in the running data_raw.<table>] preemptive duplication filtering; if I absolutely need to, I could branch and handle first ingestion differently from subsequent updates).

matttriano / analytics_data_where_house Goto Github PK

analytics_data_where_house's People

Contributors

Stargazers

Watchers

analytics_data_where_house's Issues

Metrics (per geometry/geography)

Geographic Groupings

Possible implementations

Manual Intervention

Pros:

Cons:

Data Profiler implementation

Pros:

Cons:

Street Shapefile Sources

Recommend Projects

Recommend Topics

Recommend Org

Jobs