data-for-change / anyway-etl Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
reproduction steps
expected
actual
Create an airflow process that allows CBS data backfill from s3 (without importing from email) - and with a load_start_year parameter that can be changed by the airflow user The relevant command: (python main.py process cbs --source s3 --load_start_year 2020
We had a Jenkins process that enabled such backfill.
feature/waze-dag
.docker-compose -f docker-compose.yml -f {RELATIVE_PATH}/docker-compose-override.yaml run anyway-etl waze get-data
{RELATIVE_PATH}
should be replaced with the relative path to my ETL fork you have cloned in the first step.anyway-etl waze get-data command should run
ModuleNotFoundError: No module named 'anyway'
Remove use of local storage in cbs etl for processed files - use only S3 for the most up to date CBS data:
Hence the flow can be as following :
import emails can run every day to check new emails and update CBS files in s3 - we can add a DB table that tracks the latest email id / timestamp loaded for each provider code and year.
If a new email/s is identified ** - process-files is triggered to save data in s3 - and then the parsing and next processes for data loading is triggered using minimum year from process-files - loading data from s3 first(
(https://github.com/hasadna/anyway/blob/dev/anyway/parsers/cbs/executor.py) in anyway repo)
Parsing and next processes for data loading can be triggered by a separate ETL - not related to import email - (since data is loaded from s3) - using load_start_year (that can be any year).
@OriHoch if you have other suggestion for the flow, let me know.
The important thing is that the consistent storage location for files after processing is s3 and not local storage (see here) - and this s3 repo can be trusted as most up to date source for data loading)
most of the time it works but occassionaly we get an email alert for this error
since Nov 17, 17:50 it happened twice on Nov 17 05:25 and 05:55 (UTC)
/srv/pip_install_deps.sh && /usr/local/lib/anyway-etl/bin/anyway-etl waze get-data
[2021-11-18 06:00:01,281] {bash.py:169} INFO - Output:
[2021-11-18 06:00:01,283] {bash.py:173} INFO - {}
[2021-11-18 06:00:01,770] {bash.py:173} INFO - 1
[2021-11-18 06:00:01,773] {bash.py:173} INFO - 1
[2021-11-18 06:00:03,515] {bash.py:173} INFO - 2021-11-18 06:00:03 DEBUG Starting new HTTPS connection (1): il-georss.waze.com:443
[2021-11-18 06:00:04,999] {bash.py:173} INFO - /usr/local/lib/anyway-etl/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'il-georss.waze.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
[2021-11-18 06:00:05,000] {bash.py:173} INFO - warnings.warn(
[2021-11-18 06:00:05,000] {bash.py:173} INFO - 2021-11-18 06:00:04 DEBUG https://il-georss.waze.com:443 "GET /rtserver/web/TGeoRSS?format=JSON&tk=ccp_partner&ccp_partner_name=The+Public+Knowledge+Workshop&types=traffic%2Calerts%2Cirregularities&polygon=34.123%2C31.4%3B34.722%2C33.004%3B35.793%2C33.37%3B35.914%2C32.953%3B35.765%2C32.733%3B35.6%2C32.628%3B35.473%2C31.073%3B35.23%2C30.29%3B34.985%2C29.513%3B34.898%2C29.483%3B34.123%2C31.4 HTTP/1.1" 200 None
[2021-11-18 06:00:06,279] {bash.py:173} INFO - Traceback (most recent call last):
[2021-11-18 06:00:06,279] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
[2021-11-18 06:00:06,279] {bash.py:173} INFO - return self._engine.get_loc(casted_key)
[2021-11-18 06:00:06,279] {bash.py:173} INFO - File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
[2021-11-18 06:00:06,280] {bash.py:173} INFO - File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
[2021-11-18 06:00:06,280] {bash.py:173} INFO - File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021-11-18 06:00:06,280] {bash.py:173} INFO - File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021-11-18 06:00:06,280] {bash.py:173} INFO - KeyError: 'pubMillis'
[2021-11-18 06:00:06,280] {bash.py:173} INFO -
[2021-11-18 06:00:06,280] {bash.py:173} INFO - The above exception was the direct cause of the following exception:
[2021-11-18 06:00:06,280] {bash.py:173} INFO -
[2021-11-18 06:00:06,280] {bash.py:173} INFO - Traceback (most recent call last):
[2021-11-18 06:00:06,280] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/bin/anyway-etl", line 33, in <module>
[2021-11-18 06:00:06,280] {bash.py:173} INFO - sys.exit(load_entry_point('anyway-etl', 'console_scripts', 'anyway-etl')())
[2021-11-18 06:00:06,280] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
[2021-11-18 06:00:06,280] {bash.py:173} INFO - return self.main(*args, **kwargs)
[2021-11-18 06:00:06,280] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1062, in main
[2021-11-18 06:00:06,280] {bash.py:173} INFO - rv = self.invoke(ctx)
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO - return _process_result(sub_ctx.command.invoke(sub_ctx))
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO - return _process_result(sub_ctx.command.invoke(sub_ctx))
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO - return ctx.invoke(self.callback, **ctx.params)
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 763, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO - return __callback(*args, **kwargs)
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/cli.py", line 14, in get_data
[2021-11-18 06:00:06,281] {bash.py:173} INFO - get_waze_data()
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/get_data.py", line 15, in get_waze_data
[2021-11-18 06:00:06,281] {bash.py:173} INFO - dataflows = dataflows_handler.get_dataflows(waze_data)
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflows_handler.py", line 12, in get_dataflows
[2021-11-18 06:00:06,281] {bash.py:173} INFO - return [build_dataflow(waze_data, field) for field in FIELDS]
[2021-11-18 06:00:06,281] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflows_handler.py", line 12, in <listcomp>
[2021-11-18 06:00:06,281] {bash.py:173} INFO - return [build_dataflow(waze_data, field) for field in FIELDS]
[2021-11-18 06:00:06,282] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflow_builder.py", line 22, in build_dataflow
[2021-11-18 06:00:06,282] {bash.py:173} INFO - items = self.get_items(waze_data, field)
[2021-11-18 06:00:06,282] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflow_builder.py", line 17, in get_items
[2021-11-18 06:00:06,282] {bash.py:173} INFO - parsed_data = parser(raw_data)
[2021-11-18 06:00:06,282] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/parser_retriever.py", line 67, in _parse_jams
[2021-11-18 06:00:06,282] {bash.py:173} INFO - jams_df["created_at"] = pd.to_datetime(jams_df["pubMillis"], unit="ms")
[2021-11-18 06:00:06,282] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/pandas/core/frame.py", line 3455, in __getitem__
[2021-11-18 06:00:06,282] {bash.py:173} INFO - indexer = self.columns.get_loc(key)
[2021-11-18 06:00:06,282] {bash.py:173} INFO - File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
[2021-11-18 06:00:06,282] {bash.py:173} INFO - raise KeyError(key) from err
[2021-11-18 06:00:06,282] {bash.py:173} INFO - KeyError: 'pubMillis'
[2021-11-18 06:00:06,511] {bash.py:177} INFO - Command exited with return code 1
It's inconvenient for us today to maintain releases - since some airflow dags use anyway code - and we need to create a release every time anyway code updates.
For now, we prefer production to update every time anyway master updates.
@carmelp16 please discuss Gal to get the contact - and add a data pipeline to extract data
In quite a suspicious way this task has ended in exactly 30 minutes exactly but not ended for sure since not all data is loaded (we need to see 4 different years ago in this table, see following query)
start_date=20220123T190122, end_date=20220123T193122
It happened in previous tasks as well - which is a bit suspicious.
Is there a timeout I'n not aware of?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.