GithubHelp home page GithubHelp logo

anyway-etl's People

Contributors

atalyaalon avatar carmelp16 avatar orihoch avatar tkalir avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anyway-etl's Issues

airflow dags which use anyway-kubectl-exec should print output while they are running

reproduction steps

expected

  • should show the output immediatley as it is printed

actual

  • prints show only after execution ends

Airflow process - CBS data backfill

Create an airflow process that allows CBS data backfill from s3 (without importing from email) - and with a load_start_year parameter that can be changed by the airflow user The relevant command: (python main.py process cbs --source s3 --load_start_year 2020
We had a Jenkins process that enabled such backfill.

Can't run Waze DAG locally using docker-compose

Steps to reproduce:

  • Clone my fork and checkout to feature/waze-dag.
  • Run the following command from the main directory of your anyway repository:
    docker-compose -f docker-compose.yml -f {RELATIVE_PATH}/docker-compose-override.yaml run anyway-etl waze get-data
    Where {RELATIVE_PATH} should be replaced with the relative path to my ETL fork you have cloned in the first step.

Expected result:

anyway-etl waze get-data command should run

Actual result:

ModuleNotFoundError: No module named 'anyway'

Update CBS processes - to store and load using S3

Remove use of local storage in cbs etl for processed files - use only S3 for the most up to date CBS data:

  • process-files - Files from this process should be loaded to s3 - using the load_start_year parameter (and the default will be current year - 1 - hence in 2021 we'll load 2020 and 2021 if data exists (For example, in the first month of 2022 we won't have data of 2022 - but we dont want process to fail)

Hence the flow can be as following :
import emails can run every day to check new emails and update CBS files in s3 - we can add a DB table that tracks the latest email id / timestamp loaded for each provider code and year.

If a new email/s is identified ** - process-files is triggered to save data in s3 - and then the parsing and next processes for data loading is triggered using minimum year from process-files - loading data from s3 first(
(https://github.com/hasadna/anyway/blob/dev/anyway/parsers/cbs/executor.py) in anyway repo
)

Parsing and next processes for data loading can be triggered by a separate ETL - not related to import email - (since data is loaded from s3) - using load_start_year (that can be any year).

@OriHoch if you have other suggestion for the flow, let me know.
The important thing is that the consistent storage location for files after processing is s3 and not local storage (see here) - and this s3 repo can be trusted as most up to date source for data loading)

research and mitigate error in waze get-data: KeyError: 'pubMillis'

most of the time it works but occassionaly we get an email alert for this error

since Nov 17, 17:50 it happened twice on Nov 17 05:25 and 05:55 (UTC)

/srv/pip_install_deps.sh && /usr/local/lib/anyway-etl/bin/anyway-etl waze get-data
[2021-11-18 06:00:01,281] {bash.py:169} INFO - Output:
[2021-11-18 06:00:01,283] {bash.py:173} INFO - {}
[2021-11-18 06:00:01,770] {bash.py:173} INFO - 1
[2021-11-18 06:00:01,773] {bash.py:173} INFO - 1
[2021-11-18 06:00:03,515] {bash.py:173} INFO - 2021-11-18 06:00:03 DEBUG    Starting new HTTPS connection (1): il-georss.waze.com:443
[2021-11-18 06:00:04,999] {bash.py:173} INFO - /usr/local/lib/anyway-etl/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'il-georss.waze.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
[2021-11-18 06:00:05,000] {bash.py:173} INFO -   warnings.warn(
[2021-11-18 06:00:05,000] {bash.py:173} INFO - 2021-11-18 06:00:04 DEBUG    https://il-georss.waze.com:443 "GET /rtserver/web/TGeoRSS?format=JSON&tk=ccp_partner&ccp_partner_name=The+Public+Knowledge+Workshop&types=traffic%2Calerts%2Cirregularities&polygon=34.123%2C31.4%3B34.722%2C33.004%3B35.793%2C33.37%3B35.914%2C32.953%3B35.765%2C32.733%3B35.6%2C32.628%3B35.473%2C31.073%3B35.23%2C30.29%3B34.985%2C29.513%3B34.898%2C29.483%3B34.123%2C31.4 HTTP/1.1" 200 None
[2021-11-18 06:00:06,279] {bash.py:173} INFO - Traceback (most recent call last):
[2021-11-18 06:00:06,279] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
[2021-11-18 06:00:06,279] {bash.py:173} INFO -     return self._engine.get_loc(casted_key)
[2021-11-18 06:00:06,279] {bash.py:173} INFO -   File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
[2021-11-18 06:00:06,280] {bash.py:173} INFO -   File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
[2021-11-18 06:00:06,280] {bash.py:173} INFO -   File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021-11-18 06:00:06,280] {bash.py:173} INFO -   File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021-11-18 06:00:06,280] {bash.py:173} INFO - KeyError: 'pubMillis'
[2021-11-18 06:00:06,280] {bash.py:173} INFO - 
[2021-11-18 06:00:06,280] {bash.py:173} INFO - The above exception was the direct cause of the following exception:
[2021-11-18 06:00:06,280] {bash.py:173} INFO - 
[2021-11-18 06:00:06,280] {bash.py:173} INFO - Traceback (most recent call last):
[2021-11-18 06:00:06,280] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/bin/anyway-etl", line 33, in <module>
[2021-11-18 06:00:06,280] {bash.py:173} INFO -     sys.exit(load_entry_point('anyway-etl', 'console_scripts', 'anyway-etl')())
[2021-11-18 06:00:06,280] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
[2021-11-18 06:00:06,280] {bash.py:173} INFO -     return self.main(*args, **kwargs)
[2021-11-18 06:00:06,280] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1062, in main
[2021-11-18 06:00:06,280] {bash.py:173} INFO -     rv = self.invoke(ctx)
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     return ctx.invoke(self.callback, **ctx.params)
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/click/core.py", line 763, in invoke
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     return __callback(*args, **kwargs)
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/cli.py", line 14, in get_data
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     get_waze_data()
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/get_data.py", line 15, in get_waze_data
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     dataflows = dataflows_handler.get_dataflows(waze_data)
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflows_handler.py", line 12, in get_dataflows
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     return [build_dataflow(waze_data, field) for field in FIELDS]
[2021-11-18 06:00:06,281] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflows_handler.py", line 12, in <listcomp>
[2021-11-18 06:00:06,281] {bash.py:173} INFO -     return [build_dataflow(waze_data, field) for field in FIELDS]
[2021-11-18 06:00:06,282] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflow_builder.py", line 22, in build_dataflow
[2021-11-18 06:00:06,282] {bash.py:173} INFO -     items = self.get_items(waze_data, field)
[2021-11-18 06:00:06,282] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/dataflow_builder.py", line 17, in get_items
[2021-11-18 06:00:06,282] {bash.py:173} INFO -     parsed_data = parser(raw_data)
[2021-11-18 06:00:06,282] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/src/anyway-etl/anyway_etl/waze/utils/parser_retriever.py", line 67, in _parse_jams
[2021-11-18 06:00:06,282] {bash.py:173} INFO -     jams_df["created_at"] = pd.to_datetime(jams_df["pubMillis"], unit="ms")
[2021-11-18 06:00:06,282] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/pandas/core/frame.py", line 3455, in __getitem__
[2021-11-18 06:00:06,282] {bash.py:173} INFO -     indexer = self.columns.get_loc(key)
[2021-11-18 06:00:06,282] {bash.py:173} INFO -   File "/usr/local/lib/anyway-etl/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
[2021-11-18 06:00:06,282] {bash.py:173} INFO -     raise KeyError(key) from err
[2021-11-18 06:00:06,282] {bash.py:173} INFO - KeyError: 'pubMillis'
[2021-11-18 06:00:06,511] {bash.py:177} INFO - Command exited with return code 1

CI / CD - production auto update on anyway master commit

It's inconvenient for us today to maintain releases - since some airflow dags use anyway code - and we need to create a release every time anyway code updates.
For now, we prefer production to update every time anyway master updates.

Is there a timeout for runs on main pod?

In quite a suspicious way this task has ended in exactly 30 minutes exactly but not ended for sure since not all data is loaded (we need to see 4 different years ago in this table, see following query)
start_date=20220123T190122, end_date=20220123T193122
It happened in previous tasks as well - which is a bit suspicious.
Is there a timeout I'n not aware of?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.