GithubHelp home page GithubHelp logo

leibniz-hbi / dabapush Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 374 KB

Data Base pusher for social media data (Twitter for the beginning) – pre-alpha version

Home Page: https://pypi.org/project/dabapush/

License: MIT License

Python 100.00%

dabapush's Introduction

dabapush

Database pusher for social media data (Twitter for the beginning) – pre-alpha version

Using dabapush

dabapush is a tool to read longer running data collections and write them to another file format or persist them into a database. It is designed to run periodically, e.g. controlled by chron, thus, for convenience ot use project-based configurations which contain all required information on what to read where and what to do with it. A project may have one or more jobs, each job consists of a reader and a writer configuration, e.g. read JSON-files from the Twitter API that we stored in folder /home/user/fancy-project/twitter/ and write the flattened and compiled data set in to /some/where/else as CSV files.

First steps

In order to run a first dabapush-job we'll need to create a project configuration. This is done by calling:

dabapush create

By default this walks you through the configuration process in a step-by-step manner. Alternatively, you could call:

dabapush create --non-interactive

This will create an empty configuration, you'll have to fill out the required information by e.g. calling:

dabapush reader add NDJSON default
dabapush writer add CSV default

Whereas reader add/writer add is the verb, NDJSON or CSV is the plugin to add and default is the pipeline name.

Of course you can edit the configration after creation in your favorite editor, but BEWARE NOT TO TEMPER WITH THE YAMl-TAGS!!!.

To run the newly configured job, please call:

dabapush run default

Command Reference

Invocation Pattern

dabapush <command> <subcommand?> <options>

Commands

create -- creates a dabapush project (invokes interactive prompt)

Options:

--non-interactive, create an empty configuration and exit

--interactive, this is the default behavior: prompts for user input on

  • project name,
  • project authors name,
  • project author email address(es) for notifications
  • manually configure targets or run discover?

run all -- collect all known items and execute targets/destinations

run <target> -- run a single writer and/or named target

Options:

--force-rerun, -r: forces all data to be read, ignores already logged data


reader -- interact with readers

reader configure <name> -- configure the reader for one or more subproject(s); Reader configuration is inherited from global to local level; throws if configuration is incomplete and defaults are missing

reader list: returns a table of all configured readers, with <path> <target> <class> <id>

reader list_all: returns a table of all registered reader plugins

reader add <type> <name>: add a reader to the project configuration

Options:

--input-directory <path>: directory to be read

--pattern <pattern>: pattern for matching file names against.

remove <name>: remove a reader from the project configuration.

register <path>: not there yet


discover -- discover (possible) targets in project directory and configure them automagically -- yeah, you dream of that, don't you?


writer -- interact with writers

writer add <type> <name>:

writer remove <name>: removes the writer for the given name

writer list -- returns table of all writers, with <path> <subproject-name> <class> <id>

writer list_all: returns a table of all registered writer plugins

writer configure <name> or writer configure all

Options:

--output-dir, -o <path>: default for all targets: <project-dir>/output/<target-name>

--output-pattern, -p <pattern>: pattern used for file name creation e.g. 'YYYY-MM-dd', file extension is added by the writer and cannot be overwritten

--roll-over, -r ``<file-size>:

--roll-over, -r <lines>:

--roll-over -r <None>: should be the output chunked? Give either a file-size or a number of lines for roll-over or None to disable chunking

Extending dabapush and developers guide

Dabapush's reader and writer plug-ins are registered via entry point: dabapush_readers for readers and dabapush_writers for writers. Both expect Configuration-subclass.

Developer Installation

  1. Install poetry
  2. Clone repository
  3. In the cloned repository's root directory run poetry install
  4. Run poetry shell to start development virtualenv
  5. Run dabapush create to create your first project.
  6. Run pytest to run all tests

dabapush's People

Contributors

pekasen avatar adyk007 avatar flxvctr avatar

Watchers

Kostas Georgiou avatar

dabapush's Issues

Make sure, full text gets written in RTs

Twitter does limit text in the API request to 280 characters. Therefore, retweets are often truncated by the length of RT @{username}:…. However, the full text should be available in the referenced_tweets expansion in the API response.

bug: attempting to read closed file

Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/dabapush", line 8, in <module>
    sys.exit(cli())
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/dabapush/run_subcommand.py", line 36, in run
    db.jb_run(targets)
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/dabapush/Dabapush.py", line 194, in jb_run
    self.__dispatch_job__(target)
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/dabapush/Dabapush.py", line 208, in __dispatch_job__
    writer.write(reader.read())
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/dabapush/Writer/Writer.py", line 31, in write
    for item in queue:
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/dabapush/Reader/TwacapicReader.py", line 127, in read
    for res in _res:
  File "/home/ec2-user/.local/pipx/venvs/dabapush/lib64/python3.8/site-packages/dabapush/Reader/TwacapicReader.py", line 124, in <genexpr>
    _res = (loads(line) for line in file)
ValueError: I/O operation on closed file.
Session and Connection Terminated

Refactor plug-in system for setuptools entry points

Status Quo

Right dabapush relies on a home baked solution for plug in mangement which involves a YAML-file in the installation directory. Thus, updating/installing/removing plug ins is cumbersome and error-prone. This applies especially for the smo-database plug-ins which are not linked into the repository as a sub-repo.

Solution

Setuptools' entrypoint system allows for dynamic registration of plug-ins. We can discover them at run-time. Thus, we can remove the hard dependency on smo-database and also enable other parties for use their own plug-ins.

conform `smormlpy`s style guide and conventions

Status Quo

Right now dabapush is a free-floating project which has not really work force attached and structures implemented.

Solution

smormlpy gives guide lines and structure for these kind of projects.

Remove `smo-database` sub-module and move plug-in code for writers

The smo-database sub-module hinders testing in GitHub Actions (see error log in #32 ) and makes installation a nightmare (and to be frank impossible for people outside of our org).

Thus, as dabapushitself does not dependent on code from smo-databasewe should remove the depedency, move the database plugins to smo-database for good, let them register themselves against dabapush through the entry point and ship smo-database as another package via Pypi.

Archival Pipelines

As of yet dabapush initializes pipelines solely by the readers and writers name, thus, a call like dabapush run default would look for a reader named 'default ' and a writer named default. The reader extracts all records according to it's programming from the specified file and glob-pattern and passes these records to the writer.

This hinders archival pipelines in two ways: in an archival pipeline have want to have a dependency on the outcome of another pipeline, e.g. we want to archive all the files that have been successfully read by dabapush. Therefore, the input to this pipeline would not be a path/glob-pattern pair but rather the logged files of the already finished pipeline.

Giving the reader that functionality seems a bit spaghetti-like, overloading the class with functionality that is not related to reading and processing files to records in a way that the writer-class objects can process them further.

Cleanest solution would be to enhance the pipelines further: a third object type e.g. named Attacher could be the cleanest solution to that problem. It would take over the responsibility to discover and open files for the reader and through inheritance we can design multiple, different Attachers, e.g. for reading files from disk by means of a path and glob-pattern, by reading the log and filtering for files from specific, already finished pipelines or even read remote files from S3 or SFTP.

Thus, a pipeline would include at least three objects: an Attacher, which decides which files to open, a reader that extracts meaningful records from these files and a writer that persist/writes these records. Initializing these three-piece pipelines can still be achieved by name only, thus, no changes in the structure of the configuration file format is necessary, although some fields must be moved from the reader configuration to an attacher configuration.

In summary of the new pipeline features:

  • pipelines should be able to read logged files from another pipeline, i.e. to move already read file from local storage to S3.
  • another class, the Attacher, is responsible for file/discovery and opening, the reader extracts meaningsful records from the opened file.
  • file should only be logged if processing is complete and did not fail.
  • dabapush is responsible for ensuring safe processing of files and records and keeps the log – which alleviates the Writer-classes from this responsibility.
  • failed items should not crash the pipeline but rather be persist into a special location, e.g. a file like ${date}-${pipeline}-malformed-objects.jsonl.
  • failed items log should be in a format that a Attacher is able too handle that file and process the entries accordingly.
  • therefore the log items should be enhanced with a tag which pipeline processed which file.

Split file output by variable.

So, the task at hand is reduce all of the .ndjson-files on the smo-dev-server to just one file per facebook/instagram-account and collection. Therefore we must implement a change in dabapush: we must know at the writing stage, e.g. a NDJSON-Writer some metadata for the record we are writing. Right now there is no possibility too transmit these metadata.

Thus, step one should be the implementation of a class that holds:

  • the data we want too transmit in our pipeline,
  • a range of metadata, e.g. path of the original file where the record originated, time read.

Step two would be the modification of the file-based writer too use e.g. a variable in the above mentioned metadata to distribute the output of the pipeline to different files.

E.g. all of the tweets of account to a file account1.ndjson, all of account2 in account2.ndjson and so forth.

Step three, again at the reading stage, we cannot emit simple Dicts anymore with just the record inside, we'll actually need to write the metadata we need into the above mentioned class. E.g. collection information on the file's path like parsing information from the path. factli stores it's files with results/${list_id}/${user_id}.ndjson, thus, we can get valied information about the list and and the user from the path.

Refactor DBWriters

Should adhere to a common pattern, and not implement name spaced methods and properties.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.