GithubHelp home page GithubHelp logo

Comments (8)

montanalow avatar montanalow commented on May 16, 2024 1

Your approach is one we've planned for, and should work, although it has not been tested in a production environment that I know of. I'd be happy to support your cause and help with any issues you come across.

Another option rather than replicating the entire DB is to use logical replication on the subset of tables required, or you could also create foreign data wrappers on your training server back to your host server to pull the data on demand without having to replicate at all. pgml.train creates a pgml.snapshot of the table when it is called with a relation name, so you could even test multiple algorithms or hyperparams with a single pull over FDW from the master dataset.

The algorithm will be a significant factor in the performance benefits of offloading to a second server from the master. For algorithms like linear regression my experience is that scanning the data from disk creates the majority of load on the server, the actual computation for those algorithms is small relative to the data movement from storage to RAM. For a large XGBoost model, then I agree, its worth at least testing the performance gains on a second worker. We need to add more metrics and visibility for where time is spent during training, so I'll prioritize that on our roadmap if it will help you.

The serialized models are stored as BYTEA pickled objects in pgml.models table, which should replicate the same as other tables. You should also be able to pg_dump the tables from one server to another if you want to avoid replication, or create backups. You'll also want to replicate pgml.projects and pgml.deployments. You may wish to drop the pgml.snapshots from replication back to the inference servers, since the snapshots are only used during training and later for analysis. We're working on our documentation to clarify this further, but the basics are outlined: https://postgresml.org/references/models/

Does pgml's backend implementation in Python support GPU devices?
AFAIK None of the scikit algorithms support GPU devices. XGBoost does: https://xgboost.readthedocs.io/en/stable/gpu/index.html

You'll need to ensure that the xgboost libraries installed on the PostgreSQL server are setup correctly, and pass the tree_method: gpu_hist argument.

from postgresml.

tucnak avatar tucnak commented on May 16, 2024

Thank you for a quick response, I'm very glad to know that we can rely on your support!

PostgresML is perhaps the most interesting piece of postgres-related software I've come across in months. We would like to add it to our database container build, that is based on timescaledb-ha perhaps having the the built-in dashboard would prove beneficial. Wouldn't you happen to know how we should approach this? I took a look at pgml-extension's Dockerfile and perhaps we could make it part of our multi-stage build, and just copy the necessary files? I can imagine these are the pip3-provided packages and the extension build itself. We use Postgres 14, would pgml support that as well? The existing container seems to rely on 13, so perhaps our multi-stage build would have to be more involved. I'm not a Docker guru, and in this case it's not immediately clear what the final composition should look like.

You should also be able to pg_dump the tables from one server to another if you want to avoid replication, or create backups

This is perfect! At any rate, it's unlikely that we're going to be running these "large" xgboost models, the reason why I asked about GPUs was basically to learn whether if there are any inherent limitations in pgml itself that would not allow the underlying backend (xgboost) to use the available hardware. However, for our case it's almost irrelevant, as we're likely to only rely on the lightweight regressions, and the training sessions can be scheduled with pg_cron on our regular master nodes.

Basically, there are three types of regressions we would want to perform:

(1) Predict the time-to-fund (for a plea request to reach target funding) based on the time it has already spent since it was published, the number of views and the amount of donations it accumulated during this time.

(2) Predict what plea requests would need extra exposure, ie. currently we do this using the following aggregate formula:

    SELECT
        (1+funded/100) * log(t) / coalesce(nullif(n, 0), 1) *
        log(extract(epoch FROM now()-published_at)) as score,
        id
    FROM plea_view pv
    LEFT JOIN LATERAL (
        SELECT
            count(*) as n,
            coalesce(sum(extract(epoch FROM dt)), 1) as t
        FROM panorama pr
        WHERE pr.plea_id = pv.id
          AND dt is not null
        ) pr ON true

This formula is meant to promote the requests that have accrued relatively more donations in spite of relatively small number of views. (Thus, proving that they are potentially worthwhile.) Then we pick the requests for the frontpage "carousel" in designated proportions out of the score distribution (half of them come from p95, another quarter from p75, and another from p50) this way the selection can have somewhat of a "random" feel and won't discriminate for the super-performers.

We're looking to perhaps add sophistication to how this score is calculated, using a regression model.

(3) The requests are distributed among the volunteers that have them in their "working area", i.e. the volunteer will see a request in their pool if the applicant is in their geographical vicinity where they can operate. But with the number of requests growing, we're afraid that this will potentially overwhelm the volunteers, so perhaps we could do better than that— by promoting certain pleas in the pool for certain volunteers, if we deem them a good fit for it. So potentially we would want to extract some features out of the successfully completed requests to perhaps correlate these request features with the volunteers who've completed them. This way we could potentially "recommend" more requests like ones that a volunteer has completed in the past, by category, more precise geographic location, or some tf-idf keywords.

Do you believe pgml would be a good fit for this?

from postgresml.

montanalow avatar montanalow commented on May 16, 2024

Installation

To install Postgres in Linux the standard instructions should work:

https://postgresml.org/guides/installation/#install-postgresql-with-plpython

You'll need to add those lines to your container setup, e.g. for debian/ubuntu

# Install Postgres with  PL/Python
$ sudo apt-get install -y postgresql-plpython3-12 python3 python3-pip postgresql-12

# Install PostgresML in the database (after it is running)
$ sudo pip3 install --install-option="--database-url=postgres://user_name:password@localhost:5432/database_name" pgml-extension

Modelling

These sound like great ideas. I generally start with a linear model to see where the predictive baseline is, since building that model is usually in the same order of magnitude as a SELECT count(*) FROM ... on normal PostgreSQL. I'm not sure how much performance characteristics will change on TimescaleDB, but I'd love to see your performance with \timing on.

Roughly how many rows would your training sets have? For sub 100k training samples, I think a linear model should train on the order of seconds, but it'll depend on your disk IO speed or if everything is already resident in memory. I also find it helpful to work with a subsampling during algorithm selection if there are more than 1-10M rows in the full set and survey many different algorithms/hyperparameters. At the scale of 10M+ there may be enough data for deep learning to start to overtake more efficient algorithms in performance. Another thing on our backlog is https://neuralprophet.com/html/index.html for time series predictions, which I thought might help since you're using timescaledb, but it's not immediately clear to me that would be a good algorithm for any of these use cases.

I'm not sure if you've seen the regression example code.

from postgresml.

montanalow avatar montanalow commented on May 16, 2024

Also, I'm not sure how familiar you are with the built in PostgreSQL full text search functionality, but it has been great for my projects in the past, and it may help you generate some more features as well since you mention tf-idf.

from postgresml.

levkk avatar levkk commented on May 16, 2024

We've just added some docs on how to use production data with PostgresML. Please let us know how it works out for you.

from postgresml.

montanalow avatar montanalow commented on May 16, 2024

I think we've added answers to your questions in the documentation. Please don't hesitate to re-open or create a new issue if you have additional questions.

from postgresml.

tucnak avatar tucnak commented on May 16, 2024

Thank you! I've just wanted to come by and say that PostgresML is the best thing that occurred to us. I've always felt like Pandas and the workflows based around it suck profoundly, as opposed to what you normally do in BigData® warehouses such as Snowflake, BigQuery, et al.

Now that we've had limited success with PostgresML in our charity, I'm itching to apply it everywhere I set my foot!

What would you say is currently your biggest weakness, or perhaps, most underdeveloped area that you think is preventing wider adoption in the machine learning community? You have here a beautiful project, and a gateway drug, I think, for the kind of shops that have a BI team already, but for the time being— don't yet have a dedicated machine learning team— due to lack of initiative, or whatnot. (Like my company!) The barrier to entry is just so low, it seems anybody can simply pick it up, and suddenly produce insights out of thin air. I'm not a ML guy at all, but I can appreciate a good regression, and to me tools like it are positively liberating.

BTW, would you welcome a PR introducing one or more survival analysis models such as IPCRidge or Cox's PH— these can be used to predict for a particular event in the data, if and when it would happen, or not, at a given moment in time. I'm by no means a ML specialist or anything, but this is something one of my peers have played with in the past, and I couldn't help but wonder if we could apply this to our problem set in the charity, to guess whether if a particular supply link of our vendor networks is less or likely to fail given the data that have been accumulated by our partners and us.

Please let me know if I'm missing something, or if postgresml is already capable of this.

from postgresml.

montanalow avatar montanalow commented on May 16, 2024

I'm glad you're finding success on your mission! I think the biggest weakness is simply awareness of the capability vs "the way things are done". PostgresML is still a pretty young project. It'd go a long way to have case studies we can discuss publicly. We'd love to do a writeup of how you're using it, and link to your charity to increase visibility there as well. Do you want to connect w/ me over email to coordinate? I'm [email protected].

I think the two algorithms you've linked look like good candidates for the regression API, and we'd welcome PRs to introduce new libraries/aglos/capabilities.

from postgresml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.