GithubHelp home page GithubHelp logo

Comments (8)

eggie5 avatar eggie5 commented on July 20, 2024

I think technically, libsvm features must be floats. And the provided libsvm parser expects floats. I've quickly modified the parser to take different types. However, the disadvantage of the libsvm format is that it's not ideal for query data. Query data is a nested structure (each query has context features and each query has many documents where each document has example features) and libsvm is tabular structure. This means the the libsvm parser has build this data structure from the flat file and in my experience is slow enough to starve the GPU bc it's only working sequentially in terms of IO. You might have to do some type of interleaving to get some IO concurrency.

The authors of this package recommend the ProtoBuf data format which doesn't have the parsing overhead and has a more sophisticated dataset pipeline w/ concurrency features.

from ranking.

Dinesh-Mali avatar Dinesh-Mali commented on July 20, 2024

Thank you Alex for your reply.

So Tf ranking is not handling feature creation, we have to first formulate our features for query-document pair for all the dataset and then converting it to libsvm. Then this libsvm data will be parsed by libsvm parser and then used for training the model.

Please correct me, If I am wrong?

from ranking.

eggie5 avatar eggie5 commented on July 20, 2024

TF Ranking is wrapper around the Estimators API. So you can really use any feature that you want. And just like the Estimators API you have to do feature creation yourself. And if you choose to use the libsvm Dataset provided by example by tf-ranking, w/o modification, then you need to put everything in libsvm format which is all floats. This is pretty common: if you look at the MSLR WEB30k LTR dataset, its formatted this way, and the provided Demo script (not the notebook) can run it and get near-SOTA results (around .43 NDCG@3).

So maybe look at MSLR 30k and how they build that dataset.

from ranking.

bendersky avatar bendersky commented on July 20, 2024

Thanks for your explanation, Alex. Indeed, libsvm is probably not the best choice for large datasets, where we prefer users to choose the TF.Example format.

Dinesh, if your dataset is reasonably sized, you can use libsvm. We do not provide text parsing / feature extraction as a part of the TF-Ranking as these will vary widely by application. For input, the libsvm should look roughly like this:

0 qid:1 1:3 2:0 3:2 4:2 …
1 qid:1 1:3 2:3 7:4 ...
...
1 qid:17 1:3 2:3 ...

where each row represents a <query,doc> pair and
first column -- relevance label
second column -- query id
the rest of the columns -- feature_id:value (note that libsvm is a sparse format, so if the feature doesn't exist in the document it can be skipped).

Hope this helps!

from ranking.

eggie5 avatar eggie5 commented on July 20, 2024

@bendersky SequenceExample right?

from ranking.

bendersky avatar bendersky commented on July 20, 2024

SequenceExample, you're correct.

from ranking.

Dinesh-Mali avatar Dinesh-Mali commented on July 20, 2024

Yeah, Thanks bendersky!

from ranking.

meg261995 avatar meg261995 commented on July 20, 2024

Hi,

So i have training set which was used in LTR lambdamart in the below format

4 qid:1 1:9.8376875 2:12.318446 # 7555 rambo
3 qid:1 1:10.7808075 2:9.510193 # 1370 rambo
3 qid:1 1:10.7808075 2:6.8449354 # 1369 rambo
3 qid:1 1:10.7808075 2:0.0 # 1368 rambo

col1 : relevance rank
col2 : query id
from col3: feature vector.

I would like to now use xgboost for ranking, and I do not understand correctly how to convert it into xgboost training format. Any help would be appreciated.

from ranking.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.