I Have a documents in text format. How to convert them in Libsvm format? If I have

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How should I convert My custom documents into libsvm format? about ranking HOT 8 CLOSED

tensorflow commented on July 20, 2024

How should I convert My custom documents into libsvm format?

from ranking.

Comments (8)

eggie5 commented on July 20, 2024

I think technically, libsvm features must be floats. And the provided libsvm parser expects floats. I've quickly modified the parser to take different types. However, the disadvantage of the libsvm format is that it's not ideal for query data. Query data is a nested structure (each query has context features and each query has many documents where each document has example features) and libsvm is tabular structure. This means the the libsvm parser has build this data structure from the flat file and in my experience is slow enough to starve the GPU bc it's only working sequentially in terms of IO. You might have to do some type of interleaving to get some IO concurrency.

The authors of this package recommend the ProtoBuf data format which doesn't have the parsing overhead and has a more sophisticated dataset pipeline w/ concurrency features.

from ranking.

Dinesh-Mali commented on July 20, 2024

Thank you Alex for your reply.

So Tf ranking is not handling feature creation, we have to first formulate our features for query-document pair for all the dataset and then converting it to libsvm. Then this libsvm data will be parsed by libsvm parser and then used for training the model.

Please correct me, If I am wrong?

from ranking.

eggie5 commented on July 20, 2024

TF Ranking is wrapper around the Estimators API. So you can really use any feature that you want. And just like the Estimators API you have to do feature creation yourself. And if you choose to use the libsvm Dataset provided by example by tf-ranking, w/o modification, then you need to put everything in libsvm format which is all floats. This is pretty common: if you look at the MSLR WEB30k LTR dataset, its formatted this way, and the provided Demo script (not the notebook) can run it and get near-SOTA results (around .43 NDCG@3).

So maybe look at MSLR 30k and how they build that dataset.

from ranking.

bendersky commented on July 20, 2024

Thanks for your explanation, Alex. Indeed, libsvm is probably not the best choice for large datasets, where we prefer users to choose the TF.Example format.

Dinesh, if your dataset is reasonably sized, you can use libsvm. We do not provide text parsing / feature extraction as a part of the TF-Ranking as these will vary widely by application. For input, the libsvm should look roughly like this:

0 qid:1 1:3 2:0 3:2 4:2 …
1 qid:1 1:3 2:3 7:4 ...
...
1 qid:17 1:3 2:3 ...

where each row represents a <query,doc> pair and
first column -- relevance label
second column -- query id
the rest of the columns -- feature_id:value (note that libsvm is a sparse format, so if the feature doesn't exist in the document it can be skipped).

Hope this helps!

from ranking.

eggie5 commented on July 20, 2024

@bendersky SequenceExample right?

from ranking.

bendersky commented on July 20, 2024

SequenceExample, you're correct.

from ranking.

Dinesh-Mali commented on July 20, 2024

Yeah, Thanks bendersky!

from ranking.

meg261995 commented on July 20, 2024

Hi,

So i have training set which was used in LTR lambdamart in the below format

4 qid:1 1:9.8376875 2:12.318446 # 7555 rambo
3 qid:1 1:10.7808075 2:9.510193 # 1370 rambo
3 qid:1 1:10.7808075 2:6.8449354 # 1369 rambo
3 qid:1 1:10.7808075 2:0.0 # 1368 rambo

col1 : relevance rank
col2 : query id
from col3: feature vector.

I would like to now use xgboost for ranking, and I do not understand correctly how to convert it into xgboost training format. Any help would be appreciated.

from ranking.

How should I convert My custom documents into libsvm format? about ranking HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs