Comments (8)
I think technically, libsvm features must be floats. And the provided libsvm parser expects floats. I've quickly modified the parser to take different types. However, the disadvantage of the libsvm format is that it's not ideal for query data. Query data is a nested structure (each query has context features and each query has many documents where each document has example features) and libsvm is tabular structure. This means the the libsvm parser has build this data structure from the flat file and in my experience is slow enough to starve the GPU bc it's only working sequentially in terms of IO. You might have to do some type of interleaving to get some IO concurrency.
The authors of this package recommend the ProtoBuf data format which doesn't have the parsing overhead and has a more sophisticated dataset pipeline w/ concurrency features.
from ranking.
Thank you Alex for your reply.
So Tf ranking is not handling feature creation, we have to first formulate our features for query-document pair for all the dataset and then converting it to libsvm. Then this libsvm data will be parsed by libsvm parser and then used for training the model.
Please correct me, If I am wrong?
from ranking.
TF Ranking is wrapper around the Estimators API. So you can really use any feature that you want. And just like the Estimators API you have to do feature creation yourself. And if you choose to use the libsvm Dataset
provided by example by tf-ranking, w/o modification, then you need to put everything in libsvm format which is all floats. This is pretty common: if you look at the MSLR WEB30k LTR dataset, its formatted this way, and the provided Demo script (not the notebook) can run it and get near-SOTA results (around .43 NDCG@3).
So maybe look at MSLR 30k and how they build that dataset.
from ranking.
Thanks for your explanation, Alex. Indeed, libsvm is probably not the best choice for large datasets, where we prefer users to choose the TF.Example format.
Dinesh, if your dataset is reasonably sized, you can use libsvm. We do not provide text parsing / feature extraction as a part of the TF-Ranking as these will vary widely by application. For input, the libsvm should look roughly like this:
0 qid:1 1:3 2:0 3:2 4:2 …
1 qid:1 1:3 2:3 7:4 ...
...
1 qid:17 1:3 2:3 ...
where each row represents a <query,doc> pair and
first column -- relevance label
second column -- query id
the rest of the columns -- feature_id:value (note that libsvm is a sparse format, so if the feature doesn't exist in the document it can be skipped).
Hope this helps!
from ranking.
@bendersky SequenceExample
right?
from ranking.
SequenceExample, you're correct.
from ranking.
Yeah, Thanks bendersky!
from ranking.
Hi,
So i have training set which was used in LTR lambdamart in the below format
4 qid:1 1:9.8376875 2:12.318446 # 7555 rambo
3 qid:1 1:10.7808075 2:9.510193 # 1370 rambo
3 qid:1 1:10.7808075 2:6.8449354 # 1369 rambo
3 qid:1 1:10.7808075 2:0.0 # 1368 rambo
col1 : relevance rank
col2 : query id
from col3: feature vector.
I would like to now use xgboost for ranking, and I do not understand correctly how to convert it into xgboost training format. Any help would be appreciated.
from ranking.
Related Issues (20)
- package Bert
- is_label_valid in utils.py lacks support for integer targets
- Cannot import functions and classes directly
- how can i use the context features like movie title text in listwise ranking
- Extra loss not added to the overall model loss
- Subscores in neural ranking GAM
- [Feature Request] add some multivariate examples (i.e. DLCM, SetRank)
- MRR=1 for movie lens example, how is it possible?
- Custom loss function with MultiTaskPipeline HOT 2
- v0.5.3 not available on pypi HOT 2
- Negative labels are silently masked
- Inputs are not scaled according to the temperature in pair wise losses
- submitted issue by error
- Clarifying global floating point policy
- Landing page example broken HOT 1
- Inconsistent pairwise hinge loss values when using masked labels
- Can I have an example to run distributed ranking on a ray cluster?
- Small bug in create tower
- Tensorflow Estimator no longer supported
- Most of the TFR Losses are always zero
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ranking.