etlundquist / rankfm Goto Github PK
View Code? Open in Web Editor NEWFactorization Machines for Recommendation and Ranking Problems with Implicit Feedback Data
License: GNU General Public License v3.0
Factorization Machines for Recommendation and Ranking Problems with Implicit Feedback Data
License: GNU General Public License v3.0
I would like to store the time-dependent loss function in an array. It would be nice if there was a hook function that would allow me to do this, or have the call to the trainer return this list. Can anybody help with that? The compilation to C complicates matters. Thanks.
Hi,
In Rendle's original paper on FM (https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf), he used a feature that captures the time of the user-item interaction. This should improve the model.
In your model, from what I see, we can input user features, item features and the interactions themselves.
How can I add the time of the interaction into the mix?
When running fit()
with user features, I get the error:
KeyError: 'the users in [user_features] do not match the users in [interactions]'
which has been reported previously. In my case, I did some debugging in the source code, and found the following. In the function _init_interactions
, one finds the statement:
if np.array_equal(sorted(x_uf.index.values), self.user_idx):
self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32)
else:
raise KeyError('the users in [user_features] do not match the users in [interactions]')
which is the error in question. Looking at the definition of self.user_idx
, one finds, in the same file rankfm.py
:
# store unique values of user/item indexes and observed interactions for each user
self.user_idx = np.arange(len(self.user_id), dtype=np.int32)
self.item_idx = np.arange(len(self.item_id), dtype=np.int32)
near line 128. Clearly, self.user_idx
are consecutive indexes 0,1,2, ... up to the number of user ids.
However, sorted(x_uf.index.values)
is the sorted list of user ids. Thus, the two lists cannot be equal. The code that leads me to this conclusions is:
if user_features is not None:
x_uf = pd.DataFrame(user_features.copy())
x_uf = x_uf.set_index(x_uf.columns[0])
x_uf.index = x_uf.index.map(self.user_to_index)
if np.array_equal(sorted(x_uf.index.values), self.user_idx):
As far as I understand, the first column of user_features
, which is an argument to the function, should be the actual user_id
, which can be anything, as long as it does not appear twice in the dataframe. In this case, the conditional (last line) can not be satisfied.
Therefore, I must not understand the data format of user_features
. Where is this explained? The documentation states the following:
user_features – dataframe of user metadata features: [user_id, uf_1, … , uf_n]
with no additional information regarding the values of user_id
. Any clarification would be most welcome!
Hi,
Thank you so much for this library. AFAIK it is the only FM lib with WARP loss.
I was thinking of using it, and I was wondering whether you have a source (paper) for the actual implementation you followed, or any special citing for it.
Thank you and keep the great work!
This looks like a very promising library - congrats!
I am not familiar with the theory yet, but is it possible to include user/interaction features? For example, a typical use case is the amount of time elapsed since a product was last purchased.
KeyError Traceback (most recent call last)
in
----> 1 model.fit(interactions, user_features, item_features, sample_weight, epochs=50, verbose=True)
~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in fit(self, interactions, user_features, item_features, sample_weight, epochs, verbose)
263
264 self._reset_state()
--> 265 self.fit_partial(interactions, user_features, item_features, sample_weight, epochs, verbose)
266
267
~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in fit_partial(self, interactions, user_features, item_features, sample_weight, epochs, verbose)
287 self._init_features(user_features, item_features)
288 else:
--> 289 self._init_all(interactions, user_features, item_features, sample_weight)
290
291 # determine the number of negative samples to draw depending on the loss function
~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in _init_all(self, interactions, user_features, item_features, sample_weight)
133
134 # map the user/item features to internal index positions
--> 135 self._init_features(user_features, item_features)
136
137 # initialize the model weights after the user/item/feature dimensions have been established
~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features)
200 self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32)
201 else:
--> 202 raise KeyError('the users in [user_features] do not match the users in [interactions]')
203 else:
204 self.x_uf = np.zeros([len(self.user_idx), 1], dtype=np.float32)
KeyError: 'the users in [user_features] do not match the users in [interactions]'
Hi,
I would like to capture the loss function as a function of epoch into an array. Currently, it is only possible to print it to stdout
via the verbose=True
argument of fit
. Could the code be enhanced to allow the user to specify calling functions? Alternatively, return the loss function from the C code? Thanks.
item_features_train = pd.get_dummies(train_interactions[['Items', 'moment']], columns=['moment'])
I am classifying my items into fast, medium, slow moving items.
for this I am using the parameter "item_features".
It gives me this error:
model.fit(train_user_item, user_features=None, item_features=item_features_train, sample_weight=sample_weight_train, epochs=epochs, verbose=verbose) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 265, in fit self.fit_partial(interactions, user_features, item_features, sample_weight, epochs, verbose) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 289, in fit_partial self._init_all(interactions, user_features, item_features, sample_weight) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 135, in _init_all self._init_features(user_features, item_features) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 214, in _init_features raise KeyError('the items in [item_features] do not match the items in [interactions]') KeyError: 'the items in [item_features] do not match the items in [interactions]'
can someone help me with this?
And I have also gone through the example notebooks. I noticed that you have constructed the item features but not used it in instacart example. It would be if those notebooks were updated.
Great code, thanks !
Plese help to understand
1
will it work for third order categorical features interaction ?
2
will it run on Windows computer ?
3
will it work for sparse data ?
1.can you pls explain how to save and loas the best model.
2.also is there any way that you could parallelise(use multiprocessing) if possible in the training /prediction part as i have observed that only one core of my machine is being used,
may you clarify if you tried / tested your code with auxiliary features
since in your blog
https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db
you wrote
Unfortunately, there are no user auxiliary features to take advantage of with this data set.
but your developments is essential to have auxiliary features
may be since you found data with auxiliary features?
may you clarify how your code works with key advertised feature
as written in
https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db
To overcome these limitations we need a more general model framework that can extend the latent factor approach to incorporate arbitrary auxiliary features, and specialized loss functions that directly optimize item rank-order using implicit feedback data. Enter Factorization Machines and Learning-to-Rank.
but you testing your code on data without auxiliary feature
as you wrote
Unfortunately, there are no user auxiliary features to take advantage of with this data set.
what is the sense to demo your code on data without auxiliary features, when you claim auxiliary feature specific code ?
I tried to use item_feature in the fit() method but I got:
`/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features)
212 self.x_if = np.ascontiguousarray(x_if.sort_index(), dtype=np.float32)
213 else:
--> 214 raise KeyError('the items in [item_features] do not match the items in [interactions]')
215 else:
216 self.x_if = np.zeros([len(self.item_idx), 1], dtype=np.float32)
KeyError: 'the items in [item_features] do not match the items in [interactions]`
and for adding user_feature, I got similar error:
/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features) 200 self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32) 201 else: --> 202 raise KeyError('the users in [user_features] do not match the users in [interactions]') 203 else: 204 self.x_uf = np.zeros([len(self.user_idx), 1], dtype=np.float32) KeyError: 'the users in [user_features] do not match the users in [interactions]'
I double-checked my data and there are matching catalog_id
s and user_id
s in both training data and the feature data.
What could be the issue?
installation on windows 11 fails
sep13 N1\Fastapi multi replica>pip install rankfm
Collecting rankfm
Using cached rankfm-0.2.5.tar.gz (145 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy>=1.15 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from rankfm) (1.24.3)
Requirement already satisfied: pandas>=0.24 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from rankfm) (1.5.3)
Requirement already satisfied: pytz>=2020.1 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from pandas>=0.24->rankfm) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from pandas>=0.24->rankfm) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from python-dateutil>=2.8.1->pandas>=0.24->rankfm) (1.16.0)
Building wheels for collected packages: rankfm
Building wheel for rankfm (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
building extensions with pre-generated C source...
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-310
creating build\lib.win-amd64-cpython-310\rankfm
copying rankfm\evaluation.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\rankfm.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\utils.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm_init_.py -> build\lib.win-amd64-cpython-310\rankfm
running build_ext
building 'rankfm._rankfm' extension
creating build\temp.win-amd64-cpython-310
creating build\temp.win-amd64-cpython-310\Release
creating build\temp.win-amd64-cpython-310\Release\rankfm
creating build\temp.win-amd64-cpython-310\Release\rankfm\mt19937ar
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\my_py_environments\py310_env_flaml_aug1_2023\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-cpython-310\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for rankfm
Running setup.py clean for rankfm
Failed to build rankfm
Installing collected packages: rankfm
Running setup.py install for rankfm ... error
error: subprocess-exited-with-error
× Running setup.py install for rankfm did not run successfully.
│ exit code: 1
╰─> [22 lines of output]
building extensions with pre-generated C source...
running install
C:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-310
creating build\lib.win-amd64-cpython-310\rankfm
copying rankfm\evaluation.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\rankfm.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\utils.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm_init_.py -> build\lib.win-amd64-cpython-310\rankfm
running build_ext
building 'rankfm._rankfm' extension
creating build\temp.win-amd64-cpython-310
creating build\temp.win-amd64-cpython-310\Release
creating build\temp.win-amd64-cpython-310\Release\rankfm
creating build\temp.win-amd64-cpython-310\Release\rankfm\mt19937ar
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\my_py_environments\py310_env_flaml_aug1_2023\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-cpython-310\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> rankfm
note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip
I've been trying to install RankFM on Windows but haven't been able to do so.
First, I tried installing cygwin to get the GNU Compiler Collection on Windows (now I can use gcc
on cmd
) but when installing, I received "Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools". Then, I installed the latest version of Microsoft Visual C++ and now is throwing the following:
Building wheel for rankfm (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
building extensions with pre-generated C source...
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.9
creating build\lib.win-amd64-3.9\rankfm
copying rankfm\evaluation.py -> build\lib.win-amd64-3.9\rankfm
copying rankfm\rankfm.py -> build\lib.win-amd64-3.9\rankfm
copying rankfm\utils.py -> build\lib.win-amd64-3.9\rankfm
copying rankfm\__init__.py -> build\lib.win-amd64-3.9\rankfm
running build_ext
building 'rankfm._rankfm' extension
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\rankfm
creating build\temp.win-amd64-3.9\Release\rankfm\mt19937ar
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -Ic:\users\makue\documents\rankfm\venv\include -IC:\Users\makue\AppData\Local\Progra
ms\Python\Python39\include -IC:\Users\makue\AppData\Local\Programs\Python\Python39\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\include" "-IC:\Program Files (x86)\Windows Kits\10\includ
e\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\winrt
" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\cppwinrt" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-3.9\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.31.31103\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
[end of output]
Can you please tell me what can I do? Thank you
Hello,
I am having great experience with your library.
For research purpose i need to compute the ndcg but you only propose the dcg.
Could you please add it or give us some hint on how to proceed it.
Thanks
link to data
2017 Instacart Orders Data
from your blog
https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db
is broken
can we use this data instead
https://www.kaggle.com/competitions/instacart-market-basket-analysis/data
can you summarize comparison results
https://github.com/etlundquist/rankfm/blob/master/examples/instacart.ipynb
I'm running the lib on a virtual server with 64gb RAM.
My data consist of:
200k distinct interaction between users and item
52k x 11 user_feature matrix
2770 x 49 item_feature matrix
all NA are replaced by 0
when i try to run it gives me this error:
AssertionError: user factors [v_u] are not finite - try decreasing feature/sample_weight magnitudes
sometimes it would give me item factors error as well
However, if I run on 170k user interaction without user_features and item_features it would run smoothly
What is the meaning of the error?
First of all, thank you for providing good code. But I would like to suggest changing the multiplier in the WARP part
(line 269 in _rankfm.pyx)
from
multiplier = log((I - 1) / sampled) / log(I)
to
multiplier = log((I - items_user[u]) / sampled) / log(I)
From a mathematical view point, the numerator should be the size of the population of j (in this case negative items for u). Since the population of j is the complement of user_items[u], I think it's better to change the the numerator to I - items_user[u].
RankFM has user and item features, and that is great. However, I have a use case with features that cannot be put into this nice form. Specifically, I have some user-item pairs that occur multiple times in my data set. For example, a user might purchase the same product on different days, or purchase multiple products. It does not appear that this use case can be handled by the current API. On the other hand, the algorithm found in the function compute_ui_utility
should be able to handle this case with some modification. Until this morning, I was under the impression that each row of the user-product array had a series of features.
So my question is: can I modify the library to handle the use case where I have N features per row without requiring features at the user and item levels?
I hope I was clear. Thank you for any advice.
I read in the documentation:
"If you run into overflow errors then decrease the feature and/or sample-weight magnitudes and try upping beta, especially if you have a small number of dense user-features and/or item-features."
I di not understand the meaning of decreasing the weight magnitudes. Currently, all weights are set to 1. Are you suggesting setting them to say, 0.5, the same for all rows? But in that case, there will be no difference.
Default value of beta
is set to 0.1. How high do you recommend raising it?
Thanks.
Just a few things I'm not clear on going through the medium article (https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db) and reading the notebook here on github.
1.
The medium talks about using item / user features as part of a recommender system. The accompanying notebook creates item features based on the aisle number:
item_features
item_features_train
item_features_valid
and never uses them. Unless I missed something? I check the docs also. Should the notebook be using them somewhere?
Should the product_id be in the index of item_features_train
? Because in the notebook it's just a regular column:
Anyway, it looks like item_features_train
goesinto model.fit()
model.fit(interactions_train, sample_weight=sample_weight_train, epochs=30, verbose=True, item_features=item_features_train)
I've compared the evaluation using item_features_train
to the original model in the notebook which doesn't use item_features_train
. The scores are slightly lower when I use item_features
but not by much. Which makes me think I'm either doing something wrong (should product_id
be in the index of item_features_train
?), or these features are just not great and don't add anything other than noise.
2.
The dataset also has department id
as a potential feature. How would we use both aisle and department id in the same model? The item_features
argument in model.fit()
takes a single dummified dataframe. Do we use fit_partial
to update the model with extra dataframes of user features or item features? Let's say I have the age, city, gender, and monthly income of my users. This would be four dataframes. How would I use them all?
3.
What is item_features_valid used for? This is created in the notebook but never used.
4.
There is another section where scores are generated:
scores = model.predict(interactions_valid, cold_start='nan')
print(scores)
array([-1.1237175 , 0.00314923, 1.5434608 , ..., 2.029984 ,
1.8916801 , 2.7111115 ], dtype=float32)
What are these used for? They are not used in the notebook after generating them.
Is there a way to get the latent factor space representations of users and items?
thanks for your great library ... one thing i noticed that the model training doesn't utilize all available cpus in the machine, therefore training process is very slow for larger datasets ... is there any parameter to pass to enable multi cpu training
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.