etlundquist / rankfm Goto Github PK

Factorization Machines for Recommendation and Ranking Problems with Implicit Feedback Data

License: GNU General Public License v3.0

Python 86.95% Makefile 0.13% C 12.92%

machine-learning recommendation factorization-machines recommender-system learning-to-rank implicit-feedback collaborative-filtering

rankfm's People

Stargazers

Watchers

rankfm's Issues

Tracking the loss function

I would like to store the time-dependent loss function in an array. It would be nice if there was a hook function that would allow me to do this, or have the call to the trainer return this list. Can anybody help with that? The compilation to C complicates matters. Thanks.

Adding time of interaction feature

Hi,
In Rendle's original paper on FM (https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf), he used a feature that captures the time of the user-item interaction. This should improve the model.

In your model, from what I see, we can input user features, item features and the interactions themselves.
How can I add the time of the interaction into the mix?

Bug when using user features?

When running fit() with user features, I get the error:

KeyError: 'the users in [user_features] do not match the users in [interactions]'

which has been reported previously. In my case, I did some debugging in the source code, and found the following. In the function _init_interactions, one finds the statement:

            if np.array_equal(sorted(x_uf.index.values), self.user_idx):
                self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32)
            else:
                raise KeyError('the users in [user_features] do not match the users in [interactions]')

which is the error in question. Looking at the definition of self.user_idx, one finds, in the same file rankfm.py:

        # store unique values of user/item indexes and observed interactions for each user
        self.user_idx = np.arange(len(self.user_id), dtype=np.int32)
        self.item_idx = np.arange(len(self.item_id), dtype=np.int32)

near line 128. Clearly, self.user_idx are consecutive indexes 0,1,2, ... up to the number of user ids.
However, sorted(x_uf.index.values) is the sorted list of user ids. Thus, the two lists cannot be equal. The code that leads me to this conclusions is:

        if user_features is not None:
            x_uf = pd.DataFrame(user_features.copy())
            x_uf = x_uf.set_index(x_uf.columns[0])
            x_uf.index = x_uf.index.map(self.user_to_index)
            if np.array_equal(sorted(x_uf.index.values), self.user_idx):

As far as I understand, the first column of user_features, which is an argument to the function, should be the actual user_id, which can be anything, as long as it does not appear twice in the dataframe. In this case, the conditional (last line) can not be satisfied.
Therefore, I must not understand the data format of user_features. Where is this explained? The documentation states the following:

user_features – dataframe of user metadata features: [user_id, uf_1, … , uf_n]

with no additional information regarding the values of user_id. Any clarification would be most welcome!

Citation?

Hi,

Thank you so much for this library. AFAIK it is the only FM lib with WARP loss.

I was thinking of using it, and I was wondering whether you have a source (paper) for the actual implementation you followed, or any special citing for it.

Thank you and keep the great work!

Question: User/Item Interaction Features

This looks like a very promising library - congrats!

I am not familiar with the theory yet, but is it possible to include user/interaction features? For example, a typical use case is the amount of time elapsed since a product was last purchased.

NaNs leading to KeyError while comparing arrays during user_item_index vectors generation

KeyError Traceback (most recent call last)
in
----> 1 model.fit(interactions, user_features, item_features, sample_weight, epochs=50, verbose=True)

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in fit(self, interactions, user_features, item_features, sample_weight, epochs, verbose)
263
264 self._reset_state()
--> 265 self.fit_partial(interactions, user_features, item_features, sample_weight, epochs, verbose)
266
267

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in fit_partial(self, interactions, user_features, item_features, sample_weight, epochs, verbose)
287 self._init_features(user_features, item_features)
288 else:
--> 289 self._init_all(interactions, user_features, item_features, sample_weight)
290
291 # determine the number of negative samples to draw depending on the loss function

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in _init_all(self, interactions, user_features, item_features, sample_weight)
133
134 # map the user/item features to internal index positions
--> 135 self._init_features(user_features, item_features)
136
137 # initialize the model weights after the user/item/feature dimensions have been established

~/.virtualenv/turicreate/lib/python3.8/site-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features)
200 self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32)
201 else:
--> 202 raise KeyError('the users in [user_features] do not match the users in [interactions]')
203 else:
204 self.x_uf = np.zeros([len(self.user_idx), 1], dtype=np.float32)

KeyError: 'the users in [user_features] do not match the users in [interactions]'

Capturing the loss function

Hi,

I would like to capture the loss function as a function of epoch into an array. Currently, it is only possible to print it to stdout via the verbose=True argument of fit. Could the code be enhanced to allow the user to specify calling functions? Alternatively, return the loss function from the C code? Thanks.

KeyError: 'the items in [item_features] do not match the items in [interactions]'

item_features_train = pd.get_dummies(train_interactions[['Items', 'moment']], columns=['moment'])

I am classifying my items into fast, medium, slow moving items.
for this I am using the parameter "item_features".

It gives me this error:
model.fit(train_user_item, user_features=None, item_features=item_features_train, sample_weight=sample_weight_train, epochs=epochs, verbose=verbose) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 265, in fit self.fit_partial(interactions, user_features, item_features, sample_weight, epochs, verbose) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 289, in fit_partial self._init_all(interactions, user_features, item_features, sample_weight) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 135, in _init_all self._init_features(user_features, item_features) File "/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py", line 214, in _init_features raise KeyError('the items in [item_features] do not match the items in [interactions]') KeyError: 'the items in [item_features] do not match the items in [interactions]'
can someone help me with this?
And I have also gone through the example notebooks. I noticed that you have constructed the item features but not used it in instacart example. It would be if those notebooks were updated.

will it work for third order categorical features interaction

Great code, thanks !

Plese help to understand
1
will it work for third order categorical features interaction ?
2
will it run on Windows computer ?

3
will it work for sparse data ?

doubt on how to save and load model

1.can you pls explain how to save and loas the best model.
2.also is there any way that you could parallelise(use multiprocessing) if possible in the training /prediction part as i have observed that only one core of my machine is being used,

may you clarify if you tried / tested your code with auxiliary features

since in your blog
https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db
you wrote
Unfortunately, there are no user auxiliary features to take advantage of with this data set.

but your developments is essential to have auxiliary features

may be since you found data with auxiliary features?

what is the sense to demo your code on data without auxiliary features, when you claim auxiliary feature specific code ?

may you clarify how your code works with key advertised feature
as written in
https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db

To overcome these limitations we need a more general model framework that can extend the latent factor approach to incorporate arbitrary auxiliary features, and specialized loss functions that directly optimize item rank-order using implicit feedback data. Enter Factorization Machines and Learning-to-Rank.

but you testing your code on data without auxiliary feature
as you wrote
Unfortunately, there are no user auxiliary features to take advantage of with this data set.

what is the sense to demo your code on data without auxiliary features, when you claim auxiliary feature specific code ?

Cannot incorporate item_feature or user_feature in fit()

I tried to use item_feature in the fit() method but I got:
`/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features)
212 self.x_if = np.ascontiguousarray(x_if.sort_index(), dtype=np.float32)
213 else:
--> 214 raise KeyError('the items in [item_features] do not match the items in [interactions]')
215 else:
216 self.x_if = np.zeros([len(self.item_idx), 1], dtype=np.float32)

KeyError: 'the items in [item_features] do not match the items in [interactions]`

and for adding user_feature, I got similar error:
/usr/local/lib/python3.7/dist-packages/rankfm/rankfm.py in _init_features(self, user_features, item_features) 200 self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32) 201 else: --> 202 raise KeyError('the users in [user_features] do not match the users in [interactions]') 203 else: 204 self.x_uf = np.zeros([len(self.user_idx), 1], dtype=np.float32) KeyError: 'the users in [user_features] do not match the users in [interactions]'

I double-checked my data and there are matching catalog_ids and user_ids in both training data and the feature data.

What could be the issue?

installation on windows 11 fails

installation on windows 11 fails
sep13 N1\Fastapi multi replica>pip install rankfm
Collecting rankfm
Using cached rankfm-0.2.5.tar.gz (145 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy>=1.15 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from rankfm) (1.24.3)
Requirement already satisfied: pandas>=0.24 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from rankfm) (1.5.3)
Requirement already satisfied: pytz>=2020.1 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from pandas>=0.24->rankfm) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from pandas>=0.24->rankfm) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages (from python-dateutil>=2.8.1->pandas>=0.24->rankfm) (1.16.0)
Building wheels for collected packages: rankfm
Building wheel for rankfm (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
building extensions with pre-generated C source...
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-310
creating build\lib.win-amd64-cpython-310\rankfm
copying rankfm\evaluation.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\rankfm.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\utils.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm_init_.py -> build\lib.win-amd64-cpython-310\rankfm
running build_ext
building 'rankfm._rankfm' extension
creating build\temp.win-amd64-cpython-310
creating build\temp.win-amd64-cpython-310\Release
creating build\temp.win-amd64-cpython-310\Release\rankfm
creating build\temp.win-amd64-cpython-310\Release\rankfm\mt19937ar
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\my_py_environments\py310_env_flaml_aug1_2023\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-cpython-310\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for rankfm
Running setup.py clean for rankfm
Failed to build rankfm
Installing collected packages: rankfm
Running setup.py install for rankfm ... error
error: subprocess-exited-with-error

× Running setup.py install for rankfm did not run successfully.
│ exit code: 1
╰─> [22 lines of output]
building extensions with pre-generated C source...
running install
C:\my_py_environments\py310_env_flaml_aug1_2023\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-310
creating build\lib.win-amd64-cpython-310\rankfm
copying rankfm\evaluation.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\rankfm.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm\utils.py -> build\lib.win-amd64-cpython-310\rankfm
copying rankfm_init_.py -> build\lib.win-amd64-cpython-310\rankfm
running build_ext
building 'rankfm._rankfm' extension
creating build\temp.win-amd64-cpython-310
creating build\temp.win-amd64-cpython-310\Release
creating build\temp.win-amd64-cpython-310\Release\rankfm
creating build\temp.win-amd64-cpython-310\Release\rankfm\mt19937ar
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\my_py_environments\py310_env_flaml_aug1_2023\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\include -IC:\Users\cde3\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-cpython-310\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> rankfm

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip

Unable to install on Windows

I've been trying to install RankFM on Windows but haven't been able to do so.
First, I tried installing cygwin to get the GNU Compiler Collection on Windows (now I can use gcc on cmd) but when installing, I received "Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools". Then, I installed the latest version of Microsoft Visual C++ and now is throwing the following:

  Building wheel for rankfm (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      building extensions with pre-generated C source...
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.9
      creating build\lib.win-amd64-3.9\rankfm
      copying rankfm\evaluation.py -> build\lib.win-amd64-3.9\rankfm
      copying rankfm\rankfm.py -> build\lib.win-amd64-3.9\rankfm
      copying rankfm\utils.py -> build\lib.win-amd64-3.9\rankfm
      copying rankfm\__init__.py -> build\lib.win-amd64-3.9\rankfm
      running build_ext
      building 'rankfm._rankfm' extension
      creating build\temp.win-amd64-3.9
      creating build\temp.win-amd64-3.9\Release
      creating build\temp.win-amd64-3.9\Release\rankfm
      creating build\temp.win-amd64-3.9\Release\rankfm\mt19937ar
      "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -Ic:\users\makue\documents\rankfm\venv\include -IC:\Users\makue\AppData\Local\Progra
ms\Python\Python39\include -IC:\Users\makue\AppData\Local\Programs\Python\Python39\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.31.31103\include" "-IC:\Program Files (x86)\Windows Kits\10\includ
e\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\winrt
" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.19041.0\\cppwinrt" /Tcrankfm/_rankfm.c /Fobuild\temp.win-amd64-3.9\Release\rankfm/_rankfm.obj -O2 -ffast-math -Wno-unused-function -Wno-uninitialized
      cl : Command line error D8021 : invalid numeric argument '/Wno-unused-function'
      error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.31.31103\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
      [end of output]

Can you please tell me what can I do? Thank you

Cannot generate the normalize dcg

Hello,
I am having great experience with your library.
For research purpose i need to compute the ndcg but you only propose the dcg.
Could you please add it or give us some hint on how to proceed it.
Thanks

link to instacart data from your blog is broken

link to data
2017 Instacart Orders Data
from your blog

https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db
is broken
can we use this data instead

https://www.kaggle.com/competitions/instacart-market-basket-analysis/data

can you summarize comparison results

https://github.com/etlundquist/rankfm/blob/master/examples/instacart.ipynb

Error while fit with 200k user_interaction matrix, item features and user features

I'm running the lib on a virtual server with 64gb RAM.
My data consist of:
200k distinct interaction between users and item
52k x 11 user_feature matrix
2770 x 49 item_feature matrix
all NA are replaced by 0

when i try to run it gives me this error:
AssertionError: user factors [v_u] are not finite - try decreasing feature/sample_weight magnitudes
sometimes it would give me item factors error as well

However, if I run on 170k user interaction without user_features and item_features it would run smoothly

What is the meaning of the error?

Suggestion for changing multiplier in _rankfm.pyx

First of all, thank you for providing good code. But I would like to suggest changing the multiplier in the WARP part
(line 269 in _rankfm.pyx)

from
multiplier = log((I - 1) / sampled) / log(I)

to
multiplier = log((I - items_user[u]) / sampled) / log(I)

From a mathematical view point, the numerator should be the size of the population of j (in this case negative items for u). Since the population of j is the complement of user_items[u], I think it's better to change the the numerator to I - items_user[u].

User and item features

RankFM has user and item features, and that is great. However, I have a use case with features that cannot be put into this nice form. Specifically, I have some user-item pairs that occur multiple times in my data set. For example, a user might purchase the same product on different days, or purchase multiple products. It does not appear that this use case can be handled by the current API. On the other hand, the algorithm found in the function compute_ui_utility should be able to handle this case with some modification. Until this morning, I was under the impression that each row of the user-product array had a series of features.

So my question is: can I modify the library to handle the use case where I have N features per row without requiring features at the user and item levels?

I hope I was clear. Thank you for any advice.

Convergence issues

I read in the documentation:

"If you run into overflow errors then decrease the feature and/or sample-weight magnitudes and try upping beta, especially if you have a small number of dense user-features and/or item-features."

I di not understand the meaning of decreasing the weight magnitudes. Currently, all weights are set to 1. Are you suggesting setting them to say, 0.5, the same for all rows? But in that case, there will be no difference.

Default value of beta is set to 0.1. How high do you recommend raising it?

Thanks.

item_features are not actually used in the tutorial, and other questions.

Just a few things I'm not clear on going through the medium article (https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db) and reading the notebook here on github.

The medium talks about using item / user features as part of a recommender system. The accompanying notebook creates item features based on the aisle number:

item_features
item_features_train
item_features_valid

and never uses them. Unless I missed something? I check the docs also. Should the notebook be using them somewhere?

Should the product_id be in the index of item_features_train? Because in the notebook it's just a regular column:

Anyway, it looks like item_features_train goesinto model.fit()

model.fit(interactions_train, sample_weight=sample_weight_train, epochs=30, verbose=True, item_features=item_features_train)

I've compared the evaluation using item_features_train to the original model in the notebook which doesn't use item_features_train. The scores are slightly lower when I use item_features but not by much. Which makes me think I'm either doing something wrong (should product_id be in the index of item_features_train?), or these features are just not great and don't add anything other than noise.

The dataset also has department id as a potential feature. How would we use both aisle and department id in the same model? The item_features argument in model.fit() takes a single dummified dataframe. Do we use fit_partial to update the model with extra dataframes of user features or item features? Let's say I have the age, city, gender, and monthly income of my users. This would be four dataframes. How would I use them all?

What is item_features_valid used for? This is created in the notebook but never used.

There is another section where scores are generated:

scores = model.predict(interactions_valid, cold_start='nan')

print(scores)

array([-1.1237175 ,  0.00314923,  1.5434608 , ...,  2.029984  ,
        1.8916801 ,  2.7111115 ], dtype=float32)

What are these used for? They are not used in the notebook after generating them.

Latent factor space representations of users and items

Is there a way to get the latent factor space representations of users and items?

is there is to utilize all cpus during training ?

thanks for your great library ... one thing i noticed that the model training doesn't utilize all available cpus in the machine, therefore training process is very slow for larger datasets ... is there any parameter to pass to enable multi cpu training

etlundquist / rankfm Goto Github PK

rankfm's People

Stargazers

Watchers

Forkers

rankfm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs