GithubHelp home page GithubHelp logo

Comments (21)

victle avatar victle commented on June 15, 2024

I'm personally trying to learn more about neural networks, so I'd love to work and contribute to this in pieces. I took a quick skim through the linked resources, and there's use of t-SNE as well to visualize the books. I know that you're looking to put together t-SNE for wikirec as well #35, so in the future that could definitely be something I can work on as well. Let me know what you think is a good first step!

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

For the neural network model the question would be if we implement the method from the blogpost directly where it's looking into the links to other Wikipedia pages, which would then require the cleaning process to have an option to prepare it in this way. Websites URLs are being removed as of now, which maybe they shouldn't? We could of course devise another method though :)

We already have pretrained NNs covered with BERT, so a NN approach that tries to create embeddings might be wasted effort as that's a lot of computing to try to beat something that's explicitly trained on Wikipedia in the first place.

Another model that's popped up in the last few years is XLNet, which I guess would be the other natural implementation to look into.

Let me know what you think on this :)

from wikirec.

victle avatar victle commented on June 15, 2024

I think dealing with the links themselves might be a good first approach instead of training a whole new NN. I'll look into that blogpost and see if I can find how links are explicitly addressed, unless you had any thoughts on that?

I'll have to look into XLNet as it does look interesting! Though I am definitely lacking in terms of raw computing power.

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

The original blogpost author finds the wikilinks in the data preparation steps, which are shown in this notebook. He's using wiki.filter_wikilinks() from mwparserfromhell, which we also have as a dependency. Basically in his cleaning steps he's getting titles, the infobox information, the internal wikilinks, the external links, and a few other diagnostic features, whereas we're just getting titles and the texts themselves.

Implementing this his way would honestly be a huge change to the way the cleaning works, so maybe the best way to go about this is give an option in wikirec.data_utils.clean where the websites would not be removed (might be best anyway), and we could use string methods or regular expressions to extract the links from the texts themselves. For this it'd basically be finding instances of https://en.wikipedia.org/wiki/Artcile_Name by matching the first part, and then we'd need to just extract Article Name and make sure there's a way to not have repeat entries.

Once we have those it's basically following the original blogpost :)

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

For XLNet it looks like we'd be able to use sentence-transformers like the BERT implementation, so the big question then becomes what a suitable model would be. It'd be loading in a huggingface model, as sentence-transformers allows those models to be loaded along with its BERT models. I'm not sure if it's similar to BERT where there are multilingual models and ones that are better for certain use cases or others.

References for this are the XLNet documentation from huggingface and this issue so far.

from wikirec.

victle avatar victle commented on June 15, 2024

@andrewtavis Hey! Just wanted to update, I've been pretty busy, but am still wanting to work on this issue. I wanted some clarification on what we can do to incorporate these different cleaning methods. As it is now, we're just grabbing the text of the article, which includes the text displayed for the internal wikilinks (I don't think they're getting cleaned out, are they?) How would grabbing the wikilinks themselves substantially improve the "performance" of the recommendation models, since we already have the text from the links' names as part of the inputs to the models already?

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

@victle, hey :) No worries on a bit of silence :)

Thing is that the URLs are being cleaned as of now. As seen at this line, the websites are being removed, but not the texts that they're the links for. I'm thinking now that this is a random step that actually doesn't need to be included in the cleaning process. We could simply remove this, and then you could then extract the internal Wikipedia links from the parsed text corpuses.

Grabbing the links themselves would basically just make a new modeling approach. The assumption would shift from "I believe recommendations can be made based on which articles have similar texts" to "... which articles are linked to similar things." The second assumption is the one from the blog post, and he also got strong results, so we could implement that approach here as well :)

It kind of adds another layer of depth to a combined model as well. Right now we can combine BERT and TFIDF and get something that accounts for semantic similarity (BERT) and explicit word inclusions (TFIDF), both of which as of now are working well and even better when combined. This could give us a third strong performing model that could add a degree of direct relatedness to other articles. To combine it with others the embeddings would need to be changed a bit though, as his approach embeds all target articles and all that they're linked to. Could be a situation where we could simply remove rows and columns of the embeddings matrix based on indexing though.

I checked our results against the ones he has in the blogpost. The direct way that this could help is books that are historical in context. So far we've been picking fantasy novels for examples, which ultimately seem to be performing well as they would have unique words and places that lead to books by the same author. An example is his results for War and Peace:

Books closest to War and Peace:

Book: Anna Karenina               Similarity: 0.92
Book: The Master and Margarita    Similarity: 0.92
Book: Demons (Dostoevsky novel)   Similarity: 0.91
Book: The Idiot                   Similarity: 0.9
Book: Crime and Punishment        Similarity: 0.9

Our results for a combined TFIDF-BERT approach are:

[['Prince Serebrenni', 0.6551468483971056],
 ['Resurrection', 0.6545365970449271],
 ['History of the Russian State from Gostomysl to Timashev',
  0.6189035165863549],
 ['A Confession', 0.6068009160763292],
 ['August 1914', 0.5945890587900238],
 ['The Tale of Savva Grudtsyn', 0.5914113595267484],
 ['The Don Flows Home to the Sea', 0.5905929292423463],
 ['Anna Karenina', 0.5887544831477559],
 ['Special Assignments', 0.5798599047827274],
 ['Russia, Bolshevism, and the Versailles Peace', 0.5791726426273041]]

They're all classic Russian books, but his results are "better" in my opinion. We get similar results for The Fellowship of the Ring, as expected :)

Sorry for the wall of text ๐Ÿ˜ฑ Let me know what your thoughts are on the above!

from wikirec.

victle avatar victle commented on June 15, 2024

I want to try and summarize my understanding below:

After reading that first paragraph, this is my understanding. I'm using the beginning text of Prince Serebrenni as an example ๐Ÿ˜„ Before cleaning the raw text, you'll get something like "Prince Serebrenni (Russian: ะšะฝัะทัŒ ะกะตั€ะตะฑั€ัะฝั‹ะน) is a historical novel by [[https://en.wikipedia.org/wiki/Aleksey_Konstantinovich_Tolstoy]](Aleksey Konstantinovich Tolstoy)...". And through that line you referenced to in wikirec.data_utils.clean, it's simply removing the link entirely, such that only "Aleksey Konstantinovich Tolstoy" is left?

So then, if we implement this third approach/model that looks at the internal links, we could combine that with TFIDF and BERT to make an overall stronger model (hopefully!) Although, I'm not knowledgeable to know how you would combine embeddings, so that might need more explanation ๐Ÿ˜…

Either way, it does sound interesting and challenging! I think I'll begin with removing that cleaning step, and replicating the approach from the blog post to extract the links. Does that sound like a reasonable starting step?

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

Hey there :) Yes, your understanding is correct. From [[https://en.wikipedia.org/wiki/Aleksey_Konstantinovich_Tolstoy]](Aleksey Konstantinovich Tolstoy) only Aleksey Konstantinovich Tolstoy will remain, and then that string will be tokenized, n-grams will be created such that we would likely get Aleksey_Konstantinovich_Tolstoy (while maintaining Aleksey Konstantinovich Tolstoy as well), then the tokens will be lower cased, and then common names are removed (these steps then lead to lemmatization or stemming). Common names are removed to make sure that we're not getting recommendations for things just because the character has the same name, but they're removed after n-grams are created so that we still have, for example, harry_potter.

I think that there's a better way to do this that can still maintain removing the URLs (I think that they're ultimately a lot of filler, and further will be nonsense as the punctuation's removed, so we'll have a lot of httpsenwikipediaorg). The way he gets the URLs is in the parsing phase, which is again this notebook. If you look at the ran cell 40 he seems to be just getting the article names that are then used as inputs for the model, so to get the links you honestly could do a regular expression search over the texts and get everything between https://en.wikipedia.org/ and the next space, then you'd need to convert the underscores to spaces and make sure that they don't go through later cleaning steps. Those could then be saved as an optional output to the cleaning process for if someone wanted to get the internal links, say by adding a get_wikilinks argument?

This is the simplest way I can think to go about this, as there's all kinds of cleaning steps like removing punctuation and such that follow that would also need to be accounted for. If you just put a if get_wikilinks: conditional right before the URLs are removed, we could then get them, return a third value, and maintain everything else as is :)

Lemme know what you think!

from wikirec.

victle avatar victle commented on June 15, 2024

I was messing around with the wikirec.data_utils.clean() function, and I wanted to see exactly what websites were getting removed from the code lines you linked to. I put an image of the websites that get removed in the first 500 texts or so. It seems like in the cleaning process, there really aren't any (perhaps none) internal wikilinks to other articles. I really only see links to images and such. Was this intended? Are the internal links to other articles getting removed in an earlier step? FYI, the texts that I'm parsing only come from using the enwiki_books.ndjson that's provided in the repository, so I didn't go through the data_utils.parse_to_ndjson function or anything.

image

Other than that, I'm a fan of the get_wikilinks = True argument. For now, I'll go ahead and work with the clean() function such that a separate list containing the internal wikilinks are returned ๐Ÿ‘

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

Very very interesting, and sorry for putting you on the wrong track. Honestly I last really referenced the parsing codes years ago when I originally wrote the LDA version of this (was a project for my master's), and didn't think that Wikipedia doesn't actually use URLs for internal links.

Referencing the source of Prince Serebrenni, "Aleksey Konstantinovich Tolstoy" is [[Aleksey Konstantinovich Tolstoy]], and the "1862" link is [[1862 in literature|1862]], i.e. double brackets indicate internal links, and there's a bar separator for if reference isn't the name of the article (weird that it's different from what you're seeing). I had forgotten how advanced mwparserfromhell that we're using for the parsing is, in that wikicode.strip_code().strip() is just getting the texts without the internal links, which you then explicitly need to request via wikicode.filter_wikilinks(). The URLs that you're seeing look to be for references, which appear at the end of Wikipedia articles, but then in the source these are actually found in the texts themselves and are moved to the bottom when displayed.

Again, sorry for the false lead. We'll need to get the links in the parsing phase, which in the long run makes this easier :) The main difference on this is going to be that a third element will be added to the json files where we'll be able to do the following:

with open("./enwiki_books.ndjson", "r") as f:
    books = [json.loads(l) for l in f]

titles = [b[0] for b in books]
texts = [b[1] for b in books]
wikilinks = [b[2] for b in books]  # <- access it if you need it

So basically we just get an optional element that isn't even used if we're applying the current techniques. More to the point, we don't need to screw with the cleaning process as of now. Is something that should be looked at again in the future as I think that BERT could potentially benefit from even raw texts with very little processing, but let's check this later :) I can reference this with some friends as well.

I will do an update tomorrow with the changes to wikirec.data_utils that will include edits to _process_article (adding wikicode.filter_wikilinks() as a third returned value) and then propagate this change to iterate_and_parse_file and parse_to_ndjson. I'll then DL the April dump, do a parse of it, and then update the zip of enwiki_books.ndjson. From there we'll be good to get started on the NN model ๐Ÿ˜Š

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

I'll also do a parse and get us a copy of enwiki_books_stories_plays.ndjson so long as it's within the upload size limitations :)

from wikirec.

victle avatar victle commented on June 15, 2024

Cool! I'm glad we cleared that up, and that it's an easy fix. Let me know if there's something I can look into as well. I can keep reviewing the blogpost, as I imagine a lot of the methods and insights for the NN model will derive from that.

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

Hey there :) Thanks for your offer to help on the parsing! Was literally just the line for wikicode.filter_wikilinks() being added and some other slight changes. That all went through with #43, and I've just updated the zip of enwiki_books.ndjson - it now has wikilinks at index 2. I decided to hold off on updating the Wikipedia dump for now (parsing itself is 4-5 hours), so we're still using the February one. We can update that when the examples get updated. For now I also decided against enwiki_books_stories_plays.ndjson, as the books alone now are almost 78MB, so even zipped it'll likely be more than the max file size of 100MB.

To keep track of this, the recent steps and those left are (with my estimates of time/difficulty):

  • Add wikilink parsing (2)
  • Update/upload enwiki_books.ndjson with wikilinks (3)
  • Implement updated wikilinks NN method in model.recommend (8)
  • Add model description to the readme (currently WIP) and docstrings (checking documentation) (2)
  • Add wikilinks NN method to examples (includes combination approaches) (5)
  • Add wikilinks NN results to the readme (1)
  • Add testing for wikilinks NN method (4)
  • Close this issue ๐ŸŽ‰

Let me know what all from the above you'd have interest in, and I'll take the rest. Also of course let me know if you need support on anything. Would be happy to help ๐Ÿ˜Š

from wikirec.

victle avatar victle commented on June 15, 2024

I'd love to talk about more about breaking down the 3rd task. And again, correct me if I'm not understanding ๐Ÿ˜… In following the blogpost, we'd have to train a neural network (treat as a supervised task) to generate an embedding between books and the internal wikilinks. Then, we can generate a similarity matrix based on this embedding for each book. However, how do we combine the recommendations based off this NN with those of TFIDF and BERT?

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

Let's definitely break it down a bit more :) Just wanted to put out everything so there's a general roadmap, and I'd of course do the data uploads and testing (not sure if you have experience/interest in unit tests).

Answering your question (as well as I can right now ๐Ÿ˜„), you're right in that we'll be combining similarities into a sim_matrix as with the other models and we'd make sure that it's ordered by titles in the same way. The final thing for the blog post is that the titles are being modeled with everything that they're also linked to. We'd just need to make sure that anything that's not in selected_titles isn't included in the sim_matrix, and then as before we have matrices of equal dimension that can be weighted. We'd need to experiment with the weights from there, but then that's the fun part where we can see if there's any value in combining this kind of similarity with the others we already have to ultimately get something more representative than any of the models by themselves :)

Let me know what your thoughts are on the above! As far as breaking the task down, if you want to add something that's similar to the blogpost into model.gen_embeddings, then I could potentially work on getting model.gen_sim_matrix set up from the vector embeddings? Let me know what you'd be interested in doing and have the time for ๐Ÿ˜Š

from wikirec.

victle avatar victle commented on June 15, 2024

I'm familiar with unit tests, but not well-versed I would say! Either way, what you've outlined makes sense. In terms of what I can do, I can start by building the architecture for the NN that will eventually generate the embeddings between titles and links. I'm interested in training the model myself, but we'll see if I have the computing power to do so in a reasonable manner ๐Ÿ˜… . To keep model.gen_embeddings clean, I might write a private function or something that will set everything up and train. We'll see!

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

A private function on the side would be totally fine for this! All sounds great, and looking forward to it :)

In terms of computing power, have you used Google Colab ever? That might be a solution for this, as I don't remember the training for this being mentioned as too long in the blog post. Plus it's from 2016 when GPUs weren't as readily available as today (ML growth is nuts ๐Ÿ˜ฎ). Big thing for that is that you do need to activate the GPUs in the notebook, as they're not on by default. As stated in the examples for this, you'd do Edit > Notebook settings > Hardware accelerator and selecting GPU.

I used Colab for some university school projects, and it's built with Keras in mind. You'd have 24 hours or so of GPU time before the kernel restarts, which hopefully would be enough. If it's not, just lower the parameters down and send something that works along, and I'm happy to make my computer wheeze a bit for the full run ๐Ÿ˜„

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

@victle, what do you think about combining model.gen_embeddings and mdol.gen_sim_matrix? Idea would be to make gen_embeddings a private function, that way the recommendation process would just be gen_sim_matrix followed by recommend.

from wikirec.

victle avatar victle commented on June 15, 2024

@andrewtavis I actually like the modularity of two separate functions for computing the embeddings and then the similarity matrix. Someone might just be interested in the embeddings, or would want more customization with how the similarity matrices are computed. Though this might be a rare case! But, I do see the benefit of making the recommendation process simpler. Plus, generating the similarity matrix is pretty simple after computing the embeddings, so it's like... why not? ๐Ÿ˜†

from wikirec.

andrewtavis avatar andrewtavis commented on June 15, 2024

@victle, if you like the modularity we can keep it as is :) I was kind of on the fence for it and wanted to check, but it makes sense that someone might just want the embeddings. Plus, if we keep it as is it's less work ๐Ÿ˜„ Thanks for your input!

from wikirec.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.