GithubHelp home page GithubHelp logo

andrewtavis / wikirec Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 10.0 347.85 MB

Recommendation engine framework based on Wikipedia data

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
wikipedia recommender-system recommendation-engine python books lda bert bert-embeddings unsupervised-learning multilingual

wikirec's Introduction

    rtd ci codecov pyversions pypi pypistatus license coc codestyle colab

    Recommendation engine framework based on Wikipedia data

    wikirec is a framework that allows users to parse Wikipedia in any language for entries of a given type and then seamlessly generate recommendations for the given content. Recommendations are based on unsupervised natural language processing over article texts, with ratings being leveraged to weigh inputs and indicate preferences. The goal is for wikirec to both refine and deploy models that provide accurate content recommendations with only open-source data.

    See the documentation for a full outline of the package including models and data preparation.

    Contents

    Installation

    wikirec can be downloaded from PyPI via pip or sourced directly from this repository:

    pip install wikirec
    git clone https://github.com/andrewtavis/wikirec.git
    cd wikirec
    python setup.py install
    import wikirec

    Data

    wikirec.data_utils allows a user to download Wikipedia articles of a given topic including movies, TV shows, books, music, and countless other classes of information. These texts then serve as the basis to recommend similar content given an input of what the user is interested in.

    Article topics are derived from infobox types found on Wikipedia articles. The article on infoboxes (and its translations) contains all the allowed arguments to subset the data by. Simply passing "Infobox chosen_type" to the topics argument of data_utils.parse_to_ndjson() in the following example will subset all Wikipedia articles for the given type. Lists can also be passed if more than one topic is desired. For the English Wikipedia, wikirec also provides concise arguments for data that commonly serve as recommendation inputs including: books, songs, albums, movies, tv_series, video_games, as well as various categories of people such as athletes, musicians and authors (see data_utils.input_conversion_dict()).

    Data processing in wikirec involves the following steps:

    Downloading and Parsing Articles

    Downloading and parsing Wikipedia articles is as simple as:

    from wikirec import data_utils
    
    # Downloads the most recent stable bz2 compressed English Wikipedia dump
    files = data_utils.download_wiki(language="en", target_dir="./enwiki_dump")
    
    # Produces an ndjson of all book articles on Wikipedia
    data_utils.parse_to_ndjson(
        topics="books",  # ["books", "short_stories", "plays"]
        output_path="./enwiki_books.ndjson",
        input_dir="./enwiki_dump",
        limit=None,  # articles per file to find
        multicore=True,
        verbose=True,
    )

    The examples directory has a compressed copy of enwiki_books.ndjson for testing purposes.

    Cleaning Parsed Articles

    wikirec.data_utils also provides a standardized multilingual cleaning process for the parsed articles. See wikirec.languages for a full breakdown of what is available for each language. Generating a clean text corpus is achieved through the following:

    import json
    
    with open("./enwiki_books.ndjson", "r") as f:
        books = [json.loads(l) for l in f]
    
    titles = [b[0] for b in books]
    texts = [b[1] for b in books]
    wikilinks = [b[2] for b in books]  # internal wikipedia links for WikilinkNN method
    
    text_corpus, selected_idxs = data_utils.clean(
        texts=texts,
        language="en",
        min_token_freq=5,  # 0 for Bert
        min_token_len=3,  # 0 for Bert
        min_tokens=50,
        max_token_index=-1,
        remove_stopwords=True,  # False for Bert
        verbose=True,
    )
    
    selected_titles = [titles[i] for i in selected_idxs]

    From here text_corpus would be used to derive article similarities that are then used to make recommendations for any title found in selected_titles.

    Methods

    Recommendations in wikirec are generated from similarity matrices derived from trained model embeddings. Implemented NLP modeling methods within wikirec.model include:

    BERT

    Bidirectional Encoder Representations from Transformers derives representations of words based on NLP models ran over open source Wikipedia data. These representations are leveraged to derive article similarities that are then used to deliver recommendations.

    wikirec uses sentence-transformers pretrained models. See their GitHub and documentation for the available models.

    from wikirec import model
    
    # Remove n-grams for BERT training
    corpus_no_ngrams = [
        " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
    ]
    
    # We can pass kwargs for sentence_transformers.SentenceTransformer.encode
    bert_embeddings = model.gen_embeddings(
            method="bert",
            corpus=corpus_no_ngrams,
            bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
            show_progress_bar=True,
            batch_size=32,
    )

    Doc2vec

    A generalization of Word2vec, Doc2vec is an NLP algorithm for deriving vector representations of documents from contextual word interrelations. These representations are then used as a baseline for recommendations.

    from wikirec import model
    
    # We can pass kwargs for gensim.models.doc2vec.Doc2Vec
    d2v_embeddings = model.gen_embeddings(
            method="doc2vec",
            corpus=text_corpus,
            vector_size=100,
            epochs=10,
            alpha=0.025,
    )

    LDA

    Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of wikirec, Wikipedia articles are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics. These topic-word relations are then used to determine article similarities and then make recommendations.

    from wikirec import model
    
    # We can pass kwargs for gensim.models.ldamulticore.LdaMulticore
    lda_embeddings = model.gen_embeddings(
            method="lda",
            corpus=text_corpus,  # automatically tokenized for LDA
            num_topics=50,
            passes=10,
            decay=0.5,
    )

    TFIDF

    Term Frequency Inverse Document Frequency is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. In case of wikirec, word importances are combined and compared to derive article similarities and thus provide recommendations.

    from wikirec import model
    
    # We can pass kwargs for sklearn.feature_extraction.text.TfidfVectorizer
    tfidf_embeddings = model.gen_embeddings(
            method="tfidf",
            corpus=text_corpus,
            max_features=None,
            norm="l2",
    )

    WikilinkNN

    Based on this Towards Data Science article, the wikilink neural network method makes the assumption that content will be similar if they are linked to the same Wikipedia articles. A corpus of internal wikilinks per article is passed, and embeddings based on these internal references are then derived. Note that model hyperparameters are dramatically more important in this approach than in others.

    from wikirec import model
    
    wikilink_embeddings = model.gen_embeddings(
            method="WikilinkNN",
            path_to_json="./enwiki_books.ndjson",  # json used instead of a corpus
            path_to_embedding_model="books_embedding_model.h5",
            embedding_size=75,
            epochs=20,
            verbose=True,
    )

    Recommendations

    After embeddings have been generated we can then create matrices that represent article-article cosine or euclidean similarities. These can then be sorted and selected from, with the recommendation process being as simple as the following:

    from wikirec import model
    
    sim_matrix = model.gen_sim_matrix(
        method="chosen_method",
        metric="cosine",  # euclidean
        embeddings=method_embeddings,
    )
    
    recs = model.recommend(
        inputs="title_or_list_of_titles",
        ratings=None,  # list of ints/floats between 0 and 10
        titles=selected_titles,
        sim_matrix=sim_matrix,
        metric="cosine",  # euclidean
        n=10,
    )

    Comparative Results

    • TFIDF generally outperformed all other NLP methods in terms of providing what the user would expect, with the results being all the more striking considering its runtime is by far the shortest.
    • The other strong performing NLP model is BERT, as it does the best job of providing novel but sensible recommendations.
    • WikilinkNN also provides very sensible results, giving wikirec effective modeling options using different kinds of inputs.
    • LDA with the second shortest runtime provides novel recommendations along with what is expected, but recommends things that seem out of place more often than BERT.
    • Doc2vec performs very poorly in that most results are nonsense, and it further takes the longest to train.

    See examples/rec_books and examples/rec_movies for detailed demonstrations with model comparisons, as well as examples/rec_ratings for how to leverage user ratings. These notebooks can also be opened in Google Colab for direct experimentation.

    Sample recommendations for single and multiple inputs are found in the following dropdowns:

    TFIDF

    Harry Potter and the Philosopher's Stone recommendations:
    [['Harry Potter and the Deathly Hallows', 0.6046299758645369],
     ['Harry Potter and the Chamber of Secrets', 0.6006421605504958],
     ['Harry Potter and the Order of the Phoenix', 0.5965340424789338],
     ['Harry Potter and the Goblet of Fire', 0.5569541701616842],
     ['Harry Potter and the Half-Blood Prince', 0.5525197546210491],
     ['The Magical Worlds of Harry Potter', 0.5328091662536486],
     ['Harry Potter and the Prisoner of Azkaban', 0.491142269221778],
     ['Harry, A History', 0.461521032636577],
     ['Fantastic Beasts and Where to Find Them', 0.458905951118587],
     ['Harry Potter and the Methods of Rationality', 0.45024337149870786]]
    
    The Hobbit recommendations:
    [['The History of The Hobbit', 0.7654956800395748],
     ['The Annotated Hobbit', 0.6429102504821168],
     ['The Lord of the Rings', 0.5373413608301959],
     ['The Road to Middle-Earth', 0.5306535049915708],
     ['The Letters of J. R. R. Tolkien', 0.48933976150601666],
     ['The Marvellous Land of Snergs', 0.48317913980292654],
     ['Mr. Bliss', 0.4803612654025307],
     ['J. R. R. Tolkien: A Biography', 0.4801418285780905],
     ['A Companion to J. R. R. Tolkien', 0.4668405235491576],
     ['Tolkien: A Look Behind  " The Lord of the Rings "', 0.45164156724562365]]
    
    Harry Potter and the Philosopher's Stone and The Hobbit recommendations:
    [['The History of The Hobbit', 0.39710714157986077],
     ['The Annotated Hobbit', 0.3339037084669694],
     ['Harry Potter and the Chamber of Secrets', 0.32972850299980644],
     ['Harry Potter and the Deathly Hallows', 0.32760681591732854],
     ['Harry Potter and the Order of the Phoenix', 0.319444468511931],
     ['The Lord of the Rings', 0.3069697109614444],
     ['Harry Potter and the Half-Blood Prince', 0.3022894152745786],
     ['Harry Potter and the Goblet of Fire', 0.3019957448304001],
     ['The Magical Worlds of Harry Potter', 0.2996981871702149],
     ['The Road to Middle-Earth', 0.28697680264545045]]
    

    BERT

    Harry Potter and the Philosopher's Stone recommendations:
    [['The Magical Worlds of Harry Potter', 0.88391376],
     ['Harry Potter and the Chamber of Secrets', 0.8779844],
     ['Harry Potter and the Order of the Phoenix', 0.8671646],
     ['Harry Potter and the Prisoner of Azkaban', 0.85335326],
     ['Harry Potter and the Half-Blood Prince', 0.84942037],
     ['Harry Potter and the Goblet of Fire', 0.8481754],
     ['Year of the Griffin', 0.8280591],
     ['Magyk', 0.8277706],
     ['Harry Potter and the Deathly Hallows', 0.8257748],
     ['The Weirdstone of Brisingamen', 0.81287163]]
    
    The Hobbit recommendations:
    [['The Lord of the Rings', 0.8506559],
     ["The Shepherd's Crown", 0.84309],
     ['The War That Saved My Life', 0.8352962],
     ['The Foundling and Other Tales of Prydain', 0.8336451],
     ["The Inquisitor's Tale", 0.83097416],
     ['Ruby Holler', 0.8303863],
     ['Sam and Dave Dig a Hole', 0.82980216],
     ['Fattypuffs and Thinifers', 0.82704884],
     ['El Deafo', 0.8226619],
     ['Beast (Kennen novel)', 0.8221826]]
    
    Harry Potter and the Philosopher's Stone and The Hobbit recommendations:
    [['The Weirdstone of Brisingamen', 0.8108008205890656],
     ['The Magical Worlds of Harry Potter', 0.7868899703025818],
     ["The Golem's Eye", 0.7817798852920532],
     ['Harry Potter and the Prisoner of Azkaban', 0.7784444689750671],
     ['The Last Battle', 0.7773005664348602],
     ['Child Christopher and Goldilind the Fair', 0.776639997959137],
     ["The Inquisitor's Tale", 0.7743396461009979],
     ['Charmed Life (novel)', 0.7735742926597595],
     ['A Wizard of Earthsea', 0.7710956037044525],
     ["Conrad's Fate", 0.770046204328537]]
    

    WikilinkNN

    Harry Potter and the Philosopher's Stone recommendations:
    [['Harry Potter and the Chamber of Secrets', 0.9697026],
     ['Harry Potter and the Goblet of Fire', 0.969065],
     ['Harry Potter and the Deathly Hallows', 0.9685888],
     ['Harry Potter and the Half-Blood Prince', 0.9635748],
     ['Harry Potter and the Prisoner of Azkaban', 0.9569129],
     ['Harry Potter and the Order of the Phoenix', 0.94091964],
     ['Harry Potter and the Cursed Child', 0.9358928],
     ['My Immortal (fan fiction)', 0.91195196],
     ['Eragon', 0.89236057],
     ['Quidditch Through the Ages', 0.8891448]]
    
    The Hobbit recommendations:
    [['The Lord of the Rings', 0.94245297],
     ['The Silmarillion', 0.9160445],
     ['Beren and Lúthien', 0.90604335],
     ['The Fall of Gondolin', 0.9044683],
     ['The Children of Húrin', 0.895282],
     ['The Book of Lost Tales', 0.89020956],
     ['The Road to Middle-Earth', 0.88268256],
     ["The Magician's Nephew", 0.8816683],
     ['The History of The Hobbit', 0.87789804],
     ['Farmer Giles of Ham', 0.87786204]]
    
    Harry Potter and the Philosopher's Stone and The Hobbit recommendations:
    [['The Lord of the Rings', 0.8367433249950409],
     ['Harry Potter and the Deathly Hallows', 0.8294640183448792],
     ['The Children of Húrin', 0.8240831792354584],
     ['Harry Potter and the Prisoner of Azkaban', 0.8158660233020782],
     ['Harry Potter and the Goblet of Fire', 0.8150344789028168],
     ['Eragon', 0.8118217587471008],
     ['Harry Potter and the Chamber of Secrets', 0.8101150393486023],
     ['Fantastic Beasts and Where to Find Them', 0.8092647194862366],
     ['Watership Down', 0.8012698292732239],
     ['Harry Potter and the Half-Blood Prince', 0.7979166805744171]]
    
    

    Weighted Model Approach

    Better results can be achieved by combining TFIDF and BERT:

    tfidf_weight = 0.35
    bert_weight = 1.0 - tfidf_weight
    bert_tfidf_sim_matrix = tfidf_weight * tfidf_sim_matrix + bert_weight * bert_sim_matrix
    -- Weighted BERT and TFIDF --
    
    Harry Potter and the Philosopher's Stone recommendations:
    [['Harry Potter and the Chamber of Secrets', 0.7809146131987466],
     ['Harry Potter and the Order of the Phoenix', 0.7724439006273619],
     ['The Magical Worlds of Harry Potter', 0.7610271015260268],
     ['Harry Potter and the Deathly Hallows', 0.7483740864279236],
     ['Harry Potter and the Goblet of Fire', 0.746247955871592],
     ['Harry Potter and the Half-Blood Prince', 0.7455051626944851],
     ['Harry Potter and the Prisoner of Azkaban', 0.7265793668098672],
     ['Harry Potter and the Cursed Child', 0.6773072534713512],
     ['Harry, A History', 0.6772576164353141],
     ['Fantastic Beasts and Where to Find Them', 0.626084297475856]]
    
    The Hobbit recommendations:
    [['The Lord of the Rings', 0.7409957782467453],
     ['The History of The Hobbit', 0.7352996903587457],
     ['The Annotated Hobbit', 0.7135948210557342],
     ['The Marvellous Land of Snergs', 0.6838799880927064],
     ['The Road to Middle-Earth', 0.6447863856578011],
     ['The Silmarillion', 0.6445419659298917],
     ['A Companion to J. R. R. Tolkien', 0.6416663828729424],
     ['J. R. R. Tolkien: A Biography', 0.6347377961302614],
     ['The Children of Húrin', 0.6261937795502842],
     ['Mr. Bliss', 0.6217533139998945]]
    
    Harry Potter and the Philosopher's Stone and The Hobbit recommendations:
    [['The Magical Worlds of Harry Potter', 0.6163728193841632],
     ['Harry Potter and the Order of the Phoenix', 0.6098655072975429],
     ['Harry Potter and the Prisoner of Azkaban', 0.6026408288502743],
     ['Harry Potter and the Chamber of Secrets', 0.5966943180957163],
     ['Harry Potter and the Deathly Hallows', 0.5932562267661715],
     ['The Lord of the Rings', 0.5931736380571248],
     ['Harry Potter and the Half-Blood Prince', 0.5905134043157909],
     ['The Weirdstone of Brisingamen', 0.5620134317676433],
     ['Fantastic Beasts and Where to Find Them', 0.5594706076813922],
     ['Harry Potter and the Goblet of Fire', 0.556541219039868]]
    

    The WikilinkNN model can be combined with other models by subsetting the similarity matrix for titles derived in the cleaning process:

    wikilink_sims_copy = wikilink_sims.copy()
    not_selected_idxs = [i for i in range(len(titles)) if i not in selected_idxs]
    
    wikilink_sims_copy = np.delete(wikilink_sims_copy, not_selected_idxs, axis=0)
    wikilink_sims_copy = np.delete(wikilink_sims_copy, not_selected_idxs, axis=1)

    Adding User Ratings

    The ratings argument of wikirec.model.recommend allows users to weight recommendations according to their interests:

    model.recommend(
        inputs=[
            "Harry Potter and the Philosopher's Stone",
            "The Hobbit",
            "The Hunger Games",
        ],
        ratings=None,  # averaging their similarity scores
        titles=selected_titles,
        sim_matrix=bert_tfidf_sim_matrix,  # weighted BERT and TFIDF embeddings
        n=20,
        metric="cosine",
    )
    -- Weighted BERT and TFIDF No Ratings --
    
    [['The Lord of the Rings', 0.8129448240195865],
     ['Harry Potter and the Order of the Phoenix', 0.8058152446026797],
     ['Harry Potter and the Half-Blood Prince', 0.7899741862008424],
     ['Harry Potter and the Prisoner of Azkaban', 0.7795265344828326],
     ['Harry Potter and the Deathly Hallows', 0.774902922811441],
     ['The Weirdstone of Brisingamen', 0.7718548190275306],
     ['The Magical Worlds of Harry Potter', 0.7691708768967348],
     ['Harry Potter and the Chamber of Secrets', 0.7669100258159494],
     ['Gregor and the Curse of the Warmbloods', 0.762141807244329],
     ['The Marvellous Land of Snergs', 0.7591230587892558],
     ['Mockingjay', 0.7585438304114571],
     ['Fantastic Beasts and Where to Find Them', 0.757280478510476],
     ['The Children of Húrin', 0.7570409672927969],
     ['The Book of Three', 0.7497114647690369],
     ['Harry Potter and the Goblet of Fire', 0.7414905999564945],
     ['The Bone Season', 0.7401901103966402],
     ['A Wrinkle in Time', 0.7392014390129485],
     ['A Wizard of Earthsea', 0.7337085913181924],
     ['The Casual Vacancy', 0.7306041913659236],
     ['Krindlekrax', 0.7301903731240345]]
    
    model.recommend(
        inputs=[
            "Harry Potter and the Philosopher's Stone",
            "The Hobbit",
            "The Hunger Games",
        ],
        ratings=[7, 6, 9],  # similarity scores weighted by ratings
        titles=selected_titles,
        sim_matrix=bert_tfidf_sim_matrix,  # weighted BERT and TFIDF embeddings
        n=20,
        metric="cosine",
    )
    -- Weighted BERT and TFIDF With Ratings --
    
    [['Mockingjay', 0.5847107027999907],
     ['Harry Potter and the Order of the Phoenix', 0.5846454899012506],
     ['The Lord of the Rings', 0.5758166462534925],
     ['Harry Potter and the Half-Blood Prince', 0.5677581220645922],
     ['Harry Potter and the Deathly Hallows', 0.5591667887082767],
     ['Harry Potter and the Prisoner of Azkaban', 0.5584267832698454],
     ['Catching Fire', 0.5582404750962344],
     ['Gregor and the Curse of the Warmbloods', 0.5527128074677247],
     ['Harry Potter and the Chamber of Secrets', 0.5524299731616052],
     ['The Weirdstone of Brisingamen', 0.5520358627555212],
     ['The Magical Worlds of Harry Potter', 0.5506942177737976],
     ['The Bone Season', 0.547984210564344],
     ['The Book of Three', 0.5459088891490478],
     ['Fantastic Beasts and Where to Find Them', 0.5443195045210549],
     ['The Marvellous Land of Snergs', 0.5398665287849369],
     ['A Wrinkle in Time', 0.5373739646822866],
     ['The Casual Vacancy', 0.5358385211606874],
     ['Harry Potter and the Goblet of Fire', 0.5346379229854734],
     ['The Children of Húrin', 0.5340832788476909],
     ['A Wizard of Earthsea', 0.5297755576425843]]
    

    To-Do

    Please see the contribution guidelines if you are interested in contributing to this project. Work that is in progress or could be implemented includes:

    References

    List of references

    Powered By


    Wikipedia

    wikirec's People

    Contributors

    andrewtavis avatar bizzyvinci avatar dependabot[bot] avatar imgbotapp avatar victle avatar

    Stargazers

     avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

    Watchers

     avatar  avatar  avatar

    wikirec's Issues

    Add WikilinkNN Unwanted Links

    The WikilinkNN currently best supports book recommendations in wikirec as there are preset links that are removed via the following in wikirec.model._wikilink_nn:

    to_remove = [
        "hardcover",
        "paperback",
        "hardback",
        "e-book",
        "wikipedia:wikiproject books",
        "wikipedia:wikiproject novels",
    ]
    wikilinks = [item for item in wikilinks if item not in to_remove]

    It would be best if this could be adapted for other kinds of recommendation inputs. The style of input could potentially be passed to wikirec.model.gen_embeddings, but a discussion could also be had about other ways to derive which links should be removed.

    Create concise requirement and env files

    This issue is for creating concise versions of requirements.txt and environment.yml for wikirec. It would be great if these files were created by hand with specific version numbers or generated in a way so that sub-dependencies don't always need to be updated.

    As of now both files are being created with the following commands in the package's conda virtual environment:

    pip list --format=freeze > requirements.txt  
    conda env export --no-builds | grep -v "^prefix: " > environment.yml

    wikirec, en-core-web-sm (spacy package that breaks tests), and other obviously unneeded packages are then removed from these files before being uploaded.

    Any insights or help would be much appreciated!

    Devising ways to best combine recommendations

    This issue is to discuss ways to best combine vector embeddings so that a wikirec user can optimally pass more than one argument to wikirec.model.recommend.

    The current way of combining recommendations for more than one input is to simply take the arithmetic means of the similarity matrix rows for each passed title, which is depicted in the following snippet from wikirec.model.recommend:

    for i, t in enumerate(titles):
        if t == inpt:
            if first_input:
                sims = sim_matrix[i]
    
                first_input = False
    
            else:
                sims = [np.mean([s, sim_matrix[i][j]]) for j, s in enumerate(sims)]

    A discussion of whether this is the best way to do this would be much appreciated! Furthermore, how could the above be changed to allow a user to express disinterest (as discussed in #33).

    Add t-SNE to wikirec

    It would be helpful to be able to visualize the embeddings created by wikirec models, and one such way to achieve this is t-SNE. This would allow the results models to be visually compared to see how relationships are being derived.

    The Python package kwx has an implementation of t-SNE that could be adopted for this package, with another reference being the blogpost that this package was originally based on, which is found here. Ideally this would be put into a visuals.py module, which further would be added to the documentation and tested using pytest's monkeypatch feature (see the tests for kwx for an example). Partial implementations are more than welcome though!

    Please first indicate your interest in working on this, as it is a feature implementation :)

    Thanks for your interest in contributing!

    Add ability to change model results based on topic models

    A potential addition to wikirec would be allowing a user to change the recommendations based on the topics. As of now this is only a sketch, but the general idea would be that topic coherences could be returned to the user with the words that define a topic, and then the user could say that they want results that are more in line with a topic by passing percentages a word or n-gram along with a general score. 0.5 could be that topics that include the passed word would not be weighted, with numbers below or above implying that topic importances should be shifted based on the words importance in them.

    This would allow a user to express interest in genres, or simply say that results should be more similar to those that are focussed on a similar topic keyword. kwx could be looked to for topic keyword derivation in this case.

    Allowing users to express disinterest in model.recommend

    This issue is for discussing and potentially implementing a way for users to express disinterest in a title when calling wikirec.model.recommend. The general idea now would be to allow users to pass a title with a negation indicator of some kind (ex "!title"), in which case the selections given the similarity matrix for the given item would be reversed.

    It would be great to know if the above would be intuitive UI, and an implementation would be welcome!

    New recommendation models

    Please use this issue to make suggestions for new models that could be added to wikirec. Suggestions would ideally include some of the following:

    • A blogpost or other source where the method is applied for related NLP tasks
    • A research paper that details the method and its potential applications
    • Source code for the method in Python or another language

    Estimates of the model's efficacy would also be appreciated so that a new good first issue can be made and prioritized.

    Thanks for any suggestions!

    Update gensim LDA to 4.x

    This issue is for discussing and eventually implementing an update for gensim implementations of LDA in wikirec. The package was originally written with 3.x versions of gensim, and 4.x versions apparently have some dramatic improvements as far as modeling options/efficency and n-gram creation (for wikirec.data_utils.clean). Changes would need to be made in wikirec.data_utils and wikirec.model.

    Documenting what would need to happen for the switch and then work towards implementing it would be very much appreciated :)

    Thanks for your interest in contributing!

    Implementing simple parsing arguments

    This issue is to discuss and implement keys for wikirec.data_utils.input_conversion_dict to make it easier for people to find valid arguments to parse Wikipedia articles using wikirec.data_utils.parse_to_ndjson. Rather than needing to search for the given Infobox topic, a user could instead simply query the keys of input_conversion_dict for the desired language and see what would be valid values to pass to the topics argument. Suggestions and pull requests are welcome for any language :)

    Thanks for your interest in contributing!

    Add Wikidata metadata

    One way to provide more data for wikirec would be to add metadata for the given article via its Wikidata:Main_Page page. This would change the manner in which the data is extracted, but article texts could be derived as well via the Wikipedia pages that are linked to the Wikidata entity. Whether or not the project should be shifted to focus on Wikidata as a main data source could also be discussed, with tools like WikidataIntegrator being used to derive article categories and query the needed information.

    Add neural network model

    This issue is for adding an embeddings neural network implementation to wikirec. This package was originally based on the linked blog post, but the original model implementation to now has not been included. That original work and the provided codes could serve as the basis to adding such a model to wikirec, which ideally would also be included in the documentation and tested. That model was based on analyzing the links between pages, which could serve as a basis for the wikirec version with modifications to wikirec.data_utils, or the model could focus on the article texts. Partial implementations are more than welcome though :)

    Please first indicate your interest in working on this, as it is a feature implementation :)

    Thanks for your interest in contributing!

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.