Prototype for Biffle, a recommendation engine for Developer news
Components:
master-shell-script: Controller for scripts below
profile-parse: Parse LinkedIn profiles and insert them into MongoDB.
SO-tag-download: Download StackOverflow tags for users and add them to the user object
add-wordclouds: Take all of user's tags and create a wordcloud for that individual user, then save it to the user's object
search-terms-mongo: Import a data file into MongoDB that contains all of the terms that Biffle 'understands' (currently 100 Big Data database names)
search-gen-for-articles: Generate PHP files based on terms that Biffle understands
parse-and-download: Download and parse news articles and websites.
make-recommendations: Make article recommendations. Currently recommends using ElasticSearch relevance score based on all words in user's word cloud (not just 100 database names)
send-recommendations: Sends recommendations to users via email
utils/3gram-keyword-dump: Dump all words in a user's wordcloud
utils/add-tweets: Add Tweets to a user's object
utils/SO-all-user-download: Download entire StackOverflow database of users and their email hashes
utils/technorati-scraper: Download URLs for 40,000+ tech blogs from Technorati
bifflescraper/*: Scrapy implementation of Biffle scraper tool
articles
{ "_id": MongoDB ID "q": "big data mongodb health care" "sc": "score" "c": "code" "sd": "search date" "pubd": "publish date" (guessed date) "procd": "processed date" "url" "article url" "t": "article title" "abs": "summary text" "sr": "article source" "k": keyword list "f": filename of downloaded full article "m": metadata (retweets, etc.) }
webpages
{ "_id": MongoDB ID "q": "query" "nr": "number of total results returned from search query" "url": "webpage url" "t": "webpage title" "md": "meta description content tag" "mk": "meta keywords" "abs": "webpage summary" "s": "webpage score" "v": "version??", "k": "keywords in webpage", "f": "file path on disk" }
topics - Not Implemented (list of topics)
{ "big data": [ "mongodb", "hbase", "infiniDB" ….] "cloud computing": ["sss", "sdfds"] }
industries
{ "in": ["healthcare", "transportation", …] }
operations - Not Implemented
{"op": [ "deployment", "security", ,..] }
recommended_articles
{ "_id": MongoDB ID "uid": user id "aid": article id "rt": recommend_datetime "uk": user_keywords_list "pk": presented_keywords }
recommended_webpages
{ "_id": MongoDB ID "uid": user id "wid": webpage id "rt": recommended_datetime "uk": user_keywords_list "pk": presented_keywords }
user_clicks
{ "_id": MongoDB ID, "uid": 123, "aid": article id (if article was clicked) "wid": webpage id (if webpage was clicked) "ad_url": url of ad (if ad was clicked) "ct": date/time of click }
users
{ "_id": MongoDB ID, "lid": linkedin unique ID, "e": [email protected], "n": Aki Balogh, "ln": linkedin interests (pulled from profile summary, job summary and skills) "in": "computer software", "k": ["Greenplum", "InfiniDB"] }
so_users
{ "_id": MongoDB ID, "sid": StackOverflow ID, "dn": "akibalogh", "eh": "2dd0d3404eed2283b5307d16cec68896", "l": "Cambridge, MA", "w": "linkedin.com/in/akibalogh" }
tech_blogs
{ "_id": page number of blog on Technorati, (i.e. '1' for http://technorati.com/blogs/directory/technology/page-1) "u": list of blog URLs on page }