crowdrec / idomaar Goto Github PK

CrowdRec reference framework

License: Apache License 2.0

Ruby 75.51% Shell 1.07% Java 2.79% Puppet 12.28% HTML 1.22% Pascal 0.22% Vim Script 1.20% JavaScript 0.25% Python 3.59% Makefile 0.01% CSS 0.72% Scala 0.12% Batchfile 0.04% Jupyter Notebook 0.97%

idomaar's Introduction

Idomaar

Idomaar (/i:dɒmæ(r)/) is the CrowdRec recommendation and evaluation reference framework.

At the highest abstraction level, Idomaar can be split into the following blocks:

the algorithms to test, both state-of-the-art algorithms and new solutions implemented within the CrowdRec project, e.g,. the algorithms developed in WP4. The algorithms are implemented within the computing environments
the evaluation logic, experimenting with the available algorithms in order to compute both quality (e.g., RMSE, recall) and system (e.g., execution and response time) metrics. Idomaar will include some evaluation policies, free to be extended. The evaluation logic is implemented by an orchestrator and an evaluator.
the data, i.e., the datasets made available to the practitioners (e.g., the MovieTweetings). Algorithms, evaluation logic, and data will be as decoupled as possible to allow to experiment with most existing solutions - no matter the technology they are implemented on - granting reproducible and consistent comparisons. The data are contained in the data container

See usage.md for installation instructions and usage. To use the HTTP interface, see usage-http.md for details.

idomaar's People

Contributors

Stargazers

Watchers

Forkers

egbertbouman elecmeow mindis malagoli chris3456 jart crpaun onecrazyrussian flyingcow1 mkilgus atlefren ajal88

idomaar's Issues

Documenting new Idomaar workflow with related components

To be documented in https://github.com/crowdrec/idomaar/wiki

First integration of Rival: PoC

Integrating Rival evaluator in the orchestrator

Recommendation result to fs

integrate into orchestrator a way to specify where recommendation results (the files that has to be elaborated by evaluator) has to be put.
local filesystem/hdfs/s3 has to be supported

moreover current flume agent (recommendation manager) config doesn't work for filesystem config

Data model

The data container contains all datasets available in the reference framework. A dataset will be represented by a set of entities and relations

An entity will have:

a type (e.g., movie, book, person),
an ID
a set of properties (e.g., the genre of the movie: "comedy")
a set of linked entities (e.g., the actor of the movie: the 'person' entity "Tom Cruise"), that can be seen as "particular" properties.

example
user 1001 [gender#{(male)},city#{(Barcelona)},startts#{(1394615113000)}] []
user 1002 [] []
movie 2001 [genres#{(drama),(action)},title#{(The aviator)}] [actor#{(person,3002),(person,3001)}]
movie 2002 [title#{(Titanic)},genres#{(comedy)},format#{(HD)}] [actor#{(person,3001),(person,3004)},director#{(person,3003)},alias#{(movie,2005)}]
movie 2005 [title#{(Titanic)},genres#{(comedy)},format#{(SD)}] [actor#{(person,3001),(person,3004)},director#{(person,3003)},alias#{(movie,2002)}]
movie 2007 [title#{(Talented Mr Ripley)}] [basedon#{(book,2101)}]
book 2101 [title#{(Talented Mr Ripley)},year#{(1955)}] [author#{(person,3005)}]
webpage 2201 [title#{(The Newyork Times)},url#{(http://www.nytimes.com)},genres#{(news)}] []
webpage 2202 [title#{(BBC)},url#{(http://www.bbc.co.uk)},genres#{(news)}] []
person 3001 [gender#{(male)},name#{(Leonardo Di Caprio)}] []
person 3002 [gender#{(female)},name#{(Cate Blanchett)}] []
person 3003 [gender#{(male)},name#{(James Cameron)}] []
person 3004 [gender#{(female)},name#{(Kate Winslet)}] []
person 3005 [gender#{(female)},name#{(Patricia Highsmith)}] []

A relation will have:

a type (e.g., view, ratings,...),
an ID
a timestamp
a set of properties (e.g., the location where the rating was given: "home")
a set of linked entities (e.g., the user who expressed the rating and the related item), that can be seen as "particular" properties.

example
WEBPAGE_VIEW 100001 1394616069000 [view_length_seconds#{(300)},subject_cookie#{(1001_1)},location#{(Barcelona),(home)}] [subject#{(user,1001)},object#{(item,2201)}]
WEBPAGE_VIEW 100002 1394616069000 [view_length_seconds#{(180)},subject_cookie#{(1002_1)}] [subject#{(user,1002)},object#{(item,2202)}]
RATING 100003 1394616069000 [rating#{(5)}] [subject#{(user,1001)},object#{(item,2001)}]
RATING 100004 1394616069000 [rating#{(3)}] [subject#{(user,1001)},object#{(item,2002)}]
RATING 100005 1394616069000 [rating#{(4.5)}] [subject#{(user,1002)},object#{(item,2002)}]

Orchestrator: recommendation manager sync

Now orchestrator stop the recommendation manager as soon as it receive the ok message from the computing env. the latter send the ok message as soon as it receive the EOF from the data topic.

The problem is that in this state the recommendation manager could not have finished to request recommendation.

We need to find a sync method that exit only if all the following:

recommendation manager flume agent has no more events to process
orchestrator has received the eof message

otherwise we should avoid to use the recommendaion manager in an async way but to request recommendation directly in the flume "test" agent

Orchestrator script mockup

depends on #2 and depends on #5

Script that create full project structure

Integrate the Spark based evaluator into the Idomaar workflow

Test.fm integration

Evaluate how to integrate test.fm into reference framework, the following task has to be integrated into the fmw:

the evaluator splits the dataset into test and training sets, generating the related files.
the evaluator sends the training set to train the algorithm: the evaluator writes on a configured location the files to be used as training set. The algorithm regularly checks this location for new files
the algorithm notifies the evaluator that the training is completed: the algorithm writes a new file in a configured location. The evaluator regularly checks this location
the evaluator sends the recommendation requests (e.g., a set of user to compute the TOP-10 recommendation for). the evaluator writes on a configured location the files containing the userids and the command to execute (e.g., GETTOPRANK).
The algorithm checks such folder and will compute the recommendations, saving the results in a configured path. The evaluator will reads such results and will be able to compute the quality metric.

Arise any doubt on interfaces and orchestration process (https://github.com/crowdrec/reference-framework/wiki/ORCHESTRATOR)

Simple implementation with MyMediaLite using MovieTweeeting dataset

Requirements:

Orchestrator: create configuration to avoid control message for simple test

In some situation where training is not needed (e.g. newsreel) we could avoid to implement the control messages and simply use the streaming of test/recommendation without any other control message.

This should be a run configuration for the orchestrator

Adapting MovieTweeting dataset and the converting script for the data format v4

Forward 8888 port error

When I start the orchestrator vm, I get the following error message:

There are errors in the configuration of this machine. Please fix
the following errors and try again:

vm:

Forwarded port '8888' (host port) is declared multiple times
with the protocol 'tcp'.

Suggestion: name change of reference framework

Not a coding-related issue..

Reference-framework is a really bad name for a project. No matter how much we discuss/publish about it it's going to remain "ungoogleable".

As I already wrote per email, I suggest the name to be changed. We can circulate a doc with potential names or similar.

Idomaar development setup

While using Kafka/Flume to stream data makes Idomaar capable of handling high-rate data flows, having to use all these components during development can be troublesome (slow startup, even components not affected by current changes are involved). It'd be nice to be able to develop Idomaar in a simplified setup.

VM does not terminate

After a successful run, the VM does not terminate.
(on OSX)

CrowdRec organization profile

This is not specifically related to the reference framework, rather to the CrowdRec organization.
Please create a proper profile including a link to the crowdrec website, e.g. see https://github.com/recsyschallenge/

Orchestrator: implement http protocol

all zeromq functions has to be present also for http

Template computing engine provisioning

Evaluator: create splitting process

Define configuration file for ochestrator execution

a json file should define:

concurrency for recommendation requests
stream speed (e.g. minimum msg/sec)
zeromq or http for messaging
deploy on local vagrant or aws

Investigating how to send input file "directly" through a serialization framework, and not using a shared folder

Currently input files are shared from the reference-framework to the computing environment by means of a shared folder.
This task consists of investigating alternative approaches (e.g., FileMQ?), especially to support streaming of data.

orchestrator.py crashes while trying to receive a message

While running the script "orchestrator.py" the script throws the following error and stops:

reading message
read message: READY
INFO: machine started
DO: read input
reading message
read message: OK
INFO: input correctly read
DO: train
reading message
read message: OK
INFO: recommender correctly trained
DO: recommend
reading message
read message: OK
INFO: recommendations correctly generated
Traceback (most recent call last):
File "orchestrator.py", line 167, in
orchestrator.run()
File "orchestrator.py", line 111, in run
recoms = self._socket.recv_multipart()
File "C:\Programs\Python27\lib\site-packages\zmq\sugar\socket.py", line 287, i
n recv_multipart
parts = [self.recv(flags, copy=copy, track=track)]
File "socket.pyx", line 628, in zmq.backend.cython.socket.Socket.recv (zmq\bac
kend\cython\socket.c:5616)
File "socket.pyx", line 662, in zmq.backend.cython.socket.Socket.recv (zmq\bac
kend\cython\socket.c:5436)
File "socket.pyx", line 139, in zmq.backend.cython.socket._recv_copy (zmq\back
end\cython\socket.c:1771)
File "checkrc.pxd", line 21, in zmq.backend.cython.checkrc._check_rc (zmq\back
end\cython\socket.c:6032)

zmq.error.ZMQError: Operation cannot be accomplished in current state

I think this is a bug according to several aspects:

The orchestrator script should not break when receiving a message.
In order to inform the orchestrator about the state of the application, undocumented multi-part messages are not the right approach. Simple message usually used for communicating between server and client are the better way.
In the stream-based evaluation there is nothing that must be sent in addition to the state. The use of multi-part messages does not make sense for me.

orchestrator: recommendation endpoint

Recommendation endpoint (both zeromq and http) has to be present in response to "TRAIN" message (the computing env has to provide it)

Evaluator: create examples of data splitting

Add description of the data split into training, test, and groundtruth (https://github.com/crowdrec/idomaar/wiki/Recommender-evaluation).
Add the groundtruth data format in the data format (https://github.com/crowdrec/idomaar/wiki/DATA-FORMAT), specifying that it is the same as the Idomaar format with the mandatory property "recid"

Re-running does not work

When re-running the orchestrator after a successful run, the orchestrator will not run again. This is the output (till ctrl^c):

$> sh orchestrator.sh 01.java/01.mahout/01.example/ 01.linux/01.centos/01.mahout/ 01.MovieTweetings/datasets/snapshots_10K/
reference framework base path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/..
algorithm base path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../../algorithms
dataset path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../../datasets
computing environment path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../../computingenvironments
orchestrator path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../orchestrator
cleaning
DO: Update git REPOs
Already up-to-date.
Already up-to-date.
Already up-to-date.
DO: starting machine
Bringing machine 'mahout' up with 'virtualbox' provider...
==> mahout: Clearing any previously set forwarded ports...
==> mahout: Clearing any previously set network interfaces...
==> mahout: Preparing network interfaces based on configuration...
    mahout: Adapter 1: nat
==> mahout: Forwarding ports...
    mahout: 22 => 2222 (adapter 1)
==> mahout: Running 'pre-boot' VM customizations...
==> mahout: Booting VM...
==> mahout: Waiting for machine to boot. This may take a few minutes...
    mahout: SSH address: 127.0.0.1:2222
    mahout: SSH username: vagrant
    mahout: SSH auth method: private key
    mahout: Warning: Connection timeout. Retrying...
    mahout: Warning: Remote connection disconnect. Retrying...
==> mahout: Machine booted and ready!
==> mahout: Checking for guest additions in VM...
    mahout: The guest additions on this VM do not match the installed version of
    mahout: VirtualBox! In most cases this is fine, but in rare cases it can
    mahout: prevent things such as shared folders from working properly. If you see
    mahout: shared folder errors, please make sure the guest additions within the
    mahout: virtual machine match the version of VirtualBox you have installed on
    mahout: your host and reload your VM.
    mahout:
    mahout: Guest Additions Version: 4.3.2
    mahout: VirtualBox Version: 4.2
==> mahout: Setting hostname...
==> mahout: Mounting shared folders...
    mahout: /vagrant => /Users/alan/Documents/workspace/crowdrec/computingenvironments/01.linux/01.centos/01.mahout
    mahout: /mnt/algo => /Users/alan/Documents/workspace/crowdrec/algorithms/01.java/01.mahout/01.example
    mahout: /mnt/data => /Users/alan/Documents/workspace/crowdrec/datasets/01.MovieTweetings/datasets/snapshots_10K
    mahout: /mnt/messaging => /private/tmp/messaging
    mahout: /tmp/vagrant-puppet-3/manifests => /Users/alan/Documents/workspace/crowdrec/computingenvironments/01.linux/01.centos/01.mahout/puppet/manifests
    mahout: /tmp/vagrant-puppet-3/modules-0 => /Users/alan/Documents/workspace/crowdrec/computingenvironments/01.linux/01.centos/01.mahout/puppet/modules
==> mahout: Machine already provisioned. Run `vagrant provision` or use the `--provision`
==> mahout: to force provisioning. Provisioners marked to run always will still run.
STATUS: waiting for machine to be ready
^C

To re-run, the virtual machine has to be stopped and removed manually.

Reference framework dissemination: template

Dissemination of the reference framework, starting from the example of Rival (Alan to share the template)

Reference framework dissemination: filling content

Depends on #15
Filling pages with content, see IBC paper as a reference/starting point

Integrate S3 source into idomaar flume source

Question: computing environment as part of algorithm

When instantiating an algorithm, e.g. MyMediaLite, it should be regulated somewhere that the computing environment should be linux/mono, and not linux/java. Meaning that the computing environment could be configured in the algorithm directly instead.

Comments?

Orchestrator: flume log

Orchestrator must catch any error from flume agents and stop the process

Verifying ZMQ-poc

Trying ZeroMQ PoC released by Viktor

Update documentation on github

Update documentation about:

input data model (depends on #1): see data model v.4 at https://drive.google.com/a/moviri.com/folderview?id=0B2qIQUxlCIn9Zk1XMjJrVTlCdE0&usp=sharing
output format: see output format v1 at https://drive.google.com/a/moviri.com/folderview?id=0B2qIQUxlCIn9eklXM19oY2s0ZDQ&usp=sharing
communication protocol (depends on #4): see protocol v1 at https://drive.google.com/a/moviri.com/folderview?id=0B2qIQUxlCIn9djZyN0R5ODVSNkE&usp=sharing

Recommendation manager: multiple concurrent executions

We should configure at orchestrator level how many concurrent agent can request recommendation simultaneosly.

Also the data has to be partitioned at kafka

RiVal addition to orchestrator

Incorporate from 9c6ab6d

test and prediction file formats are csv
user,item,value[,timestamp]

talk to @robertoturrin for details

Evaluator: create structure to support running python script

Orchestration process

The orchestrator is in charge of initiating the virtual machine, providing the training data (both model and recommendation) at the right time, requesting the recommendations, and eventually collecting the results to compute the evaluation metrics.

An example of workflow is:

the CE is booting (e.g., the OR started the virtual machine)
once booting is completed the CE notifies the OR and goes to ready status
the OR notifies the CE that a model training set is available and the CE starts retrieving the data
when data have been read, the CE notifies the OR and starts training the recommendation algorithm
when the CE has finished the training, it notifies the OR and goes again to ready status
the OR requests the CE to compute a recommendation and provide the related recommendation training set. The CE starts reading such data.
Once the recommendation training set has been retrieved, the CE notifies the OR and starts serving the OR requests (e.g., it generates recommendations for the specified user or users)
When all OR requests have been served, the CE notifies the OR and returns to ready status.

According to such protocol, the communication between the CE and the OR consists of a set of notifications (e.g., task completed or new data are ready) and data to transfer (e.g., the training set).
Data to transfer will be directly managed through files, i.e., a new file with the data to transfer will be made available in a shared path stored in the CE file system (so that the OR has permissions to access it).

As for the notification mechanism, the possible implementations are:

File. A notification simply consists of a file that it is added to a path (previously agreed between the CE and the OR). The components that is notifying has to write the file, the component that is notified has to regularly check the path for the new file. As an example, when the OR notifies the CE that the model training data is available, it writes a file F in a previously agreed directory D; the file F contains the path P where the training data have been stored. The CE periodically monitors the directory D and when it finds the new file F it starts retrieving the training data from P.
The implementation through files is straightforward and just requires simple file system’s commands.
Publish/subscribe messaging paradigm. The communication according to the publish/subscribe pattern consists of a publisher that sends messages (not to specific receivers) and a subscriber that expresses its interest to certain categories/classes of messages, that we refer to as channels. Thus, the CE will be subscribed to a certain “channel C”, and the OR will be subscribed to a certain “channel O”. When the CE wants to notify the OR, it publishes a message into channel O; the OR will be listening to such channel and will receive the message. Similarly, when the OR wants to notify the CE, it publishes a message into channel C, that is listened by the CE.
The implementation through the publish/subscribe paradigm is slightly less simple than the one through files. Many tools are available - such as Redis Pub/Sub and Apache ActiveMQ - with clients already implemented for most platforms.

Flume agent using memory channel cannot be shut down gracefully

https://issues.apache.org/jira/browse/FLUME-1318

Idomaar interceptor: recommendation timeout

Currently no timeout is present on recommendation request, this cause the freeze of flume recommendation agent if no recommendation agent is present or if a recommendation is not returned in time.

add timeout configuration option in interceptor and pass it from orchestrator

The item-based recommender cannot parse the dataset.

Trying the run the orchestrator example, it seems to me that the reference-framework example does not work because the item-based recommender (ItembasedRec_batch.java) does not support the current data format.

I checked out the reference framework from the repository and started an example run, using the command:
sh orchestrator.sh 01.java/01.mahout/01.example/ 01.linux/01.centos/01.mahout/ 01.MovieTweetings/datasets/snapshots_10K/

When debugging the program code, the recommender algorithms seems to cause parse errors while reading the dataset.

Project structure

I think the current project structure is a little unclear.
What I see is one project: Idomaar with three submodules (or sub projects), i.e. algorithms, evaluators, computing environments.
None of the latter three are standalone project, they won't run on their own (or will they?). Perhaps this should be reflected in how the project is structured, e.g. one 'main' module (idomaar), with several submodules (evluators, algs, ce's, etc.)

Test the Idomaar - Newsreel integration on a sample Newsreel recommendation engine

Orchestrator: integrate it into vm

Orchestrator must run in datastreamanager vm (we need to also change it name in orchestrator), so to avoid any external dependency

Quickstart

Update the site with documentation for:

download the release
install prerequisistes
explain the process
run a basic example
describe how to build a custom computing env
describe how to execute a custom evaluation logic
describe how to manage/build dataset

Orchestrator - computing environment data streaming

This issue is a continuation of an email thread about orchestrator - CE communication:

Hi Andras,
I agree with you, all your comment are correct, I would like to explain to you why we choose to use kafka.
The main idea is to use component that are "production ready" and that could be easily be scaled to manage "big" number of events.
Of course this introduce a lot of complexity in the development of the computing env (and also in the orchestrator stuff) but also give us the possibility to use very common objects (like kafka and zookeeper).

I also agree with you that zmq is already present and could be used instead of kafka, one idea is to have a restful API that receive events and data (this cover the newsreel integration too) so to have the developer to have the possibility of choosing web api or kafka without changing the rest of the orchestrator.

What do you think? can we continue the talk on a new github issue so to track our conclusion?

thanks,
Davide

On Tue, Mar 10, 2015 at 9:43 AM, András Serény [email protected] wrote:

Hi All,

On another note, although I don't see the full Idomaar picture yet, I find the Kafka/Flume setup complicated and unnecessary given the current state.

    Kafka is a persistent, highly available, distributed messaging service; the features come at the expense of a certain complexity. I don't think we really need either persistence or high availability for sending/receiving train and test data.
    We now a have a handful of services: Kafka, Zookeeper, Flume, recommendation manager. This is just too many moving parts (especially for a small application). A lot of things can go wrong (and they do) in these services that we have to deal with; this slows down development. It's difficult or impossible to install the components locally. It takes long to build a VM with all the components.
    On computing environment implementations, it places the burden of being able to connect to Zookeeper and read from Kafka.

In short, it's complicated and the benefits are unclear. Computing environments already have to have the ability to talk via ZMQ, so why not just send train/test data and recommendation requests via ZMQ ? This would make the whole setup much simpler.

Cheers,
András