GithubHelp home page GithubHelp logo

crowdrec / idomaar Goto Github PK

View Code? Open in Web Editor NEW
32.0 20.0 12.0 108.52 MB

CrowdRec reference framework

License: Apache License 2.0

Ruby 75.51% Shell 1.07% Java 2.79% Puppet 12.28% HTML 1.22% Pascal 0.22% Vim Script 1.20% JavaScript 0.25% Python 3.59% Makefile 0.01% CSS 0.72% Scala 0.12% Batchfile 0.04% Jupyter Notebook 0.97%

idomaar's Introduction

Idomaar

Idomaar (/i:dɒmæ(r)/) is the CrowdRec recommendation and evaluation reference framework.

At the highest abstraction level, Idomaar can be split into the following blocks:

  • the algorithms to test, both state-of-the-art algorithms and new solutions implemented within the CrowdRec project, e.g,. the algorithms developed in WP4. The algorithms are implemented within the computing environments
  • the evaluation logic, experimenting with the available algorithms in order to compute both quality (e.g., RMSE, recall) and system (e.g., execution and response time) metrics. Idomaar will include some evaluation policies, free to be extended. The evaluation logic is implemented by an orchestrator and an evaluator.
  • the data, i.e., the datasets made available to the practitioners (e.g., the MovieTweetings). Algorithms, evaluation logic, and data will be as decoupled as possible to allow to experiment with most existing solutions - no matter the technology they are implemented on - granting reproducible and consistent comparisons. The data are contained in the data container

See usage.md for installation instructions and usage. To use the HTTP interface, see usage-http.md for details.

idomaar's People

Contributors

alansaid avatar andras-sereny avatar andreacondorelli avatar babakx avatar egbertbouman avatar malagoli avatar morellodev avatar vigsterkr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

idomaar's Issues

Recommendation result to fs

integrate into orchestrator a way to specify where recommendation results (the files that has to be elaborated by evaluator) has to be put.
local filesystem/hdfs/s3 has to be supported

moreover current flume agent (recommendation manager) config doesn't work for filesystem config

Data model

The data container contains all datasets available in the reference framework. A dataset will be represented by a set of entities and relations

An entity will have:

  • a type (e.g., movie, book, person),
  • an ID
  • a set of properties (e.g., the genre of the movie: "comedy")
  • a set of linked entities (e.g., the actor of the movie: the 'person' entity "Tom Cruise"), that can be seen as "particular" properties.

example
user 1001 [gender#{(male)},city#{(Barcelona)},startts#{(1394615113000)}] []
user 1002 [] []
movie 2001 [genres#{(drama),(action)},title#{(The aviator)}] [actor#{(person,3002),(person,3001)}]
movie 2002 [title#{(Titanic)},genres#{(comedy)},format#{(HD)}] [actor#{(person,3001),(person,3004)},director#{(person,3003)},alias#{(movie,2005)}]
movie 2005 [title#{(Titanic)},genres#{(comedy)},format#{(SD)}] [actor#{(person,3001),(person,3004)},director#{(person,3003)},alias#{(movie,2002)}]
movie 2007 [title#{(Talented Mr Ripley)}] [basedon#{(book,2101)}]
book 2101 [title#{(Talented Mr Ripley)},year#{(1955)}] [author#{(person,3005)}]
webpage 2201 [title#{(The Newyork Times)},url#{(http://www.nytimes.com)},genres#{(news)}] []
webpage 2202 [title#{(BBC)},url#{(http://www.bbc.co.uk)},genres#{(news)}] []
person 3001 [gender#{(male)},name#{(Leonardo Di Caprio)}] []
person 3002 [gender#{(female)},name#{(Cate Blanchett)}] []
person 3003 [gender#{(male)},name#{(James Cameron)}] []
person 3004 [gender#{(female)},name#{(Kate Winslet)}] []
person 3005 [gender#{(female)},name#{(Patricia Highsmith)}] []

A relation will have:

  • a type (e.g., view, ratings,...),
  • an ID
  • a timestamp
  • a set of properties (e.g., the location where the rating was given: "home")
  • a set of linked entities (e.g., the user who expressed the rating and the related item), that can be seen as "particular" properties.

example
WEBPAGE_VIEW 100001 1394616069000 [view_length_seconds#{(300)},subject_cookie#{(1001_1)},location#{(Barcelona),(home)}] [subject#{(user,1001)},object#{(item,2201)}]
WEBPAGE_VIEW 100002 1394616069000 [view_length_seconds#{(180)},subject_cookie#{(1002_1)}] [subject#{(user,1002)},object#{(item,2202)}]
RATING 100003 1394616069000 [rating#{(5)}] [subject#{(user,1001)},object#{(item,2001)}]
RATING 100004 1394616069000 [rating#{(3)}] [subject#{(user,1001)},object#{(item,2002)}]
RATING 100005 1394616069000 [rating#{(4.5)}] [subject#{(user,1002)},object#{(item,2002)}]

Orchestrator: recommendation manager sync

Now orchestrator stop the recommendation manager as soon as it receive the ok message from the computing env. the latter send the ok message as soon as it receive the EOF from the data topic.

The problem is that in this state the recommendation manager could not have finished to request recommendation.

We need to find a sync method that exit only if all the following:

  • recommendation manager flume agent has no more events to process
  • orchestrator has received the eof message

otherwise we should avoid to use the recommendaion manager in an async way but to request recommendation directly in the flume "test" agent

Test.fm integration

Evaluate how to integrate test.fm into reference framework, the following task has to be integrated into the fmw:

  • the evaluator splits the dataset into test and training sets, generating the related files.
  • the evaluator sends the training set to train the algorithm: the evaluator writes on a configured location the files to be used as training set. The algorithm regularly checks this location for new files
  • the algorithm notifies the evaluator that the training is completed: the algorithm writes a new file in a configured location. The evaluator regularly checks this location
  • the evaluator sends the recommendation requests (e.g., a set of user to compute the TOP-10 recommendation for). the evaluator writes on a configured location the files containing the userids and the command to execute (e.g., GETTOPRANK).
  • The algorithm checks such folder and will compute the recommendations, saving the results in a configured path. The evaluator will reads such results and will be able to compute the quality metric.

Arise any doubt on interfaces and orchestration process (https://github.com/crowdrec/reference-framework/wiki/ORCHESTRATOR)

Forward 8888 port error

When I start the orchestrator vm, I get the following error message:

There are errors in the configuration of this machine. Please fix
the following errors and try again:

vm:

  • Forwarded port '8888' (host port) is declared multiple times
    with the protocol 'tcp'.

Suggestion: name change of reference framework

Not a coding-related issue..

Reference-framework is a really bad name for a project. No matter how much we discuss/publish about it it's going to remain "ungoogleable".

As I already wrote per email, I suggest the name to be changed. We can circulate a doc with potential names or similar.

Idomaar development setup

While using Kafka/Flume to stream data makes Idomaar capable of handling high-rate data flows, having to use all these components during development can be troublesome (slow startup, even components not affected by current changes are involved). It'd be nice to be able to develop Idomaar in a simplified setup.

orchestrator.py crashes while trying to receive a message

While running the script "orchestrator.py" the script throws the following error and stops:

reading message
read message: READY
INFO: machine started
DO: read input
reading message
read message: OK
INFO: input correctly read
DO: train
reading message
read message: OK
INFO: recommender correctly trained
DO: recommend
reading message
read message: OK
INFO: recommendations correctly generated
Traceback (most recent call last):
File "orchestrator.py", line 167, in
orchestrator.run()
File "orchestrator.py", line 111, in run
recoms = self._socket.recv_multipart()
File "C:\Programs\Python27\lib\site-packages\zmq\sugar\socket.py", line 287, i
n recv_multipart
parts = [self.recv(flags, copy=copy, track=track)]
File "socket.pyx", line 628, in zmq.backend.cython.socket.Socket.recv (zmq\bac
kend\cython\socket.c:5616)
File "socket.pyx", line 662, in zmq.backend.cython.socket.Socket.recv (zmq\bac
kend\cython\socket.c:5436)
File "socket.pyx", line 139, in zmq.backend.cython.socket._recv_copy (zmq\back
end\cython\socket.c:1771)
File "checkrc.pxd", line 21, in zmq.backend.cython.checkrc._check_rc (zmq\back
end\cython\socket.c:6032)

zmq.error.ZMQError: Operation cannot be accomplished in current state

I think this is a bug according to several aspects:

  • The orchestrator script should not break when receiving a message.
  • In order to inform the orchestrator about the state of the application, undocumented multi-part messages are not the right approach. Simple message usually used for communicating between server and client are the better way.
  • In the stream-based evaluation there is nothing that must be sent in addition to the state. The use of multi-part messages does not make sense for me.

Re-running does not work

When re-running the orchestrator after a successful run, the orchestrator will not run again. This is the output (till ctrl^c):

$> sh orchestrator.sh 01.java/01.mahout/01.example/ 01.linux/01.centos/01.mahout/ 01.MovieTweetings/datasets/snapshots_10K/
reference framework base path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/..
algorithm base path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../../algorithms
dataset path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../../datasets
computing environment path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../../computingenvironments
orchestrator path: /Users/alan/Documents/workspace/crowdrec/reference-framework/orchestrator/../orchestrator
cleaning
DO: Update git REPOs
Already up-to-date.
Already up-to-date.
Already up-to-date.
DO: starting machine
Bringing machine 'mahout' up with 'virtualbox' provider...
==> mahout: Clearing any previously set forwarded ports...
==> mahout: Clearing any previously set network interfaces...
==> mahout: Preparing network interfaces based on configuration...
    mahout: Adapter 1: nat
==> mahout: Forwarding ports...
    mahout: 22 => 2222 (adapter 1)
==> mahout: Running 'pre-boot' VM customizations...
==> mahout: Booting VM...
==> mahout: Waiting for machine to boot. This may take a few minutes...
    mahout: SSH address: 127.0.0.1:2222
    mahout: SSH username: vagrant
    mahout: SSH auth method: private key
    mahout: Warning: Connection timeout. Retrying...
    mahout: Warning: Remote connection disconnect. Retrying...
==> mahout: Machine booted and ready!
==> mahout: Checking for guest additions in VM...
    mahout: The guest additions on this VM do not match the installed version of
    mahout: VirtualBox! In most cases this is fine, but in rare cases it can
    mahout: prevent things such as shared folders from working properly. If you see
    mahout: shared folder errors, please make sure the guest additions within the
    mahout: virtual machine match the version of VirtualBox you have installed on
    mahout: your host and reload your VM.
    mahout:
    mahout: Guest Additions Version: 4.3.2
    mahout: VirtualBox Version: 4.2
==> mahout: Setting hostname...
==> mahout: Mounting shared folders...
    mahout: /vagrant => /Users/alan/Documents/workspace/crowdrec/computingenvironments/01.linux/01.centos/01.mahout
    mahout: /mnt/algo => /Users/alan/Documents/workspace/crowdrec/algorithms/01.java/01.mahout/01.example
    mahout: /mnt/data => /Users/alan/Documents/workspace/crowdrec/datasets/01.MovieTweetings/datasets/snapshots_10K
    mahout: /mnt/messaging => /private/tmp/messaging
    mahout: /tmp/vagrant-puppet-3/manifests => /Users/alan/Documents/workspace/crowdrec/computingenvironments/01.linux/01.centos/01.mahout/puppet/manifests
    mahout: /tmp/vagrant-puppet-3/modules-0 => /Users/alan/Documents/workspace/crowdrec/computingenvironments/01.linux/01.centos/01.mahout/puppet/modules
==> mahout: Machine already provisioned. Run `vagrant provision` or use the `--provision`
==> mahout: to force provisioning. Provisioners marked to run always will still run.
STATUS: waiting for machine to be ready
^C

To re-run, the virtual machine has to be stopped and removed manually.

Question: computing environment as part of algorithm

When instantiating an algorithm, e.g. MyMediaLite, it should be regulated somewhere that the computing environment should be linux/mono, and not linux/java. Meaning that the computing environment could be configured in the algorithm directly instead.

Comments?

Orchestration process

The orchestrator is in charge of initiating the virtual machine, providing the training data (both model and recommendation) at the right time, requesting the recommendations, and eventually collecting the results to compute the evaluation metrics.

An example of workflow is:

  • the CE is booting (e.g., the OR started the virtual machine)
  • once booting is completed the CE notifies the OR and goes to ready status
  • the OR notifies the CE that a model training set is available and the CE starts retrieving the data
  • when data have been read, the CE notifies the OR and starts training the recommendation algorithm
  • when the CE has finished the training, it notifies the OR and goes again to ready status
    the OR requests the CE to compute a recommendation and provide the related recommendation training set. The CE starts reading such data.
  • Once the recommendation training set has been retrieved, the CE notifies the OR and starts serving the OR requests (e.g., it generates recommendations for the specified user or users)
  • When all OR requests have been served, the CE notifies the OR and returns to ready status.

According to such protocol, the communication between the CE and the OR consists of a set of notifications (e.g., task completed or new data are ready) and data to transfer (e.g., the training set).
Data to transfer will be directly managed through files, i.e., a new file with the data to transfer will be made available in a shared path stored in the CE file system (so that the OR has permissions to access it).

As for the notification mechanism, the possible implementations are:

  • File. A notification simply consists of a file that it is added to a path (previously agreed between the CE and the OR). The components that is notifying has to write the file, the component that is notified has to regularly check the path for the new file. As an example, when the OR notifies the CE that the model training data is available, it writes a file F in a previously agreed directory D; the file F contains the path P where the training data have been stored. The CE periodically monitors the directory D and when it finds the new file F it starts retrieving the training data from P.
    The implementation through files is straightforward and just requires simple file system’s commands.
  • Publish/subscribe messaging paradigm. The communication according to the publish/subscribe pattern consists of a publisher that sends messages (not to specific receivers) and a subscriber that expresses its interest to certain categories/classes of messages, that we refer to as channels. Thus, the CE will be subscribed to a certain “channel C”, and the OR will be subscribed to a certain “channel O”. When the CE wants to notify the OR, it publishes a message into channel O; the OR will be listening to such channel and will receive the message. Similarly, when the OR wants to notify the CE, it publishes a message into channel C, that is listened by the CE.
    The implementation through the publish/subscribe paradigm is slightly less simple than the one through files. Many tools are available - such as Redis Pub/Sub and Apache ActiveMQ - with clients already implemented for most platforms.

Idomaar interceptor: recommendation timeout

Currently no timeout is present on recommendation request, this cause the freeze of flume recommendation agent if no recommendation agent is present or if a recommendation is not returned in time.

add timeout configuration option in interceptor and pass it from orchestrator

The item-based recommender cannot parse the dataset.

Trying the run the orchestrator example, it seems to me that the reference-framework example does not work because the item-based recommender (ItembasedRec_batch.java) does not support the current data format.

I checked out the reference framework from the repository and started an example run, using the command:
sh orchestrator.sh 01.java/01.mahout/01.example/ 01.linux/01.centos/01.mahout/ 01.MovieTweetings/datasets/snapshots_10K/

When debugging the program code, the recommender algorithms seems to cause parse errors while reading the dataset.

Project structure

I think the current project structure is a little unclear.
What I see is one project: Idomaar with three submodules (or sub projects), i.e. algorithms, evaluators, computing environments.
None of the latter three are standalone project, they won't run on their own (or will they?). Perhaps this should be reflected in how the project is structured, e.g. one 'main' module (idomaar), with several submodules (evluators, algs, ce's, etc.)

Quickstart

Update the site with documentation for:

  • download the release
  • install prerequisistes
  • explain the process
  • run a basic example
  • describe how to build a custom computing env
  • describe how to execute a custom evaluation logic
  • describe how to manage/build dataset

Orchestrator - computing environment data streaming

This issue is a continuation of an email thread about orchestrator - CE communication:

Hi Andras,
I agree with you, all your comment are correct, I would like to explain to you why we choose to use kafka.
The main idea is to use component that are "production ready" and that could be easily be scaled to manage "big" number of events.
Of course this introduce a lot of complexity in the development of the computing env (and also in the orchestrator stuff) but also give us the possibility to use very common objects (like kafka and zookeeper).

I also agree with you that zmq is already present and could be used instead of kafka, one idea is to have a restful API that receive events and data (this cover the newsreel integration too) so to have the developer to have the possibility of choosing web api or kafka without changing the rest of the orchestrator.

What do you think? can we continue the talk on a new github issue so to track our conclusion?

thanks,
Davide

On Tue, Mar 10, 2015 at 9:43 AM, András Serény [email protected] wrote:

Hi All,

On another note, although I don't see the full Idomaar picture yet, I find the Kafka/Flume setup complicated and unnecessary given the current state.

    Kafka is a persistent, highly available, distributed messaging service; the features come at the expense of a certain complexity. I don't think we really need either persistence or high availability for sending/receiving train and test data.
    We now a have a handful of services: Kafka, Zookeeper, Flume, recommendation manager. This is just too many moving parts (especially for a small application). A lot of things can go wrong (and they do) in these services that we have to deal with; this slows down development. It's difficult or impossible to install the components locally. It takes long to build a VM with all the components.
    On computing environment implementations, it places the burden of being able to connect to Zookeeper and read from Kafka.

In short, it's complicated and the benefits are unclear. Computing environments already have to have the ability to talk via ZMQ, so why not just send train/test data and recommendation requests via ZMQ ? This would make the whole setup much simpler.

Cheers,
András

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.