GithubHelp home page GithubHelp logo

mesosphere / rendler Goto Github PK

View Code? Open in Web Editor NEW
246.0 179.0 77.0 2.46 MB

A rendering web crawler for Apache Mesos.

Shell 0.78% Makefile 0.50% C++ 17.73% JavaScript 0.38% Go 10.65% Haskell 10.98% Python 26.48% Scala 15.15% Java 17.36%
dcos-orchestration-guild dcos

rendler's Introduction

RENDLER โ‰๏ธ

A rendering web-crawler framework for Apache Mesos.

YES RENDLER

See the accompanying slides for more context.

RENDLER consists of three main components:

  • CrawlExecutor extends mesos.Executor
  • RenderExecutor extends mesos.Executor
  • RenderingCrawler extends mesos.Scheduler and launches tasks with the executors

Quick Start with Vagrant

Requirements

Start the mesos-demo VM

$ wget http://downloads.mesosphere.io/demo/mesos.box -O /tmp/mesos.box
$ vagrant box add --name mesos-demo /tmp/mesos.box
$ git clone https://github.com/mesosphere/RENDLER.git
$ cd RENDLER
$ vagrant up

Now that the VM is running, you can view the Mesos Web UI here: http://10.141.141.10:5050

You can see that 1 slave is registered and you've got some idle CPUs and Memory. So let's start the Rendler!

Run RENDLER in the mesos-demo VM

Check implementations of the RENDLER scheduler in the python, go, scala, and cpp directories. Run instructions are here:

Feel free to contribute your own!

Generating a pdf of your render graph output

With GraphViz (which dot) installed:

vagrant@mesos:hostfiles $ bin/make-pdf
Generating '/home/vagrant/hostfiles/result.pdf'

Open result.pdf in your favorite viewer to see the rendered result!

Sample Output

Sample Crawl Crawl

Shutting down the mesos-demo VM

# Exit out of the VM
vagrant@mesos:hostfiles $ exit
# Stop the VM
$ vagrant halt
# To delete all traces of the vagrant machine
$ vagrant destroy

Rendler Architecture

Crawl Executor

  • Interprets incoming tasks' task.data field as a URL
  • Fetches the resource, extracts links from the document
  • Sends a framework message to the scheduler containing the crawl result.

Render Executor

  • Interprets incoming tasks' task.data field as a URL
  • Fetches the resource, saves a png image to a location accessible to the scheduler.
  • Sends a framework message to the scheduler containing the render result.

Intermediate Data Structures

We define some common data types to facilitate communication between the scheduler and the executors. Their default representation is JSON.

results.CrawlResult(
    "1234",                                 # taskId
    "http://foo.co",                        # url
    ["http://foo.co/a", "http://foo.co/b"]  # links
)
results.RenderResult(
    "1234",                                 # taskId
    "http://foo.co",                        # url
    "http://dl.mega.corp/foo.png"           # imageUrl
)

Rendler Scheduler

Data Structures

  • crawlQueue: list of urls
  • renderQueue: list of urls
  • processedURLs: set or urls
  • crawlResults: list of url tuples
  • renderResults: map of urls to imageUrls

Scheduler Behavior

The scheduler accepts one URL as a command-line parameter to seed the render and crawl queues.

  1. For each URL, create a task in both the render queue and the crawl queue.

  2. Upon receipt of a crawl result, add an element to the crawl results adjacency list. Append to the render and crawl queues each URL that is not present in the set of processed URLs. Add these enqueued urls to the set of processed URLs.

  3. Upon receipt of a render result, add an element to the render results map.

  4. The crawl and render queues are drained in FCFS order at a rate determined by the resource offer stream. When the queues are empty, the scheduler declines resource offers to make them available to other frameworks running on the cluster.

rendler's People

Contributors

adam-mesos avatar air avatar aslanbakirov avatar connordoyle avatar cruhland avatar derekchiang avatar drexin avatar elingg avatar karya0 avatar neilconway avatar nqn avatar rukletsov avatar sambatyon avatar sdeneefe avatar ssk2 avatar swartzrock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rendler's Issues

Issue with mesos.native not being found.

I'm not sure how this happened. I followed the advance cluster course on the mesosphere website and built out a four node cluster. I then decided to try installing RENDLER on it to get a feel for how a custom framework internals work.

After cloning the repo down to my master node I tried executing the python script and was greeted with an the following import error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named native

So I went looking to see what modules python is aware of. In the packages folder: /usr/lib/python2.7/site-packages/mesos I just see just two module folders:

  • cli
  • interface

I don't see a native folder. This module must have not been installed when I followed the advanced course to build out the cluster. I spent a bit of time trying to figure out the issue then settled with installing the .egg by doing the following:

# visit https://open.mesosphere.com/downloads/mesos/
# find the latest Python egg for my OS
wget http://downloads.mesosphere.io/master/centos/7/mesos-0.26.0-py2.7-linux-x86_64.egg
sudo easy_install mesos-0.26.0-py2.7-linux-x86_64.egg

Python Documentation

It might be worth noting that you need a few things on your system to get this working for the Python example.

Python Modules:

You will receive this error if you try and run without installing a few modules.

  File "crawl_executor.py", line 25, in <module>
    from bs4 import BeautifulSoup
ImportError: No module named bs4
Install the following:
sudo pip install wget
sudo pip install beautifulsoup4
sudo pip install html5lib
sudo yum install -y libxml2-devel
sudo yum install -y libxslt-devel
sudo yum install -y python-devel
sudo pip install lxml
PhantomJS:

You will get errors about PhantomJs like the following:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 764, in run
    self.__target(*self.__args, **self.__kwargs)
  File "render_executor.py", line 62, in run_task
    if call(["phantomjs", "render.js", url, destination]) != 0:
  File "/usr/lib64/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

To resolve that you need to build PhantomJs from source. If you can find a binary for your Linux distro then go with that. I used a binary I found for Centos 7 here. Note there are some issues bundling binaries for PhantomJs see thead here. If you must build from source follow the steps below it can take an hour or so.

# needed to phantomjs build from source
sudo yum -y install gcc gcc-c++ make flex bison gperf ruby \
  openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel \
  libpng-devel libjpeg-devel

git clone --recurse-submodules https://github.com/ariya/phantomjs.git
cd phantomjs
./build.py
Parser Warning on BS4:

Also the Executer throws a nice warning about not explicitly specifying the parser for BS4 that appears to halt the script.

Executor registered on slave 586d51bc-408a-4191-bce7-8527a6c0f2f4-S0
/usr/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this (See PR #41):

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.