openpreserve / pagelyzer Goto Github PK

View Code? Open in Web Editor NEW

53.0 35.0 21.0 14.54 MB

Suite of tools for detecting changes in web pages and their rendering

Home Page: http://openplanets.github.io/pagelyzer

License: Apache License 2.0

Java 88.38% CSS 2.11% JavaScript 9.52%

pagelyzer's Introduction

Pagelyzer

Overview of projects

JKernelMachines: a java library for learning with kernels. It is primary designed to deal with custom kernels that are not easily found in standard libraries, such as kernels on structured data. This library is developed by David Picard and new versions can be found here: https://github.com/davidpicard/jkernelmachines
JDescriptors: a java library for deffirent color descriptors like SIFT and HSV. This library is developed at LIP6.
MarcAlizer: a supervised framework. It extracts features to create vectors for training/comparison and calculates the score based on similarities between vectors. It uses JKernelMachines and JDescriptors.
Pagelyzer: Main project that takes the screenshots, does the web segmentation and uses MarcAlizer to return the result.

Installing Dependencies

$ sudo apt-get install openjdk-7-jdk
$ sudo apt-get install xvfb (optional)

Using Pagelyzer will popup a window to render the web pages (to get a screenshot and also to do the segmentation). You can use Xvfb to make it headless. Xvfb is used incase if the Graphical User Interface (GUI) is not available in your system but also not the open pop-up window. You should first describe a display and then execute the jars on this display:

$ Xvfb:1 -screen 0 1024x768x24 &
$ DISPLAY=:1 java -jar ....

This ReadMe will explain the usage of different jar files obtained from this code source: https://github.com/openplanets/pagelyzer Different executable jars generated from source code can be downloaded from http://scape.lip6.fr/Pagelyzer_all-jars.zip

Command-line Parameters

Example_configFiles contains different configuration examples and its ReadMe explains each tag. SettingsFiles folder is a "must" folder. After downloading it, you can change its name but you should not change its subfolders names. js and ext folder should be always in the same folder. Then you update the subdir tag in the config file.

Training

If you want to define "what is similar" and "what is dissimilar" according to your needs, you can first train the system:

| parameter | description | | ------------ | :----------------------------------------: | ------------------------------------------------------------- |
| config.xml | This is the path to the configuration file. Different examples can be found here: https://github.com/openplanets/pagelyzer/tree/master/Example_configFiles | annotations.txt | The path to a file that contains the annotated dataset to train the system.This is the file where you describe which pair of urls are simmilar/dissimilar. The file should have the following structure: URL1 \t URL2 \t ANNOTATION (0 dissimilar 1 similar).

$ java -cp Pagelyzer-libs.jar:PagelyzerTrain.jar  pagelyzer.Train config.xml   annotations.txt
$ java -jar java  -jar ./target/PagelyzerTrain-0.0.1-SNAPSHOT-jar-with-dependencies.jar config.xml   annotations.txt

This generates an output file and save it based on the settings on configuration file by "subdir" tag (config.xml). This file contains the information related to decision boundary and SVM and is used for comparison.

Comparison

We can compare the web pages as follows:

parameter	arguments	description
url1	URI	First URL
url2	URI	Second URL
config	path	path to the configuration file (config.xml)

$ java -jar Pagelyzer-0.0.1-SNAPSHOT-jar-with-dependencies.jar -url1 http://www.lip6.fr -url2 http://www.lip6.fr  -config  config.xml
$ java -cp Pagelyzer-libs.jar:Pagelyzer.jar  pagelyzer.JPagelyzer -url1 http://www.lip6.fr -url2  http://www.lip6.fr  -config config.xml

This will give a score between -1 and 1. All the scores negatives mean that the pages are dissimilar and all scores positive mean that the pages are similar. The values are also ranked which means that two pages with a score 0.9 is more similar than two pages with a score 0.2. However, the score 0 means that the system is not able to decide if the pages are similar or not. It means that the training dataset is small or do not contain the diverse examples. The suggestion in that case is to train the system again with a bigger dataset.

Test

This section will show you how to make tests with a bunch of url pairs at the same time.

| parameter | description | | ------------ | :----------------------------------------: | ------------------------------------------------------------- |
| test.txt | The path to a file that contains a list of urls that you would like to test URL1 \t URL2 | config.xml | The path to the configuration file (config.xml) | result.txt | The path to the file where you would like to save the results URL1 \t URL2 \t Score

$java  -cp Pagelyzer-libs.jar:Pagelyzer-0.0.1-SNAPSHOT-tests.jar Test test.txt  config.xml results.txt

Configuration file Details

Tag	Description	Possible Values
config:pagelyzer:run:default:parameter:get	String value that tells the sytem what to do	score: to return score; screenshot: just to get screenshots segmentation: just to do segmentation source: to save html code
config:pagelyzer:run:default:parameter:browser	default browser	firefox; chrome; opera;
config:pagelyzer:run:default:parameter:browser1	Browser for the first URL	firefox; chrome; opera;
config:pagelyzer:run:default:parameter:browser2	Browser for the second URL	firefox; chrome; opera;
config:pagelyzer:run:default:parameter:outputfile	Path to save output for screenshot/source/segmentation	--
config:pagelyzer:run:default:comparison:mode	Describes the comparison mode	content; hybrid; image;
config:pagelyzer:run:default:comparison:subdir	Path to ext folder in the SettingsFiles	--
config:pagelyzer:run:internal:server:remote:url	URL of the remote web server where are located the javascripts injected by selenium	--
config:pagelyzer:run:internal:server:local:ip	IP address used by pagelyzer to its internal local web server	--
config:pagelyzer:run:internal:server:local:port	Port number used by pagelyzer to its internal local web server	Default: 8016
config:pagelyzer:run:internal:server:local:wwwroot	Path used by pagelyzer to its internal local web server	Default: Current dir
config:pagelyzer:run:debug:screenshots:active	Activate debuging mode	Boolean
config:pagelyzer:run:debug:screenshots:path	Path to store debugging files	--
config:pagelyzer:run:debug:screenshots:filepattern	How file should be named	page#{n}.png becomes page1.png for url1, page2.png for url2
config:selenium:run:mode	Use Webdriver class (local) or use remote instance (remote)	local, remote
config:selenium:server:url	URL for the remote instance	--
config:bom:granularity	Size of blocks for the web page segmentation	0-10
config:bom:separation	Threshold for separation of blocks for the web page segmentation	length in pixels

pagelyzer's People

Contributors

Stargazers

Watchers

pagelyzer's Issues

A timeout error crashed the execution

Hi,
Thanks for resolving the problem noted in the issue 8,

We launched the Pagelyzer with this command :
$ ruby1.9.1 pagelyzer changedetection --urls=../fin.txt --output-file=../fout.txt --headless --output-folder=./first-1500/ --type=hybrid --url-archive

where "fin.txt" is a text file which contains 1500 pairs of link, in each line : "url1 url2"

After treating 216 pairs, pagelyzer crashed and displayed in the screen :

Capturing http://im1c6.internetmemory.org/tna/20110904155927/http://blogs.fco.gov.uk/roller/warren/ with local firefox
ERROR: Page load timeout execution expired
there were a problem in the capture of pages
ERROR: can't process these urls:
http://im1c6.internetmemory.org/tna/20110904155927/http://blogs.fco.gov.uk/roller/warren/
http://webarchive.nationalarchives.gov.uk/20110904155927/http://blogs.fco.gov.uk/roller/warren/
Closing browsers
Browser firefox rest open
Timeout: 30secs
Capturing http://im1c6.internetmemory.org/tna/20100512151544/http://www.huntinginquiry.gov.uk/mainsections/huntingframe.htm with local firefox
Getting screenshot
Waiting page to finish loading...
done.
Timeout: 30secs
Capturing http://webarchive.nationalarchives.gov.uk/20100512151544/http://www.huntinginquiry.gov.uk/mainsections/huntingframe.htm with local firefox
Getting screenshot
Waiting page to finish loading...
done.
/1/crawl/hatem-tmp/pagelyzer/EUP-176/pagelyzer-ruby-0.9.1-standalone/lib/pagelyzer_analyzer.rb:282:in start': undefined methodprocess_path' for [{id:0 pid: cand:["frame"]} chl:[]]:Array (NoMethodError)
from /1/crawl/hatem-tmp/pagelyzer/EUP-176/pagelyzer-ruby-0.9.1-standalone/bin/pagelyzer_changedetection:321:in block in <main>' from /1/crawl/hatem-tmp/pagelyzer/EUP-176/pagelyzer-ruby-0.9.1-standalone/bin/pagelyzer_changedetection:244:ineach'
from /1/crawl/hatem-tmp/pagelyzer/EUP-176/pagelyzer-ruby-0.9.1-standalone/bin/pagelyzer_changedetection:244:in `

-And in the output file "fout.txt" :
< test >
< url href="http://im1c6.internetmemory.org/tna/20110904155927/http://blogs.fco.gov.uk/roller/warren/" browser="firefox"/ >
< url href="http://webarchive.nationalarchives.gov.uk/20110904155927/http://blogs.fco.gov.uk/roller/warren/" browser="firefox"/ >
< score >ERROR Time out loading http://im1c6.internetmemory.org/tna/20110904155927/http://blogs.fco.gov.uk/roller/warren/ < /score >
< time >0< /time >
< /test >

Can you please tell us if the problem is from Selenium or from Pagelyzer? and how to avoid it in the future?

Thanks

Building & running jpagelyzer from source

Due to the error I mentioned here in #10, I tried running the maven version from source. After modifying the paths to resources I tried running pagelyzer/Maven/Pagelyzer/src/test/java/Test.java and getting the following error.

Selenium: local WebDriver
Setting up browser: firefox
Attempt = 1
Setting up browser: firefox
Attempt = 1
getting data using driver: firefox
title: Accueil LIP6
getting data using driver: firefox
ERROR: Could not load -url2
Trying to reinitialize browser
org.openqa.selenium.WebDriverException: f.QueryInterface is not a function
Command duration or timeout: 45.03 seconds
Build info: version: '2.40.0', revision: '4c5c0568b004f67810ee41c459549aa4b09c651e', time: '2014-02-19 11:13:01'
System info: host: 'xxxxxxxxxx', ip: 'xxxxxx', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.9.2', java.version: '1.7.0_45'
Session ID: 3274e525-257a-0146-b9a4-ba50375461ce
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Capabilities [{platform=MAC, acceptSslCerts=true, javascriptEnabled=true, cssSelectorsEnabled=true, databaseEnabled=true, browserName=firefox, handlesAlerts=true, browserConnectionEnabled=true, webStorageEnabled=true, nativeEvents=false, rotatable=false, locationContextEnabled=true, applicationCacheEnabled=true, takesScreenshot=true, version=27.0.1}]
Attempt = 1
Exception in thread "main" java.lang.NullPointerException
    at pagelyzer.JPagelyzer.changeDetection(JPagelyzer.java:215)
    at Test.main(Test.java:41)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

I can see one empty browser window and one with http://www.lip6.fr/ have been triggered. Am I missing something?

Thanks

bomversion not defined

getting following exception when we are comparing with same example URL you have given.
org.openqa.selenium.WebDriverException: javascript error: bomversion is not defined

What is bomversion?

pagelyzer_capture : `require': no such file to load -- selenium-webdriver (LoadError)

Hi,
We are installing the pagelyzer on our server Debian Squeeze :
$ uname -a
Linux machine-name 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64 GNU/Linux

$java -version
java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

$ ruby -v
ruby 1.9.2p320 (2012-04-20 revision 35421) [x86_64-linux]

$ gem -v
1.3.7.1

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PATH=$PATH:/usr/lib/jvm/java-7-oracle/bin

$ bundle
Fetching gem metadata from http://rubygems.org/.........
Fetching gem metadata from http://rubygems.org/..
Resolving dependencies...
Using ffi (1.9.0)
Using childprocess (0.3.9)
Using headless (1.0.1)
Using mini_portile (0.5.1)
Using multi_json (1.8.1)
Using nokogiri (1.6.0)
Installing rjb (1.4.8)
Installing rubyzip (0.9.9)
Installing sanitize (2.0.6)
Installing websocket (1.0.7)
Installing selenium-webdriver (2.35.1)
Using bundler (1.3.5)
Your bundle is complete!
Use bundle show [gemname] to see where a bundled gem is installed.

We think that the installation has been done correctly, but when we run the pagelyzer we recieve this error :
(I added the line "puts RUBY_VERSION" to be sure that we are using the right version of ruby, that's why it displays "1.9.2")

$ ./pagelyzer capture --url=http://www.google.fr
1.9.2
internal:lib/rubygems/custom_require:29:in require': no such file to load -- selenium-webdriver (LoadError) from <internal:lib/rubygems/custom_require>:29:inrequire'
from /1/crawl/hatem-tmp/pagelyzer/pagelyzer-ruby-0.9.1-standalone/lib/pagelyzer_capture.rb:37:in <top (required)>' from /1/crawl/hatem-tmp/pagelyzer/pagelyzer-ruby-0.9.1-standalone/bin/pagelyzer_capture:4:inrequire_relative'
from /1/crawl/hatem-tmp/pagelyzer/pagelyzer-ruby-0.9.1-standalone/bin/pagelyzer_capture:4:in `

In order to check that, ruby 1.9.2 and selenium webdriver, are working fine together, we did a script that open a webpage using the webdriver into a virtual display, and it works :

puts RUBY_VERSION
require "selenium-webdriver"
driver = Selenium::WebDriver.for :firefox
driver.navigate.to "http://google.com"
puts driver.title
driver.quit

Xvfb :1 -screen 0 1024x768x24 &
DISPLAY=:1 ruby test-sel.rb
1.9.2
Google

We noted that this error has been encountered before #1

Can you help us on this problem?

Error : in `require': no such file to load -- selenium-webdriver

Hi,

Following the steps noted by you in the description :

$ ruby -v
ruby 1.9.2p290 (2011-07-09 revision 32553) [i686-linux]
$ gem -v
1.3.7

pagelyzer-ruby-0.9-standalone.zip downloaded and unzipped.

Dependencies installed :
$ sudo apt-get install libxslt-dev libxml2-dev
$ sudo apt-get install openjdk-7-jdk
$ sudo apt-get install imagemagick

In the project folder :
$ bundle
Fetching gem metadata from http://rubygems.org/.........
Fetching gem metadata from http://rubygems.org/..
Resolving dependencies...
Enter your password to install the bundled RubyGems to your system:
Installing ffi (1.4.0)
Installing childprocess (0.3.9)
Installing hpricot (0.8.6)
Installing multi_json (1.6.1)
Installing nokogiri (1.5.5)
Installing rubyzip (0.9.9)
Installing sanitize (2.0.3)
Installing websocket (1.0.7)
Installing selenium-webdriver (2.29.0)
Using bundler (1.3.2)
Your bundle is complete! Use bundle show [gemname] to see where a bundled gem is installed.o gem install bundler

And a file "Gemfile.lock" was created.

Running pagelyzer :
After the dependencies were installed :
$ ./pagelyzer
USAGE: pagelyzer [--help|--version] [ <command_options>]

++++++++++++++++++++++++++++++++++++++++++++++

$ ./pagelyzer capture --url=http://google.fr
internal:lib/rubygems/custom_require:29:in require': no such file to load -- selenium-webdriver (LoadError) from <internal:lib/rubygems/custom_require>:29:inrequire'
from /home/hatem/Bureau/travail/AD-658/pagelyzer-ruby-0.9-standalone/bin/pagelyzer_capture:36:in `

++++++++++++++++++++++++++++++++++++++++++++++

$ ./pagelyzer changedetection --url1=http://google.fr --url2=http://google.com
Using marcalizer.jar found in /home/hatem/Bureau/travail/AD-658/pagelyzer-ruby-0.9-standalone/ext/marcalizer
Notice: using 'firefox' as default browser
Pagelyzer: capturing URLhttp://google.com
internal:lib/rubygems/custom_require:29:in require': no such file to load -- selenium-webdriver (LoadError) from <internal:lib/rubygems/custom_require>:29:inrequire'
from /home/hatem/Bureau/travail/AD-658/pagelyzer-ruby-0.9-standalone/bin/pagelyzer_capture:36:in <main>' <internal:lib/rubygems/custom_require>:29:inrequire': no such file to load -- selenium-webdriver (LoadError)
from internal:lib/rubygems/custom_require:29:in require' from /home/hatem/Bureau/travail/AD-658/pagelyzer-ruby-0.9-standalone/bin/pagelyzer_capture:36:in

'
cp: impossible d'évaluer «/home/hatem/Bureau/travail/AD-658/pagelyzer-ruby-0.9-standalone/out/firefox_google_fr.png»: Aucun fichier ou dossier de ce type
cp: impossible d'évaluer «/home/hatem/Bureau/travail/AD-658/pagelyzer-ruby-0.9-standalone/out/firefox_URLgoogle_com.png»: Aucun fichier ou dossier de ce type
Exception in thread "main" java.lang.UnsupportedClassVersionError: Taverna/ScapeTest : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: Taverna.ScapeTest. Program will exit.

The error depends on the webdriver :
$ bundle show selenium-webdriver
/usr/local/lib/ruby/gems/1.9.1/gems/selenium-webdriver-2.29.0

Am I missing somthing in the installation?

Update repo url (and hello!)

Hello! I'm working with EDGI on a related project in this space, and while doing a survey of other tools, we came across Pagelyzer :)

We're totally open to further interaction, but for now, I just wanted to say "hi" -- Hi! -- and point out that the url of the repo is a dead link, and presumably should be http://pagelyzer.openpreservation.org/

Anyhow, all the best! (If anyone is interested, help or feedback on that list/survey would be totally appreciated! Just let me know, and I'm happy to add you as collaborators on the repo.)

Can pagelyzer run in a system without GUI?

We installed the debian package of pagelyzer in a system without a GUI.
3 commands are available now :
1- pagelyzer_analyzer
2- pagelyzer_changedetection
3- pagelyzer_capture

We have 2 browsers on this machine : firefox and opera, in order to run them we use the X server Xvfb.

Trying to run pagelyzer :

+++++++++++++++++++++++++++++
$ pagelyzer_capture --url=http://twitter.com
Capturing http://twitter.com with local opera
Connection not possible with opera
WARNING: Is opera installed in your system?
Try with the --browser=your_installed_browser
USAGE: pagelyzer_capture --url=URL [--js_files_url=BASE_URL] [--output-folder=FOLDER] [--browser=BROWSER_CODE] [--thumbnail] [--help]
This tool aims to have a HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture
Browsers code are the same as defined in selenium. For instance:

firefox
chrome
iexploreproxy
safariproxy
opera

Note: the browser should be installed in your machine to work with selenium webdriver

The output is sent to 'out' folder. If it doesn't exists it will be created
+++++++++++++++++++++++++++++

We ask if your tool can be runned on systems without a GUI.

Trying to run the browsers without a virtual display :
$ firefox
Error: no display specified

$ opera
opera: cannot connect to X server . Error: No such device

Same score obtained for 1800 pairs : 0.37208013647619376

Hi,

We installed successfully the debian package of pagelyzer, and following the solution described in :
#3 for the systems without GUI, we tried to run pagelyzer on 1800 pairs of links using a script, but we obtained the same result at the end of each comparaison which is : "0.37208013647619376", we checked some pairs (of snapshots) and noted differences between images (and so the score should be different each time).

1- What is the meaning of that result (the number "0.37208013647619376")? Is it the result of wrong marcalyzer's input(nonexistent pictures)?

We noted also :
"Waiting page to finish loading... (Timeout in 10sec)"
2- How can we check if the page has finished loading before take the snapshots?

Regards.

Dependency error in Debian package

Using the .deb package from:
http://deb.openplanetsfoundation.org/pool/main/p/pagelyzer-ruby/

When trying to install the Debian package on the IM cluster, we got the following dependency errors:

Selecting previously deselected package pagelyzer-ruby1.9.1.
(Reading database ... 207379 files and directories currently installed.)
Unpacking pagelyzer-ruby1.9.1 (from pagelyzer-ruby1.9.1_0.9-12-gbbcc12f_amd64.deb) ...
dpkg: dependency problems prevent configuration of pagelyzer-ruby1.9.1:
pagelyzer-ruby1.9.1 depends on cdbs; however:
Package cdbs is not installed.
pagelyzer-ruby1.9.1 depends on ruby-pkg-tools; however:
Package ruby-pkg-tools is not installed.
pagelyzer-ruby1.9.1 depends on openjdk-6-jdk; however:
Package openjdk-6-jdk is not installed.
dpkg: error processing pagelyzer-ruby1.9.1 (--install):
dependency problems - leaving unconfigured
Processing triggers for man-db ...
Errors were encountered while processing:
pagelyzer-ruby1.9.1

Note: The deb package is only available for 64bit systems. We therefore could not try it on a local machine (x86 system).

Connection not possible with firefox

Hi,

I have installed pagelyzer and all its dependencies but the software does not use/find my installed Firefox instance.
When I run:

./pagelyzer capture --url=http://www.cnn.com

nothing happens for ca. 60 seconds and then it seems to time out with:

Connection not possible with firefox
WARNING: Is firefox installed in your system?
Try with the --browser=your_installed_browser
unable to obtain stable firefox connection in 60 seconds (127.0.0.1:7055)

followed by the regular output about usage etc.
Any help is much appreciated.

The parameters of my system:

$ uname -a
Linux abacus 2.6.38-11-generic #50-Ubuntu SMP Mon Sep 12 21:17:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

$ ruby -v
ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-linux]

$ gem -v
1.3.7

$ gem env
RubyGems Environment:

RUBYGEMS VERSION: 1.3.7
RUBY VERSION: 1.9.2 (2011-07-09 patchlevel 290) [x86_64-linux]
INSTALLATION DIRECTORY: /usr/local/bin/ruby-1.9.2-p290/lib/ruby/gems/1.9.1
RUBY EXECUTABLE: /usr/local/bin/ruby-1.9.2-p290/bin/ruby
EXECUTABLE DIRECTORY: /usr/local/bin/ruby-1.9.2-p290/bin
RUBYGEMS PLATFORMS:
- ruby
- x86_64-linux
GEM PATHS:
- /usr/local/bin/ruby-1.9.2-p290/lib/ruby/gems/1.9.1
- /home/mklein/.gem/ruby/1.9.1
GEM CONFIGURATION:
- :update_sources => true
- :verbose => true
- :benchmark => false
- :backtrace => false
- :bulk_threshold => 1000
REMOTE SOURCES:
- http://rubygems.org/

$ gem list

*** LOCAL GEMS ***

bundler (1.3.5)
childprocess (0.3.9)
ffi (1.9.0)
headless (1.0.1)
mini_magick (3.6.0)
mini_portile (0.5.1)
minitest (1.6.0)
multi_json (1.8.1)
nokogiri (1.6.0)
rake (0.8.7)
rdoc (2.5.8)
rjb (1.4.8)
rubyzip (0.9.9)
sanitize (2.0.6)
selenium-webdriver (2.35.1)
subexec (0.2.3)
websocket (1.0.7)

$ bundle show
Gems included by the bundle:

bundler (1.3.5)
childprocess (0.3.9)
ffi (1.9.0)
headless (1.0.1)
mini_magick (3.6.0)
mini_portile (0.5.1)
multi_json (1.8.1)
nokogiri (1.6.0)
rjb (1.4.8)
rubyzip (0.9.9)
sanitize (2.0.6)
selenium-webdriver (2.35.1)
subexec (0.2.3)
websocket (1.0.7)

$ which firefox
/usr/bin/firefox

$ firefox -v
Mozilla Firefox 24.0

Problems with the new version: order of the options and looping

Hi,
To run the tool correctly we should first define the JAVA_HOME :
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/

if not, we got this exception
/usr/bin/pagelyzer_changedetection:90:in load': can't create Java VM (RuntimeError) from /usr/bin/pagelyzer_changedetection:90:inget_java_instance'
from /usr/bin/pagelyzer_changedetection:112:in `

Then we can run it, in our case we use the command the pagelyzer_changedetection in order to compare between two web pages.
1- We should note, first , that the execution of the tool depends on the order of its options, example :

A-
$ pagelyzer_changedetection --url1=http://webarchive.nationalarchives.gov.uk/20100824180635/http://yourfreedom.hmg.gov.uk/ --url2=http://im1c6.internetmemory.org/tna/20100824180635/http://yourfreedom.hmg.gov.uk/ --headless --output-folder=./file-0-line/ --url-archive
Headless mode
Notice: using 'firefox' as default browser
Type: images. Using: /etc/pagelyzer_ruby/ex_images.xml file

Connection not possible with firefox
WARNING: Is firefox installed in your system?
.....

B-
$ pagelyzer_changedetection --url1=http://webarchive.nationalarchives.gov.uk/20100824180635/http://yourfreedom.hmg.gov.uk/ --url2=http://im1c6.internetmemory.org/tna/20100824180635/http://yourfreedom.hmg.gov.uk/ --headless --url-archive --output-folder=./file-0-line-0
Headless mode
Notice: using 'firefox' as default browser
Type: images. Using: /etc/pagelyzer_ruby/ex_images.xml file

Timeout: 30secs
Capturing http://webarchive.nationalarchives.gov.uk/20100824180635/http://yourfreedom.hmg.gov.uk/ with local firefox
Getting screenshot
Waiting page to finish loading...
done.
Timeout: 30secs
Capturing http://im1c6.internetmemory.org/tna/20100824180635/http://yourfreedom.hmg.gov.uk/ with local firefox
Getting screenshot
Waiting page to finish loading...
done.
0.00687000358334055
0.0166589550943836
0.0031247369478646305
0.00896583116319136
Distance between the two web-pages:: -0.5490586292655766
Processed in 71.005285768secs
Browser firefox closed

2- The second note is about the option --output-folder : if we don't create the output folder before, (in the last version, if the folder does not exist, it will be created) :
$ pagelyzer_changedetection --url1=http://webarchive.nationalarchives.gov.uk/20100824180635/http://yourfreedom.hmg.gov.uk/ --url2=http://im1c6.internetmemory.org/tna/20100824180635/http://yourfreedom.hmg.gov.uk/ --headless --url-archive --output-folder=./file-0-line-0
Headless mode
Notice: using 'firefox' as default browser
Type: images. Using: /etc/pagelyzer_ruby/ex_images.xml file

Connection not possible with firefox
WARNING: Is firefox installed in your system?
....

In other side, after using the option with an existing folder, we dont find the files used in the comparaison (.html, .png and marcalizer folder).

These were some examples of exécution while testing the tool, but our use case is run pagelyzer_changedetection in a loop, comparing many web pages. Using the command with needed options (in the right order) we note different result for the same command executed, plesae find the command runned and the result in order :

pagelyzer_changedetection --url1=http://webarchive.nationalarchives.gov.uk/20100824180635/http://yourfreedom.hmg.gov.uk/ --url2=http://im1c6.internetmemory.org/tna/20100824180635/http://yourfreedom.hmg.gov.uk/ --headless --url-archive --output-folder=./file-0-line-0
pagelyzer_changedetection --url1=http://webarchive.nationalarchives.gov.uk/20120406035308/http://www.coi.gov.uk/ --url2=http://im1c6.internetmemory.org/tna/20120406035308/http://www.coi.gov.uk/ --headless --url-archive --output-folder=./file-0-line-1
pagelyzer_changedetection --url1=http://webarchive.nationalarchives.gov.uk/20110809101133/http://nsonline.org.uk/site_intelligence/selectors/page?H --url2=http://im1c6.internetmemory.org/tna/20110809101133/http://nsonline.org.uk/site_intelligence/selectors/page?H --headless --url-archive --output-folder=./file-0-line-2

$ cat log.txt
Tue Apr 23 11:12:27 CEST 2013
Headless mode
Notice: using 'firefox' as default browser
Type: images. Using: /etc/pagelyzer_ruby/ex_images.xml file
Connection not possible with firefox
WARNING: Is firefox installed in your system?
Try with the --browser=your_installed_browser
USAGE: pagelyzer_changedetection [--url1=URL --url2=URL | --urls=FILE --output-file=FILE] [conf=CONF_FILE] [--doc=(1..10)] [--output-folder=FOLDER] [--browser=BROWSER_CODE | --browser1=BROWSER_CODE --browser2=BROWSER_CODE] [--verbose] --type=[images|structure|hybrid] [--url-archive]
This tool aims integrates all the change detection and segmentation tools
Help:
type = hybrid | webshot
Browsers code are the same as defined in selenium. For instance:

firefox
chrome
iexploreproxy
safariproxy
opera

For the input URL file it expects the following syntax of each line:
-URL1 URL2
Timeout: 30secs
there were a problem in the capture of pages
ERROR: can't process these urls:
http://webarchive.nationalarchives.gov.uk/20100824180635/http://yourfreedom.hmg.gov.uk/
http://im1c6.internetmemory.org/tna/20100824180635/http://yourfreedom.hmg.gov.uk/
Tue Apr 23 11:14:39 CEST 2013

Tue Apr 23 11:14:39 CEST 2013
Headless mode
Notice: using 'firefox' as default browser
Type: images. Using: /etc/pagelyzer_ruby/ex_images.xml file
0.012651629581915857
0.021330379109104593
0.022870095264484743
0.02865328747913551
Distance between the two web-pages:: 0.6305194657627191
Timeout: 30secs
Capturing http://webarchive.nationalarchives.gov.uk/20120406035308/http://www.coi.gov.uk/ with local firefox
Getting screenshot
Waiting page to finish loading...
done.
Timeout: 30secs
Capturing http://im1c6.internetmemory.org/tna/20120406035308/http://www.coi.gov.uk/ with local firefox
Getting screenshot
Waiting page to finish loading...
done.
Processed in 54.306332047secs
Browser firefox closed
Tue Apr 23 11:16:07 CEST 2013

Tue Apr 23 11:16:07 CEST 2013
Headless mode
Notice: using 'firefox' as default browser
Type: images. Using: /etc/pagelyzer_ruby/ex_images.xml file
0.0
0.0
0.0
0.0
Distance between the two web-pages:: 1.0006352506978375
Timeout: 30secs
Capturing http://webarchive.nationalarchives.gov.uk/20110809101133/http://nsonline.org.uk/site_intelligence/selectors/page?H with local firefox
Getting screenshot
Waiting page to finish loading...
done.
Timeout: 30secs
Capturing http://im1c6.internetmemory.org/tna/20110809101133/http://nsonline.org.uk/site_intelligence/selectors/page?H with local firefox
Getting screenshot
Waiting page to finish loading...
done.
Processed in 12.933806415secs
Browser firefox closed
Tue Apr 23 11:17:16 CEST 2013

Tue Apr 23 11:17:16 CEST 2013
Headless mode
Notice: using 'firefox' as default browser
Type: images. Using: /etc/pagelyzer_ruby/ex_images.xml file
Timeout: 30secs
Capturing http://webarchive.nationalarchives.gov.uk/20100307172104/http://ofsted.gov.uk/ofsted-home/about-us/faqs with local firefox
Getting screenshot
Waiting page to finish loading...
done.
Timeout: 30secs
Capturing http://im1c6.internetmemory.org/tna/20100307172104/http://ofsted.gov.uk/ofsted-home/about-us/faqs with local firefox
ERROR: Page load timeout execution expired
there were a problem in the capture of pages
ERROR: can't process these urls:
http://webarchive.nationalarchives.gov.uk/20100307172104/http://ofsted.gov.uk/ofsted-home/about-us/faqs
http://im1c6.internetmemory.org/tna/20100307172104/http://ofsted.gov.uk/ofsted-home/about-us/faqs
Browser firefox closed
Tue Apr 23 11:18:22 CEST 2013

As mentionned above, we found 4 differents results :
Failure : Connection not possible with firefox
Success : Distance between the two web-pages:: 0.6305194657627191
Success (I suppose) : Distance between the two web-pages:: 1.0006352506978375
Failure : ERROR: Page load timeout execution expired (I think it can be fixed with --timeout option)

To summarize, the debian package is easy to install (with dpkg command) and the option --headless is well integrated for the systems without GUI. But in other side, the new version does not keep the snapshots done, the order of the option appears to be important and it seems to be not stable while using in a loop.

Thanks.

org.openqa.selenium.WebDriverException: bomversion is not defined

In screenshot mode everything works fine but when i want to do a segmentation its broken Any ideas ?
java -jar Pagelyzer-0.0.1-SNAPSHOT-jar-with-dependencies.jar -get segmentation -url http://www.abv.bg -config /home/quelibrio/Work/pagelyzer/Example_configFiles/config_content.xml
Selenium: local WebDriver
Setting up browser: firefox
Attempt = 1
getting data using driver: firefox
title: АБВ Поща
Starting server on port 8016
ERROR: Could not load http://www.abv.bg
Trying to reinitialize browser
Shutting down server on port 8016
org.openqa.selenium.WebDriverException: bomversion is not defined
Command duration or timeout: 33 milliseconds
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: 'quelibrio-HP-ProBook-430-G4', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.0-46-generic', java.version: '10.0.2'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Capabilities [{applicationCacheEnabled=true, rotatable=false, handlesAlerts=true, databaseEnabled=true, version=31.0, platform=LINUX, nativeEvents=false, acceptSslCerts=true, webStorageEnabled=true, locationContextEnabled=true, browserName=firefox, takesScreenshot=true, javascriptEnabled=true, cssSelectorsEnabled=true}]
Session ID: e4817de3-c9d9-49d4-b321-77da95695719
Attempt = 1

Java version : generate the jar file

Hi,

I'm trying to use the java version of the Pagelyzer tool and some informations are missed :

First what is the java version needed to create and run the jar
How can we generate the jar file from these classes
https://github.com/openplanets/pagelyzer/tree/master/java/fr/lip6/jpagelyzer

Thanks.

Problem with the types : hybrid and structure

Hi,

We were testing the tool on 449 pairs and we encoutered some problems with the types hybrid and structure.
We should precise that the type=images has finished the comparaison successfully on 449 pairs.

Using hybrid and structure types, the pagelyzer_changedetection command calculate the score of the first 4 pairs and then it generates timeout errors (10 pairs of links are timeout) and finally the command crashs displaying this error :

/usr/lib/ruby/1.9.1/pagelyzer_util.rb:362:in []=': can't convert Symbol into String (TypeError) from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:362:inchange_relative_url'
from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:372:in block in change_relative_url' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:207:inblock in each'
from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:in upto' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:ineach'
from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:371:in change_relative_url' from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:372:inblock in change_relative_url'
from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:207:in block in each' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:inupto'
from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:in each' from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:371:inchange_relative_url'
from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:372:in block in change_relative_url' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:207:inblock in each'
from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:in upto' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:ineach'
from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:371:in change_relative_url' from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:372:inblock in change_relative_url'
from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:207:in block in each' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:inupto'
from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:in each' from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:371:inchange_relative_url'
from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:389:in block in normalize_DOM' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:207:inblock in each'
from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:in upto' from /usr/lib/ruby/1.9.1/nokogiri/xml/node_set.rb:206:ineach'
from /usr/lib/ruby/1.9.1/pagelyzer_util.rb:379:in normalize_DOM' from /usr/lib/ruby/1.9.1/pagelyzer_analyzer.rb:258:instart'
from /usr/bin/pagelyzer_changedetection:298:in block in <main>' from /usr/bin/pagelyzer_changedetection:250:ineach'
from /usr/bin/pagelyzer_changedetection:250:in `

Same thing using the option --timeout=120

Regards

URL Monitoring Issue

HI,
was able to setup the project on my local system. Want to use this code to monitor some of the URLs. I get you can successful use the pagelyzer message on eclipse console.

Can you please guide me as to how do i get results on monitoring. I am not sure whether this output is saying its working or not.

Can one of you guide me through this.

Thanks in advance.