GithubHelp home page GithubHelp logo

tosdr / tosback2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pde/tosback2

125.0 21.0 32.0 471.46 MB

Reimplementing TOSBack with Ruby and using git to see TOS changes!

Home Page: http://tosback.org

License: GNU General Public License v2.0

Ruby 0.26% Python 7.69% Shell 1.95% C 52.92% Makefile 10.98% Perl 3.01% C++ 2.57% TeX 5.27% JavaScript 0.28% HTML 4.68% PHP 0.02% CSS 0.08% Lex 0.08% M4 10.20% sed 0.01%

tosback2's Introduction

ToSBack!

This is a ruby implementation of TOSBack! Designed to scrape the Privacy Policies and Terms of Service agreements from sites defined in the rules folder.

Rules

The log files in "logs" should give info on when the script was last run, and if one of the rule's URLs needs to be updated. Typically, tosback.rb will grab the body of a URL and try to strip away the html before storing the policy, but if a site is coming back as modified every time the script runs (thanks to ads or related links changing), you can now add an xpath attribute to the url in the xml data to pinpoint the TOS data on the page:

Here's an example:

<docname name="Privacy Policy">
  <url name="http://www.500px.com/privacy" xpath="//div[@id='terms']">
   <norecurse name="arbitrary"/>
  </url>
</docname>

Now, tosback.rb should only grab the content we want from that URL! Hooray!

Developing

This project requires ruby 2.3.1 and phantomjs.

After cloning the project, use the --without production option to install the required gems:

$ bundle install --without production

When the app runs without any options, it saves information to our database and automatically makes some new git commits, but this is probably only desirable in production. On your dev machine, run it like this to skip the db and auto-committing:

rubycode$ ruby main.rb -dev

You can also pass a rule file as an argument to the script to get a preview of the results! For example:

rubycode$ ruby main.rb ../rules/abercrombie.com.xml

This will only scrape and write the rule you pass, so you can add xpath data to a rule and quickly test to make sure it's correct.

Running with the "-empty" argument will scan the crawl directory and update the empty.log! Example:

rubycode$ ruby main.rb -empty

tosback2's People

Contributors

colindean avatar dfandrich avatar dodger487 avatar dtauerbach avatar gmacon avatar hugoroy avatar igalic avatar jaller94 avatar jesseweinstein avatar jimmstout avatar kha-faz avatar lingjief avatar mattietk avatar michielbdejong avatar mrshu avatar pde avatar pierreozoux avatar roelofvkr avatar ryanwarsaw avatar secretrobotron avatar sethherr avatar subsystem7 avatar tosbackcrawler avatar vinnl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tosback2's Issues

Git quirk with Amazon Silk T&C

Not sure if this is just git on my machine, or if there's something interesting happening in the repo somehow, but no matter how I try to refresh my repo, I get this:

On branch master
Your branch is up-to-date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   crawl_reviewed/amazon.com/AMAZON SILK TERMS & CONDITIONS.txt

no changes added to commit (use "git add" and/or "git commit -a")

If I use git reset --hard, then, only the capitalization changes:

        modified:   crawl_reviewed/amazon.com/Amazon Silk Terms & Conditions.txt

The actual diff—which is quite large—remains the same.

Any insights?

Idea: analyse similarities between different ToS

It is very often the case that a service will simply copy parts of another service's ToS.

For this case and other cases, it would be nice to analyse the similarities between different documents hosted there, so we can see for instance when documents are 95% the same. This will also make it easy for ToS;DR to avoid duplicating reviews :-)

Muliple language versions may not be in sync :-(

I know that everyone hates such reports: you would think that localised ToS should change with the English "master". Well, nope.

Today I noticed that Google Play have changed their ToS. Or, more precisely:
https://play.google.com/intl/en-us_us/about/play-terms.html - English - December 18, 2017
https://play.google.com/intl/hu_hu/about/play-terms.html - Hungarian - February 5, 2018
possibly similar to
https://play.google.com/intl/en_uk/about/play-terms.html - UK English, Febr 5, 2018

So, first, it seems changes shall be tracked for localised languages, which may be kind of hundredfold amount of data.

Then there is the funny part, when multiple different English versions exist, where they could be happily diffed against one another, to show regional differences between the terms.

(As a sidenote: all the websites are in horrible shape: tosdr.org and especially tosback.org. They are desperately in need of lots of love.)

Handling multiple jurisdictions / countries

Stripe has Terms split over 4 countries: CA, IE, GB, US. How should this be reflected in the rules?

One idea is to add a new attribute country on the url element and expand the definition to allow multiple url elements under a single docname, like so:

<sitename name="stripe.com">
 <docname name="Terms of Service">
  <url name="https://stripe.com/ca/terms"
       xpath="//article[1]" lang="en" country="ca">
   <norecurse name="arbitrary"/>
  </url>
  <url name="https://stripe.com/ie/terms"
       xpath="//article[1]" lang="en" country="ie">
   <norecurse name="arbitrary"/>
  </url>
 </docname>
</sitename>

This could even be useful for multi-lingual cases. Such as this fictitious site:

<sitename name="multilingual.tld">
 <docname name="Terms of Service">
  <url name="http://multilingual.tld/be/nl"
       xpath="//main" lang="nl" country="be">
   <norecurse name="arbitrary"/>
  </url>
  <url name="http://multilingual.tld/be/fr"
       xpath="//main" lang="fr" country="be">
   <norecurse name="arbitrary"/>
  </url>
  <url name="http://multilingual.tld/ca/fr"
       xpath="//main" lang="fr" country="ca">
   <norecurse name="arbitrary"/>
  </url>
  <url name="http://multilingual.tld/ca/en"
       xpath="//main" lang="en" country="ca">
   <norecurse name="arbitrary"/>
  </url>
 </docname>
</sitename>

(I wanted to discuss this in #tosback but couldn’t find the channel on either freenode or OFTC.)

Conflicting requirements for mime-types gem

I installed master/45995736 on OS X 10.8.5 today. I'm using ruby 1.9.3-p429 via rbenv, and when I try to run the crawler from the "rubycode" directory, I get a gem specification failure:

$ ruby main.rb ../rules/dropbox.com.xml 
/Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/specification.rb:1637:in `raise_if_conflicts': Unable to activate mail-2.5.4, because mime-types-2.1 conflicts with mime-types (~> 1.16) (Gem::LoadError)
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/specification.rb:746:in `activate'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems.rb:212:in `rescue in try_activate'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems.rb:209:in `try_activate'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/custom_require.rb:59:in `rescue in require'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/custom_require.rb:35:in `require'

The same error persists in a clean install of 1.9.3-p429, plus the mechanize, nokogiri, sanitize, and mail gems. gem search mime-types indicates I have versions 2.1 and 1.25.1 installed, which should satisfy the requirement for ~> 1.16, but appears not to. I can't uninstall mime-types 2.1 because mechanize requires ~> mime-types 2.0.

I'm curious to know how other people are bypassing this problem.

Add current policies for Facebook

The following policies are linked from the main facebook.com page:

Of these, we're only crawling the third one. We're also crawling the Full Data Use Policy, which still exists as a URL, but is not linked from the main signup form.

Let's keep crawling the existing ones, but add the Cookie and Data policies.

For instance, I remember we got a question from a journalist when they started to mention that they might do research on users, and then we found out that it was on a page which people had to agree to, but which we were not crawling.

Add cloudsight.ai

Trying:

<sitename name="cloudsight.ai">
 <docname name="Privacy Policy">
   <url name="https://cloudsight.ai/privacy-policy" xpath="//[@class='container']" reviewed="true">
     <norecurse name="arbitrary" />
   </url>
 </docname>
 <docname name="Terms of Service">
   <url name="https://cloudsight.ai/terms" xpath="//div[@class='container']" reviewed="true">
     <norecurse name="arbitrary" />
   </url>
 </docname>
</sitename>

but i'm seeing a reCaptcha error :(

$ ruby main.rb ../rules/cloudsight.ai.xml 
https://cloudsight.ai/privacy-policy
ReferenceError: Can't find variable: Set
reCAPTCHA couldn't find user-provided function: vueRecaptchaApiLoaded
https://cloudsight.ai/terms
ReferenceError: Can't find variable: Set
reCAPTCHA couldn't find user-provided function: vueRecaptchaApiLoaded

Crawler stopped (lack of inodes)

https://tosback.org/ shows a directory listing, suggesting it's running Apache without Passenger, probably? The root cause seems to be disk quota:

tosback3@dragon:/var/www/ToSBack3/public$ touch test.txt 
touch: cannot touch 'test.txt': Disk quota exceeded

ToS;DR icon not linked

The ToS;DR icon in the header of http://tosback.org/ only links to eff.org like the EFF logo next to it. Would be good to actually link the ToS;DR icon to tosdr.org

Also, since EFF is in the footer as well, we should add ToS;DR there also. Or alternatively just remove the footer.

cc @JimmStout @hugoroy

Only on commit per day

All the "View the changes on Github" links on https://tosback.org/ that indicate change on the same day, point to the same commit which includes changes on multiple websites. This is a bit confusing. Perhaps it would be a good idea to have separate commit for each website.

merge crawl folder into crawl_reviewed folder?

Right now, phoenix only looks at documents in the crawl_reviewed folder. It would be good if the two folders could be merged so there is only one place we have to look? Then services like ResearchGate (whose terms are in crawl and not in crawl_reviewed) can also be reviewed using Phoenix's annotate view.

Upgrade Ruby version?

I am new to Ruby, so please tell me if I there are any gotchas around running old versions.

When I run bundle install --without production, I get the error message Your Ruby version is 2.5.1, but your Gemfile specified 2.3.1.

Should I try to downgrade my Ruby version or is it reasonable to keep the project’s Ruby version up-to-date? (latest version is currently: 2.6.1)

windowsphone.com is missing, please add a rule

Their documents have just changed and I was asked to accept those changes while trying to install an app. But neither tosback nor tosdr had an entry for windows phone.

ToS

The URLs for those documents is
http://www.windowsphone.com/**LOCALE**/store/terms-of-service
where LOCALE is a fully specified locale such as in the example of
http://www.windowsphone.com/en-us/store/terms-of-service

Privacy Statement

http://www.windowsphone.com/**LOCALE**/legal/wp8/windows-phone-privacy-statement
where LOCALE is a fully specified locale such as in the example of
http://www.windowsphone.com/en-us/legal/wp8/windows-phone-privacy-statement

Feature: add OpenGraph-metatags to site header

One may like or dislike the social media site that is promoting OpenGraph API, but nevertheless I would recommend to implement those metatags for tosback.org to make it more appealing to share the site.

Above all I would suggest to implement an og:image tag which links to the logo of the service (or possible more appealing "marketing images"). As of now logos of the services that are tested are displayed in the preview, which is rather misleading.
https://developers.facebook.com/docs/sharing/opengraph/object-properties

What is norecurse?

I often see <norecurse name="arbitrary" />, but I'm not sure what it does. Could we put a section in the README that explains it?

Issue with mysql connection on tosback.org server?

The tosback.org server ran out of diskspace. I removed the git history of this repo which was taking up 15Gb out of the 20Gb disk that server has, by doing a git shallow clone. The disk issue seems solved now, but I found that the script had stopped working.

When running it manually, I found it helps if I comment out https://github.com/tosdr/tosback2/blob/master/rubycode/main.rb#L12
I guess that will mean the git repo will still be updated, but the information on the web page will no longer get updated - I guess it uses mysql to communicate with https://github.com/tosdr/ToSBack3?

In any case, it's running now, I'll see if the website gets updated once the script is finished.

Commit link error?

Am I the only one who gets a 404 error using the commit link in the daily email?

My July 15 link:
These were changed in last night's crawl. Please have a look at the commit at dcaa695?diff=split to see the differences!

tosback.org links

I noticed links to commits from tosback.org after August 30 are missing from github.

TOSBack didn't grab new twitter.com update

The Twitter TOS and Privacy Policy updated this week:

We have made revisions to these Terms of Service, effective May 25th, 2018. You can see the new Terms of Service here [https://twitter.com/tos#update]. By continuing to use our services after May 25th, you agree to the new Terms of Service. A summary of changes can be found on our Help Center [https://help.twitter.com/rules-and-policies/update-privacy-policy].

However the version in this repository hasn't been updated in 7 months: crawl_reviewed/twitter.com

The URL is the same: https://twitter.com/en/tos
But I think the xpath rule no longer is valid. (At least I couldn't find main-content from rules/twitter.com.xml when looking at the source.)

[FEEDBACK] Expanding the List of Sites : TOS.org

The current list is
500px
Amazon
Apple
AT&T
Comcast
Couchsurfing
Delicious
Dropbox
DuckDuckGo
Facebook
Flickr
Github
Google
Microsoft
MySpace
Netflix
Rapidshare
Skype
SoundCloud
Spotify
Steam
Tumblr
Twitpic
Twitter
Wikipedia
Wordpress.com
Yahoo!
YouTube

Would you expand the TOS list to include Instagram, Reddit, Xvideos, Pornhub and Baidu?

Just to ensure the Terms of Services of the top 10 websites in the world are covered.

Reference
TOSBACK Website
Top 100 Websites in the World

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.