tosdr / tosback2 Goto Github PK

This project forked from pde/tosback2

Reimplementing TOSBack with Ruby and using git to see TOS changes!

License: GNU General Public License v2.0

Ruby 0.26% Python 7.69% Shell 1.95% C 52.92% Makefile 10.98% Perl 3.01% C++ 2.57% TeX 5.27% JavaScript 0.28% HTML 4.68% PHP 0.02% CSS 0.08% Lex 0.08% M4 10.20% sed 0.01%

tosback2's Introduction

ToSBack!

This is a ruby implementation of TOSBack! Designed to scrape the Privacy Policies and Terms of Service agreements from sites defined in the rules folder.

Rules

The log files in "logs" should give info on when the script was last run, and if one of the rule's URLs needs to be updated. Typically, tosback.rb will grab the body of a URL and try to strip away the html before storing the policy, but if a site is coming back as modified every time the script runs (thanks to ads or related links changing), you can now add an xpath attribute to the url in the xml data to pinpoint the TOS data on the page:

Here's an example:

<docname name="Privacy Policy">
  <url name="http://www.500px.com/privacy" xpath="//div[@id='terms']">
   <norecurse name="arbitrary"/>
  </url>
</docname>

Now, tosback.rb should only grab the content we want from that URL! Hooray!

Developing

This project requires ruby 2.3.1 and phantomjs.

After cloning the project, use the --without production option to install the required gems:

$ bundle install --without production

When the app runs without any options, it saves information to our database and automatically makes some new git commits, but this is probably only desirable in production. On your dev machine, run it like this to skip the db and auto-committing:

rubycode$ ruby main.rb -dev

You can also pass a rule file as an argument to the script to get a preview of the results! For example:

rubycode$ ruby main.rb ../rules/abercrombie.com.xml

This will only scrape and write the rule you pass, so you can add xpath data to a rule and quickly test to make sure it's correct.

Running with the "-empty" argument will scan the crawl directory and update the empty.log! Example:

rubycode$ ruby main.rb -empty

tosback2's People

Contributors

Stargazers

Watchers

tosback2's Issues

Git quirk with Amazon Silk T&C

Not sure if this is just git on my machine, or if there's something interesting happening in the repo somehow, but no matter how I try to refresh my repo, I get this:

On branch master
Your branch is up-to-date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   crawl_reviewed/amazon.com/AMAZON SILK TERMS & CONDITIONS.txt

no changes added to commit (use "git add" and/or "git commit -a")

If I use git reset --hard, then, only the capitalization changes:

        modified:   crawl_reviewed/amazon.com/Amazon Silk Terms & Conditions.txt

The actual diff—which is quite large—remains the same.

Any insights?

Idea: analyse similarities between different ToS

It is very often the case that a service will simply copy parts of another service's ToS.

For this case and other cases, it would be nice to analyse the similarities between different documents hosted there, so we can see for instance when documents are 95% the same. This will also make it easy for ToS;DR to avoid duplicating reviews :-)

Muliple language versions may not be in sync :-(

I know that everyone hates such reports: you would think that localised ToS should change with the English "master". Well, nope.

Today I noticed that Google Play have changed their ToS. Or, more precisely:
https://play.google.com/intl/en-us_us/about/play-terms.html - English - December 18, 2017
https://play.google.com/intl/hu_hu/about/play-terms.html - Hungarian - February 5, 2018
possibly similar to
https://play.google.com/intl/en_uk/about/play-terms.html - UK English, Febr 5, 2018

So, first, it seems changes shall be tracked for localised languages, which may be kind of hundredfold amount of data.

Then there is the funny part, when multiple different English versions exist, where they could be happily diffed against one another, to show regional differences between the terms.

(As a sidenote: all the websites are in horrible shape: tosdr.org and especially tosback.org. They are desperately in need of lots of love.)

Handling multiple jurisdictions / countries

Stripe has Terms split over 4 countries: CA, IE, GB, US. How should this be reflected in the rules?

One idea is to add a new attribute country on the url element and expand the definition to allow multiple url elements under a single docname, like so:

<sitename name="stripe.com">
 <docname name="Terms of Service">
  <url name="https://stripe.com/ca/terms"
       xpath="//article[1]" lang="en" country="ca">
   <norecurse name="arbitrary"/>
  </url>
  <url name="https://stripe.com/ie/terms"
       xpath="//article[1]" lang="en" country="ie">
   <norecurse name="arbitrary"/>
  </url>
 </docname>
</sitename>

This could even be useful for multi-lingual cases. Such as this fictitious site:

<sitename name="multilingual.tld">
 <docname name="Terms of Service">
  <url name="http://multilingual.tld/be/nl"
       xpath="//main" lang="nl" country="be">
   <norecurse name="arbitrary"/>
  </url>
  <url name="http://multilingual.tld/be/fr"
       xpath="//main" lang="fr" country="be">
   <norecurse name="arbitrary"/>
  </url>
  <url name="http://multilingual.tld/ca/fr"
       xpath="//main" lang="fr" country="ca">
   <norecurse name="arbitrary"/>
  </url>
  <url name="http://multilingual.tld/ca/en"
       xpath="//main" lang="en" country="ca">
   <norecurse name="arbitrary"/>
  </url>
 </docname>
</sitename>

(I wanted to discuss this in #tosback but couldn’t find the channel on either freenode or OFTC.)

Conflicting requirements for mime-types gem

I installed master/45995736 on OS X 10.8.5 today. I'm using ruby 1.9.3-p429 via rbenv, and when I try to run the crawler from the "rubycode" directory, I get a gem specification failure:

$ ruby main.rb ../rules/dropbox.com.xml 
/Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/specification.rb:1637:in `raise_if_conflicts': Unable to activate mail-2.5.4, because mime-types-2.1 conflicts with mime-types (~> 1.16) (Gem::LoadError)
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/specification.rb:746:in `activate'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems.rb:212:in `rescue in try_activate'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems.rb:209:in `try_activate'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/custom_require.rb:59:in `rescue in require'
    from /Users/myuser/.rbenv/versions/1.9.3-p429/lib/ruby/1.9.1/rubygems/custom_require.rb:35:in `require'

The same error persists in a clean install of 1.9.3-p429, plus the mechanize, nokogiri, sanitize, and mail gems. gem search mime-types indicates I have versions 2.1 and 1.25.1 installed, which should satisfy the requirement for ~> 1.16, but appears not to. I can't uninstall mime-types 2.1 because mechanize requires ~> mime-types 2.0.

I'm curious to know how other people are bypassing this problem.

tosback.org has invalid (expired) SSL cert

Not sure if this is the proper place to report this, but it should be fixed ASAP.

Add discordapp.com

It's very popular
Privacy policy raises concerns

Setting url/@reviewed='false' does not have desired effect

I'm in the process of creating a rules set and found that setting reviewed="false" for a URL has the same effect as setting it to true.

Add current policies for Facebook

The following policies are linked from the main facebook.com page:

Cookie Policy: "https://www.facebook.com/help/cookies"
Data Policy: "https://www.facebook.com/about/privacy"
Terms: "http://www.facebook.com/legal/terms"

Of these, we're only crawling the third one. We're also crawling the Full Data Use Policy, which still exists as a URL, but is not linked from the main signup form.

Let's keep crawling the existing ones, but add the Cookie and Data policies.

For instance, I remember we got a question from a journalist when they started to mention that they might do research on users, and then we found out that it was on a page which people had to agree to, but which we were not crawling.

Add cloudsight.ai

Trying:

<sitename name="cloudsight.ai">
 <docname name="Privacy Policy">
   <url name="https://cloudsight.ai/privacy-policy" xpath="//[@class='container']" reviewed="true">
     <norecurse name="arbitrary" />
   </url>
 </docname>
 <docname name="Terms of Service">
   <url name="https://cloudsight.ai/terms" xpath="//div[@class='container']" reviewed="true">
     <norecurse name="arbitrary" />
   </url>
 </docname>
</sitename>

but i'm seeing a reCaptcha error :(

$ ruby main.rb ../rules/cloudsight.ai.xml 
https://cloudsight.ai/privacy-policy
ReferenceError: Can't find variable: Set
reCAPTCHA couldn't find user-provided function: vueRecaptchaApiLoaded
https://cloudsight.ai/terms
ReferenceError: Can't find variable: Set
reCAPTCHA couldn't find user-provided function: vueRecaptchaApiLoaded

Twitter docs seem to switch between GDPR and non-GDPR

As noted by @hugoroy and discussed via
email, 27f3076?diff=split#diff-ca787fbf273ad31090b8a57a734c51c8L437 looks like the previous crawl got the GDPR version of the Privacy Policy, and the current one got the non-GDPR one. Maybe we can change the URL to force one or the other (e.g. I've seen some services have something like ?eu=true at the end of the URL).

Crawler stopped (lack of inodes)

https://tosback.org/ shows a directory listing, suggesting it's running Apache without Passenger, probably? The root cause seems to be disk quota:

tosback3@dragon:/var/www/ToSBack3/public$ touch test.txt 
touch: cannot touch 'test.txt': Disk quota exceeded

ToS;DR icon not linked

The ToS;DR icon in the header of http://tosback.org/ only links to eff.org like the EFF logo next to it. Would be good to actually link the ToS;DR icon to tosdr.org

Also, since EFF is in the footer as well, we should add ToS;DR there also. Or alternatively just remove the footer.

cc @JimmStout @hugoroy

Only on commit per day

All the "View the changes on Github" links on https://tosback.org/ that indicate change on the same day, point to the same commit which includes changes on multiple websites. This is a bit confusing. Perhaps it would be a good idea to have separate commit for each website.

Import rules from tosdr.org

Please import rules that we have in tosdr.org/services/

merge crawl folder into crawl_reviewed folder?

Right now, phoenix only looks at documents in the crawl_reviewed folder. It would be good if the two folders could be merged so there is only one place we have to look? Then services like ResearchGate (whose terms are in crawl and not in crawl_reviewed) can also be reviewed using Phoenix's annotate view.

Amazon commented out?

See https://github.com/tosdr/tosback2/blame/master/rules/amazon.com.xml - most docs seem to be commented out? I'll try to run tosback2 locally, so I can learn how to contribute xPath entries.

Upgrade Ruby version?

I am new to Ruby, so please tell me if I there are any gotchas around running old versions.

When I run bundle install --without production, I get the error message Your Ruby version is 2.5.1, but your Gemfile specified 2.3.1.

Should I try to downgrade my Ruby version or is it reasonable to keep the project’s Ruby version up-to-date? (latest version is currently: 2.6.1)

windowsphone.com is missing, please add a rule

Their documents have just changed and I was asked to accept those changes while trying to install an app. But neither tosback nor tosdr had an entry for windows phone.

ToS

The URLs for those documents is
http://www.windowsphone.com/**LOCALE**/store/terms-of-service
where LOCALE is a fully specified locale such as in the example of
http://www.windowsphone.com/en-us/store/terms-of-service

Privacy Statement

http://www.windowsphone.com/**LOCALE**/legal/wp8/windows-phone-privacy-statement
where LOCALE is a fully specified locale such as in the example of
http://www.windowsphone.com/en-us/legal/wp8/windows-phone-privacy-statement

Feature: add OpenGraph-metatags to site header

One may like or dislike the social media site that is promoting OpenGraph API, but nevertheless I would recommend to implement those metatags for tosback.org to make it more appealing to share the site.

Above all I would suggest to implement an og:image tag which links to the logo of the service (or possible more appealing "marketing images"). As of now logos of the services that are tested are displayed in the preview, which is rather misleading.
https://developers.facebook.com/docs/sharing/opengraph/object-properties

What is norecurse?

I often see <norecurse name="arbitrary" />, but I'm not sure what it does. Could we put a section in the README that explains it?

ambiguous path i don't understand

https://trello.com/security: Ambiguous match, found 2 elements matching visible xpath "//div[@Class='layout-centered-content']"

but i can't see the problem when i execute

a=document.evaluate("//div[@class='layout-centered-content']", document, null, XPathResult.ANY_TYPE, null)
a.iterateNext()
a.iterateNext()

in the browser console on https://trello.com/legal/security

Issue with mysql connection on tosback.org server?

The tosback.org server ran out of diskspace. I removed the git history of this repo which was taking up 15Gb out of the 20Gb disk that server has, by doing a git shallow clone. The disk issue seems solved now, but I found that the script had stopped working.

When running it manually, I found it helps if I comment out https://github.com/tosdr/tosback2/blob/master/rubycode/main.rb#L12
I guess that will mean the git repo will still be updated, but the information on the web page will no longer get updated - I guess it uses mysql to communicate with https://github.com/tosdr/ToSBack3?

In any case, it's running now, I'll see if the website gets updated once the script is finished.

deal with different web properties that share the same terms

For instance when visiting cnet.com you agree to the cbsinteractive.com privacy policy. Should we crawl that as rules/cnet.com.xml or as rules/cbsinteractive.com.xml? I opted for the first now, but this issue deserves more attention.

Commit link error?

Am I the only one who gets a 404 error using the commit link in the daily email?

My July 15 link:
These were changed in last night's crawl. Please have a look at the commit at dcaa695?diff=split to see the differences!

tosback.org is down

cc @michielbdejong @JimmStout

Wrong PP for nike.com

The nike.com rule is incorrect now. It should point at https://agreementservice.svs.nike.com/us/en_us/rest/agreement?agreementType=privacyPolicy&uxId=com.nike.commerce.nikedotcom.web&country=US&language=en&requestType=redirect

tosback.org links

I noticed links to commits from tosback.org after August 30 are missing from github.

TOSBack didn't grab new twitter.com update

The Twitter TOS and Privacy Policy updated this week:

We have made revisions to these Terms of Service, effective May 25th, 2018. You can see the new Terms of Service here [https://twitter.com/tos#update]. By continuing to use our services after May 25th, you agree to the new Terms of Service. A summary of changes can be found on our Help Center [https://help.twitter.com/rules-and-policies/update-privacy-policy].

However the version in this repository hasn't been updated in 7 months: crawl_reviewed/twitter.com

The URL is the same: https://twitter.com/en/tos
But I think the xpath rule no longer is valid. (At least I couldn't find main-content from rules/twitter.com.xml when looking at the source.)

Would you expand the TOS list to include Instagram, Reddit, Xvideos, Pornhub and Baidu?

Just to ensure the Terms of Services of the top 10 websites in the world are covered.

Reference
TOSBACK Website
Top 100 Websites in the World

tosdr / tosback2 Goto Github PK

tosback2's Introduction

ToSBack!

Rules

Developing

tosback2's People

Contributors

Stargazers

Watchers

Forkers

tosback2's Issues

ToS

Privacy Statement

Recommend Projects

Recommend Topics

Recommend Org

Jobs