GithubHelp home page GithubHelp logo

berkmancenter / lumendatabase Goto Github PK

View Code? Open in Web Editor NEW
141.0 36.0 90.0 20.34 MB

The Lumen Database collects and analyzes legal complaints and requests for removal of online materials.

Home Page: https://lumendatabase.org

License: GNU General Public License v2.0

Ruby 82.51% CoffeeScript 0.68% JavaScript 0.87% HTML 9.11% CSS 0.71% SCSS 5.59% Sass 0.01% Haml 0.13% Shell 0.23% Dockerfile 0.06% Procfile 0.01% PLpgSQL 0.12%

lumendatabase's Introduction

Build Status Code Climate

Lumen Database

The Lumen Database collects and analyzes legal complaints and requests for removal of online materials, helping Internet users to know their rights and understand the law. These data enable us to study the prevalence of legal threats and let Internet users see the source of content removals.

Automated Submissions and Search Using the API

The main Lumen Database instance has an API that allows individuals and organizations that receive large numbers of notices to submit them without using the web interface. The API also provides an easy way for researchers to search the database. Members of the public can test the database, but will likely need to request an API key from the Lumen team to receive a token that provides full access. To learn about the capabilities of the API, you can consult the API documentation.

Development

Stack

  • ruby 3.2.2
  • PostgreSQL 13.6
  • Elasticsearch 7.17.x
  • Java Runtime Environment (OpenJDK works fine)
  • Piwik Tracking (only used in prod)
  • Mail server (SMTP, Sendmail)
  • ChromeDriver (used only by test runner)

Using Docker

The easiest way to start is to use Docker. Make sure you have the Docker Engine and docker-compose installed.

Clone the repository.

cp config/database.yml.docker config/database.yml
cp .env.docker .env
docker-compose up
docker-compose exec website bash
rake db:drop db:create db:migrate
rake comfy:cms_seeds:import[lumen_cms,lumen_cms]
rake db:seed
bundle exec sidekiq &
rails s -b 0.0.0.0

Lumen will be available at http://localhost:8282.

Manual setup

By default, the app will try to connect to Elasticsearch on http://localhost:9200. If you want to use a different host set the ELASTICSEARCH_URL environment variable.

bundle install
cp config/database.yml.example config/database.yml

(edit database.yml as you wish)
(ensure PostgreSQL and Elasticsearch are running)

rails db:setup
rails lumen:set_up_cms

Running the app

rails s

Viewing the app

$BROWSER 'http://localhost:3000'

You can customize behavior during seeding (db:setup) with a couple of environment variables:

  • NOTICE_COUNT=10 will generate 10 (or any number you pass it) notices instead of the default 500
  • SKIP_FAKE_DATA=1 will skip generating fake seed data entirely.

Sample user logins

The seed data creates logins of the following form:

Username: {username}@lumendatabase.org
Password: password

username is one of {user, submitter, redactor, publisher, admin, super_admin}, with corresponding privileges.

If you seeded your database with an older version of seeds.rb, your username may use chillingeffects.org rather than lumendatabase.org.

Running Tests

Many of the tests require all of the services that make up the Lumen stack to be running. For that reason, the easiest way to run tests is in a docker-compose environment:

$ docker-compose -f docker-compose.test.yml --env-file .env.test up
$ docker-compose exec -e RAILS_ENV=test website bash -c "bundle install && rake db:drop db:create db:migrate && rspec"

The integration tests are quite slow; for some development purposes you may find it more convenient to ...rspec spec/ --exclude-pattern="spec/integration/*".

Parallelizing Tests

You can speed up tests by running them in parallel: $ rake parallel:spec

You will need to do some setup before the first time you run this:

  • alter config/database.yml so that the test database is yourproject_test<%= ENV['TEST_ENV_NUMBER'] %>
  • run rake parallel:setup

It will default to using the number of processors parallel_tests believes to be available, but you can change this by setting ENV['PARALLEL_TEST_PROCESSORS'] to the desired number.

Search Indexing

While the Elasticsearch integration with Rails makes indexing objects into the Elasticsearch index easy, it is untenably slow with millions of objects. We avoid this by bypassing Rails and indexing from the database straight into Elasticsearch using Logstash.

To run this indexing process, you'll need Logstash, and the PostgreSQL JDBC driver. You'll need to create a Logstash configuration that reads from Postgres and writes to Elasticsearch. There is an example setup in script/search_indexing/ that includes two pipelines, one that indexes notices and one that indexes entities. Those examples are setup to run in Docker through docker-compose.

Once setup, to run the indexing, simply run the logstash binary and point it to your configuration file, e.g. bin/logstash -f logstash.conf.

Linting

Use rubocop and leave the code at least as clean as you found it. If you make linting-only changes, it's considerate to your code reviewer to keep them in their own commit.

Profiling

  • mini-profiler
    • available in dev by default
    • in use on prod, visible only to super_admins
    • in-depth memory profiling, stacktracing, and SQL queries; good for granular analysis
  • oink
    • memory usage, allocations
    • runs in dev by default; can run anywhere by setting ENV[USE_OINK] (ok to run in production)
    • logs to log/oink.log

Environment variables

Here are all the environment variables which Lumen recognizes. Find them in the code for documentation.

Environment variables should be set in .env and are managed by the dotenv gem. .env is not version-controlled so you can safely write secrets to it (but will also need to set these on all servers).

Unless setting an environment variable on the command line in the context of a command-line process, environment variables should ONLY be set in .env.

Most of these are optional and have sensible defaults (which may vary by environment).

Variable name Description
BATCH_SIZE Batch size of model items indexed during each run of Elasticsearch re-indexing
BUNDLE_GEMFILE Custom Gemfile location
BROWSER_VALIDATIONS Enable user HTML5 browser form validations
DEFAULT_SENDER Default mailer sender
ELASTICSEARCH_URL Elasticsearch host, e.g. https://127.0.0.1:9200
EMAIL_DOMAIN Default email domain in Action Mailer
ES_INDEX_SUFFIX Can be used to specify a suffix for the name of Elasticsearch indexes
FILE_NAME Name of CSV file to import as blog entries
GOOGLE_CUSTOM_BLOG_SEARCH_ID Custom Google search ID used in the CMS
LOG_ELASTICSEARCH Enabled logging of Elasticsearch calls, only used in tests
LOG_TO_LOGSTASH_FORMAT Set to true if you want to log in the Logstash format
USE_OINK Enable the oink gem in the production environment
MAILER_DELIVERY_METHOD Sets the delivery method for emails sent by the application
NOTICE_COUNT How many fake notices to create when seeding the db
PROXY_CACHE_CLEAR_HEADER Name of a request header that is used clear cache on a proxy cache server like Varnish
PROXY_CACHE_CLEAR_SITE_HOST Needed just in development to reach the application from a Docker container
RACK_ENV Don't use this; it's overridden by RAILS_ENV
RAILS_ENV Rails environment
RAILS_LOG_LEVEL Log level for all the application loggers
RAILS_SERVE_STATIC_FILES If present (with any value) will enable Rails to serve static files
RECAPTCHA_SITE_KEY reCAPTCHA public key
RECAPTCHA_SECRET_KEY reCAPTCHA private key
RETURN_PATH Default mailer return path
SEARCH_SLEEP Used in specs only, time out of Elasticsearch searches
SECRET_KEY_BASE The Rails secret token; required in prod
SERVER_TIME_ZONE Name of the server's timezone, e.g. Eastern Time (US & Canada)
SIDEKIQ_REDIS_URL Redis location used by Sidekiq
SITE_HOST Site host, used in mailer templates
SKIP_FAKE_DATA Don't generate fake data when seeding the database
SMTP_ADDRESS SMTP server address
SMTP_DOMAIN SMTP server domain
SMTP_USERNAME SMTP server username
SMTP_PASSWORD SMTP server password
SMTP_PORT SMTP server port
SMTP_VERIFY_MODE Value of the openssl_verify_mode option of the SMTP client
USER_CRON_EMAIL For use in sending reports of court order files; can be a string or a list (in a JSON.parse-able format)
USER_CRON_MAGIC_DIR Directory used in the court order reporter cron job
WEB_CONCURRENCY Number of Unicorn workers
WEB_TIMEOUT Unicorn timeout

Email setup

The application requires a mail server, in development it's best to use a local SMTP server that will catch all outgoing emails. Mailcatcher is a good option.

Blog custom search

The /blog_entries page can contain a google custom search engine that searches the Lumen blog. To enable, create a custom search engine here restricted to the path the blog lives at, for instance https://www.lumendatabase.org/blog_entries/*. Extract the "cx" id from the javascript embed code and put it in the GOOGLE_CUSTOM_BLOG_SEARCH_ID environment variable. The blog search will appear after this variable has been configured.

Lumen API

You can search the database and, if you have a contributor token, add to the database using our API.

The Lumen API is documented in our GitHub Wiki: https://github.com/berkmancenter/lumendatabase/wiki/Lumen-API-Documentation

License

Lumen Database is licensed under GPLv2. See LICENSE.txt for more information.

Copyright

Copyright (c) 2016 President and Fellows of Harvard College

lumendatabase's People

Contributors

apatel avatar bhaprayan avatar dependabot[bot] avatar djcp avatar domenoth avatar frederichoule avatar hasegeli avatar hollandof avatar jdcc avatar jsdiaz avatar mzagaja avatar pbrisbin avatar peter-hank avatar ryanttb avatar sangsomjr avatar shubhscoder avatar siaw23-retired avatar southpolesteve avatar tanderson11 avatar thatandromeda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lumendatabase's Issues

Meta dictionary not updating with requests

The Lumen API is working well except for the fact that the meta field returned always statically sets next_page=null and current_page=1. total_pages seems to be accurate though. If i manually increment next_page in the request string, I get the correct page (as far as I can tell because the last one I get has less than per_page argument, so it seems like I'm coming to the end), but the meta fields do not change. So I believe I can still paginate correctly following the total_pages hack, but I'd thought I'd let you know that I'm seeing the behaviour as in the following output

In [2]: fetch_lumen_notices(num_days=1) 
Lumen search URL is: https://lumendatabase.org/notices/search?topics=Copyright&per_page=50&page=1&sort_by=date_received+desc&recipient_name=Twitter&date_received_facet=1533080436733..1533253236733
next_page of pagination has value 1
50 notices returned from Lumen Call
End of meta dict was: { 'current_page': 1, 'next_page': None, 'offset': None, 'per_page': None, 'previous_page': None, 'total_entries': None, 'total_pages': 4}
---------
Lumen search URL is: https://lumendatabase.org/notices/search?topics=Copyright&per_page=50&page=2&sort_by=date_received+desc&recipient_name=Twitter&date_received_facet=1533080436733..1533253236733
next_page of pagination has value 2
50 notices returned from Lumen Call
End of meta dict was: { 'current_page': 1, 'next_page': None, 'offset': None, 'per_page': None, 'previous_page': None, 'total_entries': None, 'total_pages': 4}
---------
Lumen search URL is: https://lumendatabase.org/notices/search?topics=Copyright&per_page=50&page=3&sort_by=date_received+desc&recipient_name=Twitter&date_received_facet=1533080436733..1533253236733
next_page of pagination has value 3
50 notices returned from Lumen Call
End of meta dict was: { 'current_page': 1, 'next_page': None, 'offset': None, 'per_page': None, 'previous_page': None, 'total_entries': None, 'total_pages': 4}
Saved 0 lumen notices.
---------
Lumen search URL is: https://lumendatabase.org/notices/search?topics=Copyright&per_page=50&page=4&sort_by=date_received+desc&recipient_name=Twitter&date_received_facet=1533080436733..1533253236733
next_page of pagination has value 4
40 notices returned from Lumen Call
End of meta dict was: { 'current_page': 1, 'next_page': None, 'offset': None, 'per_page': None, 'previous_page': None, 'total_entries': None, 'total_pages': 4}

"Report a Demand" Page

A few comments:

  • There's a discrepancy between language of "notices" vs. "demands" on the flutie site
  • Are we including any checks (e.g. CAPTCHA) against bots that could submit spam notices on the new CE site
  • We list "Data Protection" as a type of notice, but I would go ahead and refer to "EU Right to be Forgotten" in that title

screen shot 2014-07-16 at 3 27 49 pm

Dev Environment Setup

I went through the process of setting up the app and ran into a few speed bumps. Putting them here for documentation:

  1. MySQL is in the gemfile, but postgres seems to be the primary database. Looks like this may be due to rake commands to import old notices. Not sure if this is still in use, but having both DB gems was initially causing some weird errors. I did have an old/slightly wonky mysql setup, so suspect this may be limited to my machine. Reinstalling MySQL with homebrew (OSX) fixed the problem.
  2. Setup script fails is a tmp directory. Fixed by #296
  3. Setup fails when attempting to run database migrations. This is due to this line here: https://github.com/berkmancenter/chillingeffects/blob/master/lib/validates_automatically.rb#L5. Calling .columns on ActiveRecord::Base results in a SQL call for column info, which is normally fine, except in the setup case where the column doesn't exist. Simple fix is to comment out this line https://github.com/berkmancenter/chillingeffects/blob/master/app/models/notice.rb#L8 before running migrations and uncomment when starting app. Long term I would recommend looking into refactoring to avoid the SQL query or using a gem that works against the schea such as this: https://github.com/SchemaPlus/schema_validations

New command please

don't know if it's the right platform for begging for a new command line but i dare to ask developers. some searches end with huge amount of senders which frustrates me. i would like to suggest a command line that narrows it: "exclude_sender=sendername1;sendername2;sendername3;..." and so on. is it possible?

TEMPORARY SERVER ERROR

Is your feature request related to a problem? Please describe.
I'm always frustrated when the server is super slow. And often I got this error:

image

Describe the solution you'd like
Please improve its performance.

Which branch to PR to?

Looking in the active branches, i see the rails4 branch is closed. Which branch can I send push requests to?

Naughty

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

data json has been changed

hello guys

today is see date json has been changed ?

from

                "infringing_urls": [
                    {
                        "url": "xxxxxxxx.com/xxxxxxxxx"
                    },
                    ]

to

                    "infringing_urls": [
                        {
                            "domain": "xxxxxxxx.com",
                            "count": 5
                        },
                        ]

recaptcha request access not working many times

hello team lumendatabase.

i see some update on main site last night, that great and links claimed more protected now.

but when i submit my email and check repatcha i get red alert "Captcha verification failed, please try again."

get paged notices by entity_id from the API

Since the id field is exposed in the /entities/search JSON response, it would be nice if that'd be part of the searchable fields in the Notice model so that a list of notices sent by a given entity can be obtained without performing full text search on the sender_name (that's not super accurate in some cases).

Describe the solution you'd like

Add the entity_id field here

Describe alternatives you've considered

Tried with the notices search by sender_name, by when the name is composed of several words the full text search is not very accurate and returns results of also other entities.

Additional context

If the dev team is fine with it i can just fork and send a PR for this.

Lumen RefineryCMS integration

Hi @mzagaja @ryanttb
Is there a good proposal on Lumen this summer? If not then I'd like to take it up for my GSoC? Also there's some upgrade stuff that I would like to finish up (more about removing various redundant and unnecessary code) Please let me know :)

Search API Page Limit 20,000 or 10,000?

Thanks for an awesome and critical service. The documentation for the full text search API call indicates that there is a limit of 20,000:

Lumen is unable to return past the 20,000th result due to limitations in Elasticsearch. This means that querying deeper than page=2000 (with the default per_page of 10) will fail.

However a little bit of experimentation seems to indicate that the maximum number of results is 10,000?

>>> import requests
>>> headers = {'User-Agent': 'umd-lumen-testing'}
>>> params = {'authentication_token': 'mysupersecretkey', 'token': 'google'}
>>> response = requests.get('https://lumendatabase.org/notices/search.json', headers=headers, params=params)
>>> results = response.json()
>>> results['meta']['total_pages']
1000
>>> len(results['notices'])
10

Trying to fetch page 1001 seems to throw an Internal Server error. So It seems like the documentation should be updated to indicate the limit is 10,000?

Requests fails with 504 Gateway-Timeout when per_page > 15

Hi, I am using the Lumen API to search for copyright notices of music tracks by a string consisting of the name of the artist and the song. However, requests fail with 504 status code most of the time after 90 seconds when the per_page request paramter is roughly larger than 15.
My request payload is the following (I am using a recent api researcher key):

payload = {'works': "It's Not My Time 3 Doors Down", 
           'works-require-all': True,
           'page': 1,
           'per_page': 16,
           'date_received_facet':'1325372400000..1401573600000',
           'sort_by': 'relevancy desc',
           'authentication_token':<api_key>}

I was wondering if this is expected behaviour as the search is pretty broad, or if there is something wrong. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.