GithubHelp home page GithubHelp logo

webrecorder / browsertrix Goto Github PK

View Code? Open in Web Editor NEW
171.0 12.0 32.0 11.44 MB

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

Home Page: https://browsertrix.com

License: GNU Affero General Public License v3.0

Dockerfile 0.06% Python 42.75% Shell 0.48% JavaScript 2.34% TypeScript 53.47% CSS 0.71% EJS 0.09% Jinja 0.10%
archiving cloud warc web-archive web-archiving webrecorder wacz kubernetes

browsertrix's Introduction

Conifer

Collect and revisit web pages.

Conifer provides an integrated platform for creating high-fidelity, ISO-compliant web archives in a user-friendly interface, providing access to archived content, and sharing collections.

This repository represents the hosted service running at https://conifer.rhizome.org/, which can also be deployed locally using Docker

This README refers to the 5.x version of Conifer, released in June, 2020. This release includes a new UI and the renaming of Webrecorder.io to Conifer. Other parts of the open source efforts remain at the Webrecorder Project. For more info about this momentous change, read our announcement blog post.

The previous UI is available on the legacy branch.

Frequently asked questions

  • If you have any questions about how to use Conifer, please see our User Guide.

  • If you have a question about your account on the hosted service (conifer.rhizome.org), please contact us via email at [email protected]

  • If you have a previous Conifer installation (version 3.x), see Migration Info for instructions on how to migrate to the latest version.

Using the Conifer Platform

Conifer and related tools are designed to make web archiving more portable and decentralized, as well as to serve users and developers with a broad range of skill levels and requirements. Here are a few ways that Conifer can be used (starting with what probably requires the least technical expertise).

1. Hosted Service

Using our hosted version of Conifer at https://conifer.rhizome.org/, users can sign up for a free account and create their own personal collections of web archives. Captures web content will be available online, either publicly or only privately, under each user account, and can be downloaded by the account owner at any time. Downloaded web archives are available as WARC files. (WARC is the ISO standard file format for web archives.) The hosted service can also be used anonymously and the captured content can be downloaded at the end of a temporary session.

2. Offline Capture and Browsing

The Webrecorder Project is a closely aligned effort that offers OSX/Windows/Linux Electron applications:

  • Webrecorder Player browse WARCs created by Webrecorder (and other web archiving tools) locally on the desktop.
  • Webrecorder Desktop a desktop version of the hosted Webrecorder service providing both capture and replay features.

3. Preconfigured Deployment

To deploy the full version of Conifer with Ansible on a Linux machine, the Conifer Deploy workbook can be used to install this repository, configure nginx and other dependencies, such as SSL (via Lets Encrypt). The workbook is used for the https://conifer.rhizome.org deployment.

4. Full Conifer Local Deployment

The Conifer system in this repository can be deployed directly by following the instructions below. Conifer runs entirely in Docker and also requires Docker Compose.

5. Standalone Python Wayback (pywb) Deployment

Finally, for users interested in the core "replay system" and very basic recording capabilities, deploying pywb could also make sense. Conifer is built on top of pywb (Python Wayback/Python Web Archive Toolkit), and the core recording and replay functionality is provided by pywb as a standalone Python library. pywb comes with a Docker image as well.

pywb can be used to deploy your own web archive access service. See the full pywb reference manual for further information on using and deploying pywb.

Running Locally

Conifer can be run on any system that has Docker and Docker Compose installed. To install manually, clone

  1. git clone https://github.com/rhizome-conifer/conifer

  2. cd conifer; bash init-default.sh.

  3. docker-compose build

  4. docker-compose up -d

(The init-default.sh is a convenience script that copies wr_sample.envwr.env and creates keys for session encryption.)

Point your browser to http://localhost:8089/ to access the locally running Conifer instance.

(Note: you may see a maintenance message briefly while Conifer is starting up. Refresh the page after a few seconds to see the Conifer home page).

Installing Remote Browsers

Remote Browsers are standard browsers like Google Chrome and Mozilla Firefox, encapsulated in Docker containers. This feature allows Conifer to directly use fixed versions of browsers for capturing and accessing web archives, with a more direct connection to the live web and web archives. Remote browsers in many cases can improve the quality of web archives during capture and access. They can be "remote controlled" by users and are launched as needed, and use the same amount of computing and memory resources as they would when just running as regular desktop apps.

Remote Browsers are optional, and can be installed as needed.

Remote Browsers are just Docker images which start with oldweb-today/, and are part of oldweb-today organization on GitHub. Installing the browsers can be as simple as running docker pull on each browser image each as well as additional Docker images for the Remote Desktop system.

To install the Remote Desktop System and all of the officially supported Remote Browsers, run install-browsers.sh

Configuration

Conifer reads its configuration from two files: wr.env, and less-commonly changed system settings in wr.yaml.

The wr.env file contains numerous deployment-specific customization options. In particular, the following options may be useful:

Host Names

By default, Conifer assumes its running on localhost or a single domain, but on different ports for application (the Conifer user interface) and content (material rendered from web archives). This is a security feature preventing archived web sites accessing and possibly changing Conifer's user interface, and other unwanted interactions.

To run Conifer on different domains, the APP_HOST and CONTENT_HOST environment variables should be set.

For best results, the two domains should be two subdomains, both with https enabled.

The SCHEME env var should also be set to SCHEME=https when deploying via https.

Anonymous Mode

By default Conifer disallows anonymous recording. To enable this feature, set ANON_DISABLED=false to the wr.env file and restart.

Note: Previously the default setting was anonymous recording enabled (ANON_DISABLED=false)

Storage

Conifer uses the ./data/ directory for local storage, or an external backend, currently supporting S3.

The DEFAULT_STORAGE option in wr.env configures storage options, which can be DEFAULT_STORAGE=local or DEFAULT_STORAGE=s3

Conifer uses a temporary storage directory for data while it is actively being captured, and temporary collections. Data is moved into the 'permanent' storage when the capturing process is completed or a temporary collection is imported into a user account.

The temporary storage directory is: WARCS_DIR=./data/warcs.

The permanent storage directory is either STORAGE_DIR=./data/storage or local storage.

When using s3, the value of STORAGE_DIR is ignored and data gets placed into S3_ROOT which is an s3:// bucket URL.

Additional s3 auth environment settings must also be set in wr.env or externally.

All data related to Conifer that is not web archive data (WARC and CDXJ) is stored in the Redis instance, which persists data to ./data/dump.rdb. (See Conifer Architecture below.)

Email

Conifer can send confirmation and password recovery emails. By default, a local SMTP server is run in Docker, but can be configured to use a remote server by changing the environment variables EMAIL_SMTP_URL and EMAIL_SMTP_SENDER.

Frontend Options

The react frontend includes a number of additional options useful for debugging. Setting NODE_ENV=development will switch react to development mode with hot reloading on port 8096.

Additional frontend configuration can be found in frontend/src/config.js

Administration tool

The script admin.py provides easy low level management of users. Adding, modifying, or removing users can be done via the command line.

To interactively create a user:

docker exec -it app python -m webrecorder.admin -c

or programmatically add users by supplying the appropriate positional values:

docker exec -it app  python -m webrecorder.admin \
                -c <email> <username> <passwd> <role> '<full name>'

Other arguments:

  • -m modify a user
  • -d delete a user
  • -i create and send a new invite
  • -l list invited users
  • -b send backlogged invites

See docker exec -it app python -m webrecorder.admin --help for full details.

Restarting Conifer

When making changes to the Conifer backend app, running

docker-compose kill app; docker-compose up -d app

will stop and restart the container.

To integrate changes to the frontend app, either set NODE_ENV=development and utilize hot reloading. If you're running production (NODE_ENV=production), run

docker-compose kill frontend; docker-compose up -d frontend

To fully recreate Conifer, deleting old containers (but not the data!) use the ./recreate.sh script.

Conifer Architecture

This repository contains the Docker Compose setup for Conifer, and is the exact system deployed on https://conifer.rhizome.org. The full setup consists of the following components:

  • /app - The Conifer backend system includes the API, recording and WARC access layers, split into 3 containers:
    • app -- The API and data model and rewriting system are found in this container.
    • recorder -- The WARC writer is found in this container.
    • warcserver -- The WARC loading and lookup is found in this container.

The backend containers run different tools from pywb, the core web archive replay toolkit library.

  • /frontend - A React-based frontend application, running in Node.js. The frontend is a modern interface for Conifer and uses the backend api. All user access goes through frontend (after nginx).

  • /nginx - A custom nginx deployment to provide routing and caching.

  • redis - A Redis instance that stores all of the Conifer state (other than WARC and CDXJ).

  • dat-share - An experimental component for sharing collections via the Dat protocol

  • shepherd - An instance of OldWebToday Browser Shepherd for managing remote browsers.

  • mailserver - A simple SMTP mail server for sending user account management mail

  • behaviors - Custom automation behaviors

  • browsertrix - Automated crawling system

Dependencies

Conifer is built using both Python (for backend) and Node.js (for frontend) using a variety of Python and Node open source libraries.

Conifer relies on a few separate repositories in this organization:

The remote browser system uses https://github.com/oldweb-today/ repositories, including:

Contact

Conifer is a project of Rhizome, made possible with generous past support from the Andrew W. Mellon Foundation.

For more info on using Conifer, you can consult our user guide at: https://guide.conifer.rhizome.org

For any general questions/concerns regarding the project or https://conifer.rhizome.org you can:

License

Conifer is Licensed under the Apache 2.0 License. See NOTICE and LICENSE for details.

browsertrix's People

Contributors

anjackson avatar atomotic avatar bgrins avatar chickensoupwithrice avatar dependabot[bot] avatar edsu avatar emma-sg avatar fservida avatar ikreymer avatar kayiwa avatar lasztoth avatar leepro avatar schmoaaaaah avatar shrinks99 avatar stavares843 avatar suayoo avatar tw4l avatar vnznznz avatar white-gecko avatar wvengen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

browsertrix's Issues

Create Terms of Service

There should be a ToS to present to users who are signing up, from self registration or invite. TBD actual details of ToS and whether it needs to be explicitly or implicitly accepted.

Crawl config detail view enhancements

  • Show created by user name instead of ID
    • API returns a user object with id and `name (@ikreymer)
    • Update frontend (@SuaYoo )
  • Show created at date (@ikreymer and @SuaYoo)
  • Make entire template card clickable (@SuaYoo)
  • Rename template when duplicating to [name] Copy or [name] 2` (@SuaYoo)
  • Change runNow UI from switch to checkbox (@SuaYoo)
  • (Needs investigation) Show schedule in browser local time (@SuaYoo)
  • (Needs discussion) Limit JSON shown only to user specified fields

Crawl List + Detail View UI

Crawl List UI should include:

  • Crawl Start
  • Crawl End Time, if any
  • Crawl State
  • Link to Crawl Config
  • started manually or not
    #105

For finished crawls:

  • Number of files crawled (but not the file names, details only)
  • Duration of crawl and total size of all the files, maybe 10MB (2 Files)

For running crawls:

  • Option to watch crawl, or only on detail page?

  • Crawl stats?

  • Option to cancel (stop and discard) or stop (stop and keep what was crawled), or only on detail?

  • Sorting by: Start, End State, Config id

  • Filter by: config id?

Detail View - Finished Crawl:

  • Includes list of files and links to download each file (needs backend support)
  • Link to replay - shown inline in page? (needs backend support)

Detail View - Running Crawl

  • Includes option to watch - will show an inline watch iframe? (need backend support)

Add 'Include External Links' checkbox

An extra checkbox to the crawl configuration simple view below the scope type, which, if checked would set "extraHops": 1 in the config, otherwise "extraHops": 0.

Maybe should call it: Include all links to external pages ?

Also, perhaps should rename scope type to Crawl Scope

Additional API fields for crawls and crawlconfigs

  • Return crawl template name in /crawls and /crawls/:id response data to show in UI
  • Return user name in /crawls, /crawls/:id and /crawlconfigs/:id
  • Return created at date in /crawlconfigs and /crawlconfigs/:id
  • For crawl list, return fileSize and fileCount for each crawl instead of files array (drop files field)
  • Flatten crawls list to single crawls list instead of running and finished

e.g.

type Crawl = {
  id: string;
  user: string;
  // user: { id: string; name: string; }
  // or
  // userName: string;
  aid: string;
  cid: string;
  // crawlconfig: { id: string; name: string; }
  // or
  // crawlconfigName: string;
};

Allow archive admins to invite users

Add ability for a super-admin to invite select users to register.

  • Show archive members
  • Show invite form with email in archive
  • Sign up with accept invitation

Crawls API

Running Crawls

  • List Running Crawls
  • Stop Running Crawl w/ Save
  • Stop Running Crawl and Discard

Crawl Configs

  • Update Existing Config
  • Run Now from Existing Config

Pausing / Scaling Crawls

Support increasing/decreasing number of pods running on a crawl.
Requires:

  • Generate Crawl ID separately, not based on job/docker container id
  • Use Shared Redis for Crawl, instead of local one in Browsertrix Crawler (for how long??)
  • Crawl doc supports multiple file entries instead of just one
  • Decide which approach to take, 1 or 2:

1. Scale via Pause and Restart

To scale:

  1. job stopped gracefully, WACZ written
  2. crawl doc set to 'partial_complete', files added for each completed pod.
  3. restart with same crawl id, shared redis state.
  4. final pod adds final WACZ, sets crawl state to 'complete'

Pros:

  • Maintain one job per crawl at a time.
  • K8s takes care of parallelism, works with cron job.

Cons:

  • Scaling up or down requires stopping job, restarting with more pods.
  • harder to support via Docker only

2. Add more jobs / remove jobs to scale

To scale up:

  • New bob added with crawl id of existing Redis state.

To scale down:

  • One or more existing jobs stopped (graceful stop)
  • crawl doc updated with new WACZ and 'partial_complete'

Pros:

  • Scaling up and down without any interruption
  • Can be implemented in similar way w/o K8S

Cons:

  • multiple jobs per crawl
  • unclear how to handle cronjobs

Storage Refactor

  • Add default vs custom (s3) storage
  • K8S: All storages correspond to secrets
  • K8S: Default storages inited via helm
  • K8S: Custom storage results in custom secret (per archive)
  • K8S: Don't add secret per crawl config
  • API for changing storage per archive
  • Docker: default storage just hard-coded from env vars (only one for now)
  • Validate custom storage via aiobotocore before confirming
  • Data Model: remove usage from users
  • Data Model: support adding multiple files per crawl for parallel crawls
  • Data Model: track completions for parallel crawls
  • Data Model: initial support for tags per crawl, add collection as 'coll' tag

Create crawl config enhancements

  • Update section order to be Basic (name) -> Pages (seeds) -> Scheduling
  • Default time to current hour instead of midnight
  • Show toast linking to new crawl

Crawl Config Views

The list view crawl config should show:

  • Current Schedule
  • Option to Edit Schedule
  • Display time of last finished crawl with link to the crawl, if any
  • Run Now option to start a new crawl, if allowed (see below)
  • Option to Duplicate a Crawl Config, create a new config with seed list of previous one (for later).

The crawl config detail view can show:

  • Schedule editing
  • Seed List (read only) and/or raw JSON (for both standard and advanced configs?)

Backend:

  • Need check to see if last crawl is currently running, prevent starting new crawl
  • API should include a bool to indicate if can start a new crawl.

Some questions:

  • Do we want to show the raw JSON, even for standard-created configs, and/or only for advanced?
  • Do we need to track if it was a 'standard' created via UI or custom config created through raw JSON, eg. for duplication? Or can we detect this heuristically (eg. if json has more properties than supported by standard config, use the JSON view)

Create Crawl Configuration/Template/Definition UI

This screen will produce a JSON that is then passed to the crawl config creation API endpoint.

The format includes a top-level dictionary with a Browsertrix Cloud-specific options, and a config dictionary, which
corresponds to the Browsertrix Crawler config.

The format is:

{
  "schedule": "",
  "runNow": false,
  "colls": [],
  "crawlTimeout": 0,
  "parallel": 1,
  "config": {...}
}

The key properties to include are:

  • run now, a checkbox to start a crawl instantly.
  • schedule, a way to specify a schedule in cron-style format. (but can be simpler, eg. a day time, and an option like daily, weekly, monthly, etc..)
  • Time Limit in seconds (mostly will be helpful for testing, though not strictly required)

The actual crawl configuration, the config property, can be what is passed to browsertrix-crawler can be either a:

  • Advanced view where json can be pasted
  • Simplified view that includes a subset of properties, maybe starting with:
    seed list, containing:
    • URL
    • Scope Type (page, page-spa, prefix, host, any)
      other properties:
    • limit, total number of pages to crawl

For the seed list, the input might be:

  • Text area with one URL per line + scope type, which then get added to the list. This would be to support pasting in a bunch of URLs with a specified scope.

The supported properties in the 'simplified view' will likely continue to evolve, but also have the advanced view for pasting a custom config.

(Low priority) Invites generate additional verification email

Not a high priority, but would be better UX to not generate a verification email for users that sign up from an invite.

To reproduce (on main):

  1. Run frontend app with yarn start-dev, log in
  2. Choose an archive and click "Members"
  3. Click "Add Member" and finish invite an email you can access
  4. Log out
  5. Find invite in your inbox, replace remote URL in link with http://localhost:9870 and visit
  6. Complete sign up. After a few minutes you'll get a verification email

Crawl Configurations UI

The Crawl configuration UI will include a way to create new crawl configurations, list existing crawl configurations, delete crawl configurations.

  • Create Crawl Configurations
  • List Existing Crawl Configurations
  • Delete Crawl Configurations

Registering a user-email that is already registered shows incorrect information.

When registering with an e-mail that is already registered, the backend returns 400 with {detail: "REGISTER_USER_ALREADY_EXISTS"}.

The frontend should display this error message and perhaps offer a link to login page and/or forgot password page?

(Currently, it attempts to login with the new credentials, which also errors out, but then still displays 'Successfully signed up' message.)

`btrix-log-in` update warning

Fix source of warning on login page: Element btrix-log-in scheduled an update (generally because a property was set) after an update completed, causing a new update to be scheduled. This is inefficient and should be avoided unless the next update can only be scheduled as a side effect of the previous update. See https://lit.dev/msg/change-in-update for more information.

GET crawl config by ID returns archive instead of crawl config

The /archives/{aid}/crawlconfigs/{cid} endpoint seems to return the parent archive instead of of the crawl config. Tested in the docs UI.

Example curl request:

curl -X 'GET' \
  'https://btrix-dev.webrecorder.net/api/archives/0146a76e-b4fe-498d-a6df-0e8be8858dd1/crawlconfigs/a9ba1884-b9a3-4ec5-9210-f8dd8501cde9' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoiZTY3NmY2MWMtMjUxZi00ZDU2LWEwNzYtNTNmZTM1YzM4YmE0IiwiYXVkIjpbImZhc3RhcGktdXNlcnM6YXV0aCJdLCJleHAiOjE2NDI5MDIxNzB9.K4EOY1NftfDO9fH9Ti8sudR3FRrBCXS5_c72YikEz4Y'

Returns:

{
  "id": "0146a76e-b4fe-498d-a6df-0e8be8858dd1",
  "name": "sua dev's Archive",
  "users": {
    "e676f61c-251f-4d56-a076-53fe35c38ba4": 40
  },
  "storage": {
    "name": "default",
    "path": "0146a76e-b4fe-498d-a6df-0e8be8858dd1/"
  },
  "usage": {
    "2022-01": 3670
  }
}

User verification stuck on spinner

Now that verification emails are being sent, it seems the verification is getting stuck somewhere.

Repro Steps

  1. Register new account, receive registration e-mail with verification token
  2. Load http://localhost:9870/verify?token=<token>
  3. Observe the POST request is successful 200, but page stuck on spinner.
  4. Refreshing the page returns 'Something is wrong' when the backend returns a 400 with {detail: "VERIFY_USER_ALREADY_VERIFIED"} - this should be also be a recognized error message.

Expected:

  • The page displays a message that the address has been verified.
  • If already verified, display appropriate error message

Question: what should happen if user is not logged in? I guess this should not log in the user, only validate the user, right?
It looks like the backend does not requires the auth token to be passed for verification, so would get verified no matter what, even if not logged in..

[Product Design] Decide how/if users should be able to create multiple archives

Currently, users either start with one archive/organization when they join, or are invited to an existing archive/organization.

The API currently allows for multiple archive creation per user, but not yet the UI..

Should decide if:

  • Users can create an unlimited number of new archives? Or only certain users?
  • Users get one archive that they 'own', but can be members of as many other archives that they're invited to?
  • Different options for different user roles (may be too complicated)

Accept invite endpoint return 400 on success

To reproduce (on main):

  1. Run frontend app with yarn start-dev, log in
  2. Choose an archive and click "Members"
  3. Click "Add Member" and finish invite an email you can access
  4. Log out
  5. Find invite in your inbox, replace remote URL in link with http://localhost:9870 and visit
  6. Complete sign up
  7. Click "Accept Invite". You should see an "Invalid invite" error. However, if you go to http://localhost:9870/archives, you'll see the invite that you were added to.

Show sign up confirmation message

Per Discord convo, add notification on sign up success as first pass at onboarding flow. Show message like "Welcome to Browsertrix Cloud. A confirmation email has been sent to the e-mail address you specified"

Crawl config validation

  • Server-side validation for POST/PATCH
  • Client-side validation for both form and JSON editor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.