GithubHelp home page GithubHelp logo

lesleyodu / webarchives-changetext-search Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 3.98 MB

A change text search interface for web archives

License: GNU General Public License v3.0

Java 16.09% Python 56.18% PHP 24.66% CSS 1.88% JavaScript 0.99% Batchfile 0.19%

webarchives-changetext-search's Introduction

Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives

Webpages change over time, and web archives hold copies of historical versions of webpages. Users of web archives, such as journalists, want to find and view changes on webpages over time. However, the current search interfaces for web archives do not support this task. For the web archives that include a full-text search feature, multiple versions of the same webpage that match the search query are shown individually without enumerating changes, or are grouped together in a way that hides changes. We present a change text search engine that allows users to find changes in webpages. We describe the implementation of the search engine backend and frontend, including a tool that allows users to view the changes between two webpage versions in context as an animation. We evaluate the search engine with U.S. federal environmental webpages that changed between 2016 and 2020. The change text search results page can clearly show when terms and phrases were added or removed from webpages. The inverted index can also be queried to identify salient and frequently deleted terms in a corpus.

Overview

This system runs on top of a Solr/SolrWayback instance.

The system is composed of different levels.

  • Level 2 - Lucene change text calculations - in the lucene-validity-range folder
  • Level 2 - Solr indexing configuration - in the solr folder
  • Level 3 - Search interface - in the solarium-ui folder
  • Level 3 - Animated difference - in the web-diff-animation folder
  • Startup scripts with port numbers - in the startup-bat folder

Data

  • Level 2 - the change text calculations for the 1,000 paired memento EDGI subset (text and json) and the top 100 terms, along with a csv export of all URLs indexed in Solr
  • Status code calculations for the 30,000 unpaired mementos in the original EDGI dataset
  • Additional data may be located at cs895-f22

References

Lesley Frew, Michael L. Nelson, and Michele C. Weigle, “Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives,” In Proceedings of ACM/IEEE Joint Conference on Digital Libraries (JCDL). June 2023. Best Student Paper Award. (arxiv pre-print, slides)

Animation demo blog entries:

webarchives-changetext-search's People

Contributors

lesleyodu avatar

Stargazers

Mat Kelly avatar

Watchers

Mat Kelly avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.