GithubHelp home page GithubHelp logo

epadd / epadd Goto Github PK

View Code? Open in Web Editor NEW
111.0 111.0 24.0 87.34 MB

ePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.

Home Page: https://www.epaddproject.org

Java 75.25% Python 0.08% Shell 0.06% HTML 0.66% CSS 1.82% JavaScript 11.99% Batchfile 0.01% SCSS 6.07% Less 4.06%

epadd's Introduction

ePADD (Email: Process Appraise, Discover, Deliver)

https://library.stanford.edu/projects/epadd

Introduction

ePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.

The software is comprised of four modules:

Appraisal: Allows donors, dealers, and curators to easily gather and review email archives prior to transferring those files to an archival repository.

Processing: Provides archivists with the means to arrange and describe email archives.

Discovery: Provides the tools for repositories to remotely share a redacted view of their email archives with users through a web server discovery environment. (Note that this module is downloaded separately).

Delivery: Enables archival repositories to provide moderated full-text access to unrestricted email archives within a reading room environment.

System Requirements

OS: Windows 7 SP1 / 10, Mac OS X 10.12 / 10.13, Ubuntu 16.04 Memory: 8 GB RAM (4 GB RAM allocated to the application by default) Browser: Chrome 64 or later, Firefox 58 or later Windows installations: Java Runtime Environment 11 or later required.

Installation

(Note that a full installation and user guide is accessible here.

ePADD has been tested on and optimized for Windows 7 SP1 / 10, Mac OS X 10.13 / 10.14, and Ubuntu 16.04 . Please follow the instructions below for your operating system.

Installing ePADD on Windows

Please download the latest ePADD distribution files (.exe) from https://github.com/ePADD/epadd/releases/. You will need to have Java 11 or later installed on your machine for ePADD to work properly.

When you run ePADD for the first time, a directory for the Appraisal Module is created to store working files. When ePADD starts up, it checks this directory and relies upon it to resume earlier work. If the software does not locate this directory, ePADD will create it. The ePADD Appraisal Module directory is located at c:\users<username>\epadd-appraisal.

Depending upon your network permissions, you may be asked to allow ePADD access to your internet connection. Certain functionality (for instance, downloading email from an email account using the IMAP protocol, or communicating with DBPedia to generate relevance rankings for authority reconciliation) requires an internet connection. Upon running ePADD, the application icon will appear in the Windows Taskbar. Right-click on this icon at any point to open an ePADD window or to quit ePADD.

Note: ePADD allocates 4096 MB RAM to the application by default. If the standard RAM allocated does not suffice, you may wish to run the Java application directly from the command line (epadd-standalone.jar). From the Command Prompt, you can run the application using this command: java -Xmx#g -jar epadd-standalone.jar, where # identifies the amount of RAM (in GB) you wish to allocate.

Note: The Discovery Module is run through a separate distribution file accessible via https://github.com/ePADD/epadd/releases/. Please see the installation and user guide for more information about the Discovery Module.

Installing ePADD on OSX

Please download the latest ePADD distribution files (.dmg) from https://github.com/ePADD/epadd/releases/. When you run ePADD for the first time, a directory for the Appraisal Module is created to store working files. When ePADD starts up, it checks this directory and relies upon it to resume earlier work. If the software does not locate this directory, ePADD will create it. The ePADD Appraisal Module directory is located at Macintosh HD/Users//epadd-appraisal.

Depending upon your network permissions, you may be asked to allow ePADD access to your internet connection. Certain functionality (for instance, downloading email from an email account using the IMAP protocol, or communicating with DBPedia to generate relevance rankings for authority reconciliation) requires an internet connection. In Mac OSX, the application icon will appear in the OSX Finder Toolbar. Right-click on this icon at any point to open an ePADD window or to quit ePADD.

Note: ePADD allocates 4096 MB RAM to the application by default. If the standard RAM allocated does not suffice, you may wish to run the Java application directly from the command line (epadd-standalone.jar). From the Terminal, you can run the application using this command: java -Xmx#g -jar epadd-standalone.jar, where # identifies the amount of RAM (in GB) you wish to allocate.

Note: The Discovery Module is run through a separate distribution file accessible via https://github.com/ePADD/epadd/releases/. Please see the installation and user guide for more information about the Discovery Module.

ePADD Documentation, Help, and other Information

More information about the software and related developments, including links to the full installation and user guide, as well as a link to the community forums, can be found via [our website] (https://library.stanford.edu/projects/epadd/).

License(s)

The ePADD logo, project documentation (including installation and user guide), and other non-software products of the ePADD team are subject to the Creative Commons Attribution 2.0 Generic license (CC By 2.0).

Unless otherwise indicated, software items in this repository are distributed under the terms of the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Funding:

  • Email Archives : Building Capacity and Community, Andrew W. Mellon Foundation (ePADD Phase 4);
  • Andrew W. Mellon Foundation (ePADD Phase 3);
  • David C. Weber Librarian's Research Fund from the Stanford University Libraries (ePADD Phase 3);
  • Institute for Museum and Library Services (ePADD Phase 2);
  • National Historical Publications & Records Commission (ePADD Phase 1);
  • Payton J. Treat Fund from the Stanford University Libraries (ePADD Phase 1);
  • U.S. National Science Foundation (Muse);
  • Mobisocial Laboratory at Stanford University (Muse)

Software:

Under Apache License 2.0:

  • Apache Commons (fileupload, lang3, io, httpclient, cli, codec), tika, opennlp, tomcat, maven, ant © Apache Software Foundation
  • Muse © Mobisocial Laboratory, Stanford University
  • Google Gson © Google
  • OpenCSV © Glen Smith and others
  • Java SE Platform Products © Oracle - Oracle BCL
  • Launch4j © Grzegorz Kowal BSD 3-clause license and MIT license
  • Font Awesome © Dave Gandy MIT license, SIL OFL 1.1
  • Jquery-autocomplete © Tomas Kirda under MIY-style license.
  • Jsoup

LIBSVM Copyright (c) 2000-2014 Chih-Chung Chang and Chih-Jen Lin All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither name of copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Mstor Copyright (c) 2007, Ben Fortuna All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of Ben Fortuna nor the names of any other contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Credits

Research and Development:

  • Sudheendra Hangal, Ashoka University and Amuse Labs;
  • Chinmay Narayan, Amuse Labs;
  • Vihari Piratla, Amuse Labs;
  • Sit Manovit, iXora Inc.;
  • Peter Chan, Stanford University Libraries;
  • Sally DeBauche, Stanford University Libraries;
  • Glynn Edwards, Stanford University Libraries;
  • Josh Schneider, Stanford University Libraries

Design:

  • Saumya Sarangi, Lollypop Design;
  • Mandeep RJ, Lollypop Design

Advisory Board and Collaboration (Phase 3):

  • Stephen Abrams, Harvard University;
  • Patricia Patterson, Harvard University;
  • Ian Gifford, University of Manchester;
  • Jessica Smith, University of Manchester;
  • Jochen Farwer, University of Manchester

Testing and Collaboration (Phase 2):

  • Elvia Arroyo-Ramirez, University of California, Irvine;
  • Laura Uglean Jackson, University of California, Irvine;
  • Skip Kendall, Harvard University;
  • Margo Padilla, Metropolitan New York Library Council (METRO);
  • Christopher Prom, University of Illinois at Urbana-Champaign;
  • Audra Eagle Yun, University of California, Irvine

Advisory Board (Phase 2):

  • Sherri Berger, California Digital Library;
  • Andrew Byers, Duke University;
  • Jackie Dooley, OCLC Research;
  • Mike Giarlo, Stanford University;
  • Marie Hicks, Illinois Institute of Technology;
  • Peter Hirtle, Cornell University;
  • Lise Jaillart, Loughborough University;
  • Jeremy Leighton John, British Library;
  • Cal Lee, University of North Carolina, Chapel Hill;
  • Evelyn McLellan, Artefactual;
  • Maria Matienzo, Digital Public Library of America;
  • T. Christian Miller, ProPublica;
  • Jessica Moran, National Library of New Zealand;
  • David Rosenthal, Stanford University;
  • Marc A. Smith, Social Media Research Foundation;
  • Terry Winograd, Stanford University;
  • Kam Woods, University of North Carolina, Chapel Hill

Testing and Collaboration (Phase 1):

  • Donald Mennerich, New York University Libraries;
  • Susan Thomas, Bodleian Library, Oxford University;
  • Riccardo Ferrante and Lynda Schmitz Fuhrig, Smithsonian Institution Archives;
  • Terry Catapano, Stephen Davis, and Dina Sokolova, Columbia University

Advisory Board (Phase 1):

  • Jeremy Leighton John, British Library;
  • Monica S. Lam, Stanford University;
  • Phillip R. Malone, Stanford University;
  • Pam Maples, Stanford University;
  • Meg McAleer, Library of Congress;
  • Chris Prom, University of Illinois;
  • Ben Shneiderman, University of Maryland;
  • Jeff Ubois, Macarthur Foundation

Initial specifications (requirements and wireframes):

  • Daniel Hartwig, Stanford University Libraries;
  • Daniel Jarvis, Hoover Institute, Stanford;
  • Lisa Miller, Hoover Institute, Stanford;
  • Aimee Morgan, formerly Stanford University Libraries;
  • Laura O'Hara, formerly SLAC National Accelerator Laboratory

epadd's People

Contributors

ajlouie avatar awoods avatar chinuhub avatar dependabot[bot] avatar hangal avatar jfarwer avatar jlohani avatar joshschneider avatar peterchanws avatar sallydebauche avatar shatu avatar vihari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

epadd's Issues

Re-organize Browse Page

Location entity will include all fine type entity related to location.
Organization entity will be changed to Other Entities.
Order of the boxes have been changed as well.

epadd - browse order update - 2016 jan 29

Clean up addressbook

The addressbook for TW is unacceptable. Look at what it takes to fix it.

Avoid merging based on
single names ("Scott")
any name with a special char (including @ -- usually email addresses)
any name with the words: subscription, support, customer, Facebook, "by way of ....", "via an", "autoresponder", "vacation", "admin", "team",

Avoid merging based on any email address with no-reply@, info@ (happens for evite)

Names with only punctuation chars should not even be considered names and should not be put in the address book.

Word Frequency

Users would like a word frequency table
epadd - word frequency 2016 jan 29
for all email messages to help them to create lexicon. (exclude a, an, the, etc.)

Dump user understandable version of learnt model

Vihari, can you dump out the NE model for the various types, in a format that can be easily reviewed?
We can use this to understand better how its working, see if there's a need to tune it, decide on cutoffs etc. Just switch the type numbers etc to the type names, and make it more explanatory.

don't show internal authority expansion when 0 or 1 hits

We need to suppress showing internal authority expansion file when there are 0 or 1 hits.
Ref: email from Sudheendra to Vihari Jan 13 920am (1. Albuquerque example)
and from Josh Jan 13 4.09pm (showing Barbara Bono with an empty expansion box).

Q: what happens when there is 1 hit and its in the address book?
Then it may still be ok to show the popup box on hover, assuming that one can click on it to get address book hits (as opposed to full text hits).

Searched for entity not underlined

If the user searches for an entity (by clicking on it on the entities page),
it is highlighted but not underlined in the browsed messages.

Can we retain underlining for consistency? (and keep highlighting on, in addition).

Delivery module does not allow to navigate to other modules

Whereas the settings page in the other modules allow one to "change the module" (this feels very unintuitive, but that's a different issue), when I switch to the Delivery module, I cannot switch back because the settings page does not show this option anymore.

Around 1800 messages taken from Gmail have no date in ePADD

Version epadd-standalone-jan-14
The Harrison archives reside in Gmail.
Some messages were transferred from AOL mail
I connected to the Gmail a/c to get the messages ~40,000
I exported the messages in Appraisal module.
I imported the messages in mbox format using the latestest ePADD in order to perform fine grain entity extraction.
1882 data error(s) in message content and attachments, mostly missing date.

Better error messaging when out of disk space

ePADD version 1.1RC3

User reported ePADD error.

java.io.IOException: There is not enough space on the disk
Session active for 7 hours, 43 minutes, 9 seconds

It is desirable to let users kown the problem is out of disk space.

Expand existing sensitive messages tab

Add description of regular expression in the presetqueries file.
System to pickup the descriptions of regular expressions in the file and show them as the description of the links to those messages.
Add disease/health entities
Add link to sensitive lexicon
Add browse to cvs file for global actions
Add view email addresses tab to go to all email address and associated no of messages for actions. Cannot click to see messages.?? (not in wireframe yet)

epadd - sensitive messages jan 2016 second draft

Incorrect status messages during indexing

Another thing I noticed when indexing TW's "Important" folder.
It says Reading 78,391 messages from Important and the blue bar makes steady rightward progress.

But the line below it keeps going from "couple of minutes" to "about a minute" etc.
then it comes to zero, says "Parsing Important" and then does the same thing all over again.
Is this some side effect of batching of 10,000 messages?

Noisy location entities

Logging some examples from TW's archive:

(lower priority than the orgs):

All caps:
END PGP SIGNATURE
DAYS OVERDUE
STRICTLY PROHIBITED
PAST PROJECTS
MAJOR PROJECTS
AM GMT (or PM PDT, etc)
MATLAB

Phrases:
App Engine
Teaching Assistants
Principal Investigator
Fourth Amendment
Autumn Quarter
Office Hours
Confidentiality Note
Summer Fellows
Groups Links
Distinguished Visitor
Invoice Number
Needs Major Revision

Words:
Im
Sales
Former
Reading
Arts
Si

Better error messaging when out of memory

We should have better error messages when we run out of memory.
esp. during mbox parsing (but also in other possible places).
Currently we fail silently with no sign of trouble, except that a small number of messages is indexed.
The error message should be conveyed to the user also, apart from the log file.

ePADD build improvements

currently the epadd debug page is saying:

epadd version July 9, 2015
Build Info: built by vihari, Mon 02 Nov 2015 3:04PM

need the correct version for better diagnostics.

Organizations NER improvement

Need a post-processing stop where strings that are single word English language dictionary entries (or their plurals) are suppressed from being recognized.
We see many orgs like:
People, Computer, Folks, User, shipping, operations, etc.

Filtering out noisy NE candidates

Some suggestions to filter out phrases that are currently being picked out as candidates.
Should be fixed in CIC tokenizer (and also apply to any other tokenizer):

  1. Need a list of words/phrases that are to be ignored by CIC tokenizer at the beginning of a candidate phrase.

We can start with this list (can be case ignored)
Hello
Hi
My
Our
Regarding
Quoting
On Behalf of
Behalf of

(I assume spaces are normalized in the CIC tokenizer)

Only the token above should be dropped. i.e.
Hello Prof. ABC -> becomes Prof. ABC.

  1. Totally drop exact matches with some phrases
    Mon, Tue, etc.
    Jan, Feb, etc.
    OK (just this word)

These candidates can just be ignored.

  1. Drop any candidate that starts with
    BEGIN PGP*
  2. Drop candidates that end in the word "I"
    We see person entities like:
    Currently I...
    Unfortunately I...
    Personally I
    Recently I

Merge / correct extracted entities

Users type the preferred name of entities and the entities to be included in a text file for relative entity types.
Users update kill list of relative entity types for entities recognized by system.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.