GithubHelp home page GithubHelp logo

philterd / phileas Goto Github PK

View Code? Open in Web Editor NEW
19.0 2.0 3.0 33.37 MB

The PII and PHI redaction engine

Home Page: https://www.philterd.ai

License: Apache License 2.0

Java 99.95% ANTLR 0.05%
anonymize deidentification deidentify phi pii redact redaction personally-identifiable-information protected-health-information java

phileas's Introduction

Phileas

CodeFactor Quality

Phileas is a Java library to deidentify text and redact PII, PHI, and other sensitive information from text. Given text or documents (PDF), Phileas analyzes the text searching for sensitive information such as persons' names, ages, addresses, and many other types of information. Phileas is highly configurable through its settings and policies.

When sensitive information is identified, Phileas can manipulate the sensitive information in a variety of ways. The information can be replaced, encrypted, anonymized, and more. The user chooses how to manipulate each type of sensitive information. We refer to each of these methods in whole as "redaction."

Information can be redacted based on the content of the information and other attributes. For example, only certain persons' names, only zip codes meeting some qualification, or IP addresses that match a given pattern.

Powered by Phileas

Phileas is the underlying core of Philter, a turnkey text redaction engine which is built on top of Phileas and provides an API for redacting text. Philter runs entirely within your cloud and never transmits data outside of your cloud. Custom AI models are available for domains like healthcare, legal, and news. Philter is also open source.

Phileas also powers Airlock, an AI policy layer to prevent the disclosure of sensitive information, such as PII and PHI, in your AI applications.

What Phileas Can Do

  • Phileas can identify and redact over 30 types of sensitive information (see list below).
  • Phileas can evaluate conditions when redating (only zip codes with population less than some value, only ages > 30, only when sentiment is a certain value, etc.).
  • Phileas can perform sentiment and offensiveness classification.
  • Phileas can redact, encrypt, and anonymize sensitive information.
  • Phileas can replace persons names with random names, dates with similar but random dates, etc.
  • Phileas can disambiguate types of sensitive information (i.e. SSN vs. phone number).
  • Phileas can deidentify text consistently ("John Smith" is replaced consistently in certain documents).
  • Phileas can shift dates or replace dates with approximate representations (i.e. "3 months ago").
  • Phileas uses policies to define what sensitive information to find and how to redact it.

Supported PII, PHI, and Other Sensitive Information

This list might be outdated. Please check the individual filter classes for details.

Persons

  • Person's Names - Multiple methods, e.g. NER, dictionary, census data
  • Physician Names
  • First Names
  • Surnames

Common

  • Ages
  • Bank Account Numbers
  • Bitcoin Addresses
  • Credit Cards
  • Currency (USD)
  • Dates (in addition to birthdates and deathdates)
  • (US) Driver's License Numbers
  • Email Addresses
  • IBAN Codes
  • IP Addresses (IPv4 and IPv6)
  • MAC Addresses
  • (US) Passport Numbers
  • Phone Numbers
  • Phone Number Extensions
  • Sections (of a document)
  • SSNs and TINs
  • Tracking Numbers (UPS / FedEx / USPS)
  • URLs
  • VINs
  • Zip Codes

(US) Locations

  • Cities
  • Counties
  • Hospitals
  • Hospital Abbreviations
  • States
  • State Abbreviations

Custom Filters

  • Dictionary
  • Identifier

Building Phileas

After cloning, run git lfs pull to download models needed for unit tests. Phileas can then be built with mvn clean install.

Using Phileas

Phileas snapshots and releases are available in our Maven repositories so add the following to your Maven configuration:

<repository>
    <id>philterd-repository-releases</id>
    <url>https://artifacts.philterd.ai/releases</url>
    <snapshots>
        <enabled>false</enabled>
    </snapshots>
</repository>
<repository>
    <id>philterd-repository-snapshots</id>
    <url>https://artifacts.philterd.ai/snapshots</url>
    <snapshots>
        <enabled>true</enabled>
    </snapshots>
</repository>

Next, add the Phileas dependency to your project:

<dependency>
  <groupId>ai.philterd</groupId>
  <artifactId>phileas-core</artifactId>
  <version>2.7.0-SNAPSHOT</version>
</dependency>

Finding and Manipulating Sensitive Information in Text

Create a FilterService, using a PhileasConfiguration, and call filter() on the service:

PhileasConfiguration phileasConfiguration = ConfigFactory.create(PhileasConfiguration.class);

FilterService filterService = new PhileasFilterService(phileasConfiguration);

FilterResponse response = filterService.filter(policies, context, documentId, body, MimeType.TEXT_PLAIN);

The policies is a list of Policy classes. (See below for more about Policies.) The context and documentId are arbitrary values you can use to uniquely identify the text being filtered. The body is the text you are filtering. Lastly, we specify that the data is plain text.

The response contains information about the identified sensitive information along with the filtered text.

Usage Examples

The PhileasFilterServiceTest and EndToEndTests test classes have examples of how to configure Phileas and filter text.

Finding and Redacting Sensitive Information in a PDF Document

Create a FilterService, using a PhileasConfiguration, and call filter() on the service:

PhileasConfiguration phileasConfiguration = ConfigFactory.create(PhileasConfiguration.class);

FilterService filterService = new PhileasFilterService(phileasConfiguration);

BinaryDocumentFilterResponse response = filterService.filter(policies, context, documentId, body, MimeType.APPLICATION_PDF, MimeType.IMAGE_JPEG);

The policies is a list of Policy classes which are created by deserializing a policy from JSON. (See below for more about Policies.) The context and documentId are arbitrary values you can use to uniquely identify the text being filtered. The body is the text you are filtering. Lastly, we specify that the data is plain text.

The response contains a zip file of the images generated by redacting the PDF document.

Policies

A policy is an instance of a Policy class that tells Phileas the types of sensitive information to identify, and what to do with the sensitive information when found. A policy describes the entire filtering process, from what filters to apply, terms to ignore, to everything in between. Phileas can apply one or more policies when filter() is called. The list of policies will be applied in order as they were added to the list.

For examples on creating a policy, look at EndToEndTestsHelper. The PhileasFilterServiceTest and EndToEndTests test classes have examples of how to configure Phileas and filter text.

Policies can be de/serialized to JSON. Here is a basic (but valid) policy that identifies and redacts ages:

{
  "name": "default",
  "ignored": [],
  "identifiers": {
    "age": {
      "ageFilterStrategies": [{
        "strategy": "REDACT",
        "redactionFormat": "{{{REDACTED-%t}}}"
      }]
    }
  }
}

There is a long list of identifiers that can be applied, and each identifier has several possible strategy values. In this case, when a age is found, it is redacted by being replaced with the text {{{REDACTED-age}}}. The %t is a placeholder for the type of filter. In this case, it is the literal text age.

License

As of Phileas 2.2.1, Phileas is licensed under the Apache License, version 2.0. Previous versions were under a proprietary license.

Copyright 2024 Philterd, LLC. Copyright 2018-2023 Mountain Fog, Inc.

phileas's People

Contributors

dependabot[bot] avatar jzonthemtn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

phileas's Issues

Failing tests on OSX M2

Due to ONNX Runtime on M2.

[ERROR] Errors:
[ERROR]   PersonsV2FilterTest.filter1:64 » UnsatisfiedLink no onnxruntime in java.librar...
[ERROR]   PersonsV2FilterTest.filter2:96 » NoClassDefFound Could not initialize class ai...
[ERROR]   PersonsV2FilterTest.filter3:135 » NoClassDefFound Could not initialize class a...
[ERROR]   PersonsV2FilterTest.filter4:172 » NoClassDefFound Could not initialize class a...
[ERROR]   PersonsV2FilterTest.filter5:205 » NoClassDefFound Could not initialize class a...
[ERROR]   PersonsV2FilterTest.filter6:240 » NoClassDefFound Could not initialize class a...

POS post filter does not handle multi-word tokens

Given the input:

"George Washington was president and his ssn was 123-45-6789 and he lived at 90210."

The POS filter fails because the tokens are "George" and "Washington" individually and not "George Washington." The filter needs changed to allow for multi-word tokens.

Use stop words to shorten physician names

Use stop words to shorten physician names. Instead of taking the entire n-gram, see if we can use stop words to shorten the span by cutting it based on the location of the stop words.

Look at each token in the physician name span from the outsides to see if they are stop words. If they are condense the span.

How to launch Phileas?

Hi,

I am not very knowledgeable about Java, but much to my surprise I did manage to write a simple client using your instructions and get it to compile and run using Maven. However, I have not been able to figure out how to launch the Phineas service it expects at https://127.0.0.1:8080. I was wondering how to do that?

Cheers,
Andrew

Disable dependency logging

Disable this logging:

Jan 16 15:31:14 ip-10-0-2-32.ec2.internal bash[3348]: 2021-01-16 15:31:14.544 ERROR 3363 — [nio-8080-exec-6] c.m.p.s.validators.DateSpanValidator : Text '3/2018' could not be parsed: Unable to obtain LocalDate from TemporalAccessor: {MonthOfYear=3, Year=2018},ISO of type java.time.format.Parsed

Allow individual filter regex to be enabled/disabled

Allow individual filter regex to be enabled/disabled. The purpose is to allow only a set of regexes to be enabled.

There could be magic environment variables that can be set/unset to enable/disable the regex patterns. (Or some other method.)

Add a priority to each filter

Consider adding priority to filters in events of where two spans are completely identical, the priority would be used to determine which span is selected.

This needs tested well. Will need to test:

  • getFiltersForFilterProfile - to ensure the filters are in the order given by the priorities (high to low).
  • Identical spans found by different filters only return the span having the highest priority.

Was coded in 1.10.0 but not tested or added to documentation.

Not finding name with apostrophe

In PhileasFilterServiceTest.endToEnd15(), the name “David O’Brien” is not being identified. “David O '“ is being found but the space between the O and the apostrophe is causing the findByRegex to return -1.

Add options to make first names and surnames be adjacent

Add an optional parameter to the FirstName filter that requires a Surname immediately after.

Likewise, add an optional parameter to the Surname filter that requires a FirstName immediately preceding it.

Both options can be set independently, and both should default to false.

When either option is set to true, that filter should only report a span when it is preceded/followed by a span from the other filter.

Allow filter profiles to be written in YAML

Allow filter profiles to be written in YAML.

  • What format will the API return when retrieving filter profiles?
  • When saving filter profiles through the API, how to set the format? Content-type header?
  • The .json extension is used extensively through the filter profile services to find filter profiles on disk.

Incorporate zip code database

The goal is to reduce zip code false positives by including a look up when text matches a potential zip code. Because zip codes change, the lookup should not be definitive but should be an additional factor when determining if it is a true positive.

Condition should be a list of strings instead of just a string

Condition should be a list of strings instead of just a string. As written now, there is a one-to-one between condition and filter strategy.

This allows for multiple conditions for a given filter strategy. This is how it was done in Philter Studio before it was discovered that "condition" is just a single string in the filter condition.

Add OR boolean operator to grammar

Add OR boolean operator to grammar.

Currently, OR can be accomplished to some degree by using multiple filter strategies.

It would be ideal to allow expressions like:

context == 'test' and confidence > 1.0 or token == 'asdf'

Ignore cities in court names

Ignore cities when they appear as part of a court name, e.g. District Court of Baltimore City.

This requires consideration about where to implement the feature. If we are looking for city names then it seems to be a function of the CITY filter. So that would require a flag in the CITY filter strategy to ignore the city if it is given as part of a court name.

Court names seem to be either:

… Court of … - Supreme Court of West Virginia
… Court of the … - Supreme Court of the United States
… Court - Wisconsin Supreme Court
… Court for the … - United States District Court for the Eastern District of Wisconsin

The Restriction class could probably be used as a means of doing a lookup.

Support non-USD currencies

Support non-USD currencies. Need to add options to the filter strategy to designate the type of currency (or none for all types).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.