GithubHelp home page GithubHelp logo

jo3-l / obscenity Goto Github PK

View Code? Open in Web Editor NEW
53.0 3.0 1.0 2.04 MB

Robust, extensible profanity filter for NodeJS

License: MIT License

TypeScript 99.50% JavaScript 0.50%
profanity profane obscene swearing antiswear swearwords swear-filtering swear-filter bad-words obscenity

obscenity's Introduction

Obscenity

Robust, extensible profanity filter for NodeJS.

Build status Codecov status npm version Language License

Why Obscenity?

  • Accurate: Though Obscenity is far from perfect (as with all profanity filters), it makes reducing false positives as simple as possible: adding whitelisted phrases is as easy as adding a new string to an array, and using word boundaries is equally simple.
  • Robust: Obscenity's transformer-based design allows it to match on variants of phrases other libraries are typically unable to, e.g. fuuuuuuuckkk, ʃṳ𝒸𝗄, wordsbeforefuckandafter and so on. There's no need to manually write out all the variants either: just adding the pattern fuck will match all of the cases above by default.
  • Extensible: With Obscenity, you aren't locked into anything - removing phrases that you don't agree with from the default set of words is trivial, as is disabling any transformations you don't like (perhaps you feel that leet-speak decoding is too error-prone for you).

Installation

$ npm install obscenity
$ yarn add obscenity
$ pnpm add obscenity

Example usage

First, import Obscenity:

const {
	RegExpMatcher,
	TextCensor,
	englishDataset,
	englishRecommendedTransformers,
} = require('obscenity');

Or, in TypeScript/ESM:

import {
	RegExpMatcher,
	TextCensor,
	englishDataset,
	englishRecommendedTransformers,
} from 'obscenity';

Now, we can create a new matcher using the English preset.

const matcher = new RegExpMatcher({
	...englishDataset.build(),
	...englishRecommendedTransformers,
});

Now, we can use our matcher to search for profanities in the text. Here's two examples of what you can do:

Check if there are any matches in some text:

if (matcher.hasMatch('fuck you')) {
	console.log('The input text contains profanities.');
}
// The input text contains profanities.

Output the positions of all matches along with the original word used:

// Pass "true" as the "sorted" parameter so the matches are sorted by their position.
const matches = matcher.getAllMatches('ΚƒπŸΚƒα½—Ζˆο½‹ α»ΉΠΎα»© π”Ÿβ±αΊ—π™˜Ι¦', true);
for (const match of matches) {
	const { phraseMetadata, startIndex, endIndex } =
		englishDataset.getPayloadWithPhraseMetadata(match);
	console.log(
		`Match for word ${phraseMetadata.originalWord} found between ${startIndex} and ${endIndex}.`,
	);
}
// Match for word fuck found between 0 and 6.
// Match for word bitch found between 12 and 18.

Censoring matched text:

To censor text, we'll need to import another class: the TextCensor. Some other imports and creation of the matcher have been elided for simplicity.

const { TextCensor, ... } = require('obscenity');
// ...
const censor = new TextCensor();
const input = 'fuck you little bitch';
const matches = matcher.getAllMatches(input);
console.log(censor.applyTo(input, matches));
// %@$% you little **%@%

This is just a small slice of what Obscenity can do: for more, check out the documentation.

Accuracy

Note: As with all swear filters, Obscenity is not perfect (nor will it ever be). Use its output as a heuristic, and not as the sole judge of whether some content is appropriate or not.

With the English preset, Obscenity (correctly) finds matches in all of the following texts:

  • you are a little fucker
  • fk you
  • ffuk you
  • i like a$$es
  • ΚƒπŸΚƒα½—Ζˆο½‹ α»ΉΠΎα»©

...and it does not match on the following:

  • the pen is mightier than the sword
  • i love bananas so yeah
  • this song seems really banal
  • grapes are really yummy

Documentation

For a step-by-step guide on how to use Obscenity, check out the guide.

Otherwise, refer to the auto-generated API documentation.

Contributing

Issues can be reported using the issue tracker. If you'd like to submit a pull request, please read the contribution guide first.

Author

Obscenity Β© Joe L. under the MIT license. Authored and maintained by Joe L.

GitHub @jo3-l

obscenity's People

Contributors

hatscripts avatar jo3-l avatar renovate[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

hatscripts

obscenity's Issues

bug: Certain words not being censored

Expected behavior

Inputting all the EnglishProfaneWord, I expected all of them to be censored.

Actual behavior

As you can see on this CodePen, not all of the words get censored.

Minimal reproducible example

import {
  RegExpMatcher,
  TextCensor,
  englishDataset,
  englishRecommendedTransformers,
  keepStartCensorStrategy,
  keepEndCensorStrategy,
  asteriskCensorStrategy
} from 'obscenity'

const matcher = new RegExpMatcher({
  ...englishDataset.build(),
  ...englishRecommendedTransformers
})

const strategy = keepStartCensorStrategy(keepEndCensorStrategy(asteriskCensorStrategy()))
const censor = new TextCensor().setStrategy(strategy)

const words = 'abbo abeed africoon anal anus arabush arse ass bastard bestiality bitch blowjob boob boonga buttplug chingchong chink cock cuck cum cunt deepthroat dick dildo doggystyle double penetration ejaculate fag felch fellatio finger bang fisting fuck gangbang handjob hentai hooker incest jerk off jizz lubejob masturbate nigger orgasm orgy penis porn pussy rape retard scat semen sex slut tit tranny vagina whore'

const matches = matcher.getAllMatches(words)
console.log(censor.applyTo(words, matches))

Steps to reproduce

  1. View console
  2. Observe that not all words are censored

Additional context

Here is a less minimal CodePen with an input and output textarea: https://codepen.io/HatScripts/pen/NWJxEKW

Node.js version

N/A

Obscenity version

v0.1.4

Priority

  • Low
  • Medium
  • High

Terms

  • I agree to follow the project's Code of Conduct.
  • I have searched existing issues for similar reports.

request: French language support

Description

I'm working on a project that requires some french support. I saw https://github.com/darwiin/french-badwords-list/tree/master being adapted for https://github.com/jojoee/leo-profanity and was thinking of doing the same thing. I like how extensible this library is.

Solution

Similar to english.ts, the idea is to import and extract the array from https://github.com/darwiin/french-badwords-list/tree/master and build a dataset. I can work on a PR for it but can someone point me in the right direction for writing a test for this?

Code of Conduct

  • I agree to follow this project's Code of Conduct.

bug: Memory leak when using an empty string

Expected behavior

Proper error message

Actual behavior

JavaScript heap out of memory

❯ node index.js 

<--- Last few GCs --->

[79252:0x4d7f380]    21093 ms: Mark-sweep (reduce) 4080.9 (4142.9) -> 4080.7 (4141.9) MB, 1573.5 / 0.0 ms  (+ 1.8 ms in 2 steps since start of marking, biggest step 1.8 ms, walltime since start of marking 1584 ms) (average mu = 0.146, current mu = 0.098) [79252:0x4d7f380]    21095 ms: Scavenge 4082.3 (4142.4) -> 4081.3 (4143.4) MB, 1.2 / 0.0 ms  (average mu = 0.146, current mu = 0.098) allocation failure 


<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xafedf0 node::Abort() [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 2: 0xa1814d node::FatalError(char const*, char const*) [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 3: 0xce795e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 4: 0xce7cd7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 5: 0xeb16b5  [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 6: 0xeb21a4  [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 7: 0xec0617 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 8: 0xec39cc v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
 9: 0xe862ec v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
10: 0x11f3156 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
11: 0x15c9ed9  [/home/jeremy/.asdf/installs/nodejs/16.6.1/bin/node]
Aborted (core dumped)

Minimal reproducible example

import {
  DataSet,
  RegExpMatcher,
  englishRecommendedTransformers,
  pattern,
  parseRawPattern,
} from "obscenity";

const customDataset = new DataSet();
const bannedChatWords = [""];
bannedChatWords.forEach((item, _idx) => {
  const word = item.toLowerCase();
  customDataset.addPhrase((phrase) => {
    return phrase
      .setMetadata({ originalWord: word })
      .addPattern(parseRawPattern(word))
      .addPattern(pattern`|${word}`)
      .addPattern(pattern`${word}|`)
  });
});

const customMatcher = new RegExpMatcher({
  ...customDataset.build(),
  ...englishRecommendedTransformers
});

function messageViolation(message) {
  return  customMatcher.getAllMatches(message).length > 0;
}

console.log("test", messageViolation("test"))

Steps to reproduce

  1. Save that code to index.js
  2. run node index.js
  3. ...
  4. Profit?

Additional context

The words come from user generated content. My app was improperly storing an empty string. When the code would try to dynamically generate the banned word list with an empty string in the mix, it would tank the site.

Node.js version

v16.6.1

Obscenity version

obscenity@^0.1.1:
version "0.1.1"

Priority

  • Low
  • Medium
  • High

Terms

  • I agree to follow the project's Code of Conduct.
  • I have searched existing issues for similar reports.

bug: Using .addPhrase with Angular script optimization causes error that prevents Angular from bootstrapping

Expected behavior

I expected .addPhrase to include my added obscene term and build properly.

Actual behavior

Using .addPhrase in an Angular component and optimization: scripts = true in angular.json's build config causes the following error:
Cannot read properties of undefined (reading 'Literal');

This error prevented Angular from bootstrapping

Minimal reproducible example

Add this block to angular.json. "scripts: true" being the one that creates the issue. Seems like something to do with minification.

"optimization": { "styles": { "minify": true, "inlineCritical": true }, "scripts": true, "fonts": true },

Import the package to a component.
import { DataSet, RegExpMatcher, englishDataset, englishRecommendedTransformers, pattern } from 'obscenity';

In ngOnInit, add custom phrases to the DataSet

    const customDataSet = new DataSet()
      .addAll(englishDataset)
      .addPhrase((phrase) => phrase.addPattern(pattern`|damn|`))
      .addPhrase((phrase) => phrase.addPattern(pattern`|hell|`).addWhitelistedTerm('hello'));

    this.matcher = new RegExpMatcher({
      ...customDataSet.build(),
      ...englishRecommendedTransformers,
    });

...

Steps to reproduce

NOTE: This works fine in v0.1.4

  • Configure you Angular project to optimize scripts (minify) in the angular.json file
  • Using an Angular component...
  • Import englishDataset and DataSet
  • In ngOnInit, create a custom dataset that starts with the englishDataset
  • Add custom phrases to that DataSet using addPhrase => addPattern
  • Notice error: Cannot read properties of undefined (reading 'Literal');

Additional context

No response

Node.js version

v18.12.1

Obscenity version

v0.2.0

Priority

  • Low
  • Medium
  • High

Terms

  • I agree to follow the project's Code of Conduct.
  • I have searched existing issues for similar reports.

Question around performance

I'm considering using the library for processing a lot of text so I'm wondering if performance is something that has been considered in the library code and testing? It would be interesting to add some information about performance in the readme.

bug: Censoring of the n-word results in more asterisks than expected

Expected behavior

Actual behavior

matcher.getAllMatches('nigger') results in an array of length 2, when it should only be 1. This causes the resulting censored string to be n*********r, when it should be n****r.

Screenshot 2024-01-05 171832

Minimal reproducible example

import {
  RegExpMatcher,
  TextCensor,
  englishDataset,
  englishRecommendedTransformers,
  keepStartCensorStrategy,
  keepEndCensorStrategy,
  asteriskCensorStrategy
} from 'obscenity'

const matcher = new RegExpMatcher({
  ...englishDataset.build(),
  ...englishRecommendedTransformers
})

const strategy = keepStartCensorStrategy(keepEndCensorStrategy(asteriskCensorStrategy()))
const censor = new TextCensor().setStrategy(strategy)

const input = 'nigger'

const matches = matcher.getAllMatches(input)
console.log(matches)
console.log(censor.applyTo(input, matches))

Steps to reproduce

  1. Run above code
  2. View console

Additional context

No response

Node.js version

N/A

Obscenity version

v0.1.4

Priority

  • Low
  • Medium
  • High

Terms

  • I agree to follow the project's Code of Conduct.
  • I have searched existing issues for similar reports.

request: Censor the word "shit"

Description

I was surprised to realize that this library doesn't censor the word "shit" by default, given that it's one of the most common English swear words.

Solution

I'm not fully versed with the pattern syntax used by this project, but here's my attempt at implementing it:

.addPhrase((phrase) => 
  phrase
    .setMetadata({ originalWord: 'shit' })
    .addPattern(pattern`shit`)
    .addWhitelistedTerm('s hit')
    .addWhitelistedTerm('sh it')
    .addWhitelistedTerm('shi t')
    .addWhitelistedTerm('shitake')
)

This should cover words where "shit-" is the prefix ("shitty", "shite", "shithead", etc.), as well as words where "-shit" is the suffix ("bullshit", "dipshit", "batshit", etc.)

Code of Conduct

  • I agree to follow this project's Code of Conduct.

bug: Strange input results in false positive

Expected behavior

When I input the following string:

    "" ""
    "" ""
    "" ""
Assamese -> Assam

I expect that there should be no censoring.

Actual behavior

However, Assam becomes A*sam.

Strangely, modifying parts of the string, such as the quotes ("), results in no censoring.

Minimal reproducible example

import {
  RegExpMatcher,
  TextCensor,
  englishDataset,
  englishRecommendedTransformers,
  keepStartCensorStrategy,
  keepEndCensorStrategy,
  asteriskCensorStrategy
} from 'obscenity'

const matcher = new RegExpMatcher({
  ...englishDataset.build(),
  ...englishRecommendedTransformers
})

const strategy = keepStartCensorStrategy(keepEndCensorStrategy(asteriskCensorStrategy()))
const censor = new TextCensor().setStrategy(strategy)

const input = `    "" ""
    "" ""
    "" ""
Assamese -> Assam`

const matches = matcher.getAllMatches(input)
console.log(censor.applyTo(input, matches))

Steps to reproduce

  1. Run the above code
  2. View console

Additional context

No response

Node.js version

N/A

Obscenity version

0.2.0

Priority

  • Low
  • Medium
  • High

Terms

  • I agree to follow the project's Code of Conduct.
  • I have searched existing issues for similar reports.

bug: Unable to ban numbers

Expected behavior

Using this pattern with numbers

pattern`|666|`

I expect that to be matched

Actual behavior

it's not matched

Minimal reproducible example

import {
  DataSet,
  RegExpMatcher,
  englishRecommendedTransformers,
  pattern,
} from "obscenity";

const customDataset = new DataSet();
customDataset.addPhrase((phrase) => phrase.setMetadata({ originalWord: "666" }).addPattern(pattern`|666|`));
const matcher = new RegExpMatcher({
        ...customDataset.build(),
        ...englishRecommendedTransformers
      });

matcher.getAllMatches("666").length //=> 0

Steps to reproduce

  1. Use that code
  2. See 0, but expected 1

Additional context

No response

Node.js version

v16.6.1

Obscenity version

0.1.1

Priority

  • Low
  • Medium
  • High

Terms

  • I agree to follow the project's Code of Conduct.
  • I have searched existing issues for similar reports.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Warning

These dependencies are deprecated:

Datasource Name Replacement PR?
npm standard-version Available

Other Branches

These updates are pending. To force PRs open, click the checkbox below.

  • chore(deps): replace dependency standard-version with commit-and-tag-version ^9.5.0

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

github-actions
.github/workflows/codeql-analysis.yml
  • actions/checkout v4
  • github/codeql-action v3
  • github/codeql-action v3
  • github/codeql-action v3
.github/workflows/continuous-integration.yml
  • actions/checkout v4
  • pnpm/action-setup v2.4.0
  • actions/setup-node v4
  • actions/checkout v4
  • pnpm/action-setup v2.4.0
  • actions/setup-node v4
  • codecov/codecov-action v4
  • actions/checkout v4
  • pnpm/action-setup v2.4.0
  • actions/setup-node v4
npm
package.json
  • @commitlint/cli ^18.0.0
  • @commitlint/config-angular ^18.0.0
  • @jest/types ^29.5.0
  • @types/jest ^29.5.2
  • @typescript-eslint/eslint-plugin ^6.0.0
  • @typescript-eslint/parser ^6.0.0
  • conventional-github-releaser ^3.1.5
  • eslint ^8.42.0
  • eslint-config-neon ^0.1.47
  • eslint-config-prettier ^9.0.0
  • eslint-plugin-jest ^27.2.1
  • eslint-plugin-prettier ^4.2.1
  • fast-check ^2.25.0
  • gen-esm-wrapper ^1.1.3
  • is-ci ^3.0.1
  • jest ^29.7.0
  • jest-circus ^29.5.0
  • prettier ^2.8.8
  • rimraf ^5.0.0
  • standard-version ^9.5.0
  • ts-jest ^29.1.1
  • ts-node ^10.9.1
  • typedoc ^0.25.0
  • typedoc-plugin-markdown ^3.15.3
  • typescript ^5.2.2
  • node >=14.0.0

  • Check this box to trigger a request for Renovate to run again on this repository

Fix Typescript Types when using NodeNext module resolution

Hi! πŸ‘‹

Firstly, thanks for your work on this project! πŸ™‚

Today I used patch-package to patch [email protected] for the project I'm working on.

Using NodeNext as the typescript moduleResolution causes the types to be unresolved.

Here is the diff that solved my problem:

diff --git a/node_modules/obscenity/package.json b/node_modules/obscenity/package.json
index 899188c..580449a 100644
--- a/node_modules/obscenity/package.json
+++ b/node_modules/obscenity/package.json
@@ -6,8 +6,14 @@
   "module": "./dist/index.mjs",
   "types": "./dist/index.d.ts",
   "exports": {
-    "import": "./dist/index.mjs",
-    "require": "./dist/index.js"
+    "import": {
+      "types": "./dist/index.d.ts",
+      "default":"./dist/index.mjs"
+    },
+    "require": {
+      "types": "./dist/index.d.ts",
+      "default": "./dist/index.js"
+    }
   },
   "repository": {
     "type": "git",

This issue body was partially generated by patch-package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.