-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Self-censoring & accents does not work with custom non English words #8
Comments
Thanks for the issue!
The main way self-censoring is currently implemented is by manually adding variations for common/likely shortenings. For example, the wordlist contains
Using unicode inspector reveals that the |
Yeah, I see how self-censoring is implemented. As for accents, I wanted to say that there should be some way to extend replacements, for example. And yes, I do understand that the same Cuz, AFAIK current implementation of p.s. It's my thoughts and suggestions on how |
I am open to expanding *adding more words/replacements is probably never too much overhead, but adding more filter steps/features might be.
I could add that in a future update, but it likely wouldn't help as much as you think (because of the effort required to make a comprehensive list of replacements).
The umlaut would (along with all other accents) be filtered out by Unicode normalization in the very early stages of the filter, leaving only 'o' (which would then be subject to replacement rules). While it would, in theory, be possible to replace all 'o' lookalikes with all other 'o' lookalikes, it seems more efficient to use ASCII 'o' in in place of Cyrillic 'о' within the profanity list. That's not because the filter couldn't handle match the Cyrillic 'о' but because the filter is already engineered to replace tens or hundreds of 'o' lookalikes with ASCII 'o'
It looks like py-censure has built-in wordlists for different languages (English and Russian at the moment). I do hope to add the option, in the future, to easily substitute out the wordlist (or compose multiple wordlists). The main obstacle is finding false-positives (e.g. "assassin" or "push it"), which takes about 2-3 minutes and requires the entire dictionary for the language (too long and too much data to do at runtime). |
Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure? |
Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach) Are you trying to remove languages you don't care about to make the filter more efficient? |
My main takeaway was that the bad-words-next package use per language lookalike a.k.a replacements in the word lists. This allows for converting Cyrillic alphabet conversions. Also, you can explicitly censor in a single language with this approach. I didn't think about the efficiency, though. |
One of the barriers between If every character that looks like A had to reference very other character that looks like A (and same thing for the other 52+ letters), the replacement list would take too much memory. I have a few ideas for fixing this but none of them are particularly appealing.
Indeed 👌 |
So this is a memory problem, maybe only convert the most common variations of the letter A? Only variations that are possible to write with a keyboard? |
When adding a custom non English word, everything works fine except self-censoring and accents
Also, is there a way to add custom confusable characters?
Or we should generate custom variants for each added word.
Context
I am using
rustrict
version0.5.11
(latest version)The text was updated successfully, but these errors were encountered: