XXXX redacted
You've been given a contract from a three letter abbreviation government agency. They want a program to automatically redact sensitive information from documents to be released to the public. They want fine control over what gets redacted though.
Given a piece of free-form, possibly Unicode text, (assume text only, no markup or formatting codes) they want to be able to redact: whole words, (case sensitive or insensitive) or partial words, (case sensitive or insensitive). Further, they want the option to "overkill" redact a partial word. Overkill redact means if the word contains the redact target, even if is only part of the word, redact the entire word.
For our purposes, a "word" here, means: a character group, separated by white-space and possibly punctuation; not necessarily strictly alphabetic characters. To "redact" a word or partial word, replace each character in the redaction target with a capital letter 'X'. There should be the same number of graphemes in the final redacted word as there were in the non-redacted word.
- Task
Write a procedure to "redact" a given piece of text. Your procedure should take the text (or a link to it), the redaction target (or a link to it) and the redaction options. It need not be a single routine, as long as there is some way to programmatically select which operation will be performed. It may be invoked from a command line or as an internal routine, but it should be separately invokable, not just a hard coded block.
The given strings are enclosed in square brackets to denote them. The brackets should not be counted as part of the strings.
Using the test string: [Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom.]
Show the redacted sentence for each of the redaction target strings [Tom] & [tom] using the following options:
- Whole word
- Whole word, Case insensitive
- Partial word
- Partial word, Case insensitive
- Partial word, Overkill
- Partial word, Case insensitive, Overkill
Note that some combinations don't, or at least, shouldn't really differ from less specific combination. E.G. "Whole word, Overkill" should be theoretically be exactly the same as "Whole word".
Extra kudos for not including adjoining punctuation during "Overkill" redaction.
Extra kudos if the redaction target can contain non-letter characters.
The demo strings use the abbreviations w/p for whole/partial word, i/s for case insensitive/sensitive, n/o for normal/overkill. You are not required to use those, or any abbreviation. They are just for display, though may be useful to show what operation you are intending to perform.
Ideal expected output (adjoining punctuation untouched):
Redact 'Tom': [w|s|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [p|s|n] XXX? XXXs bottom tomato is in his stomach while playing the "XXX-tom" brand tom-toms. That's so tom. [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. [p|s|o] XXX? XXXX bottom tomato is in his stomach while playing the "XXXXXXX" brand tom-toms. That's so tom. [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX.
Redact 'tom': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [p|s|n] Tom? Toms botXXX XXXato is in his sXXXach while playing the "Tom-XXX" brand XXX-XXXs. That's so XXX. [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX.
- Stretch
Complex Unicode: Using the test string: [🧑 👨 🧔 👨👩👦] and the redaction strings: [👨] and [👨👩👦]
Show the redacted strings when using the option "Whole word" (Case sensitivity shouldn't matter.) A single grapheme should be replaced by a single 'X'.
🧑 👨 🧔 👨👩👦 Redact '👨' [w] 🧑 X 🧔 👨👩👦 Redact '👨👩👦' [w] 🧑 👨 🧔 X
Go
Go has a problem with zero width joiner (ZWJ) emojis such as the final one in the test string which is not recognized as a single 'character' by the language as it consists of five Unicode code-points (or 'runes') instead of one. This problem is aggravated (as here) when one of the constituents of the ZWJ emoji happens to be a 'normal' emoji contained within the same test string!
Care is therefore needed to ensure that when a normal emoji is being redacted it doesn't also redact one of the constituents of a ZWJ emoji.
To get the number of 'X's right where a ZWJ emoji or other character combination is being replaced, a third party library function is used which counts the number of graphemes in a string, as required by the task. <lang go>package main
import (
"fmt" "github.com/rivo/uniseg" "log" "regexp" "strings"
)
func join(words, seps []string) string {
lw := len(words) ls := len(seps) if lw != ls+1 { log.Fatal("mismatch between number of words and separators") } var sb strings.Builder for i := 0; i < ls; i++ { sb.WriteString(words[i]) sb.WriteString(seps[i]) } sb.WriteString(words[lw-1]) return sb.String()
}
func redact(text, word, opts string) {
var partial, overkill bool exp := word if strings.IndexByte(opts, 'p') >= 0 { partial = true } if strings.IndexByte(opts, 'o') >= 0 { overkill = true } if strings.IndexByte(opts, 'i') >= 0 { exp = `(?i)` + exp } rgx := regexp.MustCompile(`[\s!-&(-,./:-@[-^{-~]+`) // all punctuation except -'_ seps := rgx.FindAllString(text, -1) words := rgx.Split(text, -1) rgx2 := regexp.MustCompile(exp) for i, w := range words { match := rgx2.FindString(w) // check there's a match and it's not part of a ZWJ emoji if match == "" || strings.Index(w, match+"\u200d") >= 0 || strings.Index(w, "\u200d"+match) >= 0 { continue } switch { case overkill: words[i] = strings.Repeat("X", uniseg.GraphemeClusterCount(w)) case !partial: if words[i] == match { words[i] = strings.Repeat("X", uniseg.GraphemeClusterCount(w)) } case partial: repl := strings.Repeat("X", uniseg.GraphemeClusterCount(word)) words[i] = rgx2.ReplaceAllLiteralString(w, repl) } } fmt.Printf("%s %s\n\n", opts, join(words, seps))
}
func printResults(text string, allOpts, allWords []string) {
fmt.Printf("Text: %s\n\n", text) for _, word := range allWords { fmt.Printf("Redact '%s':\n", word) for _, opts := range allOpts { redact(text, word, opts) } } fmt.Println()
}
func main() {
text := `Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom.
'Tis very tomish, don't you think?`
allOpts := []string{"[w|s|n]", "[w|i|n]", "[p|s|n]", "[p|i|n]", "[p|s|o]", "[p|i|o]"} allWords := []string{"Tom", "tom", "t"} printResults(text, allOpts, allWords)
text = "🧑 👨 🧔 👨👩👦" allOpts = []string{"[w]"} allWords = []string{"👨", "👨👩👦"} printResults(text, allOpts, allWords)
text = "Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱" allOpts = []string{"[p]", "[p|o]"} printResults(text, allOpts, allWords)
}</lang>
- Output:
Text: Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? Redact 'Tom': [w|s|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] XXX? XXXs bottom tomato is in his stomach while playing the "XXX-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] XXX? XXXX bottom tomato is in his stomach while playing the "XXXXXXX" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 'tom': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms botXXX XXXato is in his sXXXach while playing the "Tom-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 't': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms boXXom XomaXo is in his sXomach while playing Xhe "Tom-Xom" brand Xom-Xoms. ThaX's so Xom. 'Tis very Xomish, don'X you Xhink? [p|i|n] Xom? Xoms boXXom XomaXo is in his sXomach while playing Xhe "Xom-Xom" brand Xom-Xoms. XhaX's so Xom. 'Xis very Xomish, don'X you Xhink? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. 'Tis very XXXXXX, XXXXX you XXXXX? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. XXXX very XXXXXX, XXXXX you XXXXX? Text: 🧑 👨 🧔 👨👩👦 Redact '👨': [w] 🧑 X 🧔 👨👩👦 Redact '👨👩👦': [w] 🧑 👨 🧔 X Text: Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 Redact '👨': [p] Argentina🧑🇦🇹 FranceX🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 [p|o] Argentina🧑🇦🇹 XXXXXXXX Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 Redact '👨👩👦': [p] Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 NetherlandsX🇳🇱 [p|o] Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 XXXXXXXXXXXXX
Julia
The solution must kludge a check with the variable "multichar" to properly substitute "X" instead of "XXXX" with the last example. Otherwise Julia (v 1.4) interprets one 184-bit Unicode extended emoji character as four Unicode characters. <lang julia>function doif_equals(word, pattern, insens=false)
regex = insens ? Regex("^$pattern\$", "i") : Regex("^$pattern\$") return replace(word, regex => pattern in multichars ? "X" : "X"^length(pattern))
end doif_ci_equals(word, pattern) = doif_equals(word, pattern, true)
function doif_includes(word, pattern, insens=false)
regex = insens ? Regex(pattern, "i") : Regex(pattern) return replace(word, regex => "X"^length(pattern))
end doif_ci_includes(word, pattern) = doif_includes(word, pattern, true)
function overkill(word, pattern, insens=false)
regex = insens ? Regex(pattern, "i") : Regex(pattern) return occursin(regex, word) ? "X"^length(word) : word
end ci_overkill(word, pattern) = overkill(word, pattern, true)
const method = Dict(
"[w|s|n]" => doif_equals, "[w|i|n]" => doif_ci_equals, "[p|s|n]" => doif_includes, "[p|i|n]" => doif_ci_includes, "[p|s|o]" => overkill, "[p|i|o]" => ci_overkill
) const multichars = Set(["👨👩👦", ])
function redact(teststring, pattern)
ws = split(teststring, r"[^ \?\"\.]+") words = filter(!=(""), split(teststring, r"[\s\?\"\.]+")) fs = popfirst!(words) f = method[fs] return fs * ws[2] * mapreduce(i -> f(words[i], pattern) * ws[i + 2], *, 1:length(words))
end
const testtext = """ [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [w|i|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [p|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [p|i|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [p|s|o] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [p|i|o] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. """ const stretchtext = "[w|s|n] 🧑 👨 🧔 👨👩👦"
for test in [(testtext, ["Tom", "tom"]), (stretchtext, ["👨", "👨👩👦"])]
for pat in test[2] println("\nRedact pattern \"$pat\":") for teststring in string.(split(strip(test[1]), r"\n")) println(redact(teststring, pat)) end end println()
end
</lang>
- Output:
Redact pattern "Tom": [w|s|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [p|s|n] XXX? XXXs bottom tomato is in his stomach while playing the "XXX-tom" brand tom-toms. That's so tom. [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. [p|s|o] XXX? XXXX bottom tomato is in his stomach while playing the "XXXXXXX" brand tom-toms. That's so tom. [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. Redact pattern "tom": [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [p|s|n] Tom? Toms botXXX XXXato is in his sXXXach while playing the "Tom-XXX" brand XXX-XXXs. That's so XXX. [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. Redact pattern "👨": [w|s|n] 🧑 X 🧔 👨👩👦 Redact pattern "👨👩👦": [w|s|n] 🧑 👨 🧔 X
Perl
<lang perl>use strict; use warnings;
my $test = <<END; Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? END
sub redact {
my($str, $redact, %opt) = @_; my $insensitive = $opt{'i'} or 0; my $partial = $opt{'p'} or 0; my $overkill = $opt{'o'} or 0;
my $rx = $insensitive ? $partial ? $overkill ? qr/ \b{wb} ((?i)[-\w_]* [\w*']* $redact [-'\w]* \S*?) \b{wb} /x : qr/ ((?i)$redact) /x : qr/ \b{wb}(?<!-) ((?i)$redact) (?!-)\b{wb} /x : $partial ? $overkill ? qr/ \b{wb} ([-\w]* [\w*']* $redact [-'\w]* \S*?) \b{wb} /x : qr/ ($redact) /x : qr/ \b{wb}(?<!-) ($redact) (?!-)\b{wb} /x ; $str =~ s/($rx)/'X' x length $1/gre;
}
for my $redact (<Tom tom t>) {
print "\nRedact '$redact':\n"; for (['[w|s|n]', {}], ['[w|i|n]', {i=>1}], ['[p|s|n]', {p=>1}], ['[p|i|n]', {p=>1, i=>1}], ['[p|s|o]', {p=>1, o=>1}], ['[p|i|o]', {p=>1, i=>1, o=>1}] ) { my($option, $opts) = @$_; no strict 'refs'; printf "%s %s\n", $option, redact($test, $redact, %$opts) }
}</lang>
- Output:
Redact 'Tom': [w|s|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] XXX? XXXs bottom tomato is in his stomach while playing the "XXX-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] XXX? XXXX bottom tomato is in his stomach while playing the "XXXXXXX" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 'tom': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms botXXX XXXato is in his sXXXach while playing the "Tom-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 't': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms boXXom XomaXo is in his sXomach while playing Xhe "Tom-Xom" brand Xom-Xoms. ThaX's so Xom. 'Tis very Xomish, don'X you Xhink? [p|i|n] Xom? Xoms boXXom XomaXo is in his sXomach while playing Xhe "Xom-Xom" brand Xom-Xoms. XhaX's so Xom. 'Xis very Xomish, don'X you Xhink? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. 'Tis very XXXXXX, XXXXX you XXXXX? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. XXXX very XXXXXX, XXXXX you XXXXX?
Phix
Written on the assumption that overkill implies partial (see talk page).
utf32_length() fashioned after Reverse_a_string#Phix with added ZWJ - I do not expect it to be entirely complete.
<lang Phix>enum WHOLE,PARTIAL,OVERKILL,INSENSITIVE
constant spunc = " \r\n\t.?\"" -- spaces and punctuation
function utf32_length(sequence utf32)
integer l = length(utf32) for i=1 to l do integer ch = utf32[i]
if (ch>=0x300 and ch<=0x36f)
or (ch>=0x1dc0 and ch<=0x1dff) or (ch>=0x20d0 and ch<=0x20ff) or (ch>=0xfe20 and ch<=0xfe2f) then l -= 1
elsif ch=0x200D then -- ZERO WIDTH JOINER l -= 2
end if end for return l
end function
function redact(string text, word, integer options)
sequence t_utf32 = utf8_to_utf32(text), l_utf32 = t_utf32, w_utf32 = utf8_to_utf32(word) string opt = "[?|s]" if options>INSENSITIVE then options -= INSENSITIVE opt[4] = 'i' l_utf32 = lower(t_utf32) w_utf32 = lower(w_utf32) end if opt[2] = "wpo"[options] integer idx = 1 while true do idx = match(w_utf32,l_utf32,idx) if idx=0 then exit end if integer edx = idx+length(w_utf32)-1 if options=WHOLE then if (idx=1 or find(l_utf32[idx-1],spunc)) and (edx=length(l_utf32) or find(l_utf32[edx+1],spunc)) then t_utf32[idx..edx] = repeat('X',utf32_length(t_utf32[idx..edx])) end if elsif options=PARTIAL or options=OVERKILL then if options=OVERKILL then while idx>1 and not find(l_utf32[idx-1],spunc) do idx -= 1 end while while edx<length(l_utf32) and not find(l_utf32[edx+1],spunc) do edx += 1 end while end if t_utf32[idx..edx] = repeat('X',utf32_length(t_utf32[idx..edx])) end if idx = edx+1 end while text = utf32_to_utf8(t_utf32) return {opt,text}
end function
constant test = ` Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom.`, tests = {"Tom","tom","t"} for t=1 to length(tests) do
printf(1,"Redact %s:\n",{tests[t]}) for o=WHOLE to OVERKILL do printf(1,"%s:%s\n",redact(test,tests[t],o)) printf(1,"%s:%s\n",redact(test,tests[t],o+INSENSITIVE)) end for
end for constant ut = "🧑 👨 🧔 👨👩👦", fmt = """
%s
Redact 👨 %s %s Redact 👨👩👦 %s %s """ printf(1,fmt,{ut}&redact(ut,"👨",WHOLE)&redact(ut,"👨👩👦",WHOLE))</lang>
- Output:
The windows console makes a complete mockery of those unicode characters, though it should look better on linux...
Redact Tom: [w|s]:XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [w|i]:XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [p|s]:XXX? XXXs bottom tomato is in his stomach while playing the "XXX-tom" brand tom-toms. That's so tom. [p|i]:XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. [o|s]:XXX? XXXX bottom tomato is in his stomach while playing the "XXXXXXX" brand tom-toms. That's so tom. [o|i]:XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. Redact tom: [w|s]:Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [w|i]:XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. [p|s]:Tom? Toms botXXX XXXato is in his sXXXach while playing the "Tom-XXX" brand XXX-XXXs. That's so XXX. [p|i]:XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. [o|s]:Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. [o|i]:XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. Redact t: [w|s]:Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [w|i]:Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. [p|s]:Tom? Toms boXXom XomaXo is in his sXomach while playing Xhe "Tom-Xom" brand Xom-Xoms. ThaX's so Xom. [p|i]:Xom? Xoms boXXom XomaXo is in his sXomach while playing Xhe "Xom-Xom" brand Xom-Xoms. XhaX's so Xom. [o|s]:Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. [o|i]:XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. ƒºæ ƒæ¿ ƒºö ƒæ¿ÔÇìƒæ®ÔÇìƒæª Redact ƒæ¿ [w|s] ƒºæ X ƒºö ƒæ¿ÔÇìƒæ®ÔÇìƒæª Redact ƒæ¿ÔÇìƒæ®ÔÇìƒæª [w|s] ƒºæ ƒæ¿ ƒºö X
Raku
<lang perl6>sub redact ( Str $str, Str $redact, :i(:$insensitive) = False, :p(:$partial) = False, :o(:$overkill) = False ) {
my $rx = $insensitive ?? $partial ?? $overkill ?? rx/:i <?after ^ | <:Po> | \s > (<[\w<:!Po>-]>*? [\w*\']? $redact [\w*\'\w+]? \S*?) <?before $ | <:Po> | \s > / #' !! rx/:i ($redact) / !! rx/:i <?after ^ | [\s<:Po>] | \s > ($redact) <?before $ | <:Po> | \s > / !! $partial ?? $overkill ?? rx/ <?after ^ | <:Po> | \s > (<[\w<:!Po>-]>*? [\w*\']? $redact [\w*\'\w+]? \S*?) <?before $ | <:Po> | \s > / #' !! rx/ ($redact) / !! rx/ <?after ^ | [\s<:Po>] | \s > ($redact) <?before $ | <:Po> | \s > / ; $str.subst( $rx, {'X' x $0.chars}, :g )
}
my %*SUB-MAIN-OPTS = :named-anywhere;
- Operate on a given file with the given parameters
multi MAIN (
Str $file, #= File name with path Str $target, #= Redact target text string :$i = False, #= Case insensitive flag :$p = False, #= Partial words flag :$o = False #= Overkill flag ) { put $file.IO.slurp.&redact( $target, :i($i), :p($p), :o($o) ) }
- Operate on the internal strings / parameters
multi MAIN () {
- TESTING
my $test = q:to/END/; Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? END #'
for 'Tom', 'tom', 't' -> $redact { say "\nRedact '$redact':"; for '[w|s|n]', $redact, {}, '[w|i|n]', $redact, {:i}, '[p|s|n]', $redact, {:p}, '[p|i|n]', $redact, {:p, :i}, '[p|s|o]', $redact, {:p, :o}, '[p|i|o]', $redact, {:p, :i, :o} -> $option, $str, %opts { printf "%s %s\n", $option, $test.&redact($str, |%opts) } }
my $emoji = '🧑 👨 🧔 👨👩👦'; printf "%20s %s\n", , $emoji; printf "%20s %s\n", "Redact '👨' [w]", $emoji.&redact('👨'); printf "%20s %s\n", "Redact '👨👩👦' [w]", $emoji.&redact('👨👩👦');
# Even more complicated Unicode
$emoji = 'Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱'; printf "\n%20s %s\n", , $emoji; printf "%20s %s\n", "Redact '👨' [p]", $emoji.&redact('👨', :p); printf "%20s %s\n", "Redact '👨👩👦' [p]", $emoji.&redact('👨👩👦', :p); printf "%20s %s\n", "Redact '👨' [p|o]", $emoji.&redact('👨', :p, :o); printf "%20s %s\n", "Redact '👨👩👦' [p|o]", $emoji.&redact('👨👩👦', :p, :o);
}</lang>
- Output:
Redact 'Tom': [w|s|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] XXX? XXXs bottom tomato is in his stomach while playing the "XXX-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] XXX? XXXX bottom tomato is in his stomach while playing the "XXXXXXX" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 'tom': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms botXXX XXXato is in his sXXXach while playing the "Tom-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 't': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms boXXom XomaXo is in his sXomach while playing Xhe "Tom-Xom" brand Xom-Xoms. ThaX's so Xom. 'Tis very Xomish, don'X you Xhink? [p|i|n] Xom? Xoms boXXom XomaXo is in his sXomach while playing Xhe "Xom-Xom" brand Xom-Xoms. XhaX's so Xom. 'Xis very Xomish, don'X you Xhink? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. 'Tis very XXXXXX, XXXXX you XXXXX? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. XXXX very XXXXXX, XXXXX you XXXXX? 🧑 👨 🧔 👨👩👦 Redact '👨' [w] 🧑 X 🧔 👨👩👦 Redact '👨👩👦' [w] 🧑 👨 🧔 X Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 Redact '👨' [p] Argentina🧑🇦🇹 FranceX🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 Redact '👨👩👦' [p] Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 NetherlandsX🇳🇱 Redact '👨' [p|o] Argentina🧑🇦🇹 XXXXXXXX Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 Redact '👨👩👦' [p|o] Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 XXXXXXXXXXXXX
Wren
<lang ecmascript>import "/pattern" for Pattern import "/str" for Str import "/upc" for Graphemes
var join = Fn.new { |words, seps|
var lw = words.count var ls = seps.count if (lw != ls + 1) { Fiber.abort("Mismatch between number of words and separators.") } var sb = "" for (i in 0...ls) { sb = sb + words[i] sb = sb + seps[i] } sb = sb + words[lw-1] return sb
}
var redact = Fn.new { |text, word, opts|
var partial = opts.contains("p") var overkill = opts.contains("o") var insensitive = opts.contains("i") var i = " \t\n\r!\"#$\%&()*+,./:;<=>?@[\\]^`{|}~" // all punctuation except -'_ var p = Pattern.new("+1/i", 0, i) var matches = p.findAll(text) var seps = Pattern.matchesText(matches) var words = p.splitAll(text) var expr = insensitive ? Str.lower(word) : word var p2 = Pattern.new(expr) for (i in 0...words.count) { var w = words[i] var wl = insensitive ? Str.lower(w) : w var m = p2.find(wl) if (m && wl.indexOf(m.text + "\u200d") == -1 && wl.indexOf("\u200d" + m.text) == -1) { if (overkill) { words[i] = "X" * Graphemes.clusterCount(w) } else if (!partial) { if (wl == m.text) words[i] = "X" * Graphemes.clusterCount(w) } else if (partial) { var repl = "X" * Graphemes.clusterCount(word) words[i] = p2.replaceAll(wl, repl) } } } System.print("%(opts) %(join.call(words, seps))\n")
}
var printResults = Fn.new { |text, allOpts, allWords|
System.print("Text: %(text)\n") for (word in allWords) { System.print("Redact '%(word)':") for (opts in allOpts) redact.call(text, word, opts) } System.print()
}
var text = "Tom? Toms bottom tomato is in his stomach while playing the \"Tom-tom\" brand tom-toms. That's so tom. 'Tis very tomish, don't you think?" var allOpts = ["[w|s|n]", "[w|i|n]", "[p|s|n]", "[p|i|n]", "[p|s|o]", "[p|i|o]"] var allWords = ["Tom", "tom", "t"] printResults.call(text, allOpts, allWords)
text = "🧑 👨 🧔 👨👩👦" allOpts = ["[w]"] allWords = ["👨", "👨👩👦"] printResults.call(text, allOpts, allWords)
text = "Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱" allOpts = ["[p]", "[p|o]"] printResults.call(text, allOpts, allWords)</lang>
- Output:
Text: Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? Redact 'Tom': [w|s|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] XXX? XXXs bottom tomato is in his stomach while playing the "XXX-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] XXX? XXXX bottom tomato is in his stomach while playing the "XXXXXXX" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 'tom': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [w|i|n] XXX? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so XXX. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms botXXX XXXato is in his sXXXach while playing the "Tom-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|i|n] XXX? XXXs botXXX XXXato is in his sXXXach while playing the "XXX-XXX" brand XXX-XXXs. That's so XXX. 'Tis very XXXish, don't you think? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing the "XXXXXXX" brand XXXXXXXX. That's so XXX. 'Tis very XXXXXX, don't you think? Redact 't': [w|s|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [w|i|n] Tom? Toms bottom tomato is in his stomach while playing the "Tom-tom" brand tom-toms. That's so tom. 'Tis very tomish, don't you think? [p|s|n] Tom? Toms boXXom XomaXo is in his sXomach while playing Xhe "Tom-Xom" brand Xom-Xoms. ThaX's so Xom. 'Tis very Xomish, don'X you Xhink? [p|i|n] Xom? Xoms boXXom XomaXo is in his sXomach while playing Xhe "Xom-Xom" brand Xom-Xoms. XhaX's so Xom. 'Xis very Xomish, don'X you Xhink? [p|s|o] Tom? Toms XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. 'Tis very XXXXXX, XXXXX you XXXXX? [p|i|o] XXX? XXXX XXXXXX XXXXXX is in his XXXXXXX while playing XXX "XXXXXXX" brand XXXXXXXX. XXXXXX so XXX. XXXX very XXXXXX, XXXXX you XXXXX? Text: 🧑 👨 🧔 👨👩👦 Redact '👨': [w] 🧑 X 🧔 👨👩👦 Redact '👨👩👦': [w] 🧑 👨 🧔 X Text: Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 Redact '👨': [p] Argentina🧑🇦🇹 FranceX🇫🇷 Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 [p|o] Argentina🧑🇦🇹 XXXXXXXX Germany🧔🇩🇪 Netherlands👨👩👦🇳🇱 Redact '👨👩👦': [p] Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 NetherlandsX🇳🇱 [p|o] Argentina🧑🇦🇹 France👨🇫🇷 Germany🧔🇩🇪 XXXXXXXXXXXXX