Sanitize user input

From Rosetta Code
Sanitize user input is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

"Never trust user input." If the Super Mario Bros. 3 Wrong Warp or [Bobby Tables] have taught programmers anything, it's that user input can be dangerous in unexpected ways.

In general, the task of preventing errors such as the above are best left to the built-in security features of the language rather than a filter of your own creation. This exercise is to test your ability to think about all the possible ways user input could break your program.

Task

Create a function that takes a list of 20 first and last names, and copies them to a record or struct. Ten of them must be typical input, (i.e. consist of only letters of the alphabet and punctuation), but the other ten must be deliberately chosen to cause problems with a program that expects only letters and punctuation. A few examples:

  • ASCII control codes such as NUL, CR, LF
  • Code for the language you are using that can result in damage (e.g. -rm -rf, delete System32, DROP TABLE, etc.)
  • Numbers, symbols, foreign languages, emojis, etc.


(There were already solutions provided before the requirement that ten names are "normal" and ten are potentially harmful was added. Those answers satisfied the task requirements at the time they were submitted.)

Related tasks


jq

Adapted from #Wren

The jq program presented here offers an interactive approach to the problem along the lines of the Wren solution. It will accept both "stop" and the end-of-stream as a signal to finish gathering names.

The main program, `interact`, is somewhat convoluted because jq does not currently offer much support for the type of interaction envisioned in the Wren answer. It would be easy to simplify things by using `stderr` for the prompt, but currently `stderr` cannot be used to print "raw" strings.

def Person::new(firstName; lastName):
  {firstName: firstName,
   lastName: lastName };
	 
def Person::tostring: .firstName + " " + .lastName;

def blacklist: [
    "drop", "delete", "erase", "kill", "wipe", "remove",
    "file", "files", "directory", "directories",
    "table", "tables", "record", "records", "database", "databases",
    "system", "system32", "system64", "rm", "rf", "rmdir", "format", "reformat"
];

def punct: "'-";            # allowable punctuation

def permissible:
  def ok: "[A-Za-z\(punct)]+";
  test("^" + ok +"$");

# Emit null or else the text of an error message
def checkInput:
  . as $name
  | (permissible and (((punct|contains($name[0:1])) or (punct|contains($name[-1:]))) | not)) as $ok
  | if $ok
    then 
      if blacklist|index($name|ascii_downcase) then "Sorry, that name is unacceptable."
      else null
      end
    else "Sorry, that name contains unacceptable characters."
    end ;

# Attempt to obtain a valid response until "stop" or EOS.
# Set .invalid and .answer of the incoming object.
def ask:
  .invalid = false
  # Use `first(inputs)` to avoid error on EOS.
  | (first(inputs) // null) as $x
  | if $x | IN(null, "stop") then .answer = true # i.e. stop
    else .invalid = ($x|checkInput)
    | .answer = (if .invalid then null else $x end)
    end ;

# $max is the maximum number of full names to request (-1 for arbitrarily many)
def interact($max):

  # An array of Person
  def summary:
    "The following \(length) person(s) have been added to the database:",
    (.[] | Person::tostring);

  ["first", "last"] as $prompts
  | label $out
  # .question is the question number we are currently focused on.
  # .emit is the string to emit if it has been set.
  | foreach range(0; infinite) as $i (
      {question: 0, emit: null, array: []};

      if .array | length == $max
      then .finished = .array
      elif .emit then ask
      | if .answer == true then .finished = .array
        elif .invalid then .emit = .invalid + " Please re-enter:"
        else .emit = null
      | if   .question == 0
          then .first = .answer | .question = 1
          else .last  = .answer | .question = 0
          | .array = .array + [Person::new(.first; .last)]
          end
        end
      else .
      end
      # update .emit
      | if .finished then .emit = null
        elif .emit then .
        else .emit = "Enter your \($prompts[.question]) name : "
        end;

      if .finished then ., break $out
      elif .emit then .
      else empty
      end)
    | (select(.emit) | .emit),
      (select(.finished) | .array | summary) ;

interact(20)

'Example

jq -nrR -f sanitize-user-input.jq
Enter your first name : 
John!
Sorry, that name contains unacceptable characters. Please re-enter:
John
Enter your last name : 
Doe
Enter your first name : 
stop
The following 1 person(s) have been added to the database:
John Doe

Julia

With the notorious exception of some older SQL languages, most languages never evaluate external input as code. Because of this, sanitizing of user input in languages such as Julia is not needed unless the program is designed specifically to run user input as a system command. The task given does not require such system level evaluation.

import Base: string

const BLACKLIST = [
    "drop", "delete", "erase", "kill", "wipe", "remove",
    "file", "files", "directory", "directories",
    "table", "tables", "record", "records", "database", "databases",
    "system", "system32", "system64", "rm", "rf", "rmdir", "format", "reformat"
]
const PUNC = [''', '-']
const LETT = ['a':'z'; 'A':'Z']

"""
    function validator(s)

    Validation of `s` requires:
        `s` is valid utf-8
        `s` only has chars that are in okc
        `s` is not in the `blist``, and if `csense` is false (the default),
        the lowercase version of `s` is not in the lowercase version of `blist`.

    Returns (true, s) if valid and (false, error message) if invalid.
"""
function validator(stri, okc = vcat(LETT, PUNC), blist = copy(BLACKLIST), csense = false)
    s = ""
    if !csense
        blist = lowercase.(blist)
    end
    try # some binary sequences are invalid utf8 and may throw error
       s = string(stri)
       lcs = csense ? s : lowercase(s)
       lcs  blist && return false, "Sorry, name $s is forbidden."
       any(x -> x  okc, s) && return false, "Sorry, name $s contains bad characters."
    catch y
        return false, y
    end
    return true, s
end

""" class for Person with firstname and lastname identity strings """
struct Person
    firstname::String
    lastname::String
end
""" convert a Person to its string representation """
Base.string(p::Person) = "$(p.firstname) $(p.lastname)"


""" Add Persons to plist with validation by validator """
function addsanitized!(plist, validator = validator)
    println("""\n    INSTRUCTIONS
       Enter new names as first name then last name. 
       Allowable characters are a through z (A-Z), along with ' and - in names.
       Some words are reserved for use by the system and are thus excluded.
       Enter two blank lines to exit (Hit Enter for a blank entry). 
    """)
    while true
        print("Enter first name: ")
        fn = strip(readline())
        ok, firstname = validator(fn)
        if !ok 
            println(firstname)
            continue
        end
        print("Enter last name: ")
        ln = strip(readline())
        ok, lastname = validator(ln)
        if !ok 
            println(lastname)
            continue
        end
        firstname == "" && lastname == "" && break       
        push!(plist, Person(firstname, lastname))
    end
    return plist
end

const persons = addsanitized!(Person[])
println("\nAdded:\n", join(persons, "\n"))

Phix

As noted there is no magic "one size fits all" solution, and in the specific case of sql the use of sqlite3_prepare() and sqlite3_bind_text() is strongly recommended in preference to sqlite3_exec() or sqlite3_get_table(), at least for any questionable input. Using sqlite3_bind_text() there is no problem whatsoever with having a student named (say) "Robert'); DROP TABLE students;--".

Given some suspect [Phix] source code to be run, it is simply not practical to cover cases such as system(rot13(reverse("se- ze"))) or any of the other myriad ways in which harmful content could be disguised. In case you have not guessed, that would execute "rm -rf", assuming the code also contains a working rot13() implementation.

Of course you could block all use, even legitimate, of things like system(), as covered by Safe_mode and Untrusted_environment, or whitelist as per the Raku entry below.

The inverse problem recently arose in p2js, whereby otherwise perfectly valid code on desktop/Phix could and would generate invalid HTML/Javascript if and when we tried to self-host (an effort which is still very much in progress, albeit not apace, btw):

with javascript_semantics
string header = """
<!DOCTYPE html>
<html lang="en" >
 <head>
  <title>%%s</title>%s
 </head>
 <body>
  <scr!ipt src="p2js.js"></scr!ipt>%%s%s
"""
-- ...
header = substitute(header,"scr!ipt","script")

puts(1,header)  -- (make the example runnable)

In other words I had to "sanitize" a constant in the source code, in this particular case, and I could have gone further and done something similar with all the other tags, but in practice there was no need to because the generated JavaScript was already always inside a script tag.

Raku

It would be helpful if the task author would be a little more specific about what he is after. How user inputs must be "sanitized" entirely depends on how the data is going to be used.

For internal usage, in Raku, where you are simply storing data into an internal data structure, it is pretty much a non issue. Variables in Raku aren't executed without specific instructions to do so. Full stop.

Your name is a string of 2.6 million null bytes? Ok. Good luck typing that in.

You're called 'rm -rf /'? Wow. sucks to be you.

Now, it may be a good idea to check for a maximum permitted length; (2.6e6 null bytes) but Raku would handle it with no problem.

The problem mostly comes in when you need to interchange data with some 3rd party library / system; but every different system is going to have it's own quirks and caveats. There is no "one size fits all" solution.

In general, when it comes to sanitizing user input, the best way to go about it is: don't. It's a losing game.

Instead either validate input to make sure it follows a certain format, whitelist input so only a know few commands are permitted, or if those aren't possible, use 3rd party tools the 3rd party system provides to make arbitrary input "safe" to run. Which one of these is used depends on what system you need to interact with.

For the case given, (Bobby Tables), where you are presumably putting names into some 3rd party data storage (nominally a database of some kind), you would use bound parameters to automatically "make safe" any user input. See the Raku entry under the Parametrized SQL statement task.

Validating is making sure the the input matches some predetermined format, usually with some sort of regular expression. For names, you probably want to allow a fixed maximum (and minimum!) number of: any word or digit character, space and period characters and possibly some small selection of non-word characters. It is a careful balance between too restrictive and too permissive. You need to avoid falling into pre-conceived assumptions about: names, time, gender, addresses, phone numbers... the list goes on.

When passing a user command to the operating system, you probably want to use whitelisting. Only a very few commands from a predetermined list are allowed to be used.

   if $command ∈ <ls time cd df> then { execute $command }

or some such. What the whitelist contains and how to determine if the input matches is a whole article in itself.

Unfortunately, this is very vague and hand-wavey due to the vagueness of the task description. Really, any language could copy/paste 95% or better of the above, change the language name, and be done with it. But until the task description is made a little more focused, it will have to do.

Wren

Library: Wren-ioutil
Library: Wren-pattern
Library: Wren-str
Library: Wren-iterate


The following assumes that names are only valid if they contain ASCII letters, hyphens or apostrophes. However, the first or last character of a name can't be a punctuation character and a name must be between 1 and 20 characters long. A single character name is allowed to cater for an initial where the full name is not known. People are given a chance to abbreviate their names if they are too long.

No other characters are allowed including control characters, spaces, symbols, emojis and non-English letters. Names which include them are simply rejected.

Furthermore, there is a blacklist of unacceptable names though in practice this would probably be longer or more sophisticated than the one I've used here, depending on what will be done with the records later.

import "./ioutil" for Input
import "./pattern" for Pattern
import "./str" for Str
import "./iterate" for Indexed

class Person {
    construct new(firstName, lastName) {
        _firstName = firstName
        _lastName  = lastName
    }

    firstName { _firstName }
    lastName  { _lastName }

    toString { _firstName + " " + _lastName }
}

var persons = []
var blacklist = [
    "drop", "delete", "erase", "kill", "wipe", "remove",
    "file", "files", "directory", "directories",
    "table", "tables", "record", "records", "database", "databases",
    "system", "system32", "system64", "rm", "rf", "rmdir", "format", "reformat"
]

var punct = "'-" // allowable punctuation
var i = Pattern.letter + punct
var p = Pattern.new("+1&i", Pattern.whole, i)

var sanitizeInput = Fn.new { |name|
    var ok = p.isMatch(name) && !(punct.contains(name[0]) || punct.contains(name[-1]))
    if (!ok) return "Sorry, your name contains unacceptable characters."
    name = Str.lower(name)
    if (blacklist.contains(name)) return "Sorry, your name is unacceptable."
    return ""
}

for (i in 1..20) {
    var names = List.filled(2, null)
    var outer = false
    for (se in Indexed.new(["first", "last "])) {
        var name = Input.text("Enter your %(se.value) name : ", 1, 20)
        var msg = sanitizeInput.call(name)
        if (msg != "") {
            System.print(msg + "\n")
            outer = true
            break
        }
        names[se.index] = name
    }
    if (outer) continue
    persons.add(Person.new(names[0], names[1]))
    System.print()
}
var count = persons.count
System.print("The following %(count) person(s) have been added to the database:")
for (person in persons) System.print(person)
Output:

Sample (abridged) input/output. The ninth person's name contains a tab character.

Enter your first name : Mickey_mouse
Sorry, your name contains unacceptable characters.

Enter your first name : Bobby
Enter your last  name : Tables
Sorry, your name is unacceptable.

Enter your first name : Fred
Enter your last  name : rm -rf/
Sorry, your name contains unacceptable characters.

Enter your first name : David
Enter your last  name : Wipe
Sorry, your name is unacceptable.

Enter your first name : Beyoncé
Sorry, your name contains unacceptable characters.

Enter your first name : A-12
Sorry, your name contains unacceptable characters.

Enter your first name : 'Andrew-
Sorry, your name contains unacceptable characters.

Enter your first name : 👨👨‍👩‍👦
Sorry, your name contains unacceptable characters.

Enter your first name : Don     ald
Sorry, your name contains unacceptable characters.

Enter your first name : Eric
Enter your last  name : Schäfer        
Sorry, your name contains unacceptable characters.

Enter your first name : Blaine
Enter your last  name : Wolfeschlegelsteinhausenbergerdorff
Must have a length between 1 and 20 characters, try again.
Enter your last  name : Wolfeschlegelstein'f 

Enter your first name : Marilyn
Enter your last  name : Monroe

Enter your first name : Bridget
Enter your last  name : O'Riley

... (plus another 7 acceptable people)

The following 10 person(s) have been added to the database:
Blaine Wolfeschlegelstein'f 
Marilyn Monroe
Bridget O'Riley
... (plus 7 more)