Regular expressions: Difference between revisions

Content added Content deleted

Inline

Revision as of 20:35, 21 November 2009

The goal of this task is

to match a string against a regular expression
to substitute part of a string using a regular expression

ALGOL 68

The routines grep in strings and sub in string are not part of ALGOL 68's standard prelude.

Works with: ALGOL 68G version Any - tested with release mk15-0.8b.fc9.i386

<lang algol68>INT match=0, no match=1, out of memory error=2, other error=3;

STRING str := "i am a string";

Match: #

STRING m := "string$"; INT start, end; IF grep in string(m, str, start, end) = match THEN printf(($"Ends with """g""""l$, str[start:end])) FI;

Replace: #

IF sub in string(" a ", " another ",str) = match THEN printf(($gl$, str)) FI;</lang> Output:

Ends with "string"
i am another string

Standard ALGOL 68 does have an primordial form of pattern matching called a format. This is designed to extract values from input data. But it can also be used for outputting (and transputting) the original data.

Works with: ALGOL 68 version Standard - But declaring book as flex[]flex[]string

Works with: ALGOL 68G version Any - tested with release mk15-0.8b.fc9.i386

For example:<lang algol68>FORMAT pattern = $ddd" "c("cats","dogs")$; FILE file; STRING book; associate(file, book); on value error(file, (REF FILE f)BOOL: stop); on format error(file, (REF FILE f)BOOL: stop);

book := "100 dogs"; STRUCT(INT count, type) dalmatians;

getf(file, (pattern, dalmatians)); print(("Dalmatians: ", dalmatians, new line)); count OF dalmatians +:=1; printf(($"Gives: "$, pattern, dalmatians, $l$))</lang> Output:

Dalmatians:        +100         +2
Gives 101 dogs

AutoHotkey

<lang AutoHotkey>MsgBox % foundpos := RegExMatch("Hello World", "World$") MsgBox % replaced := RegExReplace("Hello World", "World$", "yourself")</lang>

AWK

AWK supports regular expressions, which are typically marked up with slashes in front and back, and the "~" operator: <lang awk>$ awk '{if($0~/[A-Z]/)print "uppercase detected"}' abc ABC uppercase detected</lang> As shorthand, a regular expression in the condition part fires if it matches an input line: <lang awk>awk '/[A-Z]/{print "uppercase detected"}' def DeF uppercase detected</lang> For substitution, the first argument can be a regular expression, while the replacement string is constant (only that '&' in it receives the value of the match): <lang awk>$ awk '{gsub(/[A-Z]/,"*");print}' abCDefG ab**ef* $ awk '{gsub(/[A-Z]/,"(&)");print}' abCDefGH ab(C)(D)ef(G)(H)</lang> This variant matches one or more uppercase letters in one round: <lang awk>$ awk '{gsub(/[A-Z]+/,"(&)");print}' abCDefGH ab(CD)ef(GH)</lang>

C

Works with: POSIX

As far as I can see, POSIX defined function for regex matching, but nothing for substitution. So we must do all the hard work by hand. The complex-appearing code could be turned into a function.

<lang c>#include <stdio.h>

include <stdlib.h>
include <sys/types.h>
include <regex.h>
include <string.h>

int main() {

  regex_t preg;
  regmatch_t substmatch[1];
  const char *tp = "string$";
  const char *t1 = "this is a matching string";
  const char *t2 = "this is not a matching string!";
  const char *ss = "istyfied";
  
  regcomp(&preg, "string$", REG_EXTENDED);
  printf("'%s' %smatched with '%s'\n", t1,
                                       (regexec(&preg, t1, 0, NULL, 0)==0) ? "" : "did not ", tp);
  printf("'%s' %smatched with '%s'\n", t2,
                                       (regexec(&preg, t2, 0, NULL, 0)==0) ? "" : "did not ", tp);
  regfree(&preg);
  /* change "a[a-z]+" into "istifyed"?*/
  regcomp(&preg, "a[a-z]+", REG_EXTENDED);
  if ( regexec(&preg, t1, 1, substmatch, 0) == 0 )
  {
     //fprintf(stderr, "%d, %d\n", substmatch[0].rm_so, substmatch[0].rm_eo);
     char *ns = malloc(substmatch[0].rm_so + 1 + strlen(ss) +
                       (strlen(t1) - substmatch[0].rm_eo) + 2);
     memcpy(ns, t1, substmatch[0].rm_so+1);
     memcpy(&ns[substmatch[0].rm_so], ss, strlen(ss));
     memcpy(&ns[substmatch[0].rm_so+strlen(ss)], &t1[substmatch[0].rm_eo],
               strlen(&t1[substmatch[0].rm_eo]));
     ns[ substmatch[0].rm_so + strlen(ss) +
         strlen(&t1[substmatch[0].rm_eo]) ] = 0;
     printf("mod string: '%s'\n", ns);
     free(ns); 
  } else {
     printf("the string '%s' is the same: no matching!\n", t1);
  }
  regfree(&preg);
  
  return 0;

}</lang>

C++

Works with: g++ version 4.0.2

Library: Boost

<lang cpp>#include <iostream>

include <string>
include <iterator>
include <boost/regex.hpp>

int main() {

 boost::regex re(".* string$");
 std::string s = "Hi, I am a string";

 // match the complete string
 if (boost::regex_match(s, re))
   std::cout << "The string matches.\n";
 else
   std::cout << "Oops - not found?\n";

 // match a substring
 boost::regex re2(" a.*a");
 boost::smatch match;
 if (boost::regex_search(s, match, re2))
 {
   std::cout << "Matched " << match.length()
             << " characters starting at " << match.position() << ".\n";
   std::cout << "Matched character sequence: \""
             << match.str() << "\"\n";
 }
 else
 {
   std::cout << "Oops - not found?\n";
 }

 // replace a substring
 std::string dest_string;
 boost::regex_replace(std::back_inserter(dest_string),
                      s.begin(), s.end(),
                      re2,
                      "'m now a changed");
 std::cout << dest_string << std::endl;

}</lang>

C#

<lang csharp>using System; using System.Text.RegularExpressions;

class Program {

   static void Main(string[] args) {
       string str = "I am a string";

       if (new Regex("string$").IsMatch(str)) {
           Console.WriteLine("Ends with string.");
       }

       str = new Regex(" a ").Replace(str, " another ");
       Console.WriteLine(str);
   }

}</lang>

Common Lisp

Translation of: Perl

Uses CL-PPCRE - Portable Perl-compatible regular expressions for Common Lisp.

<lang lisp>(let ((string "I am a string"))

 (when (cl-ppcre:scan "string$" string)
   (write-line "Ends with string"))
 (unless (cl-ppcre:scan "^You" string )
   (write-line "Does not start with 'You'")))</lang>

Substitute

<lang lisp>(let* ((string "I am a string")

      (string (cl-ppcre:regex-replace " a " string " another ")))
 (write-line string))</lang>

Test and Substitute

<lang lisp>(let ((string "I am a string"))

 (multiple-value-bind (string matchp)
     (cl-ppcre:regex-replace "\\bam\\b" string "was")
   (when matchp
     (write-line "I was able to find and replace 'am' with 'was'."))))</lang>

D

<lang d>import std.stdio, std.regexp;

void main() {

   string s = "I am a string";

   // Test:
   if (search(s, r"string$"))
       writefln("Ends with 'string'");

   // Test, storing the regular expression:
   auto re1 = RegExp(r"string$");
   if (re1.search(s).test)
       writefln("Ends with 'string'");

   // Substitute:
   writefln(sub(s, " a ", " another "));

   // Substitute, storing the regular expression:
   auto re2 = RegExp(" a ");
   writefln(re2.replace(s, " another "));

}</lang>

Note that in std.string there are string functions to perform those string operations in a faster way.

Erlang

<lang erlang>match() -> String = "This is a string", case re:run(String, "string$") of {match,_} -> io:format("Ends with 'string'~n"); _ -> ok end.

substitute() -> String = "This is a string", NewString = re:replace(String, " a ", " another ", [{return, list}]), io:format("~s~n",[NewString]).</lang>

Forth

Library: Forth Foundation Library

Test/Match <lang forth>include ffl/rgx.fs

\ Create a regular expression variable 'exp' in the dictionary

rgx-create exp

\ Compile an expression

s" Hello (World)" exp rgx-compile [IF]

 .( Regular expression successful compiled.) cr

[THEN]

\ (Case sensitive) match a string with the expression

s" Hello World" exp rgx-cmatch? [IF]

 .( String matches with the expression.) cr

[ELSE]

 .( No match.) cr

[THEN]</lang>

Haskell

Test <lang haskell>import Text.Regex

str = "I am a string"

case matchRegex (mkRegex ".*string$") str of

 Just _  -> putStrLn $ "ends with 'string'"
 Nothing -> return ()</lang>

Substitute <lang haskell>import Text.Regex

orig = "I am the original string" result = subRegex (mkRegex "original") orig "modified" putStrLn $ result</lang>

J

J's regex support is built on top of PCRE.

<lang j>load'regex' NB. Load regex library str =: 'I am a string' NB. String used in examples.</lang>

Matching:

<lang j> '.*string$' rxeq str NB. 1 is true, 0 is false 1</lang>

Substitution:

<lang j> ('am';'am still') rxrplc str I am still a string</lang>

Java

Works with: Java version 1.5+

Test

<lang java>String str = "I am a string"; if (str.matches(".*string$")) {

 System.out.println("ends with 'string'");

}</lang>

Substitute

<lang java>String orig = "I am the original string"; String result = orig.replaceAll("original", "modified"); // result is now "I am the modified string"</lang>

JavaScript

Test/Match <lang javascript>var subject = "Hello world!";

// Two different ways to create the RegExp object // Both examples use the exact same pattern... matching "hello" var re_PatternToMatch = /Hello (World)/i; // creates a RegExp literal with case-insensitivity var re_PatternToMatch2 = new RegExp("Hello (World)", "i");

// Test for a match - return a bool var isMatch = re_PatternToMatch.test(subject);

// Get the match details // Returns an array with the match's details // matches[0] == "Hello world" // matches[1] == "world" var matches = re_PatternToMatch2.exec(subject);</lang>

Substitute <lang javascript>var subject = "Hello world!";

// Perform a string replacement // newSubject == "Replaced!" var newSubject = subject.replace(re_PatternToMatch, "Replaced");</lang>

M4

<lang M4>regexp(`GNUs not Unix', `\<[a-z]\w+') regexp(`GNUs not Unix', `\<[a-z]$\w+$', `a \& b \1 c')</lang>

Output:

5
a not b ot c

Objective-C

Test

Works with: Mac OS X version 10.4+

<lang objc>NSString *str = @"I am a string"; NSString *regex = @".*string$";

NSPredicate *pred = [NSPredicate predicateWithFormat:@"SELF MATCHES %@", regex];

if ([pred evaluateWithObject:str]) {

   NSLog(@"ends with 'string'");

}</lang> Unfortunately this method cannot find the location of the match or do substitution.

OCaml

With the standard library

Test <lang ocaml>#load "str.cma";; let str = "I am a string";; try

 ignore(Str.search_forward (Str.regexp ".*string$") str 0);
 print_endline "ends with 'string'"

with Not_found -> ()

</lang>

Substitute <lang ocaml>#load "str.cma";; let orig = "I am the original string";; let result = Str.global_replace (Str.regexp "original") "modified" orig;; (* result is now "I am the modified string" *)</lang>

Using Pcre

Library: ocaml-pcre

<lang ocaml>let matched pat str =

 try ignore(Pcre.exec ~pat str); (true)
 with Not_found -> (false)

let () =

 Printf.printf "matched = %b\n" (matched "string$" "I am a string");
 Printf.printf "Substitute: %s\n"
   (Pcre.replace ~pat:"original" ~templ:"modified" "I am the original string")

</lang>

Perl

Works with: Perl version 5.8.8

Test <lang perl>$string = "I am a string"; if ($string =~ /string$/) {

  print "Ends with 'string'\n";

}

if ($string !~ /^You/) {

  print "Does not start with 'You'\n";

}</lang>

Substitute <lang perl>$string = "I am a string"; $string =~ s/ a / another /; # makes "I am a string" into "I am another string" print $string;</lang>

Test and Substitute <lang perl>$string = "I am a string"; if ($string =~ s/\bam\b/was/) { # \b is a word border

  print "I was able to find and replace 'am' with 'was'\n";

}</lang>

Options <lang perl># add the following just after the last / for additional control

g = globally (match as many as possible)
i = case-insensitive
s = treat all of $string as a single line (in case you have line breaks in the content)
m = multi-line (the expression is run on each line individually)

$string =~ s/i/u/ig; # would change "I am a string" into "u am a strung"</lang>

PHP

Works with: PHP version 5.2.0

<lang php> $string = 'I am a string';</lang>

Test

<lang php>if (preg_match('/string$/', $string)) {

   echo "Ends with 'string'\n";

}</lang>

Replace

<lang php>$string = preg_replace('/\ba\b/', 'another', $string); echo "Found 'a' and replace it with 'another', resulting in this string: $string\n";</lang>

PowerShell

<lang powershell>"I am a string" -match '\bstr' # true "I am a string" -replace 'a\b','no' # I am no string</lang> By default both the -match and -replace operators are case-insensitive. They can be made case-sensitive by using the -cmatch and -creplace operators.

Python

<lang python>import re

string = "This is a string"

if re.search('string$',string):

   print "Ends with string."

string = re.sub(" a "," another ",string) print string</lang>

R

First, define some strings. <lang R>pattern <- "string" text1 <- "this is a matching string" text2 <- "this does not match"</lang> Matching with grep. The indices of the texts containing matches are returned. <lang R>grep(pattern, c(text1, text2)) # 1</lang> Matching with regexpr. The positions of the starts of the matches are returned, along with the lengths of the matches. <lang R>regexpr(pattern, c(text1, text2))</lang>

[1] 20 -1
attr(,"match.length")
[1]  6 -1

Replacement <lang R>gsub(pattern, "pair of socks", c(text1, text2))</lang>

[1] "this is a matching pair of socks" "this does not match"

Raven

<lang raven>'i am a string' as str</lang>

Match:

<lang raven>str m/string$/ if "Ends with 'string'\n" print</lang>

Replace:

<lang raven>str r/ a / another / print</lang>

Ruby

Test <lang ruby>string="I am a string" puts "Ends with 'string'" if string[/string$/] puts "Does not start with 'You'" if !string[/^You/]</lang>

Substitute <lang ruby>puts string.gsub(/ a /,' another ')

or

string[/ a /]='another' puts string</lang>

Substitute using block <lang ruby>puts(string.gsub(/\bam\b/) do |match|

      puts "I found #{match}"
      #place "was" instead of the match
      "was"
    end)</lang>

Scala

Define <lang scala>val Bottles1 = "(\\d+) bottles of beer".r // syntactic sugar val Bottles2 = """(\d+) bottles of beer""".r // using triple-quotes to preserve backslashes val Bottles3 = new scala.util.matching.Regex("(\\d+) bottles of beer") // standard val Bottles4 = new scala.util.matching.Regex("""(\d+) bottles of beer""", "bottles") // with named groups</lang>

Search and replace with string methods: <lang scala>"99 bottles of beer" matches "(\\d+) bottles of beer" // the full string must match "99 bottles of beer" replace ("99", "98") // Single replacement "99 bottles of beer" replaceAll ("b", "B") // Multiple replacement</lang>

Search with regex methods: <lang scala>"\\d+".r findFirstIn "99 bottles of beer" // returns first partial match, or None "\\w+".r findAllIn "99 bottles of beer" // returns all partial matches as an iterator "\\s+".r findPrefixOf "99 bottles of beer" // returns a matching prefix, or None Bottles4 findFirstMatchIn "99 bottles of beer" // returns a "Match" object, or None Bottles4 findPrefixMatchOf "99 bottles of beer" // same thing, for prefixes val bottles = (Bottles4 findFirstMatchIn "99 bottles of beer").get.group("bottles") // Getting a group by name</lang>

Using pattern matching with regex: <lang scala>val Some(bottles) = Bottles4 findPrefixOf "99 bottles of beer" // throws an exception if the matching fails; full string must match for {

 line <- """|99 bottles of beer on the wall
            |99 bottles of beer
            |Take one down, pass it around
            |98 bottles of beer on the wall""".stripMargin.lines

} line match {

 case Bottles1(bottles) => println("There are still "+bottles+" bottles.") // full string must match, so this will match only once
 case _ =>

} for {

 matched <- "(\\w+)".r findAllIn "99 bottles of beer" matchData // matchData converts to an Iterator of Match

} println("Matched from "+matched.start+" to "+matched.end)</lang>

Replacing with regex: <lang scala>Bottles2 replaceFirstIn ("99 bottles of beer", "98 bottles of beer") Bottles3 replaceAllIn ("99 bottles of beer", "98 bottles of beer")</lang>

Slate

This library is still in its early stages. There isn't currently a feature to replace a substring.

<lang slate>(Regex Matcher newOn: '^(([^:/?#]+)\\:)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?')

   `>> [match: 'http://slatelanguage.org/test/page?query'. subexpressionMatches]

"==> {"Dictionary traitsWindow" 0 -> 'http:'. 1 -> 'http'. 2 -> '//slatelanguage.org'.

      3 -> 'slatelanguage.org'. 4 -> '/test/page'. 5 -> '?query'. 6 -> 'query'. 7 -> Nil}"</lang>

Smalltalk

<lang smalltalk>|re s s1| re := Regex fromString: '[a-z]+ing'. s := 'this is a matching string'. s1 := 'this does not match'.

(s =~ re) ifMatched: [ :b |

  b match displayNl

]. (s1 =~ re) ifMatched: [ :b |

  'Strangely matched!' displayNl

] ifNotMatched: [

  'no match!' displayNl

].

(s replacingRegex: re with: 'modified') displayNl.</lang>

Tcl

Test using regexp: <lang tcl>set theString "I am a string" if {[regexp -- {string$} $theString]} {

   puts "Ends with 'string'"

}

if {![regexp -- {^You} $theString]} {

   puts "Does not start with 'You'"

}</lang>

Extract substring using regexp <lang tcl>set theString "This string has >123< a number in it" if {[regexp -- {>(\d+)<} $theString -> number]} {

   puts "Contains the number $number"

}</lang>

Substitute using regsub <lang tcl>set theString = "I am a string" puts [regsub -- { +a +} $theString { another }]</lang>

Toka

Toka's regular expression library allows for matching, but does not yet provide for replacing elements within strings.

<lang toka>#! Include the regex library needs regex

! The two test strings

" This is a string" is-data test.1 " Another string" is-data test.2

! Create a new regex named 'expression' which tries
! to match strings beginning with 'This'.

" ^This" regex: expression

! An array to store the results of the match
! (Element 0 = starting offset, Element 1 = ending offset of match)

2 cells is-array match

! Try both test strings against the expression.
! try-regex will return a flag. -1 is TRUE, 0 is FALSE

expression test.1 2 match try-regex . expression test.2 2 match try-regex .</lang>

Vedit macro language

Vedit can perform searches and matching with either regular expressions, pattern matching codes or plain text. These examples use regular expressions.

Match text at cursor location: <lang vedit>if (Match(".* string$", REGEXP)==0) {

   Statline_Message("This line ends with 'string'")

}</lang>

Search for a pattern: <lang vedit>if (Search("string$", REGEXP+NOERR)) {

   Statline_Message("'string' at and of line found")

}</lang>

Replace: <lang vedit>Replace(" a ", " another ", REGEXP+NOERR)</lang>