Tuesday, May 11, 2010

A RegEx To Validate URIs

Goal:


To test-drive the creation of a regular expression to validate URIs based on the BNF in Appendix A of RFC 2396.

Motivation:


Other publicly available regular expressions were insufficient for my needs. After meticulously starting to implement my own regex by proceeding to translate every line of the BNF grammar, I realized this mode of working would be error-prone, and I needed an automated way to verify the correctness of the regular expression.

The Test:


The test cases can follow a systematic template:
  • we have a regular expression that we want to validate;
  • we need to verify that the regular expression accepts values that conform to the BNF grammar;
  • we also need to demonstrate that the regular expression does not accept values that do not conform to the BNF grammar.

For the purposes of this tutorial, we can plan on validating the URI parts that correspond to the components described in Appendix B of RFC 2396, i.e., scheme, authority, path, query and fragment, as well as the entirety of the URI-reference.
In order to demonstrate that the regex accepts values that conform to the BNF grammar, we can use a package like Xeger to generate random values based on our regular expression, and we will verify with a Java translation of the BNF that these generated values are indeed valid.
To verify that the regular expression does not accept values that do not conform to the BNF, we can again use Xeger, but we will negate the regular expression and verify that the generated values are not valid using our Java translation of the BNF.

Now that we have a plan, we can set up a new project by downloading Xeger and the engine that drives it, automaton.
Automaton can be installed directly into our local maven repository

  mvn install:install-file -DgroupId=dk.brics -DartifactId=automaton -Dversion=1.11 -Dpackaging=jar -Dfile=/path/to/automaton.jar

Xeger is already a maven project so we can either install the archive directly, or install the module from source.

An example pom.xml for our project might look something like this:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>timezra</groupId>
  <artifactId>uri_regex</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>uri_regex</name>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.5</source>
          <target>1.5</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <dependencies>
    <dependency>
      <groupId>nl.flotsam</groupId>
      <artifactId>xeger</artifactId>
      <version>1.0-SNAPSHOT</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.7</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.google.collections</groupId>
      <artifactId>google-collections</artifactId>
      <version>1.0</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>



We are now ready to create the initial implementation for the test class, here called timezra.blog.uri_regex.URIRegExTest.java.
When we build our regular expressions from the bottom of the BNF grammar upwards, the first entire URI component that we encounter is the regex for fragments, so our first test case should validate that fragments work correctly.

package timezra.blog.uri_regex;

import static java.util.Arrays.asList;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashSet;
import nl.flotsam.xeger.Xeger;
import org.junit.Test;
import com.google.common.base.Function;

public class URIRegExTest {

  private static final String FRAGMENT_REGEX = "([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*";

  private static final int NUMBER_OF_TESTS = 100000;

  @Test
  public void theFragmentRegExIsNecessaryAndSufficient() throws Exception {
    verifyTheRegEx(FRAGMENT_REGEX, new Function<String, Boolean>() {
      public Boolean apply(final String s) {
        return fragment(s);
      }
    });
  }

  private void verifyTheRegEx(final String pattern, final Function<String, Boolean> f) {
    final Xeger generator = new Xeger(pattern);
    for (int i = 0; i < NUMBER_OF_TESTS; i++) {
      final String generated = generator.generate();
      assertTrue(String.format("Verifying the value '%s'.", generated), f.apply(generated));
    }
    final Xeger negativeGenerator = new Xeger(negate(pattern));
    for (int i = 0; i < NUMBER_OF_TESTS; i++) {
      final String generated = negativeGenerator.generate();
      assertFalse(String.format("Verifying the negative '%s'.", generated), f.apply(generated));
    }
  }

  private static final Collection<Character> mark;
  private static final Collection<Character> reserved;
  static {
    mark = new HashSet<Character>(asList('-', '_', '.', '!', '~', '*', '\'', '(', ')'));
    reserved = new HashSet<Character>(asList(';', '/', '?', ':', '@', '&', '=', '+', '$', ','));
  }

  private boolean fragment(final String s) {
    return fragment(codePoints(s));
  }

  private boolean fragment(final Integer... codePoints) {
    return uric_star(codePoints);
  }

  private boolean uric_star(final Integer... codePoints) {
    return anyNumberOfUnreservedOrEscapedOrSpecial(reserved, codePoints);
  }

  private boolean unreserved(final int codePoint) {
    return alphanum(codePoint) || mark(codePoint);
  }

  private boolean mark(final int codePoint) {
    return contains(mark, codePoint);
  }

  private boolean escaped(final Integer... codePoints) {
    return codePoints.length == 3 && isChar(codePoints[0]) && codePoints[0] == '%' && hex(codePoints[1])
        && hex(codePoints[2]);
  }

  private boolean hex(final int codePoint) {
    if (digit(codePoint)) {
      return true;
    }
    if (alpha(codePoint)) {
      final int letter = Character.toLowerCase(codePoint);
      return 'a' <= letter && letter <= 'f';
    }
    return false;
  }

  private boolean alphanum(final int codePoint) {
    return digit(codePoint) || alpha(codePoint);
  }

  private boolean digit(final int codePoint) {
    if (isChar(codePoint)) {
      final char c = (char) codePoint;
      return '0' <= c && c <= '9';
    }
    return false;
  }

  private boolean alpha(final int codePoint) {
    if (isChar(codePoint)) {
      final char c = (char) codePoint;
      return 'a' <= c && c <= 'z' || 'A' <= c && c <= 'Z';
    }
    return false;
  }

  private boolean anyNumberOfUnreservedOrEscapedOrSpecial(final Collection<Character> special,
      final Integer... codePoints) {
    return eachSymbol(new Function<Integer[], Boolean>() {
      public Boolean apply(final Integer[] ps) {
        return isUnreservedOrEscapedOrSpecial(special, ps);
      }
    }, codePoints);
  }

  private boolean isUnreservedOrEscapedOrSpecial(final Collection<Character> special, final Integer... codePoints) {
    if (codePoints.length == 1) {
      return unreserved(codePoints[0]) || contains(special, codePoints[0]);
    }
    return escaped(codePoints);
  }

  private boolean contains(final Collection<Character> cs, final int codePoint) {
    return isChar(codePoint) && cs.contains(Character.valueOf((char) codePoint));
  }

  private boolean eachSymbol(final Function<Integer[], Boolean> f, final Integer... codePoints) {
    for (int i = 0; i < codePoints.length;) {
      final Integer[] next = nextSymbol(i, codePoints);
      if (!f.apply(next)) {
        return false;
      }
      i += next.length;
    }
    return true;
  }

  private Integer[] nextSymbol(final int i, final Integer... codePoints) {
    if (codePoints[i] == '%') {
      if (i + 2 < codePoints.length) {
        return new Integer[] { codePoints[i], codePoints[i + 1], codePoints[i + 2] };
      }
    }
    return new Integer[] { codePoints[i] };
  }

  private boolean isChar(final int codePoint) {
    return Character.charCount(codePoint) == 1;
  }

  private String negate(final String regEx) {
    return String.format("~(%s)", regEx);
  }

  private Integer[] codePoints(final String s) {
    final Collection<Integer> codePoints = new ArrayList<Integer>();
    final int length = s == null ? 0 : s.length();
    for (int i = 0; i < length;) {
      final int codepoint = s.codePointAt(i);
      codePoints.add(codepoint);
      i += Character.charCount(codepoint);
    }
    return codePoints.toArray(new Integer[codePoints.size()]);
  }
}


As astute as you are, you might be asking why I have chosen to translate the BNF into Java to verify the correctness of the URI components, especially when there are already at least 3 perfectly good, validating URI implementations in Java: java.net.URI, org.eclipse.emf.common.util.URI and org.apache.commons.httpclient.URI, any of which could easily be incorporated into the project with just a couple of lines of Maven configuration, or no extra configuration at all. Originally, I had gone down this route, and to my surprise, none of these implementations suited my needs:
  • java.net.URI does not allow certain RFC 2396 URIs, such as "//"
  • org.eclipse.emf.common.util.URI is not strict enough (just look at the commented-out implementations of the validation methods)
  • org.apache.commons.httpclient.URI is character-based, not code-point-based, so I ran into a number of false positive results, particularly with Asian characters.

I strongly encourage you to setup a simple test yourself to double-check my work or to delve deeper into the implementation details to see how URIs are handled by the 3 different libraries.

Now that the test case is ready, we can incrementally proceed with our translation and verify our results at a few select checkpoints along the way.

Translating the BNF:



In most cases, components at the top of the BNF grammar are built from combinations of components that appeared earlier. For clarity, the BNF grammar has been reversed, so that the incremental translation can be read from the top of the page to the bottom. The resulting URI-reference regex at the bottom could surely be optimized or shortened since there is significant repetition, but for now, I am primarily concerned with its correctness, not with its size or speed.

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
digit = [0-9]

upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
upalpha = [A-Z]

lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
lowalpha = [a-z]

alpha = lowalpha | upalpha
alpha = [a-zA-Z]

alphanum = alpha | digit
alphanum = [a-zA-Z0-9]

hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f"
hex = [a-fA-F0-9]

escaped = "%" hex hex
escaped = %[a-fA-F0-9]{2}

mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
mark = [\-_\.!\~\*'\(\)]

unreserved = alphanum | mark
unreserved = [a-zA-Z0-9\-_\.!\~\*'\(\)]

reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
reserved = [;/\?:\@\&=\+$,]

uric = reserved | unreserved | escaped
uric = [a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2})

fragment = *uric
fragment = ([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*

query = *uric
query = ([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*

pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" | "$" | ","
pchar = [a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2})

param = *pchar
param = ([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*

segment = *pchar *( ";" param )
segment = ([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*

path_segments = segment *( "/" segment )
path_segments = (([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*

path = [ abs_path | opaque_part ]
path = ((/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)|(([a-zA-Z0-9\-_\.!\~\*'\(\);\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*))?

port = *digit
port = [0-9]*

IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit
IPv4address = [0-9]+((\.[0-9]+){3})

toplabel = alpha | alpha *( alphanum | "-" ) alphanum
toplabel = [a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?

domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
domainlabel = [a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?

hostname = *( domainlabel "." ) toplabel [ "." ]
hostname = (([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?

host = hostname | IPv4address
host = ((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3}))

hostport = host [ ":" port ]
hostport = (((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?

userinfo = *( unreserved | escaped | ";" | ":" | "&" | "=" | "+" | "$" | "," )
userinfo = ([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*

server = [ [ userinfo "@" ] hostport ]
server = (((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?

reg_name = 1*( unreserved | escaped | "$" | "," | ";" | ":" | "@" | "&" | "=" | "+" )
reg_name = ([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+

authority = server | reg_name
authority = (((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+

scheme = alpha *( alpha | digit | "+" | "-" | "." )
scheme = [a-zA-Z][a-zA-Z0-9\+\-\.]*

rel_segment = 1*( unreserved | escaped | ";" | "@" | "&" | "=" | "+" | "$" | "," )
rel_segment = ([a-zA-Z0-9\-_\.!\~\*'\(\);\@\&=\+$,]|(%[a-fA-F0-9]{2}))+

rel_path = rel_segment [ abs_path ]
rel_path = ([a-zA-Z0-9\-_\.!\~\*'\(\);\@\&=\+$,]|(%[a-fA-F0-9]{2}))+(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?

abs_path = "/" path_segments
abs_path = /(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*

net_path = "//" authority [ abs_path ]
net_path = //((((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?

uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
uric_no_slash = [a-zA-Z0-9\-_\.!\~\*'\(\);\?:\@\&=\+$,]|(%[a-fA-F0-9]{2})

opaque_part = uric_no_slash *uric
opaque_part = ([a-zA-Z0-9\-_\.!\~\*'\(\);\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*

hier_part = ( net_path | abs_path ) [ "?" query ]
hier_part = ((//((((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?)|(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*))(\?([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)?

relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
relativeURI = ((//((((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?)|(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)|(([a-zA-Z0-9\-_\.!\~\*'\(\);\@\&=\+$,]|(%[a-fA-F0-9]{2}))+(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?))(\?([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)?

absoluteURI = scheme ":" ( hier_part | opaque_part )
absoluteURI = [a-zA-Z][a-zA-Z0-9\+\-\.]*:((((//((((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?)|(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*))(\?([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)?)|(([a-zA-Z0-9\-_\.!\~\*'\(\);\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*))

URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
URI-reference = (([a-zA-Z][a-zA-Z0-9\+\-\.]*:((((//((((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?)|(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*))(\?([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)?)|(([a-zA-Z0-9\-_\.!\~\*'\(\);\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)))|(((//((((([a-zA-Z0-9\-_\.!\~\*'\(\);:\&=\+$,]|(%[a-fA-F0-9]{2}))*)\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)\.)*([a-zA-Z](([a-zA-Z0-9\-])*[a-zA-Z0-9])?)(\.)?)|([0-9]+((\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\-_\.!\~\*'\(\)$,;:\@\&=\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?)|(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)|(([a-zA-Z0-9\-_\.!\~\*'\(\);\@\&=\+$,]|(%[a-fA-F0-9]{2}))+(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\-_\.!\~\*'\(\):\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?))(\?([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)?))?(\#([a-zA-Z0-9\-_\.!\~\*'\(\);/\?:\@\&=\+$,]|(%[a-fA-F0-9]{2}))*)?



Your final test case will perhaps look quite different from mine, but you may notice some similar patterns. Ultimately, this URI validation code should be moved to a separate component. Perhaps it could eventually be used as the basis for yet another Java URI implementation.

package timezra.blog.uri_regex;

import static java.util.Arrays.asList;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
import java.lang.reflect.Array;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashSet;
import nl.flotsam.xeger.Xeger;
import org.junit.Test;
import com.google.common.base.Function;

public class URIRegExTest {
  private static final String SCHEME_REGEX = "[a-zA-Z][a-zA-Z0-9\\+\\-\\.]*";
  private static final String AUTHORITY_REGEX = "(((([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);:\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)\\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\\-])*[a-zA-Z0-9])?)\\.)*([a-zA-Z](([a-zA-Z0-9\\-])*[a-zA-Z0-9])?)(\\.)?)|([0-9]+((\\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\)$,;:\\@\\&=\\+]|(%[a-fA-F0-9]{2}))+";
  private static final String PATH_REGEX = "((/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*))*)|(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*))?";
  private static final String QUERY_REGEX = "([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*";
  private static final String FRAGMENT_REGEX = "([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*";
  private static final String URI_REFERENCE_REGEX = "(([a-zA-Z][a-zA-Z0-9\\+\\-\\.]*:((((//((((([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);:\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)\\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\\-])*[a-zA-Z0-9])?)\\.)*([a-zA-Z](([a-zA-Z0-9\\-])*[a-zA-Z0-9])?)(\\.)?)|([0-9]+((\\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\)$,;:\\@\\&=\\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?)|(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*))*))(\\?([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)?)|(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)))|(((//((((([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);:\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)\\@)?((((([a-zA-Z0-9](([a-zA-Z0-9\\-])*[a-zA-Z0-9])?)\\.)*([a-zA-Z](([a-zA-Z0-9\\-])*[a-zA-Z0-9])?)(\\.)?)|([0-9]+((\\.[0-9]+){3})))(:[0-9]*)?))?|([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\)$,;:\\@\\&=\\+]|(%[a-fA-F0-9]{2}))+)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?)|(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*))*)|(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))+(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*)(/(([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*(;([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\):\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)*))*)?))(\\?([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)?))?(\\#([a-zA-Z0-9\\-_\\.!\\~\\*'\\(\\);/\\?:\\@\\&=\\+$,]|(%[a-fA-F0-9]{2}))*)?";


  private static final int NUMBER_OF_TESTS = 100000;

  @Test
  public void theSchemeRegExIsNecessaryAndSufficient() throws Exception {
    verifyTheRegEx(SCHEME_REGEX, new Function<String, Boolean>() {
      public Boolean apply(final String s) {
        return scheme(s);
      }
    });
  }

  @Test
  public void theAuthorityRegExIsNecessaryAndSufficient() throws Exception {
    verifyTheRegEx(AUTHORITY_REGEX, new Function<String, Boolean>() {
      public Boolean apply(final String s) {
        return authority(s);
      }
    });
  }

  @Test
  public void thePathRegExIsNecessaryAndSufficient() throws Exception {
    verifyTheRegEx(PATH_REGEX, new Function<String, Boolean>() {
      public Boolean apply(final String s) {
        return path(s);
      }
    });
  }

  @Test
  public void theQueryRegExIsNecessaryAndSufficient() throws Exception {
    verifyTheRegEx(QUERY_REGEX, new Function<String, Boolean>() {
      public Boolean apply(final String s) {
        return query(s);
      }
    });
  }

  @Test
  public void theFragmentRegExIsNecessaryAndSufficient() throws Exception {
    verifyTheRegEx(FRAGMENT_REGEX, new Function<String, Boolean>() {
      public Boolean apply(final String s) {
        return fragment(s);
      }
    });
  }

  @Test
  public void theURIReferenceRegExIsNecessaryAndSufficient() throws Exception {
    verifyTheRegEx(URI_REFERENCE_REGEX, new Function<String, Boolean>() {
      public Boolean apply(final String s) {
        return uri_reference(s);
      }
    });
  }

  private void verifyTheRegEx(final String pattern, final Function<String, Boolean> f) {
    final Xeger generator = new Xeger(pattern);
    for (int i = 0; i < NUMBER_OF_TESTS; i++) {
      final String generated = generator.generate();
      assertTrue(String.format("Verifying the value '%s'.", generated), f.apply(generated));
    }
    final Xeger negativeGenerator = new Xeger(negate(pattern));
    for (int i = 0; i < NUMBER_OF_TESTS; i++) {
      final String generated = negativeGenerator.generate();
      assertFalse(String.format("Verifying the negative '%s'.", generated), f.apply(generated));
    }
  }

  private static final Collection<Character> mark;
  private static final Collection<Character> reserved;
  private static final Collection<Character> pchar;
  private static final Collection<Character> userinfo;
  private static final Collection<Character> reg_name;
  private static final Collection<Character> scheme;
  private static final Collection<Character> rel_segment;
  private static final Collection<Character> uric_no_slash;
  static {
    mark = new HashSet<Character>(asList('-', '_', '.', '!', '~', '*', '\'', '(', ')'));
    reserved = new HashSet<Character>(asList(';', '/', '?', ':', '@', '&', '=', '+', '$', ','));
    pchar = new HashSet<Character>(asList(':', '@', '&', '=', '+', '$', ','));
    userinfo = new HashSet<Character>(asList(';', ':', '&', '=', '+', '$', ','));
    reg_name = new HashSet<Character>(asList('$', ',', ';', ':', '@', '&', '=', '+'));
    scheme = new HashSet<Character>(asList('+', '-', '.'));
    rel_segment = new HashSet<Character>(asList(';', '@', '&', '=', '+', '$', ','));
    uric_no_slash = new HashSet<Character>(asList(';', '?', ':', '@', '&', '=', '+', '$', ','));
  }

  private boolean uri_reference(final String s) {
    return uri_reference(codePoints(s));
  }

  private boolean uri_reference(final Integer... codePoints) {
    final Integer[][] partition = partition(codePoints, '#');
    final Integer[] uri = partition[0];
    final Integer[] fragment = partition[1];
    return (uri.length == 0 || absoluteURI(uri) || relativeURI(uri))
        && (fragment.length == 0 || fragment(subarray(fragment, 1, fragment.length - 1)));
  }

  private boolean absoluteURI(final Integer... codePoints) {
    final Integer[][] partition = partition(codePoints, ':');
    final Integer[] scheme = partition[0];
    final Integer[] part = partition[1].length == 0 ? partition[1] : subarray(partition[1], 1,
        partition[1].length - 1);
    return scheme(scheme) && (heir_part(part) || opaque_part(part));
  }

  private boolean relativeURI(final Integer... codePoints) {
    final Integer[][] partition = partition(codePoints, '?');
    final Integer[] path = partition[0];
    final Integer[] query = partition[1];
    return (net_path(path) || abs_path(path) || rel_path(path))
        && (query.length == 0 || query(subarray(query, 1, query.length - 1)));
  }

  private boolean heir_part(final Integer... codePoints) {
    final Integer[][] partition = partition(codePoints, '?');
    final Integer[] path = partition[0];
    final Integer[] query = partition[1];
    return (net_path(path) || abs_path(path)) && (query.length == 0 || query(subarray(query, 1, query.length - 1)));
  }

  private boolean path(final String s) {
    return path(codePoints(s));
  }

  private boolean path(final Integer... codePoints) {
    return codePoints.length == 0 || abs_path(codePoints) || opaque_part(codePoints);
  }

  private boolean opaque_part(final Integer... codePoints) {
    if (codePoints.length < 1) {
      return false;
    }
    final Integer[] theFirst = nextSymbol(0, codePoints);
    final Integer[] theRest = subarray(codePoints, theFirst.length, codePoints.length - theFirst.length);
    return uric_no_slash(theFirst) && uric_star(theRest);
  }

  private boolean uric_no_slash(final Integer... codePoints) {
    return isUnreservedOrEscapedOrSpecial(uric_no_slash, codePoints);
  }

  private boolean net_path(final Integer... codePoints) {
    if (codePoints.length < 2) {
      return false;
    }
    final int firstChar = codePoints[0];
    final int secondChar = codePoints[1];
    final Integer[] authority_and_abs_path = subarray(codePoints, 2, codePoints.length - 2);
    final Integer[][] partition = partition(authority_and_abs_path, '/');
    final Integer[] authority = partition[0];
    final Integer[] abs_path = partition[1];
    return firstChar == '/' && secondChar == '/' && authority(authority)
        && (abs_path.length == 0 || abs_path(abs_path));
  }

  private boolean rel_path(final Integer... codePoints) {
    if (codePoints.length < 1) {
      return false;
    }
    final Integer[][] partition = partition(codePoints, '/');
    final Integer[] rel_segment = partition[0];
    final Integer[] abs_path = partition[1];
    return rel_segment(rel_segment) && (abs_path.length == 0 || abs_path(abs_path));
  }

  private boolean abs_path(final Integer... codePoints) {
    if (codePoints.length < 1) {
      return false;
    }
    final int slash = codePoints[0];
    final Integer[] path_segments = subarray(codePoints, 1, codePoints.length - 1);
    return slash == '/' && path_segments(path_segments);
  }

  private boolean rel_segment(final Integer... codePoints) {
    return atLeastOneUnreservedOrEscapedOrSpecial(rel_segment, codePoints);
  }

  private boolean scheme(final String s) {
    return scheme(codePoints(s));
  }

  private boolean scheme(final Integer... codePoints) {
    if (codePoints.length < 1) {
      return false;
    }
    if (!alpha(codePoints[0])) {
      return false;
    }
    for (int i = 1; i < codePoints.length; i++) {
      final Integer c = codePoints[i];
      if (!(alphanum(c) || contains(scheme, c))) {
        return false;
      }
    }
    return true;
  }

  private boolean authority(final String s) {
    return authority(codePoints(s));
  }

  private boolean authority(final Integer... codePoints) {
    return server(codePoints) || reg_name(codePoints);
  }

  private boolean reg_name(final Integer... codePoints) {
    return atLeastOneUnreservedOrEscapedOrSpecial(reg_name, codePoints);
  }

  private boolean server(final Integer... codePoints) {
    if (codePoints.length == 0) {
      return true;
    }
    final Integer[][] userinfoAndHostport = split('@', codePoints);
    if (userinfoAndHostport.length == 1) {
      return hostport(userinfoAndHostport[0]);
    }
    if (userinfoAndHostport.length == 2) {
      return userinfo(userinfoAndHostport[0]) && hostport(userinfoAndHostport[1]);
    }
    return false;
  }

  private boolean userinfo(final Integer... codePoints) {
    return anyNumberOfUnreservedOrEscapedOrSpecial(userinfo, codePoints);
  }

  private boolean hostport(final Integer... codePoints) {
    if (codePoints.length < 1) {
      return false;
    }
    final Integer[][] hostAndPort = split(':', codePoints);
    if (!host(hostAndPort[0])) {
      return false;
    }
    if (hostAndPort.length == 1 || hostAndPort.length == 2 && port(hostAndPort[1])) {
      return true;
    }
    return false;
  }

  private boolean host(final Integer... codePoints) {
    return hostname(codePoints) || ipv4address(codePoints);
  }

  private boolean hostname(final Integer... codePoints) {
    if (codePoints.length < 1) {
      return false;
    }
    final Integer[][] labels = split('.', codePoints);
    for (int i = 0; i < labels.length - 1; i++) {
      if (!domainlabel(labels[i])) {
        return false;
      }
    }
    return toplabel(labels[labels.length - 1]);
  }

  private boolean domainlabel(final Integer... codePoints) {
    return label(new Function<Integer, Boolean>() {
      public Boolean apply(final Integer i) {
        return alphanum(i);
      }
    }, codePoints);
  }

  private boolean toplabel(final Integer... codePoints) {
    return label(new Function<Integer, Boolean>() {
      public Boolean apply(final Integer i) {
        return alpha(i);
      }
    }, codePoints);
  }

  private boolean ipv4address(final Integer... codePoints) {
    final Integer[][] parts = split('.', codePoints);
    if (parts.length != 4) {
      return false;
    }
    for (final Integer[] part : parts) {
      if (part.length == 0) {
        return false;
      }
      for (final Integer d : part) {
        if (!digit(d)) {
          return false;
        }
      }
    }
    return true;
  }

  private boolean port(final Integer... codePoints) {
    for (final Integer i : codePoints) {
      if (!digit(i)) {
        return false;
      }
    }
    return true;
  }

  private boolean path_segments(final Integer... codePoints) {
    final Integer[][] path_segments = split('/', codePoints);
    for (final Integer[] segment : path_segments) {
      if (!eachSymbol(new Function<Integer[], Boolean>() {
        public Boolean apply(final Integer[] ps) {
          return segment(ps);
        }
      }, segment)) {
        return false;
      }
    }
    return true;
  }

  private boolean segment(final Integer... codePoints) {
    final Integer[][] segments = split(';', codePoints);
    for (final Integer[] pchar : segments) {
      if (!param(pchar)) {
        return false;
      }
    }
    return true;
  }

  private boolean param(final Integer... codePoints) {
    return anyNumberOfUnreservedOrEscapedOrSpecial(pchar, codePoints);
  }

  private boolean query(final String s) {
    return query(codePoints(s));
  }

  private boolean query(final Integer... codePoints) {
    return uric_star(codePoints);
  }

  private boolean fragment(final String s) {
    return fragment(codePoints(s));
  }

  private boolean fragment(final Integer... codePoints) {
    return uric_star(codePoints);
  }

  private boolean uric_star(final Integer... codePoints) {
    return anyNumberOfUnreservedOrEscapedOrSpecial(reserved, codePoints);
  }

  private boolean unreserved(final int codePoint) {
    return alphanum(codePoint) || mark(codePoint);
  }

  private boolean mark(final int codePoint) {
    return contains(mark, codePoint);
  }

  private boolean escaped(final Integer... codePoints) {
    return codePoints.length == 3 && isChar(codePoints[0]) && codePoints[0] == '%' && hex(codePoints[1])
        && hex(codePoints[2]);
  }

  private boolean hex(final int codePoint) {
    if (digit(codePoint)) {
      return true;
    }
    if (alpha(codePoint)) {
      final int letter = Character.toLowerCase(codePoint);
      return 'a' <= letter && letter <= 'f';
    }
    return false;
  }

  private boolean alphanum(final int codePoint) {
    return digit(codePoint) || alpha(codePoint);
  }

  private boolean digit(final int codePoint) {
    if (isChar(codePoint)) {
      final char c = (char) codePoint;
      return '0' <= c && c <= '9';
    }
    return false;
  }

  private boolean alpha(final int codePoint) {
    if (isChar(codePoint)) {
      final char c = (char) codePoint;
      return 'a' <= c && c <= 'z' || 'A' <= c && c <= 'Z';
    }
    return false;
  }

  private Integer[][] partition(final Integer[] is, final char c) {
    final int thePivot = indexOf(c, is);
    final Integer[] theLeft;
    final Integer[] theRight;
    if (thePivot < 0) {
      theLeft = is;
      theRight = new Integer[] {};
    } else {
      theLeft = subarray(is, 0, thePivot);
      theRight = subarray(is, thePivot, is.length - thePivot);
    }
    return new Integer[][] { theLeft, theRight };
  }

  private int indexOf(final char c, final Integer... codePoints) {
    for (int i = 0; i < codePoints.length; i++) {
      final int next = codePoints[i];
      if (next == c) {
        return i;
      }
    }
    return -1;
  }

  private boolean atLeastOneUnreservedOrEscapedOrSpecial(final Collection<Character> special,
      final Integer... codePoints) {
    return codePoints.length > 0 && anyNumberOfUnreservedOrEscapedOrSpecial(special, codePoints);
  }

  private boolean anyNumberOfUnreservedOrEscapedOrSpecial(final Collection<Character> special,
      final Integer... codePoints) {
    return eachSymbol(new Function<Integer[], Boolean>() {
      public Boolean apply(final Integer[] ps) {
        return isUnreservedOrEscapedOrSpecial(special, ps);
      }
    }, codePoints);
  }

  private boolean isUnreservedOrEscapedOrSpecial(final Collection<Character> special, final Integer... codePoints) {
    if (codePoints.length == 1) {
      return unreserved(codePoints[0]) || contains(special, codePoints[0]);
    }
    return escaped(codePoints);
  }

  private boolean contains(final Collection<Character> cs, final int codePoint) {
    return isChar(codePoint) && cs.contains(Character.valueOf((char) codePoint));
  }

  private boolean label(final Function<Integer, Boolean> firstChar, final Integer... codePoints) {
    if (codePoints.length == 0) {
      return false;
    }
    if (codePoints.length == 1) {
      return firstChar.apply(codePoints[0]);
    }
    final Integer first = codePoints[0];
    final Integer alphanum = codePoints[codePoints.length - 1];
    final Integer[] alphanums = subarray(codePoints, 1, codePoints.length - 2);
    if (!firstChar.apply(first) || !alphanum(alphanum)) {
      return false;
    }
    for (final Integer i : alphanums) {
      if (i != '-' || !alphanum(i)) {
        return false;
      }
    }
    return true;
  }

  private Integer[][] split(final char delim, final Integer... codePoints) {
    final Collection<Integer[]> parts = new ArrayList<Integer[]>();
    for (int i = 0; i < codePoints.length;) {
      final Integer[] part = nextPart(delim, i, codePoints);
      parts.add(part);
      i += part.length + 1;
    }
    return parts.toArray(new Integer[parts.size()][]);
  }

  private Integer[] nextPart(final char delim, final int start, final Integer... codePoints) {
    int i = start;
    for (; i < codePoints.length; i++) {
      if (codePoints[i] == delim) {
        break;
      }
    }
    final Integer[] pchar = subarray(codePoints, start, i - start);
    return pchar;
  }

  private Integer[] subarray(final Integer[] is, final int start, final int length) {
    return subarray(Integer.class, is, start, length);
  }

  @SuppressWarnings("unchecked")
  private <T> T[] subarray(final Class<?> type, final T[] ts, final int start, final int length) {
    final T[] sub = (T[]) Array.newInstance(type, length);
    System.arraycopy(ts, start, sub, 0, length);
    return sub;
  }

  private boolean eachSymbol(final Function<Integer[], Boolean> f, final Integer... codePoints) {
    for (int i = 0; i < codePoints.length;) {
      final Integer[] next = nextSymbol(i, codePoints);
      if (!f.apply(next)) {
        return false;
      }
      i += next.length;
    }
    return true;
  }

  private Integer[] nextSymbol(final int i, final Integer... codePoints) {
    if (codePoints[i] == '%') {
      if (i + 2 < codePoints.length) {
        return new Integer[] { codePoints[i], codePoints[i + 1], codePoints[i + 2] };
      }
    }
    return new Integer[] { codePoints[i] };
  }

  private boolean isChar(final int codePoint) {
    return Character.charCount(codePoint) == 1;
  }

  private String negate(final String regEx) {
    return String.format("~(%s)", regEx);
  }

  private Integer[] codePoints(final String s) {
    final Collection<Integer> codePoints = new ArrayList<Integer>();
    final int length = s == null ? 0 : s.length();
    for (int i = 0; i < length;) {
      final int codepoint = s.codePointAt(i);
      codePoints.add(codepoint);
      i += Character.charCount(codepoint);
    }
    return codePoints.toArray(new Integer[codePoints.size()]);
  }
}



How to Use:


Now that we have a regex to validate URIs, we can use it for such purposes as validating UML stereotype attributes. Always keep in mind that regex parsers can vary quite a bit in their supported operations. I have tried to limit this regex to a standard set of characters, but a few still require tweaking for the EMF RegEx parser to process them, i.e.,  &  ~  @  all must be unescaped (the backslashes must be removed, whereas automaton requires them).

Conclusion:


The primary purpose of this tutorial has been to provide a complete regular expression that validates the syntax of a candidate URI. Along the way, we have also discovered a technique for test-driving the incremental creation of complex regular expressions.

No comments: