To test-drive the creation of a regular expression to validate URIs based on the BNF in Appendix A of RFC 2396.
Other publicly available regular expressions were insufficient for my needs. After meticulously starting to implement my own regex by proceeding to translate every line of the BNF grammar, I realized this mode of working would be error-prone, and I needed an automated way to verify the correctness of the regular expression.
The test cases can follow a systematic template:
- we have a regular expression that we want to validate;
- we need to verify that the regular expression accepts values that conform to the BNF grammar;
- we also need to demonstrate that the regular expression does not accept values that do not conform to the BNF grammar.
For the purposes of this tutorial, we can plan on validating the URI parts that correspond to the components described in Appendix B of RFC 2396, i.e., scheme, authority, path, query and fragment, as well as the entirety of the URI-reference.
In order to demonstrate that the regex accepts values that conform to the BNF grammar, we can use a package like Xeger to generate random values based on our regular expression, and we will verify with a Java translation of the BNF that these generated values are indeed valid.
To verify that the regular expression does not accept values that do not conform to the BNF, we can again use Xeger, but we will negate the regular expression and verify that the generated values are not valid using our Java translation of the BNF.
Now that we have a plan, we can set up a new project by downloading Xeger and the engine that drives it, automaton.
Automaton can be installed directly into our local maven repository
Xeger is already a maven project so we can either install the archive directly, or install the module from source.
An example pom.xml for our project might look something like this:
We are now ready to create the initial implementation for the test class, here called timezra.blog.uri_regex.URIRegExTest.java.
When we build our regular expressions from the bottom of the BNF grammar upwards, the first entire URI component that we encounter is the regex for fragments, so our first test case should validate that fragments work correctly.
As astute as you are, you might be asking why I have chosen to translate the BNF into Java to verify the correctness of the URI components, especially when there are already at least 3 perfectly good, validating URI implementations in Java: java.net.URI, org.eclipse.emf.common.util.URI and org.apache.commons.httpclient.URI, any of which could easily be incorporated into the project with just a couple of lines of Maven configuration, or no extra configuration at all. Originally, I had gone down this route, and to my surprise, none of these implementations suited my needs:
- java.net.URI does not allow certain RFC 2396 URIs, such as "//"
- org.eclipse.emf.common.util.URI is not strict enough (just look at the commented-out implementations of the validation methods)
- org.apache.commons.httpclient.URI is character-based, not code-point-based, so I ran into a number of false positive results, particularly with Asian characters.
I strongly encourage you to setup a simple test yourself to double-check my work or to delve deeper into the implementation details to see how URIs are handled by the 3 different libraries.
Now that the test case is ready, we can incrementally proceed with our translation and verify our results at a few select checkpoints along the way.
Translating the BNF:
In most cases, components at the top of the BNF grammar are built from combinations of components that appeared earlier. For clarity, the BNF grammar has been reversed, so that the incremental translation can be read from the top of the page to the bottom. The resulting URI-reference regex at the bottom could surely be optimized or shortened since there is significant repetition, but for now, I am primarily concerned with its correctness, not with its size or speed.
Your final test case will perhaps look quite different from mine, but you may notice some similar patterns. Ultimately, this URI validation code should be moved to a separate component. Perhaps it could eventually be used as the basis for yet another Java URI implementation.
How to Use:
Now that we have a regex to validate URIs, we can use it for such purposes as validating UML stereotype attributes. Always keep in mind that regex parsers can vary quite a bit in their supported operations. I have tried to limit this regex to a standard set of characters, but a few still require tweaking for the EMF RegEx parser to process them, i.e., & ~ @ all must be unescaped (the backslashes must be removed, whereas automaton requires them).
The primary purpose of this tutorial has been to provide a complete regular expression that validates the syntax of a candidate URI. Along the way, we have also discovered a technique for test-driving the incremental creation of complex regular expressions.