Tuesday, June 24, 2008

Part 1. Power of a Regular Expressions

Rather than just posting the slides, I decided to do a series of blog posts on the subject - consider it as a intro or tutorial to Regular Expressions.

What are Regular Expressions?

Regular Expressions (or RegEx for short) is a technique to shorten coding by using bunch of letters and symbols (metacharacters). Most modern languages support regular expressions. You can use it in code. But where they are really useful is data cleanup, extraction, converting legacy data, grabbing data from the internet, etc. Regular Expressions can be considered as the SQL for freeform text.

RegExs are avoided by most people (even experienced programmers shy away from it), because they either don't understand it, haven't taken the time to learn it -- or they think that it is sediment left over from the Unix era.

At the end of this tutorial you will see that, regular expressions can actually be easier and quicker to code. Once you understand how it works, and some of the tools and techniques, you will see how dramatically it can shorten code, and save you a time with coding and debugging. Also, maintenance can actually be easier!

Here is a terrific example. Today, I was browsing through the prep book for MCTS Exam 70-528, and came across this example for asp.net custom validator control to validate passwords. The password rules are:

  • 6-14 characters,
  • at least one lowercase letter,
  • at least one uppercase letter
  • at least one number

Here is the conventional method. (Code directly copied from the book, pg 469 with misspelled "argument" and all)

<script language="javascript" type="text/javascript">
function ValidatePassword(source, arguements)
{
var data = arguements.Value.split('');
//start by setting false
arguements.IsValid=false;
//check length
if(data.length < 6 || data.length > 14) return;
//check for uppercase
var uc = false;
for(var c in data)
{
if(data[c] >= 'A' && data[c] <= 'Z')
{
uc=true; break;
}
}
if(!uc) return;
//check for lowercase
var lc = false;
for(var c in data)
{
if(data[c] >= 'a' && data[c] <= 'z')
{
lc=true; break;
}
}
if(!lc) return;
//check for numeric
var num = false;
for(var c in data)
{
if(data[c] >= '0' && data[c] <= '9')
{
num=true; break;
}
}
if(!num) return;
//must be valid
arguements.IsValid=true;
}
</script>

Now, with the use of regular expressions, we can shorten this into just ONE line of code:
<script language="javascript" type="text/javascript">
function ValidatePassword(src, args)
{
args.IsValid =
args.Value.length>=6 && args.Value.length<=14
&& /[a-z]/.test(args.Value) //find a lowercase
&& /[A-Z]/.test(args.Value) //find a uppercase
&& /\d/.test(args.Value) //find a digit
}
</script>

You've got to love the elegance and compactness of this code. I believe it is actually easier to understand and debug - no messy loops and "if" constructs. And it reads exactly like the password rules specification above. Such is the amazing power of regular expressions. Note: "\d" is the character class for identifying a single digit. We could have just as well said "[0-9]".

In my next post we will look at a few more easy examples and examine the metacharacters.

1 comment:

Vijay Jagdale said...

The C# version (for server side validations) is:
args.Value.Length >= 6 && args.Value.Length <= 14
&& new Regex("[a-z]").IsMatch(args.Value)
&& new Regex("[A-Z]").IsMatch(args.Value)
&& new Regex(@"\d").IsMatch(args.Value);