[ALL] Regular Expressions: Parsing Strings of Text

Discussion in 'Mixed Languages' started by Calistoga, Oct 27, 2010.

  1. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    #1 Calistoga, Oct 27, 2010
    Last edited by a moderator: Apr 20, 2017
    Regular Expressions Parsing Strings of Text
    For all supported programming languages.

    Note that all C# examples requires that you place "using System.Text.RegularExpressions;" on top of your code file.

    Not much going on here yet, so I give to you: Regular Expressions!

    There are times when you may need to process a long or short text string. This string may contain only a single piece of information that you need in order to get your application up running. You could of course use the classic approach, trim the string from left and right until you get what you want. This is however not a very clean solution to the problem.

    Regular Expressions might seem overly complex and difficult for beginners, and yes, it may take some time to learn. The usefulness however, makes it all worth it.

    Example Take the following text string as an example:
    As you can see, there's a whole lot of info we do not need. For the Windows 7 OEM SLP Key Collection we would probably want to get the manufacturer, the PID and the key itself - but for now, lets get only the key to keep it simple.

    Consider the following regex:
    Code:
    (?i)(?<ProductKey>\w{5}-\w{5}-\w{5}-\w{5}-\w{5})
    • (?i) means that the text will be matches case insensitive, as opposed to case sensitive.
    • (?<ProductKey>[...]) disregard the [...]. This is the capturing group, or more precisely, the named capturing group. This is where our product key will be stored after it has been matched.
    • \w{5} means that a word character (letter or number) will be matched 5 times exactly.

    So how would we go about getting the result of the regex into a separate string?

    [C#]:
    Code:
    string productKey = Regex.Match(theExampleStringHere, @"(?i)(?<ProductKey>\w{5}-\w{5}-\w{5}-\w{5}-\w{5})").Groups["ProductKey"].Value;
    
    [AutoIt v3]:
    Code:
    $productKey = StringRegExp($theExampleStringHere, "(?i)(?<ProductKey>\w{5}-\w{5}-\w{5}-\w{5}-\w{5})", 3, 1)
    $productKey = $productKey[0] ; This is an array where 0 represents the first element, which is out match.
    Note: If you need a regex to validate Microsoft product keys, I suggest you use the one posted by MasterDisaster as his version also takes illegal product key characters into account.

    All right, let's get the Windows edition as well:
    Code:
    (?i)(?<Edition>Starter|Home\s?(Basic|Premium)|Professional|Ultimate)
    \s? means that there might be a space here, so both HomeBasic and Home Basic will be matched.

    The result?

    [C#]:
    Code:
    string edition = Regex.Match(theExampleStringHere, @"(?i)(?<Edition>Starter|Home\s?(Basic|Premium)|Professional|Ultimate)").Groups["Edition"].Value;
    
    [AutoIt v3]:
    Code:
    $edition = StringRegExp($theExampleStringHere, "(?i)(?<Edition>Starter|Home\s?(Basic|Premium)|Professional|Ultimate)", 3, 1)
    $edition = $edition[0] ; This is an array where 0 represents the first element, which is out match.
    You can probably figure out how to obtain the manufacturer as well, even though that's a little harder since the list of manufacturers is so long.

    Questions about regular expressions or maybe contributions of examples are most welcome here. Note that I am not a regex ninja, so I might not always be able to answer.
     
  2. MasterDisaster

    MasterDisaster MDL Expert

    Aug 29, 2009
    1,256
    674
    60
    #2 MasterDisaster, Oct 28, 2010
    Last edited by a moderator: Apr 20, 2017
    I have a regex that validates if a key contains only valid characters
    Code:
    [BCDFGHJKMPQRTVWXY2346789]{5}-[BCDFGHJKMPQRTVWXY2346789]{5}-[BCDFGHJKMPQRTVWXY2346789]{5}-[BCDFGHJKMPQRTVWXY2346789]{5}-[BCDFGHJKMPQRTVWXY2346789]{5}
    
    Can this regex be simplified?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    #3 Calistoga, Oct 28, 2010
    Last edited by a moderator: Apr 20, 2017
    (OP)
    You could write it this way:
    Code:
    (?i)(?:[BCDFGHJKMPQRTV-Y2-9]{5}-){4}[BCDFGHJKMPQRTV-Y2-9]{5}
    You can remove that (?i) if you want it to only match upper case characters.
     
  4. MasterDisaster

    MasterDisaster MDL Expert

    Aug 29, 2009
    1,256
    674
    60
    #4 MasterDisaster, Oct 28, 2010
    Last edited by a moderator: Apr 20, 2017
    Thanks, it worked
    Code:
    (?:[BCDFGHJKMPQRTV-Y2346-9]{5}-){4}[BCDFGHJKMPQRTV-Y2346-9]{5}
    
    So, the bold part tells that ?????- should be repeated 4 times followed by ?????.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    Yep that's correct.

    Btw, that ?: in the beginning tells the regex engine that the contents inside the parenthesis should not be treated as a capturing group.
     
  6. FreeStyler

    FreeStyler MDL Guru

    Jun 23, 2007
    3,557
    3,832
    120
    I am trying to use this RegExp in a asp.net RegularExpressionValidator, my rule is currently like this: ValidationExpression="^(?:[BCDFGHJKMPQRTV-Y2346-9]{5}-){4}[BCDFGHJKMPQRTV-Y2346-9]{5}$"

    Where ^ = beginning of string
    Where $ = end of string

    Downside of this RegExp is the fact it only accepts UPPERCASE characters, if i alter the rule to accept lowercase as suggested here, eg: (?i) addition in front, the rule is broken, how to fix this?
     
  7. FreeStyler

    FreeStyler MDL Guru

    Jun 23, 2007
    3,557
    3,832
    120
    #7 FreeStyler, Nov 19, 2010
    Last edited: Nov 19, 2010
    Ok, found something more on this, seems the ignore case directive is not supported by the javascript for the RegularExpressionValidator, how to get around this? Found a class, but can't make it work correctly
     
  8. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    #8 Calistoga, Nov 19, 2010
    Last edited by a moderator: Apr 20, 2017
    (OP)
    Unbelievable, a regular expressions validator that doesn't support case insensitive matching :eek:

    Does it work if you set EnableClientScript=false like this guy?
    Eventually, it's dirty, but it should do the job:
    Code:
    ValidationExpression="^(?:[bcdfghjkmpqrtvBCDFGHJKMPQRTV-Y2346-9]{5}-){4}[bcdfghjkmpqrtvBCDFGHJKMPQRTV-Y2346-9]{5}$"
     
  9. FreeStyler

    FreeStyler MDL Guru

    Jun 23, 2007
    3,557
    3,832
    120
    #9 FreeStyler, Nov 19, 2010
    Last edited by a moderator: Apr 20, 2017
    You left out yY, guess this need to be included as well, right?

    Code:
    ValidationExpression="^(?:[bcdfghjkmpqrtv-yBCDFGHJKMPQRTV-Y2346-9]{5}-){4}[bcdfghjkmpqrtv-yBCDFGHJKMPQRTV-Y2346-9]{5}$"
    
    Or do the characters between v and y need to be listed separately? or doesn't this matter at all?

    Code:
    ValidationExpression="^(?:[bcdfghjkmpqrtvwxyBCDFGHJKMPQRTVWXY2346-9]{5}-){4}[bcdfghjkmpqrtvwxyBCDFGHJKMPQRTVWXY2346-9]{5}$"
    
     
  10. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    #10 Calistoga, Nov 19, 2010
    Last edited: Nov 19, 2010
    (OP)
    Ops yeah I forgot those, whether you write v-y or vwxy doesn't matter - I would choose the first one since it is four characters shorter :D

    But if someone with a cleaner solution to the RegularExpressionValidator problem happens to walk by, please stop here and explain.
     
  11. FreeStyler

    FreeStyler MDL Guru

    Jun 23, 2007
    3,557
    3,832
    120
    Thanks, already noticed it didn't really matter which one of the two versions to use

    BTW, the (converted to uppercase) key is also checked on the server side using the RegExp shown here, the client side function is just to do some early validation before submitting
     
  12. ar_seven_am

    ar_seven_am MDL Senior Member

    Mar 7, 2010
    398
    129
    10
    Hmmm, very impressive from the first post into the end, calistoga is there any regex function u didn't mention yet? or maybe may I know where can I find complete regex tutorial? I wanna deeply learn bout this, I have found this for link http://www.spaweditor.com/scripts/regex/index.php for test regex function we make, but till now I cant find any understandable tutorial for noob (like me) :confused:
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  13. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    I think Mastering Regular Expressions (MRE) (the unofficial regex bible :D) is probably one of the best books on the subject. As a quick reference book, I use Regular Expression: Pocked Reference (heavily based on MRE) a lot myself.

    This article is pretty good, but it targets .NET developers. What programming/scripting language will you be using?
     
  14. ar_seven_am

    ar_seven_am MDL Senior Member

    Mar 7, 2010
    398
    129
    10
    I want to know web scripting, currently for basic web tutorial I learn from here http://w3schools.com/default.asp, is there any difference from tutorial link from ur previous above? Sorry to ask, I'm starting to learning today, so I really dont know bout this, thx in advance for ur help!
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. Calistoga

    Calistoga MDL Senior Member

    Jul 25, 2009
    421
    199
    10
    #15 Calistoga, Jan 26, 2011
    Last edited by a moderator: Apr 20, 2017
    (OP)
    I think the easiest way to get a good feeling with regex is to have a problem then apply a regex-based solution to that problem. I'll try to show some of the logic behind regex matching:

    Code:
    \d{5}
    \d = digit; alone this will match a number (0-9)
    {5} = match the previous expression exactly 5 times, in other words: 5 numbers, will be matched.

    \d is the same as \d{1}

    Apply this to a string:
    Code:
    01546531658454
    Our expression will return the first 5 digits; 01546

    The following string is slightly modified:
    Code:
    01A546531658454
    Notice that 'A'. Our expression will not return anything because A is not a digit. To match the first five characters (digits and letters) we modify our expression:
    Code:
    (\w\d){5}
    \w = normal characters

    Now everything inside the parenthesis will be matched exactly 5 times. This expression will return 01A54.


    Note that I'm more a Windows applications type of developer, but you might be able to find some useful resources here. To help visualize how regular expressions matches, you might find this application interesting. It's well worth the money.

    Please ask if you want to know how to match something, and I will try to explain :)