How to validate link with Regular Expression?

In this article we are going see the link validation Regular Expression. It is one of the most commonly used regular expression. When ever we use in selenium or getting link from jmeter response data , we can use this regular expression. For basic idea on regular expression, see this post.

First , we have to identify , what is the characteristics of a web link. The generic formula of a link is scheme://host.domain:port/path/filename

And, we may think of the following
1. A link start with a schema like as http, https, ftp. So, for this part, expression should be
(https?|ftp)
I use, http and https together. 

2. Then it comes "://" so , the expression of this part should be \:\/\/ or we can use simply defining ://

3. Then it may come www. Some time it is www1 or www2. Sometimes we may find no www. Sometimes it is the sub domain name(like as mantis.shantonu.com). Sometimes there is nothing, direct domain names. So for this part the expression will be
- For www or www1  : www\d?
- For sub domain : [a-zA-Z0-9]+
- For making this two together and OR operation : (www\d?|[a-zA-Z0-9]+)
- And the full statement is option(as there may be direct domain name) :  (www\d?|[a-zA-Z0-9]+)?

4. Then the domain name comes , so expression will be [a-zA-Z0-9-]+ , considering all characters with -.


5. Then we get either com, org . This part is not fixed now a days as it differs from are to are.(ie, in India some sires have co.in then in Uk co.uk port). So , expression will be [a-zA-Z0-9.]+.
And, in this section, the port address can also be present, which is also optional . So expression will be. (\:\d+)?. So, if we add both in a single manner , ([a-zA-Z0-9.]+|(\d+)?)

6. And, then the path after the port/domain Type. This might have any type of character(no a days there can be Unicode also, and separated by path identifier / and which also optional. So , for this part the expression will be : ([/?].*)?

Now we add all together . Between statement from 1 to 5, there are symbols. I will add them part by part. So finally the regular expression will be
(https?|ftp)://(www\d?|[a-zA-Z0-9]+)?\.[a-zA-Z0-9-]+(\:|\.)([a-zA-Z0-9.]+|(\d+)?)([/?].*)?

OR
(https?|ftp)://(www\d?|[a-zA-Z0-9]+)?\.[a-zA-Z0-9-]+(\:|\.)([a-zA-Z0-9.]+|(\d+)?)([/?:].*)?

Note :  I have added  (\:|\.)  as, there can be colon : with port or dot . with domain type name.

If you have any variety of URL, you can add to comment. I will modify the expression to support more type of URLs.
Thanks..:)

No comments:

Post a Comment