How to verify form input using HTML5 input verification
I have tried finding a full list of patterns to use for verifying input via HTML5 form verification for various types, specifically url
, email
, tel
and such, but I couldn't find any. Currently, the built-in versions of these input verifications are far from perfect (and tel
doesn't even check if the thing you're entering is a phone number). So I was wondering, which patterns could I use for verifying the user is entering the right format in the inputs?
Here are a few examples of cases where the default verification allows input that is not supposed to be allowed:
type="email"
This field allows emails that have incorrect domains after the @, and it allows addresses to start or end with a dash or period, which isn't allowed either. So, .example-@x
is allowed.
type="url"
This input basically allows any input that starts with http://
(Chrome) and is followed by anything other than a few special characters such as those that have a function in URLs (, @, #, ~, etc). In FF, all that's checked is if it starts with http:
, followed by anything other than :
(even just http:
is allowed in FF). IE does the same as FF, except that it doesn't disallow http::
.
For example: http://.
is allowed in all three. And so is http://,
.
type="tel"
There currently is no built-in verification for phone numbers in any of the major browsers (it functions 100% the same as a type="text"
, other than telling mobile browsers which kind of keyboard to display.
So, since the browsers don't show a consistent behaviour in each of these cases, and since the behaviour they do show is very basic with many false positives, what can I do to verify my HTML forms (still using HTML5 input verification)?
PS: I'm posting this because I would find it useful to have a complete list of form verification patterns myself, so I figured it might be useful for others too (and of course others can post their solutions too).
These patterns aren't necessarily simple, but here's what I think works best in every situation. Keep in mind that (quite recently) Internationalized Domain Names ( IDNs ) are available too. With that, an un-testable amount of characters are allowed in URLs (there still exist lots of characters that aren't allowed in domain names, but the list of allowed characters is so big, and will change so often for different Top-Level Domains, that it's not practical to keep up with them). If you want to support the internationalized domain names, you should use the second URL pattern, otherwise, use the first.
TL;DR:
Here's a live demo to see the following patterns in action. Scroll down for an explanation, reasoning and analysis of these patterns.
URLs
https?://(?![^/]{253}[^/])((?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])).){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(/.*)?
https?://(?!.{253}.+$)((?!-.*|.*-.)([^ !-,./:-@[-`{-~]{1,63}.)+([^ !-/:-@[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]).){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(/.*)?
Emails
(?!(^[.-].*|[^@]*[.-]@|.*.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+/=?^_`{|}~.-]+@)(?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}
Phone numbers
((+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
URLs, without IDN support
https?://(?![^/]{253}[^/])((?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])).){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(/.*)?
Explanation:
-
.international
"), which most likely won't change any time soon. 0.0.0.0
, 127.0.0.1
, etc. are not checked for 01.1.1.1
) [4]. Note that the default http:.*
pattern built into modern browsers will always be enforced, so even if you remove the https?://
at the start in this pattern, it will still be enforced. Use type="text"
to avoid it.
URLs, with IDN support
https?://(?!.{253}.+$)((?!-.*|.*-.)([^ !-,./:-@[-`{-~]{1,63}.)+([^ !-/:-@[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]).){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(/.*)?
Explanation:
Since there is a huge amount of characters that are allowed in IDNs, it's not practically possible to list every possible combination in a HTML attribute (you'd get a huge pattern, so in that case it's much better to test it by some other method than regex) [5].
!"#$%&'()*+, ./ :;<=>?@ []^_`` {|}~
with the exception of a period as domain seperator. [!-,]
[./]
[:-@]
[[-``]
[{-~]
. xn--*
with *
being an encoded version of the actual TLD. This encoding uses 2 Latin letters or Arabic numerals per original character, so the arbitrary limit here is doubled to 30. Email addresses
(?!(^[.-].*|[^@]*[.-]@|.*.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+/=?^_`{|}~.-]+@)(?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}
Explanation:
Since email addresses require a whole lot more than this pattern to be 100% foolproof, this will cover the near full 100% of them. A 100% complete pattern does exist, but contains PCRE (PHP)-only regex lookaheads, so it won't work in HTML forms.
!#$%&'*+/=?^_``{|}~.-
[6]. @
can only be 63 characters long, and the total address can only be 254 characters long [8]. -
or .
, and no two dots may appear consecutively [8]. Phone numbers
((+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Explanation:
[CTRY]
stands for the country code, and X stands for the first non-zero digit (such as 6
in mobile numbers), 00[CTRY]X
+[CTRY]X
0X
[CTRY]X
(This is not officially correct syntax, but Chrome Autofill seems to like it for some reason.) This regex is just for 10-digit phone numbers. Since phone number lengths may vary between countries, it's best to use a less strict version of this pattern, or modify it to work for the desired countries. So, this pattern should generally be used as a kind of template pattern.
Extra: Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
Yes, I know, I'm very western-centric, but this may be useful too, since it might be difficult to make this too, and in case you're making a site for western people too, this will always work (Asian names have a representation in exactly this format too).
ÐÞ ðþ
): AZ
matches all uppercase Latin letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
Ά-Ϋ
matches all uppercase Greek letters, including the accented ones: Ά·ΈΉΊΌΎΏΐ ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ ΪΫ
. À-ÖØ-Þ
matches all uppercase accented Latin letters, and the Ð and Þ: ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ
. In between there's also the character ×
(between Ö
and Ø
), which is left out this way. az
matches all lowercase Latin letters: abcdefghijklmnopqrstuvwxyz
ά-ώ
matches all lowercase Greek letters, including the accented ones: άέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
ß-öø-ÿ
matches all lowercase accented Latin letters, and the ß, ð and þ: ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
. In between there's also the character ÷
(between ö
and ø
), which is left out this way. References
上一篇: HTML电子邮件的安全标记
下一篇: 如何使用HTML5输入验证来验证表单输入