Email validation

Email validation? Why? There are thousands of validations on the net. Why another one? The answer is simple – most of the validations on the net are wrong. They ofter rely solely on regexs, and these are often wrong. Some years ago I had the opportunity to write a validation which is as close to the truth as it can be (although it was 2009 so things may have changes since then). Keep in mind that email validation is not exact science. There are contradictions in the RFCs governing it. There is even a „commentary“ RFC attempting to clear all that confusion.

First, let me give you the tools you need to succeed. A list of valid and a list of not valid email addresses. Every line contains a comment (single line Java style). explaining what is tested. These are copy/pasted from a Java array, so there are quotes, and backslashed to accommodate the Java String syntax and commas at the end of each line.

// basic emails
"a@a.ad", // the shortest possible email with letters
"abcdefghijklmnopqrstuvwxyz@abv.bg", // all letters email
"ABCDEFGHIJKLMNOPQRSTUVWXYZ@abv.bg", // all capital letters email
"1@1.ad", // the shortest possible email with numbers
"1234567890@123.com", // all numbers
/* special symbol emails */
"!@special.com", // ! ASCII 0x21
"#@special.com", // # ASCII 0x23
"$@special.com", // $ ASCII 0x24
"%@special.com", // % ASCII 0x25
"&@special.com", // & ASCII 0x26
"'@special.com", // ' ASCII 0x27
"*@special.com", // * ASCII 0x2A
"+@special.com", // + ASCII 0x2B
"-@special.com", // - ASCII 0x2D
"/@special.com", // / ASCII 0x2F
"=@special.com", // = ASCII 0x3D
"?@special.com", // ? ASCII 0x3F
"^@special.com", // ^ ASCII 0x5E
"_@special.com", // _ ASCII 0x5F
"`@special.com", // ` ASCII 0x60
"{@special.com", // { ASCII 0x7B
"|@special.com", // | ASCII 0x7C
"}@special.com", // } ASCII 0x7D
"~@special.com", // ~ ASCII 0x7E
/* complex emails */
"abc.def@dot.com", // email with dot
"abv.123.!#$@dot.atom.com", // email with dot containing all groups of characters
"local@dot.com.", // domain in absolute form
/* maximum length */
"1234567890123456789012345678901234567890123456789012345678901234@domain.com", // label long 64 characters
"1234567890@1234567890.1234567890.1234567890.1234567890.1234567890.1234567890."
    + "1234567890.1234567890.1234567890.1234567890.1234567890.1234567890.1234567890."
    + "1234567890.1234567890.1234567890.1234567890.1234567890.1234567890.1234567890."
    + "1234567890.1234567890.com", // max email length - 256
/* quoted strings */
"\"quotedstring\"@quoted.com", // Quotes string for local-part
"\"quoted\\qstring\"@quoted.com", // with quoted pair - \q
"\"quoted@string\"@quoted.com", // with second @
"\"quoted\\\"string\"@quoted.com", // with escaped double quote - \"
"\"quoteds\\ tring\"@quoted.com", // with escaped space - \SP - ASCII 0x20
"\"quoteds\\\ttring\"@quoted.com", // with escaped tab character - \HTAB - ASCII 0x09
"\"\"@quoted.com", // empty quoted string
/* quoted pairs */
"\\\"@special.com", // " ASCII 0x22 - \"
"\\(@quoted.com", // ( ASCII 0x28 - \(
"\\)@quoted.com", // ) ASCII 0x29 - \)
"\\< @quoted.com", // < ASCII 0x3C - \<
"\\>@quoted.com", // > ASCII 0x3E - \>
"\\[@quoted.com", // [ ASCII 0x5B - \[
"\\]@quoted.com", // ] ASCII 0x5D - \]
"\\:@quoted.com", // : ASCII 0x3A - \:
"\\;@quoted.com", // ; ASCII 0x3B - \;
"\\@@quoted.com", // @ ASCII 0x40 - \@
"\\\\@quoted.com", // \ ASCII 0x5C - \\
"\\,@quoted.com", // , ASCII 0x2C - \,
"\\.@quoted.com", // . ASCII 0x2E - \.
/* RFC 3696 section 3 */
"Abc\\@def@example.com",
"Fred\\ Bloggs@example.com",
"Joe.\\\\Blow@example.com",
"\"Abc@def\"@example.com",
"\"Fred Bloggs\"@example.com",
"user+mailbox@example.com",
"customer/department=shipping@example.com",
"$A12345@example.com",
"!def!xyz%abc@example.com",
"_somename@example.com",
// missing parts
"@nouser.co", // without local part
"nodomain@", // without domain part
"noatdomain.com", // with no @ sign
/* extra dots */
".@domain.com", // only one for local part
".a@domain.com", // a dot at the begining of local part
"a.@doman.com", // a dot at the end of local part
/* invalid characters */
"\"@special.com", // " ASCII 0x22
"(@special.com", // ( ASCII 0x28
")@special.com", // ) ASCII 0x29
"< @special.com", // < ASCII 0x3C
">@special.com", // > ASCII 0x3E
"[@special.com", // [ ASCII 0x5B
"]@special.com", // ] ASCII 0x5D
":@special.com", // : ASCII 0x3A
";@special.com", // ; ASCII 0x3B
"@@special.com", // @ ASCII 0x40
"\\@special.com", // \ ASCII 0x5C
",@special.com", // , ASCII 0x2C
".@special.com", // . ASCII 0x2E
/* not ASCII characters and ASCII control characters */
"кирилица@domain.com", // some Cyrillic characters
"\uFFFF@domain.com", // Unicode code point U+FFFF (guaranteed not to be a character)
"\u007F@domain.com", // ASCII 0x7F (del) character
"\u0000@domain.com", // ASCII 0x00 (nul) character
/* invalid domain part */
"local@dot..com", // to adjacent dots
"local@dot.dot", // non existing TLD - dot
"local@.dot.com", // non existing TLD - dot
/* longer than maximum length */
"12345678901234567890123456789012345678901234567890123456789012345@domain.com", // label long 65 characters
"1234567890@1234567890.1234567890.1234567890.1234567890.1234567890.1234567890."
    + "1234567890.1234567890.1234567890.1234567890.1234567890.1234567890.1234567890."
    + "1234567890.1234567890.1234567890.1234567890.1234567890.1234567890.1234567890."
    + "1234567890.12345678901.com", // max email length - 257
/* quoted local part */
"\"quotedstring\\\"@quoted.com", // ends with escaped double quote
/* quoted pairs */
"\\\u007F@quoted.com", // invalid quoted pair - ASCII 0x7F (del) character
"\\\u0000@quoted.com", // invalid quoted pair -ASCII 0x00 (nul) character
"\\ъ@quoted.com", // invalid quoted pair - non ASCII character - ъ
/* RFC 3696 section 3 */
"Fred Bloggs@example.com",

The next tool you need is the validation regexps. First, there is more than one. Second, there are some length validations. This should be pretty self explanatory. The regexp below are constructing using a formal definition for an email. It is using JavaScript syntax as they were written to work in GWT.

MAX_LOCAL_PART_LENGTH = 64;
MAX_EMAIL_LENGTH = 256;

// Using \x00-\x1F\x7F instead of \p{Cntrl}
SPECIAL_CHARS = "\\x00-\\x1F\\x7F\\(\\)<>@,;:'\\\\\\\"\\.\\[\\]";

// Using \x00-\x7F instead of \p{ASCII}
LEGAL_ASCII_REGEX = "^[\\x00-\\x7F]+$";

Q_PAIR = "\\\\([\\x21-\\x7E]|\\x20|\\x09)";
// Bug fix including \x20 (space) due to RFC 3696 section 3
Q_TEXT = "[\\x20\\x21\\x23-\\x5B\\x5D-\\x7E]";
Q_CONTENT = "(" + Q_TEXT + ")|(" + Q_PAIR + ")";
QUOTED_USER = "\"(" + Q_CONTENT +  ")*\"";

VALID_CHARS = "([^\\s" + SPECIAL_CHARS + "])|(" + Q_PAIR + ")";
WORD = "((" + VALID_CHARS + "|')+|(" + QUOTED_USER + "))";

EMAIL_REGEX = "^\\s*?(.+)@(.+?)\\s*$";
USER_REGEX = "^\\s*" + WORD + "(\\." + WORD + ")*$";

Finally, here is the Java code which does it all. It is missing its domain validation, but that is easy.

/**
 * Licensed under APL - http://www.apache.org/licenses/LICENSE-2.0 - the rest of header skipped for brevity.
 *
 * Performs email validations.
 *
 * Based on org.apache.commons.validator.routines.EmailValidator.
 *
 * @author Apache Software Foundation
 * @author ShaMan-H_Fel
 */
public class EmailValidator {

    private static final int MAX_LOCAL_PART_LENGTH = 64;
    private static final int MAX_EMAIL_LENGTH = 256;

    // Using \x00-\x1F\x7F instead of \p{Cntrl}
    private static final String SPECIAL_CHARS = "\\x00-\\x1F\\x7F\\(\\)<>@,;:'\\\\\\\"\\.\\[\\]";

    // Using \x00-\x7F instead of \p{ASCII}
    private static final String LEGAL_ASCII_REGEX = "^[\\x00-\\x7F]+$";

    private static final String Q_PAIR = "\\\\([\\x21-\\x7E]|\\x20|\\x09)";
    // Bug fix including \x20 (space) due to RFC 3696 section 3
    private static final String Q_TEXT = "[\\x20\\x21\\x23-\\x5B\\x5D-\\x7E]";
    private static final String Q_CONTENT = "(" + Q_TEXT + ")|(" + Q_PAIR + ")";
    private static final String QUOTED_USER = "\"(" + Q_CONTENT +  ")*\"";

    private static final String VALID_CHARS = "([^\\s" + SPECIAL_CHARS + "])|(" + Q_PAIR + ")";
    private static final String WORD = "((" + VALID_CHARS + "|')+|(" + QUOTED_USER + "))";

    private static final String EMAIL_REGEX = "^\\s*?(.+)@(.+?)\\s*$";
    private static final String USER_REGEX = "^\\s*" + WORD + "(\\." + WORD + ")*$";

    /**
     * Singleton instance of this class.
     */
    private static final EmailValidator EMAIL_VALIDATOR = new EmailValidator();

    /**
     * Returns the Singleton instance of this validator.
     *
     * @return singleton instance of this validator.
     */
    public static EmailValidator getInstance() {
        return EMAIL_VALIDATOR;
    }

    /**
     * Protected constructor for subclasses to use.
     */
    protected EmailValidator() {
        super();
    }

    /**
     * <p>
     * Checks if a field has a valid e-mail address.
     * </p>
     *
     * @param email
     *            The value validation is being performed on. A <code>null</code> value is
     *            considered invalid.
     * @return true if the email address is valid.
     */
    public boolean isValid(String email) {
        if (email == null) {
            return false;
        }

        if (email.length() > MAX_EMAIL_LENGTH) {
            return false;
        }

        boolean match = email.matches(LEGAL_ASCII_REGEX);
        if (!match) {
            return false;
        }

        // Check the whole email address structure
        match = email.matches(EMAIL_REGEX);
        if (!match) {
            return false;
        }

        /*
         * There are 2 cases in which an email can contain more than one @ sign:
         *
         * 1. If the local part is "quoted", example "test@test"@domain.com
         *
         * 2. If the @ sign is in quoted pair (escaped), example: test\@test@domain.com
         */
        String[] groups = null;
        int lastAt = email.lastIndexOf("@");

        groups = new String[2];
        groups[0] = email.substring(0, lastAt);
        groups[1] = email.substring(lastAt + 1);

        if (!isValidUser(groups[0])) {
            return false;
        }

        if (!isValidDomain(groups[1])) {
            return false;
        }

        return true;
    }

    /**
     * Returns true if the domain component of an email address is valid.
     *
     * @param domain
     *            being validated.
     * @return true if the email address's domain is valid.
     */
    protected boolean isValidDomain(String domain) {
        // Hook your domain validation here.
        return true;
    }

    /**
     * Returns true if the user component (local part) of an email address is valid.
     *
     * @param user
     *            being validated
     * @return true if the user name is valid.
     */
    protected boolean isValidUser(String user) {
        if (user.length() > MAX_LOCAL_PART_LENGTH) {
            return false;
        }

        boolean result = user.matches(USER_REGEX);
        result = result && user.matches(LEGAL_ASCII_REGEX);

        return result;
    }

}
This entry was posted in Разни. Bookmark the permalink.

Comments are closed.