Working with email addresses in SQL Server
Last updated: July 21st '02 | Best viewed with: All popular browsers | Best viewed at: 1024x768 | Links to external sites will open in a new window

About myself
My technical skills
My favorites
My picture album

Shortcut keys
My code library

VB resources
SQLServer resources
SQLServer books
Replication FAQ
Scripting resources
ASP resources

Search my site
Sign my guestbook
Contact information

SQL Server Articles New

NEW!!! Subscribe to my newsletter:
Want to keep in touch with the latest in SQL Server world? Email with 'subscribe' in the subject line
See also:  
Database coding conventions SQL Server interview questions
Evaluation of federated database servers SQL Server FAQ on programming, administration, replication and tools
SQL Server security best practices SQL Server administration best practices
Related books:
The guru's guide to Transact SQL Advanced Transact-SQL for SQL Server 2000 SQL Server 2000 Programming by example

Working with email addresses in SQL Server

Storing email addresses in SQL Server databases is quite common, especially in those tables, that store information about customers, clients, subscribers etc.

This leads to the obvious question, which datatype one should use for storing email addresses? undoubtedly, it is varchar data type, as email addresses vary a great deal in length. Use nvarchar if you need to store email addresses that contain extended characters. Note that nvarchar datatype requires double the amount of storage space, compared to varchar! So use it only if you need it.

Now, the next question. What should be the length of this email column? I would say, use varchar(70). In my experience, an average email address is about 23 characters in length, the shortest email addresses being about 10 characters long and the longest ones being 65 to 70 characters long.

Now that we decided on the datatype and length, how about validating email addresses? In my humble opinion, email address validation should take place in the client applications and websites. It is the responsibility of your applications, not to send crap email addresses to your database. But then there are always exceptions like, multiple applications with varying degrees of validations, in which case, it is better to centralize the validation at the database level (or validate at the database level anyway, to be doubly sure).

Email validation has always been a tricky bit for programmers. T-SQL lacks sophisticated regular expressions, that are available in other languages like javascript. Keeping that in mind, here's the code for a CHECK constraint that validates the inserted and updated email addresses.

ALTER TABLE [dbo].[Subscribers] 
CONSTRAINT [EmailValidator] 
	CHARINDEX(' ',LTRIM(RTRIM([Email]))) = 0 -- No embedded spaces
AND 	LEFT(LTRIM([Email]),1) <> '@'  -- '@' can't be the first character of an email address
AND 	RIGHT(RTRIM([Email]),1) <> '.' -- '.' can't be the last character of an email address
AND 	CHARINDEX('.',[Email],CHARINDEX('@',[Email])) - CHARINDEX('@',[Email]) > 1 -- There must be a '.' after '@'
AND 	LEN(LTRIM(RTRIM([Email]))) - LEN(REPLACE(LTRIM(RTRIM([Email])),'@','')) = 1 -- Only one '@' sign is allowed
AND 	CHARINDEX('.',REVERSE(LTRIM(RTRIM([Email])))) >= 3 -- Domain name should end with at least 2 character extension
AND 	(CHARINDEX('.@',[Email]) = 0 AND CHARINDEX('..',[Email]) = 0) -- can't have patterns like '.@' and '..'

The above CHECK constraint is added on the column "Email", of the table "Subscribers". Notice the comments at the end of each line. Also notice that the constraint is being added with "WITH CHECK" option. So, the constraint will not get created if your table already has invalid email addresses, in which case, you will see the following error message:

Server: Msg 547, Level 16, State 1, Line 1
ALTER TABLE statement conflicted with TABLE CHECK constraint 'EmailValidator'. The conflict occurred in database 'pubs', table 'subscribers'.

This CHECK constraint will prevent most of the invalid email addresses from being added to the database, but it is not 100% fool proof. We still need to check for characters that are not allowed inside an email address. I do not have a definitive list of characters that are allowed/disallowed in an email address at the moment, but the above CHECK constraint can be amended with this check, if you wish. There is no guarantee that the domain name is going to be really valid, unless you look up the DNS records, which is too much to ask for, in an OLTP database. But this kind of checks can be performed in a batch process that runs during off-peak hours.

In SQL Server 2000, this CHECK constraint can be rewritten as a User Defined Function (UDF). Now, let's change gears and move on from storing and validating email addresses, to querying email addresses.

Btw, if the above CHECK constraint fails, because you already have invalid email addresses, run the following query to see those invalid/offending email address:

FROM Subscribers
	CHARINDEX(' ',LTRIM(RTRIM([Email]))) = 0 
AND  	LEFT(LTRIM([Email]),1) <> '@' 
AND  	RIGHT(RTRIM([Email]),1) <> '.' 
AND  	CHARINDEX('.',[Email],CHARINDEX('@',[Email])) - CHARINDEX('@',[Email]) > 1 
AND  	LEN(LTRIM(RTRIM([Email]))) - LEN(REPLACE(LTRIM(RTRIM([Email])),'@','')) = 1 
AND  	(CHARINDEX('.@',[Email]) = 0 AND CHARINDEX('..',[Email]) = 0) 

Consider that the table "Subscribers", stores the email addresses (along with other details) of people subscribed to your online newsletter. Now you want to find out how many subscribers you have from each domain (for example, Hotmail, Yahoo etc.). Here's the query that does that:

SELECT	RIGHT(Email, CHARINDEX('@', REVERSE(Email))-1) AS [Domain Name], COUNT(Email) AS [Number of Subscribers]
FROM 	dbo.Subscribers
ORDER BY [Number of Subscribers] DESC, [Domain Name]

The above query extracts the domain names from the email addresses and groups by the domain name, to get the count of subscribers from each domain. Output is sorted in the descending order of number of subscribers, followed by the domain name in ascending order, so that the most popular domain name appears at the top of the list. Here's the sample output:

Domain Name		Number of Subscribers 
-----------------		---------------------		197		167			160			151			149		147		143		139			135		130

(10 row(s) affected)

You will notice that we used an expression in the SELECT statement, to extract the domain name out of the email address. We had to repeat the same expression in the GROUP BY clause. This is no good. If you change the expression in the column list, you must update the GROUP BY clause accordingly. This makes the query a good candidate for "Views". You could create a view on "Subscribers" table, embedding the expression in the view definition and updating your queries to select from the view instead of the table. If for whatever reason, you need to change the expression that extracts the domain name, you will only change the view, but your queries will remain the same. Cool, isn't it? Here's how the view and your queries will look like:

CREATE VIEW SubscribersView
SELECT	Name, DOJ, Email, RIGHT(Email, CHARINDEX('@', REVERSE(Email))-1) AS DomainName, Country
FROM 	dbo.Subscribers

SELECT	DomainName, COUNT(Email) AS [Number of Subscribers]
FROM 	dbo.SubscribersView
GROUP BY DomainName
ORDER BY [Number of Subscribers] DESC, DomainName

As the expression is hidden inside the view definition, the query looks much better and easy to read and is more maintainable. Now, you are asked to generate a report containing name and email address of all those subscribers that are using Hotmail. The following query makes use of the above view and generates the report:

SELECT	Name, Email
FROM	dbo.SubscribersView
WHERE	DomainName = ''

Things are looking great. But have you checked the execution plan of the above query? (You can enable execution plan display by pressing Ctrl + K in Query Analyzer). Though there is an index defined on the Email column, you will see that the query is not using the index efficiently. It is doing a slower 'index scan', rather than a faster 'index seek'. This is because the query optimizer is unable to predict the result of this expression in advance and hence unable to use the index. So, how can we get this query to do an 'index seek' then?

Here is another approach. If you can afford to add an additional column to your table, you can do away with the view. Idea is to add a computed column, that holds the domain name, and index this computed column (Computed or calculated columns are indexable only in SQL Server 2000). The following code will add a computed column called DomainName, to the Subscribers table, create a non-unique, non-clustered index on the newly added column:

ALTER TABLE dbo.Subscribers ADD DomainName AS RIGHT(Email, CHARINDEX('@', REVERSE(Email))-1)

CREATE NONCLUSTERED INDEX NU_NC_IDX_Subscribers_DomainName ON dbo.Subscribers(DomainName)

Now, if you run the following query, the execution plan will reveal an index seek on the computed column's index and the query performs much faster, compared to the previous query against the view SubscribersView:

SELECT	Name, Email
FROM	dbo.Subscribers
WHERE	DomainName = ''

Do investigate the "Indexed Views" feature of SQL Server 2000, that allows the creation of indexes on views. Though it is useful, it has a lot of restrictions. See SQL Server Books Online for more information on Indexed Views.

Okay, now how about finding out how many subscribers we have from each country? Though you don't record your subscriber's country, you can always use the email address to find out his/her country. For example, an email address ending with .uk is from Uniked Kingdom and the one ending with .in is from India. Here's how you can find out the counties of your subscribers:

1. Create a new table called "Countries", using the following CREATE TABLE script:

	CountryCode  char(4),
	Country         varchar(40)
2. Populate this Countries table with the country extension and country names, using this script! This script has 252 different country codes and names. Click here to download the script!

Now, you can write queries, by joining the subscribers table with the countries table, as shown below:

SELECT	c.Country, COUNT(*) AS [Number of Subscribers]
FROM	dbo.Subscribers s 
	dbo.Countries c
	RIGHT(s.Email, CHARINDEX('.', REVERSE(s.Email))-1) = c.CountryCode
GROUP BY c.Country
ORDER BY COUNT(*) Desc, c.Country ASC
You can simplify the above query by creating a view and querying that view instead, as shown below:

CREATE VIEW dbo.CountryView
SELECT	c.Country
FROM	dbo.Subscribers s 
	dbo.Countries c
	RIGHT(s.Email, CHARINDEX('.', REVERSE(s.Email))-1) = c.CountryCode

SELECT	Country, COUNT(*) AS [Number of Subscribers]
FROM 	dbo.CountryView
GROUP BY Country

We covered the storage and query aspects of email addresses. Last but not the least, what about security? Email address is a very private piece of information and should be dealt with carefully. You need to make sure, only the right applications and right people are able to see this column. You don't want your junior DBA to generate a list of email addresses and sell it to your competitor, do you? If you don't trust your junior DBA or anyone else that has got access to the data, create a database role, place these users in that role and deny SELECT permissions for this role on the Email column. For example, the following scripts will add a new login called JuniorDBA, grant him access to Newsletter database, deny permission on Email column, by setting the column level permissions:

USE Newsletter

EXEC sp_addlogin 'JuniorDBA', '3rfsdc#'

EXEC sp_grantdbaccess 'JuniorDBA'

EXEC sp_addrolemember 'db_datareader', 'JuniorDBA'

EXEC sp_addrole 'Restricted'

DENY SELECT ON Subscribers(Email) TO Restricted

EXEC sp_addrolemember 'Restricted', 'JuniorDBA'

That's about it for now. Will update this page with more info, as and when I learn something new, so be sure to check back again.