Database coding conventions,
best practices, programming guidelines
Last
updated: January 28th '02 | Best viewed with: All popular
browsers | Best viewed at: 1024x768 | Links to external
sites will open in a new window
Databases are the heart and soul of many
of the recent enterprise applications and it is very
essential to pay special attention to database
programming. I've seen in many occasions where database
programming is overlooked, thinking that it's something
easy and can be done by anyone. This is wrong. For a
better performing database you need a real DBA and a
specialist database programmer, let it be Microsoft SQL
Server, Oracle, Sybase, DB2 or whatever! If you don't use
database specialists during your development cycle,
database often ends up becoming the performance
bottleneck. I decided to write this article, to put
together some of the database programming best practices,
so that my fellow DBAs and database developers can
benefit!
Here are some of the programming guidelines, best
practices, keeping quality, performance and
maintainability in mind. This list many not be complete
at this moment, and will be constantly updated. Btw, special thanks to Tibor Karaszi (SQL Server MVP) and Linda (lindawie) for taking time to read this article, and providing suggestions.
Do not depend on undocumented functionality. The reasons being:
- You will not get support from Microsoft, when something goes wrong with your undocumented code
- Undocumented functionality is not guaranteed to exist (or behave the same) in a future release or service pack, there by breaking your code
Try not to use system tables directly. System table structures may change in a future release. Wherever possible, use the sp_help* stored procedures or INFORMATION_SCHEMA views. There will be situattions where you cannot avoid accessing system table though!
Make sure you
normalize your data at least till 3rd normal
form. At the same time, do not compromize on
query performance. A little bit of
denormalization helps queries perform faster.
Write comments in
your stored procedures, triggers and SQL batches
generously, whenever something is not very
obvious. This helps other programmers understand
your code clearly. Don't worry about the length
of the comments, as it won't impact the
performance, unlike interpreted languages like
ASP 2.0.
Do not use SELECT * in your queries. Always
write the required column names after the SELECT statement, like SELECT
CustomerID, CustomerFirstName, City. This technique results
in less disk IO and less network traffic and hence better performance.
Try to avoid
server side cursors as much as possible. Always
stick to 'set based approach' instead of a
'procedural approach' for accessing/manipulating
data. Cursors can be easily avoided by SELECT statements in many
cases. If a cursor is unavoidable, use a simpleWHILE loop instead, to loop through the table. I
personally tested and concluded that a WHILE loop is faster
than a cursor most of the times. But for a WHILE loop to replace a
cursor you need a column (primary key or unique
key) to identify each row uniquely and I
personally believe every table must have a
primary or unique key. Click
here to see one of the many examples of using WHILE loop.
Avoid the creation
of temporary tables while processing data, as
much as possible, as creating a temporary table
means more disk IO. Consider advanced SQL or
views or table variables of SQL Server 2000 or derived tables, instead of temporary tables. Keep in mind that, in some cases, using a temporary table performs better than a highly complicated query.
Try to avoid
wildcard characters at the beginning of a word
while searching using the LIKE keyword, as that
results in an index scan, which is defeating the
purpose of having an index. The following statement
results in an index scan, while the second
statement results in an index seek:
1. SELECT LocationID FROM
Locations WHERE Specialities LIKE '%pples'
2. SELECT LocationID FROM
Locations WHERE Specialities LIKE 'A%s'
Also avoid
searching with not equals operators (<> and NOT) as they result in table
and index scans. If you must do heavy text-based searches, consider using the Full-Text search feature of SQL Server for better performance.
Use 'Derived
tables' wherever possible, as they perform
better. Consider the following query to find the
second highest salary from Employees table:
SELECT MIN(Salary)
FROM Employees
WHERE EmpID IN
(
SELECT TOP 2 EmpID
FROM Employees
ORDER BY Salary Desc
)
The same query can be re-written using a derived
table as shown below, and it performs twice as
fast as the above query:
SELECT MIN(Salary)
FROM
(
SELECT TOP 2 Salary
FROM Employees
ORDER BY Salary Desc
) AS A
This is just an example, the results might differ in different scenarios depending upon the database design, indexes, volume of data etc. So, test all the possible ways a query could be written and go with the efficient one. With some practice and understanding of 'how SQL Server optimizer works', you will be able to come up with the best possible queries without this trial and error method.
While designing
your database, design it keeping 'performance' in
mind. You can't really tune performance later,
when your database is in production, as it
involves rebuilding tables/indexes, re-writing
queries. Use the graphical execution plan in
Query Analyzer or SHOWPLAN_TEXT or SHOWPLAN_ALL
commands to analyze your queries. Make sure your
queries do 'Index seeks' instead of 'Index scans'
or 'Table scans'. A table scan or an index scan
is a very bad thing and should be avoided where
possible (sometimes when the table is too small or when the whole table needs to be processed, the optimizer will choose a table or index scan).
Prefix the table
names with owner names, as this improves
readability, avoids any unnecessary confusions.
Microsoft SQL Server Books Online even states
that qualifying tables names, with owner names
helps in execution plan reuse.
Use SET
NOCOUNT ON
at the beginning of your SQL batches, stored
procedures and triggers in production
environments, as this suppresses messages like
'(1 row(s) affected)' after executing INSERT,
UPDATE, DELETE and SELECT statements. This inturn
improves the performance of the stored procedures
by reducing the network traffic.
Use the more
readable ANSI-Standard Join clauses instead of
the old style joins. With ANSI joins the WHERE clause is used only for
filtering data. Where as with older style joins,
the WHERE clause handles both the join
condition and filtering data. The first of the
following two queries shows an old style join,
while the second one shows the new ANSI join
syntax:
SELECT a.au_id, t.title
FROM titles t, authors a, titleauthor ta
WHERE
a.au_id = ta.au_id AND
ta.title_id = t.title_id AND
t.title LIKE '%Computer%'
SELECT a.au_id, t.title
FROM authors a
INNER JOIN
titleauthor ta
ON
a.au_id = ta.au_id
INNER JOIN
titles t
ON
ta.title_id = t.title_id
WHERE t.title LIKE '%Computer%'
Be aware that the old style *= and =* left and right outer join syntax may not be supported in a future release of SQL Server, so you are better off adopting the ANSI standard outer join syntax.
Do not prefix your
stored procedure names with 'sp_'. The prefix sp_
is reserved for system stored procedure that ship
with SQL Server. Whenever SQL Server encounters a
procedure name starting with sp_,, it first tries
to locate the procedure in the master database,
then looks for any qualifiers (database, owner)
provided, then using dbo as the owner. So, you
can really save time in locating the stored
procedure by avoiding sp_ prefix. But there is an exception! While creating general purpose stored procedures that are called from all your databases, go ahead and prefix those stored procedure names with sp_ and create them in the master database.
Views are
generally used to show specific data to specific
users based on their interest. Views are also
used to restrict access to the base tables by
granting permission on only views. Yet another
significant use of views is that, they simplify
your queries. Incorporate your frequently
required complicated joins and calculations into
a view, so that you don't have to repeat those
joins/calculations in all your queries, instead
just select from the view.
Use 'User Defined
Datatypes', if a particular column repeats in a
lot of your tables, so that the datatype of that
column is consistent across all your tables.
Do not let your
front-end applications query/manipulate the data
directly using SELECT or
INSERT/UPDATE/DELETE statements. Instead, create
stored procedures, and let your applications
access these stored procedures. This keeps the
data access clean and consistent across all the
modules of your application, at the same time
centralizing the business logic within the
database.
Try not to use text,
ntext datatypes
for storing large textual data. 'text' datatype has some
inherent problems associated with it. You can not
directly write, update text data using INSERT,
UPDATE
statements (You have to use special statements
like READTEXT, WRITETEXT and UPDATETEXT). There are a lot of
bugs associated with replicating tables
containing text columns. So, if you don't have
to store more than 8 KB of text, use char(8000) or varchar(8000)datatypes.
If you have a
choice, do not store binary files, image files
(Binary large objects or BLOBs) etc. inside the
database. Instead store the path to the
binary/image file in the database and use that as
a pointer to the actual binary file. Retrieving,
manipulating these large binary files is better
performed outside the database and after all,
database is not meant for storing files.
Use char data type for a column,
only when the column is non-nullable. If a char
column is nullable, it is treated as a fixed
length column in SQL Server 7.0+. So, a char(100), when NULL, will eat up 100 bytes,
resulting in space wastage. So, use varchar(100) in this situation. Of
course, variable length columns do have a very
little processing overhead over fixed length
columns. Carefully choose between char and varchar depending up on the
length of the data you are going to store.
Avoid dynamic SQL
statements as much as possible. Dynamic SQL tends
to be slower than static SQL, as SQL Server must
generate an execution plan every time at runtime.
IF and CASE statements come in
handy to avoid dynamic SQL. Another major disadvantage of using dynamic SQL is that, it requires the
users to have direct access permissions on all accessed objects like tables and views. Generally, users are given access to the stored procedures which reference the tables, but not directly on the tables. In this case, dynamic SQL will not work. Consider the following scenario, where a user named 'dSQLuser' is added to the pubs database, and is granted access to a procedure named 'dSQLproc', but not on any other tables in the pubs database. The procedure dSQLproc executes a direct SELECT on titles table and that works. The second statement runs the same SELECT on titles table, using dynamic SQL and it fails with the following error:
Server: Msg 229, Level 14, State 5, Line 1
SELECT permission denied on object 'titles', database 'pubs', owner 'dbo'.
To reproduce the above problem, use the following commands:
sp_addlogin 'dSQLuser'
GO
sp_defaultdb 'dSQLuser', 'pubs'
USE pubs
GO
sp_adduser 'dSQLUser', 'dSQLUser'
GO
CREATE PROC dSQLProc
AS
BEGIN
SELECT * FROM titles WHERE title_id = 'BU1032' --This works
DECLARE @str CHAR(100)
SET @str = 'SELECT * FROM titles WHERE title_id = ''BU1032'''
EXEC (@str) --This fails
END
GO
GRANT EXEC ON dSQLProc TO dSQLuser
GO
Now login to the pubs database using the login dSQLuser and execute the procedure dSQLproc to see the problem.
Consider the following drawbacks before using IDENTITY property for generating
primary keys. IDENTITY is very much SQL Server
specific, and you will have problems if you want to support different database backends for your application.IDENTITY
columns
have other inherent problems. IDENTITY
columns run
out of numbers one day or the other. Numbers
can't be reused automatically, after deleting
rows. Replication and IDENTITY
columns
don't always get along well. So, come up with an
algorithm to generate a primary key, in the
front-end or from within the inserting stored
procedure. There could be issues with generating your own primary keys too, like concurrency while
generating the key, running out of values. So, consider both the options and go with the one that
suits you well.
Minimize the usage
of NULLs, as they often confuse the
front-end applications, unless the applications
are coded intelligently to eliminate NULLs or convert the NULLs into some other form.
Any expression that deals with NULL results in a NULL output. ISNULL and COALESCE functions are helpful
in dealing with NULL values. Here's an
example that explains the problem:
Consider the following table, Customers which
stores the names of the customers and the middle
name can be NULL.
Use Unicode
datatypes like nchar, nvarchar,
ntext, if
your database is going to store not just plain
English characters, but a variety of characters
used all over the world. Use these datatypes,
only when they are absolutely needed as they need
twice as much space as non-unicode datatypes.
Always use a
column list in your INSERT statements. This helps
in avoiding problems when the table structure
changes (like adding a column).
Here's an example which shows the problem.
Here's an INSERT statement without a column list
, that works perfectly:
INSERT INTO EuropeanCountries
VALUES (1, 'Ireland')
Now, let's add a new column to this table:
ALTER TABLE EuropeanCountries
ADD EuroSupport bit
Now run the above INSERT statement. You get the
following error from SQL Server:
Server: Msg 213, Level 16, State
4, Line 1
Insert Error: Column name or number of supplied
values does not match table definition.
This problem can be avoided by writing an INSERT statement with a column
list as shown below:
INSERT INTO EuropeanCountries
(CountryID, CountryName)
VALUES (1, 'England')
Perform all your
referential integrity checks, data validations
using constraints (foreign key and check
constraints). These constraints are faster than
triggers. So, use triggers only for auditing,
custom tasks and validations that can not be
performed using these constraints. These
constraints save you time as well, as you don't
have to write code for these validations and the
RDBMS will do all the work for you.
Always access
tables in the same order in all your stored
procedures/triggers consistently. This helps in
avoiding deadlocks. Other things to keep in mind
to avoid deadlocks are: Keep your transactions as
short as possible. Touch as less data as possible
during a transaction. Never, ever wait for user
input in the middle of a transaction. Do not use
higher level locking hints or restrictive
isolation levels unless they are absolutely
needed. Make your front-end applications
deadlock-intelligent, that is, these applications
should be able to resubmit the transaction incase
the previous transaction fails with error 1205.
In your applications, process all the results
returned by SQL Server immediately, so that the
locks on the processed rows are released, hence
no blocking.
Offload tasks like
string manipulations, concatenations, row
numbering, case conversions, type conversions
etc. to the front-end applications, if these operations are going to consume more CPU cycles on the database server (It's okay to do simple string manipulations on the database end though). Also try to do basic validations in
the front-end itself during data entry. This
saves unnecessary network roundtrips.
If back-end
portability is your concern, stay away from bit
manipulations with T-SQL, as this is very much
RDBMS specific. Further, using bitmaps to
represent different states of a particular entity
conflicts with the normalization rules.
Consider adding a
@Debug parameter to your stored procedures. This
can be of bit data type. When a 1 is passed for
this parameter, print all the intermediate
results, variable contents using SELECT or PRINT
statements and when 0 is passed do not print debug
information. This helps in quick debugging of stored
procedures, as you don't have to add and remove
these PRINT/SELECT statements before and after
troubleshooting problems.
Do not call
functions repeatedly within your stored
procedures, triggers, functions and batches. For
example, you might need the length of a string
variable in many places of your procedure, but
don't call the LEN function whenever it's needed,
instead, call the LEN function once, and store
the result in a variable, for later use.
Make sure your
stored procedures always return a value
indicating the status. Standardize on the return
values of stored procedures for success and
failures. The RETURN statement is meant for
returning the execution status only, but not
data. If you need to return data, use OUTPUT parameters.
If your stored
procedure always returns a single row resultset,
consider returning the resultset using OUTPUT parameters instead of a
SELECT statement, as ADO handles output
parameters faster than resultsets returned by SELECT statements.
Always check the
global variable @@ERROR immediately after
executing a data manipulation statement (like INSERT/UPDATE/DELETE), so that you can
rollback the transaction in case of an error (@@ERROR
will be
greater than 0 in case of an error). This is
important, because, by default, SQL Server will
not rollback all the previous changes within a
transaction if a particular statement fails. This
behavior can be changed by executing SET
XACT_ABORT ON. The
@@ROWCOUNT
variable also plays an important role in
determining how many rows were affected by a
previous data manipulation (also, retrieval)
statement, and based on that you could choose to
commit or rollback a particular transaction.
To make SQL
Statements more readable, start each clause on a
new line and indent when needed. Following is an
example:
SELECT title_id, title
FROM titles
WHERE title LIKE 'Computing%' AND
title
LIKE 'Gardening%'
Though we survived
the Y2K, always store 4 digit years in dates (especially, when using char or int datatype columns),
instead of 2 digit years to avoid any confusion
and problems. This is not a problem with datetime columns, as the century is stored even if you specify a 2 digit year. But it's always a good practice to specify 4 digit years even with datetime datatype columns.
In your queries and other SQL statements, always represent date in yyyy/mm/dd format. This format will always be interpreted correctly, no matter what the default date format on the SQL Server is. This also prevents the following error, while working with dates:
Server: Msg 242, Level 16, State 3, Line 2
The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.
As is true with
any other programming language, do not use GOTO or use it sparingly.
Excessive usage of GOTO can lead to
hard-to-read-and-understand code.
Do not forget to enforce unique constraints on your alternate keys.
Always be consistent with the usage of case in your code. On a case insensitive server, your code
might work fine, but it will fail on a case sensitive SQL Server if your code is not consistent
in case. For example, if you create a table in SQL Server or database that has a case-sensitive or binary sort order, all references to the table must use the same case that was specified in the CREATE TABLE statement. If you name the table as 'MyTable' in the CREATE TABLE statement and use 'mytable' in the SELECT statement, you get an 'object not found' or 'invalid object name' error.
Though T-SQL has no concept of constants (like the ones in C language), variables will serve the same purpose. Using variables instead of constant values within your
SQL statements, improves readability and maintainability of your code. Consider the following example:
UPDATE
dbo.Orders
SET OrderStatus = 5
WHERE OrdDate < '2001/10/25'
The same
update statement can be re-written in a more readable form as shown below:
DECLARE
@ORDER_PENDING
int
SET @ORDER_PENDING =
5
UPDATE
dbo.Orders
SET OrderStatus = @ORDER_PENDING
WHERE OrdDate < '2001/10/25'
Do not use the column numbers in the ORDER BY clause as it impairs the readability of the SQL statement. Further, changing the order of columns in the SELECT list has no impact on the ORDER BY when the columns are referred by names instead of numbers. Consider the following example, in which the second query is more readable than the first one:
SELECT OrderID, OrderDate
FROM Orders
ORDER BY 2
SELECT OrderID, OrderDate
FROM Orders
ORDER BY OrderDate
Well, this is all for
now folks. I'll keep updating this page as and when I
have something new to add. I welcome your feedback on
this, so feel free to email me. Happy database programming!