Wednesday, January 16, 2013

Database Guidelines (RDBMS/SQL)

This is a part of the blog series about (SOA) software guidelines. For the complete list of the guidelines (i.a. about design, security, performance, operations, database, coding, versioning) please refer to: http://soa-java.blogspot.nl/2012/09/soa-software-development-guidelines.html


Design

Schema

  • Normalization is useful to avoid redundancy, consistency problem/update anomaly, deletion & insertion anomalies. So first try to normalize at least to the 3rd normal form. You might need to trade off between normalization (less duplicates thus faster update but need more joins) vs denormalization (fewer tables thus fewer join tables thus faster read with indices. But with denormalization you need to make sure the consistency between duplicates e.g. using trigger, this extra process cost performance for insert/updates.) In principle always start with normalization (until 3rd or 5th normal form) then demoralization if necessary.
  • Be aware of costly relationships in your data model: simplify the relationships between entities. The relationships affects how easy to split/shard databases.
  • Reduce the size of table, you might use vertical partitioning (move infrequently used columns to another table but this will cost join) or  horizontal partitioning (e.g. move  old/infrequently used rows to another table or  split customers begin with A-N, M-Z to 2 tables) 
  • Make sure data/params types are correct (e.g. localization format, null usage)
  • Use standard type, avoid custom type if possible.
  • The convention of date data type is so different across different vendors and database version, so you might consider to save dates as string instead of date-data-type to avoid conversion headache during integration between different databases/applications.
  • Avoid too high abstract object since it's difficult to understand and can result in too many self-join.
  • Use referential integrity check to prevent operation with null foreign key.
  • Use unique constraint to make sure unique primary key
  • Natural key vs artificial key (e.g. surrogate key/GUID): First prefer to use the natural key since it's more  meaningful and to avoid duplications (reuse existing column). But there are cases when you need an artificial key: when the natural key is not unique (e.g. names) or if you need to change the value.
  • Use defined sequence object if you need to generate sequence (e.g. to generate primary key during insert)
  • Minimize the use of null for example by enforcing null constraint to force the application only sending no null values. You can also enforce default values (e.g. with DDL).
  • Each table must has primary key.
  • Consider whether to use Unicode data type  (e.g. NCHAR,NVARCHAR,NTEXT) if you need to support internationalization, but Unicode will take twice as much space.
  • Provide documentation, describe all tables & relationships, DDL. The application programmers should be able to find documentations about any triggers, constraints & store procedures in your database.

Index

  • How to choose which columns for index:
    • The primary and foreign keys columns are candidates for index.
    • Choose columns which not frequently updated, since index will degrade performance of DML  (update, insert, delete) operations (due to index table maintenance).
    • Choose columns which are used in order by / group by clauses since indexes will order the data to make these statements faster.
    • If you use range in where clause (e.g. between) use this range columns when making clustered index.
    • Use high selective / high cardinality columns.
    • Order is matter when creating composite indexes:  create highly selective indexes by using the most restrictive columns first.
  • Load frequently used indexes in memory (e.g. Oracle keep pool)
  • Avoid over indexing:
    • remove unused index
    • for small table your don't need index
    • if more than 10 percent of table must be examined you might better use full table scan instead of index

View

Consider to use views in these cases:
  • To limit columns/tables that a user can see due to security/privacy concerns. You might restrict the view query to read only and limit the rows return.
  • To hide complex join, to represent a combination of tables.
  • To give meaningful names.
  • To absorb changes. When the data model in database changes, the application can still use the same query if we modify the view definition.

Naming convention:

  • Using meaningful names for database objects (tables, constraints, indexes, stored procedures, triggers.) If the objects (e.g. constraint, index) are generated and have non meaningful names, try to rename them.
  • Avoid special characters (including space) for names.  Don't use double quotes for names (which are used for names with special characters/space/case sensitive). The double quotes and space in names are unhandy and less readable.
  • Do not use SQL keywords as name of database objects.
  • Limit the name length (e.g. 18 chars)
  • The column name should be unique within its table.
  • If a column refers to another table column, use suffix fk (foreign key).
  • Use application name/domain/department/datasource as prefix (e.g. MyCompanyCRM_tablename), use prefix "General" if the object is non application specific.
  • Use common words (e.g. Products, Employee). Use standard part of names: ID, DOB (date of birth), etc e.g. trigger MyCompanyCRM_EmployeeTable_CheckDOB_tr.
  • Use suffixes for different database objects:
primary key constraint: [domain]_[tablename]_pk, e.g. StudentRegApp_Student_pk
    foreign key constraint: [domain]_[table1]_[table2]_fk, e.g. HRM_Employee_Address_fk
      trigger: [domain]_ [tablename]_[action]_tr, e.g. HrmLdap_Employee_generateNewEmployeeAQEvent_tr
        index: [domain]_ [tablename]_[columns]_idx, e.g. Library_Books_ISBNTitleAuthor_idx
          unique constraint : [domain]_ [tablename]_[column]_unc, e.g. Library_Books_ISBN_unc
            check constraint : [domain]_ [tablename]_[action]_chk, e.g. Finance_Employee_checkSalaryLevel_chk
              view: [domain]_[viewname]_vw
                association table: [domain]_[table1]_[table2]_lnk, e.g. HRM_Employees_Projects_lnk
                  lookup table: [domain]_[table]_lku, e.g. General_CountryCodes_lku
                    Suffixes for scripts/store procedures:
                      stored procedures: [domain]_[action]_sp.sql, e.g. Finance_computeBonus_sp.sql
                        DDL/schema scripts: [domain]_[action]_ddl.sql, e.g. HRM_createEmployeeTabel_ddl.sql
                          Backup scripts: [domain]_[action]_bac.sql
                            Restore/recovery scripts: [domain]_[action]_rcv.sql
                              Rollback scripts: [domain]_[action]_rbk.sql
                                Conversion scripts: [domain]_[object1]_[object2]_conv.sql, e.g. Finance_Salary_Bonus_conv.sql
                                  Migration scripts: [domain]_[object]_[env1]_[env2]_mig.sql, e.g. HRM_EmployeeTable_Oracle10_Oracle11_mig.sql
                                    Test scripts:  [domain]_[action]_tst.sql
                                      Master scripts (that integrate/call different scripts):  [domain]_[action]_mst.sql

                                      Query

                                      • Simplify the sql statements
                                      • Minimize/limit amount of data, don't use SELECT * , use TOP/LIMIT
                                      • Use prepared statement for protection against sql injection and for better performance
                                      • For robustness in case of the table structure change: don't use insert "everything", use colomn names e.g. INSERT INTO employee (name, telephone) VALUES ('Sandeep', '077234567') instead of . INSERT INTO VALUES ('Sandeep', '077234567')
                                      • Use check constraints to prevent unnecessary read
                                      • Avoid operator like LIKE, EXISTS, or wildcards since it causes expensive table scan
                                      • Avoid function in the where clause: it's make the query complicated and can confuse the optimizer
                                      • Don't use schema name in the query since it will hinder portability between different users/database.
                                      • Use synonyms to simplify query (especially if you use database links)
                                      • When the resultset is not needed use syntax that doesn't return a resultset for better performance e.g. use select 1 or count(1) instead of count(*)
                                      • Prefer ANSI join syntax since it's more readable
                                      • Avoid transformation in the query e.g. use WHERE Baan.name = Peoplesoft.name instead of WHERE AFUNC(Baan.name , bonus)+1 = AFUNC( Peoplesoft.name, bonus)+1 
                                      • Use upper case for SQL reserve words for readability.
                                      • Use UNION ALL instead of UNION if redundancy is permitted since duplicate removal in UNION cost performance.
                                      • Don't use RETURNING clauses (e.g. with INSERT) if it's not necessary to reduce communication overhead.
                                      • Reduce number of subqueries by combining the WHERE clauses.
                                      • Use SQL coding standard e.g.  http://www.nyx.net/~bwunder/dbChangeControl/standard.htm
                                      • Include primary key when you use UPDATE/DELETE statement.
                                      • If more than one tables is involved use alias, especially if the fully qualified names are long.
                                      • Join:
                                        • Avoid Cartesian product. If you need to use distinct investigate the possibility of Cartesian product.
                                        • Think carefully the type of join to apply (e.g. outer joins is more expensive & give larger records than inner join so avoid outer join by redesign your schema.
                                        • The order of joins can influence performance (e.g. the optimizer can fail o optimize the join order in a complex/long query)
                                        • Prefer join on columns of similar data types
                                        • Prefer join on numeric columns instead of chars





                                      Performance & Scalability

                                      • Use fewer resource (e.g. don’t use select *, reduce tables involved in join operation).
                                      • Use index. Make sure your queries do index seek instead of table scan or index scan. Clustered index is faster than non-clustered one.
                                      • Minimize table, columns, rows. Only save data which is necessary. Using view to reduce tables & columns from external-application's tables.  Use summary tables (e.g. materialized view) to avoid accessing big table. Use materialized view for view that not changing frequently.
                                      • Using derived column/table to precompute the (aggregate) functions, use trigger to update the derived table.
                                      • Avoid too much normalization. Denormalization to reduce joins.
                                      • Incremental update / paging big results.
                                      • Partitioning for parallel access e.g. split customers begin with A-N, M-Z to 2 tables or put  tables in different storages. Distribute datafiles/tablespace to different physical disks. Put index table in different disk than the data table.
                                      • Reusing statements in broader scope (e.g. at session level instead of methods) instead of release and recreate them frequently. This will reduce parser and optimizer works.
                                      • Vertical partitioning: create smaller table which can be retrieve faster, e.g. split the 10 columns to a 4-columns table for frequency used, and the rest in a 6-columns table which need to be join for less frequency operations.
                                      • Use throttling
                                      • Prevent the transaction logs to become nearly full (by purging/truncating the changes) for performance and database availability.
                                      • Keep transaction logs & backup in separate disk from the datafile for better performance (concurrent access) & reliability.
                                      • Don't save everything in database. Database is suitable for data that you need to seach and have simple values (e.g. number, limited length string). Avoid store big BLOB (e.g. film, mp3) in database,  instead you can save only the titles/descriptions (thus seachable via query) and the links to the url/files in a storage for example.
                                      • Avoid ad hoc query (query provided by users) since it tends to have poor performance (not well developed & tested)
                                      • Avoid dynamic queries (queries which are not compiled)
                                      • Limit hard-coded hints, in most case the optimizer does good jobs
                                      • If you have high think usage patterns (small transactions with thinking pauses e.g. form submissions with small data) you might consider using shared server to share resources for multiple connections thus improving scalability. But for high traffic big data usage pattern better to use dedicated process.
                                      • Use fully qualified name to minimize overhead of name resolution.
                                      • Use performance test & profiling to find the bottleneck and which query statement cost much / frequently used. You don't have time to tune everything so you need to know which queries should get prioritize for tuning.
                                      • Measure query performance (e.g. load test, profiling), what is the impact to the performance of the whole system, does the performance  conforms SLA, what is the impact to other application system if you share the database/virtual machine with other applications (e.g. your application might exhaust the connection pools so that other applications will hang up)
                                      • About performance metrics: actually the more appropriate performance metrics are in term of computation counts (e.g. CPU cycles, numbers of read operations) instead of execution time (which relatively depends on hardware, contentious CPU/disk operations during the test). Nevertheless the performance metrics commonly used in SLA are in terms of time (e.g. web application response time), therefore ideally we use both operational counts & time as performance metrics (e.g. as described by tkprof tool in Oracle database).
                                      • The test environment where you do the performance test should reflect the hardware/load-condition in which the system will be deployed in the production. It's useless to do performance test in your racing machine if your client will deploy your product in a machine which much slower than your test machines.
                                      • Test the time to complete of offline operation (e.g. OLAP/ETL bulk scheduled every night). Is the processing time is scalable? What to do if the bulk operation doesn't finish yet at 8.00/working hours?
                                      Sources of poor performance queries:

                                      • Query are not efficient (remedy: rewrite the query e.g. avoid select *, avoid ad hoc query)
                                      • Poor logical & physical database design (remedy: (de)normalization, indexing, data partitioning on multiple physical disks)
                                      • Inadequate maintenance (remedy: clean up/shrink database periodically, maintain database size by moving old data to another lower grade database)
                                      • Ineffective workload management (remedy:  schedule the OLAP to not interfere with OLTP during working hours , capacity planning/anticipate peak loads during certain events e.g. xmas sales.)
                                      • Optimizer failed to generate an efficient execution plan (remedy: simplify query so that the optimizer can better understand, give hints, avoid cursor, regenerate statistics)
                                      • Hardware limitation (remedy: better hardware, more memory)

                                      Concurrency & Transaction

                                      • Minimize lock time: acquire lock late, release early. Avoid long transaction / trigger with  long actions . Do not wait for input in the middle of transactions. Precompute data before the transaction to avoid long operation during transaction.
                                      • Trade off the granularity of locks (e.g. whole table vs limited rows). Smaller granularity will reduce locking problems but can lead to more data inconsistency risk.
                                      • Trade off the isolation levels: read consistency vs performance. If your application only need rough estimate, you can use lower isolation levels for better performance.
                                      • Use profiling / trace to make sure that there is no loss efficiency due to locking problems (due to non optimal explicit locks in code or implicit locks decided by optimizer) and adjust explicitly (e.g. granularity of locks) as necessary to increase throughput
                                      • Reusing statements in broader transaction scope (thus reduce number of transaction), combine statements from different transactions to reduce the chance of waiting due to locking between different transactions and sessions.
                                      • To avoid deadlock: access tables in the same order in all store procedures & triggers.
                                      • Choose recovery strategy e.g. set deadlock priority / deadlock victim, use timeout to cancel deadlock transactions.

                                      Fault prevention

                                      • Apply business rule validation (e.g. compare the request with the average size to detect request anomalies, verify time/sequence constraint). You can apply these checks in application or using trigger or check constraints.
                                      •  Enforce constraints (e.g. null constraints, unique constrains, foreign-key constraint, check constraints)
                                      • Use "allowed values" lookup table to prevent invalid data.

                                      Availability

                                      • Scheduled backup regularly.
                                      • Save backup copy (e.g. redo log) in different physical sites (e.g. in case of fire hazard destroy the building) but not too far from the production site (to reduce recovery time) or in a Cloud storage.

                                      Operational  & configurations

                                      • Read vendor best practices / recommendation documents



                                      Store procedures

                                      • Monitor database regularly (e.g. running some data dictionary scripts for summary) to proactively prevent problems. Watch the problem trend, if a pattern of problems often emerge investigate the causes (e.g. application that didn't implement gracefully exception handling, or network/firewall configuration problem).
                                      • Application using single identity to connect to database so that connection polling is possible to increase scalability
                                      • OLTP operations tend to have small transactions and demand quick respond. The OLTP disk access tends to be random disk access. (thus may get benefit from storage grids). On the other hand, OLAP operations tend to be long and resource intensive thus more contention risk. Furthermore, the OLAP disk access tends to be sequential. To avoid OLAP blocking the OLTP it's advisable to separate OLTP & OLAP processing, i.e. using a  dedicated server only for OLAP operations or to defer OLAP operations to nightly/less intensive period.

                                      The advantages of store procedures:
                                      ·        reduce chatty communications to database thus increase performance
                                      ·        the execution plans of precompiled store procedures are often reused, thus reducing overhead
                                      ·        can be more secure (database admin can fine grain tune ACL control to database objects)
                                      ·        you can implement data integrity rules/constraints in the database level (thus more consistent & reusable across all applications.)
                                      ·        some people implement business logic in the database instead of the application on order to  hide the formula from the application programmer (due to secrecy or to hide complexity)

                                      Best practices:
                                      • Minimize use of cursors, the reasons:o      Cursor tends to be slow.  To improve the performances try to replace nested cursor loops with joins (or at least reduce the nested loops to a single flat loop). The optimizer can better deal with descriptive SQL set-operations than with a procedural cursor. o      Cursor is more procedural (focus on how) than set-based approach (focus on what). Procedural approach (processing row by row) is considered an old approach, the preferred alternative is using ORM (focus on the objects instead of processing the rows).
                                      • Avoid business logic in the store procedures, prefer to put business logic in the application layer than in database to honor the separation principle of application logic and data source.
                                      • Defensive programming:  test before executing (e.g. CURSOR%NOTFOUND in Oracle pl/sql), handle exception gracefully (e.g. catch NO_DATA_FOUND exception in Oracle pl/sql).

                                      SQL Scripting

                                      •  Strive for generic sql scripts  (that can be portable to different vendors), use standard sql if possible, avoid schema name in the queries.
                                      •  When the script is generated, remove the irrelevant create parameters (e.g. storage) to simplify the script
                                      •  Log (the shell output, warning, error messages) to file when you execute the script (e.g. use spool in Oracle)
                                      •  Every line end with ;
                                      •  Commit at the end of the script
                                      •  Scripts should have meaningful names (e.g. applicationName_version)
                                      • For sensitive operations you might encrypt your scripts (e.g. using Oracle wrap utility)

                                      Trigger

                                      • Triggers can be useful for business rule enforcement, security checking, data integrity checking, derived table updates, audit/logging.
                                      • Use trigger sparingly to avoid complex dependencies that hinder maintainability
                                      •  For maintainability & performance, keep triggers simple: avoid  recursive triggers, cascading triggers.  Avoid big size trigger actions, you can write & test the actions in an apart store procedure then call the procedure from the trigger. This will also improve performance since the procedure is compiled.
                                      •  Don't use trigger if you can use standard database features for the same goal (e.g. integrity constraints)




                                      Versioning the database

                                      Different ways for versioning database:

                                      •  Generate the DDL scripts, store these in version control (e.g. SVN). Other scripts that are also worth to be stored in version control: scripts to populate initial data (e.g. registry entries , lookup tables: country codes etc), export/import, migration, conversion between different applications & versions. If you have multiple scripting files, make a master script that will call those scripts with the right order dependencies. This self-maintenance approach cost efforts and error prone. Often the only reliable way to migrate is by drop the database and create the new one using the scripts from version control, this process can cost downtime.
                                      • Use database tool (e.g. Oracle Change Manager) that tracks the changes and generates migration scripts from diff operation.



                                      Oracle specific:

                                      ·        For recoverability, have multiple copies of the control files & redo log (using Oracle multiplexing feature or other file/disk replication utilities)
                                      ·        Use flash recovery area to centralize recovery files for better manageability
                                      ·        Use explain plan to measure the query performance (after optimizer) & detect the full table scan.
                                      ·        Use trace sparingly only if you fail to debug the problems with log. Turn off the trace in production since trace cost processing overhead and file size.
                                      ·        Use achievelog mode to use archived redo log recovery.
                                      ·        Use temporary tablespace to speed up sort operations (e.g. group by, order, index)
                                      ·        Use local management tablespace so Oracle can check the localfile header instead of querying the dictionary.
                                      ·        If you prioritize performance use "no logging" parameter to avoid redo log overhead, but without redo log your will risk recoverability.
                                      ·        You can use V$SYSSTAT to monitor workload and v$sqlarea to monitor performance.
                                      ·        Monitor security by enabling audit e.g. AUDIT CREATE SESSION then you can investigate your database (past & current) sessions using dba_audit_session view. You can monitor the current sessions using v$session view.
                                      ·        Monitor changes by enabling audit e.g. with AUDIT TABLE then you can use dba_audit_objects view to monitor the changes in your tables
                                      ·        Parallel execution can speed up your  plsql (using PIPELINED PARALLEL_ENABLE).
                                      ·        Cache can speed up your  plsql (using RESULT_CACHE hint)
                                      ·        About regenerating statistics for optimizer: for volatile data the statistics can be inconsistent so the performance can be poorer. For stable data it's not very useful to regenerate statistics since the statistics will not change so much anyway.
                                      ·        For Oracle Enterprise edition you can use Access Advisor to generate tuning recommendations (use dbms_sqltune.execute_tuning_task).
                                      ·        Oracle offers hash clusters as a (faster) alternative to index (CREATE CLUSTER ... HASHKEY ..). Oracle computes hash signature of each records, then clusters the related rows together to reduce the number of cluster blocks to visit. (see http://www.dba-oracle.com/t_sorted_hash_clusters.htm).
                                      ·        To implement user-dependent database access level from your application without giving the user ability to access the database directly (e.g. to query/edit credit cards data) you can use password protected role (SET ROLE rolename IDENTIFIED BY password) and then assign the a user with this role. When your user inlog to your application, your application can pass the username to the database and pass the password (which is encoded in the application, hidden from the user). The application can query/update the data from database without revealing all query results (e.g. credit card data) to the user. The user can not access the database directly (e.g. using sqlplus) without the application.
                                      ·        Use tablename.columnname%TYPE in your plsql, so when the table column type is changed you don't have to change the types in plsql.
                                      ·        Before open a cursor check if cursor is already open. always close cursor (also close in the exception blocks)
                                      ·        When using autotrace with big results, use traceonly to avoid displaying the large results since we're only interested in the trace information.




                                      Futher reading

                                      For more tips about performance, scalability, transactions, operations, monitoring & availability please refer to: http://soa-java.blogspot.nl/2012/09/soa-software-development-guidelines.html

                                      Source: Steve's blogs http://soa-java.blogspot.com/


                                      Any comments are welcome :)


                                      References:







                                      Len Silversion wrote a series of good books about data model patterns, this volume 3 generalizes  application specific patterns to universal patterns.


                                      • http://www.simple-talk.com/sql/database-administration/ten-common-database-design-mistakes/
                                      • http://www.sql-performance-tuning.com/oracle-sql-performance-tuning/
                                      • Database Specifications Checklist - HUD
                                      • ttp://www.nyx.net/~bwunder/dbChangeControl/standard.htm
                                      • http://www.ovaistariq.net/199/databases-normalization-or-denormalization-which-is-the-better-technique/



                                      No comments: