Change Data Capture
As we are looking through the new features in SQL Server 2008 we found a potentially interesting one called Change Data Capture. Can you give us a detailed explanation of how we go about using this one?
Solution
Change Data Capture is a new feature in SQL Server 2008 that records insert, update and delete activity in SQL Server tables. A good example of how this feature can be used is in performing periodic updates to a data warehouse. The requirement for the extract, transform, and load (ETL) process is to update the data warehouse with any data that has changed in the source systems since the last time the ETL process was run. Before CDC we might simply query a last updated DATETIME column in our source system tables to determine what rows have changed. While this is simple and pretty effective, it is of no use in determining any rows that were physically deleted. In addition we can't determine what was changed when; we can only access the current state of a row that has changed. CDC provides a configurable solution that addresses these requirements and more.
In this tip we are going to gain an understanding of CDC by walking through a simple code sample to demonstrate how to:
- Setup and configure CDC
- Use CDC to extract rows that have been inserted, updated, or deleted via T-SQL queries
Before we start reviewing the sample T-SQL code, let's discuss how CDC works at a high level. After performing some setup and configuration steps (which we will cover below), CDC will begin scanning the database transaction log for changes to certain tables that you specify, and will insert these changes into change tables. These change tables are created during the setup and configuration process. The setup and configuration process will also create table-valued functions which can be used to query for the changes. You use the table-valued functions in lieu of querying the underlying change tables directly. Based on this high level description, let's proceed to the demo.
The demo code below was only tested on the February, 2008 Community Technology Preview (CTP) of SQL Server 2008. Some of the function names and stored procedure names have changed from the earlier CTPs.
Setup and Configuration
CDC is a feature that must be enabled at the database level; it is disabled by default. To enable CDC you must be a member of the sysadmin fixed server role. You can enable CDC on any user database; you cannot enable it on system databases. Execute the following T-SQL script in the database of your choice to enable CDC:
declare @rc int exec @rc = sys.sp_cdc_enable_db select @rc -- new column added to sys.databases: is_cdc_enabled select name, is_cdc_enabled from sys.databases |
The sys.sp_cdc_enable_db stored procedure will return 0 if successful and 1 if it fails. You can query whether CDC is enabled for any database by checking the new column is_cdc_enabled in the sys.databases table. You will see a value of 1 if CDC is enabled, a 0 otherwise.
The next step is to specify a table that you want to enable for CDC. Let's create a simple table called customer:
create table dbo.customer |
Execute the following system stored procedure to enable CDC for the customer table:
exec sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'customer' , @role_name = 'CDCRole', @supports_net_changes = 1 select name, type, type_desc, is_tracked_by_cdc from sys.tables |
You must be a member of the db_owner fixed database role in order to execute the above system stored procedure and SQL Agent must be running. The sys.sp_cdc_enable_table system stored procedure has quite a few parameters; let's describe each one (only the first three parameters are required; the rest are optional and only the ones used are shown above):
- @source_schema is the schema name of the table that you want to enable for CDC
- @source_name is the table name that you want to enable for CDC
- @role_name is a database role which will be used to determine whether a user can access the CDC data; the role will be created if it doesn't exist. You can add users to this role as required; you only need to add users that aren't already members of the db_owner fixed database role.
- @supports_net_changes determines whether you can summarize multiple changes into a single change record; set to 1 to allow, 0 otherwise.
- @capture_instance is a name that you assign to this particular CDC instance; you can have up two instances for a given table.
- @index_name is the name of a unique index to use to identify rows in the source table; you can specify NULL if the source table has a primary key.
- @captured_column_list is a comma-separated list of column names that you want to enable for CDC; you can specify NULL to enable all columns.
- @filegroup_name allows you to specify the FILEGROUP to be used to store the CDC change tables.
- @partition_switch allows you to specify whether the ALTER TABLE SWITCH PARTITION command is allowed; i.e. allowing you to enable partitioning (TRUE or FALSE).
There is a new column named is_tracked_by_cdc in sys.tables; you can query it to determine whether CDC is enabled for a table.
Enabling CDC at the database and table levels will create certain tables, jobs, stored procedures and functions in the CDC-enabled database. These objects will be created in a schema named cdc and a cdc user is also created. You will see a message that two SQL Agent jobs were created; e.g. cdc.sql2008demo_capture which scans the database transaction log to pickup changes to the tables that have CDC enabled, and cdc.sql2008demo_cleanup which purges the change tables periodically. The naming convention is cdc.[databasename]_task. We will discuss some of the schema objects created as we use them in the demo that follows. You can examine the schema objects created by running the following T-SQL script:
select o.name, o.type, o.type_desc from sys.objects o join sys.schemas s on s.schema_id = o.schema_id where s.name = 'cdc' |
You can disable CDC on a particular table by executing the following T-SQL script:
exec sys.sp_cdc_disable_table @source_schema = 'dbo', @source_name = 'customer', @capture_instance = 'dbo_customer' -- or 'all' |
You can disable CDC at the database level by executing the following T-SQL script:
declare @rc int exec @rc = sys.sp_cdc_disable_db select @rc -- show databases and their CDC setting select name, is_cdc_enabled from sys.databases |
Disabling CDC at the table and/or database level will drop the respective tables, jobs, stored procedures and functions that were created in the database when CDC was enabled.
Now that we have enabled CDC for a database and a table, let's proceed to a demo.
CDC Demo
We will perform the following steps to demonstrate the CDC functionality:
- Perform a couple of inserts, update, and deletes to the customer table
- Show T-SQL code samples to query the changes
Execute the following T-SQL script to perform some inserts, an update, and a delete on the customer table:
insert customer values ('abc company', 'md') insert customer values ('xyz company', 'de') insert customer values ('xox company', 'va') update customer set state = 'pa' where id = 1 delete from customer where id = 3 |
Now let's take a look at a query that will show us a record of the above changes:
declare @begin_lsn binary(10), @end_lsn binary(10) -- get the first LSN for customer changes select @begin_lsn = sys.fn_cdc_get_min_lsn('dbo_customer') -- get the last LSN for customer changes select @end_lsn = sys.fn_cdc_get_max_lsn() -- get net changes; group changes in the range by the pk select * from cdc.fn_cdc_get_net_changes_dbo_customer( @begin_lsn, @end_lsn, 'all'); -- get individual changes in the range select * from cdc.fn_cdc_get_all_changes_dbo_customer( @begin_lsn, @end_lsn, 'all'); |
To extract the changes for a table that has CDC enabled, you have to supply the relevant LSNs. An LSN is a log sequence number that uniquely identifies entries in the database transaction log. If this is the first time you are querying to extract changes, you can get the minimum LSN and the maximum LSN using the functions sys.fn_cdc_get_min_lsn() and sys.fn_cdc_get_max_lsn(). If you set @supports_net_changes = 1 when enabling CDC on the table, you can query for the net changes using cdc.fn_cdc_get_net_changes_dbo_customer(). This will group multiple changes to a row based on the primary key or unique index you specified when enabling CDC. You can always invoke cdc.fn_cdc_get_all_changes_dbo_customer() to retrieve every change to the table within the LSN range. The dbo_customer portion of the function name is the capture instance; this is the default - schema_tablename. Here's the output from the above query:
The first result set shows the net changes; the second result set shows the individual changes. Since there was an insert and a delete, the first result set doesn't show that row since it was added and deleted in the LSN range; i.e. it no longer exists when you group the changes. The __$operation column values are: 1 = delete, 2 = insert, 3 = update (values before update), 4 = update (values after update). To see the values before update you must pass 'all update old' to the cdc.fn_cdc_get_all_changes_dbo_customer() function. The __$update_mask column is a bit mask column that identifies the columns that changed. For __$operation = 1 or 2, all columns are indicated as changed. For __$operation = 3 or 4, the actual columns that changed are indicated. The columns are mapped to bits based on the column_ordinal; execute the stored procedure sys.sp_cdc_get_captured_columns passing the capture instance as a parameter to see the column_ordinal values; for example:
Let's extend this example to handle periodically extracting changed rows. We will add a new table to log the ending LSN and a new function to retrieve the ending LSN from the table. This will allow us to pick up just what changed since the last time we ran our ETL process.
create table dbo.customer_lsn ( last_lsn binary(10) ) create function dbo.get_last_customer_lsn() returns binary(10) as begin declare @last_lsn binary(10) select @last_lsn = last_lsn from dbo.customer_lsn select @last_lsn = isnull(@last_lsn, sys.fn_cdc_get_min_lsn('dbo_customer')) return @last_lsn end |
We'll modify the code above adding a call to the get_last_customer_lsn() function and an insert or update to save the ending LSN:
declare @begin_lsn binary(10), @end_lsn binary(10) -- get the next LSN for customer changes select @begin_lsn = dbo.get_last_customer_lsn() -- get the last LSN for customer changes select @end_lsn = sys.fn_cdc_get_max_lsn() -- get the net changes; group all changes in the range by the pk select * from cdc.fn_cdc_get_net_changes_dbo_customer( @begin_lsn, @end_lsn, 'all'); -- get all individual changes in the range select * from cdc.fn_cdc_get_all_changes_dbo_customer( @begin_lsn, @end_lsn, 'all'); -- save the end_lsn in the customer_lsn table update dbo.customer_lsn set last_lsn = @end_lsn if @@ROWCOUNT = 0 insert into dbo.customer_lsn values(@end_lsn) |
After running the above query once, insert a row, then run the query again. Your output will look like this, showing only a single row that has changed:
0 comments:
Post a Comment