Data Quality Services(DQS)

  • 10:24 PM
  • 0 comments
What is Data Quality Services(DQS)


DQS is a new feature in SQL Server "Denali" that provides us with a knowledge-driven data cleansing solution.

We deal with data and there is a high possibility that the data will be incorrect;may be the user has enter wrong data, the data transmission is corrupted etc. Also integrating those inconsistent data from various sources is also a problem and time consuming that will affect the business and the worst victims in such situations are the customers which on the other hand will affect the company from loss of credibility and revenue, customer dissatisfaction etc. Manual processing is not again a good choice as it is always inclined to have erroneous report however meticulously it has been measure. Automation will not give 100% result. Ultimately it will hamper the historical data and as a consequence will affect data analysis and reporting.

In order to rescue the business from such situations, DQS is the need of the day. It ensures high quality data,improves accuracy,data consistency and resolve problems cause by bad data entry in BI or data warehouse or OLTP systems.

It helps business user or a non database professional to create, maintain and execute their organization’s data quality operations with minimal setup or preparation time and with excellent quality.

It improves the data quality by creating a Knowledge Base(KB) about the data and then clean the data based on the knowledge in the knowledge base.

A Knowledge Base(KB) is a repository of three types of knowledge:

1. Out of the box knowledge

2. Knowledge generated by the DQS server

3. Knowledge generated by the user

This knowledge base once build , can server as a reusable stuff which can be continuously improve and then apply in multiple data-quality improvement processes.

KB identifies the incorrect data,proposes changes to the data,find data matches, mismatches and enables to perform data deduplication.It compares source data with the reference data(cloud-based) provided by data quality providers.

The data steward or IT professional verifies both the knowledge in the knowledge base and the changes to be made to the data, and executes the cleansing, deduplication, and reference data services.

The knowledge in the database stores all kind of knowledges applicable to a specific data source.They are stored in data domains which can even span and each of which is a semantic representation of a type of data in a data field.

A domain contains all kind of values like valid values, invalid values, and erroneous data.

Domain Knowledge(DK) consists of synonym associations, term relationships, validation and business rules,and matching policies. Eqipped with this KB, the IT professional/data steward/non-database user can makes a concrete decision about the data analysis and provides support as to whether to correct those data or not.

The DQS knowledge-driven solution uses two fundamental steps to cleanse data:

1. Builds a Knowledge base through the knowledge management process

2. Changes to be done (if needed for the Knowledge symantics to satisfy) in teh source data based on the knowledge in the KB.This is done through a data quality project.

What are the DQS Components

DOS consists of the

1. DQS Server.

2. DQS Client.

In addition to these, there is a transformation component call as DQS Cleansing Component in SSIS transformation toolbox that helps us in cleansing the data based on the knowledge base created.

The client and the server component are install as part of SQL Server setup program.(Denali CTP3)

DQS Server

The DQS server is implemented as two SQL Server catalogs that we can manage and monitor in the SQL Server Management Studio.They are

1. DQS_MAIN

2. DQS_PROJECT

DQS_MAIN includes DQS stored procedures, the DQS engine, and published knowledge bases.

DQS_PROJECT includes data that is required for knowledge base management and DQS project activities.

The source database that contains the data to be analyzed must also be in the same SQL Server instance as the DQS server.

DQS Client

It is a standalone application, designed for data stewards and DQS administrators that helps to perform knowledge management,domain management, matching policy creation, data cleansing, matching, monitoring, data quality projects, and administration in one user interface.

The client application can installed and run on the same computer as the DQS Server or remotely on a separate computer.



Read more »

Top Ten Feature Of Denali 'SQL Server 2012'

  • 10:20 PM
  • 0 comments
• SQL Server Developer Tools—One of the most obvious improvements in SQL Server Denali is the new development environment, SQL Server Developer Tools, coded-named Juneau. Juneau uses the Windows Presentation Foundation (WPF)–based Visual Studio 2010 shell, and it unifies development for Business Intelligence Development Studio (BIDS) and Visual Studio. One goal for Juneau is to make the development environment consistent for both SQL Azure and the on-premises version of SQL Server.



• Contained databases—Contained databases make it easy to move databases between different instances of SQL Server. With Denali, login credentials are included with contained databases. Users don't need logins for the SQL Server instance because all authentications are handled by the contained database. Contained databases have no configuration dependencies on the instance of SQL Server that they're hosted on and can be moved between on-premises SQL Server instances and SQL Azure.



• Project "Crescent"—The new data visualization tool, code-named Project "Crescent," is Closely integrated with SharePoint 2010 and Silverlight. Microsoft has called the Crescent feature "PowerPoint for your data." Crescent makes it easy for users to create great-looking data pages and dashboards by using data models that are built using PowerPivot or from tabular data from SQL Server Analysis Services.





• Data Quality Services—Valid data is critical for making effective business intelligence (BI) decisions. Data Quality Services lets you set up a knowledge base that defines your metadata rules. You can then run Data Quality Services projects to apply those rules to data stored in a SQL Server data source. The Data Quality Services projects cleanse the data and allow viewing of good, invalid, and corrected rows.





• User-defined server roles—An important security-related feature in Denali is the addition of user-defined severs roles. Earlier releases had fixed server roles that were predefined by Microsoft. These roles covered most situations, but they weren't as flexible or granular as some organizations wanted. The new user-defined server roles give organizations more control and customization ability over SQL Server's server roles.





• Change data capture (CDC) for Oracle—CDC lets you keep large tables in sync by initially moving a snapshot to a target server, then moving just the captured changes between the databases. With the SQL Server 2008 release, CDC was limited to SQL Server, but many organizations also have other database platforms they want to use CDC with. A big improvement in the Denali release is the addition of CDC for Oracle.





• T-SQL enhancements—Two of the most important T-SQL enhancements in Denali are the addition of the Sequence object and the window functions. Unlike the similar Identity column, Sequence lets you tie unique row identifiers across multiple tables. The new window functions apply to sets of rows using the new OVER clause. You can read more about window functions in "Window Functions (OVER Clause)—Help Make a Difference."





• Columnar store index—The columnar store index or, as it is sometimes called, the column-based query accelerator, uses the same high performance/high compression technology that Microsoft uses in PowerPivot, and it brings that technology into the database engine. Indexed data is stored according to the data of each column rather than by the rows, and only necessary columns are returned as query results for columnar indexes. Microsoft states this technology can provide up to 100 times improvement in query performance in some cases.





• Support for Windows Server Core—The ability to run SQL Server on Windows Server Core has been missing from previous releases of SQL Server. Server Core is designed for infrastructure applications such as SQL Server that provide back-end services but don't really need a GUI on the same server. Denali's support for Server Core enables leaner and more efficient SQL Server installations and at the same time reduces potential attack vectors and the need for patching.





• AlwaysOn—Without a doubt, the most important new feature in SQL Server Denali is the new SQL Server AlwaysOn feature. AlwaysOn is essentially the next evolution of database mirroring. AlwaysOn supports up to four replicas, the data in the replicas can be queried, and backups can be performed from the replicas. Although it's still early, AlwaysOn seems more complicated to set up than database mirroring because it requires Windows Failover Clustering, but the advantages appear to make it well worth the extra effort.

Read more »

Page Restore and Tail Log

  • 10:15 PM
  • 0 comments
Here's an interesting scenario that cropped up today. You have a database on a RAID array that failed and has zero'd out a page. How can you get the data back?


There are two ways to do it, depending on the database recovery model and version of SQL Server - single-page restore or manual insert/select - both of which rely on you having a backup of the database. You can use single-page restore if you're on SQL Server 2008 and the database is in Full or Bulk-Logged recovery mode, otherwise you need to use the manual method, and that will only work as long as you know the data being salvaged hasn't changed since the last backup.

Let's try them both. Here's a script to create a test database and make a backup of it:

-- Create the database.

USE master;

GO

CREATE DATABASE dbccpagetest;

GO

ALTER DATABASE dbccpagetest SET RECOVERY FULL;

GO

-- Create a table to play with.

USE dbccpagetest;

GO

CREATE TABLE sales (

salesID INT IDENTITY,

customerID INT DEFAULT CONVERT (INT, 100000 * RAND ()),

salesDate DATETIME DEFAULT GETDATE (),

salesAmount MONEY);

CREATE CLUSTERED INDEX salesCI ON sales (salesID);

GO

-- Populate the table

SET NOCOUNT ON;

GO

DECLARE @count INT

SELECT @count = 0

WHILE (@count < 5000)

BEGIN

INSERT INTO sales (salesAmount) VALUES (100 * RAND ());

SELECT @count = @count + 1

END;

GO

-- Take a full backup.

BACKUP DATABASE dbccpagetest TO DISK = 'C:\dbccpagetest.bak' WITH INIT;

GO









I'm going to simulate our scenario by shutting down the database and using a hex editor to zero out page 158 of the database. (This translates to byte offset 1294336 of the file being zero'd for 8192 bytes).

Now if I run checkdb, I get the following:

Msg 8909, Level 16, State 1, Line 1

Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:158) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).

CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.

Msg 8928, Level 16, State 1, Line 1

Object ID 2073058421, index ID 1, partition ID 72057594038386688, alloc unit ID 72057594042384384 (type In-row data): Page (1:158) could not be processed. See other errors for details.

CHECKDB found 0 allocation errors and 1 consistency errors in table 'sales' (object ID 2073058421).

CHECKDB found 0 allocation errors and 2 consistency errors in database 'dbccpagetest'.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (dbccpagetest).

What does the page look like?

DBCC TRACEON (3604);

GO

DBCC PAGE (dbccpagetest, 1, 158, 3);

GO





DBCC execution completed. If DBCC printed error messages, contact your system administrator.

PAGE: (0:0)

BUFFER:

BUF @0x02C0632C

bpage = 0x04C12000 bhash = 0x00000000 bpageno = (1:158)

bdbid = 9 breferences = 0 bUse1 = 37241

bstat = 0xc00009 blog = 0x89898989 bnext = 0x00000000

PAGE HEADER:

Page @0x04C12000

m_pageId = (0:0) m_headerVersion = 0 m_type = 0

m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x200

m_objId (AllocUnitId.idObj) = 0 m_indexId (AllocUnitId.idInd) = 0 Metadata: AllocUnitId = 0

Metadata: PartitionId = 0 Metadata: IndexId = -1 Metadata: ObjectId = 0

m_prevPage = (0:0) m_nextPage = (0:0) pminlen = 0

m_slotCnt = 0 m_freeCnt = 0 m_freeData = 0

m_reservedCnt = 0 m_lsn = (0:0:0) m_xactReserved = 0

m_xdesId = (0:0) m_ghostRecCnt = 0 m_tornBits = 16777216

Allocation Status

GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED

PFS (1:1) = 0x60 MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (1:6) = NOT CHANGED

ML (1:7) = NOT MIN_LOGGED

Msg 2514, Level 16, State 5, Line 2

DBCC PAGE error: Invalid page type - dump style 3 not possible.

Note the error at the end of the output - DBCC PAGE can't do an in-depth dump because it doesn't know what page type the page is. Let's try a full page hex dump using dump style 2 instead:

DBCC PAGE (dbccpagetest, 1, 158, 2);

GO

PAGE: (0:0)



DATA:

Memory Dump @0x44F3C000

44F3C000: 00000000 00020000 00000000 00000000 †................

44F3C010: 00000000 00000000 00000000 00000000 †................



44F3DFE0: 00000000 00000000 00000000 00000000 †................

44F3DFF0: 00000000 00000000 00000000 00000000 †................

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

It really is all zero. First we'll fix it using page restore.

USE master;

GO

RESTORE DATABASE dbccpagetest PAGE = '1:158' FROM DISK = 'C:\dbccpagetest.bak';

GO

Processed 1 pages for database 'dbccpagetest', file 'dbccpagetest' on file 1.

The roll forward start point is now at log sequence number (LSN) 32000000047000001. Additional roll forward past LSN 33000000001700001 is required to complete the restore sequence.

RESTORE DATABASE ... FILE= successfully processed 1 pages in 0.176 seconds (0.046 MB/sec).



Isn't that cool? You can restore up to 1000 single pages from a backup at a time. For VLDBs, this cuts the recovery time WAY down. Now we need to roll forward the log. We don't have any more log backups so we can finish the roll forward by backing up and restoring the tail of the log.







-- Need to complete roll forward. Backup the log tail...

BACKUP LOG dbccpagetest TO DISK = 'C:\dbccpagetest_log.bak' WITH INIT;

GO

-- ... and restore it again.

RESTORE LOG dbccpagetest FROM DISK = 'C:\dbccpagetest_log.bak';

GO



Processed 5 pages for database 'dbccpagetest', file 'dbccpagetest_log' on file 1.

BACKUP LOG successfully processed 5 pages in 0.146 seconds (0.248 MB/sec).

Processed 0 pages for database 'dbccpagetest', file 'dbccpagetest' on file 1.

RESTORE LOG successfully processed 0 pages in 0.004 seconds (0.000 MB/sec).

And now we should have a clean database:



DBCC CHECKDB (dbccpagetest) WITH NO_INFOMSGS;

GO

Command(s) completed successfully.

Easy. But what if we can't do a page restore? Assuming I've corrupted the database in exactly the same way again, the first thing is to do is make sure we can restore the backup and then see what data range is on that page:

RESTORE DATABASE dbccpagetest_copy FROM DISK = 'C:\dbccpagetest.bak' WITH

MOVE N'dbccpagetest' TO N'C:\dbccpagetest_copy.mdf',

MOVE N'dbccpagetest_log' TO N'C:\dbccpagetest_log.ldf',

REPLACE;

GO

DBCC PAGE (dbccpagetest_copy, 1, 158, 3);

GO

Processed 184 pages for database 'dbccpagetest_copy', file 'dbccpagetest' on file 1.

Processed 2 pages for database 'dbccpagetest_copy', file 'dbccpagetest_log' on file 1.

RESTORE DATABASE successfully processed 186 pages in 0.361 seconds (4.205 MB/sec).

PAGE: (1:158)

BUFFER:

BUF @0x02BE8D38

bpage = 0x03FB4000 bhash = 0x00000000 bpageno = (1:158)

bdbid = 10 breferences = 1 bUse1 = 38283

bstat = 0xc00009 blog = 0x159a2159 bnext = 0x00000000

PAGE HEADER:

Page @0x03FB4000

m_pageId = (1:158) m_headerVersion = 1 m_type = 1

m_typeFlagBits = 0x4 m_level = 0 m_flagBits = 0x8200

m_objId (AllocUnitId.idObj) = 68 m_indexId (AllocUnitId.idInd) = 256

Metadata: AllocUnitId = 72057594042384384

Metadata: PartitionId = 72057594038386688 Metadata: IndexId = 1

Metadata: ObjectId = 2073058421 m_prevPage = (1:157) m_nextPage = (1:159)

pminlen = 28 m_slotCnt = 245 m_freeCnt = 11

m_freeData = 7691 m_reservedCnt = 0 m_lsn = (24:453:8)

m_xactReserved = 0 m_xdesId = (0:0) m_ghostRecCnt = 0

m_tornBits = -1020457745

Allocation Status

GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED

PFS (1:1) = 0x60 MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (1:6) = NOT CHANGED

ML (1:7) = NOT MIN_LOGGED

Slot 0 Offset 0x60 Length 31

Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP

Memory Dump @0x4542C060

00000000: 10001c00 d5030000 5bd30000 3f771101 †........[...?w..

00000010: b9980000 baa10a00 00000000 0500e0††††...............

UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

salesID = 981

Slot 0 Column 2 Offset 0x8 Length 4

customerID = 54107

Slot 0 Column 3 Offset 0xc Length 8

salesDate = Jan 17 2007 4:35PM

Slot 0 Column 4 Offset 0x14 Length 8

salesAmount = 69.68



Slot 244 Offset 0x1dec Length 31

Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP

Memory Dump @0x4542DDEC

00000000: 10001c00 c9040000 bfa10000 57771101 †............Ww..

00000010: b9980000 c6b80500 00000000 0500e0††††...............

UNIQUIFIER = [NULL]

Slot 244 Column 1 Offset 0x4 Length 4

salesID = 1225

Slot 244 Column 2 Offset 0x8 Length 4

customerID = 41407

Slot 244 Column 3 Offset 0xc Length 8

salesDate = Jan 17 2007 4:35PM

Slot 244 Column 4 Offset 0x14 Length 8

salesAmount = 37.50

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

So we're looking at salesID range 981 to 1225 inclusive. Before we can copy the rows back to the damaged database, we need to get rid of the corrupt page. Repair should delete the page for us. First I'm going to take another backup though - just in case something goes wrong!

BACKUP DATABASE dbccpagetest TO DISK = 'C:\dbccpagetest_corrupt.bak' WITH INIT;

GO

ALTER DATABASE dbccpagetest SET SINGLE_USER;

GO

DBCC CHECKDB (dbccpagetest, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;

GO

ALTER DATABASE dbccpagetest SET MULTI_USER;

GO

Processed 184 pages for database 'dbccpagetest', file 'dbccpagetest' on file 1.

Processed 4 pages for database 'dbccpagetest', file 'dbccpagetest_log' on file 1.

BACKUP DATABASE successfully processed 188 pages in 0.380 seconds (4.052 MB/sec).

Msg 8909, Level 16, State 1, Line 1

Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:158) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).

The error has been repaired.

CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.

CHECKDB fixed 0 allocation errors and 1 consistency errors not associated with any single object.

Repair: The Clustered index successfully rebuilt for the object "dbo.sales" in database "dbccpagetest".

Repair: The page (1:158) has been deallocated from object ID 2073058421, index ID 1, partition ID 72057594038386688, alloc unit ID 72057594042384384 (type In-row data).

Msg 8945, Level 16, State 1, Line 1

Table error: Object ID 2073058421, index ID 1 will be rebuilt.

The error has been repaired.

Msg 8928, Level 16, State 1, Line 1

Object ID 2073058421, index ID 1, partition ID 72057594038386688, alloc unit ID 72057594042384384 (type In-row data): Page (1:158) could not be processed. See other errors for details.

The error has been repaired.

Msg 8976, Level 16, State 1, Line 1

Table error: Object ID 2073058421, index ID 1, partition ID 72057594038386688, alloc unit ID 72057594042384384 (type In-row data). Page (1:158) was not seen in the scan although its parent (1:154) and previous (1:157) refer to it. Check any previous errors.

The error has been repaired.

Msg 8978, Level 16, State 1, Line 1

Table error: Object ID 2073058421, index ID 1, partition ID 72057594038386688, alloc unit ID 72057594042384384 (type In-row data). Page (1:159) is missing a reference from previous page (1:158). Possible chain linkage problem.

The error has been repaired.

CHECKDB found 0 allocation errors and 3 consistency errors in table 'sales' (object ID 2073058421).

CHECKDB fixed 0 allocation errors and 3 consistency errors in table 'sales' (object ID 2073058421).

CHECKDB found 0 allocation errors and 4 consistency errors in database 'dbccpagetest'.

CHECKDB fixed 0 allocation errors and 4 consistency errors in database 'dbccpagetest'.

We should check the row count to see that the count has dropped from the initial 5000 rows we inserted:

USE dbccpagetest;

GO

SELECT COUNT (*) FROM SALES;

GO

SELECT COUNT (*) FROM sales WHERE salesID > 980 AND salesID < 1226;

GO

And we're down to 4755 rows, as expected with zero rows in that range. All we need to do now is to copy the missing rows over from the restored copy. Remember, this will only work if you know that the data being salvaged hasn't changed since the backup was taken - otherwise you'll have old and new data mixed in the table which will play havoc with your business. Before we copy the rows, we know we're got an identity column we'd like to preserve so we set IDENTITY_INSERT on which tells the server not to generate new identity values for the inserted rows.

-- Make sure identity values survive.

SET IDENTITY_INSERT sales ON;

GO

-- Insert the missing rows.

SET NOCOUNT OFF;

GO

INSERT INTO sales (salesID, customerID, salesDate, salesAmount)

SELECT * FROM dbccpagetest_copy.dbo.sales AS R

WHERE R.salesID > 980 AND R.salesID < 1226;

GO

-- Restore identity behavior.

SET IDENTITY_INSERT sales OFF;

GO

(245 row(s) affected)

We copy over 245 rows and checking the row count again says we're back to 5000 rows.



Read more »
 

Copyright © 2010 SQL Cached, All Rights Reserved. Design by DZignine