Home > Common Table Expressions, SQLServer > SQL Server – Identifying unique and duplicate rows in a table

SQL Server – Identifying unique and duplicate rows in a table

Finding duplicate rows in a table can be done easily by using ROW_NUMBER() function. You can number each similar rows by using PARTITION BY clause.

For example, consider the below table:

EmployeeID  FirstName      MiddleName     LastName

———– ———-     ———–    ———

1           Ken            J              Sánchez

2           Terri          Lee            Duffy

3           Gail           A              Erickson

4           Ken            J              Sánchez

5           Terri          Lee            Duffy

6           Gail           A              Erickson

 

(6 row(s) affected)

Selecting using DISTINCT does not help here, as we have an identity column [EmployeeID] which id unique for all rows. To find duplicates here we need to compare only common columns [FirstName], [MiddleName] and [LastName]. We can partition data based on these columns for row numbering as below:

SELECT      EmployeeID, FirstName, MiddleName, LastName,

            ROW_NUMBER() OVER

            (PARTITION BY FirstName, MiddleName, LastName

             ORDER BY EmployeeID)

            AS 'Row Number'

FROM        dbo.Table_Employees

Result Set:

EmployeeID  FirstName      MiddleName     LastName     Row Number

———– ———-     ———–    ———    ———–

3           Gail           A              Erickson     1

6           Gail           A              Erickson     2

1           Ken            J              Sánchez      1

4           Ken            J              Sánchez      2

2           Terri          Lee            Duffy        1

5           Terri          Lee            Duffy        2

 

(6 row(s) affected)

Now we have both unique and duplicate rows from dbo.Table_Employees. All rows with [Row Number] = 1 are unique, and all other rows are duplicate rows. Still we cannot identify that if a row is duplicate, which is a corresponding unique row for the same (except by looking at it 🙂 , which may not be feasible if there are a large number of rows).

 

To easily identify a unique row corresponding to a duplicate row, we would like to add another column which gives us the [EmployeeID] of the unique row. In the above example, row with [EmployeeID] = 6 is duplicate row, and we would like to add a column which contains value 3 ([EmployeeID] of corresponding unique row). So the row should look like below:

EmployeeID  FirstName      MiddleName     LastName     Row Number     Duplicate Of

———– ———-     ———–    ———    ———–    ————-

6           Gail           A              Erickson     2              3

 

This can be achieved by:

 

1. Move all unique rows to TableA (all rows where [RowNumber] = 1),

2. Move all duplicate rows to TableB (all rows where [RowNumber] <> 1),

3. JOIN TableA with TableB to get value for [Duplicate Of].

 

We don’t need to create actual separate tables here, we can use CTE instead. Final query will be as follows:

 

WITH  CTE_Employees

      (EmployeeID, FirstName, MiddleName, LastName, RowNumber)

AS

(

      SELECT      EmployeeID, FirstName, MiddleName, LastName,

                  ROW_NUMBER() OVER

                  (PARTITION BY FirstName, MiddleName, LastName

                  ORDER BY EmployeeID)

      FROM        Table_Employees

)

SELECT      TableB.EmployeeID, TableB.FirstName, TableB.MiddleName,

            TableB.LastName, TableA.EmployeeID AS 'Duplicate Of'

FROM        CTE_Employees TableB

INNER JOIN CTE_Employees TableA

            ON    TableB.FirstName  = TableA.FirstName

            AND   TableB.MiddleName = TableA.MiddleName

            AND   TableB.LastName   = TableA.LastName

WHERE       TableB.RowNumber > 1 — Duplicate rows

AND         TableA.RowNumber = 1 — Unique rows

Result Set:

EmployeeID  FirstName      MiddleName     LastName      Duplicate Of

———– ———-     ———–    ———     ————-

6           Gail           A              Erickson      3

4           Ken            J              Sánchez       1

5           Terri          Lee            Duffy         2

 

(3 row(s) affected)

Hope This Helps!

Vishal

If you like this post, do like my Facebook Page -> SqlAndMe
EMail me your questions -> Vishal@SqlAndMe.com
Follow me on Twitter -> @SqlAndMe

Advertisements
  1. August 12, 2011 at 9:13 pm

    You could also consider:

    SELECT
    e.EmployeeID,
    e.FirstName,
    e.MiddleName,
    e.LastName,
    [Duplicate Of] = e.first_id
    FROM
    (
    SELECT
    e2.*,
    rn = ROW_NUMBER() OVER (
    PARTITION BY e2.FirstName, e2.MiddleName, e2.LastName
    ORDER BY e2.EmployeeID),
    first_id = MIN (e2.EmployeeID) OVER (
    PARTITION BY e2.FirstName, e2.MiddleName, e2.LastName)
    FROM Table_Employees AS e2
    ) AS e
    WHERE
    e.rn > 1;

  2. Steve
    June 28, 2012 at 8:25 pm

    Neat explanation. Nice work!

  3. Samiigom
    January 15, 2014 at 10:29 am

    Marvelous!

  4. November 11, 2014 at 6:50 am

    Hi Vishal,

    Excellent Post.

    I have an requirement to find the duplicates records based on multiple columns. Above SQL query fetches both duplicate and non-duplicate records.

    Any idea how to avoid the non-duplicates from the table ?

    Regards,
    Prabhu

  5. Anuja
    June 12, 2015 at 11:53 am

    thank you so much ! it saved my day 🙂

  1. August 12, 2011 at 8:51 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: