Home > Common Table Expressions, SQLServer > SQL Server – Identifying unique and duplicate rows in a table

SQL Server – Identifying unique and duplicate rows in a table


Finding duplicate rows in a table can be done easily by using ROW_NUMBER() function. You can number each similar rows by using PARTITION BY clause.

For example, consider the below table:

EmployeeID  FirstName      MiddleName     LastName

———– ———-     ———–    ———

1           Ken            J              Sánchez

2           Terri          Lee            Duffy

3           Gail           A              Erickson

4           Ken            J              Sánchez

5           Terri          Lee            Duffy

6           Gail           A              Erickson

 

(6 row(s) affected)

Selecting using DISTINCT does not help here, as we have an identity column [EmployeeID] which id unique for all rows. To find duplicates here we need to compare only common columns [FirstName], [MiddleName] and [LastName]. We can partition data based on these columns for row numbering as below:

SELECT      EmployeeID, FirstName, MiddleName, LastName,

            ROW_NUMBER() OVER

            (PARTITION BY FirstName, MiddleName, LastName

             ORDER BY EmployeeID)

            AS 'Row Number'

FROM        dbo.Table_Employees

Result Set:

EmployeeID  FirstName      MiddleName     LastName     Row Number

———– ———-     ———–    ———    ———–

3           Gail           A              Erickson     1

6           Gail           A              Erickson     2

1           Ken            J              Sánchez      1

4           Ken            J              Sánchez      2

2           Terri          Lee            Duffy        1

5           Terri          Lee            Duffy        2

 

(6 row(s) affected)

Now we have both unique and duplicate rows from dbo.Table_Employees. All rows with [Row Number] = 1 are unique, and all other rows are duplicate rows. Still we cannot identify that if a row is duplicate, which is a corresponding unique row for the same (except by looking at it ? , which may not be feasible if there are a large number of rows).

 

To easily identify a unique row corresponding to a duplicate row, we would like to add another column which gives us the [EmployeeID] of the unique row. In the above example, row with [EmployeeID] = 6 is duplicate row, and we would like to add a column which contains value 3 ([EmployeeID] of corresponding unique row). So the row should look like below:

EmployeeID  FirstName      MiddleName     LastName     Row Number     Duplicate Of

———– ———-     ———–    ———    ———–    ————-

6           Gail           A              Erickson     2              3

 

This can be achieved by:

 

1. Move all unique rows to TableA (all rows where [RowNumber] = 1),

2. Move all duplicate rows to TableB (all rows where [RowNumber] <> 1),

3. JOIN TableA with TableB to get value for [Duplicate Of].

 

We don’t need to create actual separate tables here, we can use CTE instead. Final query will be as follows:

 

WITH  CTE_Employees

      (EmployeeID, FirstName, MiddleName, LastName, RowNumber)

AS

(

      SELECT      EmployeeID, FirstName, MiddleName, LastName,

                  ROW_NUMBER() OVER

                  (PARTITION BY FirstName, MiddleName, LastName

                  ORDER BY EmployeeID)

      FROM        Table_Employees

)

SELECT      TableB.EmployeeID, TableB.FirstName, TableB.MiddleName,

            TableB.LastName, TableA.EmployeeID AS 'Duplicate Of'

FROM        CTE_Employees TableB

INNER JOIN CTE_Employees TableA

            ON    TableB.FirstName  = TableA.FirstName

            AND   TableB.MiddleName = TableA.MiddleName

            AND   TableB.LastName   = TableA.LastName

WHERE       TableB.RowNumber > 1 — Duplicate rows

AND         TableA.RowNumber = 1 — Unique rows

Result Set:

EmployeeID  FirstName      MiddleName     LastName      Duplicate Of

———– ———-     ———–    ———     ————-

6           Gail           A              Erickson      3

4           Ken            J              Sánchez       1

5           Terri          Lee            Duffy         2

 

(3 row(s) affected)

Hope This Helps!

Vishal

  1. Paul White
    12.08.2011 09:13 | #1

    You could also consider:

    SELECT
    e.EmployeeID,
    e.FirstName,
    e.MiddleName,
    e.LastName,
    [Duplicate Of] = e.first_id
    FROM
    (
    SELECT
    e2.*,
    rn = ROW_NUMBER() OVER (
    PARTITION BY e2.FirstName, e2.MiddleName, e2.LastName
    ORDER BY e2.EmployeeID),
    first_id = MIN (e2.EmployeeID) OVER (
    PARTITION BY e2.FirstName, e2.MiddleName, e2.LastName)
    FROM Table_Employees AS e2
    ) AS e
    WHERE
    e.rn > 1;

  2. Steve
    28.06.2012 08:25 | #2

    Neat explanation. Nice work!

  3. Samiigom
    15.01.2014 10:29 | #3

    Marvelous!

  4. Prabhu Govindan
    11.11.2014 06:50 | #4

    Hi Vishal,

    Excellent Post.

    I have an requirement to find the duplicates records based on multiple columns. Above SQL query fetches both duplicate and non-duplicate records.

    Any idea how to avoid the non-duplicates from the table ?

    Regards,
    Prabhu

  5. Anuja
    12.06.2015 11:53 | #5

    thank you so much ! it saved my day ?

  1. No trackbacks yet.