How to Remove Duplicate Rows in SQL?
Deduplicate Rows in SQL is a SQL query that this sql query removes duplicate rows by assigning a unique row number to each row within a partition of duplicates.. Formula Genius generates and validates this formula automatically from a plain-English prompt.
Duplicate rows can clutter your data and lead to inaccurate analysis. This guide shows you how to effectively remove them using SQL.
The Formula
"Remove or identify duplicate rows in a table using ROW_NUMBER() with PARTITION BY to keep only the first occurrence"
SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY column3) as rn FROM your_table) as temp WHERE rn = 1;
This SQL query removes duplicate rows by assigning a unique row number to each row within a partition of duplicates.
Step-by-Step Breakdown
- Use ROW_NUMBER() to assign a sequential integer to rows within a partition of duplicates.
- Specify the columns to check for duplicates in the PARTITION BY clause.
- Order the duplicates by a specific column to determine which row to keep.
- Filter the results to only include rows where the row number is 1.
Edge Cases & Warnings
- All rows are unique, resulting in no duplicates to remove.
- Multiple rows have the same values in the partitioned columns but different values in the order column.
- The table is empty, leading to no output.
- The PARTITION BY columns contain NULL values, which may affect grouping.
Examples
"SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY date) as rn FROM employees) as temp WHERE rn = 1;"
Returns unique employee records based on name.
"SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY sale_date) as rn FROM sales) as temp WHERE rn = 1;"
Returns the first sale record for each product.
Frequently Asked Questions
What does PARTITION BY do?
PARTITION BY divides the result set into partitions to which the ROW_NUMBER() function is applied.
Can I use this method for large datasets?
Yes, but performance may vary based on the size of the dataset and the database engine.
What happens if I don't include ORDER BY?
Without ORDER BY, the row numbers assigned may be arbitrary, leading to unpredictable results.
Can't find what you need?
Describe any formula in plain English and Formula Genius will generate, explain, and validate it — instantly.