bodsexpert
bodsexpert

How to find and remove duplicates in BODS using gen_row_num_by_group function

By Imran, Mohammad January 24, 2022 under BODS

If you are worried that your source data has some duplicates and you are not sure on which row it occurs, do not worry BODS has a special feature to identify the duplicates records easly using Ranking method. In this post, I will guide how you can find and remove duplicates step by step with an example. I will be using in-built function called gen_row_num_by_group() function.


Source Data

In the example below the last 3 rows are shown as duplicate records, you want to indentify them in order to remove from the data set.


Query Design

Lets build a query in BODS. In the design below you see that Order by query is used which has been implented to short the records so that all the records are in ascending or descending order, so when gen_row_num_by_group() function is executed if the next row is found same as previous row then control will treat that row as duplicate and insert 2. if you dont apply Order by then you will not have rocords in order, so when function is executed the next row can be any random row, control will not treat that record as duplicate and insert 1.

Query

ORDER_BY Transform

In the example below, I have included all the fields in ORDER BY tab, it will be the combination of fields which make duplicate.

Order By

RANK Transform

In the example below, the gen_row_num_by_group() function is applied in order to give the ranking to duplicate records. Non-duplicate records will be given 1 and duplicate records will given next number based on the occurance. So you should pass all the fields which make records duplicate, in the example below we have 5 fields which make records duplicate so pass those 5 fields in gen_row_num_by_group() function. If you have one field which makes records duplicate then pass only one. But ensure the field you would pass to the function is already sorted out in previous query.

Duplicate Ranking

Output

In the example below, the last 3 rows are shown as duplicate and given rank based on the occurance. In order to remove duplicate you can task RANK 1 forward and others can be skipped.

duplicate rank

Removing Duplicate

In the example below, I have applied a where claus in next Query Transform. By this, I will take only RANK 1 records further.

Where clause

Duplicate Removed


In the example below, if you notice last two rows which were shown as duplicate have been removed.

Remove Duplicate

Conclusion

Although I have tried to cover all the aspects of removing duplicate using ranking method but if you need more help on this or any other topic related to data migration, do connect with me.