How to find and remove duplicates in BODS using gen_row_num_by_group function
By Imran, Mohammad January 24, 2022 under BODS
If you are worried that your source data has some duplicates and you are not sure on which row it occurs, do not worry BODS has a special feature to identify the duplicates records easly using Ranking method. In this post, I will guide how you can find and remove duplicates step by step with an example. I will be using in-built function called gen_row_num_by_group() function.
Source Data
In the example below the last 3 rows are shown as duplicate records, you want to indentify them in order to remove from the data set.
Query Design
Lets build a query in BODS. In the design below you see that Order by query is used which has been implented to short the records so that all the records are in ascending or descending order, so when gen_row_num_by_group() function is executed if the next row is found same as previous row then control will treat that row as duplicate and insert 2. if you dont apply Order by then you will not have rocords in order, so when function is executed the next row can be any random row, control will not treat that record as duplicate and insert 1.
ORDER_BY Transform
In the example below, I have included all the fields in ORDER BY tab, it will be the combination of fields which make duplicate.
RANK Transform
In the example below, the gen_row_num_by_group() function is applied in order to give the ranking to duplicate records. Non-duplicate records will be given 1 and duplicate records will given next number based on the occurance. So you should pass all the fields which make records duplicate, in the example below we have 5 fields which make records duplicate so pass those 5 fields in gen_row_num_by_group() function. If you have one field which makes records duplicate then pass only one. But ensure the field you would pass to the function is already sorted out in previous query.
Output
In the example below, the last 3 rows are shown as duplicate and given rank based on the occurance. In order to remove duplicate you can task RANK 1 forward and others can be skipped.
Removing Duplicate
In the example below, I have applied a where claus in next Query Transform. By this, I will take only RANK 1 records further.
Duplicate Removed
In the example below, if you notice last two rows which were shown as duplicate have been removed.
Conclusion
Although I have tried to cover all the aspects of removing duplicate using ranking method but if you need more help on this or any other topic related to data migration, do connect with me.