Quantcast
Viewing latest article 7
Browse Latest Browse All 18

ADF Data Flows: Distinct Rows

Below is a method to use in ADF’s Mapping Data Flows to reduce the data stream in your data flows to only include distinct rows. This sample is available as a pipeline template here.

Image may be NSFW.
Clik here to view.
distinct1

  1. Choose your source. This will work with any type of source.
  2. Add an Aggregate transformation. Here I call it “DistinctRows”.
  3. In the group by setting, you need to choose which or column or combination of columns will make up the key(s) for ADF to determine distinct rows. In this simple demo, I’ll just pick “movie” as my key column:Image may be NSFW.
    Clik here to view.
    distinct3
  4. The inherent nature of the aggregate transformation is to block all metadata columns not used in the aggregate. But here, we are using the aggregate to filter out non-distinct rows, so we need every column from the original dataset.
  5. To do this, go to the aggregate settings and choose Column Pattern.
  6. Here, you will need to make a choice between including the first set of values from the duplicate rows, or the last. Essentially, choose which row you want to be the source of truth. Image may be NSFW.
    Clik here to view.
    distinct2
  7. That’s really it. That’s all you need to do to find distinct rows in your data.
  8. The rest of this data flow sample is comparing the distinct row to the original set of data. You can keep the other transformation streams in this sample data flow so that you can compare the original data with the distinct row to make sure it is behaving the way you expect.
  9. I created a copy of the original source data by using New Source and then renamed that stream as “OriginalData” by aliasing it with the Select transformation on the bottom row.
  10. The row counts are just Aggregate transformations.
  11. To create row counts go to Aggregate settings and leave the group by empty.
  12. In the aggregate function, use the function count(1). This will create a running count of every row.

Viewing latest article 7
Browse Latest Browse All 18

Trending Articles