To perform a median (middle value of a sorted list), you need to put a couple of transformations together. Below are the steps needed to use median in ADF using data flows.
- Sort your data by the field that you wish to find median value
- Collect the values into an array
- Count the number of values
- Find the midpoint
Image may be NSFW.
Clik here to view.
In my demo, I’m using the movies database CSV source and I would like to find the median rating value of movies grouped by year. My final result will be a single median value for each year that represents the median rating of movies for that year.
The Sort transformation sorts Ratings so that I know that they are in ascending order for my median calculation. Next is the Aggregate transformation which I use to group the data by year. Inside the aggregate, I use collect() so that I can have an index for each value to find the middle and a count() for the total number of indexes.
Image may be NSFW.
Clik here to view.
Last thing I need to do is to find the middle. I do that as a calculation inside a Derived Column transformation. I call the new field “median” and apply this formula:
ratingsCollection[toInteger(round(ratingsCount/2)+1)]
The field ratingsCount was created in the aggregation and so I divide it by 2, round it to an integer and then add 1. Adding 1 means I won’t ever end-up with a 0 index value and I simply pick the higher middle index.