ADF Data Flows Custom Logging and Auditing Video
ADF has a number of built-in capabilities for logging, monitoring, alerting, and auditing your pipelines. There are UI monitoring tools, telemetry logs, and integration with Azure Monitor to provide a rich set of tools for the administration of your ETL and data integration processes.
However, if you’d like to apply additional custom logging and auditing for your ETL data flows, you can use these techniques below which are all based on existing functionality found within ADF natively:
Data Flow pipeline activity output
Image may be NSFW.
Clik here to view.
With this technique, you will query the output metrics from your data flow activities in the pipeline and pass in the values you are interested to another data flow. The first data flow (ExecDataFlow) is the data transformation worker and the second data flow (Log Data) is the logger activity.
If you look at the output from your data flow activity execution, you will see the JSON payload returned by the activity.
Image may be NSFW.
Clik here to view.
You can pick out different metrics to log such as time for each transformation stage, source rows read, sink rows written, bytes read/written … For this example, I’m going to log the processing time for the Sink (the total time it took to write the rows to the sink) and the number of rows written:
@activity('ExecDataFlow').output.runStatus.metrics.sink1.rowsWritten @activity('ExecDataFlow').output.runStatus.metrics.sink1.sinkProcessingTime
I am assigning those values to the logger data flow which takes in several integer parameters and simply serves the purpose of writing out those params to a text delimited file with no header to my output folder in ADLS. This makes the data flow very generic and reusable for logging.
The logger data flow uses an ADF data flow technique of pointing to a source CSV file in Blob Store that contains just a single row, a single column, and has no header.
The file content is simply this:
1
I call this file “dummyfile.txt” and I recommend keeping one of those around in your blob stores with an ADF dataset pointing it. It will allow you to generate data flows that don’t really use the source data. Instead, data flows like this logger, will generate values and use parameterized values via Derived Column transformations.
It’s an important technique to learn and repeat in ADF data flows. This way, I can set my source with this dummy source and then set the incoming parameter values in a Derived Column. Then I can write each logger parameter to a text delimited file.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
This data flow can be re-used in other pipelines as your logger.
Row Count aggregation inside the data flow
With this technique, you add logging directly inside of your data flows. Here, you can just log row counts and sink those values to a text file or database table. Use a new branch in your data flow logic to create a separate logging branch. Add an Aggregate transformation with no grouping and use the count() function. Notice in the 2nd example below, I’m writing the counts of each insert, update, and upsert operation coming from my logic to audit and log my database operations. In both cases, using a new branch with a logging branch does not affect your transformation logic. However, this technique requires you to add this as non-reusable logic inside each data flow. The above technique with a separate logger data flow allows for reuse.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Row Count lookup activity verification
Another common technique is to count the number of rows from your data flow logic and compare it against the actual number of rows written to the database sink. This is important for auditing and validation.
Image may be NSFW.
Clik here to view.
In this example, we use the Lookup activity in the pipeline and query the total number of rows so that we compare it to the number of rows reported from our data flow logger.
This pipeline expression is the 3rd parameter sent to our data flow logger:
@activity('GetRowCount').output.firstRow.myrowcount
Now, when we look at the output file from our data flow logger, it shows the number of rows written from the activity, the time it took to execute in milliseconds, and the number of rows counted in the actual database itself, so that we can see the discrepancy:
9128,14913,9125