Encouragement of Airflow doc_md

Atsushi Hara
3 min readJan 12, 2021

Airflow is a great task flow management tool for ETL.

Thanks to Airflow, we could easily add or remove tasks and change the tasks order freely because Airflow manages task dependency.
And, of course, if we manage codes with git, we could control the version of the code, too.

But I’m feeling still there is a remaining problem.

Manage the document with code

As I’ve mentioned above, because of Airflow’s feature we could easily add and remove the task, and also modify it. Due to this, maybe the pipeline is implemented very fast and updated rapidly.
But from my personal experience, updating documents won’t be so fast as the code’s update does. Especially when the document is written in other document tools. I don’t think this comes from most developers who are lazy lol. But I believe because developers are focusing on developing software.
So my suggestion is “manage the document with code” just like README has.

Is the Python docstring a hero?

Okay, then what is the best practice?
Python already has a document feature, Python’s docstring.
This is a good way to explain API. Is it the same for Airflow?
I’m feeling the required explanation is quite different.

API document’s main objective is to explain how the function or feature works.
So, most explanations are focusing the input and output interfaces, how to use them, and what kind of exception would occur.

But for the Airflow task explanation would be different. To understand Airflow’s task, it should more focus on the ETL part. Which would be like

  • Which data source are you looking for a process?
  • Where would be the output put after the process?
  • Would this task overwrite the output? or append it? or save as another file?
  • Would this dag cause any problems to other tasks when re-running it casually?

I think these explanations are sometimes redundant for API documentation. So, I suggest using the documentation feature in Airflow.

Documentation with Airflow

Already, Airflow has a feature for documentation and you could add an explanation for DAG and tasks easily.
I’ll show a brief example here, but for details please take a look at the reference.

dag = DAG(
dag_id='example_dag_for_docmd',
default_args=args,
)
dag.doc_md = """
# What this dag will do?
This dag will how to use doc_md. You can write a note with markdown!# Will this dag cause any problems if I re-run it casually?**No.**<br>Please feel free to re-run this.## Do you want to share any details?There are a lot of format which this feature supports.<br>
For instance, you can leave a message with json format(I'm not sure if that would help you understand :p).<br>
<br>
And don't forget to add `<br>` tag to add a new line.
"""

If the dag is no longer available, you can simply update this doc_md part with the date it stopped working. Of course, when the DAG or tasks are updated developer could easily update the document, and if the developer forgot to update the document and only update the code, the reviewer could request it in the pull requests.

But it’s not only you could update the document with code.
You could read the document in Airflow web UI, like below.

An example explains with doc_md.

Isn’t it nice?

Summary

Let’s dive into use Airflow’s documentation feature.
If your data engineering team already has documents for your pipeline and managing it in different tools, let’s move on to Airflow’s documentation feature.
If your data engineering team doesn’t have documents for your pipeline, then let’s start writing it.

--

--