Airflow By Example (II)
Second twitter thread about following a simple example to learn Apache Airflow.
So, let's talk a bit more about watching TV using Apache Airflow (which is not a very good idea legally, but has good potential technically!)
— Roberto Alsina (@ralsina) February 18, 2020
This is the first thread: https://t.co/KG8VgqkQTO
When we last looked, we had two airflow tasks looking for the latest episodes of two series, and one task that was using xcom_pull to get that information and find torrents.
— Roberto Alsina (@ralsina) February 18, 2020
Well, that was not really working well.
Why? Because of this line in the downstream task. pic.twitter.com/vUKLcw584O
— Roberto Alsina (@ralsina) February 18, 2020
So, xcom_pull "pulls" data that was "pushed" by the upstream tasks. But ... we are pulling once. If two tasks pushed, then we will only pull the latest value, and search for a single series' episode.
— Roberto Alsina (@ralsina) February 18, 2020
So, we have to do better.
Turns out xcom_pull can take a "task_ids" argument, which if it's a list of tasks will pull the value required from EVERY ONE of those tasks.
— Roberto Alsina (@ralsina) February 18, 2020
Luckily, operators know the ids of their upstream (and downstream) tasks, so this is easy to do.
— Roberto Alsina (@ralsina) February 18, 2020
So, I can get all that data, iterate, and process each one as before. pic.twitter.com/VeBBIjYOqA
Another interesting bit is that there is no need to explicitly declare each "search for latest episode of series X" task in the DAG. That's what we have config files and loops for (notice the use of the dag as context manager) pic.twitter.com/dEsieHIH8e
— Roberto Alsina (@ralsina) February 18, 2020
So, now we have a WORKING thing that given a list of series names ends up with a list of torrents and series metadata.
— Roberto Alsina (@ralsina) February 18, 2020
The next logical step is to download those things somehow.
One way to do this is:
— Roberto Alsina (@ralsina) February 18, 2020
1) Start transmission-daemon
2) Configure however you want
3) Let's use transmission-remote to add torrents to it
Since transmission-remote is just a command, we can use PythonOperator's more rustic brother BashOperator
— Roberto Alsina (@ralsina) February 18, 2020
Basically, BashOperator runs commands.
— Roberto Alsina (@ralsina) February 18, 2020
Here is what works (will explain) pic.twitter.com/iAmFDiAoEK
— Roberto Alsina (@ralsina) February 18, 2020
DAFUQ, you may say. And indeed.
— Roberto Alsina (@ralsina) February 18, 2020
But basically: BashOperator is **templated**. That string there? It's ran through jinja with the context (remember the context?) and thus things are replaced.
— Roberto Alsina (@ralsina) February 18, 2020
So it ends up being something like this (but much longer) because it's extracting the torrents from the context. pic.twitter.com/oZBMU3nDrs
So, our DAG now looks like this. pic.twitter.com/hXWbkpbKlx
— Roberto Alsina (@ralsina) February 18, 2020
And what happens when you trigger that DAG? pic.twitter.com/kahGmSVTpj
— Roberto Alsina (@ralsina) February 18, 2020
Coming in part 3:
— Roberto Alsina (@ralsina) February 18, 2020
* So, what can you do with that?
* What value is airflow adding?