In today’s post we are going to be extracting CoT (Commitment of Traders) reports from the CFTC website using a pipeline built on Apache Airflow.
What is CoT data?
The CoT report is a weekly publication which reports the open positions of market participants in the U.S futures market. It’s published every Friday at 3:30 E.T but the actual report from the participants is compiled on the same week Tuesday. See here for more information on it.
I’ve just released ArtFlow 2.5 with mobile device support. This version bring all the new features and tools (like selection, gradient fill and new guides) that were previously introduced in ArtFlow 2 and pack it in clean (material design inspired) and intuitive interface for phones.
You can take Airflow to the cloud now. Google has launched Google-Cloud-Composer, which is a hosted service of Apache Airflow on the cloud. This saves you the hassle of running Airflow on a local server in your company. You will never have to worry about Airflow crashing ever again. As you can see, data pipelines are just scratching the surface. Download Airflow Stable. Generally recommended. MacOS 10.12 or newer 64-bit Windows Windows 7 or newer 32-bit Windows Windows 7 or newer.
Apache Airflow Github
Getting Data From “Everywhere” to “here”
I’m sure you heard of the term data is everywhere. It definitely is. However, you often want it “here” where you can analyze it, not out there.
There are many ways of getting data from everywhere to here. The path which time- and resource-constrained traders should choose is the one that offers the most return on their effort. For example, chances are you are not going to get any improvement in productivity if you decide to code a bespoke system that fits your “needs”. Simply using the ones that are already there will be more than enough.
A python library that fits the bill for what we intend to do is Apache Airflow.
What is Apache Airflow?
Apache Airflow is an Open Source python library that is used to build pipelines.
What is a pipeline?
It’s a series of tasks that need to be executed in their respective order. One of the most compelling reasons to go with Airflow is it’s out of the box scalability, modular approach, neat UI and logging features. You can get up to speed with Airflow concepts here.
Airflow is meant to be run on distributed cloud systems like Kubernetes. Google offers a hosted Airflow service called Composer. I recommend you either run Airflow there or for simpler tasks like the one we will be doing here you could run it on a Virtual Machine. For this post, we are simply going to run it on a Virtual Machine.
Setting up Apache Airflow running as Cloud Composer on Google Cloud Platform
Firstly we need to set up a Virtual Machine, you can do this on any cloud provider but, the one we are going to be using is Google Cloud.
Click here to set up a new GCP project.
After you click Create you should have your new GCP project ready to go.
Apache Airflow
Next, create a VM Instance by clicking on the VM Instances tab to your left and then clicking on Create.
For this project, I’m choosing Ubutnu 18.04 LTS and I have allowed HTTP and HTTPS traffic to the VM:
One of Airflow’s coolest features is its webserver. Using it allows us to connect to the Airflow instance and actually see our data pipeline being executed. To use it, we need to allow firewall access to port 8080.
Click on View network details.
Here you add TCP port 8080 under Specified protocols and ports and click Save.
Now we are going to connect to our machine by clicking on the SSH button.
One of the first commands you should run whenever you log on to a fresh Linux machine is:
After successfully running that we need to install additional python3 packages:
Now we are ready to install Airflow on our machine, to do that simply run:
After successfully installing Apache airflow you need to initialize the airflow database and export the path variable for airflow:
Now that was a slightly tedious setup process but, if you have followed everything it should work as intended. If not, the great thing about airflow is its big community of contributors. Usually, if you have a problem someone else has an answer. This is another thing to consider when choosing to go with a system.
Next up we are going to run airflow for the first time. There are two stages to this. First, we need to start up the Webserver which gives us a nice UI to interact with. Then we need to start up the Airflow Scheduler which as the name suggests schedules tasks. Learn more about the airflow scheduler here
Now that we have the system up and running we can visit it by going to the external IP of our VM and adding port 8080 like this: external vm IP:8080
If you have issues with getting the webserver running try changing the
max_threads
parameter to 1 in the airflow.cfg file in your home directoryThe image above is the Airflow UI – it’s a neat way of viewing and managing your workflows.
Imagine you are pulling tick data from your broker along with sentiment data from another website. It gets out of hand pretty quickly if you use bash scripts and cron jobs. The Airflow UI offers a lot of creature comforts that you don’t always find in data engineering.
Python Code to Pull Data From the CFTC Website
Okay, we’ve set up Airflow. Let’s leave it aside for a while and figure out the actual code that will pull the data from the CFTC website.
If you can write your task in Python, you can write it in Apache Airflow.
Let’s check out what data we want to download from CFTC. If you visit their page you’ll see a wide range of data options. Here, we’ll focus on downloading Traders in Financial Futures (Futures Only Reports). The process is easily repeatable for other data sets.
Keynote template design 1 0 – endearing keynote template. The data is available in zipped format.
If we right-click on one of the Text fields we can copy the link to the file. We’ll then use that in our python code to programmatically download the file.
Here is the code for downloading all the files from 2010-2020
Now we have a script that downloads all the data from 2010 to 2020 and saves it in our local directory as a csv file.
Creating an Apache Airflow DAG
We don’t need to always download everything. Going forward, we only need to download new data every Friday as it gets published. This is where Airflow can help us.
We’re going to use Airflow to schedule
download_extract_zip
to run every Friday and download the latest 2020 data.With that in mind, it’s now time to create our first DAG.
Wait, what’s a DAG?
A dag is a lock of wool matted with dung hanging from the hindquarters of a sheep.
It’s also an acronym for Directed Acyclic Graph, which is basically a collection of all the tasks that you want to run organized in order.
You can structure DAGs so that the tasks inside them have specific dependencies between one another. You could also run a task conditional on the success/failure of the previous one. Really the sky is the limit. All we need for now is a simple task that downloads the CoT data every Friday:
As you can see there aren’t many differences from the airflow DAG and the original batch downloader. We can see the Airflow DAG object has a task called cot-download which calls the download_extract_zip function each Friday at 21:00 UTC (Airflow works in UTC).
Now we need to get this code inside the Airflow dags directory. To do that you need to SSH to your machine and type in cd airflow to change your working directory to the Airflow one
Then you create a directory called days, which is where all your dag files will live.
Finally, we need to upload the python file from our local machine to the Google VM. The way you do that is by clicking the settings icon on the top right and uploading a file. The file should be in your home directory.
Now you can move it from the home directory to the dags folder we just created.
If everything went well we should have the DAG pop up on our Airflow UI
You can either let the dag trigger by itself in due time, or you can trigger it by pressing the play button under the Links tab.
Airflow will handle all the logging and error exceptions. In case something goes wrong the logs will be easily available through the UI for you to investigate.
The final output of this DAG should be a file in your home directory called
finfut.csv
. This file will be updated each Friday 21:00 UTCCongratulations – you created and ran your first DAG!
What’s Next?!
Hopefully, you found this tutorial useful.
How might you extend this work?
- Maybe you could download the other datasets on the website?
- How would you handle federal holidays?
- Do you think we should have a notification service if the task fails?
- Having data on a drive on a VM isn’t ideal. Could you push it to a storage bucket, a cloud database?
All this is possible with Apache Airflow. It’s a tool that really starts shining when your pipeline gets bigger.
Happy Data Scraping! #stayhome
What if we say it's not like the others?
Airflow is different We're not cutting any corners. This is not yet another FFmpeg wrapper like you might have seen elsewhere. Don't get us wrong, we love FFmpeg and use many of its parts under the hood, but our custom built video processing pipeline goes way beyond wrapping FFmpeg and calling it a day. We've been working on it for years it and it lets us do things that other similar software simply can't.
It's a bold claim for sure, so here are just a few examples:
- AirPlay HEVC videos to Apple TV without transcoding
- Streaming to AirPlay 2 enabled TVs
- Adaptive audio volume, spatial headphone downmix
- Lossless audio transcoding when streaming to Apple TV (FLAC codec, requires tvOS 14)
- High quality audio transcoding when streaming to Chromecast (Opus codec)
- OCR (text recognition) for DVD/Bluray/Vobsub subtitles
..with a very particular set of skills..
Airflow is a razor sharp focused software. It supports specific set of devices and it will pull every trick in the book to get the best possible results on these devices. It may not stream video to your smart fridge, but it will gladly push your Chromecast, Apple TV and AirPlay 2 TVs to their limits.
And yes, Airflow can handle pretty much any video format and codec you throw at it.
Pixels, pixels everywhere!
Airflow can stream full 4K HDR HEVC files to Chromecast Ultra, Built-in, Apple TV 4K and AirPlay 2 enabled TVs. It will go out of its way not to touch the original video stream unless absolutely needed for compatibility reasons, ensuring best possible video quality with lowest CPU load (your computer fans will thank you). As far as we can tell, Airflow is still the only desktop software that can natively stream HEVC videos to Apple TV and AirPlay 2 TVs.
And for those pesky videos that are incompatible with your device - Airflow will handle that tranparently, with hardware accelerated transcoding if your computer supports it.
Audio pipeline that goes to eleven
Full multichannel support including DD+ passthrough with Dolby Atmos? Of course.
Advanced adaptive volume booster + limiter for late night watching when you don't want to disturb your neighbours with loud scenes but still want to hear the dialogue clearly? Check.
Spatial headphone downmix for surround sound videos? Also check.
Detailed A/V sync adjustment where you can compensate for the delay of individual devices like bluetooth headphones? Airflow has it.
And subtitle support to match it
For both embedded and external subtitles. It's a bit of a secret that pretty much every other streaming software needs to extract embedded subtitle tracks before playing the video. That involves reading the entire file upfront! Crazy, right? Airflow needs no such crude tricks. Embedded or external, for our playback pipeline it's all the same. All widely used subtitle formats are supported, now including vobsub. Integrated opensubtitles.org search is a cherry on top.
..with real time text recognition
Some subtitles (DVD, Vobsub, Bluray) are stored as pictures. This means that the only way to render them when streaming is to burn them in the video. That's inconvenient to say the least. It massively increases CPU load (think fan noise and heat) and it's completely infeasible to do for 4K videos.
Enter our new realtime subtitle text recognition (OCR). During playback Airflow will transparently extract the text from picture subtitles and render it on target device just like it would with regular text subtitles.
But wait, there's more!
The 'small' things, like the scrubbing preview, beautiful polished user interface, multiple playlists support, meticulous last position tracking, or the integrated Speed Test for Chromecast, which is invaluable when dealing with network connection issues. The list goes on.
Did we mention the remote control companion app for Android and iPhone? No? Well, it's pretty cool. It lets you control all Airflow features from the comfort of your couch. And it's completely free!