My First Honest Review of Amazon Managed Workflows for Apache Airflow (MWAA)
6 min read, last updated on 2021-02-05
Introduction
Recently, I had an opportunity to dive deeper into the newly released AWS service that allows us to provision and use a fully-managed version of Apache Airflow. This is one of the pre-re:Invent 2020 announcements.
Personally, I am excited. Apache Airflow is a state-of-the-art workflow management platform for data analytics. By combining it with Kubernetes, many data teams used that as a data infrastructure design pattern. Because of that, other cloud and SaaS providers already allowed us to use this service in a managed way. Not to mention that Apache Airflow itself is very pesky to manage and operate reliably.
In that sense AWS did what they do the best very consistently: they’ve monetized their operational knowledge by providing a fully-managed service. The service selling point is that you have the same Apache Airflow as the open source version. It means that you deal with a fully-managed service that supports well-known plugins and has full compatibility and integration with AWS portfolio.
As a person who worked with Amazon Data Pipeline, AWS Glue Workflows, and AWS Step Functions, I am thrilled that we received an alternative that is fully compatible with an open source version - because that removed another point from the list of contraindications related to diving deeper into the cloud.
Is it perfect?
It’s far from that. 😞
Don’t get me wrong: I tried to guess what AWS releases this year, which was on the top of my list. It’s an important announcement. The service is usable, has a great value proposition, fixes many issues of unmanaged solutions. But at the same time requires a lot of work.
Pros
To reiterate the right things, and emphasize that this service can be used now in certain situations:
- It’s a fully managed version of Apache Airflow, which has an opinion being pesky to operate.
- It’s just Apache Airflow, and that means 100% compatibility with open source ecosystem that already is in place.
- It’s well integrated with AWS ecosystem (e.g., Amazon EMR, AWS Glue and so on).
Cons
Well, I would like to start with something tough to understand for me: it’s not Apache Airflow 2.0. I totally understand that this version was released a few weeks before the announcement, but at this point, the service started with a significant lag, and that probably will introduce more drag to update environments later.
The second thing (which is understandable on the other hand): integration is fresh, so it has rough edges. e.g., integration with AWS Glue Crawlers was added two weeks after release when the community reported that on Github.
Another point: service at the moment imposes (or imposed) really strange constraints:
- It requires a specific Amazon VPC to operate with two subnets in two different AZs.
- There is just one AWS KMS key to rule them all, which affects strict requirements around security and usage of Customer managed CMKs.
- Why? You have one key for Amazon S3 data (input for the jobs), Amazon SQS queues used by Celery, and … Amazon CloudWatch Logs. That’s, in most restrictive cases, unacceptable.
- Also, I have to admit that documentation in this place is extremely poor.
- Why? You have one key for Amazon S3 data (input for the jobs), Amazon SQS queues used by Celery, and … Amazon CloudWatch Logs. That’s, in most restrictive cases, unacceptable.
- Documentation and GUI states that you need to prefix Amazon S3 bucket with
airflow-
, otherwise it won’t work.- Aaand it’s gone… that’s not true anymore, but it stays in the default IAM policies and inside docs. 🤦
- Speaking about AWS IAM: default roles are generated with invalid IAM statements around
s3:ListAllMyBuckets
permission. - Another limitation, which is strange and not acceptable in strict environments: at the moment, there is no way to connect a custom SSL certificate to the Apache Airflow cluster.
- You can assign a custom domain, but you will have an issue with not matching the domain between certificate and assignment.
Did I say that documentation sucks badly? It needs a lot of improvements - if we could have it on Github, it would be much better already, but it’s not there yet. Do you want more examples? Let’s talk about AWS CloudFormation docs (screenshot made at 20.02.2020):
Well, except it does not work (neither Arn
, nor WebserverUrl
):
ROLLBACK_IN_PROGRESS Attribute 'WebserverUrl' does not exist.
Rollback requested by user.
Speaking about AWS CloudFormation: there are no examples. None, not mentioning the issue from above with GetAtt
or Ref
. Plus, if your cluster creation fails, debugging is impossible - there is no way to find out what happened. No logs, no AWS CloudTrail details - and I faced that because of the lacking permission on customer-managed CMK to the Amazon CloudWatch Logs (mentioned above).
The two last points are more generic things.
Why, oh why this is another service that uses Amazon S3 buckets for storing artifacts. I do not get that: there is AWS CodeArtifact, there are better ways to deliver DAGs to Apache Airflow. Yet, as we do for AWS SAM, AWS Lambda, and many other services, we still use generic buckets. I am looking forward to at least AWS CodeCommit or, to be honest, any other git
repositories provider integration. Or AWS CodeArtifact support, anything beyond those pesky buckets where you have to come up with naming conventions (e.g., for versioning) and enforce it via external, custom-made tools/scripts.
The second generic thing is cost. It’s not the cheapest service at the moment - first, it’s fully-managed, but there is no way to scale it down to zero (where, e.g., on Astronomer.io you can do it without the sweat. Additionally, metadata storage can be a very high cost for bigger data platforms. Regarding that space, I am actually calm - because AWS is well-known for decreasing costs over time, so there is a great chance it will happen here as well.
Summary
Hopefully, this review will not be just pure whining - but it will help you evaluate and avoid mistakes I made when using this service. After getting over the initial hoops, we are still using this service. From the operational side, it gives a tremendous advantage. Nevertheless, it needs more polish, and I am looking forward to updates, improvements, and innovations provided by the team responsible.
Why? I think this is a service with considerable potential to become one of the most important and fastest-growing services on AWS platform. 🖖
What are your thoughts about Amazon MWAA? Have you found any other issues that I did not document here? If yes, share them in the comments, I will happily learn about them as well.