I want to start with a disclaimer: the origin of this article goes to 2016, so I refreshed it to reflect the state of the 2020 landscape. What struck me is that so few things I had to update.
So back to the story: around that time, I was a part of a project that had to choose:
Should we use AWS CloudFormation or Terraform?It's a typical question of anybody that addresses the automation of infrastructure.
I think we can all agree that we are past the times and discussions about should we automate the infrastructure or not. However, another question is still valid: should we use a dedicated tool (AWS CloudFormation, Azure Resource Templates, OpenStack Heat, or Google Cloud Deployment Manager) or a provider-agnostic solution?
At that time, I was shocked that there is no other tool like Terraform available on the market (maybe unpopular Foreman). What is even more surprising, this statement is true today - mostly because it’s a great tool, built by amazing people (HashiCorp).
I would like to show my rationale for a new project and infrastructure for which I, among others, was responsible. I present the motivation which drove us towards a pure CloudFormation setup, the reason behind our bet on AWS, and the rationale behind the decision to drop Terraform.
Keep in mind that all remarks and comments pointed out in this article are constructive criticism - it does not change my opinion about companies that created those tools. I have massive respect for both AWS and HashiCorp - the work they have done, especially in the tooling and cloud computing landscape, is outstanding.
As a user of AWS Services and HashiCorp tools, I am grateful for the work they did.
If you do not have experience with CloudFormation or Terraform - please read either fantastic documentation or any other introductory article. I will assume your basic knowledge about these two. If you are looking for a feature-by-feature comparison, I recommend the following article by Andreas Wittig from cloudonaut.io.
Terraform was not a silver bullet for us
When I evaluated Terraform, it was before the 0.7 release. Today we have version 0.12.
According to the definition, it is:
… a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.It's funny how it did not change at all since 2016.
It is a tool built by practitioners to support the infrastructure as code approach. An orchestration tool focused on visibility (execution plans and resource graphs) and change automation, with minimal human interaction required.
Sounds great, and what is even more important is cloud/provider agnostic by design. This is a huge plus, especially when you need to consider mixing public cloud with on-premises or a scenario with multiple cloud providers (e.g., for disaster recovery or due to availability requirements). What is also crucial, it does not focus on cloud APIs only. It incorporates various 3rd party APIs and Cloud provider API in one place, enabling interesting scenarios - e.g., creating infrastructure inside cloud provider, connecting it with your storage solution on-premises, and combining it with an external DNS provider.
One of its assets is the DSL - HashiCorp Configuration Language (HCL). In my opinion, it is not a revolution, but an evolution in the right direction. Still declarative, but more expressive and, at the same time, more concise than JSON/YAML formats. Terraform is internally compatible with the JSON format. I will not dive into details because they are extensively covered in the documentation (but in 0.12, you finally have loops, kinda). One thing worth pointing out - it is not a complete programming language, and we will talk about it soon as well.
So far, so good. But before making a final decision, I wanted to check its state and see how it feels. This gave me a whole new outlook. From my perspective, that tool was not as mature as I would like to see it. It was in the infancy stage - I have scanned the Github repository just for bugs related to AWS provider and uncovered a long list. It is not a definite list (probably many of them were neither confirmed nor triaged/prioritized). But it answered how rapidly the tool evolves. You may think that it is not true today, and you will be wrong. In essence, there are still missing things (e.g., my favorite one about missing tagging capabilities for AWS Egress-only Internet Gateway 😉) and rapid evolution that severely affects developers (e.g., syntax changes between 0.11 and 0.12, and the whole migration process).
Second thing: I have some horror stories about backward-incompatible changes from version to version. Again, it should be understandable, taking into account that the tool is below 1.0. Still, it was not suitable for our needs back in the day, nowadays it has issues with stability between versions, and yet it is a sane default recommended by many, which I don’t get.
What struck me the most in 2016, are the inner workings. It does not use CloudFormation but the AWS API - which has several consequences. It’s impossible to perform a rollback when something goes crazy. The usual workflow is slightly different from the CloudFormation one- first, we need to plan our changes, then we need to review them and decide whether we should apply. With CloudFormation, it is possible to review changes as well. However, if everything goes haywire for any reason (and it eventually will - trust me 😱), it will be able to perform a rollback and return to the previous state.
Both tools are highly opinionated. I think that many people that use this as an argument against Terraform are wrong. Both require and assume an awful lot of things regarding your workflow. Both add impedance mismatch and require effort regarding knowledge exchange and learning.
The one thing, though, is apparent. State management in Terraform is a leaking abstraction. You need to take care of it on your own, and I think we all agree - it is not a privilege and flexibility. You can handle that with git (🤦♂️), local files on CI/CD server (🤦♂️), or via S3 bucket with remote state and DynamoDB table locking. Even that there are documentation and tooling for setting the remote state storage in a recommended way, I still fix some misconfigured storage. A great example is my favorite error is not enabling versioning for S3 bucket, which kicks you in the butt just when you need to roll back to a previous state file 🤦♂️.
AWS CloudFormation is not a hostile environment (and neither a perfect one)
Before I started my experience with CloudFormation, I heard horror stories. But I spent some time with that beast. And guess what? It is not as ugly as everybody advertises that.
The apparent advantage is better support for AWS services than any other 3rd party tools. When AWS releases a new service in most cases, it is already supported inside CloudFormation, at least partially.
However, some elements either do not make sense for CloudFormation (e.g., registering DNS name for a machine spawned inside the auto-scaling group), or some are unsupported yet. Even if you need something custom, you have Custom Resources and AWS Lambda for the rescue.
According to the documentation - our hero, AWS CloudFormation, is stateless, but that is a tricky concept. It is stateless, except it is not at all. From your perspective, you do not need to bother about state management. But it is not entirely true - inside the service, it preserves a stack of operations invoked inside your cloud (called events), and it connects them with resources. Updates are based on that state.
What about reusability? You have nested stacks (yikes), and another feature called exported outputs - a global set of values shared inside the same region and AWS account. It means that you cannot create the same stack twice based on the declared definition because it will collide with the defined outputs. Additionally, exports are coupling stacks very tightly, in the same strong way as nested stacks.
What about the learning curve? Well, it turned out that it depends. It is not as hostile an environment as advertised. If you research it, you will see how much people hate it. But it’s not that bad (you clearly understand how many times I abuse this statement referring to both tools), 2016 you have only JSON, and it is still horrible (sigh, I had to reread this sentence a couple of times because I couldn’t believe I am praising YAML here).
What about vendor lock-in? What about multi-cloud?
Let me start with a digression: it’s funny to observe that nobody from Hashicorp claims that Terraform is a tool that allows switching between cloud providers magically, and yet people are using that statement just based on the cloud-agnosticism.
As grownups, we know that like the ORM does not let you change the database on the fly, Terraform will not let you automatically switch your cloud provider for your entire system. It will not happen for two main reasons: First, the Terraform code is very provider-specific, and such a change requires a total rewrite. Second, even if you introduce an abstraction layer for multiple cloud providers, you will introduce a very unpleasant side-effect of reducing the service capabilities to a common denominator. And that’s not an effective way to use the cloud.
On the vendor lock-in, there are plenty of great essays and articles, and I will not address them here. TL;DR: Don’t get locked up into avoiding lock-in.
I think Infrastructure as Code space is in a state of Stockholm Syndrome. We are so used to specific issues and deficiencies that we accepted them. That is why I think it is hard to expect that those tools will solve all kinds of problems. Reality is more complicated than models, and therefore silver bullets rarely exist.
I think that often the decision between those two tools is dictated based on myths or folklore. That’s why I explained my rationale why I stick in most cases to AWS CloudFormation, although I have worked with Terraform, and it has many merits too.
I hope this article will help everybody looking for a deeper understanding of such a choice. If you have any questions or comments, feel free to share - I would love to hear your justification and reasoning behind such a decision.