Why you should NOT use Service Mesh

Abdellfetah SGHIOUAR
Google Cloud - Community
7 min readJan 10, 2022

--

Source: https://i.ytimg.com/vi/QiXK0B9FhO0/maxresdefault.jpg

Notes

  • In this article, I’m assuming your implementation is Kubernetes based, you might notice I reference sidecars, clusters, or pods. While theoretically, a Service Mesh can extend into VM’s, they are primarily built for Kubernetes.
  • In this article I’m making a lot of assumptions and generalizations, this is on purpose, maybe a Service Mesh tool will not meet some of these generalizations I’m making and that’s fine. The purpose of this write-up is to be thought-provoking and drive conversations.

Service Mesh is becoming more and more a de facto of cloud-based architecture. It’s has a lot of benefits and if you used correctly can unlock a lot of features and allow you and your team to focus on more added-value work.

In this article, I will NOT explain what a Service Mesh is, for that you can refer to this excellent write-up. In this piece I’m going to highlight some cases when adopting a Service Mesh is less beneficial. Think about this as a distilled version of 4 years of experience helping companies adopting Service Mesh (mainly using Istio), what I learned, when I saw projects failing or getting more complex than they need to be because we choose to use a Mesh and what you should think about carefully about before making such a decision.

Unfortunately adopting a Service Mesh is a decision you have to make at the beginning of the project. Some people might argue it's not. Citing for example the migration guide for Istio from a non-mTLS to an mTLS installation. This guide shows how you can deploy the control plane and migrate your workloads/namespaces gradually to the Mesh, initially by setting the mTLS mode to be PERMISSIVE so you can allow workloads that haven’t been migrated yet to talk to those who have and later on when all your namespaces are part of the mesh you can switch the mTLS mode to STRICT. In theory, this is doable, in practices it’s more complicated than that, for various reasons, to name a few:

  • Capacity planning: a Service Mesh with its control plane and proxies consume resources, you have to factor that in when you are sizing your infrastructure in terms of compute/storage and network capacity.
  • Network Design: One of the good features of a Service Mesh is allowing you to span a Logical Mesh across multiple Clusters spanning across multiple regions, either on the same or totally separate VPC’s. This is a very important design decision that you have to take into account at the start of the project. and using a Mesh could change this drastically.
  • Supporting Infrastructure: By this, I mean any extra pieces of software you need to deploy to take full advantage of the Mesh tool you are using, it could be things like monitoring software (Exp: Prometheus and Grafana), tracing software, graph visualization software…etc. These tools have to be deployed and maintained which could increase cost, time and requires qualified people.

Unfortunately in most projects I experienced, people glance over these factors and take a simplistic Yes or No approach.

  • Do we need Security? Yes
  • Can mTLS ensure this? Yes
  • Can mTLS be a pain to manage? Yes
  • Can a Service Mesh make mTLS easy? Yes

Ok then let’s use a Service Mesh

So in this article, I will highlight the major considerations one should take into account before making the decision of using or not a Service Mesh.

Are you planning to take full advantage of the Service Mesh software you are using? Today or in the future.

Most popular Service Mesh tools do a lot of things, besides mTLS which is the popular feature they can do traffic shifting and steering, retries with back-offs, canary, and A/B testing using a rather simple approach. Instead of implementing these things in the application code (specially retries) or using some pieces of your infrastructure to perform canary or traffic shifting, you can define your policies in a YAML file and send it to the MESH control plane which will program the proxies running alongside your workloads.

This is a powerful concept that everyone (and I mean everyone) in your organization should be aware of. Otherwise, you might risk your developers implementing a feature in the code that the Service Mesh can already do declaratively. So spread the word about what tool you are going to use and what capabilities it have. This in theory should not be an issue if you are working in a DevOps first culture where teams own an application and pieces of the infrastructure the app runs on. In traditional setups (Dev vs Ops) you have to make sure everyone is aware of what you are using as a Service Mesh tool.

Now, this doesn’t mean of course you have to make the decision on day 1 based on unknown requirements that might change in the future. But just be aware of what your Service Mesh is capable of and factor that into your design.

Do people in your org have the theoretical knowledge and practical experience of a Service Mesh?

This is something I came across many times, people who are not familiar with Kubernetes, neither what a Service Mesh does nor doesn’t, which create confusion, wrong expectations and delays, and negatively impact your project. This is usually exacerbated during outages or issues you might encounter. One cannot debug something they don’t understand.

Here is a simple example of something I found myself having to explain to people which to me was intuitive (that is probably my PTSD brain that pushes me to have to understand the nitty-ditty details of everything). In a Service Mesh environment when your pods talk to each other, there is no load balancer in action!. One might ask well why then do you have to create Services of type ClusterIP inside the cluster to make 2 micro-services talk to each other, that’s a kind of a Load Balancer no? The answer is Service Discovery, a tool like Istio uses Kubernetes Service for Service Discovery, but since each pod has a little proxy running alongside it (called sidecar), each pod in the Mesh is aware of all the other pods IP’s, So when two pods talk to each other, your application will use the Service you created to find the server it wants to reach, but traffic going out of your application container is intercepted by the proxy sidecar, policies are applied to it and traffic is sent directly to the receiving end using the pod IP, there is no Load Balancing.

If you didn’t understand anything in the last paragraph, then that’s exactly the problem I’m describing here. You can read this article or this book. Or simply Google “How MESH NAME works”.

So in conclusion you have to make sure people who are responsible for the Service Mesh, understand it in and out.

Increased technical debt

Using a Service Mesh in production is far more complex than the simple hello-world example you find on the startup guide. There are a lot of things you have to put in place before you go live, to name a few:

  • Automation: How is your Service Mesh going to be deployed and used, most projects I worked on end up with some sort of CI/CD here to deploy the Mesh, deploy the manifests (YAML files)…
  • Monitoring and Tracing: This is an obvious one, most Mesh tools give you a lot of metrics out of the box directly from the sidecar, but most of them expect you have the supporting infrastructure to capture, store and visualize these metrics. This will probably be some sort of SaaS tool (like datadog) or self-hosted set of OSS tools (Prometheus, Grafana, Alert Manager…).
  • Debugging and troubleshooting: How you have to get yourself familiar with the debugging and troubleshooting capabilities of your Mesh tool. Maybe write some playbooks for your DevOps/SRE teams to tell them what to do when an alert is raised.

What I'm trying to say here is that adopting a Service Mesh requires more work than one might think, and requires more infrastructure and tools to be set up and maintained which could increase your technical debt.

Your Service Mesh is not compatible with your application

This is an easy one to understand, but you have to make sure whatever you are trying to deploy on the Service Mesh will work and will not cause any issues.

This is usually not a problem for applications you build yourself, but I have seen examples, where this was an issue, here is an example:

  • Argo workflows: is an OSS tool used to orchestrate data pipelines. It’s event-driven and supports a bunch of triggers, when an event happens, the tool creates a pod and runs your custom code to do data transformation. We tried to deploy Istio on the cluster hosting Argo workflows and came across and a bunch of hurdles:
    - Increased execution time: while the custom code doing the data transformation took a few seconds to run, adding Istio with its sidecar increased the executing time of the workloads by 5x.
    - Timeouts: This was due to a limitation in Kubernetes where you cannot control in which order containers start. With Istio the sidecar has to be up and running before your application container can get network access, so if your app tries to perform network access as part of its startup, there is a risk it will fail and timeout, you could implement a retry in your code, but you are pretty much fixing a non-existing problem.
    - Increased resource usage: This is a known issue in Istio, the sidecar is pretty hungry and requires a lot of CPU and memory. It’s customizable but in our case, the resources usage increase was 5x. So while the app itself consumed a few mCPU’s, with the sidecar we were upwards of 1/2 CPU per pod.

You have to do some experimentation to try to figure out if that tool or software you want to use can work and is compatible with your Service Mesh.

Conclusion

In conclusion, a Service Mesh is not a must for every Cloud-Native Kubernetes-based deployment. It does have a lot of benefits and features out of the box but comes with its own set of challenges that you have to take into consideration before using a Mesh.

I hope you enjoyed reading this article and it have triggered you to think deeply about this topic. As usual i’m up for debate about pretty much anything i write about. Comment below or hit me up on twitter or Linkedin.

--

--

Abdellfetah SGHIOUAR
Google Cloud - Community

Google Cloud Engineer with a focus on Serverless, Kubernetes, and Devops Methodologies. A supporter and contributor to OSS. Podcast Host @cloudcareers.dev