How we 30x’d our Node parallelism

January 23, 2020

Table of Contents

What’s the best way to safely increase parallelism in a production Node service? That’s a question my team needed to answer a couple of months ago. We were running 4,000 Node containers (or ‘workers’) for our bank integration service.

The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations. But since our total capacity was capped at 4,000 concurrent requests, the system did not gracefully scale.

Most requests were network-bound, so we could improve our capacity and costs if we could just figure out how to increase parallelism safely. In our research, we couldn’t find a good playbook for going from ‘no parallelism’ to ‘lots of parallelism’ in a Node service. So we put together our own plan, which relied on careful planning, good tooling and observability, and a healthy dose of debugging.

In the end, we were able to 30x our parallelism, which equated to a cost savings of about $300k annually. This post will outline how we increased the performance and efficiency of our Node workers and describe the lessons that we learned in the process.

Source: plaid.com

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex configurations that affect these products at city and sub-city levels. To maintain our growth and architecture, Uber’s Observability team built a robust, scalable metrics and alerting pipeline responsible for detecting, mitigating, and notifying engineers of issues with their services as soon as they occur.

Making the LinkedIn experimentation engine 20x faster

At LinkedIn, we like to say that experimentation is in our blood because no production release at the company happens without experimentation; by “experimentation,” we typically mean “A/B testing.” The company relies on employees to make decisions by analyzing data. Experimentation is a data-driven foundation of the decision-making process, which helps with measuring the precise impact of every change and release, and evaluating whether expectations meet reality.

The Biggest IT Failures of 2018

This year provedonce againthat IT-related failures “are universally unprejudiced: they happen in every country; to large companies and small; in commercial, nonprofit, and governmental organizations; and without regard to status or reputation.” Below is a review that just scratches the surface of the sundry failures, glitches, and other IT hiccups that made the news in 2018. This year saw a slight reduction in the number of flight cancellations and delays due to computer-related problems as compared with the past three years, especially in the United States.

How we 30x’d our Node parallelism

Tags :

Share :

Related Posts

Observability at Scale: Building Uber’s Alerting Ecosystem

Making the LinkedIn experimentation engine 20x faster

The Biggest IT Failures of 2018