Inside Uber’s Engineering Struggles

May 13 was one of Uber’s darkest days.

The computer system used by its business-operations employees ground to a halt, for the whole day. The cause: a bug in one of Uber’s databases, which set off a chain reaction that took down other systems and was made worse by human errors in response to the cascading problems, according to internal emails reviewed by The Information.

Uber’s chief technology officer, Thuan Pham, later wrote to his staff that the mistake “reflects an amateurism with our overall engineering organization, its culture, its processes, and its operation.”

The Takeaway
It may not be obvious to the outside world, but Uber’s technical infrastructure has been hanging by a thread for years. Internal emails and interviews with people who’ve worked on that system show that as CEO Travis Kalanick and operations executives pressed engineers to build new features for the business, it came at the expense of stability in the back-end systems that power them. Now it’s up to the company’s inspirational CTO, Thuan Pham, to professionalize a fast-growing engineering team that’s been held back by what he calls “amateurism” under his watch.

The “massive outage,” which stretched to the following day but didn’t affect users of the ride-sharing app, was symptomatic of an engineering organization that had long struggled to keep Uber’s infrastructure together as the business has grown. Uber is driven by its operations groups, which includes the people who oversee each of the cities where Uber operates, as well as the people who monitor and analyze Uber’s performance in real time, handle customer service for riders and deal with the needs of drivers. Their needs for new functions, such as support for certain currencies or new tools to help drivers, had made it harder for the engineering group to focus on revamping and fortifying the systems that powered those functions.

Early on in the company’s life, CEO Travis Kalanick pushed engineers to take risks in the interest of furthering the goals of the business, and he believed in the common engineering phrase “Move fast and break things,” say several people who've worked with him. In other words, new functionality trumped reliability.

Mission: More Nines

So far, Uber has managed to overcome its technical challenges, attaining a $50 billion private valuation as it has grown to dominate ride-sharing in many places around the world. It now operates in more than 300 cities, handling several million rides a day. But its focus on features over server stability could threaten its ability to expand further, including into new businesses like food delivery and new markets like China. There it has to operate a duplicate of the systems it built for the rest of the world, so that the government can have access to data on Chinese customers. From a technical perspective, the company needs to remake itself in order to avoid disasters that will hurt its revenue, brand and partners.

That challenge rests on the shoulders of Mr. Pham, a 47-year-old Vietnamese immigrant who worked in ad tech and at VMWare before joining Uber in 2013.

In an interview with The Information, Mr. Pham says the company’s goal is to get to 99.99% reliability, or “four nines,” in the next 12 months. That means about four minutes of down-time per month, or 50 minutes a year.

Uber is a “utility,” says Ganesh Srinivasan, who reports to Mr. Pham,  and “We have to provide a highly reliable service. But it’s extremely hard.” It doesn’t help that Uber is growing so fast that every three months, what was peak traffic becomes its average traffic.

Unanimously described by colleagues as straight-talking, serious and humble, the bespectacled Mr. Pham has had to take personal responsibility for the engineering group’s failings, including the one from four months ago, in front of the whole company. In an email to the company, Mr. Pham said then that he had been “deathly afraid” that the consumer-facing app would also go down for a prolonged period of time. (The disruption hurt internal systems but didn’t affect riders.) “It is simply unacceptable for us to make this type of mistake,” he said.

Even as he constantly plays defense, he’s invested in a better technical foundation. Since he’s joined, Uber’s engineering staff has grown from 40 to about 1,200, or one quarter of the company’s workforce.

Risk-taker, Firefighter

He’s taken bold bets that sometimes backfire. For instance, unhappy with its existing Web hosting provider, the engineering group set up servers in a new data center last year, hours before Halloween, the second-busiest night of the year for the Uber app. (New Year’s Eve is No. 1.) When trick-or-treat traffic was dumped on the new servers, many different systems failed, causing widespread outages and a “tense night” of fire-fighting at Uber HQ, say people who are familiar with the incident. The company had to resort to the old Web hosting provider.

He and others say that without taking such a risk, Uber would have been woefully underprepared for when growth spiked several months later, on New Year’s Eve.

“We failed early; we learned fast and as a result; New Year’s Eve was flawless, with around 1.7 million trips,” or nearly double the volume of Halloween, Mr. Pham says.

Amid outages that required significant work overnight and his personal supervision, Mr. Pham has been caught sleeping in the office a couple of times, in one of the booths that have small beds behind the company’s engineering “war room” or on a bean bag chair in a conference room. In general, though, on weekdays he’s up at 6 a.m., drops his only child off at school and takes a two-hour trip north from his home in east San Jose to Uber in downtown San Francisco by train. Then he walks 20 minutes to his office on Market Street. He typically stays until 6 p.m. or 7 p.m. before heading home, and he often works after dinner. Because of his long daily walks, he almost always wears sneakers.

Mr. Pham says he’s “very introverted,” but he’s also known to carry around a DSLR camera at office parties and offsite meetings and snap numerous pictures of colleagues, along with their dates.

From Refugee to MIT

In a video that’s available to Uber employees, Mr. Pham recounts his inspirational life story. He immigrated to the U.S. in 1980, when he was 12. In May 1979, several years after the Vietnam war ended, his mother took him and his brother out of the country in the hopes of building a better life for them, while his father, a former officer in the south Vietnamese army, stayed behind because they couldn’t afford for the whole family to leave. Mr. Pham didn’t see his father again for a decade.

The family spent almost a year bouncing between countries in southeast Asia, including at refugee camps in Indonesia and Malaysia, before being allowed into the U.S. Mr. Pham grew up in Rockville, Maryland, near Washington, with his mother, brother and another immigrant family inside a crummy two-bedroom apartment in a "bad part of town." His mother, who didn’t know English at the time, had two jobs, including bookkeeping at a gas station and bagging groceries. She was earning minimum wage, Mr. Pham says. A friend of his at middle school had an IBM computer and Mr. Pham became acquainted with programming. “I like things that are orderly,” he says.

He volunteered at his church, where he got to know a congregant who was a director at the National Bureau of Standards, an arm of the federal government. He then volunteered for that bureau and reprogrammed its back end systems.

“I always think of myself as an underdog, having come from nowhere,” Mr. Pham says. His mom “gave up everything, her whole life,” so Mr. Pham felt he “had to make something of myself.”

His government work and straight A’s in school helped him get accepted to the Massachusetts Institute of Technology. After getting his bachelor’s and master’s degrees in electrical engineering and computer science, Mr. Pham moved west to work for an R&D arm of Hewlett Packard. Later, he joined NetGravity, where he helped develop advertising technology in the mid- to late-1990s. He stayed for three more years after the firm sold to a competitor, DoubleClick.

John Danner, the founder of NetGravity, says in an interview that it was a mistake not to promote Mr. Pham to be vice president of engineering at the startup, where at one point he managed around 20 engineers. “He’s a natural manager,” Mr. Danner says.

Like many engineers, Mr. Pham a tinkerer and built things like his own barbecue at home. He’s also rabidly curious. After Mr. Danner’s wife, then a U.S. Supreme Court clerk, let Mr. Pham sit through a public oral argument there, he kept researching the obscure case—which was related to gambling in Mississippi—in order to understand how the court would weigh its decision.

After a stint at a computer security startup, Mr. Pham spent nine years at VMWare, which helps companies make more efficient use of computer hardware to run their applications. By the end, he oversaw hundreds of engineers and helped run products including vCenter, which was the interface through which customers interacted with other VMWare services.

“It’s where all the money comes in, and it was a high stress job,” says Steve Herrod, a former colleague. “His team was the beaten horse for every single function that has to get out the door; it is a hard thing to do there, and he did it well,” Mr. Herrod says.

Mr. Pham thus seemed prepared for the fast pace of Uber.

The Whole Stack

Bill Gurley, a venture capitalist who sits on Uber’s board of directors, had been chasing Mr. Pham for 12 years after hearing about him from Mr. Danner. Mr. Gurley says he had long wanted to place Mr. Pham in one of his portfolio companies, but he couldn’t find a job that was enticing enough for him, until Uber.

It’s hard to imagine Mr. Pham could have anticipated how tricky the role would be.

From the beginning of Uber, Mr. Kalanick has refused to use a “public” cloud provider like Amazon Web Services to host the app, in contrast with other startups of its day, because he didn’t want to get “locked into” a tech vendor and be dependent on it, according to Mr. Pham. Curtis Chambers, who was Uber’s top engineering manager from 2010 until Mr. Pham joined, says AWS is better suited for “volatile” traffic on an app’s systems, whereas demand on Uber’s systems has been fairly predictable.

The decision not to use a public cloud meant Uber relied on smaller third parties to manage servers for the company, and those firms weren’t always able to handle Uber’s growth. (During some outages, Mr. Kalanick has gotten mad enough to call out the main Uber infrastructure provider.) Under Mr. Pham, the company later hired people to handle the servers and network engineering, including buying and physically handling the machines that power Uber, which are made by firms like Dell and Quanta Computer. Now the company controls an entire technology “stack,” except for actually owning a data center. It spends more than $10 million a year on data center-related costs, estimates one person who is familiar with that unit at Uber.

Early on in the company’s life, CEO Travis Kalanick pushed engineers to take risks in the interest of furthering the goals of the business, and he believed in the common engineering phrase, “move fast and break things.”

Mr. Pham says Mr. Kalanick’s bet on owning the whole stack, despite the hardships of gaining data center expertise, was “right” and has led to reduced costs overall.

Get No Respect

The infrastructure that’s needed to run Uber (about 10,000 servers, give or take) is miniscule compared to companies like Google and Facebook that utilize millions or hundreds of thousands of machines. As a result, engineering leaders at such companies often dismiss Uber’s challenges as small.

But Uber doesn’t cache, or store, much information the way Google does with search results or Facebook does with user-profile information. Uber is more like a massive multiplayer online game (think “World of Warcraft”) but at a bigger scale and with greater complexity because of mapping and other calculations that need to be made instantaneously. And many of the features are for employees or its contractor-drivers, meaning riders never see them. For that reason, Uber engineers, including Mr. Pham, feel like their infrastructure doesn’t get enough respect.

When he arrived at Uber, Mr. Pham says, it was obvious to him that the infrastructure wasn’t prepared for growth. Uber has two main systems: one that runs the dispatching of cars to customers and tracks both of them throughout their trips, using software known as Node.js that’s built into the app; and a back-end system that calculates fares, sends emails to customers, and provides tools to Uber employees to analyze the business and do customer service, all written in code known as Python.

The Node system, according to some engineers, has had issues because it isn’t necessarily designed for large scale combined with heavy data processing. The Python system has had its issues because it was a huge, “monolithic” code base where it was difficult to pinpoint specific causes of technical problems.

Making things worse, the engineers in charge of the two separate systems had long been at odds, in part because of personality clashes and how each team thought software should be written using their preferred language. Mr. Thuan initially let the bad blood fester, as it allowed each group to work in isolation and get more done. A side benefit was that if one system had troubles, it didn’t immediately bring down the other, including during the May crisis. But overall, the friction became problematic because the teams ended up working on similar tools for software development that were written in different languages. This also meant there was a duplication of resources, but Mr. Pham was willing to overlook that because there were bigger problems to solve.

He’s since moved to resolve that by moving more engineers to design software development “platforms” that any Uber engineer could use to make new products, no matter which team they were on.

Red Bull and Features

Then there was pressure from the top. Mr. Kalanick wanted his engineers to move quickly despite the risks of pushing faulty software code that would break the system from time to time. Mr. Kalanick didn’t believe in having a separate team for QA, or quality assurance, to make sure code changes didn’t cause new problems. “Travis believed the quality of your code is your responsibility,” says one person who’s worked with him. Mr. Pham says he’s proud that the company doesn’t have any QA engineers, which can often cause friction in an engineering organization and make it harder to get things done.

Mr. Kalanick and global operations chief Ryan Graves often pushed the engineers to work at the business side’s pace, requesting features to support new currencies and languages and vehicle types. And they wanted it done yesterday. Because of such requests, “a lot of things happened outside the scope of the [engineering] roadmap; engineers would go drink Red Bull and work over the weekend to get things done,” says one person who observed the situation.

It was difficult for the engineers to explain to the business executives that adding more engineers to a project wouldn’t necessarily speed it up; new features just take time. As a result, a lot of “bad code” was pushed into production, the person said. And some engineers have gotten “burned out” and needed extended breaks to recover.

For Uber’s engineers, Mr. Pham became a critical first shield against Messrs. Kalanick and Graves and other general managers of specific cities that wanted new features built just for their market.

But the pressure to serve the business needs has continued, and Mr. Pham has also struggled with getting past the legacy technical problems. The May crisis, coming more than two years after Mr. Pham arrived at Uber, is a prime example.

The Big Crash

As the company was preparing databases for its new system in China, an engineer introduced a bug that caused a “rapid depletion” of space on the company’s “master” database, Mr. Pham wrote in a post-mortem summary, which he shared with the rest of the company. (Databases hold information like riders’ payment credentials.)

The depletion of space on the master database triggered an alert to the engineering team, but the alert came much too late; there were only two hours before the database crashed. An on-call engineer who received the alerts “ignored” them, Mr. Pham said. After the main database crashed, an engineer was supposed to take an uncorrupted database and turn it into the new “master.” That person instead made an error and corrupted the new master database and allowed the rest of the system to replicate that bug across numerous databases “like a cancer that metastasizes quickly throughout the body.” That was the “fatal blow” that caused the prolonged outage, Mr. Pham said.

It took more than 24 hours to repair the databases and get back to normal.

In the meantime, the company couldn’t on-board new drivers and customer service was inoperable, among many other problems. In China, Baidu was “extremely upset” because the special integration between its apps and Uber's didn’t work, Mr. Pham said. The crash imposed a “huge amount of pain and inconvenience for the rest of the business,” he said.

He later wrote to his engineers that “the first step to fixing any problem is to acknowledge that we have a problem.” Uber has “many problems in multiple levels of the org with respect to quality (code, testing, monitoring, tooling, process, operation, etc.), including me for not having pushed hard enough to establish the level of rigor and quality in the tooling and operation of our services.”

To solve those problems, he promised to increase emergency-response training (“just like airline pilots”) for all engineers and create a “site reliability engineering” role within the group in order to operate Uber in a more “professional manner.” The company also started a “zombie apocalypse recovery toolkit,” which is a digital manual for how to fix problems and keep things going when a specific system goes down.

He also continued a transition, which began before he arrived at Uber, to what’s called “service oriented architecture.” SOA splits up all of a company’s back end functions into isolated “micro services.” Doing so allows one system to go down without impacting other systems. It’s still an ongoing process, and Uber already has 450 micro services, Mr. Pham says.

You Don’t Know Git

Other big changes have been made. There’s now encrypted data being stored on each driver’s phone (which are specially designed to run a driver’s version of the Uber app) so that if a related back-end system goes down in Uber’s servers, the phones can play a backup role. In addition, there’s a new system called “uDestroy,” similar to Google’s “Dirt” and Netflix’s “Chaos Monkey,” that deliberately and randomly causes system problems in order to help the organization know how to deal with them.

Meanwhile, the quality of some new engineering recruits has fallen as hiring has increased, and many hires are fresh graduates with no experience. “People are graduating with CS degrees and don’t know how to use Git,” which is a system for writing code with other engineers, says one person familiar with Uber’s hiring. Uber says that less than 10 percent of its hires over the past year were new college graduates, and the average amount of experience for the other 90% has gone up significantly.

In response to the influx of graduates, the company has invested considerably on training. The orientation program for new hires, known as “Uberversity” for “nUbers,” has expanded for engineers, not only to acquaint them with Uber’s back-end systems but also for general education. There’s also a computer science curriculum, taught in part by contractors, to help engineers write good Python code or to learn how to build iOS apps.

Mr. Pham says Uber cannot compete with Facebook and Google by throwing $100,000 signing bonuses onto new recruits. And Uber’s engineering culture has never been as extravagant as some other fast-growing companies: The food isn’t as varied, and there aren’t as many toys and other distractions (drinking from beer taps known as “uBeer” and engaging in Nerf gun battles aren’t allowed before 6 p.m.) But this year he’s been able to poach some senior people like Joe Sullivan, who was Facebook’s chief security officer, and AG Gangadhar, who was a cloud platform manager at Google.

Many of those who’ve worked with Mr. Pham at Uber have a difficult time saying whether he’s done a great job or not.

“He didn’t get crushed. The wheels didn’t fall off the bus,” one colleague notes. “That was an achievement.”

This article has been updated with a comment from Uber about the experience level of its recent hires.

Amir Efrati is executive editor at The Information, which he helped to launch in 2013. Previously he spent nine years as a reporter at the Wall Street Journal, reporting on white-collar crime and later about technology. He can be reached at [email protected] and is on Twitter @amir