Entrepreneur, husband, Dad, and technology geek all contained within a single human being.
1481 stories

“Drupalgeddon2” touches off arms race to mass-exploit powerful Web servers

1 Comment

Enlarge (credit: Torkild Retvedt)

Attackers are mass-exploiting a recently fixed vulnerability in the Drupal content management system that allows them to take complete control of powerful website servers, researchers from multiple security companies are warning.

At least three different attack groups are exploiting "Drupalgeddon2," the name given to an extremely critical vulnerability Drupal maintainers patched in late March, researchers with Netlab 360 said Friday. Formally indexed as CVE- 2018-7600, Drupalgeddon2 makes it easy for anyone on the Internet to take complete control of vulnerable servers simply by accessing a URL and injecting publicly available exploit code. Exploits allow attackers to run code of their choice without having to have an account of any type on a vulnerable website. The remote-code vulnerability harkens back to a 2014 Drupal vulnerability that also made it easy to commandeer vulnerable servers.

Drupalgeddon2 "is under active attack, and every Drupal site behind our network is being probed constantly from multiple IP addresses," Daniel Cid, CTO and founder of security firm Sucuri, told Ars. "Anyone that has not patched is hacked already at this point. Since the first public exploit was released, we are seeing this arms race between the criminals as they all try to hack as many sites as they can."

China-based Netlab 360, meanwhile, said at least three competing attack groups are exploiting the vulnerability. The most active group, Netlab 360 researchers said in a blog post published Friday, is using it to install multiple malicious payloads, including cryptocurrency miners and software for performing distributed denial-of-service attacks on other domains. The group, dubbed Muhstik after a keyword that pops up in its code, relies on 11 separate command-and-control domains and IP addresses, presumably for redundancy in the event one gets taken down.

Added punch

Netlab 360 said that the IP addresses that deliver the malicious payloads are widely dispersed and mostly run Drupal, an indication of worm-like behavior that causes infected sites to attack vulnerable sites that have not yet been compromised. Worms are among the most powerful types of malware because their self-propagation gives them viral qualities.

Adding extra punch, Muhstik is exploiting previously patched vulnerabilities in other server applications in the event administrators have yet to install the fixes. Webdav, WebLogic, Webuzo, and WordPress are some of the other applications that the group is targeting.

Muhstik has ties to Tsunami, a strain of malware that has been active since 2011 and infected more than 10,000 Unix and Linux servers in 2014. Muhstik has adopted some of the infection techniques seen in recent Internet-of-things botnets. Propagation methods include scanning for vulnerable server apps and probing servers for weak secure-shell, or SSH, passwords.

The mass exploitation of Drupal servers harkens back to the epidemic of unpatched Windows servers a decade ago, which gave criminal hackers a toehold in millions of PCs. The attackers would then use their widely distributed perches to launch new intrusions. Because website servers typically have much more bandwidth and computing power than PCs, the new rash of server compromises poses a potentially much greater threat to the Internet.

Drupal maintainers have patched the critical vulnerability in both the 7.x and 8.x version families as well as the 6.x family, which maintainers stopped supporting in 2016. Administrators who have yet to install the patch should assume their systems are compromised and take immediate action to disinfect them.

Read Comments

Read the whole story
1 day ago
Drupal... just say no
Waterloo, Canada
Share this story

Google gives up on Google Allo, hopes carriers will sort out RCS messaging

2 Comments and 3 Shares


It's time for another chapter in the saga of Google's messaging mess. This latest news comes from The Verge, which reports that Google will be abandoning its most recent messaging app failure, Google Allo, in favor of a renewed push for the carrier-controlled RCS (Rich Communication Services) protocol.

Google Allo was Google's attempt at a WhatsApp clone, and it launched just a year-and-a-half ago with a laundry list of deficiencies. It used a phone-centric login system and didn't support using a Google account. It only worked on one device at a time and didn't have an interface for desktop or laptop computers. Distribution wasn't great either, as Allo wasn't one of the mandatory Google apps included in every Android phone. None of this really mattered since Allo didn't support sending SMS messages, so there was no one to talk to anyway. Google's other chat service, Google Hangouts, was better in nearly every way.

With such a half-baked launch, the real unknown for Google Allo was what kind of resources Google would throw at it. Like Android, which also entered a market late in the game, Allo needed a massive amount of resources to catch up to the competition. Instead, we were treated to an absolutely glacial development pace that mostly focused on new sticker packs. It took a full year before Allo addressed one of its biggest flaws—not working on a desktop—and even then, login was handled by a janky QR code pairing system that only worked on one extra device at a time. Google users expect a Google account-based login that works on all devices all the time, just like Hangouts.

At least we won't have to worry about Allo anymore. The Verge report says Google is "pausing" Allo development and "transferring almost the entire team off the project and putting all its resources into another app." Allo will continue to work for the foreseeable future, but new features won't be arriving any time soon.

Rich (and fragmented) Communication Services

What everyone wants from Google is an iMessage clone: an over-the-top messaging service that would run on all devices and platforms, with login handed by a Google account. Essentially, people want an updated version of Google Hangouts, a piece of software Google abandoned and removed features from in order to promote Google Allo. The Verge report says that Google "won’t build the iMessage clone that Android fans have clamored for" and will instead try to get the carriers to cooperate on RCS.

RCS, or Rich Communication Services, has been around as a GSMA (the worldwide mobile network trade body) project for about ten years now. RCS replaces SMS and MMS with a service that works more like an instant messaging app. RCS adds IM features to carrier messaging that most users take for granted, like user presence, typing status, read receipts, and location sharing. It sends messages over your data connection and increases the size caps on photos and video sharing.

The current problem with RCS versus an over-the-top IM service is that users on different carriers are usually not able to talk to each other with RCS features enabled. The cell carriers fear being turned into "dumb pipes" and generally prefer proprietary services that give them customer lock-in. Naturally, they have resisted building an interoperable RCS system. Currently, the RCS landscape is fragmented, with RCS flavors like AT&T Advanced Messaging, Verizon Message+T-Mobile Advanced Messaging, and Sprint Enhanced Messaging.

Google got involved with RCS in 2015 when it acquired Jibe Mobile, a company that provides back-end RCS services to carriers. At the end of 2016, the GSMA published the "Universal Profile" spec, which was an agreed upon standard that would let the various carrier RCS implementations talk to each other. Google then started pushing carriers to adopt "Google Jibe" as an end-to-end RCS service, where Google could provide the RCS network, the cloud infrastructure, and the end-user clients. Android's default SMS app, Android Messages, was made to support this new standard.

RCS-powered “Chat”: Carrier-dependent messaging

The Verge

As part of this renewed RCS push, The Verge reports that Google is putting more resources (including the Allo team) into Android Messages, and RCS will be rebranded into a new service called "Chat." Not "Google Chat," because this is RCS, which is a carrier-controlled standard. RCS will just be rebranded to "Chat."

Being carrier-controlled comes with a number of downsides. First, Chat will need your individual carrier to support Universal Profile to work. Over 55 carriers—including the big four in the US—have "committed" to eventually support RCS, but no timeframe is included in that commitment. In the US, only Sprint has Universal Profile up and running right now. T-Mobile has promised a "Q2 2018" rollout, while Verizon and AT&T have so far declined to give a time frame. (This worldwide Universal Profile tracker is a great resource.) There's also no one single client for RCS. Google's RCS client is Android Messages, while Samsung phones come with a Samsung RCS app. Also, no one knows if Apple will support RCS on the iPhone.

Another big downside of carrier control is no end-to-end encryption. The Verge notes that Google's RCS service will follow the same legal intercept standards as SMS. Nearly all of Google's competition—like iMessage, WhatsApp, Facebook Messenger, Signal, and Telegram—supports end-to-end encryption.

Google's revamped Android Messages app will include old Allo features, like integration with the Google Assistant, GIF search, and smart replies. Android Messages is already an SMS app, so if your friends aren't on an RCS carrier, you'll still be able to send them a regular SMS message. SMS support will be a big improvement over Allo.

Messages will also get a desktop client, but it unfortunately sounds a lot like Allo's awful Web client. The Verge got to try an Android Messages Web client that, like Allo, paired to your phone through a QR code instead of a Google account. These QR-code powered systems typically mean you'll only allowed be logged into one device at a time, and if your phone dies, you can't text anyone. The "desktop client" is also only a webpage, so it won't run in the background the way Hangouts and other IM apps can.

Carriers versus consumers

Various Google execs have been asked numerous times why Google doesn't just build an iMessage clone, and the answer that came back was always something along the lines of "We don't want to jeopardize our relationship with carriers." Carriers famously dislike many of the consumer-centric choices Apple makes with the iPhone, and building a quality, non-SMS messaging solution was one of those choices. For Google, keeping carriers happy so they run Android and Google services on nearly every non-Apple device is far more important than rocking the boat with a competitive messaging app. The plan this year, apparently, is to try to strike a happy medium with the carriers.

Like Google Allo, Chat will start far, far behind the competition at launch and will need to move quickly to catch up. If it catches up—if that's even possible—it needs to surpass the entrenched messaging services and be so much better that users are willing to switch. It seems like it will be especially tough to accomplish this while being hamstrung by the world's cellular carriers. Just making RCS actually work across carriers is a huge challenge, and it would only result in a very basic messaging system that can be matched by every other chat app in existence. Plus, the lack of end-to-end encryption already makes Google's Chat plans inferior to other services in many people's eyes.

With all these challenges ahead of it, can Google turn RCS into something worth using? If Google's history with past messaging apps is any indication, the answer is "no."

Read Comments

Read the whole story
3 days ago
Seriously this is embarrassing. They own the OS and are a “services” company. How hard is to build an iMessage API even for apps to use let alone a decent actual app? And relying on the carriers? Lol.
Waterloo, Canada
3 days ago
Any time someone says you can’t compete with Google ask whether you have an attention span longer than a sandflea’s and any respect for your users
Washington, DC
Share this story

Tesla paved the way for EVs but electrification isn’t just about cars anymore


Tesla and its electric cars may have kicked off the party, but other industries are quickly joining the revolution.

For generations of humans, the internal combustion engine has been the go-to solution for many of our needs.  We use them every day, to create electricity, work our farms, transport our products, and move us around the globe with relative ease.  It has been a spectacularly successful technology, with decades of refinement bringing us the engines we have today.  But as ubiquitous as they’ve become, the evidence is mounting that now is the time for their replacement by a cleaner, more efficient, reliable, and flexible technology.  We had just been waiting for the right motors and batteries to make it possible.

Today there are many applications where motors and batteries are primarily a direct swap for internal combustion engines.  No longer is the discussion reserved for passenger cars alone.  Freight trucks, buses, ships, planes, and utilities are all part of the growing list.  The technology is proving to be scalable, cost-effective, and flexible enough to apply to a wide variety of societal needs.  It’s quickly becoming a “general purpose technology,” arguably to an extent more significant than the internal combustion engine.  Electric motors and batteries are becoming the preferred form of motive power — it’s happening right now, allow me to illustrate:


Ground Freight – Long Haul, Cargo Trucks, Package Delivery, Food Transport, Waste Haulers

Until recently a reasonably common belief was that electric transport of commercial goods was some far off concept.  That transport trucks were too big, too heavy, and traveled too far.  I saw this demonstrated last summer at an industry event.  The presenter attested that electric freight transport was decades away and the only practical solution was direct combustion of natural gas or hydrogen.  That narrative has changed rather quickly.  In the fall of 2017, Tesla unveiled their Semi, a $230,000 Class 8 truck capable of 500 miles and 80,000 lbs.  The real kicker is that it exceeds the performance of diesel trucks and reduces operating costs by 20%.  A shorter range 300-mile version for $190,000 was also announced with production targeted for 2019.

Companies that rely on trucking took notice, with the likes of Pepsi, UPS, FedEx, Walmart, and many others placing hundreds of preorders.  It’s a safe bet that many more will follow if those initial orders prove successful.  Just this past month UPS wrote how their integrated charging system in London “..signals the beginning of the end of reliance upon traditional combustion engine powered vehicles.”   That’s from a company that delivers nearly 5 billion items a year.

Tesla Semi, Image Source: Tesla

While Tesla’s truck is currently the most ambitious, other manufacturers haven’t been sitting idle.  Most companies are starting with smaller vehicles for short hauling within cities.  Some other hybrid options do exist but the focus here in on pure electric, as ultimately the preferred solution (versus the increased complexity and maintenance of hybrids).

Daimler’s Fuso brand started delivering their eCanter truck this year, albeit in limited quantities (500 in the first two years).  It only has a 62-mile range and a max load capacity of 3 tons.  Their Mercedes brand has the eSprinter cargo van is coming later in 2018.  Future options with longer range and more capacity aren’t far away though, with their Mercedes eActros truck and E-Fuso Vision One.  The eActros is marketed with a range of 125 miles and a max weight capacity of 26 tons (52,000 lbs).  It’s already in pilot testing, with 2021 targeted for sales.   The Vision One concept is a similar size but nearly double the range at 220 miles.

There’s also electric vehicle giant BYD, which already sells a Class 8 electric truck with 90 miles of range.  If you aren’t aware of them, BYD produces the most electric vehicles in the world, most of them as passenger vehicles in China.  But they have a large lineup and a growing global reach.   They even have an electric garbage truck, two of which were delivered to the city of Palo Alto for pilot testing.

The CEO of Navistar issued his electric challenge early this year, declaring that by 2025 his company would have more electric trucks on the road than Tesla (Navistar has 11% of the Class 8 truck market and is partially owned by Volkswagen).

Tesla’s approach to their Semi may have a competitive advantage.  By using motors, inverters, and battery modules produced for their mass-market Model 3, the costs of their truck can be dramatically reduced.  There are economies of scale in making millions of virtually identical parts and sharing them between their vehicles.  It drives homes the point that electric motive power technology is even more general purpose than internal combustion.

Mercedes eActros, Image Source: Daimler


Public Transit – Buses and Shared Transportation


Public transit is undoubtedly a huge overall benefit to air quality in cities, but anyone that’s been spewed by the black smoke of a diesel bus or walked down the street partially holding their breathe may beg to differ.  Diesel buses have to go.  With constant start-stops and regular periods of idling, they are inherently inefficient (it actually might be the worst application for combustion engines, right after submarines I suppose, or space..).  Diesel exhaust isn’t just annoying; it’s a serious hazard to human health.

Electric drives, on the other hand, have regenerative braking and no direct emissions.  They are efficient, clean, have drastically lower fuel costs, and require less maintenance.  That’s why in my city, the Toronto Transit Commission announced their plans to buy 30 pure electric buses to add to their existing fleet of nearly 700 hybrid buses and 1300 combustion only buses.  Los Angeles recently ordered 25 all-electric buses and declared their intent to make their fleet 100% electric by 2030.  That’s great, but other parts of the world have us sorely beat.  In China, the city of Shenzhen has already completed it’s conversion to fully electric buses, all 16,359 of them serving a city of nearly 12 million people.  Check out the video of their fleet below.

(Impressive stuff Shenzhen)

In the USA pure electric buses account for less than 1% of the public transit fleet, with only 300 out of a countrywide fleet of 70,000, according to BNEF.  Hybrid buses in the USA look better, accounting for nearly 18% of the fleet according to the US DOE.

Several major cities around the world have announced they will only purchase all-electric buses by 2025, but that seems like eight wasted years.  Regardless, the choice will become ever more apparent as battery costs continue to fall cities need to cut operating costs while reducing air pollution.


Before leaving the topic of buses.  Blue Bird and Daimler even have electric school buses started deliveries this year.  That’s a great application to allow kids learn about electric vehicles while reducing their exposure to diesel exhaust.

Blue Bird Electric School Bus, Image Source Business Wire


Taxi’s and Ride Sharing

Taxi’s and ‘shared’ transportation options are another important part of city transit.  Shenzhen is again leading the way, looking to replace all of their combustion taxis by 2020.  It may help that BYD’s headquarters are also located in Shenzhen.  But even in London, the iconic black taxi’s are going electric.  By 2021 London expects 9,000 to be on the road, roughly half their current fleet.

Then there’s Waymo (Google) which recently announced they are purchasing 20,000 Jaguar I-Pace electric cars to be part of their autonomous fleet.  Waymo expects those vehicles can replace 1 million combustion vehicle trips per day.  That’s something to take note of — that through shared mobility, relatively few electric cars can displace many more combustion vehicle trips.

Waymo / Jaguar I-Pace, Image Source: Waymo


The Boring Company

If none of that excites you, then here’s something.  If Elon Musk has his way, there will be a radical new approach to public transit.  It requires tunnels, but no tracks and no trains.  Instead, by utilizing self-powered autonomous electric “skates,” the Boring Company wants to create a mass transit system that’s more accessible, requires less capital investment, and offers greater flexibility. Here’s a quick video of their vision for the future.

Shipping – Ferries and Cargo Vessels

Shipping over water is very efficient but also very dirty.  About half of the world’s shipping fleet uses something called “bunker fuel” which is so viscous it often has to be heated to allow it to flow (in case you were wondering the other half of those ships use diesel).  Bunker fuel is also extremely toxic in a spill and highly polluting when combusted.  The particulates produced from ship-based combustion alone are estimated to be responsible for 60,000 deaths every year.

Following a now familiar path, the first ships being electrified are for short-range applications.  In 2013 the first electric ferry was brought into service in Norway, with spectacular results.  The ship is called Ampere, and it reduced CO2 emissions by 95% and cut operating costs by 80%.  That one vessel saves over 1 million litres of diesel a year.  Its builder, Fjellstrand, now has orders for 53 more electric ferries.  Another shipbuilder in Norway, Havyard Group, is also producing electric boats with a recently announced contract to provide 7 for operator Fjord 1.  In Canada, our first fully electric ferries have just been ordered to serve on Lake Ontario.

Havyard electric ferries, Image Source Havyard


Electric ships aren’t just limited to ferries though.  In August 2018 there will be five new autonomous electric barges operating on the inland waterways between the Netherlands and Belgium.  They’re relatively small, only capable of carrying 24 20ft containers but six larger barges will follow later in the year.  Those will carry 280 containers each and operate out of the ports of Amsterdam, Antwerp, and Rotterdam (pictured below).  In China there’s even an electric barge transporting coal, of all things; it’s almost like there’s a fracture in the space-time continuum.  It’s hard to imagine they’re doing it for environmental reasons, so the economics must be good.

For large ocean traversing vessels (“Ultra Large Container Vessels”) electrification is more difficult.  Their power demands are massive, and the single trip distances traveled are far greater.  Solutions here are expected to be more of a hybrid between technologies, including hydrogen, batteries, biofuels, and sail assist.  The key thing to note is that the solutions in shipping are scalable and even in the near-term will go a long way to improving air quality on land.  (Of course, buying local is often the best solution.)

Port-Liner 2, Image Source: GVT Logistics


Electric Utilities and Power Generation

Using electricity instead of fossil fuels for transport will reduce pollution, which is true everywhere in North America and most of the rest of the world too.  The environmental benefits are also improving every year (a previous post goes into this topic in some detail). 

Some of those improvements come from reducing the use of diesel and natural gas “peaking plants.”  A “peaking plant” is one that can be quickly dispatched to meet demand when other sources of power are unable to respond quickly enough.  Coal and nuclear, for example, are very slow to ramp up or down.  Battery packs, on the other hand, can ramp even faster than diesel or natural gas and are great at frequency regulation.  Storage also allows for more of our power to come from renewables like solar and wind.

Tesla Powerpacks, Source: Tesla


Tesla recently installed the most powerful battery storage system in the world, a 100MW/129 MWh facility in South Australia.  From contract signing, it was up and running in 100 days.  That “most powerful” battery title won’t last long though.  Hyundai is currently building one that’s 50% larger for a smelting company in South Korea.  Tesla has at least two more utility projects secured in Australia and is working on a project that will install Powerwall batteries in 50,000 homes, creating a 675 MWh of storage

In the USA, Xcel Energy is planning their massive collection of battery projects, releasing bids in December 2017 for projects totaling 1,050 MW and 7,200 MWh.   In California PG&E recently awarded 165MW of battery storage projects and Southern California Edison has a 100MW/400 MWh system awarded.  It was only one year ago that California installed a 30MW/120MWh facility, the largest in the USA at that time.  Things are moving quickly.  For small and medium-sized projects there are now simply too many to note.

The point here is that the battery storage for utility power is growing rapidly.  BNEF forecasts that the worldwide market will double size times by 2030 (60x was it is today).  In the USA GTM forecasts an annual installation increase of 10x by 2022.  That’s only five years from now!  And it’s not surprising why.  A report from the World Bank shows costs continue to reduce for Li-ion batteries on both utility-scale and residential installations, even relative to other storage technologies (graphs below).   The solutions are also easily scalable, as seen by the residential and utility examples.  These are the same batteries as those going into electric cars, trucks, buses, and ships.  Further lending to the arguments of economies of scale and the ubiquity of the technology to serve our needs.


Small Engines – Motorcycles, etc

Motorcycles shouldn’t be left out of this discussion.  Why?  Because there are approximately 200 million of them globally and they emit more pollution per mile than a car (~10x more in fact).  Thankfully electric options are here too.  There’s Vespa, which has their first electric moped coming out in 2018 and Zero, which produces only pure electric motorbikes.  Harley Davidson is even developing one under the name Project Livewire (it’s gorgeous).  There are also hundreds of companies producing electric scooters, a transport solution which is common in many parts of the world.  A colleague recently told me how impressed he was with the battery swapping programs for scooters in Indonesia, for example.

And at the risk of lumping in lawn mowers with motorcycles, even traditionally gas-powered devices like lawn mowers, weed eaters, and snow blowers are rapidly switching to electric.

Harley-Davidson – Project Livewire


Airplanes – Commuter Hybrids, All Electric Future

It’s going to be a long time before pure electric intercontinental flights are operating (energy density is the main problem), but smaller airplanes and hybrids are being developed right now.  It’s not just by NASA and a few startups either.  Boeing and Airbus both have programs underway.  Airbus has partnered with Siemens and Rolls Royce to develop the E-Fan X pictured below.  It’s a hybrid-electric demonstrator aircraft with test flights planned for 2020.  Boeing is working with Zunum Aero out of Seattle, developing a hybrid passenger plane.  Zunum hopes to be selling their 12 seat hybrid aircraft by 2022.  The design uses two electric motors, which are fed by a battery, which is in turn charged by a jet fuel burning generator, leading to greater overall efficiency.  Electrically propelled aircraft also open up some interesting possibilities in design, such as fan arrays and vertical takeoffs.

Airlines are also looking to electrified planes to reduce costs and emissions.  EasyJet announced plans last year to develop a hybrid hydrogen aircraft with their partner Wright Electric.  Founded by engineers from NASA, Boeing, and Cessna, Wright already has a two-seater prototype.  There’s also the big announcement by Norway’s public air transport operator, Avinor, which earlier this year declared their intention for all short-haul flights to be pure electric by 2040.




The Point

All the indicators are there.  Electric motors and batteries are proliferating throughout our society.  It’s quickly becoming our new go-to “general purpose technology.”  It simply has too many benefits and yet much innovation ahead.

This is all to our benefit.  Technological revolutions are required to keep our civilization moving forward; it’s one of the ways new jobs are created.  But perhaps even more importantly, electrification brings greater efficiency and reduced pollution (yes CO2 is a pollutant when in sufficient quantities that would render life on this planet inhospitable).  That last part is important because if we don’t make changes to these industries now, we won’t have much of a civilization left to worry about.

Personally, I’m encouraged by the progress being made. I attended a Q&A session for a program funding low carbon solutions.  Several separate groups asked about funding for electric freight, electric ferries, electric buses, electric commercial car fleets, and battery storage.  Obviously, interest has really taken off.  A year ago people were barely convinced about electric cars and now, as important as they are, electrification isn’t just about passenger cars anymore.


The post Tesla paved the way for EVs but electrification isn’t just about cars anymore appeared first on TESLARATI.com.

Read the whole story
10 days ago
Waterloo, Canada
Share this story

Four cents to deanonymize: Companies reverse hashed email addresses

1 Comment and 2 Shares

[This is a joint post by Gunes Acar, Steve Englehardt, and me. I’m happy to announce that Steve has recently joined Mozilla as a privacy engineer while he wraps up his Ph.D. at Princeton. He coauthored this post in his Princeton capacity, and this post doesn’t necessarily represent Mozilla’s views. — Arvind Narayanan.]

Your email address is an excellent identifier for tracking you across devices, websites and apps. Even if you clear cookies, use private browsing mode or change devices, your email address will remain the same. Due to privacy concerns, tracking companies including ad networks, marketers, and data brokers use the hash of your email address instead, purporting that hashed emails are “non-personally identifying”, “completely private” and “anonymous”. But this is a misleading argument, as hashed email addresses can be reversed to recover original email addresses. In this post we’ll explain why, and explore companies which reverse hashed email addresses as a service.

Email hashes are commonly used to match users between different providers and databases. For instance, if you provide your email to sign up for a loyalty card at a brick and mortar store, the store can target you with ads on Facebook by uploading your hashed email to Facebook. Data brokers like Acxiom allow their customers to look up personal data by hashed email addresses. In an earlier study, we found that email tracking companies leak hashed emails to data brokers.
How hash functions work
Hash functions take data of arbitrary length and convert it into a random-looking string of fixed length. For instance, the MD5 hash of *protected email* is b58996c504c5638798eb6b511e6f49af. Hashing is commonly used to ensure data integrity, but there are many other uses.

Hash functions such as MD5 and SHA256 have two important properties that are relevant for our discussion: 1) the same input always yields the same output (deterministic); 2) given a hash output, it is infeasible to recover the input (non-invertible). The determinism property allows different trackers to obtain the same hash based on your email address and match your activities across websites, devices, platforms, or online-offline realms.

However, for hashing to be non-invertible, the number of possible inputs must be so large and unpredictable that all possible combinations cannot be tried. For instance, in a 2012 blog post, Ed Felten, then the FTC’s Chief Technologist, argued that hashing all possible SSNs would take “less time than it takes you to get a cup of coffee”.

The huge number of possible email addresses makes naively iterating over all possible combinations infeasible. However, the number of existing email addresses is much lower than the number of possible email addresses — a recent estimate puts the total number of email addresses at around 5 billion. That may sound like a lot, but hashing is an extremely fast operation; so fast that one can compute 450 Billion MD5 hashes per second on a single Amazon EC2 machine a the cost of $0.0069 [1]. That means hashing all five billion existing email addresses would take about ten milliseconds and cost less than a hundredth of a cent.

Lists of email addresses are widely available
Once an email address is known, it can be hashed and compared against supposedly “anonymous” hashed email addresses. This can be done by marketing or advertising companies that use hashed email addresses as identifiers, or hackers who acquire hashed addresses by other means. Indeed, there are several options to obtain email addresses:

    1. Data breaches: Thanks to a steady stream of data breaches, hundreds of millions of email addresses from existing leaks are publicly available. HaveIBeenPwned, a service that allows users to check if their accounts have been breached, has observed more than 4.9 Billion breached accounts. Want to check if your email address is vulnerable to this attack? Use HaveIBeenPwned  to determine if any of your email addresses were leaked in a data breach. If they were, an attacker would be able to use data from a breach to recover your email addresses from their hashes [2].
    2. Marketing email lists: Mailing lists with millions of addresses are available for bulk purchase, and often are labeled with privacy invasive categories like religious affiliation, medical conditions or addictions including “Underbanked”, “Financially Challenged”, “Gamblers”, “High Blood Pressure Sufferers in Tallahassee, Florida”, “Anti-Sharia Christian Conservatives”, “Muslim Prime Prospects”.In addition, there are websites that readily share massive lists of email addresses.


  1. Harvesting email addresses from websites, search engines, PGP key servers: There are a number of software solutions available to extract email addresses in bulk from websites, search engines and public PGP key servers.
  2. Guessing email addresses: Email addresses can also be synthetically generated by using popular names and patterns such as *protected email*. Past studies achieved recovery rates between 42% and 70% using simple heuristics and limited resources [3]. We believe this can be significantly improved by using neural networks to generate plausible email addresses.

Companies reverse email hashes as a service
The hash recovery methods listed above require very basic technical skills. However, even that isn’t required to reverse hashed data as several companies reverse email hashes as a service.

Datafinder – Reverse email hashes for $0.04 per email: Datafinder, a company that combines online and offline consumer data, charges $0.04 per email to reverse hashed email addresses. The company promises 70% recovery rate and for a nominal fee will provide additional information along with the reversed email, including: name, address, city, state, zip and phone number. Datafinder is accredited by Better Business Bureau with an A+ rating, and its clients include T-Mobile.


Infutor – Sub 500-millisecond hashed email “decoding”.: Infutor, a consumer identity management company states[a]nonymous hashed data can be matched to a database of known hashed information to provide consumer contact information, insights and demographic information”. In one case study, the company claims to have reversed nearly 3MM email addresses. In another case, Infutor set up a near real-time online service to reverse hashed emails for an EU company, which “is able to extract a hashed email from the website visit”. Infotutor boasts that they could meet their client’s sub-500 millisecond response time requirement to reverse a given hash.

The Leads Warehouse – “We have cracked the code”: The Leads Warehouse claims that “[they] recover all of your MD5 hashed emails” quickly, securely and cost-effectively through their bizarrely named service “MD5 Reverse Encryption”. Their website reads “[i]n fact, [hashed emails are] designed to be impenetrable and irreversible.  Don’t sweat it, though, we have cracked the code.” The Leads Warehouse also sells phone and mailing leads that include Sleep Apnea, Wheelchair Leads and Student Loans list. For their Ailment & Diabetic Email Lists, they claim they have “amazing filtering options” including length of illness, age, ethnicity, cost of living/hospital expenses.

Are hashed email addresses “pseudonymous” data under the GDPR?

In response to our earlier blog post on login manager abuse, a European company official claimed that hashed email addresses are pseudonymous identifier[s]” and are “compliant with regulations.” The upcoming EU General Data Protection Regulation (GDPR) indeed recognizes pseudonymization as a security measure [4] and considers it as a factor in certain obligations [5]. But can email hashing really be classified as pseudonymization under GDPR?

The GDPR defines pseudonymization as:

“the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;” [6]

For example, if email addresses were encrypted and the key stored separately with additional protections, the encrypted data could be considered pseudonymized under this definition. If there were a breach of the data, the adversary would not be able to recover the email addresses without the key.

However, hashing does not require a key. The additional information needed to reverse hashed email addresses — lists of email addresses, or algorithms that guess plausible email addresses — can be obtained in several ways as we described above. None of these methods requires additional information that “is kept separately and is subject to technical and organisational measures”. Therefore we argue that email hashing does not fall under GDPR’s definition of pseudonymisation.

Hashed email addresses can be easily reversed and linked to an individual, therefore they do not provide any significant protection for the data subjects. The existence of companies that reverse email hashes shows that calling hashed email addresses “anonymous”, “private”, “irreversible” or “de-identified” is misleading and promotes a false sense of privacy. If reversing email hashes were really impossible as claimed, it would cost more than 4 cents.

Even if hashed email addresses were not reversible, they could still be used to match, buy and sell your data between different parties, platforms or devices. As privacy scholars have already argued, when your online profile can be used to target, affect and manipulate you, keeping your true name or email address private may not bear so much significance [7].

Acknowledgements: We thank Brendan Van Alsenoy for his helpful comments.

End notes:

[1]: Hourly price for Amazon EC2 p3.16xlarge instance is $24.48 (as of March 2018).
[2]: HaveIBeenPwned does not share data from breaches, but leaked datasets can be found on underground forums, torrents and file sharing sites.
[3]: See also, Demir et al. The Pitfalls of Hashing for Privacy.
[4]: Article 32 The GDPR.
[5]: Article 6(4)(e), Article 25, Article 89(1) The GDPR.
[6]: Article 4(5), The GDPR.
[7]: See, for instance, “Big Data’s End Run around Anonymity and Consent” (Barocas and Nissenbaum, 2014) and “Singling Out People Without Knowing Their Names – Behavioural Targeting, Pseudonymous Data, and the New Data Protection Regulation” (Zuiderveen Borgesius, 2016).

Read the whole story
11 days ago
I'd say something about salting your hashes, but c'mon. You cannot compete with cheap cloud compute and no incentives for people who store your data to make it HARDER to read.
Louisville, KY
11 days ago
Waterloo, Canada
Share this story

Some of my favorite technical papers.

1 Comment

I've long been a fan of hosting paper reading groups, where a group of folks sit down and talk about interesting technical papers. One of the first steps to do that is identifying some papers worth chatting about, and here is a list of some papers I've seen lead to excellent discussions!

Dynamo: Amazon's Highly Available Key-value Store

Reading only the abstract, you'd be forgiven for not being overly excited about the Dynamo paper: This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an always-on experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

That said, this is in some senses "the" classic modern systems paper. It has happened more than once that an engineer I've met has only read a single systems paper in their career, and that paper was the Dynamo paper. This paper is a phenomenal introduction to eventual consistency, coordinating state across distributed storage, reconciling data as it diverges across replicas and much more.

Hints for Computer System Design

Butler Lampson is an ACM Turning Award winner (among other awards), and worked at the Xerox PARC. This paper concisely summarizes many of his ideas around system design, and is a great read.

In his words:

Studying the design and implementation of a number of computer has led to some general hints for system design. They are described here and illustrated by many examples, ranging from hardware such as the Alto and the Dorado to application programs such as Bravo and Star.

This paper itself acknowledges that it doesn't aim to break any new ground, but it's a phenomenal overview.

Big Ball of Mud

A reaction against exuberant papers about grandiose design patterns, this paper labels the most frequent architectural pattern as the Big Ball of Mud, and explores why elegant initial designs rarely remain intact as a system goes from concept to solution.

From the abstract:

While much attention has been focused on high-level software architectural patterns, what is, in effect, he de-facto standard software architecture is seldom discussed. This paper examines this mostfrequently deployed of software architectures: the BIG BALL OF MUD. A BIG BALL OF MUD is a casually, even haphazardly, structured system. Its organization, if one can call it that, is dictated more by expediency than design. Yet, its enduring popularity cannot merely be indicative of a general disregard for architecture.

Although humor is certainly infuses this paper, it's also true that software design is remarkably poor, with very few systems having a design phase and few of those resembling the initial design (and documentation is rarely updated to reflect later decisions), making this an important topic for consideration.

The Google File System

From the abstract:

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use. Google has done something fairly remarkable in defining the technical themes in Silicon Valley and, at least debatably across the technology industry, for more than the last decade (only recently joined to a lesser extent by Facebook and Twitter as they reach significant scale), and it's done that largely through their remarkable technical papers. The Google File System (GFS) paper is one of the early entries in that strategy, and is also remarkable as the paper which largely inspired the Hadoop File System (HFS).

On Designing and Deploying Internet-Scale Services

We don't always remember to consider Microsoft as one of the largest internet technology players, although increasingly Azure is making that comparison obvious and immediate, and it certainly wasn't a name that necessarily came to mind in 2007. This excellent paper from James Hamilton, exploring tips on building operable systems at extremely large scale, makes it clear that not considering Microsoft as a large internet player was a lapse in our collective judgement.

From the abstract:

The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we've seen ratios as high as 2,500:1. Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While auto-administration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.

This is a true checklist of how to design and evaluate large scale systems (almost like The Twelve Factor App wants to be for a checklist for operable applications).

CAP Twelve Years Later: How the Rules Have Changed

Eric Brewer posited the CAP theorem in the early 2000s, and twelve years later he wrote this excellent overview and review of CAP (which argues distributed systems have to pick between either availability or consistency during partitions), in part because:

In the decade since its introduction, designers and researchers have used (and sometimes abused) the CAP theorem as a reason to explore a wide variety of novel distributed systems. The NoSQL movement also has applied it as an argument against traditional databases. CAP is interesting because there is not a "seminal CAP paper", but this article serves well in such a paper's stead. These ideas are expanded on in the Harvest and Yield paper.

Harvest, Yield, and Scalable Tolerant Systems

This paper builds on the concepts from CAP Twelve Years Later, introducing the concepts of harvest and yield to add more nuance to the AP vs CP discussion:

The cost of reconciling consistency and state management with high availability is highly magnified by the unprecedented scale and robustness requirements of today's Internet applications. We propose two strategies for improving overall availability using simple mechanisms that scale over large applications whose output behavior tolerates graceful degradation. We characterize this degradation in terms of harvest and yield, and map it directly onto engineering mechanisms that enhance availability by improving fault isolation, and in some cases also simplify programming. By collecting examples of related techniques in the literature and illustrating the surprising range of applications that can benefit from these approaches, we hope to motivate a broader research program in this area.

The harvest and yield concepts are particularly interesting because they are both self-evidence and very rarely explicitly used, instead distributed systems continue to fail in mostly undefined ways. Hopefully as we keep rereading this paper, we'll also start to incorporate it's design concepts into the systems we subsequently build!

MapReduce: Simplified Data Processing on Large Clusters

The MapReduce paper is an excellent example of an idea which has been so successful that it now seems self-evident. The idea of applying the concepts of functional programming at scale became a clarion call, provoking a shift from data warehousing to a new paradigm for data analysis:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Much like Google File System paper was an inspiration for the Hadoop File System, this paper was itself a major inspiration for Hadoop.

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

The Dapper paper introduces a performant approach to tracing requests across many services, which has become increasingly relevant as more companies refactor core monolithic applications into dozens or hundreds of micro-services.

From the abstract:

Here we introduce the design of Dapper, Google's production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie and X-Trace, but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

The ideas from Dapper have since made their way into open source, especially in Zipkin and OpenTracing.

Kafka: a Distributed Messaging System for Log Processing

Apache Kafka has become a core piece of infrastructure for many internet companies. It's versatility lends it to many roles, serving as the ingress point to "data land" for some, a durable queue for others, and that's just scratching the surface.

Not only a useful addition to your toolkit, Kafka is also a beautifully designed system:

Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and scalable. Our experimental results show that Kafka has superior performance when compared to two popular messaging systems. We have been using Kafka in production for some time and it is processing hundreds of gigabytes of new data each day.

In particular, Kafka's partitions do a phenomenal job of forcing application designers to make explicit tradeoffs about trading off performance for predictable message ordering.

Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services

In many ways similar to Kafka, Facebook's Wormhole is another highly scalable approach to messaging:

Wormhole is a publish-subscribe (pub-sub) system developed for use within Facebook's geographically replicated datacenters. It is used to reliably replicate changes among several Facebook services including TAO, Graph Search and Memcache. This paper describes the design and implementation of Wormhole as well as the operational challenges of scaling the system to support the multiple data storage systems deployed at Facebook. Our production deployment of Wormhole transfers over 35 GBytes/sec in steady state (50 millions messages/sec or 5 trillion messages/day) across all deployments with bursts up to 200 GBytes/sec during failure recovery. We demonstrate that Wormhole publishes updates with low latency to subscribers that can fail or consume updates at varying rates, without compromising efficiency.

In particular, note the approach to supporting lagging consumers without sacrificing overall system throughput.

Borg, Omega, and Kubernetes

While the individual papers for each of Google's orchestration systems (Borg, Omega and Kubernetes) are worth reading in their own right, this article is an excellent overview of the three:

Though widespread interest in software containers is a relatively recent phenomenon, at Google we have been managing Linux containers at scale for more than ten years and built three different container-management systems in that time. Each system was heavily influenced by its predecessors, even though they were developed for different reasons. This article describes the lessons we've learned from developing and operating them.

Fortunately, not all orchestration happens under Google's aegis, and Mesos' alternative two-layer scheduling architecture is a fascinating read as well.

Large-scale cluster management at Google with Borg

Borg has been orchestrating much of Google's infrastructure for quite some time (significantly predating Omega, although fascinatingly the Omega paper predates the Borg paper by two years):

Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.

This paper takes a look at Borg's centralized scheduling model, which was both effective and efficient, although it became increasingly challenging to modify and scale over time, inspiring both Omega and Kubernetes within Google (the former to optimistically replace it, and the later seemingly to commercialize their learnings, or at least prevent Mesos from capturing too much mindshare).

Omega: flexible, scalable schedulers for large compute clusters

Omega is, among many other things, an excellent example of the second-system effect, where an attempt to replace a complex existing system with something far more elegant ends up being more challenging than anticipated.

In particular, Omega is a reaction against the realities of extending the aging Borg system:

Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control.

Perhaps also an example of worse is better once again taking the day.

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

This paper describes the design of Apache Mesos, in particular its distinctive two-level scheduler:

We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today's frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

Used heavily by Twitter and Apple, Mesos was for some time the only open-source general scheduler with significant adoption, and is now in a fascinating competition for mindshare with Kubernetes.

Design patterns for container-based distributed systems

The move to containers-based deployment and orchestration has introduced a whole new set of vocabulary like sidecars and adapters, and this paper provides a survey of the patterns which have evolved over the past decade as microservices and containers have become increasingly prominent infrastructure components:

In the late 1980s and early 1990s, object-oriented programming revolutionized software development, popularizing the approach of building of applications as collections of modular components. Today we are seeing a similar revolution in distributed system development, with the increasing popularity of microservice architectures built from containerized software components. Containers are particularly well-suited as the fundamental object in distributed systems by virtue of the walls they erect at the container boundary. As this architectural style matures, we are seeing the emergence of design patterns, much as we did for object-oriented programs, and for the same reason – thinking in terms of objects (or containers) abstracts away the low-level details of code, eventually revealing higher-level patterns that are common to a variety of applications and algorithms.

The "sidecar" term in particular likely originated in this blog post from Netflix, which is a worthy read in its own right.

Raft: In Search of an Understandable Consensus Algorithm

Where we often see the second-system effect where a second system becomes bloated and complex relative to a simple initial system, the roles are reversed in the case of Paxos and Raft. Whereas Paxos is often considered beyond human comprehension, Raft is a fairly easy read:

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

Raft is used by etcd and influxdb among many others.

Paxos Made Simple

One of Leslie Lamport's numerous influential papers, Paxos Made Simple is a gem both in explaining the notoriously complex Paxos algorithm, and because even at it's simplest, Paxos isn't really that simple:

The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithmthe synod algorithm. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed systeman approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems.

Paxos itself remains a deeply innovative concept, and is the algorithm behind Google's Chubby and Apache Zookeeper, among many others.

SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

The majority of consensus algorithms focus on being consistent during partition, SWIM goes the other direction and focuses on availability:

Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs.

SWIM is used in Hashicorp's software, as well as Uber's Ringpop.

The Byzantine Generals Problem

Another classic Leslie Lamport paper on consensus, the Byzantine Generals Problem explores how to deal with distributed actors which intentionally or accidentally submit incorrect messages:

Reliable computer systems must handle malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more than two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.

The paper is mostly focused on the formal proof, a bit of a theme from Lamport who developed TLA+ to make formal proving easier, but is also a useful reminder that we still tend to assume our components will behave reliably and honestly, and perhaps we shouldn't!

Out of the Tar Pit

Out of the Tar Pit bemoans unnecessary complexity in software, and proposes that that functional programming and better data modeling can help us reduce accidental complexity (arguing that most unnecessary complexity comes from state).

From the abstract:

Complexity is the single major difficulty in the successful development of large-scale software systems. Following Brooks we distinguish accidental from essential difficulty, but disagree with his premise that most complexity remaining in contemporary systems is essential. We identify common causes of complexity and discuss general approaches which can be taken to eliminate them where they are accidental in nature. To make things more concrete we then give an outline for a potential complexity-minimizing approach based on functional programming and Codd's relational model of data.

Certainly a good read, although reading a decade later it's fascinating to see that neither of those approaches have particularly taken off, and instead the closest "universal" approach to reducing complexity seems to be the move to numerous mostly stateless services, which is perhaps more a reduction of local complexity, at the expense of larger systemic complexity, whose maintenance is then delegated to more specialized systems engineers.

(This is yet another paper that makes me wish TLA+ felt natural enough to be a commonly adopted tool.)

The Chubby lock service for loosely-coupled distributed systems

Distributed systems are hard enough without having to frequently reimplement Paxos or Raft, and the model proposed by Chubby is to implement consensus once in a shared service, which will allow systems built upon it to share in the resilience of distribution by following greatly simplified patterns.

From the abstract:

We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.

In the open source world, the way Zookeeper is used in projects like Kafka and Mesos has the same role as Chubby.

Bigtable: A Distributed Storage System for Structured Data

One of Google's preeminent papers and technologies is Bigtable, which was an early (early in the internet era, anyway) NoSQL datastore, operating at extremely high scale and built on top of Chubby.

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

From the SSTable design to the bloom filters, Cassandra and inherits significantly from the Bigtable paper, and is probably rightfully considered a merging of the Dynamo and Bigtable papers.

Spanner: Google's Globally-Distributed Database

Where many early NoSQL storage systems traded eventual consistency for increased resiliency, building on top of eventually consistent systems can be harrowing. Spanner represents an approach from Google to offering both strong consistency and distributed reliability, relying in part on a novel approach to managing time.

Spanner is Google's scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

We haven't seen any open-source Spanner equivalents yet, but I imagine we'll start seeing them in 2017.

Security Keys: Practical Cryptographic Second Factors for the Modern Web

Security keys like the YubiKey have emerged as the most secure second authentication factor, and this paper out of Google explains the motivations that lead to their creation, and the design that makes them work.

From the abstract:

Security Keys are second-factor devices that protect users against phishing and man-in-the-middle attacks. Users carry a single device and can self-register it with any online service that supports the protocol. The devices are simple to implement and deploy, simple to use, privacy preserving, and secure against strong attackers. We have shipped support for Security Keys in the Chrome web browser and in Google's online services. We show that Security Keys lead to both an increased level of security and user satisfaction by analyzing a two year deployment which began within Google and has extended to our consumer-facing web applications. The Security Key design has been standardized by the FIDO Alliance, an organization with more than 250 member companies spanning the industry. Currently, Security Keys have been deployed by Google, Dropbox, and GitHub.

They're also remarkably cheap! Order a few and start securing your life in a day or two.

BeyondCorp: Design to Deployment at Google

Building on the original BeyondCorp paper in 2014, this paper is slightly more detailed and benefits from two more years of migration-fueled wisdom. That said, the big ideas have remained fairly consistent and there is not much new relative to the BeyondCorp paper itself (although that was a fantastic paper, and if you haven't read it, this is an equally good starting point):

The goal of Google's BeyondCorp initiative is to improve our security with regard to how employees and devices access internal applications. Unlike the conventional perimeter security model, BeyondCorp doesn't gate access to services and tools based on a user's physical location or the originating network; instead, access policies are based on information about a device, its state, and its associated user. BeyondCorp considers both internal networks and external networks to be completely untrusted, and gates access to applications by dynamically asserting and enforcing levels, or tiers, of access.

As is often the case reading Google papers, my biggest take away thought here is wondering when we'll start to see reusable, pluggable open source versions of the techniques described within.

Availability in Globally Distributed Storage Systems

This paper explores how to think about availability in replicated distributed systems, and is a useful starting point for those of us who are trying to determine the correct way to measure uptime for their storage layer or any other sufficiently complex system.

From the abstract:

We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

Particularly interesting is the focus on correlated failures, building on the premise that users of distributed systems only experience the failure when multiple components have overlapping failures. Another expected but reassuring observation is that at Google's scale (and with resources distributed across racks and regions), most failure comes from tuning and system design, not from the underlying hardware.

I was also surprised by how simple their definition of availability was in this case:

A storage node becomes unavailable when it fails to respond positively to periodic health checking pings sent by our monitoring system. The node remains unavailable until it regains responsiveness or the storage system reconstructs the data from other surviving nodes.

Often discussions of availability become arbitrarily complex ("really it should be response rates are over X, but with correct results and within our latency SLO!"), and it's reassuring to see the simplest definitions are still usable.

Still All on One Server: Perforce at Scale

As a company grows, code hosting performance becomes one of the critical factors in overall developer productivity (along with build and test performance), but it's a topic that isn't discussed frequently. This paper from Google discusses their experience scaling Perforce:

Google runs the busiest single Perforce server on the planet, and one of the largest repositories in any source control system. From this high-water mark this paper looks at server performance and other issues of scale, with digressions into where we are, how we got here, and how we continue to stay one step ahead of our users.

This paper is particularly impressive when you consider the difficulties that companies run into scaling Git monorepos (talk to a ex-Twitter employee near you for war stories).

Large-Scale Automated Refactoring Using ClangMR

Large code bases tend to age poorly, especially in the case of monorepos storing hundreds or thousands of different teams collaborating on different projects. This paper covers one of Google's attempts to reduce the burden of maintaining their large monorepo through tooling that makes it easy to rewrite abstract syntax trees (ASTs) across the entire codebase.

From the abstract:

In this paper, we present a real-world implementation of a system to refactor large C++ codebases efficiently. A combination of the Clang compiler framework and the MapReduce parallel processor, ClangMR enables code maintainers to easily and correctly transform large collections of code. We describe the motivation behind such a tool, its implementation and then present our experiences using it in a recent API update with Google's C++ codebase.

Similar work is being done with Pivot.

Source Code Rejuvenation is not Refactoring

This paper introduces the concept of "Code Rejuvenation", a unidirectional process of moving towards cleaner abstractions as new language features and libraries become available, which is particularly applicable to sprawling, older code bases.

From the abstract:

In this paper, we present the notion of source code rejuvenation, the automated migration of legacy code and very briefly mention the tools we use to achieve that. While refactoring improves structurally inadequate source code, source code rejuvenation leverages enhanced program language and library facilities by finding and replacing coding patterns that can be expressed through higher-level software abstractions. Raising the level of abstraction benefits software maintainability, security, and performance.

There are some strong echoes of this work in Google's ClangMR paper.

Searching for Build Debt: Experiences Managing Technical Debt at Google

This paper is an interesting cover of how to perform large-scale migrations in living codebases. Using broken builds as the running example, they break down their strategy to three pillars: automation, make it easy to do the right thing, and make it hard to do the wrong thing.

From the abstract:

With a large and rapidly changing codebase, Google software engineers are constantly paying interest on various forms of technical debt. Google engineers also make efforts to pay down that debt, whether through special Fixit days, or via dedicated teams, variously known as janitors, cultivators, or demolition experts. We describe several related efforts to measure and pay down technical debt found in Google's BUILD files and associated dead code. We address debt found in dependency specifications, unbuildable targets, and unnecessary command line flags. These efforts often expose other forms of technical debt that must first be managed.

No Silver Bullet - Essence and Accident in Software Engineering

A seminal paper from the author of The Mythical Man Month, "No Silver Bullet" expands on discussion of accidental versus essential complexity, and argues that there is no longer enough accidental complexity to allow individual reductions in accidental complexity to significantly increase engineer productivity.

From the abstract:

Most of the big past gains in software productivity have come from removing artificial barriers that have made the accidental tasks inordinately hard, such as severe hardware cosntraints, awkward programming languages, lack of machine time. How much of what software engineers now do is still devoted to the accidental, as opposed to the essential? Unless it is more than 9/10 of all effort, shrinking all the accidental activities to zero time will not give an order of magnitude improvement.

Interestingly, I think we do see accidental complexity in large codebases become large enough to make order of magnitude improvements (motivating, for example, Google's investments into ClangMR and such), so perhaps we're not quite as far ahead in the shift to essential complexity as we'd like to believe.

The UNIX TimeSharing System

This paper describes the fundamentals of UNIX as of 1974, and what is truly remarkable is how many of the design decisions are still used today. From the permission model we've all manipulated with chmod to system calls used to manipulate files, it's amazing how much remains intact.

From the abstract:

UNIX is a general-purpose, multi-user, interactive operating system for the Digital Equipment Corporation PDP-11/40 and 11/45 computers. It offers a number of features seldom found even in larger operating systems, including: (1) a hierarchical file system incorporating demountable volumes; (2) compatible file, device, and inter-process I/O; (3) the ability to initiate asynchronous processes; (4) system command language selectable on a per-user basis; and (5) over 100 subsystems including a dozen languages. This paper discusses the nature and implementation of the file system and of the user command interface.

Also fascinating is their observation that UNIX has in part succeeded because it was designed to solve a general problem by its authors (working with the PDP-7 was frustrating), and not towards a more specified goal.

Read the whole story
15 days ago
Ok, *tons* of good reading in here
Waterloo, Canada
Share this story

Mark Zuckerberg’s Fourteen-Year Apology Tour

1 Share

Zeynep Tufekci, writing in Wired:

Facebook’s 2 billion users are not Facebook’s “community.” They are its user base, and they have been repeatedly carried along by the decisions of the one person who controls the platform. These users have invested time and money in building their social networks on Facebook, yet they have no means to port the connectivity elsewhere. Whenever a serious competitor to Facebook has arisen, the company has quickly copied it (Snapchat) or purchased it (WhatsApp, Instagram), often at a mind-boggling price that only a behemoth with massive cash reserves could afford. Nor do people have any means to completely stop being tracked by Facebook. The surveillance follows them not just on the platform, but elsewhere on the internet — some of them apparently can’t even text their friends without Facebook trying to snoop in on the conversation. Facebook doesn’t just collect data itself; it has purchased external data from data brokers; it creates “shadow profiles” of nonusers and is now attempting to match offline data to its online profiles.

Again, this isn’t a community; this is a regime of one-sided, highly profitable surveillance, carried out on a scale that has made Facebook one of the largest companies in the world by market capitalization.

As is often the case with one of Tufekci’s pieces, this is a must-read in full. I pulled the above quote because I think it illustrates the depth and breadth of Facebook’s business model and its intrusiveness in the public sphere, even among those who are not registered users. I don’t think it’s possible to grasp the scale of their power and influence, but Tufekci comes close.

Read the whole story
16 days ago
Waterloo, Canada
Share this story
Next Page of Stories