Paywalls, if you’re doing it on the client side, you’re going to have a bad time

November 20, 2012

Helsingin Sanomat, the largest newspaper in the Nordic countries, has decided to put up a pay wall or pay gate for the online readers. You get five free articles per week, and after that you’re hit with a message that tells you to pay for more articles. Now, I’m not going into the debate whether or not this is good, I do understand the necessity behind this decision, journalists need to eat too.

But what I am going to talk about is their implementation of the pay wall. The way they do it, and they’re not the first I’ve encountered to do so, is that they hide the content instead of actually not render it. Okay, that’s not completely true, they do render it, at first, but they then check the users article-counter and acts on that number, if it’s over 5, they put display: none on the article parent div and proceeds to show a notification which gives you more information on how to pay for the content.

There’s a few reasons why this is not optimal, first, which is kind of obvious, is that the content is still accessible by the client. It wouldn’t take long for even a beginner in Javascript to whip up a greasemonkey script that resets either the cookies or the parent div’s style. The second reason is that because they hide the content with Javascript, they force the browser to redraw almost the complete page, twice. At first when the page has been loaded and the second time when they hide the content. And lastly why this implementation is flawed is that because you send the data over the internet connection the client has to download more data than it actually needs to, which is not only a penalty for the mobile users (with slow connection) but also for Helsingin Sanomat itself, because I’m pretty sure they pay for their bandwidth, and if you have a site in their size, the sixth largest in Finland, these bytes are probably going to add up.

So how do they do it.

At first they render out the page, without caring who the recipient is. In the markup they have a few inline javascript sections, these sections contain data about the user, whether or not it’s a paying, premium user or if it’s a normal, non-paying user. They give away a lot of the information within their hs-object in the window scope, I’ve included the data that we’re particularly interested in;

{
	paywall:{
		defaults:{
			daysBack:7,
			freeAmount:5,
			firstWarning:3,
			secondWarning:1,
			referrerBypassList:[]
		},
		instance:{
			daysBack:7,
			freeAmount:5,
			firstWarning:1,
			secondWarning:0,
			referrerBypassList:[
				"facebook.com",
				"nelonen.fi",
				"iltasanomat.fi",
				"twitter.com",
				"taloussanomat.fi",
				"metro.fi"
			],
		}
	}
}

They store the data, complete with article history in the localStorage (if the browser supports it), you can see how they do it in this file. The file tells us that they save the data in either a cookie or in the localstorage as _hs_hist. This means that we can modify the data to our hearts content, without anybody being able to know about it. The easiest way to bypass this “paywall” is just to reset your localStorage.

// Overwrite the localStorage object _hs_hist with an empty object
window.localStorage._hs_hist = {};

Another interesting thing that they’ve included in the script is a referrerBypassList, this array contains a list of domains, if your request has one of these domains as the referrer, you’ll bypass the warning. A referrer is part of the http request header which is sent with every request to the server, it tells the server (or in this case, the browser) from where the user has originated. To get the referrer they probably just retrieve it from the global variable document.referrer. I would guess that the reason they’ve included this part is if somebody clicks on a link from Facebook or any other of those domains, the user should be able to take part of the article, even though the user might have read more than five articles that week.

So let’s see how the function works:

function canbypass( ref ) {
	if (!ref || ref === "") return false;
	var allowedDomains = referrerBypassList;
	for (var i = 0; i < allowedDomains.length; i++) {
		if (referrer.indexOf(domains[i]) != -1) return true;
	}
	return false;
};

As we can see, they basically only check if the domain part is in the referrer, to test this, install the plugin refControl and change the referrer for hs.fi.

And there you go, you are now able to read as many articles as you want without writing a single line of code.

So the biggest question that is left to answer is this, why do they send everything to the user and rely on client side security? They’ve gone through great lengths to secure the page on the client, even though it falls flat if somebody tries to go around it. If I’d guess, I’d believe there’s a couple of reasons. They want to keep their current Page Rank at Google. Almost all search engines have lots of bots, or spiders, that go around the internet and download pages which are then indexed in to a database. This database is then queried each time a user searches for something, but if the spiders are limited to only five articles and then there’s a complete stop, Helsingin Sanomat articles wouldn’t make it to the top of the search results. Another reason could be because they’re behind an http-accelerator like Varnish or Nginx, which would make it a lot easier to just give the same cached content to everybody.

So what can one do if they’d really want to secure the content from non-paying consumers? There’s probably lots of ways to to do it, some easier, some harder. I’d opt for a user-agent combined with IP filter on the server. A user-agent is a string that is sent in the request header, telling the server what browser you have. All Google’s bots report with the user-agent string “Googlebot/2.1 (+http://www.google.com/bot.html)”. The user-agent in itself is insecure as a filter, because just like the referrer, you can spoof it with readily available browser add-on’s. And that’s where the IP filter comes in.

I’d log all request with both their user-agent and their IP, if the IP with a specific user-agent goes over the 5 article limit, I’d start serving the paywall. Some users would be unfairly blocked, like schools or libraries, where you have multiple computers with identical user-agents and the same IP, but we did want to have it secure, right?

Now that we’ve secured the normal user, we’d have to look at the indexing bots, because as earlier stated, we don’t want to limit them in how many pages they can index per week. So every time we get a request with Googlebot (or similar user-agent) we’d make a look up on their IP address. If the owner of the IP-address is Google we’d let it through. We would also flag that user-agent/ip combination for future references (We don’t want to make a look up for each request that they make).

By doing it this way, you can be pretty sure that the paywall will be secured against all client side hacks.

Tags