Using Cookies To Require Users To Visit An Introduction Page On Your Web Site

Asked recently on Yahoo! Answers:

If somebody clicks a link to my site, is there a way to redirect them to an intro page?
And then redirect them back to the page they were trying to view once they click something. What code would that require? thanks
I meant, like if somebody finds a link to my site on google it wouldn’t take them to the intro page, it would take them to whatever link they clicked. I want a way to force them to go to the intro page whenever they go to the website, and then once they click another link, allow them continue to the exact spot they wanted to go to when they first clicked the google link. I’m thinking that I would need to implement cookies or server side scripting. I just need someone to point me in the right direction and i can figure it out from there. Thanks again.

While I respect the questioner’s desire to do this himself, I’m going to actually write the code to do this in PHP. It’s the questioner’s own fault for posing an interesting problem that is practical and somewhat complicated, but within the reach of my intended audience — newer Web programmers.

Executive summary: We’ll use a cookie, set by the “intro” / “splash” page alone, to check if a visitor has yet seen the introduction page. If not, we will send him to the introduction page, along with a GET variable that records the current page; once a checkbox and button have been clicked on the introduction page, we’ll return the user to the intended page.

There are two important considerations in this script: one is to ensure that search engine indexing spiders, such as Googlebot and Yahoo!, can bypass the intro page requirement; the other is, for all other users, to check for the intro page visit cookie before the “link” page does anything — connects to a database, processes other data, etc. — and especially before any headers are sent to the Web browser.

An Aside On Headers

An HTTP header is basically a bunch of preliminary information a Web server sends to a Web browser that describes the document the server is sending. For example, an HTTP header typically contains, at minimum, what kind of document it is, how big it is, when it was last modified and the HTTP status code returned by the request. This information is sent just before the document itself, so that the Web browser knows what it kind of file it is receiving and how to render the file being sent.

Imagine a basic Web page with an external CSS stylesheet and a half-dozen pictures. Not only does the Web server send an HTTP header in advance of the Web page; it also sends headers for the stylesheet and each of the images.

Because we want to ensure the introduction page has been seen before any other page of our Web site can be viewed in any given visit, we need to check for the cookie before any headers are sent.

Once a header is sent, the Web browser expects to see the document described by that header. We can’t send headers telling the Web browser, “I am sending this page”, then actually send a different page. PHP will specifically prevent us from doing this by sending the original file anyway, and giving us a warning, such as this:

Warning: Cannot modify header information – headers already sent by (output started at /home/public_html/index.php:27) in /home/public_html/index.php on line 45

(The message above tells us that at line 27 of the code in index.php, we had a statement that required PHP to send headers to the Web browser [almost always, this is an echo / print statement, or some other statement that outputs text / HTML]. But on line 45, we tried to add or amend a header statement, which we can’t do. Once the headers have been sent, they can’t be changed.)

Two Very Important Security Caveats

I would be remiss if I did not mention that neither cookies nor query strings are secure, and we’ll be using both.

It is very easy for users to override your cookie expiration requests, to forge cookies, give them to other users, read their contents, etc. For the purposes of this application, that insecurity should be OK; but don’t count on users not tampering with your cookies.

Also remember that a number of users still stuck in late-1990s Web paranoia will not enable cookies, and we’re not checking for that case. If you encounter such a user with this solution, the only page of your site that will be visible is the splash page; the user simply will not be able to leave that page. (I’m leaving it that way because anyone so paranoid as to not accept a second-party cookie isn’t someone I would like having browsing my Web site anyway.)

Hoping for the best and preparing for the worst is definitely the rule of thumb when working with query strings. Remember that it is very easy to poison variables and engage in cross site scripting attacks if your target does not sanitize every inch of a query string. Don’t be such a target.

Step 1: The Cookie Checking Code

We first create a simple PHP include file that checks for the presence of the intro page cookie. If the cookie does not exist, we get the URL of the current page, and pass that as a query string variable to the introduction page. If the cookie exists, the page is rendered.

As previously mentioned, this include file — which will be placed on every page of the site except the “splash” page, before any other code or text — will examine each visitor’s browser, checking to see if it is the Google, Yahoo! or MSN indexing robot. We do that by looking for specific terms in the user agent string sent by the browser: googlebot, slurp or msnbot, respectively. There are many more indexing robots out there; you’ll want to check your Web site statistics for other robots that are indexing your site and add their user agent IDs to the list of tested values. (More on how to do that shortly.)

We want to exclude indexing robots because every time we use header() to redirect a user to a different Web page than the one he tried to reach, we also send a response code that indicates the requested resource has moved. This will adversely affect your search engine rankings: lots of redirects suggest to the search engines that you’re pulling a keyword bait and switch. To keep your page ranks high, we want to prevent the indexing robots from encountering the splash page.

Note that in the code below, we use PHP_SELF to get the full URL of the page requested. This method will ensure your script works no matter where it is installed, but it could also allow for cross-site scripting attacks (thanks for the reminder, Fred), so we use htmlspecialchars() to sanitize the URL, and urlencode() to ensure the page’s URL is properly sent as a query string to our intro page.

<?php
//if not an excluded spider ...
if(!preg_match('/\bgooglebot\b|\bmsnbot\b|\bslurp\b/i', $_SERVER['HTTP_USER_AGENT'])) {
	//check for the intro page cookie; if it doesn't exist ...
	if(!isset($_COOKIE['intro'])) {
		//go to the intro page, passing the current page and its querystring into a new querystring
		$path = "Location: index.php?p=" . urlencode(htmlspecialchars($_SERVER['PHP_SELF']) . "?" .  htmlspecialchars($_SERVER['QUERY_STRING']));
		header($path);
	}
}
?>

If you want to add more user agents to the list of those excluded from visiting the introduction page, you do so by wrapping some unique part of its user agent string inside \b tags and adding a pipe, which acts as an alternating expression operator (that is, a logical OR).

For example, if you wanted to add Exabot to the exempted browsers, you would change line 3 of the code above to this:

if(!preg_match('/\bgooglebot\b|\bmsnbot\b|\bslurp\b|\bexabot\b/i', $_SERVER['HTTP_USER_AGENT'])) {

Step 2: The Introduction Page Code

Our splash page’s code is a bit more complex, because it needs to do two things: Handle setting the visit cookie via processing an approval form, and re-route the user back to the page he requested, complete with proper query string variables.

Let’s start with the simple approval form, which consists of a checkbox, a hidden field that records the incoming / referred URL, and a submit button. Just above the form is a PHP echo statement we will use to tell people they didn’t check the checkbox.

<?php echo $message; ?>
<form id="intro" name="intro" method="post" action="<?php echo htmlspecialchars($_SERVER['PHP_SELF']); ?>">
	<label><input type="checkbox" id="ok" name="ok[]" /> I agree to the terms and conditions above.</label>
    <br />
    <input type="hidden" id="p" name="p" value="<?php echo htmlspecialchars($_REQUEST['p']); ?>" />
    <input id="fsubmit" name="fsubmit" type="submit" value="Proceed" />
</form>

With the form out of the way, we can focus on the PHP script that processes the approval and sets the cookie.

The script first checks to see if there is a referring page; because we are passing referring page paths as a query string variable, I created a rudimentary system to check if the path leads to an actual page on the site. If the path does exist, the referring URL is used; if not, a default URL is used.

<?php
if(isset($_POST['fsubmit'])) {
	$path = "";
	//check for referring page; make sure request originates from this site
	if(isset($_POST['p']) && trim($_POST['p']) !== "" && preg_match('/^http:\/\/(www\.)?dougv\.com/i', $_SERVER['HTTP_REFERER'])) {
		//if referring page exists, check if page is valid
		$curl = curl_init();
		curl_setopt($curl, CURLOPT_URL, "http://" . $_SERVER['SERVER_NAME'] . $_POST['p']);
		curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
		curl_setopt($curl, CURLOPT_NOBODY, 1);
  		curl_setopt($curl, CURLOPT_TIMEOUT, 10);
		$html = curl_e xec($curl); //remove the space between the e and x
		$info = curl_getinfo($curl);
  		curl_close($curl);

		if($info['http_code'] == "200") {
			//if URL is valid, use it
			$path = $_POST['p'];
			$path = str_replace("&amp;", "&", $path);
		}
	}
	//check if checkbox is checked; if so
	if(isset($_POST['ok'])) {
		//set session-only cookie
		setcookie('intro', '1');
		//go to specified path if it exists
		if(!empty($path)) {
			header("Location: " . $path);
		}
		else {
			//otherwise, go to default page
			header("Location: default.php");
		}
	}
	else {
		//message to display if checkbox not checked
		$message = "<p><strong>You must check the box indicating you accept the terms above in order to proceed.</strong></p>\n";
	}
}
?>

Note that in the script above, Line 5 uses a regular expression to specifically check if the referer — that is, the page requesting the postback of this form — is on our Web server. We do that because otherwise, someone who knows this form expects to see a URL and uses cURL to check it could make copies of this script and use it to launch a denial of service attack from other computers.

Even this is not bulletproof; the HTTP_REFERER variable can be forged. As other sites will tell you, the best way to prevent cross-site request forgeries is to require a token of some sort — that is, a random number or md5 hash or the like — that is created by a means known only to your Web server, changes every time the form is rendered and works only for that instance of the form.

Yeah, that’s a lot of effort, and the chances you encounter someone who wants to run a DoS attack on you are small. For most installations, this referrer method provides sufficient basic security.

Changing the domain name in Line 5 is straightforward. If your site’s domain name is example.com, you would use:

if(isset($_POST['p']) && trim($_POST['p']) !== "" && preg_match('/^http:\/\/(www\.)?example\.com/i', $_SERVER['HTTP_REFERER'])) {

If your site’s domain name is mysite.freehost.org, you would use:

if(isset($_POST['p']) && trim($_POST['p']) !== "" && preg_match('/^http:\/\/mysite\.freehost\.org/i', $_SERVER['HTTP_REFERER'])) {

And that’s pretty much it.

Important note: I have not extensively tested this script. I’ve tested it to withstand most XSS attack attempts and to handle most common query strings, but I cannot vouch for it being bulletproof. You will want to test your scripts to ensure you have properly protected your code against malicious and unintended data.

I have a few examples to show. First, use this URL to see the basics: http://demo.dougv.com/php_intro_page/page.php

To test query strings, try this page: http://demo.dougv.com/php_intro_page/qs.php?foo=bar&hello="world’&html=<blink>Hello World!</blink>

You can see the “default” page, which is used when the cookie can be set but the referring URL cannot be verified, at http://demo.dougv.com/php_intro_page/default.php

If all goes according to plan, every time you close your Web browser, you’ll need to “reauthenticate” at the introduction page if you click these links.

You can download the code used here: Using Cookies To Require Users To Visit An Introduction Page On Your Web Site Demo Code. I distribute all code under the GNU GPL.

All links in this post on delicious: http://delicious.com/dhvrm/using-cookies-to-require-users-to-visit-an-introduction-page-on-your-web-site

Leave a Reply

  • Check out the Commenting Guidelines before commenting, please!
  • Want to share code? Please put it into a GitHub Gist, CodePen or pastebin and link to that in your comment.
  • Just have a line or two of markup? Wrap them in an appropriate SyntaxHighlighter Evolved shortcode for your programming language, please!