Automatically Hash Tagging Text With PHP And MySQL

My recent work on the Google Reader to Twitter interface led me to recognize a serious shortcoming of such a basic system: A lack of support for hash tags.

For those unfamiliar with Twitter, hashtags are basically words proceeded by a hash mark (#). When a word is “tagged”, it becomes a hyperlink to content also containing that term.

Tagging isn’t unique to Twitter. It’s integral to WordPress, Tumblr and many other blogging platforms; Google uses tags (which they call “labels”) in most of their major applications, including GMail and Google Documents.

The reason is simple: People tend to organize information in terms of categories, so interrelating content by linking items that belong to the same categories to one another makes it easier on us to find and process that information.

So here’s a quick and easy script that lets you take keywords / tags / labels / categories / what have you from a MySQL table, run those terms over a string / subject text, and automatically tag that string with those terms.

(In a later tutorial, I will describe how to add new terms to the database.)

An aside on what constitutes a “term”: The one thing that became readily apparent during this project was that there are a lot of different trade-offs required in determining what constitutes a “term,” and in how easy it is to select simple derivatives of a given term in a subject string.

For example, hack. We probably want to be able to tag the similar terms hacks, hacker, hackers, hacking and hacked, as well as more complex derivatives, such as h4x0r. Needless to say, it’s difficult to convert hack into h4x0r, but it’s also difficult to simply append common endings to the root word. (More on this when we cover, in an upcoming post, adding terms from a subject string to the database.)

It’s also hard to know when hack is actually in a context we want to hash tag. For example, hacker is probably always going to be a term we want to tag. But words such as hackle, hackberrry and hacksaw are not ones we’re likely to want to tag, if the context in which we’re using hack is that of “altering a system to perform differently than intended.”

The compromise I am using is not the most elegant, but it is simple and direct: A term is as an exact match of a word contained in the database. Therefore, if I want to tag hack, hacker, hacking and hacked, all four of those words must appear in the database.

Terms are case-insensitive. In other words, if I have hack in the database, it matches hack, Hack, hAck and HACK in the subject string.

An HTML Form To Input A Subject String

We need a simple way to get our subject string (that is, the text we want to have tagged). Here’s a form to do that; you could, of course, alter this script to open up a file, or retrieve data from some other store, as your subject text.

I am also including an echo statement, just before the form, that will show the autotagged text once the form has been submitted.

<p class="notice"><?php echo $content; ?></p>
<form id="tform" name="tform" action="<?php echo $_SERVER['PHP_SELF']; ?>" method="post">
	<textarea id="ttext" name="ttext" cols="50" rows="3"><?php echo $_POST['ttext']; ?></textarea>
	<br />
	<input type="submit" name="submit" id="submit" value="Submit" />
</form>

A MySQL Table To Contain Terms

We need to have some sort of data store to hold the terms. Eventually, we’re going to put these terms into an array, so you could simply hard-code your terms as a PHP array. Also, you could use an XML file, JSON, a CSV or other text file, etc. to hold your terms.

In my case, I am storing terms in a MySQL table. Here’s the code for my table:

CREATE TABLE IF NOT EXISTS `php_auto_hashtag` (
  `term_text` varchar(255) NOT NULL,
  UNIQUE KEY `term_text` (`term_text`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `php_auto_hashtag` (`term_text`) VALUES
('adsense'),('amazon'),('android'),('aol'),('api'),('apple'),
('bing'),('canvas'),('cbs'),('chrome'),('cloud'),('comcast'),
('darpa'),('eff'),('facebook'),('firefox'),('google'),
('hacker'),('hackers'),('hacking'),('html'),('html5'),
('http'),('https'),('ie9'),('ietf'),('intel'),('internet'),
('ios'),('ipad'),('ipv6'),('javascript'),('kinect'),('malware'),
('microsoft'),('mozilla'),('mvc'),('nokia'),('pentagon'),('php'),
('ps3'),('rackspace'),('safari'),('silverlight'),('sony'),
('stuxnet'),('symbian'),('tablets'),('twitter'),('vb'),
('verizon'),('virus'),('windows'),('xml'),('youtube');

Note that we don’t have a primary key. That’s because we have a unique key. We don’t want the same term in the database twice, and that’s what a unique key does: prevent duplicate entries. As a result, a primary key isn’t necessary for tuning / optimization if we have a unique key, since their purposes in indexing are similar.

A PHP Function To Retrieve Terms From The Database

To get the terms out of the database and into a PHP array, I’ll make a function. The reason why I am doing it this way will be noted shortly. The function returns false on an error, an array on success.

The function assumes the database table contains at least one term, but if it doesn’t, it’s not a fatal error (but will show a warning to the end user).

Finally, you’ll note I am using globally defined constants for taking in database credentials. This isn’t really elegant, but I want I want this script to work out-of-the-box for those who have limited programming skills; by defining DB variables globally, an end user can simply plug in the right values and use this script out of the box.

//your database server variables
define('MYSQL_HOST', 'localhost');
define('MYSQL_USER', 'db_user');
define('MYSQL_PASS', 'db_password');
define('MYSQL_DB', 'db_name');
define('MYSQL_QUERY', 'SELECT term_text FROM php_auto_hashtag');

function at_get_terms() {
	//retrieve terms from database
	//returns Boolean false on failure, array of terms on success
	
	if(!$link = mysql_connect(MYSQL_HOST, MYSQL_USER, MYSQL_PASS)) {
		trigger_error('function at_get_terms: Cannot connect to database server. Please check your host name and credentials', E_USER_WARNING);
		return false;
	}
	
	if(!mysql_select_db(MYSQL_DB)) {
		trigger_error('function at_get_terms: Cannot select the database. Please check your database name', E_USER_WARNING);
		return false;
	}
	
	if(!$rs = mysql_query(MYSQL_QUERY)) {
		trigger_error('function at_get_terms: Error parsing query. MySQL error: ' . mysql_error(), E_USER_WARNING);
		return false;
	}
	
	if(mysql_num_rows($rs) == 0) {
		trigger_error('function at_get_terms: No terms found in database', E_USER_NOTICE);
		return false;
	}
	
	$out = array();
	while($row = mysql_fetch_array($rs)) {
		$out[] = $row[0];
	}
	return $out;
}

A PHP Function To Autotag The Subject

We can now create a function that does the autotagging. It takes as arguments the subject text and the array of terms we want tagged; it returns false on an error and the tagged subject string on success.

In this case, we’re using preg_replace to do the tagging. There’s a lot of argument as to whether str_replace or ereg_replace is faster / better than preg_replace, but I find such arguments to be counting angels dancing on the head of a pin. I use preg_replace because it works quickly enough, regular expressions are an elegant way to find text, and PCRE is PHP’s preferred regular expression processing extension.

function autotag($input, $terms) {
	//tags $input with $terms
	//returns false on error, tagged string on success
	
	if(strlen(trim($input)) < 1) {
		trigger_error('function autotag: string to be tagged is empty', E_USER_WARNING);
		return false;
	}
	if(!is_array($terms)) {
		trigger_error('function autotag: terms is not an array', E_USER_WARNING);
		return false;
	}

	$tmp = array();	
	foreach($terms as $term){
		//matches will be terms exactly as in database,
		//followed by space or newline
		$tmp[] = "/($term)(\s|$)/i";
	}
	$out = preg_replace($tmp, '#$0', $input);
	return $out;
}

Note the second argument in the preg_replace call, above. # is just the hash mark, which in the case of Twitter will be turned automatically into an tag link. $0 means, in regular expressions, the entire part of the subject text (the third argument) that matched the pattern (the first part of the argument).

So, if you wanted to use hyperlinks instead of hashtags, and use the found terms as querystring variables to a page named term.php, your preg_replace statement would be something like this:

	$out = preg_replace($tmp, '<a href="term.php?term=$0">$0</a>', $input);

(Always sanitize your querystring variables before using them in your PHP code. You have been warned. Don’t come crying to me or pointing fingers in my direction if you fall victim to an XSS or injection attack. Sanitize your variables.)

Get The Terms And Tag The Target String

We now have everything we need to autotag the target string. It’s as simple as a single-command if statement:

$content = "Enter text in the textarea below, then click Submit. The text will be automatically tagged with terms contained in the database. ";

if(isset($_POST['submit'])) {
	$content = "<strong>Hashtagged string:</strong> " . autotag(htmlspecialchars($_POST['ttext']), at_get_terms());
}

And that’s all there is to it. You can see a working demo here: http://www.dougv.com/demo/php_auto_hashtag/

You can also download the source code. I distribute this code under the GNU GPL version 3.

All links in this post on delicious: http://www.delicious.com/dougvdotcom/automatically-hash-tagging-text-with-php-and-mysql

2 thoughts on “Automatically Hash Tagging Text With PHP And MySQL

  1. Pingback: Automatically Hash Tagging Text With PHP And MySQL Part 2: Adding New Hash Tags To The Database Table dougv.com « Doug Vanderweide dougv.com « Doug Vanderweide

  2. Pingback: Automatically Hash Tagging Text With ASP.NET Web Forms (VB.NET) dougv.com « Doug Vanderweide dougv.com « Doug Vanderweide

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>