Automatically Hash Tagging Text With PHP And MySQL

My recent work on the Google Reader to Twitter interface led me to recognize a serious shortcoming of such a basic system: A lack of support for hash tags.

For those unfamiliar with Twitter, hashtags are basically words proceeded by a hash mark (#). When a word is “tagged”, it becomes a hyperlink to content also containing that term.

Tagging isn’t unique to Twitter. It’s integral to WordPress, Tumblr and many other blogging platforms; Google uses tags (which they call “labels”) in most of their major applications, including GMail and Google Documents.

The reason is simple: People tend to organize information in terms of categories, so interrelating content by linking items that belong to the same categories to one another makes it easier on us to find and process that information.

So here’s a quick and easy script that lets you take keywords / tags / labels / categories / what have you from a MySQL table, run those terms over a string / subject text, and automatically tag that string with those terms.

(In a later tutorial, I will describe how to add new terms to the database.)

An aside on what constitutes a “term”: The one thing that became readily apparent during this project was that there are a lot of different trade-offs required in determining what constitutes a “term,” and in how easy it is to select simple derivatives of a given term in a subject string.

For example, hack. We probably want to be able to tag the similar terms hacks, hacker, hackers, hacking and hacked, as well as more complex derivatives, such as h4x0r. Needless to say, it’s difficult to convert hack into h4x0r, but it’s also difficult to simply append common endings to the root word. (More on this when we cover, in an upcoming post, adding terms from a subject string to the database.)

It’s also hard to know when hack is actually in a context we want to hash tag. For example, hacker is probably always going to be a term we want to tag. But words such as hackle, hackberrry and hacksaw are not ones we’re likely to want to tag, if the context in which we’re using hack is that of “altering a system to perform differently than intended.”

The compromise I am using is not the most elegant, but it is simple and direct: A term is as an exact match of a word contained in the database. Therefore, if I want to tag hack, hacker, hacking and hacked, all four of those words must appear in the database.

Terms are case-insensitive. In other words, if I have hack in the database, it matches hack, Hack, hAck and HACK in the subject string.

An HTML Form To Input A Subject String

We need a simple way to get our subject string (that is, the text we want to have tagged). Here’s a form to do that; you could, of course, alter this script to open up a file, or retrieve data from some other store, as your subject text.

I am also including an echo statement, just before the form, that will show the autotagged text once the form has been submitted.

<p class="notice"><?php echo $content; ?></p>
<form id="tform" name="tform" action="<?php echo $_SERVER['PHP_SELF']; ?>" method="post">
	<textarea id="ttext" name="ttext" cols="50" rows="3"><?php echo $_POST['ttext']; ?></textarea>
	<br />
	<input type="submit" name="submit" id="submit" value="Submit" />

A MySQL Table To Contain Terms

We need to have some sort of data store to hold the terms. Eventually, we’re going to put these terms into an array, so you could simply hard-code your terms as a PHP array. Also, you could use an XML file, JSON, a CSV or other text file, etc. to hold your terms.

In my case, I am storing terms in a MySQL table. Here’s the code for my table:

CREATE TABLE IF NOT EXISTS `php_auto_hashtag` (
  `term_text` varchar(255) NOT NULL,
  UNIQUE KEY `term_text` (`term_text`)

INSERT INTO `php_auto_hashtag` (`term_text`) VALUES

Note that we don’t have a primary key. That’s because we have a unique key. We don’t want the same term in the database twice, and that’s what a unique key does: prevent duplicate entries. As a result, a primary key isn’t necessary for tuning / optimization if we have a unique key, since their purposes in indexing are similar.

A PHP Function To Retrieve Terms From The Database

To get the terms out of the database and into a PHP array, I’ll make a function. The reason why I am doing it this way will be noted shortly. The function returns false on an error, an array on success.

The function assumes the database table contains at least one term, but if it doesn’t, it’s not a fatal error (but will show a warning to the end user).

Finally, you’ll note I am using globally defined constants for taking in database credentials. This isn’t really elegant, but I want I want this script to work out-of-the-box for those who have limited programming skills; by defining DB variables globally, an end user can simply plug in the right values and use this script out of the box.

//your database server variables
define('MYSQL_HOST', 'localhost');
define('MYSQL_USER', 'db_user');
define('MYSQL_PASS', 'db_password');
define('MYSQL_DB', 'db_name');
define('MYSQL_QUERY', 'SELECT term_text FROM php_auto_hashtag');

function at_get_terms() {
	//retrieve terms from database
	//returns Boolean false on failure, array of terms on success
	if(!$link = mysql_connect(MYSQL_HOST, MYSQL_USER, MYSQL_PASS)) {
		trigger_error('function at_get_terms: Cannot connect to database server. Please check your host name and credentials', E_USER_WARNING);
		return false;
	if(!mysql_select_db(MYSQL_DB)) {
		trigger_error('function at_get_terms: Cannot select the database. Please check your database name', E_USER_WARNING);
		return false;
	if(!$rs = mysql_query(MYSQL_QUERY)) {
		trigger_error('function at_get_terms: Error parsing query. MySQL error: ' . mysql_error(), E_USER_WARNING);
		return false;
	if(mysql_num_rows($rs) == 0) {
		trigger_error('function at_get_terms: No terms found in database', E_USER_NOTICE);
		return false;
	$out = array();
	while($row = mysql_fetch_array($rs)) {
		$out[] = $row[0];
	return $out;

A PHP Function To Autotag The Subject

We can now create a function that does the autotagging. It takes as arguments the subject text and the array of terms we want tagged; it returns false on an error and the tagged subject string on success.

In this case, we’re using preg_replace to do the tagging. There’s a lot of argument as to whether str_replace or ereg_replace is faster / better than preg_replace, but I find such arguments to be counting angels dancing on the head of a pin. I use preg_replace because it works quickly enough, regular expressions are an elegant way to find text, and PCRE is PHP’s preferred regular expression processing extension.

function autotag($input, $terms) {
	//tags $input with $terms
	//returns false on error, tagged string on success
	if(strlen(trim($input)) < 1) {
		trigger_error('function autotag: string to be tagged is empty', E_USER_WARNING);
		return false;
	if(!is_array($terms)) {
		trigger_error('function autotag: terms is not an array', E_USER_WARNING);
		return false;

	$tmp = array();	
	foreach($terms as $term){
		//matches will be terms exactly as in database
		$tmp[] = "/(\b$term\b)/i";
	$out = preg_replace($tmp, '#$0', $input);
	return $out;
Update, 24 August 2015: Commenter CR notes that the previous expression in the code above would tag words that were prefixed; e.g., “antivirus” would become “anti#virus” if “virus” was a taggable word.

To fix that, we will use the \b word boundary qualifier in the regular expression. A word boundary basically says that $term must be surrounded by something that is not a letter or number.

Note the second argument in the preg_replace call, above. # is just the hash mark, which in the case of Twitter will be turned automatically into an tag link. $0 means, in regular expressions, the entire part of the subject text (the third argument) that matched the pattern (the first part of the argument).

So, if you wanted to use hyperlinks instead of hashtags, and use the found terms as querystring variables to a page named term.php, your preg_replace statement would be something like this:

$out = preg_replace($tmp, '<a href="term.php?term=$0">$0</a>', $input);
Always sanitize your querystring variables before using them in your PHP code. You have been warned. Don’t come crying to me or pointing fingers in my direction if you fall victim to an XSS or injection attack. Sanitize your variables.

Get The Terms And Tag The Target String

We now have everything we need to autotag the target string. It’s as simple as a single-command if statement:

$content = "Enter text in the textarea below, then click Submit. The text will be automatically tagged with terms contained in the database. ";

if(isset($_POST['submit'])) {
	$content = "<strong>Hashtagged string:</strong> " . autotag(htmlspecialchars($_POST['ttext']), at_get_terms());

And that’s all there is to it. You can see a working demo here:

Code on github:

All links in this post on delicious:


    1. @Ali: If the text is already hashtagged, why would you need to hashtag it?

      That said, if for some odd reason you need to hashtag text twice, just run the end result through

      preg_replace('/#+/', '#', $input);

      and back-to-back hashmarks will be replaced with a single hash.

  1. thanks for providing this info to the users i want to use it one of my project, on basis of client requirement.

  2. One thing of which you should be aware is that it will tag within words.

    e.g. on the demo above, if you want to tag “virus” and the word “antivirus” is in your text, you will end up with “anti#virus” (no space in between) which may not be what you desire. E.g. you might want no tag or “anti #virus” but probably not “anti#virus”

    Thanks for the article!

  3. To change the behavior so it only tags a word in full (e.g. virus vs antivirus), I think you can alter the code in autotag to say:

    $tmp[] = "/(\s|$)($term)(\s|$)/i";
    1. @CR: Thanks for the note. I would recommend wrapping the expression in \b.

      In POSIX-compliant regular expressions, \b indicates a word boundary, meaning that wherever \b is, the character at that space must not be a word-like character. A “word-like character” is basically any letter or number, but not spaces, punctuation and the like.

      $tmp[] = "/(\b$term\b)/i";

      The benefit is that this prevents the need to change any other part of the code.

      Again, thanks for the tip, and this post has been updated to reflect your correction.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • Check out the Commenting Guidelines before commenting, please!
  • Want to share code? Please put it into a GitHub Gist, CodePen or pastebin and link to that in your comment.
  • Just have a line or two of markup? Wrap them in an appropriate SyntaxHighlighter Evolved shortcode for your programming language, please!