Counting The Number Of Characters In A Tweet Via .NET

Here’s a bonus post in my LinqToTwitter tutorial series: How to make sure the body text of a status update (tweet) falls within the 140-character limit.

Ensuring our tweet meets the character count threshold in Twitter is a three-part problem:

  1. We need to account for the presence of any links, which will be automatically run through the t.co link shortener;
  2. If we’re tweeting with media, we need to account for those links; and
  3. We need to account for certain diacritical marks and other extended characters (such as emojis) that take up more than one codepoint.

Let’s get the simplest case out of the way: If you only intend to tweet basic Latin characters, you can count the length of a tweet as a straight count of the characters.

Consider this passage, from “The Curious Incident of the Dog in the Night-Time“:

It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

That’s 121 characters, in Twitter’s eyes: An exact count of the letters, numbers, spaces, periods and single-quotes contained in the tweet.

Twitter written out in Scrabble letters. Via Pixabay, in the public domain.
Twitter written out in Scrabble letters. Via Pixabay, in the public domain.

Whitespace Counts

Note that Twitter counts leading and trailing spaces when calculating a string’s length. Given these two strings:

 It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.
It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

Line 1 is 122 characters long (leading whitespace); Line 2 is 121 characters long.

Twitter also counts multiple white spaces between words in a tweet. Given these two strings:

It was     7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.
It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

Line 1 is 125 characters long (4 extra spaces between “was” and “7”); Line 2 is 121 characters long.

Finally, Twitter counts carriage return / newline combinations as a character. {Technically, it combines carriage return (\r) and newline (\n) into just a newline (\n).} Given these two strings:

It was 7 minutes after midnight.
The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.
It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

Lines 1 and 2 combined are 121 characters (after trimming the trailing space off Line 1, following the period, and replacing it with a newline); Line 3 is 121 characters.

So, trim your tweet strings, or you may get unexpected results; and consider running your tweets through a regular expression to remove duplicate whitespace, unless you anticipate multiple spaces being necessary within a tweet.

Here’s how that might look:

using System;
using System.Text;
using System.Text.RegularExpressions;

namespace TwitterWordCount
{
    class Program
    {
        static void Main()
        {
            const string sampleWithSpaces = " This string has leading     and extra whitespace."

            var sampleWithSpacesFixed = CleanForTwitter(sampleWithSpaces);
            Console.WriteLine("The length of the spaces string is {0}", sampleWithSpacesFixed.Length);
            Console.ReadLine();
        }
        
        public static string CleanForTwitter(string input) {
            return Regex.Replace(input.Trim(), @"\s+", " ");
        }
    }
}
Wrong kind of Links. “Years of Link changes by SootToon” on DeviantArt.

Counting Link Lengths

Per the Twitter documentation, as of this writing every link run through t.co will be either 22 characters (http) or 23 characters (https) in length, regardless of the length of the actual URL you submitted.

In other words, http://lens.blogs.nytimes.com/2015/08/31/child-goddesses-in-nepal/?module=BlogPost-Title&version=Blog%20Main&contentCollection=Multimedia&action=Click&pgtype=Blogs&region=Body#slideshow/100000003880427/100000003880434, http://nyti.ms/1Ni6xiA and http://dougv.us/r1 will all be 22 characters in length after t.co gets done with them.

Therefore, we can use a regular expression to find all links in our tweet’s body, and count each link as being 22 characters long, regardless of its actual length. Something like this should find web links:

(http)?(s?)(://)?[a-zA-Z0-9\-]+\.+[a-z]{2,13}[\.\?=&%/\w-#]*
Nothing engenders fruitless pedantry better than regex patterns, and this one is far from perfect, so the less restrained among my readers may want to start picking nits off of it.

Will the regex above pass every correct URL possible? Nope; a really weird one might get rejected. Will it pass some poorly formed URLs? Probably.

For all but edge cases, the regex above will do just fine for finding web links. If you need something better, good; find it or make it, then use it, but please spare everyone the details of why your regex is better than this one. On behalf of everyone who has a life to live, we thank you.

Counting matches against this pattern is a two-step process:

  1. First, we need to remove the actual length of the URL(s) in our tweet from the length of our string; then
  2. We need to add back, to the length of our string(s), either 22 or 23, for each URL in the tweet, depending on whether it is secure.

Here’s how that might look:

using System;
using System.Text.RegularExpressions;

namespace TwitterWordCount
{
    class Program
    {
        static void Main()
        {
            const string sample = "There are two links in this text, one of which is secure. http://www.bn.com https://www.dougv.com/2015/08/posting-status-updates-to-twitter-via-linqtotwitter-part-2-plain-text-tweets";
            var result = GetUrlCharCount(sample);
            Console.WriteLine("The tweet length of the sample string is {0}", result);
            Console.ReadLine();
        }

        public static int GetUrlCharCount(string input)
        {
            var count = input.Length; //sample string is 182 chars long

            var regex = new Regex(@"(http)?(s?)(://)?[a-zA-Z0-9\-]+\.+[a-z]{2,13}[\.\?=&%/\w-#]*");
            var matches = regex.Matches(input); //there are 2 regex matches in the sample

            foreach (Match match in matches) 
            {
                //subtract link length; in sample, that's 17 for bn.com link, 106 for dougv.com link
                count -= match.Length;
                //then add 22 for each url found, representing its t.co shortened length
                count += 22;
                //add one more char if original URL was https
                if (string.IsNullOrWhiteSpace(match.Groups[2].Value.Trim())) continue;
                count += 1;
                
            }

            return count; //sample will return 104; 182 - 17 + 22 - 106 + 22 + 1
        }
    }
}

Important note on what t.co will shorten: The t.co shortener will shorten what looks like a top-level and second-level domain combo, even if it doesn’t have a protocol.

In other words, if you were to include the term ASP.NET in your tweet, Twitter would consider that a URL, prepend it with http://, then run it through the t.co shortener.

To avoid this behavior, either put a space between the two (e.g., ASP .NET) or replace the period with a UTF-8 extended character that looks like a period, but isn’t (e.g., \u002e).

Important note on what t.co won’t shorten: Twitter will not wrap certain links with t.co. These include two or more links that are joined by commas or periods without spaces (e.g., http://www.site.com,http://www.example.com), and links containing credentials (e.g., http://user:pass@example.com).

If you send these kinds of links via the API, they will not be shortened; their full length will count against the 140-character limit.

Important note on the length of t.co shortened links: As a rule, the t.co shortener will use whichever protocol was passed to it when shortening your link. That is, if your tweet has a non-secure (http) link, t.co will shorten it with http; if you submit a secure (https) link, t.co will shorten it with https.

Also, the length of a t.co shortened link can change. To be 100 percent certain of the current length of a t.co link, you can run a GET request on help/configuration in the Twitter REST 1.1 API; it will return short_url_length and short_url_length_https, which will be integers giving the current length of a t.co shortened link.


Multiple photos. Via pixabay, in the public domain.
Multiple photos. Via pixabay, in the public domain.

Counting Media

In the rendered tweet, all media (images / video / animated gif) will be represented by a single t.co shortened link, regardless of the number of images in the tweet.

Therefore, if we create a status update, which includes one or more media IDs, via the LinqToTwitter TweetAsync method, we will need to reserve 24 characters (the 23 characters of a secure t.co link, plus a leading whitespace). That’s true if we are tweeting a single picture or up to four; however much media we’re tweeting, it’s represented in our tweet by a single, https t.co-shortened link, with a leading whitespace.

Emoji available to use on Twitter. They're represented by unicode text characters.
A small sample of the emoji available to use on Twitter. They’re represented by unicode text characters.

Counting Complex Characters

The question of diacritical marks and other multibyte characters is a bit more problematic.

According to the Twitter documentation, Twitter counts the length of a string using two criteria:

  1. The string is first normalized using Unicode’s Normalization Form C; and
  2. Twitter counts codepoints, not UTF-8 bytes, for extended characters.

This is fortunate for us, for a number of technical reasons I won’t bore you with. (It has to do with very geeky stuff relating to how the encoding between UTF-8, in which Twitter operates, and UTF-16, in which .NET opperates, is translated; plus how normalization changes the means by which extended characters are encoded.)

Suffice it to say, using the StringInfo class and its LengthInTextElements method, we can easily count the codepoints in a string.

using System;
using System.Globalization;
using System.Text;

namespace TwitterWordCount
{
    class Program
    {
        static void Main()
        {
            const string codepoints = "\u23f0\u0308bc\u303c";
            var codepointsResult = GetCodepointLength(codepoints);
            Console.WriteLine("The length of the codepoints string is {0}", codepointsResult);

            Console.ReadLine();
        }

        public static int GetCodepointLength(string input)
        {
            var info = new StringInfo(input.Normalize(NormalizationForm.FormC));
            return info.LengthInTextElements;
        }
    }
}
I recognize that because I am assessing this string with UTF-16 encoding rather than UTF-8 encoding, I may get unexpected results in terms of length.

That is, it’s possible UTF-16 does not use the same number of codepoints to render a character as UTF-8 uses; and it’s possible that even after normalizing, I may send to Twitter a UTF-16 character encoding that it cannot properly normalize as UTF-8, which may be truncated.

All things even, that error would probably result in this function overestimating the length of a string, rather than underestimating it. I am also assuming that for most people, letting Twitter handle the conversion to UTF-8 (and subsequent normalization) will be OK.

If that’s an OK assumption, the function above works fine. If it isn’t, there’s a good answer over at Stack Overflow for conversion functions; you should get the correct codepoint out of your string using StringInfo after converting it to UTF-8.

This code as a github Gist: https://gist.github.com/dougvdotcom/fcf499b38d03bb1df329

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • Check out the Commenting Guidelines before commenting, please!
  • Want to share code? Please put it into a GitHub Gist, CodePen or pastebin and link to that in your comment.
  • Just have a line or two of markup? Wrap them in an appropriate SyntaxHighlighter Evolved shortcode for your programming language, please!