Text truncation at a word boundary using C#

Following on from my last post on stripping HTML from text using C#, once I had removed all signs of HTML from the incoming text, I was also required to show a short preview of the text. I originally went with a truncation method, as follows:

namespace ExtensionMethods
{
    public static class StringExtensionMethods
    {
        public static string Truncate(this string text, int maximumLength)
        {
            if (string.IsNullOrEmpty(text))
            {
                return text;
            }

            return text.Length <= maximumLength ? text : text.Substring(0, maximumLength);
        }
    }
}

This works, but the results look a little odd if the truncate happens half-way through a word.

Instead, I came up this method to truncate at the first word break within the allowed number of characters:

using System.Linq;

namespace ExtensionMethods
{
    public static class StringExtensionMethods
    {
        private static readonly char[] Punctuation = {'.', ',', ';', ':'};

        public static string TruncateAtWordBoundary(this string text, int maximumLength)
        {
            if (string.IsNullOrEmpty(text))
            {
                return text;
            }

            if (text.Length <= maximumLength)
            {
                return text;
            }

            // If the character after the cut off is white space or punctuation 
            // then return what we've got using substring:
            var isCutOffWhiteSpaceOrPunctuation = char.IsWhiteSpace(text[maximumLength]) || Punctuation.Contains(text[maximumLength]);
            text = text.Substring(0, maximumLength);

            if (isCutOffWhiteSpaceOrPunctuation)
            {
                return text;
            }

            // Find the last white-space or punctuation and chop off there:
            var lastWhiteSpaceOrPunctuationPosition = 0;
            for (var i = text.Length - 1; i >= 0; i--)
            {
                if (char.IsWhiteSpace(text[i]) || Punctuation.Contains(text[i]))
                {
                    lastWhiteSpaceOrPunctuationPosition = i;
                    break;
                }
            }

            text = text.Substring(0, lastWhiteSpaceOrPunctuationPosition).Trim();

            return text;
        }
    }
}

While not perfect, this approach works a lot better. Please feel free to suggest improvements.

Stripping HTML from text using C#

I recently had a situation where I needed to show some text received in HTML format as plain text. This is the method I now use for this purpose, implemented as an extension method:

using System.Linq;
using System.Text.RegularExpressions;

namespace ExtensionMethods
{
    public static class StringExtensionMethods
    {
        public static string StripHtml(this string text)
        {
            if (string.IsNullOrEmpty(text))
            {
                return text;
            }

            var tagRegex = new Regex(@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>");
            var tagMatches = tagRegex.Matches(text);

            var commentRegex = new Regex(@"\<![ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t]*)\>");
            var commentMatches = commentRegex.Matches(text);

            // Replace each tag match with an empty space:
            text = tagMatches.Cast<object>().Aggregate(text, (current, match) => current.Replace(match.ToString(), " "));

            // Replace each comment with an empty string:
            text = commentMatches.Cast<object>()
                .Aggregate(text, (current, match) => current.Replace(match.ToString(), string.Empty));

            // We also need to replace &nbsp; as this can mess up the system:
            text = text.Replace("&nbsp;", " ");

            // Trim and remove all double spaces:
            text = text.Trim().RemoveDoubleSpaces();

            return text;
        }

        public static string RemoveDoubleSpaces(this string text)
        {
            if (string.IsNullOrEmpty(text))
            {
                return text;
            }

            // Condense all double spaces to a single space:
            while (text.Contains("  "))
            {
                text = text.Replace("  ", " ");
            }

            return text;
        }
    }
}

The method RemoveDoubleSpaces was also needed, since after replacing HTML elements with empty space it is possible to end up with multiple empty spaces where a single space would do. This is quite a useful method in its own right, hence separating it out.

If you find any inputs which trip this method up, please let me know.