Extracting all numbers from a string
A problem that I ran across was given an ill-formatted string that contained numbers, how do I extract them all out? When I mean ill-formatted, it means that the string doesn’t have a consistent structure from which I can pull out the numbers like a comma separated string. Ill-formatted strings are usually plain text and human understandable.
An example of an ill-formatted string would be:
There are 3 cases of number formats such that a=1, b is 27m/s, and c:33.0%
Simple string splitting won’t do the job, so I had to resort to using a regular expression splitter. The code below demonstrates how to split the string.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Text.RegularExpressions; // Ill-formatted string to parse. string input = "There are 3 cases of number formats such that a=1, b is 27m/s, and c:33.0%"; // Split the string looking for numbers. var numbers = Regex.Split(input, "[^.\\d]").Where(n => !String.IsNullOrWhiteSpace(n)); foreach (string number in numbers) Console.WriteLine(number); // The output on the console will be: // 3 // 1 // 27 // 33.0
There are two major components to look at. The first is the regular expression split method which takes in the expression “[^.\\d]”. The expression tells the split method to match on any character not within the brackets, which means that any characters within the brackets are isolated and untouched. There are two characters in the brackets. The ‘.’ character for finding decimal points and the escaped ‘\d’ character sequence which represents any digit.
The second component is the ‘Where’ clause which will strip off all empty and whitespace results that are returned from the splitting (and there’s a lot).
The code above isn’t perfect. Problems can arise if numbers are in the form “1,000,000” and there are legitimate commas in the sentence. The same goes for decimal points and periods in a sentence. Additional conditions may need to be added to the ‘Where’ clause to strip out strings that only contain a ‘.’ or ‘,’ character.