The String Tokenizer library has been deprecated. In its place is the
String Toolkit Library (StrTk), an advanced
and highly efficient C++ string processing library, that features
various tokenization, splitting, parsing, serialization, formatting, conversions
and numerous other string processing and transformation routines.
String tokenization is defined as the problem that consists of breaking up
a string into tokens which are seperated by delimiters. Both tokens and
delimiters are themselves strings. Commonly used string structures that
require the use of string tokenization are Comman Seperated Values (CSV),
written text and basically any other format of data grouping where differing
units of data are sperated by some kind of delimiter.
Assume you have data units representing fruit thats in a basket:
Apple
Peach
Orange
Banana
If you decided to package these data units and either send them over a socket or
store them in a file, a simple conclusion would be concatenate all data units
into one large string. An example would be as follows:
ApplePeachOrangeBanana
The problem with this kind of formating, is that it may be simple producing the data string
initially, but when the data needs to be read back in and broken up into its independent data
units, the problem of parsing the data, and determining where a particular unit begins and
where it ends becomes rather difficult.
A solution to this problem is to place a character or series of characters that are known never
to exist within the data units and use them as markers to seperate the data units. Assuming our
delimiter is a ':' the previous example may look something like this:
The StringTokenizer class can be used to break-up strings of data that have already been created
from tokens and have also been separated by delimiters. The class allows for delimiters to be strings
themselves as well as just being simple chars. The StringTokenizer behaves somewhat like a stack,
in that you can't access a particular token within the data, just the next token in the head of data.
It provides an ability to find out how many tokens are in the data, in order to support looping patterns
for extracting the tokens out of the data, and also allows for tokens to be converted into other basic
built-in types such as int and double.
std::string data = "abc:def:ghi:jkl"; StringTokenizer strtok(data,":");
Obtaining Number Of Tokens (Remaining)
unsigned int tokenCount = strtok.countTokens();
Obtaining The Next Token
std::string token = strtok.nextToken();
It should be noted that everytime a token is obtained from the StringTokenizer, it is actually
removed from the data store (similar to a pop call on a stack).
Checking To See If There Are Still More Tokens
if(strtok.hasMoreTokens())
{
std::cout << "Still has more tokens!" << std::endl;
}
else
{
std::cout << "No more tokens left." << std::endl;
}
It should be noted that everytime a token is obtained from the StringTokenizer, it is actually
removed from the data store (similar to a pop call on a stack).
Obtaining The Next Token As An int
int token = strtok.nextIntToken();
Obtaining The Next Token As double
double token = strtok.nextFloatToken();
Obtaining The Remaining Data As One String
std::string remainingData = strtok.remaining();
Filter A Token
In some situations the tokens maybe formatted with unnecessary strings of characters, such as
in database files where tuples are stored, and particular fields in the tuples are defined to
be a particular size. Instances where data being placed into those fields is less than the
required size, padding of the data occurs by repeatedly adding a character such as a SPACE
until the data reaches the required size. StringTokenizer allows you to define the padding
pattern and have that filtered out before the token is passed back.
It should be noted that the filtering will occur over the entire token, meaning if the particular
filter pattern occurs within the actual token itself, it will also filtered out. In various instances
this may lead to undesired behavior.
In this example we will assume there exists a database that contains tuples relating to information
about people. The fields for the tuple will be:
First Name
Surname
Year Of Birth
Height (meters)
From the above definition and assuming we construct our data tuple in the same order as is listed above and
seperate each field with a '#' symbol we can give some examples of possible tuples that maybe found in
this database.
John#Doe#1970#1.53
Jane#Doe#1980#1.78
Bob#Cob#1900#2.34
Tokenization of a data string in the format above may look something like the following using StringTokenizer:
StringTokenizer is a unambiguous parser, meaning it will attempt the alter the data string passed
on to it, in such a way that further processing done by StringTokenizer will result in only deterministic
and rational behaviour. These alterations consist of eliminating consecutive delimiters within the string
and delimiters at the beginning and end of the string.
It is advised that values which later on may become tokens in a string always have a default and null values.
For example in the case of strings a string such as "N/A" or "~" will be adequate. Leaving the string empty
will cause consecutive delimiters in the string and hence will be eliminated at processing time. This will
result in a shift to the left of the tokens in the string meaning for example token7 will become token6
etc...
Update 10-05-2003 - A far more advanced and optimal implementation of string tokenzier in C++ can be
found in the String Toolkit Library (StrTk). The new implementation supports generic
tokenizing for arrays of any type, iteratations and split functions.
Free use of the StringTokenizer library is permitted under the guidelines and in accordance with the most current
version of the "Common Public License."