Translate

Monday, April 7, 2014

Tokenize a string with command line arguments

The tutorial for today will explore how to parse a string in c++. Parsing a string or line of characters is a crucial exercise in programming, as many programs will need to read lines of code for bites of data, usually held in character-based strings. Parsing is seen in many different aspects of programming, from networking programs that need to parse a URL, to a graphics engine that needs to parse a model file for a 3D shape, to a program like Microsoft Office that measures then number of words in a document or selected line of text.

Further, we will be implementing this program with command line arguments. These are the parameters found in the main() function: argc and argc[], which respectively stand for argument count and argument vector. Argc is an integer type that holds the number of arguments being passed to the program when it runs, and argv[] is an array of C-style character strings, the size of argc. These strings in argv can be simple strings, such as in our case; but they can also provide strings like address paths for files like image files and text files that hold data.

We will be taking normal string arguments for our program, and parsing them by counting the number of words inside each string by checking for whitespace i.e. ‘ ‘ in each string to determine an instance of a word. The strings will be normal, double quotation strings (“ ”) like in C++.

The first line in the main() function is a call to printf(), with a string supplied as the first argument and argv[0] as the second. The string "Program name %s\n" will output “Program name” and the name you give the program when you compile it: it is the name of that application being ran, and this is the case for every application that the first argument in argv[] is the applications name. The %s inside the string is a flag, indicating where certain text goes in that string. The %s indicates a string variable, with argv[0] being supplied as the string to copy into the position ‘%s’ is at in the string.

An important point to make for argv[]: it is always ended with a ‘NULL’ value. This means that that value will need to be tested for when parsing our command line arguments, which is what we do in the first ‘for’ loop in the ‘if’ loop, and again in the second ‘for’ loop below the first. This also means that a minimum of two arguments are in each instance of argv; the executable name and a ‘NULL’ value to indicate the end of the argument list. This is to enable the argument vector to have an end point to test for, to prevent an infinite loop in the main function.

The second ‘for’ loop will be the loop where we tokenize our string, meaning to break the string into several words, or in our case to simply count the words inside a string. Tokenization is crucial for areas of programming that require the analysis of strings, and in the case of data security, the masking of sensitive data with reference variables. Our argument is passed to a string variable, which is then passed to the wordCount method.

Each character in the string supplied with the call to wordCounter(string) is analyzed in this method. The if() loop will test each input character to determine two conditions; A: if that first character is whitespace, and B: if the character before the first is not whitespace (to prevent counting multiple spaces as a word). If these two conditions are met, then the integer wordCount is incremented. When the line is fully read, the method will return the value of wordCount to the count variable in the ‘if’ loop. After that, the program will output a message with the count variable, displaying the number of words counted in the string.

Navigate in the command prompt to the location of your testing area, start notepad and create a .cpp file, and copy and paste the code below; when finished, save the file and compile it using:

G++ -Wall filename.cpp –o applicationName.exe
And when you run the executable, on the right hand side, add some strings in double-quotation marks, like so:

applicationName.exe “Hello There!” “:D How are you?” “Good, thank you very much”
Your program will output the application name, followed by the strings you’ve input. Then the strings will be analyzed by the method wordCount, which will output both the string, and the message with the word count.

New Terms:

Tokenization: The programming act of separating the words in a string, either counting them or moving them into separate data structures altogether.

argc: The main() parameter that holds the number of arguments passed into the application. Use this to keep count of the arguments: remember that there will always be at least two elements, so test for a number >= 2 to test for any arguments passed into the application.

argv[]: The second main() parameter; this one holds the actual string arguments that the application will be working with. Remember that the first value (argv[0]) will always be the application name, and that the last element (argv[last]) will always be a NULL value. Remember to test for the NULL value, and never pass it to any methods.

Code

#include <stdio.h>
#include <string>
#include <iostream>

int wordCounter(std::string);

int main( int argc, char *argv[] ) 
{
   printf("Program name %s\n", argv[0]);

   if( argc >= 2 )
   {
      for(int i = 0; i <= argc; i++)
         {
              if(argv[i] != NULL) printf("%s \n", argv[i]);
              else continue;
         }
        
         for(int i = 1; i <= argc; i++)
         {
              if(argv[i] != NULL) {
                     std::string argument = argv[i];
             
                     printf("%s \n", argv[i]); //Print the argument
             
                     int count = wordCounter(argument);

                     printf("The number of words in this line is %i \n", count);
                     }
              else
                     continue;
         }
   }
   else
   {
      printf("No arguments were supplied.\n");
   }
  
   return 0;
}

int wordCounter(std::string line)
{
       int wordCount = 1; //initialize to one word counted
      
       for(unsigned int i = 0; i < line.length(); i++)
              {
                     if(line[i] == ' ' && line[i - 1] != ' ')
                           wordCount++;
              }
       return wordCount;

}

No comments:

Post a Comment