Exercise 25
Info

This is an in-class exercise. An exercise page like this one will contain a brief description but is intended to be supplemented by discussion during our meeting time. Complete the exercise to the best of your ability in the time given. Feel free to talk with other students as you work, and do not be afraid to ask questions if you get stuck. Aim to complete as much as possible during our meeting, and submit on Gradescope to check your solution. You are encouraged to work at home to complete what you do not get through today, and ask questions over Piazza or in office hours.

Info

Get started by running git pull to update your clone of the public repository, and then copying the exercises/ex25 directory into your personal git repository.

Learning Objectives

Reinforces basic C++ concepts

  • C++ basics: file I/O using istream and ostream, string manipulation
  • Using stringstream to extract data from a string
Goal

Implement programs to abbreviate words in a text document, analyze tokens in a text document, and analyze a text document to find the relative frequencies of letters.

Processing and analyzing text

In this exercise you will write programs and functions to manipulate and transform text using the string and stringstream classes, and read and write text files using ifstream and ofstream.

The exercise has two “required” parts, Part 1 and Part 2, and one “optional” part, Part 3.

Part 1

Words are often recognizable when one or more of the vowels (“a”, “e”, “i”, “o”, and “u”) are omitted. For example:

W’rds ‘r’ ‘ft’n r’c’gn’z’bl’ wh’n ‘n’ ‘r m’r’ ‘f th’ v’w’ls ‘r’ ‘m’tt’d.

In this transformed sentence, each vowel or consecutive group of vowels has been replaced with a single apostrophe (') character. It’s a bit difficult to read, but not impossible. To test the theory that text abbreviated this way is still readable, you will write a program to automatically abbreviate text.

Complete the program in abbrev.cpp as follows. The program takes two command-line arguments. The first is the name of an input file, which will contain the input text. The second is the name of the output file to generate, containing the abbreviated form of the original text. Some helper functions have been provided to get you started, but you’ll have to finish writing abbreviate and main.

Compile the program using the command

g++ -g -std=c++11 -Wall -Wextra -pedantic abbrev.cpp -o abbrev

Example usage:

./abbrev example1.txt example1-abbrev.txt

As you work, the documentation for the string class will be useful: http://www.cplusplus.com/reference/string/string/

Three example files are provided, example1.txt, example2.txt, and example3.txt. The expected results of abbreviating the text in each file are in the files example1-expected.txt, example2-expected.txt, and example3-expected.txt. You can check your program’s output using the diff command, e.g. (assuming that your transformed version of example1.txt is called example1-abbrev.txt):

diff -w example1-expected.txt expected1-abbrev.txt

If the diff command produces no output, then your output matches the expected output.

Here are some hints and suggestions.

You can read the input file line by line using the getline function, which reads a line of text from an istream (including an ifstream) into a string object. Your main loop might look something like this (assuming that in is an ifstream reading from the input file):

string line;
while (getline(in, line)) {
  // do something with line
}

To read words from a line, creating a stringstream from one line of text and then reading one word at a time can be accomplished as follows:

stringstream ss(line);
string word;
while (ss >> word) {
  // do something with word
}

Defining a function with the following prototype will probably be a good idea:

string abbreviate(const string &word);

You should use an ofstream to write the output file.

Part 2

One useful technique enabled by the stringstream class is the capability to dynamically check a string to determine if it contains data in a particular form (integer value, floating point value, etc.)

Complete the program in classify.cpp so that it reads textual input from cin one token at a time, and then prints the following summary information:

A token should be considered a floating point value if a double value can be successfully extracted from it using a stringstream.

A token should be considered an integer value if it is not a floating point value, but an int value can be successfully extracted from it using a stringstream, such that the entire string is matched. For example, the string "3.14159" could be extracted into an int variable, but only the 3 character would be matched, leaving the remaining text (".14159") unmatched.

All tokens that aren’t floating point or integer should be considered non-numeric.

Compile the program using the command

g++ -g -std=c++11 -Wall -Wextra -pedantic classify.cpp -o classify

You can test your program by running the command

./classify < data.txt

The output should be something like:

Floating point sum: 387.542
Integer sum: 8
Number of non-numeric tokens: 24
Number of characters in non-numeric tokens: 114

Hint: one way to determine if an extraction of an integer value from a stringstream consumed the entire string is, after the int value is extracted successfully, to attempt to extract a string. If the extraction of the string value fails, then you know that the extraction of the int value consumed the entire original string.

Part 3 (Bonus)

Note that this part is optional! It’s a good opportunity to practice working with input files, STL containers, and STL algorithms, but it’s not essential.

Complete the program in letter_freq.cpp so that it counts the number of occurrences of each letter in the input text file specified as the command line argument. The program should ignore case, so (for example) “A” and “a” are considered the same letter.

After analyzing the input file, the program should print a table with the number of ocurrences of each letter, in the order from most-frequently occurring to least-frequently occurring. For example, the invocation

./letter_freq example2.txt

should produce the output

e: 107
i: 80
t: 76
n: 62
o: 62
a: 57
r: 56
s: 53
h: 51
l: 45
m: 30
u: 26
p: 25
g: 24
f: 22
y: 22
d: 21
c: 18
w: 17
v: 13
b: 9
k: 4
q: 2
z: 2

As the data structure for recording the occurrence count of each letter, use a vector of Bucket elements, where Bucket is a struct defined something like the following:

struct Bucket {
  char letter;
  unsigned count;
};

The vector should have one Bucket per letter.

The program’s main loop should read characters from the input file one at a time. The get member function should be useful. For example, if in is an ifstream reading from the input file, the main loop might look something like this:

char c;
while (in.get(c)) {
  // do something with c
}

The <cctype> header is included, so that you can use the isalpha, toupper and/or tolower functions. Your program will need to know which characters are letters, and then convert each letter to a consistent case (upper or lower), in order to know which bucket to update when a letter is encountered.

Before printing the output, you can use the std::sort algorithm to sort the vector elements so that they are arranged from most-frequently occurring to least-frequently occurring. To do this, implement a function with the following signature:

bool compare_buckets(const Bucket &left, const Bucket &right);

The function should return true if the left bucket should be before the right bucket. If two buckets have the same count, then the one with the earlier letter should come first.

To sort the vector:

sort(hist.begin(), hist.end(), compare_buckets);

This code assumes that your vector of Bucket elements is called hist.

Reminder

Remember to add and commit to your local repo copy as your work. Push to your remote repo when finished. Also scp and submit to Gradescope to check your solution. Use exit to logout from your ugrad account when finished. If you continue to work on the program after class, make sure to keep your repo updated as well!