This is an in-class exercise. An exercise page like this one will contain a brief description but is intended to be supplemented by discussion during our meeting time. Complete the exercise to the best of your ability in the time given. Feel free to talk with other students as you work, and do not be afraid to ask questions if you get stuck. Aim to complete as much as possible during our meeting, and submit on Gradescope to check your solution. You are encouraged to work at home to complete what you do not get through today, and ask questions over Piazza or in office hours.
Get started by running git pull
to update your clone of the public repository, and then copying the exercises/ex25
directory into your personal git repository.
Reinforces basic C++ concepts
- C++ basics: file I/O using
istream
andostream
,string
manipulation - Using
stringstream
to extract data from a string
Implement programs to abbreviate words in a text document, analyze tokens in a text document, and analyze a text document to find the relative frequencies of letters.
Processing and analyzing text
In this exercise you will write programs and functions to manipulate
and transform text using the string
and stringstream
classes, and
read and write text files using ifstream
and ofstream
.
The exercise has two “required” parts, Part 1 and Part 2, and one “optional” part, Part 3.
Part 1
Words are often recognizable when one or more of the vowels (“a”, “e”, “i”, “o”, and “u”) are omitted. For example:
W’rds ‘r’ ‘ft’n r’c’gn’z’bl’ wh’n ‘n’ ‘r m’r’ ‘f th’ v’w’ls ‘r’ ‘m’tt’d.
In this transformed sentence, each vowel or consecutive group of vowels has
been replaced with a single apostrophe ('
) character. It’s a bit difficult to
read, but not impossible. To test the theory that text abbreviated this way
is still readable, you will write a program to automatically abbreviate text.
Complete the program in abbrev.cpp
as follows. The program takes two
command-line arguments. The first is the name of an input file, which will
contain the input text. The second is the name of the output file to
generate, containing the abbreviated form of the original text. Some helper functions have been provided to get you started, but you’ll have to finish writing abbreviate
and main
.
Compile the program using the command
g++ -g -std=c++11 -Wall -Wextra -pedantic abbrev.cpp -o abbrev
Example usage:
./abbrev example1.txt example1-abbrev.txt
As you work, the documentation for the string
class will be useful:
http://www.cplusplus.com/reference/string/string/
Three example files are provided, example1.txt
, example2.txt
, and example3.txt
.
The expected results of abbreviating the text in each file are in the files
example1-expected.txt
, example2-expected.txt
, and example3-expected.txt
.
You can check your program’s output using the diff
command, e.g. (assuming that
your transformed version of example1.txt
is called example1-abbrev.txt
):
diff -w example1-expected.txt expected1-abbrev.txt
If the diff
command produces no output, then your output matches the expected
output.
Here are some hints and suggestions.
You can read the input file line by line using the getline
function,
which reads a line of text from an istream
(including an ifstream
)
into a string
object. Your main loop might look something like this
(assuming that in
is an ifstream
reading from the input file):
string line;
while (getline(in, line)) {
// do something with line
}
To read words from a line, creating a stringstream
from one line of text
and then reading one word at a time can be accomplished as follows:
stringstream ss(line);
string word;
while (ss >> word) {
// do something with word
}
Defining a function with the following prototype will probably be a good idea:
string abbreviate(const string &word);
You should use an ofstream
to write the output file.
Part 2
One useful technique enabled by the stringstream
class is the capability
to dynamically check a string to determine if it contains data in a particular
form (integer value, floating point value, etc.)
Complete the program in classify.cpp
so that it reads textual input from cin
one token at a time, and then prints the following summary information:
- the sum of all floating point values found in the input
- the sum of all integer values found in the input
- the number of tokens (words) in the input that weren’t numeric
- the total length of all tokens in the input that weren’t numeric
A token should be considered a floating point value if a double
value can be
successfully extracted from it using a stringstream
.
A token should be considered an integer value if it is not a floating point value,
but an int
value can be successfully extracted from it using a stringstream
,
such that the entire string is matched. For example, the string "3.14159"
could
be extracted into an int
variable, but only the 3
character would be matched,
leaving the remaining text (".14159"
) unmatched.
All tokens that aren’t floating point or integer should be considered non-numeric.
Compile the program using the command
g++ -g -std=c++11 -Wall -Wextra -pedantic classify.cpp -o classify
You can test your program by running the command
./classify < data.txt
The output should be something like:
Floating point sum: 387.542
Integer sum: 8
Number of non-numeric tokens: 24
Number of characters in non-numeric tokens: 114
Hint: one way to determine if an extraction of an integer value from a stringstream
consumed the entire string is, after the int
value is extracted successfully,
to attempt to extract a string
. If the extraction of the string
value fails,
then you know that the extraction of the int
value consumed the entire original string.
Part 3 (Bonus)
Note that this part is optional! It’s a good opportunity to practice working with input files, STL containers, and STL algorithms, but it’s not essential.
Complete the program in letter_freq.cpp
so that it counts the number of occurrences
of each letter in the input text file specified as the command line argument.
The program should ignore case, so (for example) “A
” and “a
” are considered the
same letter.
After analyzing the input file, the program should print a table with the number of ocurrences of each letter, in the order from most-frequently occurring to least-frequently occurring. For example, the invocation
./letter_freq example2.txt
should produce the output
e: 107
i: 80
t: 76
n: 62
o: 62
a: 57
r: 56
s: 53
h: 51
l: 45
m: 30
u: 26
p: 25
g: 24
f: 22
y: 22
d: 21
c: 18
w: 17
v: 13
b: 9
k: 4
q: 2
z: 2
As the data structure for recording the occurrence count of each letter, use a vector
of Bucket
elements, where Bucket
is a struct defined something like the following:
struct Bucket {
char letter;
unsigned count;
};
The vector should have one Bucket
per letter.
The program’s main loop should read characters from the input file one at
a time. The get
member function should be useful. For example, if in
is an ifstream
reading from the input file, the main loop might look
something like this:
char c;
while (in.get(c)) {
// do something with c
}
The <cctype>
header is included, so that you can use the isalpha
,
toupper
and/or tolower
functions. Your program will need to know
which characters are letters, and then convert each letter to a consistent
case (upper or lower), in order to know which bucket to update when
a letter is encountered.
Before printing the output, you can use the std::sort
algorithm to
sort the vector elements so that they are arranged from most-frequently
occurring to least-frequently occurring. To do this, implement a function
with the following signature:
bool compare_buckets(const Bucket &left, const Bucket &right);
The function should return true
if the left
bucket should be before
the right
bucket. If two buckets have the same count, then the one with
the earlier letter should come first.
To sort the vector:
sort(hist.begin(), hist.end(), compare_buckets);
This code assumes that your vector
of Bucket
elements is called hist
.
Remember to add and commit to your local repo copy as your work. Push to your remote repo when finished. Also scp and submit to Gradescope to check your solution. Use exit
to logout from your ugrad account when finished. If you continue to work on the program after class, make sure to keep your repo updated as well!