EN.601.220: Homework 5

Homework 5

Caution

You are expected to work individually.
Due: Friday April 4th at 11pm (Baltimore time).
This assignment is worth 70 points.

Learning Objectives

Objectives

To practice with:

C++ STL containers
the `string` class
file I/O
command-line arguments
input validation

Individual Assignment

This is an individual assignment. This means you must NOT show your working code to another student, and should discuss with each other only the assignment requirements and expectations. See course staff for coding help.

Overview

Caution

Before you start working on this homework, make sure you do a `git pull` on the `public` repo to get a copy of the starter code that comes with this assignment. The starter code is very minimal that includes one file only named `wordgen.cpp` and a few text files for testing with.

The infinite monkey theorem states that given enough time, a monkey typing randomly on a typewriter will eventually produce the complete works of William Shakespeare (or any other literary work, for that matter). Of course, the length of time one would have to wait for that to occur is rather long; e.g., the universe would probably end before it actually happened. Although we don’t have time to test the theorem in its original form, we will test a modified theorem involving a more clever typewriter monkey. This particular monkey has been browsing books by well-known authors and remembers how often certain letter sequences appear. Rather than typing purely randomly, the monkey tries to mimic great authors by repeating patterns it has seen before (though it does not understand the actual meaning of anything that it types). The monkey may still not produce Hamlet itself, but might at least be able to produce something passing as Shakespearean.

More practically, this project will give you experience writing programs using multiple classes, working with STL containers, and performing file I/O in C++. You should read through the entire handout before proceeding with the project! Remember to plan your classes and methods before beginning to code!

Consider the following three excerpts of text:

Call me Ishmael.  Some years ago--never mind how long
precisely--having repeatedly smelt the spleen respectfully, not to
say reverentially, of a bad cold in his grego pockets, and throwing
grim about with his tomahawk from the bowsprit?


Call me Ishmael.  Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.


Call me Ishmael, said I to myself.  We hast seen that the lesser man is
far more prevalent than winds from the fishery.

The second excerpt is the first sentence of Herman Melville’s Moby Dick. The other two excerpts were generated in Melville’s style using a simple algorithm (Claude Shannon, A mathematical theory of communication). In this homework, you will implement Shannon’s algorithm, allowing you to programmatically generate text in the style of real authors!

Shannon’s algorithm is based on letter probability distributions. Imagine taking the book Tom Sawyer and determining the probability with which each character occurs (we’ll call this a level-0 analysis). You’d probably find that spaces are the most common, that the character e is fairly common, and that the character q is rather uncommon. After completing this analysis, you’d be able to produce random Tom Sawyer text based on character probabilities by just sampling one character at a time. It wouldn’t have much in common with the real thing, but at least the characters would tend to occur in the proper proportion. In fact, here’s an example of what you might produce:

Level 0: rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela iad o r te u t mnyto onmalysnce, ifu en c fDwn oee iteo

Now imagine doing a slightly more sophisticated analysis by determining the probability with which each character follows every other character – we’ll call this a level-1 analysis. You would probably discover that h follows t more frequently than x does, and that a space follows . more frequently than , does. For example, if you are analyzing the text the theater is their thing and considering the letter h, then e appears after h three times, i appears after h one time, and no other letters ever appear after h. So the probability that e follows h is 0.75 (75%); the probability that i follows h is 0.25 (25%); the probability that any other letter follows h is 0.

Using a level-1 analysis, you could produce some randomly generated Tom Sawyer text by picking some character to begin with and then repeatedly choosing the next character based on the previous one and the probabilities revealed by the original text analysis. Here’s an example:

Level 1: Shand tucthiney m? le ollds mind Theybooure He, he s whit Pereg lenigabo Jodind alllld ashanthe ainofevids tre lin–p asto oun theanthadomoere

Now imagine doing a level-k analysis by determining the probability with which each character follows every possible sequence of characters of length k. For example, a level-5 analysis of Tom Sawyer would reveal that r follows Sawye more frequently than any other character. After such an analysis, you’d be able to produce random Tom Sawyer text by always choosing the next character based on the previous k characters and the probabilities revealed by the analysis.

At somewhat higher levels of analysis (e.g., levels 5–7), the randomly generated text begins to take on many of the characteristics of the source text. It probably won’t make complete sense, but you’ll be able to tell that it was derived from Tom Sawyer as opposed to, say, Hamlet or Moby Dick. Here are some more Tom Sawyer examples:

Level 2: “Yess been.” for gothin, Tome oso; ing, in to weliss of an’te cle – armit. Papper a comeasione, and smomenty, fropeck hinticer, sid, a was Tom, be suck tied. He sis tred a youck to themen

Level 4: en themself, Mr. Welshman, but him awoke, the balmy shore. I’ll give him that he couple overy because in the slated snufflindeed structure’s kind was rath. She said that the wound the door a fever eyes that WITH him.

Level 6: people had eaten, leaving. Come – didn’t stand it better judgment; His hands and bury it again, tramped herself! She’d never would be. He found her spite of anything the one was a prime feature sunset, and hit upon that of the forever.

Level 8: look-a-here – I told you before, Joe. I’ve heard a pin drop. The stillness was complete, how- ever, this is awful crime, beyond the village was sufficient. He would be a good enough to get that night, Tom and Becky.

Level 10: you understanding that they don’t come around in the cave should get the word “beauteous” was over-fondled, and that together” and decided that he might as we used to do – it’s nobby fun. I’ll learn you.”

To summarize the algorithm: given some input text (e.g., the text of Tom Sawyer) and the level k of the desired analysis, we first process the input text and store the probabilities of every possible character that follows each k-length sequence encountered in the input text. Following this analysis, we can generate random text as follows: first, pick the first k letters from the input text to bootstrap the random text. Then, repeatedly choose the next character by looking at the preceding k characters in the random text and selecting randomly given the probability information from the input text analysis. We can continue to select random characters in this way to generate as much output text as desired.

Program Specification

Your program will take three command-line arguments: the file to read, the desired maximum level k, and the number of characters of text to generate (n). Your program should do some basic error checking on the inputs. If the entered file can’t be read, if the value of k is invalid (non-numeric, or less than 1), or if the amount of random text is invalid (non-numeric, or less than k), your program should print an error message to the standard error stream (cerr) and exit with a non-zero exit code. Exact error messages and exit codes are up to you.

You are strongly encouraged, however, to not do all of this right from the start. Instead, first write a program that works from just a single input file and uses a constant value of k. Once that works, generalize it to the more complete scenario.

Your program must perform a level-i analysis for each i between 1 and k (inclusive), printing n characters of text from each analysis, as shown in the examples above. For example, running your program as ./wordgen ts.txt 4 75 would produce the following output:

Level 1: Co l iritont d, bubentrentet foroudsped ffug s wow Mr miginerire ond tout."
~~~
Level 2: CHAPTER XXXXII
"Baroubt."
"But ove gairie MANDER XIV
Hucke gloneriss.
T
~~~
Level 3: CHAPTER XI
They've inst theird, and skiff pearthey're stookind, to that. I
~~~
Level 4: TOM'S minished at. Tom stopped in then! Tom, it himself been seen a man ice

As you can see from this example, newlines are an expected part of the output (as long as your source text also contains newlines, of course). Thus, to help make the boundaries between each level analysis clear, your program should print a line with three tilde (~) characters between each different level of output text.

In addition to generating random text, as described above, your program must also support a -m flag, which replaces the n value when the program is run (that is, run the program with a command line ./wordgen ts.txt 4 -m). When this flag is used, your program should print out the character distribution map, in JSON format, as shown in the Hints and Suggestions section below. It should do so only for the value of k provided, rather than all values up to and including k.

NOTE: The autograder will test your code by parsing the output as JSON and then comparing it against the expected output, so indentation and the order of key/value pairs does not matter, but your output must be valid JSON or the autograder will have a parse error. Make sure to test it to make sure. If you’ve not worked with JSON before, the section below shows the character distrubtions of several strings in JSON format. You can also read a bit more about it online, such as here or here. There are certain characters that have special meaning in JSON – things like newlines, quotes, slashes – that without special effort, will cause parse errors. The autograder will not test your implementation with any of these characters, and you do not need to worry about them here. We’ll only test “normal” characters (things like letters, numbers, punctuation, spaces, etc) that doesn’t cause problems with the JSON. When generating random text, your program should still work fine for these characters, however.

Hints and Suggestions

As always, you should start with design before you jump into the implementation. A good way to approach the design is to think about the two stages of the program: first, processing the input text to calculate the probability information (stage #1), and second, using that probability information to generate random text (stage #2). Thus, your program will need to repeatedly perform the following two operations:

Given a string of k characters and the following (k+1) character from the input text, update the probabilities in your probability table. This operation will be used when reading the text input and building the table (stage #1).
Given a string of k characters and using the probabilities previously computed and stored in the table, select the next (i.e., k+1) character to follow in the generated text. This operation will be used when generating the output text (stage #2).

After you implement stage #1, we strongly encourage you to test your implementation to make sure that it builds the correct character distributions.
To do so, you should come up with a small amount of text, with a decent amount of repetition (in order to make frequencies more than just 0s and 1s) and calculate the character distributions by hand. For instance, you might use the input text abcdabcbaa. A level-1 analysis of this text would produce the following distribution:

{
    "a": {
        "b": 2,
        "a": 1
    }, 
    "b": {
        "c": 2,
        "a": 1
    },
    "c": {
        "d": 1,
        "b": 1
    },
    "d": {
        "a": 1
    }
}

This shows that, after an a, the input text contained two bs and one a. After a b, the text contained two cs and one a. Results for c and d can be read similarly.

A level-3 analysis of the same text would produce the following distribution:

{
    "abc": {
        "d": 1,
        "b": 1
    },
    "bcd": {
        "a": 1
    },
    "cda": {
        "b": 1
    },
    "dab": {
        "c": 1
    },
    "bcb": {
        "a": 1
    },
    "cba": {
        "a": 1
    }
}

Important Note

The program must use container classes from the C++ Standard Template Library (STL) to keep track of character distributions. You must at least use `std::string` and `std::map`, but you are free to use others as well. Take the time to understand the STL containers; selecting the right ones will make your code cleaner and easier to write and debug.

Special Cases

One particular complication to note is any situation in which you generate a random `k`-length sequence that has no known following characters. In particular, this situation can arise if your input ends in a `k`-length sequence that appears nowhere else in the input. In this case, if you randomly generate this particular `k`-length sequence, the program will have no probability information to decide what character should follow it. A straightforward fix to this problem is to prepend (i.e., add to the beginning) the final `k`-length sequence of the input to itself before processing. That way, there will never be a unique `k`-length sequence appearing at the end of the input.

Git log

In the assignments folder of your private repository, create a new subfolder named `hw5`. Do your work in that subfolder and use `git add`, `git commit` and `git push` regularly to backup your work as you make progress!

README

You need to submit a file called `README` (not `README.txt` or `README.md`, etc -- just `README`), including information about additional changes you made (besides the program specification) and anything the graders should know about your submission. In your `README` you should write your Hopkins ID (random 6 character code) at the top of the file, briefly justify the structure of your program; why you defined the functions you did, etc., and if applicable tell the graders if you couldn't do everything. Where did you stop? What did you get stuck on? What are the parts you already know do not work according to the requirements?

Specific Requirements

Your program must have all the functionality described above.
You must use `std::cin`, `std::cout`, `std::cerr`, the insertion operator `<<` and the extraction operator `>>` for all input and output. Don't use `printf`, `scanf` or other C I/O functions.
Your program should not directly allocate or deallocate any heap memory. That is, you should not directly call `malloc`, `free`, `realloc`, `calloc`, `new`, `delete` or similar functions. STL containers or other parts of the C++ library might call these functions, which is fine.
Your solution should run very quickly, even on large input texts. If your solution is running slowly (more than a couple seconds) think again about whether you are making good use of iterators and the STL containers.
Factor your code into functions, each function performing a distinct task. Don't do everything in main.
All variables must be declared inside functions. No variables should be global or extern.
Do not use auto.
Use header guards in all header (.h) files.
Do not use `using` in header (.h) files.
In C++ source files (.cpp files), you may import individual symbols using statements like `using std::string`.
Do not use `using namespace ID`, either in headers or in source files.
You do not need to handle or report any errors other than what is mentioned above.

Makefile

Important Note

You need to write your own Makefile. Make sure you have defined the target `wordgen` properly to compile your program. We will run `make wordgen` to compile your program and produce an executable named `wordgen`. Failure to comply with this requirement will result in a zero score.

Your submission to Gradescope

Create a .zip file named hw5.zip containing your source/header files, Makefile, gitlog.txt, and README. Do not include any .txt files and never submit any executable or object files!

Copy the hw5.zip file to your local machine (using scp or pscp), and submit it to Gradescope. When you submit, Gradescope conducts a series of automatic tests. These do basic checks, e.g. to check that you submitted the right files. If you see error messages (in red), address them and resubmit. You may resubmit any number of times prior to the deadline; only your latest submission will be graded. Review the course syllabus for late submission policies (grace period and late days).

Danger

Remember that if your final submitted code does not compile, you will earn a zero score for the entire assignment.

Info

Two notes regarding automatic checks for programming assignments:

Passing an automatic check is not itself worth points. (There might be a nominal, low point value like 0.01 associated with a check, but that won’t count in the end.) The checks exist to help you and the graders find obvious errors.
The automatic checks cover some of the requirements set out in the assignment, but not all. It is up to you to test your own work and ensure your programs satisfy all stated requirements. Passing all the automatic checks does not mean you have earned all the points.