This is a follow up to Jumping Into C++, a first-hand account of my experiences writing my first modern C++ application. It was a simple exercise that turned out to be not so simple and much more educational than I imagined thanks to the community.
Thank you.
And now, in no particular order, a few thoughts.
About not using regular expressions. I avoided regular expressions early in the project because I was leery of additional complexity they might introduce. Not complexity from the regular expression language itself (\w+ etc.) but complexity introduced by the C++ plumbing required to use the regular expression. I didn't research the issue before nixing them, I simply nixed them. I was worried of getting stuck in what I thought would be additional complexity.
Fear is not the reason to eliminate a potential solution.
On the plus side, avoiding regular expressions gave me the opportunity to learn how to remove characters from a std::string and, in a subsequent version, the joys of lamdas. It also taught me that even with short example projects, correctness matters. I should have had a more rigorous definition for word and implemented a slightly better solution, one the regular expression delivers concisely.
About the speed of Community. MattPD posted an alternative version with comments in a little over an hour after the original post. In the first twelve hours, many more shared alternative versions and observations including Diego Dagum, Stephan Lavavej, Carl Daniel, Ivan, Daniel Earwicker, Mike and others. By Tuesday afternoon there were at least five different versions including one in C# and two that used memory mapped files and many more comments on the blog and elsewhere. Comments were thoughtful and there was no trolling, not even when DotNet quipped "There is a better way. Jumping into C#!".
The C# part of me agreed 100%.
Community is awesome.
About the different solutions. In broad strokes, all solutions made changes to the same core functionality, namely file input and word parsing.
To handle file input, the original program used iostream, a filename check to weed out non-text files, and a convoluted loop to check for failure on the initial read from the file. The code worked for normal cases but would fail to weed out some files. Enhancements include:
- Stephan Lavavej uses an ifstream, skipping any file that cannot be opened. He processes the file one line at a time using getline in a while loop. Words are parsed from the file one line at a time. (Grab a copy of Stephan's solution here).
- MattPD (source) handles files similar to the original but uses Boost to filter files based on file extension (boost::algorithm::iends_with(file, ".exe")) and to parse words. I now know why C++ developers have strong feelings for Boost.
- Giovanni Dicanio and the duo James McNellis and Kenny Kerr pursued a need for speed and used memory mapped files. Giovanni wrote a custom allocator to speed up map allocation for new words. James and Kenny step further, exploiting the Visual C++ Concurrency Runtime. James and Kenny wrote a blog post with all the details here. Go read it now!
To parse a word, the original program read a string from the input stream and hacked out certain punctuation characters using a custom function. It left room for improvement but met the spirit (*cough* weasel word *cough*) of the requirements. Enhancements include:
- Stephan Lavavej uses regular expressions. I love this solution because it is clear (once I understood the token iterator), concise and painless to update if the definition of "word" changes. Even better, after discussing the solution with Stephan (perk of working down the hall from him) I got a copy of his Regex 2.0 deck with overview and useful examples. Stephan also made this video on Channel 9.
- Different folks used Boost regular expressions to improve execution speed. I need to read the Boost docs.
- MattPD uses Boost to erase unwanted characters like so: boost::remove_erase_if(word, static_cast<int(*)(int)>(std::ispunct));. (The static cast tells the compiler which overloaded std::ispunct should be used. It is cleaner than my original version.) I admire the Boost version more because I wrote a non-Boost version.
- Giovanni Dicanio and the duo Kenny Kerr and James McNellis use a custom parser. At the risk of oversimplifying, both use pointers to process file data, skipping unwanted characters with one of the std::is* character tests. These solutions were C++ but they spoke to the C side of me.
The awesomest part about these solutions is watching how each evolved from previous discussions and versions. The journey is the trip, not the destination!
About scaring off potential new C++ developers. As the original program evolved, there were discussions about variable declaration best practices, performance tradeoffs for different statements and libraries, which arguments are passed at the command line, how memory is allocated in C++ versus C#, memory mapped files, long long (as in guaranteed 64 bitness, not a galaxy far, far away), and parallel tasks.
From the outside, it is intimidating. what, can't those C++ folks agree on anything? How hard can it be? In my favorite language, it would only be two lines (if that)! From the inside, it is a technical exercise. It is about precision and developer personality. How can the code be improved? What is the proof? Can it be simpler?
I think the discussion is helpful for newbies; learning takes time and no code is perfect.
About optimization. Optimization was beyond the scope of the original exercise. Those who offered enhanced versions had experience profiling code, likely profiled the original and subsequent enhancements, and used their experience with C++ to create faster versions. I like to think that my version was just fine.
I remember some advice I was given by a grizzled veteran developer: Get a reasonable version of your code working before profiling. If you do, you will spend time optimizing where it will make the most difference, not the places you think it matters but probably does not.
About reading a file using iostream. Don't. It is slower and difficult to manage for complicated file formats. This application was trivial but I will avoid in the future.
About the next exercise. My journey to C++ Ninjahood continues. Next time, I'll write a class and likely come to terms with copy versus move semantics and all that jazz. I also want to take a look at how lack of domain knowledge can make an example program much more difficult to understand even where the underlying C++ is standard, basic stuff. Word counting does not take much domain knowledge; 3D object manipulation for real-time games does. Kinematic Equations fall somewhere in between.
Was this article helpful? Let us know in the comments, Twitter, or Facebook.