GithubHelp home page GithubHelp logo

robertdavidgraham / wc2 Goto Github PK

View Code? Open in Web Editor NEW
251.0 5.0 15.0 81 KB

Investigates optimizing 'wc', the Unix word count program

Makefile 0.66% C 84.17% Shell 1.59% Assembly 6.23% JavaScript 7.36%

wc2's Introduction

wc2 - asynchronous state machine parsing

There have been multiple articles lately implementing the classic wc program in various programming languages, to "prove" their favorite language can be "just as fast" as C.

This project does something different. Instead of a different language it uses a different algorithm. The new algorithm is significantly faster -- implementing in a slow language like JavaScript is still faster than the original wc program written in C.

The algorithm is known as an "asynchronous state-machine parser". It's a technique for parsing that you don't learn in college. It's more efficient, but more importantly, it's more scalable. That's why your browser uses a state-machine to parse GIFs, and most web servers use state-machiens to parse incoming HTTP requests.

This projects contains three versions:

  • wc2o.c is a simplified 25 line version highlighting the idea
  • wc2.c is the full version in C, supporting Unicode
  • wc2.js is the version in JavaScript

The basic algorithm

The algorithm reads input and passes each byte one at a time to a state-machine. It looks something like:

    length = fread(buf, 1, sizeof(buf), fp);
    for (i=0; i<length; i++) {
        c = buf[i];
        state = table[state][c];
        counts[state]++;
    }

No, you aren't suppose to be able to see how the word-count works by looking at this code. The complexity happens elsewhere, setting up the state-machine.

The state-machine table is the difference between the simple version (wc2o.c) and complex version (wc2.c) of the program. The algorithm is the same, the one shown above, the difference is in how they setup the table. The simple program creates a table for ASCII, the complex program creates a much larger table supporting UTF-8.

How wc works

The wc word-count program counts the number of words in a file. A "word" is some non-space characters separate by space.

Those who re-implement wc simplify the problem by only doing ASCII instead of the full UTF-8 Unicode. This is cheating, because much of the speed of wc comes from its need to handle character-sets like UTF-8. The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces -- which re-implementations of wc skip.

For this reason, we've implemented a full UTF-8 version in this project, to prove that it works without cheating. Now the real wc works with a lot more character-sets, and we don't do that. But by implementing UTF-8, we've shown that it's possible, and that the speed for any character-set is the same.

Another simplification is how invalid input is handled. The original wc program largley ignores errors, but it's still an important factor in making sure you are doing things correctly.

Benchmark input files

This project uses a number of large input files for benchmarking. The traditional wc program has wildly different performance depending upon input, such as whether the file is full of illegal characters, or whether UTF-8 is being handled. The first test file is downloaded from the Internet as "real-world data", while the others are generated using a program built with this project (wctool).

  • pocorgtfo18.pdf a large 92-million byte PDF file that contains binary/illegal characters
  • ascii.txt a file the same size containing random words, ASCII-only
  • utf8.txt a file containing random UTF-8 sequences of 1, 2, 3, and 4 bytes
  • word.txt a file containing 92-million 'x' characters
  • space.txt a file containing 92-million ' ' (space) characters

Before benchmarking the old wc, set the character-set to UTF-8. It's probably already set to this on new systems, but do this to make sure:

$ export LC_CTYPE=en_US.UTF-8

When running wc, the -lwc is the default for counting words in ASCII text. To convert it into UTF-8 "multi-byte" mode, change c o m, as in -lwm.

The numbers are reported come from the Unix time command, the number of seconds for user time. In other words, elapsed time or system time aren't reported.

The following table shows benchmarking a 2019 x86 MacBook Air of the old wc program. As you can see, it has a wide variety of speeds depending on input.

The wc program included with macOS and Linux are completely different. Therefore, the following table shows them benchmarked against each other on the same hardware.

Command Input File macOS Linux
wc -lwc pocorgtfo18.pdf 0.709 5.591
wc -lwm pocorgtfo18.pdf 0.693 5.419
wc -lwc ascii.txt 0.296 2.509
wc -lwm utf8.txt 0.532 1.840
wc -lwc space.txt 0.296 0.284
wc -lwm space.txt 0.295 0.298
wc -lwc word.txt 0.302 1.268
wc -lwm word.txt 0.294 1.337

These results tell us:

  • Illegal characters (in pocorgtfo18.pdf) slow things down a lot, twice as slow on macOS, 10x slower on Linux.
  • Text that randomly switches between spaces and words is much slower than text containing all the same character.
  • On Linux, the code path that reads all spaces is significantly faster.
  • The macOS program is in general much faster than the Linux version.
  • Processing Unicode (the file utf8.txt with the -m option) is slower than processing ASCII (the file ascii.txt with the -c option).

Our benchmarks

The time for our algorithm, in C and JavaScript, are the following. The state-machine parser is immune to input type, all the input files show the same results.

Program Input File macOS Linux
wc2.c (all) 0.206 0.278
wc2.js (all) 0.281 0.488

These results tell us:

  • This state machine approach always results in the same speed, regardless of input.
  • This state machine approach is faster than the built-in programs.
  • Even written in JavaScript, the state machine approach is competitive in speed.
  • The difference in macOS and Linux speed is actually the difference in clang and gcc speed. The LLVM clang compiler is doing better optimizations for x86 processors here.
  • I don't know why Node.js behaves differently on macOS and Linux, it's probably just due to different versions.
  • A JIT (like NodeJS) works well with simple compute algorithms. This tells us little about it's relative performance in larger programs. All languages that have a JIT should compile this sort of algorithm to roughly the same speed.

Asynchronous scalability

The algorithm is faster, but more importantly, it's more scalable.

Such scalability isn't usefull for wc, but is incredibly important for network programs. Consider an HTTP web-server. The traditional way that the Apache web-server worked was by reading the entire header in and buffering it, before then parsing the header. This need to buffer the entire header caused an enormous scalability problem. In contrast, asynchronous web-servers like Nginx use a state-machine parser. They parse the bytes as they arrive, and discard them.

This is analogous to NFA and DFA regular-expressions. If you use the NFA approach, you need to buffer the entire chunk of data, so that the regex can backtrack. Using the DFA approach, input can be provided as a stream, one byte at a time, without needing buffering. DFAs are more scalable than NFAs.

State machine parsers

This project contains a minimalistic wc2o.c program to highlight the algorithm, without all the fuss of building UTF-8 tables, supporting only ASCII.

#include <stdio.h>
int main(void)
{
    static const unsigned char table[4][4] = {
        {2,0,1,0,}, {2,0,1,0,}, {3,0,1,0,},  {3,0,1,0,}
    };
    static const unsigned char column[256] = {
        0,0,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,
    };
    unsigned long counts[4] = {0,0,0,0};
    int state = 0;
    int c;

    while ((c = getchar()) != EOF) {
        state = table[state][column[c]];
        counts[state]++;
    }

    printf("%lu %lu %lu\n", counts[1], counts[2], 
                counts[0] + counts[1] + counts[2] + counts[3]);
    return 0;
}

The key part that does all the word counting is in the two lines inside:

    while ((c = getchar()) != EOF) {
        state = table[state][column[c]];
        counts[state]++;
    }

This is only defined for ASCII, so you can see the state-machine on a single-line in the code (table).

Additional tools

This project includes additional tools:

  • wctool to generate large test files
  • wcdiff to find difference between two implementatins of wc
  • wcstream to fragment input files (demonstrates a bug in macOS's wc)

The program wc2.c has the same logic, the difference being that it generates a larger state-machine for parsing UTF-8.

Pointer arithmetic

C has a peculiar idiom called "pointer arithmetic", where pointers can be incremented. Looping through a buffer is done with an expression like *buf++ instead of buf[i++]. Many programmers think pointer-arithmetic is faster.

To test this, the wc2.c program has an option -P that makes this small change, to test the difference in speed.

wc2's People

Contributors

davidfetter avatar robertdavidgraham avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

wc2's Issues

where are the benchmark files?

Hi,

I might have missed a line in the documentation, but I'm looking for:

  • pocorgtfo18.pdf
  • corgtfo18.pdf
  • ascii.txt
  • utf8.txt
  • word.txt
  • space.txt:
  • ascii.txt
  • utf8.txt
  • word.txt
  • space.txt
    in this repo and they're not there. Where should I be looking?

License?

How are you licensing this software?

"The algorithm" is inconsistent, buffering nuances

I enjoyed the write-up.

The first algorithm does a table[state][c]; (i.e. table is UCHAR_MAX; and you will be in trouble if the platform uses signed char and your code does it correctly) while the latter is does an indirect lookup via columns table[state][column[c]];. Neither are wrong, of course, but it's confusing that they are inconsistent.

node.js's http_parser uses a callback model with a function call overhead per token. It still has to buffer at least enough to be able to at least emit a token. Buffering all tokens and the whole request is more or less equivalent. You contrast that with "reading the entire header in and buffering it". I think you overlook that Ethernet data arrive in packets so you may very well have the whole header in the first packet. If this is production server you have to apply size constraints, otherwise your parser may do a lot of work only to realize that you are under a denial of service (DoS) attempt. Either do the benchmark, or avoid extrapolating one data set to another case that is not directly comparable.

Whenever I explored the "Pointer arithmetic" point I have found that gcc generated identical assembler. My conclusion was to write what is easier to read.

code in README.md differs from wc2o.c

It's maybe not an issue and I'm not quite sure why the fourth status column in the table is needed for (probably to prevent invalid input), but the initialization of table in the README.md differs from the code in wc2o.c.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.