Exploring Parabix

icgreprep Project Structure

Here's a summary of the largest folders in the project.

40 KB	cc	character code compiler
92 KB	editd	counts matches (seems to be broken)
112 KB	toolchain	helps connect tools to each other
212 KB	IR_Gen	LLVM IR helper code
432 KB	re	RE compiler and parser
640 KB	kernels	stream abstractions
868 KB	pablo	Pablo; mostly optimizations and transformers for Pablo
3472 KB	UCD	mostly unicode data
5356 KB	combine	more unicode data

Here is the overall control flow of icgrep.

icgrepp.cpp

sets configurations in accordance with the command line arguments
parses the regular expressions
finds files to parse
chooses a grep engine (e.g. grep::EmitMatchesEngine)
compiles the regular expression with the grep engine (grepEngine->grepCodeGen)
tells the grep engine which files to use (grepEngine->initFileResult)
uses the engine to search the files (grepEngine->searchAllFiles)

grepCodeGen:

uses a ParabixDriver (new ParabixDriver(“engine”))

icgrep

Here's an example of some pablo generated by icgrep. The left was generated by the regular expression "a". The right is "c". Note that there is much more pablo in common than what is shown here.

	diff <(./icgrep-build/icgrep 'a' -ShowPablo 2>&1) <(./icgrep-build/icgrep 'c' -ShowPablo 2>&1) -y

	Initial Pablo AST:                                              Initial Pablo AST:

	basis[0] = Extract basis, 0                                     basis[0] = Extract basis, 0

	basis[1] = Extract basis, 1                                     basis[1] = Extract basis, 1

	basis[2] = Extract basis, 2                                     basis[2] = Extract basis, 2

	basis[3] = Extract basis, 3                                     basis[3] = Extract basis, 3

	basis[4] = Extract basis, 4                                     basis[4] = Extract basis, 4

	basis[5] = Extract basis, 5                                     basis[5] = Extract basis, 5

	basis[6] = Extract basis, 6                                     basis[6] = Extract basis, 6

	basis[7] = Extract basis, 7                                     basis[7] = Extract basis, 7

	linebreak[0] = Extract linebreak, 0                             linebreak[0] = Extract linebreak, 0

	cr+lf[0] = Extract cr+lf, 0                                     cr+lf[0] = Extract cr+lf, 0

	required[0] = Extract required, 0                               required[0] = Extract required, 0

	required[1] = Extract required, 1                               required[1] = Extract required, 1

	required[2] = Extract required, 2                               required[2] = Extract required, 2

	not = (~basis[6])                                             | not = (~basis[5])

	not_1 = (~basis[5])                                           | not_1 = (~basis[4])

	not_2 = (~basis[4])                                           | not_2 = (~basis[3])

	not_3 = (~basis[3])                                           | not_3 = (~basis[0])

	not_4 = (~basis[0])                                           | and = (basis[6] & basis[7])

	and = (basis[7] & not)                                        <

	or = (basis[4] | basis[5])                                      or = (basis[4] | basis[5])

	not_5 = (~or)                                                 | not_4 = (~or)

	and_1 = (basis[2] & not_3)                                    | and_1 = (basis[2] & not_2)

	and_2 = (basis[1] & not_4)                                    | and_2 = (basis[1] & not_3)

	and_3 = (and & not_5)                                         | and_3 = (and & not_4)

	and_4 = (and_1 & and_2)                                         and_4 = (and_1 & and_2)

	CC_61 = (and_3 & and_4)                                       | CC_63 = (and_3 & and_4)

	ipp = pablo.Advance(CC_61, 1)                                 | ipp = pablo.Advance(CC_63, 1)

	and_6 = (required[0] & ipp)                                     and_6 = (required[0] & ipp)

	fpp = pablo.ScanThru(and_6, required[1])                        fpp = pablo.ScanThru(and_6, required[1])

	matches[0] = Extract matches, 0                                 matches[0] = Extract matches, 0

	matches[0] = fpp                                                matches[0] = fpp

These are the same, except the "a" uses "~basis[6]" and "c" uses "basis[6]". This makes sense, as the two letters differ by one bit. Other things are simply renamed.

Null byte as word

I noticed that a file containing a single null byte (echo -en '\0' > /tmp/f) is counted as containing one word with the icgrep 'wc', but has 0 words according to linux 'wc'