Exploring Parabix

icgreprep Project Structure

Here's a summary of the largest folders in the project.

40 KBcccharacter code compiler
92 KBeditdcounts matches (seems to be broken)
112 KBtoolchainhelps connect tools to each other
212 KBIR_GenLLVM IR helper code
432 KBreRE compiler and parser
640 KBkernelsstream abstractions
868 KBpabloPablo; mostly optimizations and transformers for Pablo
3472 KBUCDmostly unicode data
5356 KBcombinemore unicode data

Here is the overall control flow of icgrep.

icgrepp.cpp

grepCodeGen:

icgrep

Here's an example of some pablo generated by icgrep. The left was generated by the regular expression "a". The right is "c". Note that there is much more pablo in common than what is shown here.

	diff <(./icgrep-build/icgrep 'a' -ShowPablo 2>&1) <(./icgrep-build/icgrep 'c' -ShowPablo 2>&1) -y
Initial Pablo AST:                                              Initial Pablo AST:
basis[0] = Extract basis, 0                                     basis[0] = Extract basis, 0
basis[1] = Extract basis, 1                                     basis[1] = Extract basis, 1
basis[2] = Extract basis, 2                                     basis[2] = Extract basis, 2
basis[3] = Extract basis, 3                                     basis[3] = Extract basis, 3
basis[4] = Extract basis, 4                                     basis[4] = Extract basis, 4
basis[5] = Extract basis, 5                                     basis[5] = Extract basis, 5
basis[6] = Extract basis, 6                                     basis[6] = Extract basis, 6
basis[7] = Extract basis, 7                                     basis[7] = Extract basis, 7
linebreak[0] = Extract linebreak, 0                             linebreak[0] = Extract linebreak, 0
cr+lf[0] = Extract cr+lf, 0                                     cr+lf[0] = Extract cr+lf, 0
required[0] = Extract required, 0                               required[0] = Extract required, 0
required[1] = Extract required, 1                               required[1] = Extract required, 1
required[2] = Extract required, 2                               required[2] = Extract required, 2
not = (~basis[6])                                             | not = (~basis[5])
not_1 = (~basis[5])                                           | not_1 = (~basis[4])
not_2 = (~basis[4])                                           | not_2 = (~basis[3])
not_3 = (~basis[3])                                           | not_3 = (~basis[0])
not_4 = (~basis[0])                                           | and = (basis[6] & basis[7])
and = (basis[7] & not)                                        <
or = (basis[4] | basis[5])                                      or = (basis[4] | basis[5])
not_5 = (~or)                                                 | not_4 = (~or)
and_1 = (basis[2] & not_3)                                    | and_1 = (basis[2] & not_2)
and_2 = (basis[1] & not_4)                                    | and_2 = (basis[1] & not_3)
and_3 = (and & not_5)                                         | and_3 = (and & not_4)
and_4 = (and_1 & and_2)                                         and_4 = (and_1 & and_2)
CC_61 = (and_3 & and_4)                                       | CC_63 = (and_3 & and_4)
ipp = pablo.Advance(CC_61, 1)                                 | ipp = pablo.Advance(CC_63, 1)
and_6 = (required[0] & ipp)                                     and_6 = (required[0] & ipp)
fpp = pablo.ScanThru(and_6, required[1])                        fpp = pablo.ScanThru(and_6, required[1])
matches[0] = Extract matches, 0                                 matches[0] = Extract matches, 0
matches[0] = fpp                                                matches[0] = fpp
These are the same, except the "a" uses "~basis[6]" and "c" uses "basis[6]". This makes sense, as the two letters differ by one bit. Other things are simply renamed.

Null byte as word

I noticed that a file containing a single null byte (echo -en '\0' > /tmp/f) is counted as containing one word with the icgrep 'wc', but has 0 words according to linux 'wc'