Here's a summary of the largest folders in the project.
40 KB | cc | character code compiler |
92 KB | editd | counts matches (seems to be broken) |
112 KB | toolchain | helps connect tools to each other |
212 KB | IR_Gen | LLVM IR helper code |
432 KB | re | RE compiler and parser |
640 KB | kernels | stream abstractions |
868 KB | pablo | Pablo; mostly optimizations and transformers for Pablo |
3472 KB | UCD | mostly unicode data |
5356 KB | combine | more unicode data |
Here is the overall control flow of icgrep.
icgrepp.cpp
Here's an example of some pablo generated by icgrep. The left was generated by the regular expression "a". The right is "c". Note that there is much more pablo in common than what is shown here.
diff <(./icgrep-build/icgrep 'a' -ShowPablo 2>&1) <(./icgrep-build/icgrep 'c' -ShowPablo 2>&1) -yThese are the same, except the "a" uses "~basis[6]" and "c" uses "basis[6]". This makes sense, as the two letters differ by one bit. Other things are simply renamed.
Initial Pablo AST: Initial Pablo AST:
basis[0] = Extract basis, 0 basis[0] = Extract basis, 0
basis[1] = Extract basis, 1 basis[1] = Extract basis, 1
basis[2] = Extract basis, 2 basis[2] = Extract basis, 2
basis[3] = Extract basis, 3 basis[3] = Extract basis, 3
basis[4] = Extract basis, 4 basis[4] = Extract basis, 4
basis[5] = Extract basis, 5 basis[5] = Extract basis, 5
basis[6] = Extract basis, 6 basis[6] = Extract basis, 6
basis[7] = Extract basis, 7 basis[7] = Extract basis, 7
linebreak[0] = Extract linebreak, 0 linebreak[0] = Extract linebreak, 0
cr+lf[0] = Extract cr+lf, 0 cr+lf[0] = Extract cr+lf, 0
required[0] = Extract required, 0 required[0] = Extract required, 0
required[1] = Extract required, 1 required[1] = Extract required, 1
required[2] = Extract required, 2 required[2] = Extract required, 2
not = (~basis[6]) | not = (~basis[5])
not_1 = (~basis[5]) | not_1 = (~basis[4])
not_2 = (~basis[4]) | not_2 = (~basis[3])
not_3 = (~basis[3]) | not_3 = (~basis[0])
not_4 = (~basis[0]) | and = (basis[6] & basis[7])
and = (basis[7] & not) <
or = (basis[4] | basis[5]) or = (basis[4] | basis[5])
not_5 = (~or) | not_4 = (~or)
and_1 = (basis[2] & not_3) | and_1 = (basis[2] & not_2)
and_2 = (basis[1] & not_4) | and_2 = (basis[1] & not_3)
and_3 = (and & not_5) | and_3 = (and & not_4)
and_4 = (and_1 & and_2) and_4 = (and_1 & and_2)
CC_61 = (and_3 & and_4) | CC_63 = (and_3 & and_4)
ipp = pablo.Advance(CC_61, 1) | ipp = pablo.Advance(CC_63, 1)
and_6 = (required[0] & ipp) and_6 = (required[0] & ipp)
fpp = pablo.ScanThru(and_6, required[1]) fpp = pablo.ScanThru(and_6, required[1])
matches[0] = Extract matches, 0 matches[0] = Extract matches, 0
matches[0] = fpp matches[0] = fpp
I noticed that a file containing a single null byte (echo -en '\0' > /tmp/f) is counted as containing one word with the icgrep 'wc', but has 0 words according to linux 'wc'