The bisect_set module

Module for dealing with sets of integers, or in particular, sets of characters from potentially large character sets such as Unicode. We can store and operate on these efficiently under the assumption that sets will contain mostly ranges of consecutive character codes.

We encode character classes such as [0-9A-Z] as lists of breakpoints e.g. [0x30, 0x3a, 0x41, 0x5b], which says that “included” characters begin at 0x30 and 0x41 (the even breakpoints) whereas “excluded” characters begin at 0x3a and 0x5b (the odd breakpoints).

ndcode.piyacc.bisect_set.N_CHARACTERS = 256

Defines the alphabet size, set this to 0x11000 for Unicode.

ndcode.piyacc.bisect_set.bisect_set_and(character_set0, character_set1)

Calculate intersection of the operand sets.

We do this by calculating a series of breakpoints, at each breakpoint evaluating the “and” (min) of the even/odd truth values of each operand, then making the output truth value even/odd by outputting if necessary.

ndcode.piyacc.bisect_set.bisect_set_not(character_set)

Calculate complement of the operand set.

  • If operand set begins with [0], remove it, otherwise add [0] prefix.

  • If operand set ends with [N_CHARACTERS], remove it, otherwise add [N_CHARACTERS] suffix.

The suffix part is not totally necessary, but makes sure length is even (the evenness is so that single character sets can always be [c, c + 1]).

ndcode.piyacc.bisect_set.bisect_set_or(character_set0, character_set1)

Calculate union of the operand sets.

We do this by calculating a series of breakpoints, at each breakpoint evaluating the “or” (max) of the even/odd truth values of each operand, then making the output truth value even/odd by outputting if necessary.