lex2

<library> lex2

lex2 is a library intended for lexical analysis (also called tokenization). String analysis is performed using regular expressions (regex), as specified in user-defined rules. Mechanisms, such as a dynamic ruleset-stack, provide flexibility to some degree at runtime.

Modules

lex2.excs

Components of exceptions.

lex2.lexer

Components of lexer implementations.

lex2.matcher

Components of matcher implementations.

lex2.predefs

Predefined rule objects and -template classes.

lex2.textio

Components of textstreams.

lex2.util

Common library components and utilities.

Classes

DEFAULT_LEXER

alias of GenericLexer

DEFAULT_MATCHER

alias of ReMatcher

ILexer

Common interface to a lexer object instance.

IMatcher

Common interface to a rule matcher object instance.

LexerOptions

Struct to define processing options of a lexer.

Rule

Class representing a rule, used as filter during lexical analysis.

RuleGroup

Abstract base class for making a creator class to dynamically build up a group of rules.

Token

Represents a token that is output during lexical analysis.

Functions

make_lexer([MATCHER_T, LEXER_T])

Factory function for creating a lexer instance.

class lex2.Rule

Class representing a rule, used as filter during lexical analysis.

__init__(id, regex, returns=True)

Rule object instance initializer.

Parameters
  • id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).

  • regex (str) – The regular expression used by a matcher to perform regex matching.

  • returns (bool, optional) –

    Specify whether tokens matched by this rule should be returned when scanning for tokens.

    By default True

Raises

ValueError – If the given regular expression is empty.

id: str

<readonly> Rule identifier string.

returns: bool

Whether tokens matched by this rule should be returned when scanning for tokens.

regex: str

<readonly> The regular expression used by a matcher to perform regex matching.

get_matcher()

Gets the IMatcher-compatible object instance.

The rule matcher object is used by a lexer object to identify tokens during lexical analysis.

Return type

PtrType[IMatcher]

set_matcher(matcher)

Sets the rule matcher object reference.

Parameters

matcher (IMatcher) –

class lex2.RuleGroup

Bases: ABC

Abstract base class for making a creator class to dynamically build up a group of rules.

To condense down the rule group into a single Rule object, use the inherited .rule() method.

abstract __init__(id, returns=True, regex_prefix='', regex_suffix='')

Rule object instance initializer.

Parameters
  • id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).

  • returns (bool, optional) –

    Specify whether tokens matched by this rule group should be returned when scanning for tokens by default.

    By default True

  • regex_prefix (str, optional) –

    Regular expression that is prefixed for every added regex pattern group.

    By default ""

  • regex_suffix (str, optional) –

    Regular expression that is suffixed for every added regex pattern group.

    By default ""

rule(id=None, returns=None)

Compiles the rule group to a rule object.

Parameters
  • id (str, optional) –

    Overwrites the predefined identifying name of a resulting token.

    By default the id set by the parent class.

  • returns (bool, optional) –

    Overwrites whether tokens matched by this rule group should be returned when scanning for tokens by default.

    By default the returns set by the parent class.

Return type

Rule

Raises

ValueError – If the given regular expression is empty.

class lex2.Token

Represents a token that is output during lexical analysis.

__init__(id='', data='', pos=<lex2.textio._textposition.TextPosition object>, groups=())

Token object instance initializer.

Parameters
  • id (str, optional) –

    The identifying string of the resulting token’s type (e.g. “NUMBER”, “WORD”).

    By default ""

  • data (str, optional) –

    String data of the identified token.

    By default ""

  • position (TextPosition, optional) –

    Position in the textstream where the token occurs.

    By default TextPosition()

  • groups (Iterable[str], optional) –

    Result of regex match, split by encapsulated groups.

    By default ()

id: str

The identifier of a token’s type (e.g. “NUMBER”, “WORD”).

data: str

Result of regex match.

pos: TextPosition

Position in the textstream where a token occurs.

groups: Sequence[str]

Result of regex match, split by encapsulated groups.

is_rule(expected_rule)

Evaluates if the token’s identifier matches that of a given rule.

Parameters

expected_rule (Rule) – Rule object instance.

Return type

bool

is_rule_oneof(expected_rules)

Evaluates if the token’s identifier matches that one of a given list of rules.

Parameters

expected_rules (List[Rule]) – List of Rule object instances.

Return type

bool

validate_rule(expected_rule)

Validates that the token’s identifier matches that of a given rule.

Parameters

expected_rule (Rule) – Rule object instance.

Raises

UnknownTokenError – When the token’s identifier does not match that of a given rule.

validate_rule_oneof(expected_rules)

Validates that the token’s identifier matches that one of a given list of rules.

Parameters

expected_rules (List[Rule]) – List of Rule object instances.

Raises

UnknownTokenError – When the token’s identifier does not match that of a given rule.

class lex2.LexerOptions

Struct to define processing options of a lexer.

class SeparatorOptions

Struct that defines processing options of a separator token.

__init__()
ignored: bool

Flag to specify whether processing of tokens of this separator should be ignored. Defaults to False

returns: bool

Flag to specify whether tokens of this separator should be returned. Defaults to False

__init__()
space: SeparatorOptions

Options to specify how a SPACE separator should be handled.

tab: SeparatorOptions

Options to specify how a TAB separator should be handled.

newline: SeparatorOptions

Options to specify how a NEWLINE separator should be handled.

id_returns: Dict[str, bool]

Map with <str, bool> key-pairs to specify whether tokens matched by a rule (Rule.id) should be returned to the user.

class lex2.ILexer

Bases: ITextIO, ABC

Common interface to a lexer object instance.

abstract push_ruleset(ruleset)

Pushes a ruleset to the ruleset-stack.

Parameters

ruleset (RulesetType) –

abstract pop_ruleset()

Pops a ruleset from the ruleset-stack.

abstract clear_rulesets()

Clears all rulesets from the ruleset-stack.

abstract get_next_token()

Finds the next token in the textstream using the currently active ruleset.

Return type

Token

Raises
  • UnknownTokenError – If an unknown token type has been encountered.

  • EOF – If the lexer has reached the end of input data from a textstream.

abstract get_options()

Gets the lexer options to define processing options of the lexer.

Return type

LexerOptions

abstract set_options(options)

Sets the lexer options to define processing options of the lexer.

Parameters

options (LexerOptions) –

class lex2.IMatcher

Bases: ABC

Common interface to a rule matcher object instance.

abstract get_uid()

Gets the unique identifier (UID) of the matcher implementation.

Return type

str

abstract compile_pattern(regex)

Compiles regex pattern to implementation-specific regex matcher object.

Parameters

regex (str) – Regular expression to compile.

abstract match(ts, token)

Looks for a pattern match and sets it in the provided token object.

Parameters
  • ts (ITextstream) – Textstream object managed by the lexer object.

  • token (Token) – Used to set the match data in the token.

Returns

True in case of a match.

Return type

bool

lex2.DEFAULT_LEXER

alias of GenericLexer

lex2.DEFAULT_MATCHER

alias of ReMatcher

lex2.make_lexer(MATCHER_T=<class 'lex2.matcher._std_re.ReMatcher'>, LEXER_T=<class 'lex2.lexer._generic_lexer.GenericLexer'>)

Factory function for creating a lexer instance.

If no values are provided for the template parameters, the implementations used for the matcher and lexer will default to the library constants DEFAULT_MATCHER and DEFAULT_LEXER respectively.

Template Parameters

MATCHER_TType[BaseMatcher], optional

Template class type that implements the BaseMatcher base class.

By default DEFAULT_MATCHER

LEXER_TType[BaseLexer], optional

Template class type that implements the BaseLexer base class.

By default DEFAULT_LEXER

Parameters
  • ruleset (RulesetType, optional) –

    Initial ruleset.

    By default []

  • options (LexerOptions, optional) –

    Struct specifying processing options of the lexer.

    By default LexerOptions()

Return type

ILexer