Defining Rulesets

The purpose of a lexer is to break a stream of text characters into a sequence of tokens (strings with an assigned and thus identified meaning). Before tokens can be recognized, of course, it must first be clear which tokens are possible and how they can be recognized as such. This is done by defining a set of rules, each defining a regex pattern on how to recognize a particular token and what identifiable name it should be assigned to it.

Note

If you are not familiar with regular expressions (regex), the quickstart guide on regular-expressions.info provides a good starting point.

Rules

Defining rules in lex2 is done through creating Rule object instances. In addition to an identifiable name and regex pattern, each rule can be configured to be returned to the user by the lexer (on by default) or whether to be processed internally but not returned.

A good example when tokens should not be returned are comments in programming languages. Comments are always ignored and should therefore not be returned to the parser. However, it must be defined on how to recognize a comment in order for them to be skipped by the lexer.

Instantiating a Rule object
#                      identifier  regex  returns? (optional)
#                       ┌───┴───┐ ┌──┴──┐  ┌─┴─┐
number_rule = lex2.Rule("INTEGER", r"\d+", True)

# You can inspect the identifier and regex pattern of the object, and change
# the return behaviour property value
number_rule.id
>>> 'INTEGER'
number_rule.regex
>>> '\\d+'
number_rule.returns
>>> True
number_rule.returns = False
number_rule.returns
>>> False
class lex2.Rule

Class representing a rule, used as filter during lexical analysis.

__init__(id, regex, returns=True)

Rule object instance initializer.

Parameters
  • id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).

  • regex (str) – The regular expression used by a matcher to perform regex matching.

  • returns (bool, optional) –

    Specify whether tokens matched by this rule should be returned when scanning for tokens.

    By default True

Raises

ValueError – If the given regular expression is empty.

id: str

<readonly> Rule identifier string.

returns: bool

Whether tokens matched by this rule should be returned when scanning for tokens.

regex: str

<readonly> The regular expression used by a matcher to perform regex matching.

get_matcher()

Gets the IMatcher-compatible object instance.

The rule matcher object is used by a lexer object to identify tokens during lexical analysis.

Return type

PtrType[IMatcher]

set_matcher(matcher)

Sets the rule matcher object reference.

Parameters

matcher (IMatcher) –

Rule Groups

If you find yourself writing boilerplate code to include or format parts of a regular expression, or just would like an abstraction layer altogether, you can opt to use the RuleGroup base class. An example is given in the code block below.

Creating a regular expression for specific e-mail domain addresses
class AllowedEmail (lex2.RuleGroup):
    def __init__(self):
        super().__init__(
            "EMAIL",
            regex_prefix=r"(?i)\A[a-z\d_\-.]+"
        )

#   Define a public method to add a regex group. It has to
#   call the inherited protected method '_add_regex_group()'
#   ┌──────────────────────┴────────────────────┐
    def add_provider(self, domain: str, tld: str):
        self._add_regex_group(fr'@{domain}\.{tld}')
        return self

allowed_email = (AllowedEmail()
                    .add_provider("gmail",   "com")
                    .add_provider("hotmail", "com")
                    .rule())
#                   └──┬──┘
#   Compiles your group into a singular Rule object

allowed_email.id
>>> 'EMAIL'
allowed_email.regex
>>> '(?i)\A[a-z\d_\-.]+((@gmail\.com)|(@hotmail\.com))'

Warning

By convention, your custom public methods should return themselves (self) so it’s possible to have method chaining.

class lex2.RuleGroup

Bases: ABC

Abstract base class for making a creator class to dynamically build up a group of rules.

To condense down the rule group into a single Rule object, use the inherited .rule() method.

abstract __init__(id, returns=True, regex_prefix='', regex_suffix='')

Rule object instance initializer.

Parameters
  • id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).

  • returns (bool, optional) –

    Specify whether tokens matched by this rule group should be returned when scanning for tokens by default.

    By default True

  • regex_prefix (str, optional) –

    Regular expression that is prefixed for every added regex pattern group.

    By default ""

  • regex_suffix (str, optional) –

    Regular expression that is suffixed for every added regex pattern group.

    By default ""

rule(id=None, returns=None)

Compiles the rule group to a rule object.

Parameters
  • id (str, optional) –

    Overwrites the predefined identifying name of a resulting token.

    By default the id set by the parent class.

  • returns (bool, optional) –

    Overwrites whether tokens matched by this rule group should be returned when scanning for tokens by default.

    By default the returns set by the parent class.

Return type

Rule

Raises

ValueError – If the given regular expression is empty.

Ruleset

Finally, rulesets can be defined as standard lists populated with Rule object instances. It is recommended to type-hint the list variable (if stored) with the RulesetType type alias.

Defining a ruleset
ruleset: lex2.RulesetType = [
    lex2.Rule("WORD",        r"[a-zA-Z]+"),
    lex2.Rule("NUMBER",      r"[0-9]+"),
    lex2.Rule("PUNCTUATION", r"[.,:;!?\\-]")
]

Note

If you intend to use multiple rulesets, that reuse earlier defined rules, it is better to store the Rule instances into separate variables first and reference them in rulesets.