pregex.core.classes

All classes within this module represent the so-called RegΕx character classes, which can be used in order to define a set or “class” of characters that can be matched.

Class types

A character class can be either one of the following two types:

  1. Regular class: This type of class represents the […] pattern, which can be translated as “match every character defined within the brackets”. You can tell regular classes by their name, which follows the Any* pattern.

  2. Negated class: This type of class represents the [^…] pattern, which can be translated as “match every character except for those defined within the brackets”. You can tell negated classes by their name, which follows the AnyBut* pattern.

Here is an example containing a regular class as well as its negated counterpart.

from pregex.core.classes import AnyLetter, AnyButLetter

regular = AnyLetter()
negated = AnyButLetter()

regular.print_pattern() # This prints "[A-Za-z]"
negated.print_pattern() # This prints "[^A-Za-z]"

Class unions

Classes of the same type can be combined together in order to get the union of the sets of characters they represent. This can be easily done though the use of the bitwise OR operator |, as depicted within the code snippet below:

from pregex.core.classes import AnyDigit, AnyLowercaseLetter

pre = AnyDigit() | AnyLowercaseLetter()
pre.print_pattern() # This prints "[\da-z]"

The same goes for negated classes as well:

from pregex.core.classes import AnyButDigit, AnyButLowercaseLetter

pre = AnyButDigit() | AnyButLowercaseLetter()
pre.print_pattern() # This prints "[^\da-z]"

However, attempting to get the union of a regular class and a negated class causes a CannotBeUnionedException to be thrown.

from pregex.core.classes import AnyDigit, AnyButLowercaseLetter

pre = AnyDigit() | AnyButLowercaseLetter() # This is not OK!

Lastly, it is also possible to union a regular class with a token, that is, any string of length one or any instance of a class that is part of the pregex.core.tokens module:

from pregex.core.classes import AnyDigit
from pregex.core.tokens import Newline

pre = AnyDigit() | "a" | Newline()

pre.print_pattern() # This prints "[\da\n]"

However, in the case of negated classes one is forced to wrap any tokens within an AnyButFrom class instance in order to achieve the same result:

from pregex.core.classes import AnyButDigit
from pregex.core.tokens import Newline

pre = AnyButDigit() | AnyButFrom("a", Newline())

pre.print_pattern() # This prints "[^\da\n]"

Subtracting classes

Subtraction is another operation that is exclusive to classes and it is made possible via the overloaded subtraction operator -. This feature comes in handy when one wishes to construct a class that would be tiresome to create otherwise. Consider for example the class of all word characters except for all characters in the set {C, c, G, g, 3}. Constructing said class via subtraction is extremely easy:

from pregex.core.classes import AnyWordChar, AnyFrom

pre = AnyWordChar() - AnyFrom('C', 'c', 'G', 'g', '3')

Below we are able to see this operation’s resulting pattern, from which it becomes evident that building said pattern through multiple class unions would be more time consuming, and more importantly, prone to errors.

[A-BD-FH-Za-bd-fh-z0-24-9_]

It should be noted that just like in the case of class unions, one is only allowed to subtract a regular class from a regular class or a negated class from a negated class, as any other attempt will cause a CannotBeSubtractedException to be thrown.

from pregex.core.classes import AnyWordChar, AnyButLowercaseLetter

pre = AnyWordChar() - AnyButLowercaseLetter() # This is not OK!

Furthermore, applying the subtraction operation between a class and a token is also possible, but just like in the case of class unions, this only works with regular classes:

from pregex.core.classes import AnyWhitespace
from pregex.core.tokens import Newline

pre = AnyWhitespace() - Newline()

pre.print_pattern() # This prints "[\t \x0b-\r]"

Negating classes

Finally, it is useful to know that every regular class can be negated through the use of the bitwise NOT operator ~:

from pregex.core.classes import AnyDigit

pre = ~ AnyDigit()
pre.print_pattern() # This prints "[^0-9]"

Negated classes can be negated as well, however you should probably avoid this as it doesn’t help much in making the code any easier to read.

from pregex.core.classes import AnyButDigit

pre = ~ AnyButDigit()
pre.print_pattern() # This prints "[0-9]"

Therefore, in order to create a negated class one can either negate a regular Any* class or use its AnyBut* negated class equivalent. The result is entirely the same and which one you’ll use is just a matter of choice.

Classes & methods

Below are listed all classes within pregex.core.classes along with any possible methods they may possess.

class pregex.core.classes.Any[source]

Matches any possible character, including the newline character.

class pregex.core.classes.AnyBetween(start: str, end: str)[source]

Matches any character within the provided range.

Parameters
  • start (str) – The first character of the range.

  • end (str) – The last character of the range.

Raises
  • InvalidArgumentTypeException – At least one of the provided characters is neither a Token class instance nor a single-character string.

  • InvalidRangeException – A non-valid range is provided.

Note

Any pair of characters start, end constitutes a valid range as long as the code point of character end is greater than the code point of character start, as defined by the Unicode Standard.

class pregex.core.classes.AnyButBetween(start: str, end: str)[source]

Matches any character except for those within the provided range.

Parameters
  • start (str) – The first character of the range.

  • end (str) – The last character of the range.

Raises
  • InvalidArgumentTypeException – At least one of the provided characters is neither a Token class instance nor a single-character string.

  • InvalidRangeException – A non-valid range is provided.

Note

Any pair of characters start, end constitutes a valid range as long as the code point of character end is greater than the code point of character start, as defined by the Unicode Standard.

class pregex.core.classes.AnyButCJK[source]

Matches any character except for those defined within the CJK Unified Ideographs Unicode block.

class pregex.core.classes.AnyButCyrillicLetter[source]

Matches any character except for characters in the Cyrillic alphabet.

class pregex.core.classes.AnyButDigit[source]

Matches any character except for numeric characters.

class pregex.core.classes.AnyButFrom(*chars: str)[source]

Matches any character except for the provided characters.

Parameters

*chars (str | Pregex) – One or more characters not to match from. Each character must be either a string of length one or an instance of a class defined within the pregex.core.tokens module.

Raises
  • NotEnoughArgumentsExceptions – No arguments are provided.

  • InvalidArgumentTypeException – At least one of the provided arguments is neither a string of length one nor an instance of a class defined within pregex.core.tokens.

class pregex.core.classes.AnyButGermanLetter[source]

Matches any character except for characters in the German alphabet.

class pregex.core.classes.AnyButGreekLetter[source]

Matches any character except for characters in the Greek alphabet.

class pregex.core.classes.AnyButHebrewLetter[source]

Matches any character excpet for those defined within the Hebrew Unicode block.

class pregex.core.classes.AnyButKoreanLetter[source]

Matches any character except for characters in the Korean alphabet.

class pregex.core.classes.AnyButLetter[source]

Matches any character except for characters in the Latin alphabet.

class pregex.core.classes.AnyButLowercaseLetter[source]

Matches any character except for lowercase characters in the Latin alphabet.

class pregex.core.classes.AnyButPunctuation[source]

Matches any character except for punctuation characters as defined within the ASCII table.

class pregex.core.classes.AnyButUppercaseLetter[source]

Matches any character except for uppercase characters in the Latin alphabet.

class pregex.core.classes.AnyButWhitespace[source]

Matches any character except for whitespace characters.

class pregex.core.classes.AnyButWordChar(is_global: bool = False)[source]

Matches any character except for alphanumeric characters and the underscore character “_”.

Parameters

is_global (bool) – Indicates whether to include foreign alphabetic characters or not. Defaults to False.

Raises

GlobalWordCharSubtractionException – There is an attempt to subtract a negated character class from an instance of this class for which parameter is_global has been set to True.

class pregex.core.classes.AnyCJK[source]

Matches any character that is defined within the CJK Unified Ideographs Unicode block.

class pregex.core.classes.AnyCyrillicLetter[source]

Matches any character from the Cyrillic alphabet.

class pregex.core.classes.AnyDigit[source]

Matches any numeric character.

class pregex.core.classes.AnyFrom(*chars: str)[source]

Matches any one of the provided characters.

Parameters

*chars (str | Pregex) – One or more characters to match from. Each character must be either a string of length one or an instance of a class defined within the pregex.core.tokens module.

Raises
  • NotEnoughArgumentsExceptions – No arguments are provided.

  • InvalidArgumentTypeException – At least one of the provided arguments is neither a string of length one nor an instance of a class defined within pregex.core.tokens.

class pregex.core.classes.AnyGermanLetter[source]

Matches any character from the German alphabet.

class pregex.core.classes.AnyGreekLetter[source]

Matches any character from the Greek alphabet.

class pregex.core.classes.AnyHebrewLetter[source]

Matches any character that is defined within the Hebrew Unicode block.

class pregex.core.classes.AnyKoreanLetter[source]

Matches any character from the Korean alphabet.

class pregex.core.classes.AnyLetter[source]

Matches any character from the Latin alphabet.

class pregex.core.classes.AnyLowercaseLetter[source]

Matches any lowercase character from the Latin alphabet.

class pregex.core.classes.AnyPunctuation[source]

Matches any puncutation character as defined within the ASCII table.

class pregex.core.classes.AnyUppercaseLetter[source]

Matches any uppercase character from the Latin alphabet.

class pregex.core.classes.AnyWhitespace[source]

Matches any whitespace character.

class pregex.core.classes.AnyWordChar(is_global: bool = False)[source]

Matches any alphanumeric character as well as the underscore character _.

Parameters

is_global (bool) – Indicates whether to include foreign alphabetic characters or not. Defaults to False.

Raises

GlobalWordCharSubtractionException – There is an attempt to subtract a regular character class from an instance of this class for which parameter is_global has been set to True.