Hebrew Unicode Regex Matching

Stand back, I know regular expressions

Posted by James Cuénod on August 19, 2017

Disclaimer

If you don’t know what unicode is, this is not the post for you. If you don’t know what regular expressions are, this is not the post for you. If you’re still reading and you want to identify Hebrew characters in a string, this is the post for you. This is really just to document this stuff for myself…

TL;DR

Just want to match the whole unicode block relevant to biblical studies, this should get you going:

// The standard Hebrew unicode block
/[\u0590-\u05FF]/.test("some string with Hebrew in it")

// With precomposed characters
/[\u0590-\u05FF\uFB2A-\uFB4E]/.test("some string with Hebrew in it")

Why does this matter?

There are a bunch of useful use cases for finding Hebrew characters. One would be so that I can identify pointed Hebrew in a document and remove it (which I can do using only a regex find and replace). Another would be to do the same kind of find and replace on https://parabible.com (my site to search Hebrew morphology in the Old Testament) when you copy text. I use it to remove accent/cantillation markers so that when you paste text somewhere it’s formatted better for documents (I still intend to add a more general “remove all pointing option”). So we have a why, now for the how

Relevant Unicode Blocks for Hebrew

To begin with, it’s worth noting the relevant unicode blocks. The full block is \u0590-\u05FF. We have the cantillation (\u0591-\u05AF), vowels (\u05B0-\u05BB), some other pointing (\u05BC-\u05C7), the alphabet (\u05D0-\u05EA) and some extras that I don’t care about (\u05Fx). I have shamelessly stolen this table from https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

 0123456789ABCDEF
U+059x ֑  ֒  ֓  ֔  ֕  ֖  ֗  ֘  ֙  ֚  ֛  ֜  ֝  ֞  ֟ 
U+05Ax ֠  ֡  ֢  ֣  ֤  ֥  ֦  ֧  ֨  ֩  ֪  ֫  ֬  ֭  ֮  ֯ 
U+05Bx ְ  ֱ  ֲ  ֳ  ִ  ֵ  ֶ  ַ  ָ  ֹ  ֺ  ֻ  ּ  ֽ  ־ ֿ 
U+05Cx ׀ ׁ  ׂ  ׃ ׄ  ׅ  ׆ ׇ 
U+05Dx א ב ג ד ה ו ז ח ט י ך כ ל ם מ ן
U+05Ex נ ס ע ף פ ץ צ ק ר ש ת
U+05Fx װ ױ ײ ׳ ״
Notes
  1. 1. As of Unicode version 10.0
  2. 2. Grey areas indicate non-assigned code points

Some highlights above:

  • Note the final characters interspersed above (ך: \u05DA, ם: \u05DD, ן‎: \u05DF, ף‎: \u05E3, ץ‎: \u05E5)
  • Dagesh/Mappiq at \u05BC.
  • Maqaf (or however you want to spell that) at \u05BE.
  • Inverted Nun - that’s fun \u05C6.
  • Qamets Qatan (\u05C7), I have no idea what that is…
  • The \u05Fx range is for yiddish digraphs (I think), note the Geresh and Gereshayim in there though are punctuation (rather than accents).

The table above provides “combining characters” - in other words, בְּ is “composed” of three individual code points: \u05D1 + \u05BC + \u05B0. But unicode specification also supports “precomposed” characters at a another point in the range…

So the other block we need to consider is \uFB1D-\uFB1F. A bunch of these characters are irrelevant for my purposes (what’s the point of the wide characters or the alternative Ayin?) but sometimes you’ve gotta match “Shin with a shin dot”… So again, I have shamelessly stolen the table below from https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet. In this range, the characters I actually want to match are just \uFB2A-\uFB4E.

 0123456789ABCDEF
U+FB1x (U+FB00–U+FB1C omitted) ﬞ 
U+FB2x
U+FB3x
U+FB4x
Notes
  1. 1. As of Unicode version 10.0
  2. 2. Grey areas indicate non-assigned code points

Unicode Regular Expressions (using javascript)

Building Regexes in Javascript

So now what do we do with all this unicode goodness?

Javascript actually has a number of ways of building a regular expression. For more details, the best source is MDN. In short, regular expressions can be built in two ways:

  • A string (often as a parameter to a function that expects a regular expression). This is useful if you want to generate an expression; use something like new RegExp(str).
  • A regex style string: /regex-stuff/

And you can choose to compile them for reuse: e.g. const r = new RegExp("match") or const r = new RegExp(/match/) and then r.test("string with match").

Or you can use them directly /match/.test("string with match") (this only works with regex style strings).

Quick Reminder about Regex Functions

Hopefully that will get you going. Just a reminder to check the difference between search, test and match (note the object these functions exist on and their return type):

Test

regexpObj.test(str)

Returns: true if there is a match between the regular expression and the specified string; otherwise, false.

Match

str.match(regexp)

Returns: An Array containing the entire match result and any parentheses-captured matched results; null if there were no matches.

str.search(regexp)

Returns: The index of the first match between the regular expression and the given string; if not found, -1.

Matching Unicode with Regex

To get the unicode goodness into a regular expression, we use the \u syntax…

// Remember that 05BC is the unicode point for a Dagesh
/\u05BC/.test("בְּרֵאשִׁית")
//true

// And we can also match ranges:
/[\u0590-\u05FF]/.test("בְּרֵאשִׁית")
// true

// And even multiple ranges:
/[\u0590-\u05FF]/.test("מָקוֹם") // (Using a precomposed Holem-waw)
// false
/[\u0590-\u05FF\uFB2A-\uFB4E]/.test("מָקוֹם")
// true

It’s worth also noting that this .test is just checking whether the string has a matching character. If we want to check that the whole string is Hebrew we need to use ^ and $ (and remember, this will still not match spaces).

// The match must span the whole string
/^[\u0590-\u05FF]*$/.test("בְּרֵאשִׁית something that's not hebrew")
// false