Home | Site Map | More Information

BackNem 2.0 Implementation: Part I

Introduction

This webpage is the first part of a two-part article which presents a detailed description of the approach that the BackNem 2.0 application uses to backtranslate Nemeth braille text and math and English Braille American Edition (EBAE) text to print. This first part covers everything up to the analysis and backtranslation of math items which is covered in the second part. (A significantly shortened technical description of the entire application is also available. )

This article assumes a basic knowledge of computer science concepts but not of braille. There are only two key facts about braille that are required to understand the processes described here. The first is that a braille system, such as Nemeth braille, uses a combination of markup, shorthand, direct representation, and explicit formatting to represent print by braille cells. The second key fact is that the main difficulty of backtranslating braille comes from the braille's use context-sensitive rules.

BackNem is a software application written in Java.  It includes hand-written Java code together with lexers and parsers generated automatically from hand-written grammars specified in ANTLR.  Lexers and parsers are used wherever possible since it is considerably easier to debug and modify grammars than to debug and modify code.

The hand-written code in BackNem includes the use of hash tables to associate braille words, contractions, mathematical symbols, and other items with their back-translations. Unfortunately, the difficulty of this problem is such that BackNem requires a great deal of additional hand-written code that is considerably more complex than that needed to use hash tables.

BackNem incorporates five lexers for analyzing text plus additional lexers and two parsers for analyzing math. BackNem handles the context-sensitive rules for scanning braille text by using several different lexers to address different contexts. However, BackNem handles the context-sensitive rules for scanning braille math primarily via the actions contained in the grammars for the math lexers and parsers.

Input and Output

BackNem accepts as input a standard electronic representation of braille called North American ASCII Braille.  ASCII Braille is a (necessarily somewhat arbitrary[1]) transliteration of the 63 braille cells by 63 ASCII (keyboard) characters.

This article also uses ASCII Braille in showing examples of braille.  ASCII characters intended to represent braille cells are displayed here using a monospaced (typewriter) font:
  A B C ! # > .

Six-dot braille doesn't have capital letters per se. Single capital letters are represented by preceding the braille cell for the corresponding small letter with another braille cell used as a capitalization indicator. (This other cell is transliterated by the ASCII comma character.) Nonetheless, the standard ASCII Braille system transliterates the braille small letters by print capital letters for historical reasons, namely that ASCII Braille was specified prior to the time that the ASCII character set included small letters.  We use this somewhat misleading transliteration in this article both because it makes it easier to distinguish ASCII Braille from surrounding print and also because it is standard braille technology. (The BackNem software, like most six-dot braille software, treats either ASCII small letters as ASCII capital letters as representing the six-dot braille letters.)

The output of BackNem is a webpage or XHTML file with text represented as XHTML and mathematics represented as MathML.

Basic Process

As pointed out in the introduction, the main problem to be solved in backtranslating braille is finding an effect approach for dealing with context-sensitive rules for representing printed material by braille cells. Examples of context that can affect semantics include the absolute position on a page; the presence or absence of blank lines above or below a set of lines; the type of an individual item and, sometimes, of its neighboring items; and the relative position of the braille cells within a single item.

BackNem addresses different possibilities in a top-down manner. Document areas containing spatial arithmetic are identified and isolated first since these have to be handled separately from other material. Further discussion of BackNem's current handling of spatial arithmetic may be found elsewhere. The rest of the present article focusses on other types of material including spatial arrangements such as matrices, that are used in higher-level mathematics.

BackNem handles the remaining possibilities basically line-by-line although, of course, multi-line spatial arrangements such as matrices are handled appropriately. Each line of braille is split into braille items where a braille item is a sequence of braille cells delimited by whitespace and/or braille dashes (sequences of two braille dots-36 cells). Each item is then analyzed individually in the order of its appearance in the original document.

The next goal after isolating an item is determining its type. BackNem handles five types of braille items in addition to the previously-discussed spatial arithmetic:

  1. Page numbers
  2. Computer Braille Code (CBC)
  3. Literal braille examples intended to be left as simulated braille in print
  4. Contracted (or uncontracted) narrative braille text
  5. Mathematical symbols and expressions
In practice, the main difficulty is distingushing the last two types, text and math, from each other. The first three types are easily identified via a preliminary screening and easily back-translated by specialized processes.

Page numbers are identified by their location on either the top or bottom line of a page together with their special syntax.  A sequence of one or more CBC items (which may extend for more than one line) is identified by unique braille start and end indicators which act similarly to tags in XML. Literal braille examples, which are not a standard part of braille, are identified with a start indicator unique to BackNem: two repeated full cells (dots-123456).

The remainder of this article focusses on the processes required to distinguish between text items and math items and on the processes used to back-translate these items once their type is known.

Note that text items are back-translated individually while mathematical items are collected into a group prior to back-translation. (A group is simply an uninterrupted sequence of mathematical items between a pair of text items.) A single group of math items may represent either one mathematical expression or more than one.

Automatically Differentiating Text from Math and Backtranslating Text

A significant challenge for the backtranslation of Nemeth braille is differentiating those braille items that represent text from those that represent mathematics.  This is necessary because different rules are used to back-translate text and to back-translate math.

The reason that differentiating these two types of items is difficult is context-sensitivity. Even though the same braille cells that are used in a text or narrative context to represent punctuation marks and the letter sequences known as braille contractions are re-used in a math context to represent decimal digits, mathematical symbols, etc., there is no explicit markup to differentiate the two contexts. In the absence of markup, it is necessary for a computer program to use an ad hoc mix of syntax analysis, lexical analysis, and heuristics to simulate how human readers differentiate braille text from braille math.

As an aside, there is a modernized version of the Nemeth code, The Nemeth Uniform Braille System (NUBS), that does include explicit markup to differentiate the two contexts. Unfortunately, BANA has not yet chosen to review NUBS. One should also note that Professor Nemeth's original version of his code used two spaces to switch between narrative context and math context. However, the braille authorities removed this rule in 1965 because braille readers were having trouble distinguishing one space from two spaces. As Caryn Navy recently stated, "With historical hindsight from the work of computerizing translation between print and Braille, it certainly seems sad that the committee didn't simply change these delimiters from two spaces to other indicators easier to recognize."

Successfully differentiating text items from math items in the current Nemeth system requires carrying out a series of rather complex processes in a carefully prescribed sequence. The exact number and nature of these processes depends on the particular item.  Simple items, such as a single letter, can obviously be characterized more quickly than those that contain several symbols. 

Preliminary Screening

BackNem starts the process of differentiating text from math by dealing with those easy cases where the nature or syntax of an item is such that the item is necessarily math or necessarily text. Several processes are used in screening for these items.

Items definitely known to be math include isolated reserved words, namely braille function names and function name abbrevations, and those comparison signs that are isolated from both the preceding and the following items by spaces.

Math items which have a unique syntax can be identified using standard string manipulations. Nemeth fractions are one example.[2] [Although fractions would eventually be detected as math items by later processes, catching them at this point avoids any interference with the special algorithm used to disambiguate the contraction for st from the embedded slash in terms like and.][3] Isolated numbers, items consisting of a number followed by a single letter, and items such as matrices which use enlarged grouping symbols also have a unique syntax.

Another process in the preliminary screening is a simple ANTLR lexical analysis[4] applied to the beginning of a item. This analysis identifies certain items as defintely text and others as definitely mathematical. It also provides useful information about the other items.

Note that up to this point the preliminary screening processes don't make use of any information about the items surrounding the item being processed. This factor means the these same processes can also be applied to the subsequent item if knowledge of its nature can resolve an ambiguity. For example, a single letter followed by a comparison sign is assumed to be a symbolic letter used as part of an inline expression and not a single-letter whole word contraction, e.g. let x = 3. (Of course, since items are processed in sequence, the nature of items preceding the item being processed is always already known.)

Some Nemeth braille items that are truly ambiguous are also screened out here by default. Items consisting of a letter followed by a single digit could be either a text item consisting of a single-letter whole-word contraction followed by a punctuation mark or a math item consisting of a literal symbol followed by an implicit numerical subscript. For example B1 is the braille translation of both but, and b1. However, it is unusual for the common words translated by the single-letter whole-word contractions (but, can, do, every, from, ...) to be followed by punctuation marks in ordinary writing. We take the approach that unless such an item is clearly at the end of sentence, then it is math. [The astute reader will have noted that it would be difficult to backtranslate the braille transcription of the previous sentence!] Of course, if such an item is preceded by or followed by a math item, then it is unambiguously math.

The process used for this preliminary screening is representative of BackNem's approach of making alternating use of specialized Java methods developed specifically for BackNem and of its ANTLR-generated lexers. Note that some steps can be carried out in several different sequences and there not always an optimal order.

BackLit

The strategy at this point is to temporarily assume that any items not identified during the preliminary screening are text items and to start applying the sequences of processes necessary to back-translate a text item while simultaneously gathering additional information until the item is either positively identified as either text or math or is shown as less likely to be text than to be math.

BackNem is built on top of the BackLit back-translator for contracted braille since the processes required to distinguish math items from text items are very similar to the processes used in BackLit to identify the particular sub-type of a text item.  These latter processes only require slight modification for use in BackNem.

Just as in processing print text, the first thing that has to be done in processing a braille text item is to strip off any leading and trailing punctuation marks and/or indicators. (For simplicity we refer to the remaining portion of the item as a word although, of course, it may turn out to be something else.) Stripping off punctuation is trivial for print text because print uses unique characters for punctuation. However, as already noted, braille uses the same braille cells for other purposes in addition to their use as punctuation. Obviously, this doesn't mean that the useage is ambiguous; otherwise, braille users couldn't read braille. However, it does mean that the process is more complex than for braille than for print.

Persons unfamiliar with braille might find it useful at this point to look at the examples in the article on BackLit.

BackNem uses two separate lexers to perform this first processing step. The first lexer is used to identify leading punctuation marks and indicators (and also the leading symbol under the assumption that the item is text). The second lexer is used to identify trailing punctuation marks (and also any trailing symbol under the assumption that the item is text). It turns out to be easier to identify the latter by effectively scanning from right-to-left rather than from left-to-right. Right-to-left scanning is achieved in practice by actually reversing the input sequence and implementing the grammar rules so as to specify the tokens with the braille cells in reverse order.[5]

The output of the second lexer makes it possible to distinguish certain formerly unidentified items as being definitely math. First, math items include a special indicator to separate trailing liteary punctuation marks from preceding cells so the punctuation marks aren't misread as digits. Second, math items also use a different braille cell for representing a comma than do text items.

Also, once the use of both lexers has made it possible to separate a word from any attached symbols, BackLit checks to see if the word appears in the table that associates braille words that are exceptions[6] with their back-translations. This check distinguishes certain formerly unidentified items as being definitely text along with determining the back-translation of the words that are in the table.

Final Screening

At this point, items which are still unidentified and items which have been identified as text will have been characterized as being of one of the following three sub-types as a pre-requisite to applying the back-translation rules:

  1. Items where the "word" consists of a single symbol
  2. Items where the "word" consists of a two symbols
  3. Items where the "word" consists of at least three symbols

Note that the term symbol as used here refers to the outcome of the lexical analysis under the assumption that the item represents text. If it is discovered later that the item actually represents math, then the proper lexical analysis is applied.

BackNem now goes ahead and back-translates the entire item as if it is text and, if necessary, applies a spell-checker to check the back-translation.

Items with Single-Symbol Main Portions

There are three types of text items with a single-symbol main portion:

  1. Ordinary contracted words
  2. Letters (and shortform sequences) which stand for themselves because they have a leading lettersign indicator [Letters used in a text context to stand for themselves rather than being used as single-letter whole-word contractions follow different rules from letters used in a math context. ]
  3. Single-digit EBAE upper numbers

Single-symbol main portions which are mathematical in nature will have already been identified as math with the exception of special situations. For example, isolated letters used in a math context to stand for letters rather than single-letter whole word contractions will have been detected during the preliminary screening.

One way to spot an isolated mathematical symbol, such as the radical symbol, which uses the same braille cell as an ar contraction, will be by noting that when back-translated as text, it doesn't represent an ordinary English word.

Items with Two-Symbol Main Portions

There are four types of text items with a two-symbol main portion:

  1. Ordinary contracted words
  2. Two sequenced (unspaced) large-sign words such as &!, which is the braille translation of and the (as identified by inspection)
  3. Mixed items with an EBAE upper number
  4. EBAE upper numbers

Two-symbol main portions which are mathematical in nature will again have already been identified as math with the exception of special situations. This is because Nemeth mathematical expressions are written unspaced and most mathematical expressions have at least three symbols.

Items with Multi-Symbol Main Portions

There are three general categories of text items with a multi-symbol main portion:

  1. Words with no embedded hyphens (or minus signs)
  2. Words with embedded hyphens that are not true compound words; this category includes letter words [e.g. e-mail], number words [e.g. 6-pack], and the various special constructs in EBAE Rules II.11 and II.13 such as spelled-out words [e.g. c-h-e-e-s-e] and stammered words [e.g. w-w-will]
  3. Words that are true compound words where the component parts have to be back-translated separately
Distinguishing whether items with embedded hyphens belong to the second or to the third category is quite tedious. BackNem does this by explicitly checking the item against the criteria for each of the different types of items in the second category. Items which don't match any of these types are identified as being true compound words.

We now discuss these three categories in turn. Remember that a third lexical analysis is still required in order to identify the interior symbols of these items.

Words with no embedded hyphens (or minus signs)

Back-translating items in this category is straightforward. (Again, unspaced sequences of certain words are identified by inspection prior to back-translation.) The back-translation of those "words" that haven't yet been identified as text are put through a spelling check to decide if they are more likely to be math than text.

Words (and related constructs) with embedded hyphens but that are not true compound words[7]

Identifying items with embedded hyphens, including letter words and spelled-out words, requires some tedious syntax analysis. Luckily as far as distinguishing text items in this category from math items, the algorithms used to identify the various types of constructs with embedded hyphens results in a positive identification so that we can be fairly confident that any items identified as being in this category are definitely text and not math. (The default for items where a positive identification isn't made is that they aren't in this category at all but are true compound words.)

The one possible trouble spot is number words, such as 2-car. These can generally be distinguished from subtraction problems simply by ensuring that the literal part corresponds to a stand-alone word. BackNem could, however, get into trouble with the number word 6-can where the braille translation, #6-C uses a single-letter whole word contraction and could be intended as an algebraic subtraction. Luckily algebraic subtraction usually occurs in a longer expression.) Difficult cases can be added to the exceptions table.

True compound words versus algebraic subtraction

With compound words we into a difficult area where we have to use some heuristics. The first thing we do is to go ahead and apply the rules for back-translating compound words. (Remember that in a compound word, the component parts are back-translated independently, as though the parts were standalone words with the one exception that the contraction for com [which is the same braille cell as the hyphen] can't be used.) Once we've backtranslated a presumed compound word, we can apply the spelling test to the back-translation of each of the component parts. Then, if any of the parts fails the test, we assume that item is math and not text.

However, if all of the parts pass the test, we're not quite done since we need to consider what do about the rule that allows the single-letter whole-word contractions to be used used in compound words. For example, the print word self-knowledge is translated as SELF-K. This rule could lead to confusion as to whether Y-Z represents y-z or you-as since the back-translation of both parts would pass the spelling test.

However, as this example suggests, there aren't very many true compound words that use more than one single-letter whole-word contraction. BackNem takes the view that two Latin letters separated by a hyphen-minus are intended as an algebraic subtraction and not as a compound word. If you have a can-do attitude (C-D ATTITUDE), perhaps you can come up with a better solution.

Misspelled or Math?

Once words have been back-translated, the backtranslations of those words that aren't part of an item that was previously identified as text are checked against a table of English words and, if they are not in that table, are also checked against a table of acceptable misspellings. A word that is not in either table is assumed to be math unless it happens not to contain any braille cells other than letters.

There can be at least two pitfalls with this strategy.  The first pitfall is context-dependence.  There are actually a few true ambiguities in Nemeth. For example, the braille translations of the word rings and the math expression r+s are the same: r+s.  Also, the braille translations of the word farmer and the math expression f sqrt(m) (where the braille expression represents the actual radical symbol) are the same: f<m].  Since BackNem gives precedence to identifying an item as text, these true ambiguities will be identified as text. (These cases are rare enough that this shouldn't be a problem in practice. Also the Nemeth rules state that such situations should be avoided.)

A more worrisome pitfall with this strategy occurs when an item is actually a text item with its main portion being a misspelled word but BackNem mistakes the item for math. This will happen when the main portion is either a correctly-spelled word that is missing from the provided table of English words or a misspelled word that is missing from the table of acceptable mispellings. Two features of BackNem help minimize the impact of this mistake. First, the BackNem user can add additional misspelled words to his or her personal list of misspelled words. Second, the lexers and parsers used for back-translating math incorporate a great amount of error detection that functions as a sort of "second chance" for items that are actually misspelled words and not math. 

Thoughts on spelling correction

I had at one time considered incorporating an open source spelling-correction functionality similar to Gnu Aspell instead of simply letting the user control the list of misspelled of words.

Spelling correction attempts to guess the intended word from a misspelling. The application here would be that if the correction algorithm were to come up with a fairly similar result that is in the list of English words, then it would be presumed that text and not math is intended. However, I determined that using such an approach effectively would require too much research.

In the first place, spelling-correction algorithms are based on print text where each character retains its identity and not on situations like braille systems where a braille cell can have multiple unrelated meanings. (For example, properly handling of cases where a user has misused the contraction for ing at the start of a word would be exceedingly difficult to resolve since this contraction uses the same braille cell as the Nemeth symbol for a plus sign but, unlike the plus sign, isn't allowed at the start of a word.)

In the second place, I recently learned of research which shows that braillos are often quite different from print typos and spelling errors.   For example, a braille user who uses six-key entry might accidentally enter the mirror image of the proper braille cell, such as substituting a w for an r; or omit a dot, such as substituting an and for for, etc.

Backtranslating Math

Please see the rest of this article for the backtranslation of items identified as math.


Notes

[1]There are at least two reasons why any one-for-one transliteration of braille cells by print characters is necessarily arbitrary and can't reflect the semantics of all of the braille cells.

  1. Braille cells are symbols, not characters, and most have different semantics in different contexts.
  2. There are numerous cases where the semantics of a cell doesn't correspond to any ASCII character, or, often, not even to any print character, ASCII or not. For example, there is a single braille cell that represents the three-letter print sequence the when used in text. (This same cell represents an integral sign when used in Nemeth math but the integral sign isn't an ASCII character.)

[2]Actually there is one exception. David Halliday of Duxbury Systems pointed out to me that the braille item ?RU/A# can be read as the translation of either the word thrustable or the spatial fraction

       ru
       -
       a

[3]Literary braille uses the same braille cell, dots-36, to represent the contraction for st and the oblique stroke or slash symbol. This leads to ambiguity in the back-translation of terms like and/or which contain an embedded slash. (There are new rules specifying that at least some braille transcriptions should use a new unambigous symbol for the oblique stroke but these rules have not yet been widely adopted.)

The BackLit functionality of BackNem includes a specially-developed algorithm that does a good job of disambiguating the contraction useage from the slash useage. This algorithm uses a list of standard English words to determine the most reasonable back-translation. This requires considering several trial back-translations based on the various alternatives.

Of course, common terms like either/or can always be included in the exceptions table. Unfortunately, one is always encountering new ones. A recent issue of the ACB's Braille Forum referred to an instructor/guide.

[4]Lexical analysis is method of dividing a sequence of characters into symbols or tokens where each symbol contain one or more characters or, in this case, one or more ASCII characters representing braille cells.  Identifying braille symbols is a necessary (but not always sufficient) prequisite to back-translating them. (See Examples.)

[5]From a purely computer science perspective, one might think first of addressing this problem with a single LR lexer rather than with two ANTLR LL(k) lexers. However, given that we don't necessarily know at this point whether we are dealing with text or math, and, thus, don't know what grammer to use for the interior cells, the present approach is considerably simpler. A second reason for not using an LR lexer can be seen in this article on BackLit which shows how context-sensitive lexing of the different portions of a braille item needs to be alternated with algorithmic processes to identify various special cases. Finally, there simply aren't robust open source Java LR lexers with user support comparable to that available for ANTLR.

[6]There are number of different types of exceptions to text translation which have to be handled with an exceptions table. One type occurs for those shortform sequences that stand for the shortform in some cases but simply for the letter sequence in others. For example, the shortform sequence AC for according represents the shortform in ACLY, which is the braille translation of the print word accordingly, but it represents the letter sequence in ACME, which is the braille translation of the print word acme, not accordingme.

[7]One could, of course, argue that documents such as student work which are sourced originally in braille are unlikely to contain stammered words and other specialized literary braille constructs and that going to all the trouble necessary to properly back-translate these constructs is a waste of time.  However, one of the intended applications of BackNem is to provide an independent check on braille material sourced originally in print and then transcribed from print to braille. Specialized constructs are (at least somewhat) likely to occur in this latter type of material. Also, one of the major goals of this project is to illustrate to the professional braille transcribers who care about such things that it is possible to develop reliable braille software.


Article history:

Send questions and test files to info@dotlessbraille.org