Home | Site Map | More Information on BackNem 2.0

BackLit Technical Summary

Introduction

BackLit, a backtranslator functionality for EBAE braille, is incorporated in the BackNem backtranslator which handles Nemeth braille as well as EBAE. This article describes as briefly as possible the sequential steps used in BackLit to backtranslate an electronic EBAE braille docment to print. A separate article describes BackLit features from a user perspective.

The overriding difficulty of backtranslation is that the meaning of a braille cell depends on context where context includes the absolute position on a page; the (sometimes implicit) scope of markup; the type of an item; and the relative position of braille cells within an item. (See Examples.)

Persons familiar with EBAE will note that I've skipped describing certain details that are essential to correct back-translation. The main focus of the present article is to explain why BackLit is complex and requires a number of different types of processes. Backtranslation of EBAE can't be done accurately just using a simple approach such as table lookup. The accompanying article on BackNem includes additional information.

BackLit uses hand-written Java code together with lexers generated automatically from hand-written grammars specified in ANTLR. Lexers are used wherever possible since it is considerably easier to debug or modify grammars than to debug or modify code.

ASCII Braille

The standard way of representing electronic braille is as a plain text file with the 63 braille cells encoded using a conventional transliteration to 63 ASCII characters; this transliteration is known as ASCII Braille. There is a Unicode specification, called Braille Patterns, for representing the braille cells; however, it isn't widely used. In any case, it doesn't provide any more information than ASCII Braille and wouldn't be as convenient for the lexical analysis needed for backtranslation. Defining lexer grammars in ASCII Braille is, by contrast, quite straightforward.

EBAE

EBAE or English Braille American Edition is a system for transcribing print documents using only braille cells and whitespace.  EBAE was designed to accommodate the needs of tactile readers.  The EBAE system is a variation on a plain text linear markup system characterized by the following features:

The Backtranslation Process

BackLit approaches backtranslation by splitting a braille document into braille items where a braille item is a sequence of braille cells delimited by whitespace and/or braille dashes (sequences of two braille dots-36 cells). Each braille item is then characterized and backtranslated in the order of its appearance in the original document.

Any braille items that represent page numbers or embedded Computer Braille Code (CBC) are characterized by inspection and separated out for special processing.

BackLit uses a series of three steps to characterize each of the remaining braille items individually. Each step combines custom processes with context-sensitive lexical analysis as necessary to support backtranslation.  Characterization of an item stops if an error is encountered. 

Since some of the backtranslation rules depend on the relative position of a braille cell within a braille item, it is easiest to analyze the following parts of a braille item separately:

One or more parts are handled in each of the steps described in the following sections.

Step I

The first analysis uses a lexer to tokenize any prefix and the first symbol of the main portion.

Syntax errors

Syntax errors identified by the first analysis include:

Special case: joined words

The first analysis identifies those items where the first symbol of the main portion is a joined word. These items are divided after the joined word into two new braille items. The first new item is sent to a separate process for back-translation and the new second item is re-characterized  by starting over at the beginning of Step I.

Special case: low words and other isolated symbols

The first analysis also identifies those items that have no postfix and where the only symbol of the main portion is a low word, long dash, or ellipsis. (Low words are special contractions that can only be used where they are followed by whitespace.) These items are sent to a separate process for back-translation.

Summary

Any prefix of remaining items includes at most one of the following semantic markup indicators:

  1. A number sign
  2. A letter sign
  3. A non-Latin letter indicator
  4. A non-Latin word indicator
  5. An ambiguous indicator that will turn out to be either a letter sign or a non-Latin word indicator

Step II

If any braille cells remain unanalyzed, the second analysis uses another lexer to tokenize any last symbol of the main portion and/or any postfix after (temporarily) removing any prefix and the first symbol.

Implementation details

Note that the second analysis is necessarily carried out from right-to-left. This means that the possibility of the last symbol's representing an EBAE (upper) number generally has to be ignored and resolved by the lexical analysis of Step III.  This is because the scope of a number sign in the prefix does not necessarily extend to the end of the main portion and, even if there is not a number sign in the prefix, there could be an embedded number sign.  (The back-translation of embedded number signs is complicated by context-sensitivity: the same braille cell that represents an embedded number sign also represents the letter sequence ble.)

Special case: non-Latin letters and words

The main portion of the item is then isolated from any prefix and/or postfix. The nature of the main portion, including whether it is a short-form word sequence, is used where necessary* to disambiguate a letter sign from a non-Latin word indicator. Items where the prefix includes either a non-Latin letter or (a possibly just-disambiguated) non-Latin word indicator are sent to a separate process for backtranslation.

Special case: exceptions

Items with no semantic markup indicator in the prefix and with their main portion in either the exceptions table or an optional table of supplied back-translations are sent to a separate process for backtranslation.

Special case: items with embedded periods

The syntax of remaining items with no semantic indicator in the prefix is evaluated to determine if the item is a special case -- such as an unspaced abbreviation ( e.g. Ph.D.) or an URL -- that contains one or more embedded periods. Items with embedded periods are sent to a separate process for backtranslation. (These items have to be treated as a special case because when the braille cell which is used for a period is embedded, it normally represents the letter sequence dd, not a period.)

Summary

Items which have been completely analyzed, i.e. those with only one or only two symbols in their main portion, are sent to the respective appropriate process for backtranslation.

Any prefix of remaining items includes at most one of the following semantic markup indicators:

  1. A number sign
  2. A letter sign

Step III

The third analysis uses one or two additional lexers to tokenize the interior symbols (and the last symbol it if turns out to be numerical) of the main portion of those items with more than two symbols in their main portion. (One of the lexers is used to begin the analysis of non-numerical items and the other is used to begin the analysis of numerical items.  The analysis process switches between the two lexers in the case of mixed items.)

Items with no embedded hyphens

Items with no embedded hyphens are sent to the appropriate process for backtranslation.

Special cases: items with embedded hyphens that are not compound words

The syntax of items with embedded hyphens is examined to determine whether the item is a special case or a true, hyphenated compound word.  If the item is not positively characterized as a special case, it is handled as a compound word. 

Special cases include numbers, mixed items with numbers and letters, and special constructs. Special constructs include letter words [e.g. e-mail], number words [e.g. 6-pack] and the various constructs specified according to EBAE Rules II.11 and II.13. These rules cover spelled-out words [e.g. c-h-e-e-s-e], stammered words [e.g. w-w-will], and words with an italicized portion [e.g. baseball] that use an embedded hyphen (in braille only) to terminate the default scope of the italics indicator. Special cases are sent to the appropriate process for backtranslation. 

Hyphenated compound words

The main portion of compound words is divided at the hyphen(s) and the component parts reanalyzed individually starting at Step I.  A reanalysis of the parts (but not of any prefix or postfix belonging to the original item) is required because the component parts of a compound word are back-translated independently, with the characterization and back-translation of each of the parts using (almost) the same rules as are used for standalone words. For example, most of the contractions that are normally permitted only at the start of a word can also be used after the hyphen of a compound word.

There is no straightforward way to characterize compound words at an earlier point so as to avoid the need for the reanalysis of their component parts. This is because the processes that are utilized to differentiate compound words from other items with embedded hyphens make use of the information previously obtained from Steps I and II in addition to the information obtained from the initial analysis of this Step.

Conclusion

Backtranslating EBAE is a one-of-a-kind problem. I spent a lot of time on several alternate approaches before I settled on this one. (The amount of time was to some extent a reflection of my lack of appreciation of the logical complexities of EBAE.) I finally came to the conclusion that any higher-level abstraction of the problem than the one used here was likely to be leaky and unlikely to have any other useful application. The current implementation, which attempts to reflect the processes a human reader uses to decode braille, demonstrated its robustness in the face of two challenges. First, I kept discovering new special cases that could only be handled by inserting extra processes. Second, I had to insert additional extra processes in order to disambiguate text items from math items so I could backtranslate Nemeth braille along with EBAE in BackNem 2.0.

On the other hand, I would be very happy if another Java developer, with the advantage of using BackLit in addition to the official EBAE documentation as a starting point, discovers a method of refactoring BackLit that makes it simpler and thus easier to maintain.


Appendix: Two Examples

These examples illustrate the steps described in the preceding article. Braille cells are represented by ASCII Braille.

Example No. 1 (Context-sensitivity)

This example shows that certain backtranslation rules depend on which part of a braille item is being backtranslated. The dots-235 braille cell, transliterated as the ASCII digit 6, has three different meanings in the three different parts. The example is the braille translation of the last two words and their attached punctuation from this sample sentence:
She said, "His toast was short, ‘To Jefferson!’”

‘To Jefferson! ’” is translated to braille as a single braille item because of the use of the joined word contraction for translating to. (Joined words are special braille contractions, used only in translating the print words by, into, and to. Joined words are can only be used when they can be unspaced from the following word; there are also other restrictions on their use.)

Step I The result of applying the first analysis to the beginning of
,8,6,JE6]SON60'0

ASCII Braille Print backtranslation
,8 Opening single (inner) quotation mark
, Capitalization indicator
6Joined word to (actually To because
of the preceding indicator)

Because of the joined word, the original item is separated into two new items:
,8,6 and ,JE6]SON60'0
The analysis of the first new item is the same as the Step I analysis of the original item and doesn't have to be redone. The analysis of the second new item starts over with Step I.

Step I The result of applying the first analysis to the beginning of
,JE6]SON60'0

ASCII Braille Print backtranslation
, Capitalization indicator
J Letter j (actually J because
of the preceding indicator)

Step II The result of applying the second analysis to the end of E6]SON60'0

ASCII Braille Print backtranslation
NLetter n
6Exclamation point
0' Closing single (inner) quotation mark
0 Closing double (outer) quotation mark

Step III The result of applying the third analysis to E6]SO

ASCII Braille Print backtranslation
E Letter e
6 Letter sequence ff
] Letter sequence er
S Letter s
O Letter o

Example No. 2 (Multi-cell symbols)

This second example illustrates the usefulness of standard lexical analysis for tokenization of braille simply because some braille contractions (and other braille symbols) have more than one braille cell. The example is based on the braille translation of a single print word written in italics:
greatness

Step I The result of applying the first analysis to the beginning of
.GRT;S

ASCII Braille Print backtranslation
. Italics indicator used when only
a single word is italicized
GRT Shortform great

Step II The result of applying the second analysis to the end of ;S

ASCII Braille Print backtranslation
;S Final-letter contraction ness

Step III This step is not needed since the item doesn't have any interior symbols.


*In some cases the only way to distinguish a non-Latin word indicator from a lettersign is to know the the nature of the main portion. For example, if the main portion is a single letter or is a sequence of two or more letters that corresponds to a short-form word, then a single dots-56 indicator is a lettersign.


Article history:

Send questions and test files to info@dotlessbraille.org