LR(1) and LALR Parsing

We've seen that the following grammar is an LR(0) grammar and we constructed the corresponding LR(0) states, which define a shift–reduce parser.

\begin S & → S~+~E \mid E \\ E & → n \mid (~S~) \end

What instead of the production $S → S + E$, the grammar included the production $S → E + S$ ? Then the construction of the LR automaton would have proceeded a little differently. In particular, consider what happens when the parser completes an $E$ from the initial state:

We can already see a problem with this automaton, because it's not always clear what to do in state 3: the parser is ready to reduce the production $S→E$, but if the next input token is +, the parser could also shift that token to go to a new state. The right choice turns out to be to shift when the next token is +, and reduce otherwise. But nothing in the LR(0) construction we've seen so far takes the lookahead token into account. Notice that an Earley parser would simply try both the Scan and Complete actions, and one of them would get stuck based on the next input symbol. But the LR parser has to commit to one of them, so an LR(1) parser uses the lookahead to decide which action to take.

LR(1) construction

This observation leads us to extend the LR automaton construction to an LR(1) automaton. Just as in the LR(0) automaton, the states are a set of items that is closed under prediction. However, the items now contain a set of lookahead tokens $λ$. Thus, an LR(1) item has the form $X→α\,.\,β~~~~λ$ , where as before, the symbols $α$ represent the top of the automaton stack, the dot represents the current input position, the symbols $β$ derive possible future input, and the set of tokens $λ$ describes tokens that could appear in the input stream after the derivation of $β$, and not (as one might think) the tokens that might follow the dot.

To build an LR(1) automaton, the initial state is the closure of the item $S'→\,.\,S$~~~~∅$ . Each automaton transition is done as in the LR(0) construction, except that the lookahead set is propagated to the next state, and the closure is taken as just described.

With the LR(1) construction process, the two states we built earlier look similar but include lookahead information.

With this extended automaton, the ambiguity about what to do in state 3 is removed. The production $S→E$ should only be reduced if the lookahead token is the end of input ($). If the lookahead token is +, then the parser should shift the token. And any other lookahead token indicates a syntax error.

A grammar is LR(1) if the action for each of its states is unambiguous: all of the complete items (on which a reduction could be done) have disjoint lookahead sets, and further, those lookahead sets do not include any token $c$ for which the state contains an item of the form $X→α\,.\,c\,β~~~~λ$ . When these conditions hold, only one action can appear in each entry of the parser action table.

Conflicts

When a grammar is not LR(1), it is because some entry in the action table cannot be assigned a single action. There are two ways that things can go wrong. First, there could be two completed items of the form $X→γ.~~~λ$ and $X'→γ'.~~~λ'$ whose lookahead tokens $λ$ and $λ'$ both include some token $c$. In this case we say that the grammar has a reduce–reduce conflict on $c$. The second way to fail to be LR is to have a shift–reduce conflict in some state. In this case, the state contains a completed item of the form $X→γ\,.~~~λ$ but also an incomplete item of form $Y→α\,.\,cβ~~~λ'$ , where we have $c ∈ λ$. If the lookahead token is $c$, the parser doesn't know whether to shift $c$ or to reduce. Much of the challenge when using a parser generator that does not generate counterexamples automatically is to identify the grammar issues that lead to reduce–reduce or shift–reduce conflicts. Ambiguous grammars often lead to a surprisingly large number of conflicts, so it is prudent to build up grammars incrementally, testing them for conflicts as you go.

Precedence declarations

Bottom-up parser generators usually support precedence declarations as a way to resolve conflicts. Precedence declarations do not fundamentally increase the power of the parser but often make it easier to write grammars for programming languages.

For example, suppose our example grammar contained the production $E→E+E$. This grammar is ambiguous with respect to associativity, so the LR(1) construction produces a state with a shift–reduce conflict on the token +. This the kind of state we would expect the parser to get into if handed input like "1 + 2 . + 3", where the dot marks the current input position.

The syntax for precedence declarations varies, but in CUP, for example, we can write a precedence declaration “ precedence left PLUS ” to cause the parser to choose the reduce action when the lookahead token is +, resulting in left associativity. Conversely, we can write “ precedence right PLUS ” to cause the parser to choose the shift actionm resulting in right associativity.

We can also get shift–reduce conflicts arising from precedence. For example, suppose our grammar contains the two productions $ E→E+E \mid E*E $. In this case, we'll get states with shift–reduce conflicts like the following, corresponding to input like "1 * 2 . + 3" and "1 + 2 . * 3":

The left state has a shift–reduce conflict on token +, and the right state has one on token *. These conflicts are easy to resolve using precedence declarations, however. In CUP, the order of precedence declarations defines precedence of different tokens. For example, these declarations would make * (TIMES) have higher precedence than + (PLUS):

precedence left PLUS; precedence left TIMES;

Operationally, these declarations would cause the parser to reduce the left state on token + and to shift the right state on token *, achieving the desired precedence.

LALR parser generators also allow a precedence declaration to be attached directly to a production, to specify its priority with respect to other productions of the same nonterminal. Such declarations are helpful when the production does not itself use the token whose precedence is being used.

LALR grammars

The number of LR(1) states is often unnecessarily large, because the LR(1) automaton ends up with many states that are identical other than lookahead tokens $λ$. This insight leads to LALR automata. An LALR automaton is exactly the same as an LR(1) automaton except that it merges together all states that are identical other than lookaheads. In the merge, the lookahead sets are combined for each item. The LALR automaton can be constructed directly, of course; it's not necessary to build the full LR(1) automaton and then merge.

Many parser generators are based on LALR, including commonly used software like yacc, bison, CUP, mlyacc, ocamlyacc, and Menhir.

While LALR is in practice just about as good as full LR, it does occasionally lose some expressive power. To see why, consider what happens when the two LR(1) states in the following diagram are merged to form the state marked M:

The two states on the top are unambiguous LR(1) states in which the lookahead character indicates which of the two productions to reduce. But when merged, the resulting state has reduce–reduce conflicts on both + and $. When merging LR(1) states creates a new conflict, the grammar must be LR(1) but not LALR(1).

Comparing the power of different parsers

Earley parsers are the most powerful parsers we've seen: they can recognize any grammar, including ambiguous grammars. They can even recognize ambiguous languages, languages for which there is no unambiguous grammar, such as $\bigcup_ \< a^mb^mc^nd^n, a^mb^nc^nd^m \>$. Earley parsers expose the full power of the nondeterministic pushdown automaton (NPDA), which you can think of as a shift–reduce parser with an oracle telling it what to do in case of conflicts.

$\mathit(1)$ parsers, on the other hand, expose the full power of the shift-reduce parser (pushdown automaton, or PDA). Anything that can be recognized by a shift-reduce parser with one-token lookahead is $\mathit(1)$.

For any $k$, $\textit(k)$ is strictly more powerful than $\textit(k)$. However, $\textit(1)$ is actually incomparable to $\textit(1)$ (though in practice, it is more expressive). However, there are still unambiguous grammars that cannot be parsed by $\textit(k)$ parsers for any $k$, such as a grammar for a language of palindromes.

These comparisons are depicted in the following diagram: