Jeff Wu

Computer Scientist, Programmer, Dog Lover, BMW Enthusiast

Programming Language Theory: Part 1

Let’s start with formal languages. A formal language can be either a finite or infinite set of strings. Specifically, a formal language L is a subset of Σ*, where each string from Σ* in L is called a sentence. For example, {a, aa, aaa, bb, aba, abc} is a formal language.

Syntax and Semantics

Syntax refers to the structure of language and semantics refers to the meaning of language.

Validity

Only after the validity of every individual word in the entire string is established can we examine whether the words are arranged in a proper order according to the particular language in which this particular string is a candidate sentence.

There are three progressive types of sentence validity:

  1. A sentence is lexically valid if all the words of the sentence are valid.
  2. A sentence is syntactically valid if it is lexically valid and the ordering of the words is valid.
  3. A sentence is semantically valid if it is lexically and syntactically valid and has a valid meaning.

Consider the following:

  • Garfield is a catt.

This sentence is not lexically valid because “catt” is not a word.  The sentence cannot be syntactically or semantically valid.

Consider the following:

  • Garfield is a.

This sentence is lexically valid because all of its words are valid, but it is not syntactically valid because the arrangement of those words does not conform to the subject - verb - object structure of English sentences, so it cannot be semantically valid.

Consider the following:

  • Cat is a Garfield.

This sentence is lexically valid because all of its words are valid and syntactically valid because the arrangement of those words conforms to the subject - verb - object structure of English sentences, but it is not semantically valid because the sentence does not make sense.

Consider the following:

  • Garfield is a cat.

This sentence is lexically, syntactically, and semantically valid. Notice that these types of sentence validity are progressive. Once a candidate sentence fails any test for validity, it automatically fails a more stringent test for validity. In other words, if a candidate sentence does not even have valid words, those words can never be arranged correctly. Similarly, if the words of a candidate sentence are not arranged correctly, that sentence can never make semantic sense.

Regular Expressions

Regular expressions provide a concise and formal method of describing languages, including infinite languages. A regular expression is a pattern – itself represented as a string – that denotes the strings of a language.  Regular expression notion is one way to describe a regular language. This notation involves a combination of strings of symbols from some alphabet Σ, parentheses, and the operators +, ·, and *. 

The simplest case is the language {a}, which is denoted by the regular expression: a. 

Slightly more complicated is the language {a, b, c}, for which, using the + to denote union, we have the regular expression a + b + c. An asterisk * is used for star-closure in a similar way. For example, the regular expression bar* denotes the language { bar, barr, barrr, barrrr, … }

Finite-State Automata

A finite-state automaton (FSA) is a model of computation used to determine whether a string is a sentence that belongs to a particular language. Specifically, determining whether a string s from Σ* in L is a set-membership problem, and depends on the complexity of L. For example, determining if a string s from Σ* is in the language of all three-character strings is simpler than determining if s is in the language of palindromes.

Deterministic Finite Accepters

A deterministic finite accepter (DFA) is a type of automaton, and is defined by the quintuple: M = (Q, Σ, δ, q0, F) where:

  • Q is a finite set of internal states
  • Σ is a finite set of symbols called the input alphabet
  • δ is the transition function
  • q0 is the initial state
  • F is a set of the final states

The initial state is q0, with the input mechanism pointing to the leftmost symbol of the input string. During each move of the automaton, the input mechanism advances one position to the right; in other words, each move consumes one input symbol. When the end of the string is reached, the string is accepted if the automaton is in one of its final states. Otherwise, the string is rejected. The input mechanism can move only from left to right and reads exactly one symbol on each step. The transitions from one internal state to another are governed by the transition function δ.

Every finite automaton accepts some language. If we consider all possible finite automata, we get a language family.

Transition graphs can be used to visualize finite automata. Vertices represent states and the edges represent transitions. The labels on the vertices are the names of the states, while the labels on the edges are the current values of the input symbol. For example, if q0 and q1 are internal states of some DFA M, then the graph associated with M will have one vertex labeled q0 and another labeled q1. An edge (q0, q1) labeled a represents the transition δ(q0, a) = q1. The initial state will be identified by an incoming unlabeled arrow not originating at any vertex; final states are drawn with a double circle.

To show that any language is regular, a DFA must be valid for it. With the construction of a DFA for the language, we can claim that by definition, the language is regular.

For any given regular language there are many equivalent DFAs, so it should be our goal to find the DFA with the smallest number of internal states.

Nondeterministic Finite Accepters

Nondeterministic Finite Accepters (NFA) means that some automata choices in some situations have more than one possible transition. Rather than prescribing a unique move in each situation, there is an allowed set of possible moves. This is achieved by defining the transition function so that its range is a set of possible states.

NFAs can also be represented by transition graphs. The vertices are determined by Q, while an edge (qi, qj) with label a is in the graph if and only if δ(qi, a) contains qj.

A string is accepted by an NFA if there is some sequence of possible moves that will put the machine in a final state at the end of the string. A string is rejected only if there is no possible sequence of moves by which a final state can be reached.

Next Time

In the next blog for this series, we'll explore grammars and context-free languages.