In computer programming, syntax refers to the set of rules that defines the combinations of symbols that are considered to be correctly structured statements or expressions in a programming language.
This applies not only to programming languages (where the text represents source code) but also to markup languages, where the text represents structured data.
The syntax of a programming language defines its surface structure—what code looks like and how it must be arranged to be understood by a compiler or interpreter.
Text-based languages use sequences of characters, while visual programming languages may rely on the spatial arrangement and connections between graphical symbols.
Code that does not adhere to these rules is considered syntactically invalid and results in a syntax error.
When designing a programming language, developers often start by creating examples of both valid and invalid code snippets and formalizing the rules that distinguish between them.
Syntax focuses on the structure of code, whereas semantics refers to its meaning and behavior. Typically, syntax analysis precedes semantic analysis in the compilation process, though in some cases, the two are interdependent.
In compiler architecture, syntax analysis is usually part of the frontend, while semantic analysis belongs to the backend or middle-end stages.
Levels of Syntax
Syntax in computer languages is often described in three hierarchical levels:
- Lexical Level: Defines how individual characters form tokens (e.g., identifiers, operators, literals).
- Grammar Level (Phrase Level): Specifies how tokens combine to form valid statements and expressions, using formal grammars.
- Contextual Level: Handles how names, types, and scopes are resolved—such as variable declarations and type checking.
This modular design allows different stages of the compiler to handle each level independently (at least in theory).
- Lexical Analysis (Tokenization): A lexer (or scanner) converts a raw sequence of characters into a sequence of tokens—logical units like identifiers, keywords, and symbols.
- Parsing (Syntax Analysis): A parser takes the tokens and builds a syntax tree (also known as a parse tree or concrete syntax tree), ensuring that the sequence conforms to the language’s grammar. The parser then often generates an abstract syntax tree (AST) that represents the logical structure of the code without unnecessary detail.
- Contextual Analysis: This phase checks that the program obeys semantic rules—such as verifying variable declarations, type consistency, and scope resolution. Although ideally separated, real-world languages often require contextual information even during parsing (e.g., C’s “lexer hack” where tokenization depends on type information).
Grammar and the Chomsky Hierarchy
The syntax levels correspond roughly to levels in the Chomsky hierarchy:
- Lexical grammar: Type-3 (Regular Language), specified using regular expressions.
- Phrase grammar: Type-2 (Context-Free Language), specified using production rules in Backus–Naur Form (BNF) or similar notations.
- Contextual structure: Sometimes Type-1 (Context-Sensitive Grammar), though often implemented manually via symbol tables and type checkers.
Many programming languages’ core grammar can be expressed as a context-free grammar, though features like type checking and scoping may require context-sensitive analysis.
Tools and Parsing Strategies
Tools like lex and yacc (or flex and bison) automate the generation of lexers and parsers from formal specifications. The lexer produces a concrete syntax tree, and the parser writer typically converts this to an AST by writing additional code.
Contextual analysis is usually done manually. Despite the availability of these tools, many compilers implement lexers and parsers by hand, for reasons such as performance, error handling, or flexibility.
Parsers may be written in various languages, including C, C++, Python, Perl, or Haskell.
Examples of Syntax Errors
1. Lexical error: Encountering an unexpected character sequence that cannot form a valid token.
Example: @foo might be invalid if @ is not defined in the language.
2. Parsing error: Tokens cannot be combined according to the grammar rules.
Example: if (x > 0 is missing a closing parenthesis.
3. Contextual error: The structure is correct but fails context-sensitive rules (often considered semantic errors but sometimes treated as syntax errors).
Example: int x = “hello”; fails because a string cannot be assigned to an integer variable.
Syntax vs. Semantics
Syntax defines the form of code—whether it can be parsed by the compiler—whereas semantics defines its meaning and behavior during execution. A syntactically correct program may still be semantically invalid (e.g., using an uninitialized variable or dividing by zero).
Complex Syntax
Some languages (e.g., Perl, Lisp) blur the lines between parsing and execution by allowing runtime code to alter the parsing process (e.g., macros or dynamic parsing).
This can make the syntax analysis undecidable—famously described as “only Perl can parse Perl.” In simpler languages, parsing is more straightforward.
Summary
Syntax is fundamental to understanding and writing correct code. It governs how we structure programs and is the foundation for all further processing—like semantic analysis and code generation. Mastering syntax helps avoid errors and ensures your code is valid, consistent, and ready to run.
« Back to Glossary Index