www.digitalmars.com [Home] [Search] [D]

Lexical Grammar

In D, the lexical analysis is independent of the syntax parsing and the semantic analysis. The lexical analyzer splits the source text up into tokens. The lexical grammar describes what those tokens are. The D lexical grammar is designed to be suitable for high speed scanning, it has a minimum of special case rules, there is only one phase of translation, and to make it easy to write a correct scanner for. The tokens are readilly recognizable by those familiar with C and C++.

Phases of Compilation

The process of compiling is divided into multiple phases. Each phase has no dependence on subsequent phases. For example, the scanner is not perturbed by the semantic analyser. This separation of the passes makes language tools like syntax directed editors relatively easy to produce.
  1. ascii/wchar
    The source file is checked to see if it is in ascii or wchar, and the appropriate scanner is loaded.
  2. lexical analysis
    The source file is divided up into a sequence of tokens.
  3. syntax analysis
    The sequence of tokens is parsed to form syntax trees.
  4. semantic analysis
    The syntax trees are traversed to declare variables, load symbol tables, assign types, and in general determine the meaning of the program.
  5. optimization
  6. code generation

Source Text

D source text consists of Wide char characters. If the source text consists of ASCII characters, they are treated as the first 128 Wide char characters. Multibyte and UTF8 character sets are not supported, although nothing precludes them from being supported. There are no digraphs or trigraphs in D. The source text is split into tokens using the maximal munch technique, i.e., the lexical analyzer tries to make the longest token it can. For example >> scans into a shift right token, not two greater than tokens.

End of File

	EndOfFile:
		end of the file
		\u0000
		\u001A
	
The source text is terminated by whichever comes first.

End of Line

	EndOfLine:
		\u000D
		\u000A
		\u000D \u000A
		EndOfFile
	
There is no backslash line splicing, nor are there any limits on the length of a line.

White Space

White space is defined as a sequence of one or more of the following:

Comments

	Comment:
		/* characters */
		// characters EndOfLine
	
D has two kinds of comments, the block comment and the line comment. Block comments can span multiple lines, line comments only one. Comments do not nest. Comments cannot be used as token concatenators, for example, abc/**/def is two tokens, abc and def, not one abcdef token.

Identifiers

Identifiers start with a letter or '_', and are followed by any number of letters, '_' or digits. Identifiers can be arbitrarilly long, and are case sensitive. Identifiers starting with '__' are reserved.

String Literals

A string literal is either a double quoted string, a single quoted string, or an escape string.

Single quoted strings are enclosed by ''. All characters between the '' are part of the string, there are no escape sequences inside '':

	'hello'
	'c:\root\foo.exe'
	'ab\n'			string is 4 characters, 'a', 'b', '\', 'n'
	
Double quoted strings are enclosed by "". Escape sequences can be embedded into them with the typical \ notation.
	"hello"
	"c:\\root\\foo.exe"
	"ab\n"			string is 3 characters, 'a', 'b', and a linefeed
	
Escape strings start with a \ and form an escape character sequence. Adjacent escape strings are concatenated:
	\n			the linefeed character
	\t			the tab character
	\"			the double quote character
	\0123			octal
	\x1A			hex
	\u1234			wchar character
	\r\n			carriage return, line feed
	
Adjacent strings are concatenated with the ~ operator, or by simple juxtaposition:
	"hello " ~ "world" ~ \n	// forms the string 'h','e','l','l','o',' ','w','o','r','l','d',linefeed
	
The following are all equivalent:
	"ab" "c"
	'ab' 'c'
	'a' "bc"
	"a" ~ "b" ~ "c"
	\0x61"bc"
	

Integer Literals

Integers can be specified in decimal, binary, octal, or hexadecimal.

Decimal integers are a sequence of decimal digits.

Binary integers are a sequence of binary digits preceded by a '0b'.

Octal integers are a sequence of octal digits preceded by a '0'.

Hexadecimal integers are a sequence of hexadecimal digits preceded by a '0x' or followed by an 'h'.

Integers can be immediately followed by one 'l' or one 'u' or both.

The type of the integer is resolved as follows:

1) If it is decimal it is the last representable of ulong, long, or int.
2) If it is not decimal, it is the last representable of ulong, long, uint, or int.
3) If it has the 'u' suffix, it is the last representable of ulong or uint.
4) If it has the 'l' suffix, it is the last representable of ulong or long.
5) If it has the 'u' and 'l' suffixes, it is ulong.

Floating Literals

Floats can be in decimal or hexadecimal format. Decimal ones work just like in C.

Hexadecimal floats are preceded with a '0x' and the exponent is a 'p' or 'P' followed by a power of 2.

Floats can be followed by one 'f', 'F', 'l' or 'L' suffix. The 'f' or 'F' suffix means it is a float, and 'l' or 'L' means it is an extended.

If a floating literal is followed by 'i', then it is an imaginary type.

Examples:

	0x1.FFFFFFFFFFFFFp1023		// double.max
	0x1p-52				// double.epsilon
	1.175494351e-38F		// float.min
	6.3i				// imaginary 6.3
	
Complex literals are not tokens, but are assembled from real and imaginary expressions in the semantic analysis:
	4.5 + 6.2i		// complex number
	

Keywords

Keywords are reserved identifiers.
	Keyword:
		this
		super
		assert
		null
		true
		false
		cast
		new
		delete
		throw
		void
		byte
		ubyte
		short
		ushort
		int
		uint
		long
		ulong
		float
		double
		extended
		bit
		char
		ascii
		wchar
		imaginary
		complex
		if
		else
		while
		for
		do
		switch
		case
		default
		break
		continue
		synchronized
		return
		goto
		try
		catch
		finally
		with
		struct
		class
		interface
		union
		enum
		import
		static
		final
		const
		typedef
		typealias
		override
		abstract
		volatile
		debug
		deprecated
		in
		out
		inout
		align
		extern
		private
		protected
		public
		export
		body
		invariant
	

Copyright (c) 1999-2001 by Digital Mars, All Rights Reserved