A compiler is a program that takes as input a program written in a source language and produces as output an equivalent program written in another (target) language. Usually the input program is in a high-level language such as C++ and the output is in assembly language for the target machine (see assembler).
Compilers are useful because programming directly in low-level machine instructions (as had to be done with the first computers) is tedious and prone to errors. Use of assembly language helps somewhat by allowing substitu-tion of symbols (variable names) for memory locations and the use of mnemonic names for operations (such as “add” such as a keyword (reserved word) such as if or while. Next, the tokens are parsed or grouped according to the rules of the language. The result of parsing is a “parse tree” that resolves statements into their component parts. For example, an assignment statement may be parsed into an identifier, an assignment operator (such as =), and a value to be assigned. The value in turn may be an arithmetic expres-sion that consists of operators and operands.
Parsing can be done either “bottom up” (finding the individual components of the statement and then linking them together) or “top down” (identifying the type of state-ment and then breaking it down into its component parts). A set of grammatical rules specifies how each construct (such as an arithmetic expression) can be broken into (or built up from) its component parts.
The next step is semantic analysis. During this phase the parsed statements are analyzed further to make sure they don’t violate language rules. For example, most languages require that variables must be declared before they are ref-erenced by the program. Many languages also have rules for which data types may be converted to other types when the two types are used in the same operation.
The result of front-end processing is an intermediate rep-resentation somewhere between the source statements and machine-level statements. The intermediate representation is then passed to the back end.
Code Generation and Optimization
The process of code generation usually involves multiple passes that gradually substitute machine-specific code and data for the information in the parse tree. An important consideration in modern compilers is optimization, which is the process of substituting equivalent (but more efficient) constructs for the original output of the front end. For example, an optimizer can replace an arithmetic expression with its value so that it need not be repeatedly calculated while the program is running. It can also “hoist out” an invariant expression from a loop so that it is calculated only once before the loop begins. On a larger scale, optimiza-tion can also improve the communication between different parts (procedures) of the program.
The compiler must attempt to “prove” that the change it is making in the program will never cause the program to operate incorrectly. It can do this, for example, by tracing the possible paths of execution through the program (such as through branching and loops) and verifying that each possible path yields the correct result. A compiler that is too “aggressive” in making assumptions can produce subtle program errors. (Many compilers allow the user to control the level of optimization, and whether to optimize for speed or for compactness of program size.) During development, a compiler is often set to include special debugging code in the output. This code preserves potentially important infor-mation that can help the debugging facility better identify program bugs. After the program is working correctly, it will be recompiled without the debugging code.
The final code generation is usually accomplished by using templates that match each intermediate construc-tion with a construction in the target (usually assembly) language, plugging items in as specified by the template. Often a final step, called peephole optimization, examines the assembly code and identifies redundancies or, if pos-sible, replaces a memory reference so that a faster machine register is used instead.
In most applications the assembly code produced by the compiler is linked to code from other source files. For example, in a C++ applications class definitions and code that use objects from the classes may be compiled sepa-rately. Also most languages (such as C and C++) have oper-ating system-specific libraries that contain commonly used support functions.
As an alternative to bringing the external code into the final application file, code can be “dynamically linked” to libraries that will be accessed only while the program is being run. This eliminates the waste that would occur if several running applications are all using the same stan-dard library code (see library, program).
In mainframes compilers were usually invoked as part of a batch file using some form of JCL (job control language). With operating systems such as UNIX and MS-DOS a pro-gram called make is typically used with a file that specifies the compiler, linker, and other options to be used to com-pile the program. Modern visually oriented development environments (such as those provided by products such as Visual C++) allow options to be set via menus or simply by selecting from a variety of typical configurations.
Compiler design has become a highly complex field. A modern compiler is developed using a variety of tools (including packaged parsers and lexical analyzers), and involves a large team of programmers. Nevertheless, the principles of compiler design are emphasized in the gen-eral computer science curriculum because when a student understands even a simplified compiler in detail, he or she has become acquainted both with important ideas (such as language grammar, parsing, and optimization) and with many levels of understanding computer architecture.
No comments:
Post a Comment