You are on page 1of 7

Automated Source Code Transformations on Fourth Generation Languages

Johannes Martin Johannes Gutenberg - Universit t Mainz a Mainz, Germany jmartin2004@notamusica.com

Abstract
To control the operation of large application suites or to tailor a special purpose application to particular need, developers frequently use application specic languages, such as batch, scripting, and query languages. These languages which are also referred to as Fourth Generation Languages (4GLs) therefore play an important role in todays economy. Incompatibilities between different versions of 4GLs and changing requirements may make massive changes on a companys library of 4GL programs necessary. In this paper, we explore possibilities for performing mass changes on 4GLs and show how the transformation of programs written in 4GLs compares to the transformation of mainstream programming languages.

1. Introduction
Fourth generation programming languages are application specic programming languages that have a certain level of knowledge of the applications domain. This knowledge is expressed by the languages offering special purpose language features to deal with the application domain. For example, SQL offers language constructs to query database tables, bourne shell supplies various means of controlling program execution and IO redirection, and JavaScript allows the programmer to access the properties of a web page easily. Because of their good support for special purpose applications, functional prototypes and preliminary deployment versions of specialised applications can often be implemented inexpensively using 4GLs. When these applications evolve and also need to perform general purpose tasks, deciencies, limitations, and performance problems of the programming languages used can become apparent. Changes in syntax and semantics of the programming languages that often occur without formal notice in new releases of these languages may complicate matters. Changing business requirements or standardisation efforts within

a company can require a program to be rewritten in a different programming language. Thus, a migration of the subject system from the 4GL to a mainstream programming language or a more suitable other 4GL may appear desirable. But while mainstream programming languages are relatively well supported by common reengineering tools, fourth generation languages rarely enjoy that much support. This is due to the large number of different 4GLs and dialects of every one 4GL. Custom tools need to be built to deal with the 4GLs in question. Where powerful and extensible tools are usually preferred to deal with mainstream programming languages, tools that offer the best support for grammar specication and modication seem most suitable for 4GL conversion tasks. In this paper, we explore a specic migration task in the domain of statistical applications. A number of complex statistical algorithms had been developed at our department using the M ATLAB [9] language for technical computing. Cooperation with other research institutes required these algorithms to be made available for the R-Project [4]. As we wanted to be able to keep maintaining the source code in the original M ATLAB version, we developed a source code transformation tool that automates the migration of the algorithms. We report on the specic problems and techniques related to this migration in the following sections of this paper.

2. Matlab/Octave
M ATLAB is an environment and programming language for performing mathematical computations. Its basic data types are vectors and matrices, and so it is very well suited for linear algebra and numerical analysis. For these purposes, M ATLAB provides a wide range of algorithms and plotting routines. The M ATLAB programming language is an interpreted language. Its syntax and control structures are similar to those of the C programming language. Unlike in C, variables do not need to be declared prior to their use, and they can even change type during their lifetime. M ATLAB comes

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR04) 1534-5351/04 $ 20.00 2004 IEEE

with an extensive utility library that has been modelled after the standard C library, in particular where IO is concerned. Octave [3] is an open source project with the aim of providing a free alternative to M ATLAB. Its programming language and libraries are mostly compatible to M ATLAB. The algorithms we studied had originally been programmed for M ATLAB and then ported to Octave.

We evaluated a few parser and source transformation toolkits and decided to use TXL as it seemed to be the most suitable for our purpose. The following paragraphs describe some of our experiences in transforming our algorithms from Octave to R.

5.1. Grammar Specications for Octave and R


TXL expects grammar specications in a BNF-like notation. They are pretty easy to devise since BNF notation is the usual means to specify the syntax of a programming language. In the case of Octave and R, we did not succeed in nding such a specication (or any formal grammar specication), but with the help of language tutorials and manuals we were able to come up with grammars that correctly describe at least our algorithms and some other sample programs we tested our grammars on. Specifying a grammar for TXL is not quite as easy as writing down its BNF in all cases though. A binary expression, for example, would usually be described in BNF using a recursive denition as follows: <expression> := <basic expression> | <binary expression> <binary expression> := <expression> <operator> <expression> A parser such as Yacc would then use an operator precedence table to parse complex expressions according to the precedence rules. TXL cannot handle indirect left-recursive denitions as in the example above, and there is no direct way to tell it about operator precedence. In our Octave and R grammars, we therefore had to specify binary expressions differently: <expression> := <binary expression> <binary expression> := <basic expression> <binary operation>* <binary operation> := <operator> <basic expression> As this denition does not involve left-recursion, TXL is able to handle it correctly, though it still disregards operator precedence. In a language conversion where source and target language have the same operator precedence rules, it is often not necessary for these precedence rules to be considered during the transformation phase. The precedence rules of Octave and R differ slightly, however, so obeying operator precedence is strictly required in our case. Rather than trying to improve the TXL grammar specication, we decided to implement some transformation rules to deal with operator precedence during the source code transformation phase.

3. The R-Project
The R-Project for Statistical Computing provides a programming language specically designed for solving statistical problems. Like Octaves programming language, the R programming language is interpreted and provides support for vector and matrix data types. Variables do not need to be declared and can change types. The syntax and control structures of the R programming language are modelled according to the functional programming paradigm. While similar to Octave in the mathematical libraries, its IO and utility libraries are quite different from those of Octave. We will discuss the differences between Octave and R in more detail in Section 5.

4. TXL
TXL is a programming language designed to support source code analysis and transformation tasks [1]. To perform a conversion, one has to describe the programming languages involved and a set of code transformation rules. The description of the programming languages is provided through an unrestricted ambiguous context-free grammar using a BNF-like notation. Transformations are specied by example, using a set of context-sensitive structural transformation rules. Using these descriptions, TXL generates a parser that is able to resolve ambiguities and left-recursion and a strategy to apply the code transformation rules. TXL was used very successfully in Y2K conversion projects for programming languages such as COBOL, PL/I, and RPG.

5. Octave to R Conversion
The goal in the conversion of our Octave algorithms to R was to make it possible for scientists using R to use these algorithms for their applications. As the researchers at our institute are more familiar with Octave, the Octave versions of the algorithms will be kept for active development of new functionality. Newly developed features of the Octave algorithms should become available for R quickly. For this reason, a manual conversion of the algorithms was deemed infeasible, even though the algorithms consisted of only a few thousand lines of code.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR04) 1534-5351/04 $ 20.00 2004 IEEE

An annoying problem was caused by the ambiguities in the use of the single quote in Octave: it is both a string delimiter and an operator, depending on the context. Since TXL expects a context-free grammar, it is very difcult to specify a grammar that will correctly resolve this ambiguity. We solved the problem by using context-sensitive regular expression in a PERL script to replace single quotes by double quotes where they were used to delimit strings. There are a couple more ambiguities in the Octave grammar. Eaton provides a good overview and shows how they are usually resolved [3]. TXL tries to resolve ambiguities in a grammar, but the decisions it makes are not always correct since there is no way to instruct the TXL generated parser how to resolve certain ambiguities. We use TXL transformation rules to transform incorrectly resolved ambiguities during the conversion step.

syntactical transformation, replacement of Octave function calls with R equivalents, semantically correct transformation of expressions.
The syntactical transformation consisted mostly of the conversion of control structures. In this respect, Octave and R differ only in the notation, so the implementation of the necessary conversion rules was straight forward. The replacement of function calls was similarly easy: we dened a TXL table of Octave functions and their R equivalents and used it in a simple conversion rule. 5.3.2. Transformation of Expressions The transformation of expressions was more difcult. Octave and R have a different set of operators. In fact, some Octave operators do not have a functionally equivalent operator in R, so they have to be replaced by calls to functions emulating these operators. Also, some of the operators have different precedence rules in Octave and R. A semantically correct transformation of expressions might therefore involve the insertion of parentheses around subexpressions. For example, the negation operator in Octave has higher precedence than relational operators, while in R it has lower precedence. So the equivalent of the Octave exin R is . As explained earlier in pression this paper, the parser we implemented using TXL does not take operator precedence into account. But knowledge of operator precedence is mandatory to achieve a semantically correct conversion of expressions. We found the following strategy most suited to solve these conversion problems: 1. convert all unary and binary expressions to special nested function call expressions using an intermediate notation, 2. reorder the nested function calls taking into account Octave operator precedence, 3. remove unnecessary parentheses, 4. perform necessary operator replacements, 5. insert parentheses where needed taking into account R operator precedence, 6. convert the nested function call expressions back to unary and binary expressions. An example conversion is shown in Figure 1. Octaves multiplicative operator performs scalar multiplication on scalars, matrix multiplication on matrices. In contrast, Rs multiplicative operator performs scalar multiplication on scalars and element-wise multiplication on matrices. As the

5.2. Grammar Combination


TXL was originally created for applying source transformations with source and target language being identical or at least similar. In order to perform a translation between different programming languages, one needs to create a grammar specication that represents both languages. In the simplest case, this can be done by adding an additional syntactical element to the existing grammar specications: <any program> := <octave program> | <r program> The parser generated by TXL will try to parse <any program> using the rst alternative, i.e. as an <octave program>, which is what is intended for the source code transformation. TXL transformation rules can then be used to transform the <octave program> into an <r program>. Combining the lexical aspects of the grammars was more difcult, because they were partly contradictory. For example, percent signs indicate comments in Octave but operators in R. Keywords in Octave can be used as IDs in R and vice versa. Unfortunately, TXL does not offer help in solving the ambiguities introduced by combining the contradictory grammars. In the case of Octave comments, we use an external preprocessor to remove all the comments from the Octave code before feeding it to TXL with a modied Octave grammar that disregards percent signs.

5.3. Source Transformation


5.3.1. Overall Strategy The transformation tasks required to perform the Octave to R source code conversion can be broken down as follows:

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR04) 1534-5351/04 $ 20.00 2004 IEEE

Original expression: a + b * (c + d) Converted to function call notation: "*"("+"(a,b),("+"(c,""(d)))) Reordered according to precedence rules: "+"(a,"*"(b,("+"(c,""(d))))) Parentheses removed: "+"(a,"*"(b,"+"(c,""(d)))) Operators replaced: "+"(a,mult(b,"+"(c,t(d)))) Parentheses inserted (none): "+"(a,mult(b,"+"(c,t(d)))) Converted to operator notation: a + mult(b, c + t(d)) Figure 1. Expression conversion example types of the operands are known only at runtime, the multiplicative operator has to be replaced by an emulation function that checks the actual types of its operands and uses the correct operator. R does not have a transpose operator as Octave ( ) but a transpose function (). Octave also has a multiplicative operator ( ) that performs element-wise multiplication, and that one has to be converted to Rs regular multiplicative operator (not shown in the example). 5.3.3. Type Inference Octaves grammar is partly ambiguous. For example, the expression x(1) may denote a call to function x() with parameter 1, the rst character of string x, or the rst element of vector x. The equivalent R expression would be x(1) for the function call, substring(x, 1, 1) for the substring, and x[1] for the vector subscript. Since neither functions nor variables need to be declared in the source le they are used in, the parser cannot determine the actual type of the symbol x. We used a simple heuristic to infer this information from the context of the expression. If x was assigned a value previously in the source le, we assume x is a variable, otherwise we assume it is a function. We partly evaluate expressions to determine whether a variable has a character or numeric type. If we cannot deduct the type of a variable with certainty, we assume it is of numeric type, since nu-

meric types are used most often in the application domain of Octave. Though it is quite easy to construct examples that break this heuristic, it succeeded for all our subject systems. 5.3.4. Matrix Conversion In Octave, matrices can grow during their lifetime. If, at any point in a program, a value is assigned to a previously non-existent element of a matrix, a number of rows and/or columns is added to the matrix to accommodate this element. Matrices in R, however, cannot change size once they have been dened. We therefore had to emulate the behaviour of Octave matrices in R. Assignments to matrix elements in Octave are transformed to a function call to an emulation function that performs bounds checking and grows the matrix if necessary. When retrieving single rows or columns from a matrix, R silently drops dimensions in the result, i.e. retrieving the rst row of a matrix yields a vector. In Octave, dimensions are not dropped, so retrieving the rst row of a matrix yields another matrix with a single row. While this difference might look trivial, it becomes a problem when these results are used in matrix multiplications. Our converter therefore not only emulates matrix element assignments but also matrix element retrievals through a function call.

6. Evaluation
6.1. Conversion
Our main goal in developing the Octave to R source code transformation tool was to be able to repeatedly transform medium volumes of Octave source code to R. As we intended to keep maintaining the Octave source code and use the R version only for loose integration into other R programs, the readability of the R source code was not a very high priority. Nevertheless, and despite of the use of emulation functions, reviews of the generated source code show it to be well readable. Because TXL did not output the generated code in the desired format, we used a few PERL scripts to adjust the layout of the code. Quite naturally, correctness and performance of the generated code were top priority goals. As we were dealing with mathematical algorithms, we had a well dened set of inputs and expected outputs for our subject systems. It was therefore possible to test the generated code extensively and show that it performs the computations both correctly and in about the same time as the original code. While computations were as fast in the R code as in the original code, low level le access was considerably slower. Understandably, our emulation of the Octave IO library using R source code was much slower than Octaves builtin

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR04) 1534-5351/04 $ 20.00 2004 IEEE

functions. However, those low level le assess routines were needed only in one of the generated functions, and we were able to replace that entire function by a builtin and fast R routine. Our original Octave source code had been integrated into a web service using a few PHP scripts. Octave is very well suited for integration into web services and a Unix environment in general. It can even be installed as a shell to execute Octave programs like any other shell script. As it starts almost instantaneously, it was no problem to start Octave on demand whenever needed. Since R allows more exible customisation of the layout of graphs, we decided to replace the Octave web service by an R version. Unfortunately, R does not integrate into the Unix environment as well as Octave. When running as a shell, it cannot pass parameters to scripts under its control, and it has a rather large startup time (several seconds) that makes interactive web pages difcult. We therefore wrote an extension to R to run as a server and accept computation requests over a TCP/IP connection. In that conguration, the R solution performs as well as the original Octave solution. We experimented with an alternative conversion strategy to evaluate its effect on readability and performance of the generated code: R provides some object oriented language features such as classes and operator overloading. We dened a custom matrix class and subscript operators to emulate the behaviour of Octave matrices in R transparently. This solution had several drawbacks. We needed to overload a lot of operators and functions to ensure that computations on our matrix classes yielded results that also belonged to those matrix classes. Within the overloaded operator functions, type casts had to be used to delegate the calls to the original R operators. This created quite an overhead, causing the generated programs to execute in double the time of the original program. Since variables do not need to be declared in R and so the type of any variable is not immediately apparent, R programmers who use the results of the converted algorithms might not be aware of the different semantics of the emulation matrices and therefore not expect their side effects. So despite the increased readability of the converted algorithms, the overall cognitive overhead of using these algorithms would increase. We therefore abandoned this approach.

though. TXL is best suited for transformations within a single programming language, or dialects of a language. As we saw in our project, it is difcult to combine contradictory grammars. TXL transformation rules are intended to transform one syntactical element into another syntactical element of the same type. For a cross language conversion, such a transformation usually does not make sense. The intended transformation from a source syntactical element to a target syntactical element has to be done in multiple steps that hide the real transformation logic. It would be helpful if TXL differentiated between source and target grammar and allowed transformations between source and target syntactical elements. Already within our rst project, and especially when we started other conversion projects using TXL, we noticed that we duplicated a lot of code, changing only tiny bits of transformation rules to make them suitable for other projects. Unfortunately, TXL lacks support for abstraction and generalisation of transformation rules, so there is no formal way to factor out and parameterise shared code. When the maintenance of the duplicated code became to cumbersome, we decided to use the GNU C preprocessor cpp to simplify it. Preprocessor denes are used to declare templates of partial, complete, or even sets of complete transformation rules that are instantiated in any project that needs them, possibly even multiple times. Since the preprocessor does not quite understand the syntax of TXL programs, we had to employ a few PERL scripts to modify the TXL code for preprocessing.

7. Related Work
A lot of research has been done on language conversion. In their paper The Realities of Language Conversions [13], Terekhov and Verhoef give an account of their experiences with language conversions. The examples they provide deal mostly with COBOL systems and generalise to many instances of language conversions. They argue that the difculties of such conversions are often underestimated and manifold: Too much emphasis is put on technology and tools that claim to aid in language conversion and too little attention is paid to training of the personnel that has a major impact on the success or failure of the migration project. Terekhov and Verhoef list several requirements that have to be met to achieve a successful source conversion. They also propose a coarse three step process for source conversion. In The Migration Barbell [6], Malton notes that there are many ad-hoc techniques for source conversion, but few systematic approaches. He formalises the process proposed by Terekhov and Verhoef. Malton also denes a set of goals for source conversion and identies three distinct conversion tasks: dialect conversion, API migration, and language migration. His observations are founded on dialect conver-

6.2. TXL
TXL was a valuable tool for building the conversion utility. From experience with previous projects, we do not expect to be to design a parser for the Octave language as quickly as it was possible with TXL using any other parser construction environment it took only about one manmonth to complete the conversion tool. Some limitations of TXL became quickly apparent

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR04) 1534-5351/04 $ 20.00 2004 IEEE

sions in the COBOL, PL/I, and RPG domains, and pilot studies in source conversion from COBOL to Java, RPG to COBOL, and SQL to SQLj. There are a number of papers describing experiences and lessons learned from source conversion projects. Kontogiannis et al. [5] report on the conversion of the IBM compiler back-end from a PL/I derivative to C++ Yasumatsu [15] describes a system for translating Smalltalk programs into a C environment. Terekhov [12] presents a case study on an automated language conversion project from a proprietary language to Visual Basic and COBOL. In previous papers at CSMR, we reported on our experiences converting C/C++ source code to Java [7, 8]. A few papers describe experiences converting M ATLAB to other mainstream programming languages. De Rose and Padua report on a M ATLAB to Fortran conversion [2]. Since Fortran requires variables to be dened before being used, they faced challenges in having to infer types of variables from expressions and describe their inference algorithms in detail in their paper. In our experiences, the more lightweight heuristics we used in our conversion were sufcient for the transformation to R. Menon and Pingali propose to apply source transformations to M ATLAB programs to achieve performance gains independently or in combination with a migration or compilation of the original program [10]. They use vectorisation, pre-allocation, and expression optimisation to increase the efciency of M ATLAB programs. At the beginning of our conversion project, we conducted a short survey of available source code transformation systems and tools. Besides TXL, we also investigated ANTLR [11] and the ASF+SDF Meta-Environment [14]. ANTLR (ANother Tool for Language Recognition) is a language tool that provides a framework for constructing recognisers, compilers, and translators. As such, it would generally have been suitable for our translation purposes as well. The ASF+SDF Meta-Environment is a system that can be used to describe syntax and semantics of programming languages as well as analyses and transformations of programs written in such programming languages and was therefore another system we considered as a tool for our transformation project.

leased as Octave source code, we can provide the R implementation almost instantaneously. So far, we have only attempted to transform Octave source code written by three unrelated programmers. It is possible that our conversion tool cannot handle other programming styles than those used by these programmers. To make it useful for the general public, we therefore need to test it on Octave programs written by other programmers and evaluate where it needs improvements. The FALCON test suite might be a good candidate, as it covers a wide range of M ATLAB functionality [2]. In a current project, we study the conversion of PHP programs to JSP. This project shows a whole new set of challenges. As Java is a compiled language and very strict about type checking, a lot of effort has to be directed towards type inference algorithms to determine the types of programs written in the dynamically typed and interpreted PHP language. Another aspect to be dealt with is the volatility of the source language: PHP is still in active development and changes frequently. Compared with ANTLR and the ASF+SDF MetaEnvironment, TXL convinced by the quality and accessibility of its documentation and the power and simplicity of its grammar and rule notation language. At the time of our survey, familiarisation with the ANTLR or ASF+SDF environments seemed to require a comparably high investment. This larger initial investment might have been offset by time savings due to more powerful features of these tools during the later development of the translator. It would be interesting to explore how ANTLR and ASF+SDF compare to TXL in the development of a source code translation tool.

References
[1] J. R. Cordy, T. R. Dean, A. J. Malton, and K. A. Schneider. Software Engineering by Source Transformation Experience with TXL. In Proceedings of the 1st International IEEE Workshop on Source Code Analysis and Manipulation, pages 168178, Florence, Italy, Nov. 2001. [2] L. De Rose and D. A. Padua. A MATLAB to fortran 90 translator and its effectiveness. In International Conference on Supercomputing, pages 309316, 1996. [3] J. W. Eaton. Octave: Past, present, and future. In Proceedings of the 2nd International Workshop on Distributed Statistical Computing, 2001. [4] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299314, 1996. [5] K. Kontogiannis, J. Martin, K. Wong, R. Gregory, H. A. M ller, and J. Mylopoulos. Code Migration Through Transu formations: An Experience Report. In Proceedings of CASCON 98, pages 113, Toronto, ON, 1998. [6] A. J. Malton. The Migration Barbell. First ASERC Workshop on Software Architecture, Aug. 2001. http://www.cs.ualberta.ca

8. Conclusions and Future Work


In this paper, we showed our experiences in performing source code transformations between the Octave and R fourth generation languages. We illustrated our approach to achieving a successful transformation and evaluated both the conversion results and the tools we used in the conversion process. The source code conversion was successful: the generated code performs well and is readable and maintainable. Whenever a new version of our algorithms is re-

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR04) 1534-5351/04 $ 20.00 2004 IEEE

[7]

[8]

[9] [10]

[11]

[12]

[13] [14]

[15]

/kenw/conf/awsa2001/papers/malton.pdf, November 2001. J. Martin and H. A. M ller. Strategies for Migration from C u to Java. In Proceedings of the Fifth European Conference on Software Maintenance and Reengineering, pages 200209, Lisbon, Portugal, Mar. 2001. J. Martin and H. A. M ller. C to Java Migration Experiu ences. In Proceedings of the Sixth European Conference on Software Maintenance and Reengineering, pages 143152, Budapest, Hungary, Mar. 2002. T. MathWorks. MATLAB. http://www.mathworks.com/, October 2003. V. Menon and K. Pingali. A case for source-level transformations in MATLAB. In Domain-Specic Languages, pages 5365, 1999. T. Parr and R. Quang. antlr: A predicated-ll(k) parser generator. Journal of Software Practice and Experience, 25(7):789810, July 1995. A. A. Terekhov. Automating Language Conversion: A Case Study. In Proceedings of the International Conference on Software Maintenance (ICSM) 2001, pages 654658, Florence, Italy, Nov. 2001. A. A. Terekhov and C. Verhoef. The Realities of Language Conversions. IEEE Software, pages 111124, Nov. 2000. M. van den Brand, A. van Deursen, J. Heering, H. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P. Olivier, J. Scheerder, J. Vinju, E. Visser, and J. Visser. the asf+sdf meta-environment: a component-based language development environment. In Proceedings of the 10th International Conference on Compiler Construction, pages 365 370. Springer-Verlag, Apr. 2001. K. Yasumatsu and N. Doi. SPiCE: A System for Translating Smalltalk Programs into a C Environment. IEEE Transactions on Software Engineering, 21(11):902912, 1995.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR04) 1534-5351/04 $ 20.00 2004 IEEE

You might also like