Rick Riolo, Trent McConaghy and Ekaterina Vladislavleva (Eds.) Genetic Programming Theory and Practice VIII
Genetic a...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Rick Riolo, Trent McConaghy and Ekaterina Vladislavleva (Eds.) Genetic Programming Theory and Practice VIII

Genetic and Evolutionary Computation Series Editors John R. Koza Consulting Editor Medical Informatics Stanford University Stanford, CA 94305-5479 USA Email: [email protected]

For other titles published in this series, go to http://www.springer.com/series/6016

Rick Riolo • Trent McConaghy Ekaterina Vladislavleva Editors

Genetic Programming Theory and Practice VIII Foreword by Nic McPhee

1C

Editors Dr. Rick Riolo University of Michigan Center for the Study of Complex Systems 323 West Hall Ann Arbor Michigan 48109 USA [email protected]

Dr. Ekaterina Vladislavleva University of Antwerp Dept. Mathematics & Computer Science Campus Middelheim G.103 2020 Antwerpen Belgium [email protected]

Dr. Trent McConaghy Solido Design Automation, Inc. 102-116 Research Drive S7N 3R3 Saskatoon Saskatchewan Canada [email protected]

ISSN 1566-7863 ISBN 978-1-4419-7746-5 e-ISBN 978-1-4419-7747-2 DOI 10.1007/978-1-4419-7747-2 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010938320 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Contents

Contributing Authors

vii

Preface

xi

Foreword

xiii

Genetic Programming Theory and Practice 2010: An Introduction Trent McConaghy, Ekaterina Vladislavleva and Rick Riolo

xvii

1 FINCH: A System for Evolving Java (Bytecode) Michael Orlov and Moshe Sipper 2 Towards Practical Autoconstructive Evolution: Self-Evolution of Problem-Solving Genetic Programming Systems Lee Spector

1

17

3 The Rubik Cube and GP Temporal Sequence Learning: An Initial Study Peter Lichodzijewski and Malcolm Heywood

35

4 Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams Terence Soule, Robert B. Heckendorn, Brian Dyre, and Roger Lew

55

5 Covariant Tarpeian Method for Bloat Control in Genetic Programming Riccardo Poli

71

6 A Survey of Self Modifying Cartesian Genetic Programming Simon Harding, Wolfgang Banzhaf and Julian F. Miller

91

vi

Genetic Programming Theory and Practice VIII

7 Abstract Expression Grammar Symbolic Regression Michael F. Korns

109

8 Age-Fitness Pareto Optimization Michael Schmidt and Hod Lipson

129

9 Scalable Symbolic Regression by Continuous Evolution with Very Small Populations Guido F. Smits, Ekaterina Vladislavleva and Mark E. Kotanchek 10 Symbolic Density Models of One-in-a-Billion Statistical Tails via Importance Sampling and Genetic Programming Trent McConaghy 11 Genetic Programming Transforms in Linear Regression Situations Flor Castillo, Arthur Kordon and Carlos Villa

147

161

175

12 195 Exploiting Expert Knowledge of Protein-Protein Interactions in a Computational Evolution System for Detecting Epistasis Kristine A. Pattin, Joshua L. Payne, Douglas P. Hill, Thomas Caldwell, Jonathan M. Fisher, and Jason H. Moore 13 Composition of Music and Financial Strategies via Genetic Programming Hitoshi Iba and Claus Aranha

211

14 Evolutionary Art Using Summed Multi-Objective Ranks Steven Bergen and Brian J. Ross

227

Index

245

Contributing Authors

Claus Aranha is a graduate student at the Graduate School of Frontier Sciences in the University of Tokyo, Japan ([email protected]). Wolfgang Banzhaf is a professor at the Department of Computer Science at Memorial University of Newfoundland, St. John’s, NL, Canada ([email protected]). Steven Bergen is a graduate student in the Department of Computer Science, Brock University, St. Catharines, Ontario, Canada ([email protected]). Tom Caldwell is a database developer and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]). Flor Castillo is a Lead Research Specialist in the Polyglycols, Surfactants, and Fluids group within Performance Products R&D organization of The Dow Chemical Company ([email protected]). Brian Dyre is an Associate Professor of Experimental Psychology (Human Factors), a member of the Neuroscience Program, and the director of the Idaho Visual Performance Laboratory (IVPL) at the University of Idaho, USA ([email protected]). Jonathan Fisher is a computer programmer and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]). Simon Harding is a postdoctoral research fellow at the Department of Computer Science at Memorial University of Newfoundland, St. John’s, NL, Canada ([email protected]). Robert B. Heckendorn is an Associate Professor of Computer Science and a member of the Bioinformatics and Computational Biology Program at the University of Idaho, USA ([email protected]). Malcolm Heywood is a Professor of Computer Science at Dalhousie University, Halifax, NS, Canada ([email protected]). Douglas Hill is a computer programmer and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]).

viii

Genetic Programming Theory and Practice VIII

Hitoshi Iba is a professor of Computer Science at the Graduate School of Engineering in the University of Tokyo, Japan ([email protected]). Arthur K. Kordon is a Data Mining and Modeling Leader in the Advanced Analytics Group within the Dow Business Services of The Dow Chemical Company ([email protected]). Michael F. Korns is Chief Technology Officer at Freeman Investment Management, Henderson, Nevada, USA ([email protected]). Mark E. Kotanchek is Chief Technology Officer of Evolved Analytics, a data modeling consulting and systems company, USA/China ([email protected]). Roger Lew is a graduate student in the Neuroscience Program at the University of Idaho, USA ([email protected]). Peter Lichodzijewski is a graduate student in the Faculty of Computer Science at Dalhousie University, Halifax, Nova Scotia, Canada ([email protected]). Hod Lipson is an Associate Professor in the school of Mechanical and Aerospace Engineering and the school of Computing and Information Science at Cornell University, Ithaca, NY, USA ([email protected]). Trent McConaghy is co-founder and Chief Scientific Officer of Solido Design Automation Inc., which makes variation-aware IC design software for top-tier semiconductor firms. He is based in Vancouver, Canada. (trent [email protected]). Julian F. Miller is a lecturer in the Department of Electronics at the University of York, UK ([email protected]). Jason H. Moore is the Frank Lane Research Scholar in Computational Genetics and Associate Professor of Genetics at Dartmouth Medical School, USA ([email protected]). Michael Orlov is a graduate student in Computer Science at Ben-Gurion University, Israel ([email protected]). Kristine Pattin is a Molecular and Cellular Biology graduate student and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]).

Contributing Authors

ix

Joshua L. Payne is a postdoctoral research fellow in the computational genetics laboratory at Dartmouth College ([email protected]). Riccardo Poli is a Professor of Computer Science in the School of Computer Science and Electronic Engineering at the University of Essex, UK ([email protected]). Rick Riolo is Director of the Computer Lab and Associate Research Scientist in the Center for the Study of Complex Systems at the University of Michigan, USA ([email protected]). Brian J. Ross is a Professor of Computer Science at Brock University, St. Catharines, ON, Canada ([email protected]). Michael Schmidt is a graduate student in computational biology at Cornell University, Ithaca, NY, USA ([email protected]). Moshe Sipper is a Professor of Computer Science at Ben-Gurion University, Israel ([email protected]). Guido F. Smits is a Research and Development Leader in the New Products Group within the Core R&D Organization of the Dow Chemical Company, Belgium ([email protected]). Terence Soule is an Associate Professor of Computer Science, a member of the Bioinformatics and Computational Biology Program, and Director of the Neuroscience Program at the University of Idaho, USA ([email protected]). Lee Spector is a Professor of Computer Science in the School of Cognitive Science at Hampshire College, Amherst, MA, USA ([email protected]). Carlos Villa is a Senior Research Specialist in Polyurethanes Process Research within Performance Products R&D organization of The Dow Chemical Company ([email protected]). Ekaterina Vladislavleva is a Lecturer in the Department of Mathematics and Computer Science at the University of Antwerp, Belgium ([email protected]).

Preface

The work described in this book was first presented at the Eighth Workshop on Genetic Programming, Theory and Practice, organized by the Center for the Study of Complex Systems at the University of Michigan, Ann Arbor, May 20-22, 2010. The goal of this workshop series is to promote the exchange of research results and ideas between those who focus on Genetic Programming (GP) theory and those who focus on the application of GP to various realworld problems. In order to facilitate these interactions, the number of talks and participants was small and the time for discussion was large. Further, participants were asked to review each other’s chapters before the workshop. Those reviewer comments, as well as discussion at the workshop, are reflected in the chapters presented in this book. Additional information about the workshop, addendums to chapters, and a site for continuing discussions by participants and by others can be found at http://cscs.umich.edu/gptp-workshops/ . We thank all the workshop participants for making the workshop an exciting and productive three days. In particular we thank the authors, without whose hard work and creative talents, neither the workshop nor the book would be possible. We also thank our keynote speaker J¨urgen Schmidhuber, Director of the Swiss Artificial Intelligence Lab IDSIA, Professor of Artificial Intelligence at the University of Lugano, Switzerland, Head of the CogBotLab at TU Munich, Germany, and Professor SUPSI, Switzerland. J¨urgen’s talk inspired a great deal of discussion among the participants throughout the workshop. The workshop received support from these sources: The Center for the Study of Complex Systems (CSCS); John Koza, Third Millennium Venture Capital Limited; Michael Korns, Freeman Investment Management; Ying Becker, State Street Global Advisors, Boston, MA; Mark Kotanchek, Evolved Analytics; Jason Moore, Computational Genetics Laboratory at Dartmouth College; Conor Ryan, Biocomputing and Developmental Systems Group, Computer Science and Information Systems, University of Limerick; and William and Barbara Tozier, Vague Innovation LLC. We thank all of our sponsors for their kind and generous support for the workshop and GP research in general.

xii

Genetic Programming Theory and Practice VIII

A number of people made key contributions to running the workshop and assisting the attendees while they were in Ann Arbor. Foremost among them was Howard Oishi, who makes GPTP workshops run smoothly with his diligent efforts before, during and after the workshop itself. After the workshop, many people provided invaluable assistance in producing this book. Special thanks go to Philipp Cannnons who did a wonderful job working with the authors, editors and publishers to get the book completed very quickly. Jennifer Maurer and Melissa Fearon provided invaluable editorial efforts, from the initial plans for the book through its final publication. Thanks also to Springer for helping with various technical publishing issues. Rick Riolo, Trent McConaghy and Ekaterina (Katya) Vladislavleva

Foreword

If politics is the art of the possible, research is surely the art of the soluble. Both are immensely practical-minded affairs. — Peter Medawar1 The annual Genetic Programming Theory and Practice (GPTP) is an important cross-fertilization event, bringing practitioners and theoreticians together in a small, focussed setting for several days. At larger conferences, parallel sessions force one to miss the great majority of the presentations, and it’s not uncommon for a theoretician and a practitioner to have little more contact than a brief conversation at a coffee break. GPTP blows away any stereotypes suggesting that theoreticians neither care about nor understand the challenges practitioners face, or that practitioners are indifferent to theoretical work, considering it an ivory tower exercise of no real consequence. The mutual respect around the table is manifest, and many participants have made substantial contributions to both theory and practice over the years. As a result, the discussions and debate are open, inclusive, lively, rigorous, and often intense. Despite the “Genetic Programming” in the title, GPTP has always been a showcase for problem solving techniques, without standing too much on the ceremony of names and labels. Many of the techniques and systems discussed this year have moved considerable distances from the standard s-expression GP of the early 90’s, and more and more hybrid systems are bringing together powerful tools from across evolutionary computation, machine learning, and statistics, often incorporating sophisticated domain knowledge as well. The creativity of our community, however, creates a plethora of challenges for those who wish to provide a theoretical understanding of these techniques and their dynamics, and evolutionary computation and GP work have long been dogged by a gap between the racing front of practical exploration and the rather more stately pace of theoretical understanding. Given that mismatch, events like GPTP become even more important, providing valuable opportunities for the community to take stock of the current state-of-play, identifying gaps, opportunities, and connections that have the potential to shape and inform work for years to come. This year’s papers continue to press many of the Hard Problems of the field. A number explore multi-objective evolutionary systems, co-evolution, and various types of modularity, hierarchy, and population structure, all with the goal of finding solutions to complex, structured, and often epistatic, problems. A

1 Review

of Arthur Koestler’s The Act of Creation, in the New Statesman, 19 June 1964.

xiv

Genetic Programming Theory and Practice VIII

constant challenge is finding effective representations, and many of the representations here don’t look much like a traditional tree-based GP. Similarly, configuration and parameter settings are a consistent burr, and this year’s work includes approaches that evolve this information, and approaches that dynamically set these values as a deterministic function of the current state. Application domains range widely through areas such as finance, industrial systems modeling, biology and medicine, games, art, and music; many, however, could still be described as forms of regression or classification, a vein that I suspect people will continue to mine successfully for years to come. A thread running through almost all the applications, in some cases more explicitly than others, is the importance of identifying and incorporating important domain knowledge, and it seems clear that few folks are tackling really tough problems without including the best domain knowledge they can lay their hands on. Another important trend is the continued conversion of GP into an increasingly off-the-shelf tool, what Rick Riolo and Bill Tozier might call the transition from an art to a craft. Several participants are building systems with the express goal of making high quality GP tools available to non-programmers, people with problems to solve but who aren’t interested in (or able to) implement a state-of-the-art evolutionary algorithm themselves. One of the great values in participating in this sort of workshop is the conversation and discussion, both during the presentations and in the halls. Perhaps the biggest “buzz” this year was about the increased computation power being made available through cluster and cloud computing, multiple cores, and the massive parallelism of graphic processing units (GPUs). This topic came up in several papers, and was discussed with both excitement and skepticism throughout the workshop. EC, along with most machine learning and artificial intelligence work, is a processor hungry business and one that parallelizes and distributes in fairly natural ways. This makes the increasing availability of large number of low-cost processing units, whether through physical devices or out on the Internet, very exciting. It wasn’t that long ago when population sizes were often 100 or less. These would now be considered small in many contexts, with population sizes routinely being several orders of magnitude larger. GPUs and cloud computing, however, make it possible to reasonably process populations of millions of individuals today, and no doubt many more in the next few years. This has enormous potential impact for both practice and theory in the field. People often comment on the fact that in the next few decades we’ll likely have computers (or clusters of computers) with computational power comparable to that of the human brain. This also gives us the ability run much more complex evolutionary systems, effectively simulating much richer evolutionary processes in more complex environments. Many have commented over the years that to see the true potential of evolutionary algorithms we need to place

Foreword

xv

them in more complex environments, and this came up again in this year’s GPTP discussions. If we only present our systems with simple problems, or problems with easily discovered local optima, we shouldn’t be surprised if their behaviors are often disappointingly simple. One of the reasons for this simplicity has all too often been the limit on available computing power. The continued growth in computing capacity make it possible to run much richer systems and tackle more challenging problems, shedding light in exciting new places. These changes may have strong implications on the theory side as well. Many theoretical results (such as those from schema theory and many statistical techniques) require infinite population assumptions, for example. While many of the predictions of these theories have been shown to hold for finite populations, sampling effects have often led to significant variances especially for small populations, and many researchers have been skeptical of the practical value of results built on infinite population assumptions. If we reach a point where we’re routinely using population sizes in the millions, then while there will surely be issues of sampling, these will likely be profoundly different than those seen with populations of hundreds. More generally, as population sizes grow, it will become increasingly important to develop and extend theoretical techniques that process individuals in aggregate. Even if we could theoretically characterize each individual in a population of millions, to do so would likely be useless as we would drown in the data. We will instead need ways to characterize the broader properties of the population, probably using tools like statistical distributions and coarse graining. Another subject of considerable discussion throughout the workshop was that of “selling” GP in particular and evolutionary systems in general. Despite the substantial and growing evidence of GP’s ability as a powerful problem solving tool, many remain skeptical. Sometimes this is because people are naturally nervous about the unknown, but caution is certainly warranted when there is a great deal at stake, such as people’s lives or millions of dollars. One traditional way to address this is to try to focus on the evolution of “understandable” solutions so one can offer the ideas embedded in a comprehensible solution instead of trying to pitch a black box that no one understands. Several of this year’s participants were avoiding the difficulty of selling GP by simply sidestepping it. They bundle GP as part of a complex set of tools that collectively address the customer’s problem, and find that in that setting the customer is often less concerned with the technical details of each component. I’m in a privileged position in that I rarely have to “sell” my work and so don’t have to face these issues, which I understand are very real. I must say, however, that I found it somewhat disheartening to hear so many people talk about obscuring the evolutionary component of their systems. Evolution is an incredibly powerful concept, but one that is all too little understood by the gen-

xvi

Genetic Programming Theory and Practice VIII

eral public (especially in the United States). As an educator and evolutionary enthusiast, I see evolutionary computation as a great opportunity to help people understand that evolution is real, an idea that not only led to the amazing diversity of life on Earth, but which can also be harnessed in silico to solve tough problems and explore important new areas. To veil its use and successes seems, to me, to be a lost opportunity on many levels. Not surprisingly, however, there are no simple answers, and the conversations on all of these ideas and issues will continue well into the future, fueled and re-energized by events such as GPTP. None of this would be possible, of course, without the hard work of the folks at the University of Michigan Center for the Study of Complex Systems (CSCS), who organize and host the gathering each year. Particular thanks go to CSCS’s Howard Oishi for his administrative organization and support, and to the organizing committee and editors of this volume: Rick Riolo (CSCS), Trent McConaghy (Solido Design Automation), and Katya Vladislavleva (University of Antwerp). Like all such events, GPTP costs money and we greatly appreciate the generous contributions of Third Millennium; State Street Global Advisors (SSgA); Michael Korns, Investment Science Corporation (ISC); the Computational Genetics Laboratory at Dartmouth College; Evolved Analytics; the Biocomputing and Developmental Systems Group, CSIS, the University of Limerick; and William and Barbara Tozier of Vague Innovation LLC. All that work and those donations made it possible for a group of bright, enthusiastic folks to get together to share and push and stretch. This volume contains one form of their collective effort, and it’s a valuable one. Read on, and be prepared to take a few notes along the way. Nic McPhee, Professor Division of Science and Mathematics University of Minnesota, Morris Morris, MN, USA July, 2010

Genetic Programming Theory and Practice 2010: An Introduction Trent McConaghy1, Ekaterina Vladislavleva2, and Rick Riolo3 1 Solido Design Automation Inc., Canada; 2 Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium; 3 Center for Study of Complex Systems, University of

Michigan.

Abstract

The toy problems are long gone, real applications are standard, and the systems have arrived. Genetic programming (GP) researchers have been designing and exploiting advances in theory, algorithm design, and computing power to the point where (traditionally) hard problems are the norm. As GP is being deployed in more real-world and hard problems, GP research goals are evolving to a higher level, to systems in which GP algorithms play a key role. The key goals in GP algorithm design are reasonable resource usage, high-quality results, and reliable convergence. To these GP algorithm goals, we add GP system goals: ease of system integration, end-user friendliness, and user control of the problem and interactivity. In this book, expert GP researchers demonstrate how they have been achieving and improving upon the key GP algorithm and system aims, to realize them on real-world / hard problems. This work was presented at the GP Theory and Practice (GPTP) 2010 worshop. This introductory chapter summarizes how these experts’ work is driving the frontiers of GP algorithms and GP systems in their application to ever-harder application domains.

Keywords:

genetic programming, evolutionary computation

1.

The Workshop

In May 2010 the Center of Studies of Complex Systems at the University of Michigan – with deep historical roots in evolutionary computation tracing back to Holland’s seminal work – opened its doors for the invitees of the workshop on Genetic Programming in Theory and Practice 2010. Over twenty experienced and internationally distinguished GP researchers gathered in Ann Arbor to close themselves in one room for two and a half days, present their newest (and often controversial) work to the critical attention of their peers, discuss the challenges of genetic programming, search for common traits in the field’s development, get a better understanding of the global state-of-the-art and share the vision on the “next big things” in GP theory and practice. The atmosphere at the workshop has always been enjoyable, with every participant trying to get a deep understanding of presented work, provide constructive comments on it, suggest links to the relevant topics in the broad field of computing, and question generality, scalability of the approach. The workshop fosters a friendly atmosphere wherein inquiring minds are genuinely trying to understand not only what they collectively know or can do with GP, but also

xviii

Genetic Programming Theory and Practice VIII

what they collectively do not yet know or cannot yet do with GP. The latter understanding is a major driving force for further developments that we have observed in all workshops. We are grateful to all sponsors and acknowledge the importance of their contributions to such an intellectually productive and regular event. The workshop is generously founded and sponsored by the University of Michigan Center for the Study of Complex Systems (CSCS) and receives further funding from the following people and or organizations: Michael Korns of Freeman Investment Management, State Street Global Advisors, Third Millenium, Bill and Barbara Tozier of Vague Innovation, Evolved Analytics, the Computational Genetics Laboratory of Dartmouth College and the Biocomputing and Developmental Systems Group of the University of Limerick. We also thank J¨urgen Schmidhuber for an enlightening and provocative keynote speech, which covered his thoughts on what makes a scientific field mature, a review of his work in solving difficult real-world problems in pragmatic ways, and his theoretical work in GP- and non-GP-based program induction.

2.

Summary of Progress

Last year, GPTP 2009 marked a transition wherein the aims of GP algorithms – reasonable resource usage, high results quality, and reliable convergence – were being consistently realized on an impressive variety of “real-world” applications by skilled practitioners in the field. This year, for GPTP 2010, researchers have begun to aim for the next level: for systems where GP algorithms play a key role. This was evident by the record number of GPTP demos, and by a renewed emphasis on system usability and user control. Also reflecting this transition, discsussions had a marked unity and depth of questions on the philosophy and future of GP, on the need to re-think the algorithms and re-design systems to solve conceptually harder problems. This chapter is organized accordingly. After a brief introduction to GP, Section 4 describes goals for design of GP algorithms and systems. Then the contributions of this volume (from the workshop) are summarized from two complementary perspectives: section 5 describes the “best practice” techniques that GP practitioners have invented and deployed to achieve the GP algorithm and system aims (including the improvements of GPTP 2010), and section 6 describes the application domains in which success through best practices has been reported. We conclude with a discussion of observations that emerged from the workshop, challenges that remain and potential avenues of future work. To make the results of the workshop useful to even a relative novice in the field of GP, we first provide a brief overview of GP.

GPTP2010: An Introduction

3.

xix

A Brief Introduction to Genetic Programming2

Genetic programming is a search and optimization technique for executable expressions that is modeled on natural evolution. Natural evolution is a powerful process that can be described by a few central, general mechanisms; for an introduction, see (Futuyma, 2009). A population is composed of organisms which can be distinguished in terms of how fit they are with respect to their environment. Over time, members of the population breed in frequency proportional to their fitness. The new offspring inherit the combined genetic material of their parents with some random variation, and may replace existing members of the population. The entire process is iterative, adaptive and open ended. GP and other evolutionary algorithms typically realize this central description of evolution, albeit in somewhat abstract forms. GP is a set of algorithms that mimic of survival of the fittest, genetic inheritance and variation, and that iterate over a “parent” population, selectively “breeding” them and replacing them with offspring. Though in general evolution does not have a problem solving goal, GP is nonetheless used to solve problems arising in diverse domains ranging from engineering to art. This is accomplished by casting the organism in the population as a candidate program-like solution to the chosen problem. The organism is represented as a computationally executable expression (aka structure), which is considered its genome. When the expression is executed on some supplied set of inputs, it generates an output (and possibly some intermediate results). This execution behavior is akin to the natural phenotype. By comparing the expression’s output to target outputs, a measure of the solution’s quality is obtained. This is used as the “fitness” of an expression. The fact that the candidate solutions are computationally executable structures (expressions), not binary or continuous coded values which are elements of a solution, is what distinguishes GP from other evolutionary algorithms (O’Reilly and Angeline, 1997). GP expressions include LISP functions (Koza, 1992; Wu and Banzhaf, 1998), stack or register based programs (Kantschik and Banzhaf, 2002; Spector and Robinson, 2002a), graphs (Miller and Harding, 2008; Mattiussi and Floreano, 2007; Poli, 1997), programs derived from grammars (Gruau, 1993; Whigham, 1995; O’Neill and Ryan, 2003), and generative representations which evolve the grammar itself (Hemberg, 2001; Hornby and Pollack, 2002; O’Reilly and Hemberg, 2007). Key steps in applying GP to a specific problem collectively define its search space: the problem’s candidate solutions are designed by choosing a representation; variation operators (mutation and crossover) are selected (or specialized); and a fitness function (objectives and

2 Adapted

from (O’Reilly et al., 2009).

xx

Genetic Programming Theory and Practice VIII

constraints) which expresses the relative merits of partial and complete solutions is formulated. For a more detailed overview we refer the reader to the book (Poli et al., 2008), which is available for free online.

4.

GP Challenges and Goals

In the early days of GP, the challenge was simply to “make it work” on small problems. As the field of GP research has matured, to be able to solve challenging real-world problems GP experts have strived to improve GP algorithms in terms of efficient computational resource usage, ensuring better quality results, and attaining more reliable convergence. With the maturation of “best practice” approaches, researchers are starting to create whole systems using GP which present its own challenges: ease of system integration, end-user friendliness, user control of the problem (perhaps interactively). This section elaborates on these GP algorithm and system goals and challenges.

GP Algorithm Goals and Challenges A successful GP algorithm has at least the following attributes. Efficent Use of Computational Resources includes shorter runtime, reduced usage of processor(s), and reduced memory and disk usage, for a given result. Achieving efficent use of computer resources has traditionally been a major issue for GP. A key reason is that GP search spaces are astronomically large, multi-modal, epistatic (e.g., variable interactions), have poor locality3 , and other nonlinearities. To handle such challenging search spaces, significant exploration is needed (e.g. large population sizes). This entails intensive processing and memory needs. Exacerbating the problem, fitness evaluations (objectives and constraints) of real-world problems tend to be expensive. Finally, because GP expressions have variable length, there is a tendency for them to “bloat”— to grow rapidly without a corresponding increase in performance (cf. Poli’s Chapter 5 in this book). Bloat can be a significant drain on available memory and CPU resources. Ensuring Quality Results. The key question is: “can a GP result be used in the target application?” This may be more difficult to attain than evident at first glance because the result may need to be human-interpretable, trustworthy, or predictive on dramatically different inputs— attaining such qualities can be

3 Poor locality means that a small change in the individual’s genotype often leads to large changes in the fitness and introducing additional difficulty into the search effort. For example, the GP “crossover” operation of swapping the subtrees of two parents might change the comparison of two elements from a “less than” relationship to an “equal to” relationship. This usually gives dramatically different behavior and fitness.

GPTP2010: An Introduction

xxi

challenging. Ensuring quality results has always been perceived as an issue, but the goal is becoming more prominent as GP is being applied to more real world problems. Practitioners, not GP, are responsible for deploying a GP result in their application domain. This means that the practitioner (and potentially their client) must trust the result sufficiently to be comfortable using it. Humaninterpretability (readability) of the result is a key factor in trust. This can be an issue when deployment of the result is expensive or risky, when customers’ understanding of the solution is crucial; when the result must be inspected or approved; or to gain acceptance of GP methodology. Reliable convergence means that the GP run can be trusted to return reasonable, useful results, without the practitioner having to worry about premature convergence or whether algorithm parameters like population size were set correctly. GP can fail to capably identify sub-solutions or partially correct solutions and successfully promote, combine and reuse them to generate good solutions with effective structure. The default approach has been to use the largest population size possible, subject to time and resource constraints. This invariably implies high resource usage, and still gives no guarantee of hitting useful results even if such results exist. Alternative approaches to increase the number of iterations with smaller population sizes still lack robust scenarios for computing resource allocation.

Goals for GP Incorporated in larger Systems These are necessary attributes of GP for successful “GP systems,” i.e., systems in which GP plays a key role4 . A successful GP system must no doubt have many other attributes particular to the context in which it is deployed, but each of the following factors certainly have high impact on the system’s success. Ease of system integration is how easy the GP algorithm is to deploy as part of the entire system, by the person or a team building the system. Even if a GP algorithm does well on the algorithm challenges, its may be hard for system integrators (or other researchers) to deploy because of high complexity or many parameters. Simple algorithms with few parameters are worth striving for; and if this is not possible, then readily available software with simple application programming interfaces and good documentation is a reasonable solution. End-user friendliness is the end-user’s perspective of how easy the system is to use when solving the problem at hand, when GP is only a subcomponent of the overall system. The user wants to solve a problem economically, with

4 GP

may not even be the centerpiece of the system—that’s fine!

xxii

Genetic Programming Theory and Practice VIII

quality results, reliably. The user task should be smooth and efficient, not tedious and time consuming. User (Interactive) Control of the Problem. The system (and its subsystems) should not be solving a problem any harder than it needs to be, especially when it makes a qualitative difference to resource usage, result quality, or convergence. To meet this goal, users should be able to specify problems to be solved with as much resolution as appropriate. In some cases, this also means interactivity with results so far, to further guide exploration according to the user’s needs, intuitions or subjective tastes. And it specifically does not mean user-level control of the GP algorithm itself: the end-user should not have to be a GP expert to use GP to solve a problem, just as GP experts do not have to be experts on electronics in order to use computers. For more book-length texts on applying GP to industrial problems, we refer the reader to recent books on the subject – by GPTP participants themselves: (Kordon, 2009), (Iba et al., 2010), and (McConaghy et al., 2009).

5.

GP Best Practices

First, we describe general best practices that GP practitioners use to achieve GP algorithm goals. Then, we review advances made at GPTP 2010 toward attaining those GP algorithm goals, followed a review of GPTP 2010 work that addresses GP system goals. In general, GP computational resource use has been made more efficient by improved algorithm design, improved design of representation and operators in specific domains. The importance of high demands of GP for computational resources has been lessened by Moore’s Law and increasing availability of parallel computational approaches, meaning that computational resources become exponentially cheaper over time. Results quality has improved for the same reasons. It is also due to a new emphasis by GP practitioners on getting interpretable or trustworthy results. Reliability has been enhanced via algorithm techniques that support continuous evolutionary improvement through a systematic or structured fashion, so that the practitioner no longer has to “hope” that the algorithm isn’t stuck. Implicit or explicit diversity maintenance also helps. Finally, thoughtful design of expression representation and genetic operators, for general and specific problem domains, has led to GP systems achieving human-competitive performance. Techniques along these lines include evolvability, self-adaptiveness, modularity and bloat control. At GPTP 2010, the following papers demonstrated advances in GP algorithm aims (efficient computational resource usage, results quality, or reliable convergence): Poli (Chapter 5) draws on recently developed theory to construct a very simple technique that manages bloat.

GPTP2010: An Introduction

xxiii

Harding et al. (Chapter 6) and Spector (Chapter 2) illuminate the state of the art in using self-modifying individuals to achieve highly scalable GP. Pattin, Moore et al. (Chapter 12) also uses self-adaptation and demonstrates how to incorporate expert knowledge in novel ways, for highly scalable GP. Lichodzijewski and Heywood (Chapter 3) and Soule et al. (Chapter 4) make further advances in GP scalability through evolution of teams. Orlov and Sipper (Chapter 1) is an excellent example of best-practice operator design to maintain evolvability in a highly constrained space. Smits et al. (Chapter 9) points towards evolution in the “compute cloud,” by exploring massively parallel evolution. Iba and Aranha (Chapter 13) exploits the structure of the resourceallocation problem in operator and algorithm design to improve GP scalability and results quality. Bergen and Ross (Chapter 14) explores how to handle problems with 2 objectives yet maintain convergence. Korns (Chapter 7) and McConaghy (Chapter 10) aggressively transform and simplify their respective problems for GP as much as possible, to greatly reduce GP resource needs. At GPTP 2010, the following papers demonstrated advances in GP system goals (system-integrator usability, user-level usability, or user control of the problem and interactivity). For system integrator usability: Schmidt and Lipson (Chapter 8) shows an approach that achieves the reliable convergence of the popular ALPS algorithm (Hornby, 2006), but with a simpler algorithm having fewer parameters. Harding et al. and Spector (Chapter 2) are also examples of relatively simple algorithms, algorithms that have been simplified over the years as their designers gained experience with them. In his keynote address, J¨urgen Schmidhuber described the achievement of best-in-class results using simple backpropagation neural networks but with modern computational resources. For user-level usability: Castillo et al. (Chapter 11) prescribes a flow for industrial modeling people where they can use GP as part of their overall manual flow in developing trustworthy industrial models. In the special demos session, many researchers presented highly usable GP systems, including Kotanchek’s DataModeler (symbolic regression and data analysis package for Mathematica), Schmidt and Lipson’s Eureqa (symbolic regression), Bergen and Ross’s Jnetic Textures (art), and Iba and Aranha’s CACIE (music). For user control of the problem / interactivity: Korns (Chapter 7) describes an SQL-style language to specify symbolic regression problems, so that function search only changes subsections of the overall expression. Bergen and Ross (Chapter 14) and Iba and Aranha (Chapter 13) describe systems that emphasize usability in interactive design of art and music, respectively.

xxiv

Genetic Programming Theory and Practice VIII

What is equally significant in these papers is that which is not mentioned or barely mentioned: GP algorithm goals that have already been solved sufficiently for particular problem domains, allowing researchers to focus their work on the more challenging issues. For example, there are several papers that do some form of symbolic regression (SR), which historically has had major issues with interpretability or bloat. Yet in these pages, the SR papers barely discuss interpretability or bloat, because best practices avoid the issue in one or more ways, most notably: pareto optimization using an extra objective of minimizing complexity, templated functional forms like McConaghy’s CAFFEINE or Korns’ abstract expressions or simply using the GP system to generate promising subexpressions in a manual modeling flow. Other off-the-shelf techniques that solve specific problems well have been around for years and are being increasingly adopted by the GPTP community. These include grammars to restrict program evolution (Whigham, 1995; O’Neill and Ryan, 2003), competent algorithms to handle multiple objectives and/or constraints e.g. (Deb et al., 2002), and meta-algorithms providing diversity and continuous improvement like ALPS (Hornby, 2006). Finally, significant compute resources are available to most: in an informal survey at the workshop, we found that most groups use a compute cluster, and two groups are already using “the compute cloud.”

6.

Application Successes Via Best Practices

One of the fascinating aspects of GP research is that GP is so general, i.e. “search through a space of (program or structure) entities,” that it can be used to attack an enormous variety of problems, including many problems that are currently unapproachable by any other technique. This year’s batch of applications is no exception. This section briefly reviews the applications. One of the long-standing aims of AI, and GP, has been evolution of software in the most general sense possible. GPTP this year was fortunate to have three groups present work directly on this. Orlov and Sipper (Chapter 1) present FINCH, a system to evolve Java bytecode, an evolutionary substrate that has evolvability close to machine code, yet returns interpretable Java code thanks to industry-standard bytecode decompilers. Spector (Chapter 2) presents an autoconstructive version of PUSH, a GP system which evolves stack-based programs. Finally, Harding et al. (Chapter 6) presents a self-modifying Cartesian GP which evolves graphs that can be interpreted as software, circuits, equations, and more. Two chapters introduce wholly new problems for GP. McConaghy (Chapter 10) introduces the problem of building density models at a distribution’s tails (and dusts off the general problem of symbolic density modeling), for the application of SRAM memory circuit analysis. Lichodzijewski and Heywood

GPTP2010: An Introduction

xxv

(Chapter 3) introduce the problem of solving a Rubik’s cube with GP, taking the perspective of temporal sequence learning. GP continues to help the artistic types. Bergen and Ross (Chapter 14) describe a sophisticated interactive system for interactive evolutionary art, and Iba and Aranha (Chapter 13) describe an advanced system for interactive evolutionary music. Both systems have been already used extensively by artists and musicians. In a biology application, Pattin, Moore et al. (Chapter 12) describe the use of GP for disease susceptibility modeling. GP remains popular in financial applications. Korns (Chapter 7) ups the ante on a set of symbolic regression and classification problems that are representative of financial modeling problems to aid stock-trading decisionmaking. Iba and Aranha (Chapter 13) describes a system for portfolio allocation. For the problem of industrial modeling (e.g. of inferential sensors at Dow), Castillo et al. (Chapter 11) focuses on a structured approach to exploit GP results within industrial modelers’ model development flows. Undoubtedly, the symbolic regression approach in Smits et al. (Chapter 9) will find end usage in Dow’s industrial modeling environment as well. Other approaches used standard problems in (symbolic) classification or regression as their test suites, though the emphasis was not the application. This includes work by Soule et al. (Chapter 4), Poli (Chapter 4), and Schmidt and Lipson (Chapter 8).

7.

Themes, Summary and Looking Forward

The toy problems are gone; the GP systems have arrived. No doubt there will continue to be qualitative improvements to GP algorithms and GP systems for years to come. But is there more? We posit there is. Despite these achievements, GP’s computer-based evolution does not demonstrate the potential associated with natural evolution, nor does it always satisfactorily solve important problems we might hope to use it on. Even when using best practice approaches to manage challenges in resources, results, and reliability, the computational load may still be too excessive and the final results may still be inadequate. To achieve success in a difficult problem domain takes a great deal of human effort toward thoughtful design of representations and operators. Many questions and challenges remain: • What does it take to make GP a science? (Is this even a realistic question?) How can work on applications facilitate the continued development of a GP theory? • What does it take to make GP a technology? (Is this even a realistic question?) What fundamental contributions will allow GP to be adopted into broader

xxvi

Genetic Programming Theory and Practice VIII

use beyond that of expert practitioners? For example, how can GP be scoped so that it becomes another standard, off-the-shelf method in the “toolboxes” of scientists and engineers around the world? Can GP follow in the same vein of linear programming? Can it follow the example of support vector machines and convex optimization methods? One challenge is in formulating the algorithm so that it provides more ease in laying out a problem. Another is determining how, by default – without parameter tuning, GP can efficiently exploit specified resources to return results reliably. • How do we get 1 million people using GP? 1 billion? (Should they even know they’re using GP?) • Success with GP often requires extensive human effort in capturing and embedding the domain knowledge. How can this up-front human effort be reduced while still achieving excellent results? Are there additional automatic ways to capture domain knowledge for input to GP systems? • Scalability is always relative. GP has attacked fairly large problems, but how can GP be improved to solve problems that are 10x, 100x, 1,000,000x harder? • What opportunities await GP due to new computing architectures and substrates, with potentially vastly richer processing resources? This includes massively multicore processors, GPUs, and cloud computing; but it also includes digital microfluidics, modern programmable logic, and more. • What opportunities await GP due to massive memory and storage capacity, coupled with giant databases? For example, this has already profoundly affected machine learning applied to speech recognition, not to mention web search. Massive and freely available databases are coming online, especially from biology. • What “uncrackable” problems await a creative GP approach? The future has many challenges in energy, health care, defence, and more. For many fields, there are lists of “holy grail” problems, unsolved problems, even problems with prize money attached. These questions and their answers will provide the fodder for future GPTP workshops. We wish you many hours of stimulating reading of this volume’s contributions.

References Deb, Kalyanmoy, Pratap, Amrit, Agarwal, Sameer, and Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6:182–197. Futuyma, Douglas (2009). Evolution, Second Edition. Sinauer Associates Inc.

GPTP2010: An Introduction

xxvii

Gruau, Frederic (1993). Cellular encoding as a graph grammar. IEE Colloquium on Grammatical Inference: Theory, Applications and Alternatives, (Digest No.092):17/1–10. Hemberg, Martin (2001). GENR8 - A design tool for surface generation. Master’s thesis, Department of Physical Resource Theory, Chalmers University, Sweden. Hornby, Gregory S. (2006). ALPS: the age-layered population structure for reducing the problem of premature convergence. In Keijzer, Maarten, Cattolico, Mike, Arnold, Dirk, Babovic, Vladan, Blum, Christian, Bosman, Peter, Butz, Martin V., Coello Coello, Carlos, Dasgupta, Dipankar, Ficici, Sevan G., Foster, James, Hernandez-Aguirre, Arturo, Hornby, Greg, Lipson, Hod, McMinn, Phil, Moore, Jason, Raidl, Guenther, Rothlauf, Franz, Ryan, Conor, and Thierens, Dirk, editors, GECCO 2006: Proceedings of the 8th annual conference on Genetic and evolutionary computation, volume 1, pages 815–822, Seattle, Washington, USA. ACM Press. Hornby, Gregory S. and Pollack, Jordan B. (2002). Creating high-level components with a generative representation for body-brain evolution. Artificial Life, 8(3):223–246. Iba, Hitoshi, Paul, Topon Kumar, and Hasegawa, Yoshihiko (2010). Applied Genetic Programming and Machine Learning. CRC Press. Kantschik, Wolfgang and Banzhaf, Wolfgang (2002). Linear-graph GP—A new GP structure. In Foster, James A., Lutton, Evelyne, Miller, Julian, Ryan, Conor, and Tettamanzi, Andrea G. B., editors, Genetic Programming, Proceedings of the 5th European Conference, EuroGP 2002, volume 2278 of LNCS, pages 83–92, Kinsale, Ireland. Springer-Verlag. Kordon, Arthur (2009). Applying Computational Intelligence: How to Create Value. Springer. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Mattiussi, Claudio and Floreano, Dario (2007). Analog genetic encoding for the evolution of circuits and networks. IEEE Transactions on Evolutionary Computation, 11(5):596–607. McConaghy, Trent, Palmers, Pieter, Gao, Peng, Steyaert, Michiel, and Gielen, Georges G.E. (2009). Variation-Aware Analog Structural Synthesis: A Computational Intelligence Approach. Springer. Miller, Julian Francis and Harding, Simon L. (2008). Cartesian genetic programming. In Ebner, Marc, Cattolico, Mike, van Hemert, Jano, Gustafson, Steven, Merkle, Laurence D., Moore, Frank W., Congdon, Clare Bates, Clack, Christopher D., Moore, Frank W., Rand, William, Ficici, Sevan G., Riolo, Rick, Bacardit, Jaume, Bernado-Mansilla, Ester, Butz, Martin V., Smith, Stephen L., Cagnoni, Stefano, Hauschild, Mark, Pelikan, Martin, and Sastry,

xxviii

Genetic Programming Theory and Practice VIII

Kumara, editors, GECCO-2008 tutorials, pages 2701–2726, Atlanta, GA, USA. ACM. O’Neill, Michael and Ryan, Conor (2003). Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language, volume 4 of Genetic programming. Kluwer Academic Publishers. O’Reilly, Una-May and Angeline, Peter J. (1997). Trends in evolutionary methods for program induction. Evolutionary Computation, 5(2):v–ix. O’Reilly, Una-May and Hemberg, Martin (2007). Integrating generative growth and evolutionary computation for form exploration. Genetic Programming and Evolvable Machines, 8(2):163–186. Special issue on developmental systems. O’Reilly, Una-May, McConaghy, Trent, and Riolo, Rick (2009). GPTP 2009: An example of evolvability. In Riolo, Rick L., O’Reilly, Una-May, and McConaghy, Trent, editors, Genetic Programming Theory and Practice VII, Genetic and Evolutionary Computation, chapter 1, pages 1–18. Springer, Ann Arbor. Poli, Riccardo (1997). Evolution of graph-like programs with parallel distributed genetic programming. In Back, Thomas, editor, Genetic Algorithms: Proceedings of the Seventh International Conference, pages 346–353, Michigan State University, East Lansing, MI, USA. Morgan Kaufmann. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Spector, Lee and Robinson, Alan (2002). Genetic programming and autoconstructive evolution with the push programming language. Genetic Programming and Evolvable Machines, 3(1):7–40. Whigham, P. A. (1995). Grammatically-based genetic programming. In Rosca, Justinian P., editor, Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Applications, pages 33–41, Tahoe City, California, USA. Wu, Annie S. and Banzhaf, Wolfgang (1998). Introduction to the special issue: Variable-length representation and noncoding segments for evolutionary algorithms. Evolutionary Computation, 6(4):iii–vi.

Chapter 1 FINCH: A SYSTEM FOR EVOLVING JAVA (BYTECODE) Michael Orlov and Moshe Sipper Department of Computer Science, Ben-Gurion University, Beer-Sheva 84105, Israel.

Abstract

The established approach in genetic programming (GP) involves the definition of functions and terminals appropriate to the problem at hand, after which evolution of expressions using these definitions takes place. We have recently developed a system, dubbed FINCH (Fertile Darwinian Bytecode Harvester), to evolutionarily improve actual, extant software, which was not intentionally written for the purpose of serving as a GP representation in particular, nor for evolution in general. This is in contrast to existing work that uses restricted subsets of the Java bytecode instruction set as a representation language for individuals in genetic programming. The ability to evolve Java programs will hopefully lead to a valuable new tool in the software engineer’s toolkit.

Keywords:

Java bytecode, automatic programming, software evolution, genetic programming.

1.

Introduction

The established approach in genetic programming (GP) involves the definition of functions and terminals appropriate to the problem at hand, after which evolution of expressions using these definitions takes place (Koza, 1992; Poli et al., 2008). Poli et al. recently noted that: While it is common to describe GP as evolving programs, GP is not typically used to evolve programs in the familiar Turing-complete languages humans normally use for software development. It is instead more common to evolve programs (or expressions or formulae) in a more constrained and often domain-specific language. (Poli et al., 2008, ch. 3.1; emphasis in original)

The above statement is (arguably) true not only where “traditional” treebased GP is concerned, but also for other forms of GP, such as linear GP and grammatical evolution (Poli et al., 2008). R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_1, © Springer Science+Business Media, LLC 2011

2

Genetic Programming Theory and Practice VIII

}}

0 iconst_1 1 istore_2 2 iload_1 3 ifle 16 6 iload_1 7 aload_0 8 iload_1 9 iconst_1 10 isub 11 invokevirtual #2 14 imul 15 istore_2 16 iload_2 17 ireturn

(a)

(b)

class F { int fact(int n) { // offsets 0-1 int ans = 1; // offsets 2-3 if (n > 0) // offsets 6-15 ans = n * fact(n-1); // offsets 16-17 return ans;

Figure 1-1. A recursive factorial function in Java (a) and its corresponding bytecode (b). The argument to the virtual method invocation (invokevirtual) references the int F.fact(int) method via the constant pool.

We have recently developed a system, dubbed FINCH (Fertile Darwinian Bytecode Harvester), to evolutionarily improve actual, extant software, which was not intentionally written for the purpose of serving as a GP representation in particular, nor for evolution in general. The only requirement is that the software source code be either written in Java or can be compiled to Java bytecode. The following chapter provides an overview of our system, ending with a pr´ecis of results. Additional information can be found in (Orlov and Sipper, 2009; Orlov and Sipper, 2010). Java compilers typically do not produce machine code directly, but instead compile source-code files to platform-independent bytecode, to be interpreted in software or, rarely, to be executed in hardware by a Java Virtual Machine (JVM) (Lindholm and Yellin, 1999). The JVM is free to apply its own optimization techniques, such as Just-in-Time (JIT) on-demand compilation to native machine code—a process that is transparent to the user. The JVM implements a stack-based architecture with high-level language features such as object management and garbage collection, virtual function calls, and strong typing. The bytecode language itself is a well-designed assembly-like language with a limited yet powerful instruction set (Engel, 1999; Lindholm and Yellin, 1999). Figure 1-1 shows a recursive Java program for computing the factorial of a number, and its corresponding bytecode. The JVM architecture is successful enough that several programming languages compile directly to Java bytecode (e.g., Scala, Groovy, Jython, Kawa, JavaFX Script, and Clojure). Moreover, Java decompilers are available, which facilitate restoration of the Java source code from compiled bytecode. Since the design of the JVM is closely tied to the design of the Java programming

FINCH: A System for Evolving Java (Bytecode)

3

language, such decompilation often produces code that is very similar to the original source code (Miecznikowski and Hendren, 2002). We chose to automatically improve extant Java programs by evolving the respective compiled bytecode versions. This allows us to leverage the power of a well-defined, cross-platform, intermediate machine language at just the right level of abstraction: We do not need to define a special evolutionary language, thus necessitating an elaborate two-way transformation between Java and our language; nor do we evolve at the Java level, with its encumbering syntactic constraints, which render the genetic operators of crossover and mutation arduous to implement. Note that we do not wish to invent a language to improve upon some aspect or other of GP (efficiency, terseness, readability, etc.), as has been amply done. Nor do we wish to extend standard GP to become Turing complete, an issue which has also been addressed (Woodward, 2003). Rather, conversely, our point of departure is an extant, highly popular, general-purpose language, with our aim being to render it evolvable. The ability to evolve Java programs will hopefully lead to a valuable new tool in the software engineer’s toolkit. The motivation behind evolving Java bytecode is detailed in Section 2. The principles of bytecode evolution are described in Section 3. Section 4 describes compatible bytecode crossover—the main evolutionary operator driving the FINCH system. Alternative ways of evolving software are considered in Section 5. Program halting and compiler optimization issues are dealt with in Sections 6 and 7. Current experimental results are summarized in Section 8, and the concluding remarks are in Section 9.

2.

Why Target Bytecode for Evolution?

Bytecode is the intermediate, platform-independent representation of Java programs, created by a Java compiler. Figure 1-2 depicts the process by which Java source code is compiled to bytecode and subsequently loaded by the JVM, which verifies it and (if the bytecode passes verification) decides whether to interpret the bytecode directly, or to compile and optimize it—thereupon executing the resultant native code. The decision regarding interpretation or further compilation (and optimization) depends upon the frequency at which a particular method is executed, its size, and other parameters. Our decision to evolve bytecode instead of the more high-level Java source code is guided in part by the desire to avoid altogether the possibility of producing non-compilable source code. The purpose of source code is to be easy for human programmers to create and to modify, a purpose which conflicts with the ability to automatically modify such code. We note in passing that we do not seek an evolvable programming language—a problem tackled, e.g., by

4

Genetic Programming Theory and Practice VIII MPS[EGI iconst_1 IA32 Bytecode UHVWRUH PRYHG[HVL VXELO istore_2 GHFHVL FPSO FPS[HVL iload_1 EJSQLFF[IEF Load MJ[EGHH QRS ifle 16 PRY[HVL ELFF[IEF MPS[EGI Verify iload_1 PRYL PRYHG[[HVS PRYHG[HD[ aload_0 VXEOR PRYHVLHGL GHFHD[ FDOO[IEFHD GHFHGL if (n > 0) iload_1 WHVWHD[HD[ Interpret PRYLR PRYHGLHG[ MOH[IFD ans = n * iconst_1 PXO[ORL PRYHVL[HVS PRYHG[HES PXO[LLL FDOO[EGI fact(n-1); isub DGG[IIIIIIIIIIIII VHWKLKL[II PRY[HVS HVL WHVWHESHES invokevirtual #2 OG>O@J LPXOHVLHD[ SPARC MOH[IFD UHW return ans; imul PRYHD[HVL PRYHD[[UVS PRY[HVS HG[ CompileLPXOHG[HVL Compileistore_2 PRYHG[UVS } DGG[IIIIIIIIIIIII } iload_2 FDOOT[IFIE ireturn LPXOHESHD[ LPXO[UVS HD[ AMD64 PRYUVS HG[ Platform-dependent Platform-independent

class F Source { int fact(int n) { int ans = 1;

Java compiler

just-in-time compiler

Figure 1-2. Java source code is first compiled to platform-independent bytecode by a Java compiler. The JVM only loads the bytecode, which it verifies for correctness, and raises an exception in case the verification fails. After that, the JVM typically interprets the bytecode until it detects that it would be advantageous to compile it, with optimizations, to native, platformdependent code. The native code is then executed by the CPU as any other program. Note that no optimization is performed when Java source code is compiled to bytecode. Optimization only takes place during compilation from bytecode to native code.

(Spector and Robinson, 2002)—but rather aim to handle the Java programming language in particular. Evolving bytecode instead of source code alleviates the issue of producing non-compilable programs to some extent—but not completely. Java bytecode must be correct with respect to dealing with stack and local variables (cf. Figure 1-3). Values that are read and written should be type-compatible, and stack underflow must not occur. The JVM performs bytecode verification and raises an exception in case of any such incompatibility. We wish not merely to evolve bytecode, but indeed to evolve correct bytecode. This task is hard, because our purpose is to evolve given, unrestricted code, and not simply to leverage the capabilities of the JVM to perform GP. Therefore, basic evolutionary operations, such as bytecode crossover and mutation, should produce correct individuals.

3.

Bytecode Evolution Principles

We define a good crossover of two parents as one where the offspring is a correct bytecode program, meaning one that passes verification with no errors; conversely, a bad crossover of two parents is one where the offspring is an incorrect bytecode program, meaning one whose verification produces errors. While it is easy to define a trivial slice-and-swap crossover operator on two programs, it is far more arduous to define a good crossover operator. This latter is necessary in order to preserve variability during the evolutionary process, because incorrect programs cannot be run, and therefore cannot be ascribed a

5

FINCH: A System for Evolving Java (Bytecode) IDFW method call frame IDFW method call frame IDFW method call frame (active)

Heap Shared objects store.

11

(stack top)

int 4 ³)´ (WKLV)

Program Counter ³)´ object

int 5

Holds offset of currently executing instruction in method code area.

³)´ (WKLV)

int 5

int 1

0

1

2

Operand Stack

Local Variables Array

References objects on the heap. Used to provide arguments to JVM instructions, such as arithmetic operations and method calls.

References objects on the heap. Contains method arguments and locally defined variables.

Figure 1-3. Call frames in the architecture of the Java Virtual Machine, during execution of the recursive factorial function code shown in Figure 1-1, with parameter n = 7. The top call frame is in a state preceding execution of invokevirtual. This instruction will pop a parameter and an object reference from the operand stack, invoke the method fact of class F, and open a new frame for the fact(4) call. When that frame closes, the returned value will be pushed onto the operand stack.

fitness value—or, alternatively, must be assigned the worst possible value. Too many bad crossovers will hence produce a population with little variability. Note that we use the term good crossover to refer to an operator that produces a viable offspring (i.e., one that passes the JVM verification) given two parents; compatible crossover, defined below, is one mechanism by which good crossover can be implemented. The Java Virtual Machine is a stack-based architecture for executing Java bytecode. The JVM holds a stack for each execution thread, and creates a frame on this stack for each method invocation. The frame contains a code array, an operand stack, a local variables array, and a reference to the constant pool of the current class (Engel, 1999). The code array contains the bytecode to be executed by the JVM. The local variables array holds all method (or function) parameters, including a reference to the class instance in which the current method executes. In addition, the variables array also holds local-scope variables. The operand stack is used by stack-based instructions, and for arguments when calling other methods. A method call moves parameters from the caller’s operand stack to the callee’s variables array; a return moves the top value from the callee’s stack to the caller’s stack, and disposes of the callee’s frame. Both the operand stack and the variables array contain typed items, and instructions always act on a specific type. The relevant bytecode instructions are prefixed accordingly: ‘a’ for an object or array reference, ‘i’ and ‘l’ for integral types int and long, and

6

Genetic Programming Theory and Practice VIII

‘f’ and ‘d’ for floating-point types float and double.1 Finally, the constant pool is an array of references to classes, methods, fields, and other unvarying entities. The JVM architecture is illustrated in Figure 1-3. In our evolutionary setup, the individuals are bytecode sequences annotated with all the necessary stack and variables information. This information is gathered in one pass over the bytecode, using the ASM bytecode manipulation and analysis library (Bruneton et al., 2002). Afterwards, similar information for any sequential code segment in the individual can be aggregated separately. This preprocessing step allows us to define compatible two-point crossover on bytecode sequences (Orlov and Sipper, 2009). Code segments can be replaced only by other segments that use the operand stack and the local variables array in a depth-compatible and type-compatible manner. The compatible crossover operator thus maximizes the viability potential for offspring, preventing type incompatibility and stack underflow errors that would otherwise plague indiscriminating bytecode crossover. Note that the crossover operation is unidirectional, or asymmetric—the code segment compatibility criterion as described here is not a symmetric relation. An ability to replace segment α in individual A with segment β in individual B does not imply an ability to replace segment β in B with segment α. As an example of compatible crossover, consider two identical programs with the same bytecode as in Figure 1-1, which are reproduced as parents A and B in Figure 1-4. We replace bytecode instructions at offsets 7–11 in parent A with the single iload 2 instruction at offset 16 from parent B. Offsets 7–11 correspond to the fact(n-1) call that leaves an integer value on the stack, whereas offset 16 corresponds to pushing the local variable ans on the stack. This crossover, the result of which is shown as offspring x in Figure 1-4, is good, because the operand stack is used in a compatible manner by the source segment, and although this segment reads the variable ans that is not read in the destination segment, that variable is guaranteed to have been written previously, at offset 1. Alternatively, consider replacing the imul instruction in the newly formed offspring x with the single invokevirtual instruction from parent B. This crossover is bad, as illustrated by incorrect offspring y in Figure 1-4. Although both invokevirtual and imul pop two values from the stack and then push one value, invokevirtual expects the topmost value to be of reference type F, whereas imul expects an integer. Another negative example is an attempt to replace bytecode offsets 0–1 in parent B (that correspond to the int ans=1 statement) with an empty segment. In this case, illustrated by incorrect offspring z in Figure 1-4, variable ans is no longer guaranteed to be initialized 1 The

types boolean, byte, char and short are treated as the computational type int by the Java Virtual Machine, except for array accesses and explicit conversions (Lindholm and Yellin, 1999).

7

FINCH: A System for Evolving Java (Bytecode) iconst_1 istore_2 iload_1 ifle iload_1 aload_0 iload_1 iconst_1 isub invokevirtual imul istore_2 iload_2 ireturn

Parent A

x

z iconst_1 istore_2 iload_1 ifle iload_1 aload_0 iload_1 iconst_1 isub invokevirtual imul istore_2 iload_2 ireturn Parent B

y

iconst_1 istore_2 iload_1 ifle iload_1 iload_2 imul istore_2 iload_2 ireturn

(correct) Offspring x

iconst_1 istore_2 iload_1 ifle iload_1 iload_2 invokevirtual istore_2 iload_2 ireturn

iload_1 ifle iload_1 aload_0 iload_1 iconst_1 isub invokevirtual imul istore_2 iload_2 ireturn

(incorrect) Offspring y

(incorrect) Offspring z

Figure 1-4. An example of good and bad crossovers. The two identical individuals A and B represent a recursive factorial function (see Figure 1-1; here we use an arrow instead of branch offset). In parent A, the bytecode sequence that corresponds to the fact(n-1) call that leaves an integer value on the stack, is replaced with the single instruction in B that corresponds to pushing the local variable ans on the stack. The resulting correct offspring x and the original parent B are then considered as two new parents. We see that either replacing the first two instructions in B with an empty section, or replacing the imul instruction in x with the invokevirtual instruction from B, result in incorrect bytecode, shown as offspring y and z—see main text for full explanation.

when it is read immediately prior to the function’s return, and the resulting bytecode is therefore incorrect. A mutation operator employs the same constraints as compatible crossover, but the constraints are applied to variations of the same individual. The requirements for correct bytecode mutation are thus derived from those of compatible crossover. To date, we did not use this type of mutation as it proved unnecessary, and instead implemented a restricted form of constants-only point mutation, where each constant in a new individual is modified with a given probability.

4.

Compatible Bytecode Crossover

As discussed above, compatible bytecode crossover is a fundamental building block for effective evolution of correct bytecode. In order to describe the formal requirements for compatible crossover, we need to define the meaning of variable accesses for a segment of code. That is, a section of code (that is not necessary linear, since there are branching instructions) can be viewed as reading and writing some local variables, or as an aggregation of reads and writes by individual bytecode instructions. However, when a variable is written before being read, the write “shadows” the read, in the sense that the code executing prior to the given section does not have to provide a value of the correct type in the variable.

8

Genetic Programming Theory and Practice VIII

Variables Access Sets. We define variables access sets, to be used ahead by the compatible crossover operator, as follows: Let a and b be two locations in the same bytecode sequence. For a set of instructions δa,b that could potentially be executed starting at a and ending at b, we define the following access sets. r : set of local variables such that for each variable v, there exists a potential δa,b execution path (i.e., one not necessarily taken) between a and b, in which v is read before any write to it. w δa,b : set of local variables that are written to through at least one potential execution path. w! δa,b : set of local variables that are guaranteed to be written to, no matter which execution path is taken.

These sets of local variables are incrementally computed by analyzing the data flow between locations a and b. For a single instruction c, the three access sets for δc are given by the Java bytecode definition. Consider a set of (normally non-consecutive) instructions {bi } that branch to instruction c or have c as their immediate subsequent instruction. The variables accessed between a and c are computed as follows: r r , with the addition of variables read by is the union of all reads δa,b δa,c i instruction c—unless these variables to be written before r are guaranteed r = r w! . δ \ δ ∪ δ c. Formally, δa,c c i a,bi i a,bi w is the union of all writes δ w , with the addition of variables written by δa,c a,bi w w = w δ instruction c: δa,c i a,bi ∪ δc . w! is the set of variables guaranteed to be written before c, with the addition δa,c w! w! = w! of variables written by instruction c: δa,c i δa,bi ∪ δc (note that w! has already been computed, its previous value δcw! = δcw ). When δa,c needs to be a part of the intersection as well.

We therefore traverse the data-flow graph starting at a, and updating the variables access sets as above, until they stabilize—i.e., stop changing.2 During the traversal, necessary stack depths are also updated. The requirements for compatible bytecode crossover can now be specified.

Bytecode Constraints on Crossover. In order to attain viable offspring, several conditions must hold when performing crossover of two bytecode programs. Let A and B be functions in Java, represented as bytecode sequences. Consider segments α and β in A and B, respectively, and let pα and pβ be the necessary depth of stack for these segments—i.e., the minimal number of 2 The

data-flow traversal process is similar to the data-flow analyzer’s loop in (Lindholm and Yellin, 1999).

FINCH: A System for Evolving Java (Bytecode)

9

elements in the stack required to avoid underflow. Segment α can be replaced with β if the following conditions hold. Operand stack: (1) it is possible to ensure that pβ pα by prefixing stack pops and pushes of α with some frames from the stack state at the beginning of α; (2) α and β have compatible stack frames up to depth pβ : stack pops of α have identical or narrower types as stack pops of β, and stack pushes of β have identical or narrower types as stack pushes of α; (3) α has compatible stack frames deeper than pβ : stack pops of α have identical or narrower types as corresponding stack pushes of α. Local variables: (1) local variables written by β (β w ) have identical or narrower types as corresponding variables that are read after α (post-αr ); (2) local variables read after α (post-αr ) and not necessarily written by β (β w! ) must be written before α (pre-αw! ), or provided as arguments for call to A, as identical or narrower types; (3) local variables read by β (β r ) must be written before α (pre-αw! ), or provided as arguments for call to A, as identical or narrower types. Control flow: (1) no branch instruction outside of α has branch destination in α, and no branch instruction in β has branch destination outside of β; (2) code before α has transition to the first instruction of α, and code in β has transition to the first instruction after β; (3) last instruction in α implies transition to the first instruction after α. Detailed examples of the above conditions can be found in (Orlov and Sipper, 2009). Compatible bytecode crossover prevents verification errors in offspring, in other words, all offspring compile sans error. As with any other evolutionary method, however, it does not prevent production of non-viable offspring—in our case, those with runtime errors. An exception or a timeout can still occur during an individual’s evaluation, and the fitness of the individual should be reset accordingly. We chose bytecode segments randomly before checking them for crossover compatibility as follows: For a given method, a segment size is selected using a given probability distribution among all bytecode segments that are branchconsistent under the first control-flow requirement; then a segment with the chosen size is uniformly selected. Whenever the chosen segments result in bad crossover, bytecode segments are chosen again (up to some limit of retries). Note that this selection process is very fast (despite the retries), as it involves fast operations—and, most importantly, we ensure that crossover always produces a viable offspring.

10

Genetic Programming Theory and Practice VIII

float x; int y = 7; if (y >= 0) x = y; else x = -y; System.out.println(x);

int x = 7; float y; if (y >= 0) { y = x; x = y; } System.out.println(z);

(a)

(b)

Figure 1-5. Two Java snippets that comply with the context-free grammar rules of the programming language. However, only snippet (a) is legal once the full Java Language Specification (Gosling et al., 2005) is considered . Snippet (b), though Java-compliant syntactically, is revealed to be ill-formed when semantics are thrown into play.

5.

The Grammar Alternative

One might ask whether it is really necessary to evolve bytecode in order to support the evolution of unrestricted Java software. After all, Java is a programming language with strict, formal rules, which are precisely defined in Backus-Naur form (BNF). One could make an argument for the possibility of providing this BNF description to a grammar evolutionary system (O’Neill and Ryan, 2003) and evolving away. We disagree with such an argument. The apparent ease with which one might apply the BNF rules of a real-world programming language in an evolutionary system (either grammatical or tree-based) is an illusion stemming from the blurred boundary between syntactic and semantic constraints (Poli et al., 2008, ch. 6.2.4). Java’s formal (BNF) rules are purely syntactic, in no way capturing the language’s type system, variable visibility and accessibility, and other semantic constraints. Correct handling of these constraints in order to ensure the production of viable individuals would essentially necessitate the programming of a full-scale Java compiler—a highly demanding task, not to be taken lightly. This is not to claim that such a task is completely insurmountable—e.g., an extension to context-free grammars (CFGs), such as logic grammars, can be taken advantage of in order to represent the necessary contextual constraints (Wong and Leung, 2000). But we have yet to see such a GP implementation in practice, addressing real-world programming problems. We cannot emphasize the distinction between syntax and semantics strongly enough. Consider, for example, the Java program segment shown in Figure 15(a). It is a seemingly simple syntactic structure, which belies, however, a host of semantic constraints, including: type compatibility in variable assignment, variable initialization before read access, and variable visibility. The similar (and CFG-conforming) segment shown in Figure 1-5(b) violates all these constraints: variable y in the conditional test is uninitialized during a read access, its subsequent assignment to x is type-incompatible, and variable z is undefined.

FINCH: A System for Evolving Java (Bytecode)

11

It is quite telling that despite the popularity and generality of grammatical evolution, we were able to uncover only a single case of evolution using a real-world, unrestricted phenotypic language—involving a semantically simple hardware description language (HDL). (Mizoguchi et al., 1994) implemented the complete grammar of SFL (Structured Function description Language) (Nakamura et al., 1991) as production rules of a rewriting system, using approximately 350(!) rules for a language far simpler than Java. The semantic constraints of SFL—an object-oriented, register-transfer-level language—are sufficiently weak for using its BNF directly: By designing the genetic operators based on the production rules and by performing them in the chromosome, a grammatically correct SFL program can be generated. This eliminates the burden of eliminating grammatically incorrect HDL programs through the evolution process and helps to concentrate selective pressure in the target direction. (Mizoguchi et al., 1994)

(Arcuri, 2009) recently attempted to repair Java source code using syntax-tree transformations. His JAFF system is not able to handle the entire language— only an explicitly defined subset (Arcuri, 2009, Table 6.1), and furthermore, exhibits a host of problems that evolution of correct Java bytecode avoids inherently: individuals are compiled at each fitness evaluation, compilation errors occur despite the syntax-tree modifications being legal (cf. discussion above), lack of support for a significant part of the Java syntax (inner and anonymous classes, labeled break and continue statements, Java 5.0 syntax extensions, etc.), incorrect support of method overloading, and other problems: The constraint system consists of 12 basic node types and 5 polymorphic types. For the functions and the leaves, there are 44 different types of constraints. For each program, we added as well the constraints regarding local variables and method calls. Although the constraint system is quite accurate, it does not completely represent yet all the possible constraints in the employed subset of the Java language (i.e., a program that satisfies these constraints would not be necessarily compilable in Java). (Arcuri, 2009)

FINCH, through its clever use of Java bytecode, attains a scalability leap in evolutionarily manageable programming language complexity.

6.

The Halting Issue

An important issue that must be considered when dealing with the evolution of unrestricted programs is whether they halt—or not (Langdon and Poli, 2006). Whenever Turing-complete programs with arbitrary control flow are evolved, a possibility arises that computation will turn out to be unending. A program that has acquired the undesirable non-termination property during evolution is executed directly by the JVM, and FINCH has nearly no control over the process.

12

Genetic Programming Theory and Practice VIII

A straightforward approach for dealing with non-halting programs is to limit the execution time of each individual during evaluation, assigning a minimal fitness value to programs that exceed the time limit. This approach, however, suffers from two shortcomings: First, limiting execution time provides coarsetime granularity at best, is unreliable in the presence of varying CPU load, and as a result is wasteful of computer resources due to the relatively high time-limit value that must be used. Second, applying a time limit to an arbitrary program requires running it in a separate thread, and stopping the execution of the thread once it exceeds the time limit. However, externally stopping the execution is either unreliable (when interrupting the thread that must then eventually enter a blocked state), or unsafe for the whole application (when attempting to kill the thread).3 Therefore, in FINCH we exercise a different approach, taking advantage of the lucid structure offered by Java bytecode. Before evaluating a program, it is temporarily instrumented with calls to a function that throws an exception if called more than a given number of times (steps). A call to this function is inserted before each backward branch instruction and before each method invocation. Thus, an infinite loop in any evolved individual program will raise an exception after exceeding the predefined steps limit. Note that this is not a coarse-grained (run)time limit, but a precise limit on the number of steps.

7.

(No) Loss of Compiler Optimization

Another issue that surfaces when bytecode genetic operators are considered is the apparent loss of compiler optimization. Indeed, most native-code producing compilers provide the option of optimizing the resulting machine code to varying degrees of speed and size improvements. These optimizations would presumably be lost during the process of bytecode evolution. Surprisingly, however, bytecode evolution does not induce loss of compiler optimization, since there is no optimization to begin with! The common assumption regarding Java compilers’ similarity to native-code compilers is simply incorrect. As far as we were able to uncover, with the exception of the IBM Jikes Compiler (which has not been under development since 2004, and which does not support modern Java), no Java-to-bytecode compiler is optimizing. Sun’s Java Compiler, for instance, has not had an optimization switch since version 1.3.4 Moreover, even the GNU Compiler for Java, which is part of the highly optimizing GNU Compiler Collection (GCC), does not optimize at the

3 For the intricacies of stopping Java threads see http://java.sun.com/javase/6/docs/technotes/ guides/concurrency/threadPrimitiveDeprecation.html. 4 See the old manual page at http://java.sun.com/j2se/1.3/docs/tooldocs/solaris/javac. html, which contains the following note in the definition of the -O (Optimize) option: the -O option does nothing in the current implementation of javac.

FINCH: A System for Evolving Java (Bytecode)

13

bytecode-producing phase—for which it uses the Eclipse Compiler for Java as a front-end—and instead performs (optional) optimization at the native codeproducing phase. The reason for this is that optimizations are applied at a later stage, whenever the JVM decides to proceed from interpretation to just-in-time compilation (Kotzmann et al., 2008). The fact that Java compilers do not optimize bytecode does not preclude the possibility of doing so, nor render it particularly hard in various cases. Indeed, in FINCH we apply an automatic post-crossover bytecode transformation that is typically performed by a Java compiler: dead-code elimination. After crossover is done, it is possible to get a method with unreachable bytecode sections (e.g., a forward goto with no instruction that jumps into the section between the goto and its target code offset). Such dead code is problematic in Java bytecode, and it is therefore automatically removed from the resulting individuals by our system. This technique does not impede the ability of individuals to evolve introns, since there is still a multitude of other intron types that can be evolved (Brameier and Banzhaf, 2007) (e.g., any arithmetic bytecode instruction not affecting the method’s return value, which is not considered dead-code bytecode, though it is an intron nonetheless).

8.

A Summary of Results

Due to space limitations we only provide a brief description of our results, with the full account available in (Orlov and Sipper, 2009; Orlov and Sipper, 2010). To date, we have successfully tackled several problems: Simple and complex symbolic regression: Evolve programs to approximate the simple x4 + x3 + x2 + x and the more complex 9 polynomial i polynomial i=1 x . Artificial ant problem: Evolve programs to find all 89 food pellets on the Santa Fe trail. Intertwined spirals problem: Evolve programs to correctly classify 194 points on two spirals. Array sum: Evolve programs to compute the sum of values of an integer array, along the way demonstrating FINCH’s ability to handle loops and recursion. Tic-tac-toe: Evolve a winning program for the game, starting from a flawed implementation of the negamax algorithm. This example shows that programs can be improved. Figure 1-6 shows two examples of Java programs evolved by FINCH.

14

Genetic Programming Theory and Practice VIII

Number simpleRegression(Number num) { double d = num.doubleValue(); return Double.valueOf(d + (d * (d * (d + ((d = num.doubleValue()) + (((num.doubleValue() * (d = d) + d) * d + d) * d + d) * d) * d) + d) + d) * d); }

int sumlistrec(List list) { int sum = 0; if (list.isEmpty()) sum = sum; else sum += ((Integer)list.get(0)) .intValue() + sumlistrec( list.subList(1, list.size())); return sum; }

(a)

(b)

Figure 1-6. Examples of evolved programs for the degree-9 polynomial regression problem (a), and the recursive array sum problem (b). The Java code shown was produced by decompiling the respective evolved bytecode solutions.

9.

Concluding Remarks

A recent study commissioned by the US Department of Defense on the subject of futuristic ultra-large-scale (ULS) systems that have billions of lines of code noted, among others, that, “Judiciously used, digital evolution can substantially augment the cognitive limits of human designers and can find novel (possibly counterintuitive) solutions to complex ULS system design problems” (Northrop et al., 2006, p. 33). This study does not detail any actual research performed but attempts to build a road map for future research. Moreover, it concentrates on huge, futuristic systems, whereas our aim is at current systems of any size. Differences aside, both our work and this study share the vision of true software evolution. Turing famously (and wrongly...) predicted that, “in about fifty years’ time it will be possible, to programme computers [. . . ] to make them play the imitation game so well that an average interrogator will not have more than 70 per cent. chance of making the right identification after five minutes of questioning” (Turing, 1950). Recently, Harman wrote that, “. . . despite its current widespread use, there was, within living memory, equal skepticism about whether compiled code could be trusted. If a similar change of attitude to evolved code occurs over time. . . ” (Harman, 2010). We wish to offer our own prediction for fifty years hence, in the hope that we shall not be wrong: We believe that in about fifty years’ time it will be possible, to program computers by means of evolution. Not merely possible but indeed prevalent.

References Arcuri, Andrea (2009). Automatic Software Generation and Improvement Through Search Based Techniques. PhD thesis, University of Birmingham, Birmingham, UK.

FINCH: A System for Evolving Java (Bytecode)

15

Brameier, Markus and Banzhaf, Wolfgang (2007). Linear Genetic Programming. Number XVI in Genetic and Evolutionary Computation. Springer. Bruneton, Eric, Lenglet, Romain, and Coupaye, Thierry (2002). ASM: A code manipulation tool to implement adaptable systems (Un outil de manipulation de code pour la r´ealisation de syst`emes adaptables). In Adaptable and Extensible Component Systems (Syst`emes a` Composants Adaptables et Extensibles), October 17–18, 2002, Grenoble, France, pages 184–195. Engel, Joshua (1999). Programming for the JavaTM Virtual Machine. AddisonWesley, Reading, MA, USA. Gosling, James, Joy, Bill, Steele, Guy, and Bracha, Gilad (2005). The JavaTM Language Specification. The JavaTM Series. Addison-Wesley, Boston, MA, USA, third edition. Harman, Mark (2010). Automated patching techniques: The fix is in. Communications of the ACM, 53(5):108. Kotzmann, Thomas, Wimmer, Christian, M¨ossenb¨ock, Hanspeter, Rodriguez, Thomas, Russell, Kenneth, and Cox, David (2008). Design of the Java HotSpotTM client compiler for Java 6. ACM Transactions on Architecture and Code Optimization, 5(1):7:1–32. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Langdon, W. B. and Poli, R. (2006). The halting probability in von Neumann architectures. In Collet, Pierre, Tomassini, Marco, Ebner, Marc, Gustafson, Steven, and Ek´art, Anik´o, editors, Proceedings of the 9th European Conference on Genetic Programming, volume 3905 of Lecture Notes in Computer Science, pages 225–237, Budapest, Hungary. Springer. Lindholm, Tim and Yellin, Frank (1999). The JavaTM Virtual Machine Specification. The JavaTM Series. Addison-Wesley, Boston, MA, USA, second edition. Miecznikowski, Jerome and Hendren, Laurie (2002). Decompiling Java bytecode: Problems, traps and pitfalls. In Horspool, R. Nigel, editor, Compiler Construction: 11th International Conference, CC 2002, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2002, Grenoble, France, April 8–12, 2002, volume 2304 of Lecture Notes in Computer Science, pages 111–127, Berlin / Heidelberg. Springer-Verlag. Mizoguchi, Jun’ichi, Hemmi, Hitoshi, and Shimohara, Katsunori (1994). Production genetic algorithms for automated hardware design through an evolutionary process. In Proceedings of the First IEEE Conference on Evolutionary Computation, ICEC’94, volume 2, pages 661–664. Nakamura, Yukihiro, Oguri, Kiyoshi, and Nagoya, Akira (1991). Synthesis from pure behavioral descriptions. In Camposano, Raul and Wolf, Wayne Hendrix, editors, High-Level VLSI Synthesis, pages 205–229. Kluwer, Norwell, MA, USA.

16

Genetic Programming Theory and Practice VIII

Northrop, Linda et al. (2006). Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Pittsburgh, PA, USA. O’Neill, Michael and Ryan, Conor (2003). Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language, volume 4 of Genetic programming. Kluwer Academic Publishers. Orlov, Michael and Sipper, Moshe (2009). Genetic programming in the wild: Evolving unrestricted bytecode. In Raidl, G¨unther et al., editors, Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, July 8–12, 2009, Montr´eal Qu´ebec, Canada, pages 1043–1050, New York, NY, USA. ACM Press. Orlov, Michael and Sipper, Moshe (2010). Flight of the FINCH through the Java wilderness. IEEE Transactions on Evolutionary Computation. In press. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Spector, Lee and Robinson, Alan (2002). Genetic programming and autoconstructive evolution with the Push programming language. Genetic Programming and Evolvable Machines, 3(1):7–40. Turing, Alan Mathison (1950). Computing machinery and intelligence. Mind, 59(236):433–460. Wong, Man Leung and Leung, Kwong Sak (2000). Data Mining Using Grammar Based Genetic Programming and Applications, volume 3 of Genetic Programming. Kluwer, Norwell, MA, USA. Woodward, John R. (2003). Evolving Turing complete representations. In Sarker, Ruhul et al., editors, The 2003 Congress on Evolutionary Computation, CEC 2003, Canberra, Australia, 8–12 December, 2003, volume 2, pages 830–837. IEEE Press.

Chapter 2 TOWARDS PRACTICAL AUTOCONSTRUCTIVE EVOLUTION: SELF-EVOLUTION OF PROBLEM-SOLVING GENETIC PROGRAMMING SYSTEMS Lee Spector Cognitive Science, Hampshire College, Amherst, MA, 01002-3359 USA.

Abstract

Most genetic programming systems use hard-coded genetic operators that are applied according to user-specified parameters. Because it is unlikely that the provided operators or the default parameters will be ideal for all problems or all program representations, practitioners often devote considerable energy to experimentation with alternatives. Attempts to bring choices about operators and parameters under evolutionary control, through self-adaptative algorithms or meta-genetic programming, have been explored in the literature and have produced interesting results. However, no systems based on such principles have yet been demonstrated to have greater practical problem-solving power than the more-standard alternatives. This chapter explores the prospects for extending the practical power of genetic programming through the refinement of an approach called autoconstructive evolution, in which the algorithms used for the reproduction and variation of evolving programs are encoded in the programs themselves, and are thereby subject to variation and evolution in tandem with their problem-solving components. We present the motivation for the autoconstructive evolution approach, show how it can be instantiated using the Push programming language, summarize previous results with the Pushpop system, outline the more recent AutoPush system, and chart a course for future work focused on the production of practical systems that can solve hard problems.

Keywords:

genetic programming, meta-genetic programming, autoconstructive evolution, Push, PushGP, Pushpop, AutoPush

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_2, © Springer Science+Business Media, LLC 2011

18

1.

Genetic Programming Theory and Practice VIII

Introduction

The work described in this chapter is motivated both by features of biological evolution and by the requirements for the high-performance problem-solving systems of the future. Under common conceptions of biological evolution the variation of genotypes from parents to children, and hence the diversification of phenotypes from progenitors to their descendants, is essentially random prior to selection. Offspring vary randomly, it is said, and selection acts on the resulting diversity by allowing the better-adapted random variants to survive and reproduce. Such conceptions are held not only by the lay public but also by theorists such as Jerry Fodor and Massimo Piattelli-Palmarini who, in their book What Darwin Got Wrong, criticize Darwinian theory in part on the grounds that the random “generate and test” algorithm at its core is insufficiently powerful to account for the facts of natural history (Fodor and Piattelli-Palmarini, 2010). But diversification in nature, while certainly random in some respects, is also clearly non-random in several others. If one were to modify DNA molecules in truly random ways, considering all chemical bonds to be equally good candidates for breakage and re-connection, then one would not end up with DNA molecules at all but instead with some other sort of organic soup. Cellular machinery copies DNA, and repairs copying errors, in ways that allow for certain kinds of “errors” but only within tightly constrained bounds. At higher levels of organization variation is constrained by genetic regulatory processes, the mechanics of sexual recombination, cell division and development, and, at a much higher level of organization, by social structures that guide non-random mate selection. All of these constraints emerge from reproductive processes that have themselves evolved over the course of natural history. There is a large literature on such constraints, including a recent theory of “facilitated variation” (Gerhart and Kirschner, 2007), and summaries of the evolution of variation from pre-biotic Earth to the present (Maynard Smith and Szathm´ary, 1999). Whether or not the evolved-non-randomness of biological variation constitutes a significant critique of neo-Darwinism or of the historical Darwin, as claimed by Fodor and Piattelli-Palmarini, is beyond the scope of the present discussion. For our purposes, however, two related points should be made. First, while truly random variation, filtered by selection, may be too weak of a mechanism to have produced the sequence of phenotypes observed over time in the historical record, it is possible for random variation, when acting on the reproductive mechanisms themselves, to produce variation mechanisms that are not purely random. This is presumably what happened in natural history. Second, this bootstrapping process, of the evolution of adaptive, not-entirelyrandom variation by means of the initially random variation of the variation

Towards practical autoconstructive evolution

19

mechanisms, might also be applied to evolutionary problem-solving technologies. Why would we want to do this? One reason is that the problem-solving power of current evolutionary computing technologies is limited by the nature of the variation mechanisms that we build into these systems by hand. Consider, for example, the standard mutation operators used in genetic programming. Subtree replacement, applied uniformly to the nodes in a program tree (or uniformly to interior vs. leaf nodes with a specified probability), involving the replacement of subtrees with newly-generated random subtrees, provides a form of variation that leads to solutions in some but not all problem environments. This has led to the development of a wide range of alternative mutation operators; see, for example, the “Mutation Cookbook” section of (Poli et al., 2008, pp. 42– 44). But which of these will be most helpful in which circumstances, and which others, perhaps not yet invented, may be needed to solve challenging new problems? The field currently has no satisfying answer to this question, which will become all the more pressing as genetic programming systems incorporate more expressive and heterogeneous program representations. In the context of such representations it may well make sense for different program elements or program locations to have different variation rates or procedures, and it will not be obvious, in advance, how to make these choices. The question will also become all the more pressing as genetic programming systems are applied to ever more complex problems, about which the system designers will have less knowledge and intuition. And the question will be raised with even greater urgency with respect to recombination operators such as crossover, for which there even more open questions (e.g. about how to choose crossover partners) that currently require the user to make choices that may not be optimal. Two approaches to these general issues that have previously been explored in the literature are “self-adaptation” and “meta-genetic programming.” Many forms of self-adaptation have been investigated, both within genetic programming and in other areas of evolutionary computation (with many examples including (Angeline, 1995; Spears, 1995; Angeline, 1996; Eiben et al., 1999; MacCallum, 2003; Fry et al., 2005; Beyer and Meyer-Nieberg, 2006; Vafaee et al., 2008; Silva and Dignum, 2009)). In all of these systems the parameters of the evolutionary algorithm are varied and subjected to some form of selection, whether the variation and selection is accomplished by means of the overarching evolutionary algorithm, by a secondary evolutionary algorithm, or by some other machine learning technique. In some cases the parameters are adapted on an individual basis, while in others the self-adaptive system modifies global parameters that apply to an entire population. In general, however, these systems vary only pre-selected parameters of the variation operators in pre-specified ways, and they do not allow for the evolution of arbitrary methods of variation.

20

Genetic Programming Theory and Practice VIII

By contrast, the “meta-genetic programming” approach leverages the program-space search capabilities of genetic programming to search for variation operators—which are, after all, themselves programs—during the search for problem-solving programs (Schmidhuber, 1987; Kantschik et al., 1999; Edmonds, 2001; Tavares et al., 2004; Diosan and Oltean, 2009). These systems would appear to have more potential to evolve adaptive variation algorithms, but they have generally been subject to one or both of the following two significant limitations: The evolving genetic operators are not associated with specific evolving problem-solving programs; they are expected to apply to all evolving problem-solving programs equally well. The evolving genetic operators are restricted to being compositions of a small number of pre-designed components; many conceivable genetic operators will not be representable using these components. The first of these limitations contrasts with some of the self-adaptive evolutionary algorithms mentioned previously, in which the values of parameters for genetic operators are encoded in individuals. That this “global” conception of the applicability of genetic operators might be a limitation should be evident from a cursory examination of the diversity of reproductive strategies in nature. For example, the reproductive strategies of the dandelion are quite different from those of the tiger, the oyster mushroom, and Escherichia coli; nobody would expect the strategies of any of these organisms to work particularly well for any of the others. Of course the diversity present in the Earth’s biosphere dwarfs that of any current genetic programming system, but it would nonetheless be quite surprising if the same genetic operators worked equally well across a genetic programming population with any significant diversity. One could well imagine, for example, that a subset of the population might share one particular subtree in which a high degree of mutation is adaptive and a second subtree in which mutation is always deleterious. Other individuals in the population might lack either or both of these subtrees, or they might contain additional code that changes the effects of mutations within these particular subtrees. The second of these limitations is probably mostly a reflection of the fact that most genetic programming representations limit the expressiveness of the programs that they can evolve more generally. Although several Turing complete representations have been described (for example, (Teller, 1994; Nordin and Banzhaf, 1995; Spector and Robinson, 2002a; Woodward, 2003; Yabuki and Iba, 2004; Langdon and Poli, 2006)), such representations are relatively rare and representations that can easily perform arbitrary transformations on variable-sized programs are rarer still. Nature appears to be quite flexible and

Towards practical autoconstructive evolution

21

inventive in the variation mechanisms that it employs (e.g., mechanisms involving gene duplication), and we can easily imagine cases in which genetic programming systems would benefit from the use of genetic operators that are not simple compositions of hand-designed operator components. Another line of research that bears on the approach presented here generally appears in the artificial life literature. Systems such as Tierra (Ray, 1991), Avida (Ofria and Wilke, 2004), and SeMar (Suzuki, 2004) all involve the evolution of programs that are partially responsible for their own reproduction, and in which the reproductive mechanisms (including genetic operators) are therefore subject to variation and selection. However, in these systems diversification is generally driven by hand-designed “ancestor” replicators and/or by the effects of hand-designed mutation algorithms that are applied automatically to the results of all code manipulation operations. Furthermore, while some of these systems have been used to solve computational problems their problem-solving power has been quite limited; they have been used to evolve simple logic gates and arithmetic functions, but they have not been applied to the kinds of difficult problems that genetic programming practitioners are interested in solving. This is not surprising, as these systems have generally been developed primarily to study biological evolution, not to solve difficult computational problems. Additional related work has been conducted in the context of evolved selfreproduction (Taylor, 1999; Sipper and Reggia, 2001) although most of this work has been focused on the evolution of exact replication rather than the evolution of adaptive variation. An exception, and the closest work to that described below, is Koza’s work on the “Spontaneous Emergence of Self-Replicating and Evolutionarily Self-Improving Computer Programs” (Koza, 1994). In that work Koza evolved programs that simultaneously solved problems (albeit simple Boolean problems) and produced variant offspring using template-based code self-modification in a “sea” or “Turing gas” of programs (Fontana, 1992). This chapter describes an approach to self-adaptive genetic programming, called autoconstructive evolution, that combines several features of the approaches described above, with the long-term goal of producing a new generation of powerful problem solving systems. The potential advantage of the autoconstructive evolution approach is that it will allow variation mechanisms to co-evolve with the programs to which they are applied, thereby allowing the evolutionary system itself to adapt to its problem environments in significant ways. The autoconstructive evolution approach was first described in 2001 and 2002 (Spector, 2001; Spector, 2002; Spector and Robinson, 2002a; Spector and Robinson, 2002b), using the Pushpop system that leveraged features of the Push programming language for evolved programs. In the next section this earlier work is briefly described. The subsequent section describes more recent work on the approach, using better technology and a more explicit focus on the goal

22

Genetic Programming Theory and Practice VIII

of high performance problem solving, implemented in a newer system called AutoPush. The final section of the chapter offers some brief conclusions.

2.

Push and Pushpop

An autoconstructive evolution system was defined in (Spector and Robinson, 2002a) as “any evolutionary computation system that adaptively constructs its own mechanisms of reproduction and diversification as it runs.” In the context of the present discussion, however, that definition is too general, and a more specific definition that captures both the past and present usage would be “any genetic programming system in which the methods for reproduction and diversification are encoded in the individual programs themselves, and are thereby subject to variation and evolution.” The goal in the previous work, as in the work described here, is for the ways in which children are produced to be evolved along with the programs to which they will be applied. This is done by encoding the mechanisms for reproduction and diversification within the programs themselves, which must be capable of producing children and, in principle, of solving the problem to which the genetic programming system is being applied. The space of possible reproduction and diversification methods is vast and an ideal system would allow evolving programs to reach new and uncharted reaches of this space. Human-designed diversification mechanisms, including human-designed genetic operators, human-specified automatic mutation during code-manipulation, and human-written ancestor programs, should all be avoided. Of course it will generally be necessary for some features of any evolutionary system to be pre-specified; for example, all of the systems described here borrow several pre-specified elements of traditional genetic programming systems, including a generation-based evolutionary loop, a fixed-size population, and tournament selection with a pre-specified tournament size. The focus here is on the evolution of the means by which children are produced from parents, and it is this task for which we currently seek autoconstructive methods. A prerequisite for this approach is a program representation in which problemsolving functions and child-production functions can both be easily expressed. The Push programming language was originally designed specifically for this purpose (Spector, 2001). Push is a stack-based language roughly in the tradition of Forth, but for which each data type has its own stack. Instructions generally take their arguments from the appropriate stacks and push their results onto the appropriate stacks.1 If an instruction requires arguments that are not present on the appropriate stacks when it is called then it does nothing (it acts as a “no-op”). 1 Exceptions are instructions that draw their inputs from external data structures, for example instructions that access inputs, and instructions that act on external data structures, for example “developmental” instructions that add components to externally-developing representations of circuits or other structured objects.

Towards practical autoconstructive evolution

23

These specifications mean that even though multiple data types may be present in a program no instruction will ever be called on arguments of the wrong type, regardless of its syntactic position in the program. Among other benefits, this means that there are essentially no syntax constraints on Push programs aside from a requirement that parentheses be balanced. This is particularly useful for systems in which child programs will be produced by evolving programs. One of Push’s most important features for autoconstructive evolution, and for genetic programming more generally, is the fact that “code” is a first-class data type. When a Push program is being executed the code that is queued for execution is stored on a special stack called the “exec” stack, and exec instructions in the program can manipulate the queued instructions in order to implement a wide variety of evolved control structures (Spector et al., 2005). Additional code stacks (including one called simply “code,” and in some implementations others with names such as “child”) can be used to store and manipulate code for a variety of other purposes. This feature has significant benefits for genetic programming even in a non-autoconstructive context (that is, even when standard, hard-coded genetic operators are used, as in the PushGP system), but here we focus on the use of Push for autoconstructive evolution. Space limits prevent full exposition of the Push language here; see (Spector et al., 2005) and the references therein for further details. 2 The first autoconstructive evolution system built using Push, called Pushpop, can best be understood as an extension of a more-standard genetic programming system such as PushGP. In PushGP, when a program is being tested for fitness on a particular fitness case it is run and then the problem-solving outputs are collected from the relevant data stacks (typically integer or float) and tested for errors; Pushpop does this as well, but it also simultaneously collects a potential child from the child stack. If the problem to which the system is being applied involves n fitness cases then the testing of each program in the population will produce n potential children. In the reproductive phase tournaments are conducted among parents and children are selected randomly from the set of potential children of the winning parents. If there are insufficient children to fill the child population then newly generated random individuals are used. In Pushpop, as in any autoconstructive evolution system, care must be taken to prevent the takeover of the population by perfect replicators or other pathological replicants. Because there is no automatic mutation in Pushpop a perfect replicator can rapidly fill the population with copies of itself, after which no evolution (and indeed no change at all) will occur. The production of perfect replicators in Push is generally trivial, because programs are pushed onto the code stack prior to execution. For this reason Pushpop includes a “no cloning” rule that specifies that exact clones will not be allowed into the child popula2 See

also http://hampshire.edu/lspector/push.html.

24

Genetic Programming Theory and Practice VIII

tion. Settings are also available that prohibit children that are identical to any of their ancestors or any other individuals in the population. The “no cloning” rule forces programs to diversify in some way, but it does not dictate the mode or extent of diversification. The pathology of perfect replicators in nature was presumably overcome with the aid of vast stretches of time and over vast expanses of the Earth, within which perfect replicators may have arisen but later been eliminated when changes occurred to which they could not adapt. Our resources are much more constrained, however, and so we must proactively cull the individuals that we know cannot possibly evolve. Programs in a Pushpop population can reproduce using evolved forms of multi-parent recombination, accessing other individuals in the population through the use of a variety of instructions provided for this purpose and using them in any computable way to produce their children (Spector and Robinson, 2002a). In fact, evolving Pushpop programs can access and then execute code from other individuals in the population, which means that evolved programs may not work correctly when executed outside of the populations within which they evolved. This is unfortunate from the perspective of a practitioner who is primarily interested in producing a program that will solve a particular problem, since the “solution” may require the entire population to work and it may be exceptionally difficult to understand. The mechanisms for population access in Pushpop are also somewhat complex, and the presence of these mechanisms makes it particularly difficult to analyze the performance of the system. For these reasons the new work described here does not allow executing programs to access the other programs in the population; see below for further discussion. Pushpop is capable of solving simple symbolic regression problems, and it has served as the basis for studies of the evolution of diversification. For example, one study showed that evolving populations that produce adaptive Pushpop programs—that is, programs that actually solve the problems presented to the system—are reliably more diverse than is required by the “no cloning” rule alone (Spector, 2002). But Pushpop’s utility as a problem-solving system is limited, and the focus of the Push project in subsequent years has been on more traditional genetic programming systems such as PushGP. PushGP uses traditional genetic operators but the code-manipulation features of Push nonetheless provide benefits, for example by simplifying the evolution of novel control structures and modular architectures. More recently, however, the use of Push for autoconstructive evolution has been revisited in light of improvements to the Push language (Spector et al., 2005), the availability of substantially faster hardware, and a clarified focus on the long-term potential of autoconstructive evolution to solve problems that cannot be solved with hand-coded diversification mechanisms.

Towards practical autoconstructive evolution

3.

25

Practical Autoconstructive Evolution

AutoPush is a new autoconstructive genetic programming system, a successor to Pushpop built on the more expressive version 3 of the Push programming language and designed with a more explicit focus on problem-solving power. To that end, several sources of inessential complexity in Pushpop have been removed to aid in the analysis of AutoPush runs and their results. AutoPush, like Pushpop, uses the basic generational loop of a standard genetic programming system and tournament selection with a pre-specified tournament size. Also like Pushpop it uses no pre-specified genetic operators, no ancestor replicators, and no pre-specified, automatic mutation. And like Pushpop it represents its programs in a Turing complete language so that children may be produced from parents by means of any computable function, modulo limits on execution steps or time. The current version of AutoPush is asexual—that is, parents must construct their children without having access to other programs in the population— because this eliminates the complexity that may not be necessary and it also simplifies analysis. Asexual programs may be run in isolation, both to solve the target problem and to study the range of children that they produce, and it is easy to store all of their ancestors (of which there will be only as many as there have been generations, while each individual in a sexually-reproducing population may have exponentially many ancestors). Future versions of AutoPush may reintroduce the possibility of recombination by reintroducing instructions that provide access to other individuals in the population; it is our intention to explore this option once the dynamics of the asexual version are better understood. It is also worth noting that the role of sex in biological diversification is a subject of considerable debate, and that asexual organisms diversify in complex and significant ways (Barraclough et al., 2003). The processes by which programs are tested for problem-solving performance and used to produce children also differ between Pushpop and AutoPush. In Pushpop a potential child is produced for each fitness case, during the calculation of the problem-solving answer for that fitness case. This means that the number of children may depend on the number of fitness cases, which complicates analysis and also changes the way that the algorithm will perform on problems with different numbers of fitness cases. By contrast, in AutoPush no children are produced during fitness testing; any code left on the code stack after a fitness-testing run is ignored. 3 Instead, when an individual is selected

3 In Pushpop a special child stack is used for the production of children because the code stack is needed for the expression of evolved control structures in Push1, in which Pushpop was implemented. AutoPush is implemented in Push3, in which the new exec stack can be used for evolved control structures, freeing up the code stack for child production.

26

Genetic Programming Theory and Practice VIII

for autoconstructive reproduction in a tournament it is run again, with an input of 0, to produce a child program for the next generation. 4 The most significant innovation in AutoPush is a new approach to constraints on birth and selection. Pushpop incorporates a “no cloning” rule but AutoPush goes further, adding more constraints on birth and selection to facilitate the evolution of adaptive diversification. Following the lead of meta-genetic programming developers who judged the fitness of evolving operators by “some measure of success in increasing the fitness of the population they operate on” (Edmonds, 2001), AutoPush incorporates factors based on the history of improvement within the ancestry of an individual. There are many ways in which one might measure “history of improvement” and many ways in which such measurements might be used in an evolutionary algorithm. For example, Smits et al. define “activity” or “potential to improve” as “the sum of the number of moves [in the program search space] that either improved the fitness or neutral moves that resulted in either no change in fitness or a change that was less than a given (dynamic) tolerance limit” (Smits et al., 2010). They use this measure to select candidates for further testing, crossover, and replacement. Additional comments on varieties and measures of selfimprovement can be found in (Schmidhuber, 2006). In AutoPush the history of improvement is a scalar that summarizes the direction of problem-solving performance changes over the individual’s ancestry, with greater weight given to more recent changes (see formula below). It would be tempting to use this measure of improvement only in selection, possibly as a second objective—in addition to problem-solving performance—in the context of a multi-objective selection scheme. But this, by itself, would not work well because selection cannot salvage a population that has become overrun by evolutionary “dead-enders” that can never produce improved descendants. Such dead-enders include not only cloners but also programs of several other categories. For example, consider a population full of programs that produce children that vary only in a subexpression that is never executed. This population is just as un-adaptive as a population of cloners, and it will do no good to select among its individuals on any basis whatsoever. Many other, more subtle categories of dead-enders exist, presenting challenges to any evolutionary system that relies only on selection to drive adaptation. The alternative approach taken in AutoPush is to prevent such dead-enders, when they can be detected, from reproducing at all, and to make room in the population for the children of improvers or at least for new random individuals.

4 The input

value of 0 is arbitrary, and an input value is provided only for the minor convenience of avoiding re-definition of the input-pushing instruction. None of this should be significant as long as we are consistent in the ways that we conduct the autoconstructive reproduction runs.

Towards practical autoconstructive evolution

27

As a result, we place a variety of constraints on birth and selection which act collectively to promote the evolution of adaptive diversification without specifying the form(s) that the actual diversification algorithms will take. More specifically, we conduct selection using tournaments, with comparisons within the tournament set computed as follows:5 Prefer reproductively competent parents: Individuals that were generated by other individuals beat randomly-generated individuals, and individuals that are “grandchildren” beat all others that are not. If both individuals being compared are grandchildren then the lengths of their lineages are not otherwise decisive. Prefer parents with non-stagnant lineages: A lineage is considered stagnant if it has persisted for at least some preset number of generations (6 in the experiments described here) and if problem-solving performance has not changed in the most recent half of the lineage. Prefer parents with good problem-solving performance: If neither reproductive competence nor lineage stagnation are decisive then select the parent that does a better job on the target problem. The constraints on birth make use of two auxiliary definitions, for “improvement” and “code discrepancy.” Improvement is a measure of how much the problem-solving performance of a lineage has improved, with greater weight being given to the most recent steps in the lineage. We first compute a normalized vector of changes in problem-solving performance, with improvements represented as 1, declines represented as −1, and repeats of the same value represented as 0. The overall improvement value is then calculated as the weighted average of the elements of this vector, with the weights produced by following function (with decay factor δ = 0.1 for the runs described here): wg=current−gen = 1 wg−1 = wg ∗ (1 − δ) Code discrepancy is a measure of the difference between two programs, calculated as the sum, over all unique expressions and sub-expressions in either of the programs, of the difference between the numbers of occurrences of the expression in the two programs. In the context of these definitions we can state the constraints on birth as follows: 5 These constraints, and those mentioned for birth below, are stated using the numerical parameter values that were chosen, more or less arbitrarily, for the runs described here. Other values may perform better, and further study may provide guidance on setting these values or eliminating the parameters altogether.

28

Genetic Programming Theory and Practice VIII

Prevent birth from lineages with at least a preset threshold number of ancestors (4 here) and an improvement of less than some preset minimum (0.1 here). Prevent birth from lineages with at least a preset threshold number of ancestors (3 here) and constant discrepancy between parent and child in all generations. Prevent birth from parents that received disqualifying fitness penalties, e.g. for nontermination or non-production of result values. Prevent birth of children with sizes outside of the specified legal range (here 10–100 points). Prevent birth of children that are identical to any of their ancestors. Prevent birth of children that are identical to potential siblings; for this test the parent program is run a second time to produce an additional child that is used only for this comparison.

4.

Preliminary results

While the approach described here has not yet been shown to solve problems that are out of reach of more conventional genetic programming systems— indeed, it is currently weaker than the more-standard PushGP system—it has solved simple problems and produced illuminating data that may help to deepen our understanding. For example, in one run on a symbolic regression problem with the target function y = x3 − 2x2 − x AutoPush found a solution that descended from the following randomly generated program: 6 ((code_if (code_noop) boolean_fromfloat (2) integer_fromfloat) (code_rand integer_rot) exec_swap code_append integer_mult)

While it is difficult to tell from inspection how this program works, even for those experienced in reading Push code, the specific code instructions that are included provide clues about how it constructs children. For example, the code rand instruction generates new random code, and the code append instruction combines two pieces of code on the code stack. It is even more revealing to look at the code outputs from several runs of this program. In this case they are all of the form: (RANDOM-INSTRUCTION (code_if (code_noop) boolean_fromfloat

6 Space limitations prevent full description of the run parameters or the instruction set; see (Spector et al., 2005) and the source code at http://hampshire.edu/lspector/gptp10 for more information.

Towards practical autoconstructive evolution

29

(2) integer_fromfloat) (code_rand integer_rot) exec_swap code_append integer_mult)

where “RANDOM-INSTRUCTION” is some particular randomly chosen instruction. So this program’s reproductive strategy is merely to add a new, random instruction to the beginning of itself. This strategy continues for several generations, with several improvements in problem-solving performance, until something new and interesting happens. In the sixth generation a child is produced with a new list added, rather than just a new instruction, and it also has a new reproductive strategy: it adds something new to the beginning of both of its top-level lists. In other words, the sixth-generation individual is of this form: (SUB-EXPRESSION-1 SUB-EXPRESSION-2)

where each “SUB-EXPRESSION-n” is a different sub-expression, and the seventhgeneration children of this program are all of the form: ((RANDOM-INSTRUCTION-1 (SUB-EXPRESSION-1)) (RANDOM-INSTRUCTION-2 (SUB-EXPRESSION-2)))

where each “RANDOM-INSTRUCTION-n” is some particular randomly chosen instruction. One generation later the problem was solved, by the following program: ((integer_stackdepth (boolean_and code_map)) (integer_sub (integer_stackdepth (integer_sub (in (code_wrap (code_if (code_noop) boolean_fromfloat (2) integer_fromfloat) (code_rand integer_rot) exec_swap code_append integer_mult))))))

This program inherits the altered reproductive strategy of its parent, augmenting both of its primary sub-expressions with new initial instructions in its children. In the run described above the only available code-manipulation instructions were those in the standard Push specification, which are modeled loosely on Lisp list-manipulation primitives. In some runs, however, we have added a “perturb” instruction that changes symbols and constants in a program to other random symbols or constants with a probability derived from an integer popped from the integer stack. Perturb, which was also used in some Pushpop runs, is itself a powerful mutation operator, but its availability does not dictate if or how or where it will be used; for example, it would be possible for an evolved reproductive strategy to use perturb on only one part of its code, or to use it with different probabilities on different parts of its code, or to use it conditionally or in conjunction with other code-manipulation instructions. With the perturb instruction included we have been able to solve somewhat more difficult problems such as the symbolic regression of y = x6 −2x4 +x2 −2, and

30

Genetic Programming Theory and Practice VIII

we are actively exploring application to more difficult problems and analysis of the resulting programs and lineages, with the hypothesis that more complex and adaptive reproductive strategies will emerge in the context of more challenging problem environments.

5.

Conclusions

The specific results reported here are preliminary, and the hypothesis that autoconstructive evolution will extend the problem-solving power of genetic programming is still speculative. However, the hypothesis has been refined, the means for testing it have been simplified, the principles that underlie it have been better articulated, and the prospects for analysis of incremental results have been improved. We have shown (again) that mechanisms of adaptive variation can evolve as components of evolving problem-solving systems, and we have described reasons to believe that the best problem-solving systems of the future will make use of some such techniques. Only further experimentation will determine whether and when autoconstructive evolution will become the most appropriate technique for solving difficult problems of practical significance.

Acknowledgments Kyle Harrington, Paul Sawaya, Thomas Helmuth, Brian Martin, Scott Niekum and Rebecca Neimark contributed to conversations in which some of the ideas used in this work were refined. Thanks also to the GPTP reviewers, to William Josiah Erikson for superb technical support, and to Hampshire College for support for the Hampshire College Institute for Computational Intelligence.

References Angeline, Peter J. (1995). Adaptive and self-adaptive evolutionary computations. In Palaniswami, Marimuthu and Attikiouzel, Yianni, editors, Computational Intelligence: A Dynamic Systems Perspective, pages 152–163. IEEE Press. Angeline, Peter J. (1996). Two self-adaptive crossover operators for genetic programming. In Angeline, Peter J. and Kinnear, Jr., K. E., editors, Advances in Genetic Programming 2, chapter 5, pages 89–110. MIT Press, Cambridge, MA, USA. Barraclough, Timothy G., Birky, C. William Jr., and Burt, Austin (2003). Diversification in sexual and asexual organisms. Evolution, 57:2166–2172. Beyer, Hans-Georg and Meyer-Nieberg, Silja (2006). Self-adaptation of evolution strategies under noisy fitness evaluations. Genetic Programming and Evolvable Machines, 7(4):295–328. Diosan, Laura and Oltean, Mihai (2009). Evolutionary design of evolutionary algorithms. Genetic Programming and Evolvable Machines, 10(3):263–306.

Towards practical autoconstructive evolution

31

Edmonds, Bruce (2001). Meta-genetic programming: Co-evolving the operators of variation. Elektrik, 9(1):13–29. Turkish Journal Electrical Engineering and Computer Sciences. Eiben, Agoston Endre, Hinterding, Robert, and Michalewicz, Zbigniew (1999). Parameter control in evolutionary algorithms. IEEE Transations on Evolutionary Computation, 3(2):124–141. Fodor, Jerry and Piattelli-Palmarini, Massimo (2010). What Darwin got wrong. New York: Farrar, Straus and Giroux. Fontana, Walter (1992). Algorithmic chemistry. In Langton, C. G., Taylor, C., Farmer, J. D., and Rasmussen, S., editors, Artificial Life II, pages 159–210. Addison-Wesley. Fry, Rodney, Smith, Stephen L., and Tyrrell, Andy M. (2005). A self-adaptive mate selection model for genetic programming. In Proceedings of the 2005 IEEE Congress on Evolutionary Computation, volume 3, pages 2707–2714, Edinburgh, UK. IEEE Press. Gerhart, John and Kirschner, Marc (2007). The theory of facilitated variation. Proceedings of the National Academy of Sciences, 104:8582–8589. Kantschik, Wolfgang, Dittrich, Peter, Brameier, Markus, and Banzhaf, Wolfgang (1999). Meta-evolution in graph GP. In Genetic Programming, Proceedings of EuroGP’99, volume 1598 of LNCS, pages 15–28, Goteborg, Sweden. Springer-Verlag. Koza, John R. (1994). Spontaneous emergence of self-replicating and evolutionarily self-improving computer programs. In Langton, Christopher G., editor, Artificial Life III, volume XVII of SFI Studies in the Sciences of Complexity, pages 225–262. Addison-Wesley, Santa Fe, New Mexico, USA. Langdon, William B. and Poli, Riccardo (2006). On turing complete T7 and MISC F–4 program fitness landscapes. In Arnold, Dirk V., Jansen, Thomas, Vose, Michael D., and Rowe, Jonathan E., editors, Theory of Evolutionary Algorithms, Dagstuhl, Germany. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany. MacCallum, Robert M. (2003). Introducing a perl genetic programming system: and can meta-evolution solve the bloat problem? In Genetic Programming, Proceedings of EuroGP’2003, volume 2610 of LNCS, pages 364–373, Essex. Springer-Verlag. Maynard Smith, John and Szathm´ary, E¨ors (1999). The origins of life. Oxford: Oxford University Press. Nordin, Peter and Banzhaf, Wolfgang (1995). Evolving turing-complete programs for a register machine with self-modifying code. In Genetic Algorithms: Proceedings of the Sixth International Conference (ICGA95), pages 318–325, Pittsburgh, PA, USA. Morgan Kaufmann.

32

Genetic Programming Theory and Practice VIII

Ofria, Charles and Wilke, Claus O. (2004). Avida: A software platform for research in computational evolutionary biology. Artificial Life, 10(2):191– 229. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Ray, Thomas S. (1991). Is it alive or is it GA. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 527–534, University of California - San Diego, La Jolla, CA, USA. Morgan Kaufmann. Schmidhuber, Jurgen (1987). Evolutionary principles in self-referential learning. on learning now to learn: The meta-meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany. Schmidhuber, Jurgen (2006). G¨odel machines: Fully self-referential optimal universal self-improvers. In Goertzel, B. and Pennachin, C., editors, Artificial General Intelligence, pages 119–226. Springer. Silva, Sara and Dignum, Stephen (2009). Extending operator equalisation: Fitness based self adaptive length distribution for bloat free GP. In Proceedings of the 12th European Conference on Genetic Programming, EuroGP 2009, volume 5481 of LNCS, pages 159–170, Tuebingen. Springer. Sipper, Moshe and Reggia, James A. (2001). Go forth and replicate. Scientific American, 265(2):27–35. Smits, Guido F., Vladislavleva, Ekaterina, and Kotanchek, Mark E. (2010). Scalable symbolic regression by continuous evolution with very small populations. In Riolo, Rick L., McConaghy, Trent, and Vladislavleva, Ekaterina, editors, Genetic Programming Theory and Practice VIII. Springer. Spears, William M. (1995). Adapting crossover in evolutionary algorithms. In Proceedings of the Fourth Annual Conference on Evolutionary Programming, pages 367–384. MIT Press. Spector, Lee (2001). Autoconstructive evolution: Push, pushGP, and pushpop. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pages 137–146, San Francisco, California, USA. Morgan Kaufmann. Spector, Lee (2002). Adaptive populations of endogenously diversifying pushpop organisms are reliably diverse. In Proceedings of Artificial Life VIII, the 8th International Conference on the Simulation and Synthesis of Living Systems, pages 142–145, University of New South Wales, Sydney, NSW, Australia. The MIT Press. Spector, Lee, Klein, Jon, and Keijzer, Maarten (2005). The push3 execution stack and the evolution of control. In GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, volume 2, pages 1689–1696, Washington DC, USA. ACM Press.

Towards practical autoconstructive evolution

33

Spector, Lee and Robinson, Alan (2002a). Genetic programming and autoconstructive evolution with the push programming language. Genetic Programming and Evolvable Machines, 3(1):7–40. Spector, Lee and Robinson, Alan (2002b). Multi-type, self-adaptive genetic programming as an agent creation tool. In GECCO 2002: Proceedings of the Bird of a Feather Workshops, Genetic and Evolutionary Computation Conference, pages 73–80, New York. AAAI. Suzuki, Hideaki (2004). Design Optimization of Artificial Evolutionary Systems. Doctor of informatics, Graduate School of Informatics, Kyoto University, Japan. Tavares, Jorge, Machado, Penousal, Cardoso, Amilcar, Pereira, Francisco B., and Costa, Ernesto (2004). On the evolution of evolutionary algorithms. In Genetic Programming 7th European Conference, EuroGP 2004, Proceedings, volume 3003 of LNCS, pages 389–398, Coimbra, Portugal. SpringerVerlag. Taylor, Timothy John (1999). From Artificial Evolution to Artificial Life. PhD thesis, Division of Informatics, University of Edinburgh, UK. Teller, Astro (1994). Turing completeness in the language of genetic programming with indexed memory. In Proceedings of the 1994 IEEE World Congress on Computational Intelligence, volume 1, pages 136–141, Orlando, Florida, USA. IEEE Press. Vafaee, Fatemeh, Xiao, Weimin, Nelson, Peter C., and Zhou, Chi (2008). Adaptively evolving probabilities of genetic operators. In Seventh International Conference on Machine Learning and Applications, ICMLA ’08, pages 292– 299, La Jolla, San Diego, USA. IEEE. Woodward, John (2003). Evolving turing complete representations. In Proceedings of the 2003 Congress on Evolutionary Computation, pages 830–837, Canberra. IEEE Press. Yabuki, Taro and Iba, Hitoshi (2004). Genetic programming using a Turing complete representation: recurrent network consisting of trees. In de Castro, Leandro N. and Von Zuben, Fernando J., editors, Recent Developments in Biologically Inspired Computing, chapter 4, pages 61–81. Idea Group Publishing.

Chapter 3 THE RUBIK CUBE AND GP TEMPORAL SEQUENCE LEARNING: AN INITIAL STUDY Peter Lichodzijewski and Malcolm Heywood Faculty of Computer Science, Dalhousie University, 6050 University Av., Halifax, NS, B3H 1W5. Canada.

Abstract

The 3 × 3 Rubik cube represents a potential benchmark for temporal sequence learning under a discrete application domain with multiple actions. Challenging aspects of the problem domain include the large state space and a requirement to learn invariances relative to the specific colours present the latter element of the domain making it difficult to evolve individuals that learn ‘macro-moves’ relative to multiple cube configurations. An initial study is presented in this work to investigate the utility of Genetic Programming capable of layered learning and problem decomposition. The resulting solutions are tested on 5,000 test cubes, of which specific individuals are able to solve up to 350 (7 percent) cube configurations and population wide behaviours are capable of solving up to 1,200 (24 percent) of the test cube configurations. It is noted that the design options for generic fitness functions are such that users are likely to face either reward functions that are very expensive to evaluate or functions that are very deceptive. Addressing this might well imply that domain knowledge is explicitly used to decompose the task to avoid these challenges. This would augment the described generic approach currently employed for Layered learning/ problem decomposition.

Keywords:

bid-based cooperative behaviours, problem decomposition, Rubik cube, symbiotic coevolution, temporal sequence learning.

1.

Introduction

Evolutionary Computation as applied to temporal sequence learning problems generally assumes a phylogenetic framework for learning (Barreto et al., 2009). That is to say, policies are evaluated in their entirety on the problem domain before search operators are applied to produce new policies. Conversely, the ontogenetic approach to temporal sequence learning performs incremental refinement over a single candidate solution with respect to each state–action pair

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_3, © Springer Science+Business Media, LLC 2011

36

Genetic Programming Theory and Practice VIII

(Barreto et al., 2009). The latter is traditionally referred to as reinforcement learning. However, the distinction is often ignored, with reinforcement learning frequently used as a general label for any scenario in which the temporal credit assignment problem/ delayed reward exists; not least because algorithms are beginning to appear which combine both phylogenetic and ontogenetic mechanisms of learning (Whiteson and Stone, 2006).1 Examples of the temporal sequence learning problem appear in many forms, from control style formulations in which the goal is to learn a policy for controlling a robot or vehicle to games in which the general objective is to learn a strategy. In this work we are interested in the latter domain, specifically the case of learning a strategy to solve multiple configurations of the 3 × 3 Rubik cube. The problem of learning to solve Rubik cube configurations presents multiple challenges of wider interest to the temporal sequence learning community. Specific examples might include: 1) a large number of states ranging from trivial to demanding, 2) the problem is known to challenge human players, 3) a wide variation in start states exists, therefore resilient to self play dynamics that might simplify board games such as back-gammon (Pollack and Blair, 1998), 4) generalization to learn invariances/ symmetries implicit in the game. Approaches for finding solutions to scrambled configurations of a Rubik cube fall into one of two general approaches: optimal solvers or macro-moves. In the case of solving a cube using a minimal (optimal) number of moves, extensive use is made of lookup tables to provide an exact evaluation function as deployed relative to a game tree summary of the cube state. Thus with respect to the eight corner cubies, the position and orientation of a single cubie is defined by the other 7; or 8! × 37 = 88,179,840 combinations. An iterative deepening breadth first search would naturally enumerate all such paths between goal and possible configurations for the corner cubies, forming a “pattern database” for later use. Most emphasis is therefore on the utilization of appropriate hash codings and graph symmetries to extend this enumeration over all possible legal states of a cube (in total there are 4.3252 × 1019 legal states in a 3 × 3 cube). Such an approach recently identified an upper bound on the number of moves necessary to solve a worst case cube configuration as 26 (Kunkle and Cooperman, 2007). Conversely, non-optimal methods rely on ‘macro-moves’ which establish the correct location for specific cubies without disrupting the location of perviously positioned cubies. This is the approach most widely assumed by both human players and ‘automated’ solvers. Such strategies generally take 50 to 100 moves to solve a scrambled cube (Korf, 1997). The advantage this gives is that “general purpose” strategies might result that are appropriate to a wide range of

1 In the following we will use the terms reinforcement and temporal sequence learning interchangeably, particularly where there is a previous established history of terminology e.g., as in hierarchical reinforcement learning.

The Rubik Cube and GP Temporal Sequence learning

37

scenarios, thus giving hope for identifying machine learning approaches that generalize. However, from the perspective of cube ‘state’ we can also see that once one face of a cube is completed the completion of the remaining faces will increasingly result in periods when the relative entropy of the cube will go up considerably.2 Moreover, from a learning system perspective macro-moves need to be associated with any color combination to be effective, a problem that represents a requirement for learning invariances in a scalable manner. Two previous published attempts to evolve solutions to the Rubik cube using evolutionary methods have taken rather different approaches to the problem. One attempts to evolve a generic strategy under little a priori information (Baum and Durdanovic, 2000); whereas the second concentrates on independently evolving optimal move sequences to each scrambled cube (El-Sourani et al., 2010), making use of domain knowledge to formulate appropriate constraints and objectives. In this work we assume the motivation of the former, thus the goal is to evolve a program able to provide solutions to as many scrambled cubes as possible. The approach taken by (Baum and Durdanovic, 2000) employed a domain specific language under the Hayek framework for phylogenetic temporal sequence learning. A domain specific representation included the capability to ‘address’ specific faces of the cube and compare content with other faces as well as tests for the number of correct cubies. Two approaches to training were considered, either incrementally increasing the difficulty of cube configurations (e.g., one or two twist modifications relative to a solved cube) with binary (solved/ not solved) feedback or cubes with a 100 twist ‘scrambling’ and feedback proportional to the number of correctly placed cubies. Two different formulations for actions applied to a cube were also considered, with an action space of either three (90 degree turn of the front face, and row or column twists of the cube) or a fixed 3-dimensional co-ordinate frame on which a total of twelve 90 degree turns are applied (the scheme employed here). The most notable result from Hayek relative to this work was that up to 10 cubies could be correctly placed (one face and some of the middle). In effect Hayek was building macro-moves, but could not work through the construction of the remaining cube faces without destroying the work done on the first face. In the following we develop the Symbiotic Bid-Based (SBB) GP framework and introduce a generic approach to layered learning that does not rely on the a priori definition of different goal functions for each ‘layer’ (as per the classical definition of Layered learning (Stone, 2007)). The use of layering is supported by the explicitly symbiotic approach adopted to evolution. A discussion of the domain specific requirements will then be made, with results establishing the

2 Consider

the case of completing the final face if all other cubies are correctly positioned.

38

Genetic Programming Theory and Practice VIII

relative success of the initial approach adopted here, and conclusions discussing future work in the Rubik cube domain.

2.

Layered learning in Symbiotic Bid-Based GP

Symbiosis is a process by which symbionts of different species – in this case computer programs – receive sufficient ecological pressure to cooperate in a common host compartment (Heywood and Lichodzijewski, 2010). Over a period of time the symbionts will either develop the fitness of the host or not, as per natural selection. Thus, fitness evaluation takes place at the level of hosts not at the level of individual symbionts, or a serial dependence between host and symbionts (Figure 3-1). In the case of this work, hosts are represented by an independent population – a Genetic Algorithm in this case – each host individual defining a compartment by indexing a subset of individuals from an independent symbiont population (Figure 3-1). However, rather than symbionts from the same host having their respective outcomes combined in some form of a voting policy – as in ensemble methods – we explicitly require each symbiont to learn the specific context in which they operate. To do so, symbionts assume the bid-based GP framework (Lichodzijewski and Heywood, 2007). Thus, each symbiont consists of a program and a scalar. The program is used to evolve a bidding strategy and the scalar expresses a domain dependent action, say class label or ‘turn right’. The program evolves whereas the action does not. Within the context of a host individual each symbiont executes their program on the current state of the world/ training instance. The symbiont with the largest bid winning the right to present its action as the outcome from that host under the current state. Under a reinforcement learning domain this action would update the state of the world and the process repeats, with a new round of bidding between symbionts from the same host w.r.t. the updated state of the world. Fitness evaluation is performed over the worlds/ training instances defined by the point population (Algorithm 1, Step 10). Competitive coevolution therefore facilitates the development of point and host populations, with co-operative coevolution developing the interaction between symbionts within a host. Competitive coevolution again appears between hosts in the host population (speciation) to maintain host diversity. This latter point is deemed particularly important in supporting ‘intrinsic motivation’ in the behaviours evolved,3 where this represents a central tenet for hierarchical reinforcement learning in general (Oudeyer et al., 2007).

3 Intrinsic motivations or goals are considered to be those central to supporting the existence of an organism. In addition to behaviour diversity, the desire to reproduce is considered an intrinsic motivation/ goal. Conversely, ‘extrinsic motivations’ are secondary factors that might act in support of the original intrinsic factors such as food seeking behaviours, where these are learnt during the lifetime of the organism and might be specific to that particular organism.

39

The Rubik Cube and GP Temporal Sequence learning

Figure 3-1. Generic architecture of Symbiotic Bid-Based GP (SBB). A point population represents the subset of training scenarios over which a training epoch is performed. The host population conducts a combinatorial search for the best symbiont partnerships; whereas the symbiont population contains the bid-based GP individuals who attempt to learn a good context for their corresponding actions.

Algorithm 1 The core SBB training algorithm. P t , H t , and S t refer to the point, host, and symbiont populations at time t. 1: procedure Train 2: t=0 Initialization 3: initialize point population P t initialize host population H t (and symbiont population S t ) 4: 5: while t ≤ tmax do Main loop t 6: create new points and add to P 7: create new hosts and add to H t (add new symbionts to S t ) 8: for all hi ∈ H t do 9: for all pk ∈ P t do 10: evaluate hi on pk 11: end for 12: end for 13: remove points from P t remove hosts from H t (remove symbionts from S t ) 14: 15: t=t+1 16: end while 17: end procedure

40

Genetic Programming Theory and Practice VIII

The above Symbiotic Bid-Based GP or ‘SBB’ framework – as summarized by Figure 3-1 and Algorithm 1 – provides a natural scheme for layered learning by letting the content of the (converged) host population represent the actions for a new set of symbionts in a second application of the SBB algorithm; hereafter ‘Layered SBB’. The association between the next population of symbionts and the earlier population of hosts is explicitly hierarchical. However, there is no explicit requirement to re-craft fitness functions at each layering (although this is also possible). Instead, the reapplication of the SBB algorithm results in a second layer of hosts that learn how to combine previously learnt behaviours in specific contexts. The insight behind this is that SBB bidding policies under a temporal sequence learning domain are effectively evolving the conditions under which an action begins and ends its deployment. This is the general goal of hierarchical reinforcement learning. However, the SBB framework achieves this without also requiring an a priori formulation of the appropriate subtasks, the relation between subtasks, or a modified credit assignment policy; as is generally the case under hierarchical reinforcement learning (Oudeyer et al., 2007). In the following we summarize the core SBB algorithm, where this extends the original SBB framework presented in (Lichodzijewski and Heywood, 2008) and was applied elsewhere in a single layer supervised learning context (Lichodzijewski and Heywood, 2010a); the reader is referred to the latter for additional details of regarding host–symbiont variation operators.

Point Population As indicated in the above generic algorithm description, a competitive coevolutionary relationship is assumed between point and host population (Figure 3-1). Specifically, variation in the point population supports the necessary development in the host population. This implies that points have a fitness and are subject to variation operators. Thus, points are created in two phases on account of assuming a breeder style of replacement in which the worst Pgap points are removed (Step 13) – hereafter all references to specific ‘Steps’ are w.r.t. Algorithm 1 – and a corresponding number of new points are introduced (Step 6) at each generation. New points are created under one of two paths. Either a point is created as per the routine utilized at initialization (no concept of a parent point) or offspring are initialized relative to a parent point, with the parent selected under fitness proportional selection. The relative frequency of each point creation scheme is defined by a corresponding probability, pgenp . Discussion of the point population variation operators is necessarily application dependent, and is therefore presented later (Section 3). The evaluation function of Step 10 assumes the application of a domain specific reward that is a function of the interaction between point (pk ) and host

The Rubik Cube and GP Temporal Sequence learning

41

(hi ) individuals, or G(hi , pk ). This is therefore defined later (Equation (3.6), Section 3) as a weighted distance relative to the ideal target state. The global / base point fitness, fk , may now be defined relative to the count of hosts, ck , within a neighbourhood (Lichodzijewski and Heywood, 2010a), or 1−ck 1+ H if ck > 0 size fk = (3.1) 0 otherwise where Hsize is the host population size, and count ck is relative to the arithmetic mean μk of outcomes on point pk or, G(hi , pk ) (3.2) μk = h i Hsize where μ → 0 implies that hosts are failing on point pk and ck is set to zero. Otherwise, ck is defined by the number of hosts satisfying G(hi , pk ) ≥ μk ; that is the number of hosts with an outcome reaching the mean performance on point pk . Equation (3.1) establishes the global fitness of a point. However, unlike classification problem domains, points frequently have context under reinforcement learning domains i.e., a geometric interpretation. This enables us to define a local factor by which the global reward is modulated in proportion to the relative ‘local’ uniqueness of the candidate point. Specifically, each point is rewarded in proportion to the distance from the point to a subset of its nearest neighbours using ideas from outlier detection (Harmeling et al., 2006). To do so, all the points are first normalized by the maximum pair-wise Euclidean distance – as estimated across the point population content, therefore limiting local reward to the unit interval – after which the following reward scheme is adopted: 1. The set of K points nearest to pk is identified; 2. The local reward rk is calculated as, 2 2 pl (D(pk , pl )) rk = K

(3.3)

where the summation is taken over the set of K points nearest to pk and D(·, ·) is the application specific distance function (Equation (3.7), Section 3). 3. The corresponding final fitness for point pk is defined in terms of both global and local rewards or fk = fk · rk

(3.4)

42

Genetic Programming Theory and Practice VIII

With the normalized fitness fk established we can now delete the worst performing Pgap points (Step 13).

Host and Symbiont Population Hosts are also subject to the removal and addition of a fixed number of Hgap individuals per generation, Steps 14 and 7 respectively. However, in order to also promote diversity in the host population behaviours, we assume a fitness sharing formulation. Thus, shared fitness, si of host hi has the form,

si =

pk

G(hi , pk ) hj G(hj , pk )

3 (3.5)

Thus, for point pk the shared fitness score si re-weights the reward that host hi receives on pk relative to the reward on the same point as received by all hosts. As per the earlier comments regarding the role of fitness sharing in supporting ‘intrinsic motivation,’ a strong bias for diversity is provided through the cubic power. Evaluation takes place at Step 10, thus all hosts, hi , are evaluated on all points, pk . Once the shared score for each host is calculated, the Hgap lowest ranked hosts are removed. Any symbionts that are no longer indexed by hosts are considered ineffective and are therefore also deleted. Thus, the symbiont population size may dynamically vary, with variation operators having the capacity to add additional symbionts (Lichodzijewski and Heywood, 2010a), whereas the point and host populations are of a fixed size.

3.

Domain specific design decisions

Cube representation and actions The representation assumed directly indexes all 54 facelets comprising the 3 × 3 Rubik cube. Indexing is sequential, beginning at the centre face with cubie colours differentiated in terms of integers over the interval [0, ..., 5]. Such a scheme is simplistic with no explicit support for indicating which facelets are explicitly connected to make corner or edges. Actions in layer 0 define a 90 degree clock-wise or counter clock-wise twists to each face; there are 6 faces resulting in a total of 12 actions. When additional layers are added under SBB, the population of host behaviours from the previous population represent the set of candidate actions. As such additional layers attempt to evolve new contexts for previously evolved behaviours/ build larger macro-moves.

The Rubik Cube and GP Temporal Sequence learning

43

Reward and distance functions The reward function applies a simple weighting scheme to the number of quarter turn twists (i.e., actions) necessary to move the final cube state to a solved cube. Naturally, such a test becomes increasingly expensive as the number of moves applied in the ‘search’ about the final cube state increases. Hence, the search is limited to testing for up to 2 moves away from the solution, resulting in the following reward function, 1 (3.6) (1 + D(sf , s∗ ))2 where sf is the final state of the cube relative to cube configuration pk and sequence of moves defined by host hi ; s∗ is the ideal solved cube configuration, and; D(s2 , s1 ) defines the weighted distance function, or G(hi , pk ) =

⎧ 0, ⎪ ⎪ ⎨ 1, D(s2 , s1 ) = 4, ⎪ ⎪ ⎩ 16,

when 0 quarter twists match state s2 with s1 when 1 quarter twists match state s2 with s1 when 2 quarter twists match state s2 with s1 when > 2 quarter twists match state s2 with s1

(3.7)

Naturally, curtailing the ‘look-ahead’ to 2 quarter turn twists from the presented solution casts the fitness function into that of a highly deceptive ‘needle in a haystack’ style reward i.e., feedback is only available when you have all but provided a perfect solution. Adding additional twist tests however would result in tens of thousands of cube combinations potentially requiring evaluation before fitness could be defined. Other functions such as counting the number of correct facelets or cube entropy generally appeared to be less informative. The utility of combined metrics or a priori defined constraints might be of interest in future work.

Symbiont representation Symbionts take the form of a linear GP representation, with instruction set for the Bid-Based GP individuals consisting of the following generic set of operators {+, −, ×, ÷, ln(·), cos(·), exp(·), if }. The conditional operator ‘if ’ applies an inequality operator to two registers and interchanges the sign of the first register if its value is smaller than the second. There are always 8 registers and a maximum of 24 instructions per symbiont.

Point initialization and offspring Initialization of points – cube configurations used during evolution (Step 3) – takes the form of: (1) uniform sampling from the interval [1, ..., 10] to define the number of twists applied to a solved cube; (2) stochastic selection of the

44

Genetic Programming Theory and Practice VIII

sequence of quarter twist actions used to ‘scramble’ the cube, and; (3) test for a return to the solved cube configuration (in which case the quarter twist step is repeated). Thereafter, new points introduced during breeding (Step 6) follow one of two scenarios: adding twists to a parent point to create a child with probability pgenp or create a new point as per the aforementioned point initialization algorithm with probability 1 − pgenp . The point offspring/ parentwise creation is governed by the following process: 1. Select parent point, pi ∈ P t , under fitness proportional selection (point fitness defined by Equation (3.4), Section 2); 2. Define the number of additional twists, wi , applied to create the child from the parent in terms of a normal p.d.f., or wi = abs(N (0, σgenT wist )) + 1

(3.8)

where N (0, σgenT wist ) is a normal p.d.f. with zero mean and variance σgenT wist. Naturally, this is rounded to the nearest integer value; 3. Until the twist limit (wi ) is reached, select faces and clockwise/ counter clockwise twists with uniform probability relative to the parent cube configuration, pi ; 4. Should the resulting cube be a solved cube, the previous step is repeated.

4.

Results

Parameterization Runs are performed over 60 initializations for both the case of Layered SBB (two layers) and single layer SBB base cases. The latter are parameterized to provide the same number of fitness evaluations/ upper bound on the number of instructions executed as per the total Layered SBB requirement. In the case of this work this implies a limit of 72000 evaluations or a maxP rogSize limit of 36 under the single layer baseline; hereafter ‘big prog’. Likewise reasoning brings about a team size limit (ω) of 36 under the single layer SBB baseline; hereafter ‘big team’. Relative to the sister work in which the current SBB formulation was applied to data sets from the supervised learning domain of classification (Lichodzijewski and Heywood, 2010a), three additional parameters are introduced for point generation (Section 2): (1) outlier parameter K = 13; (2) the probability of creating points pgenp = 0.9; and, (3) the variance for defining the number of additional twists necessary to create an offspring from a parent point σgenT wist = 3. All other parameters are unchanged relative to those of the classification study (Table 3-1).

45

The Rubik Cube and GP Temporal Sequence learning

Table 3-1. Parameterization at Host (GA) and Symbiont (GP) populations. As per Linear GP, a fixed number of general purpose registers are assumed (numRegisters) and variable length programs subject to a max. instruction count (maxP rogSize).

Host (solution) level Parameter Value Parameter tmax 1 000 ω Psize , Hsize 120 Pgap , Hgap pmd 0.7 pma pmm 0.2 pmn Symbiont (program) level numRegisters 8 maxProgSize pdelete , padd 0.5 pmutate , pswap

Value 24 20, 60 0.7 0.1 24 1.0

Sampled Test Set Post training test performance is evaluated w.r.t. 5,000 unique ‘random’ test cubes, created as per the point initialization algorithm. Table 3-2 summarizes the distribution of cubes relative to the number of twists used to create them. A combined violin / quartile box plot is then used to express the total number of cube configurations solved. Figures 3-2 and 3-3 summarize this in terms of a single champion individual from each run4 and corresponding cumulated population wide performance. It is immediately apparent that the population wide behaviour (Figure 3-3) provides a significant source of useful diversity relative to that of the corresponding individual-wise performance (Figure 3-2). This is a generic property of fitness sharing implicit in the base SBB algorithm; Equation (3.5). However, it is also clear that under SBB 1 – in which second layer symbionts assume the hosts from layer 0 as their actions – the champion individuals are unable to directly build on the cumulative population wide behaviour from SBB 0. Conversely, under the case of real-valued reinforcement problem domains – such as the truck backer-upper (Lichodzijewski and Heywood, 2010b) – SBB 1 individuals were capable of producing champions that subsumed the SBB 0 population-wise performance. We attribute this to the more informative fitness function available under the truck backer-upper domain than that available under the Rubik cube. Relative to the non-layered SBB base cases, no real trend appears under the individual-wise performance (Figure 3-2). Conversely, under the cumulative population wide behaviour (Figure 3-3), SBB 1 provides a significant 4 Identified

post training on an independent validation set generated as per the stochastic process used to identify the independent test set.

46

Genetic Programming Theory and Practice VIII

Table 3-2. Distribution of test cases. Samples selected over 1 to 10 random twists relative to solved cube resulting in 5,000 unique test configurations.

Number of twists 1 2 3 4 5

# of test cases 9 86 403 527 588

Number of twists 6 7 8 9 10

# of test cases 662 640 728 673 683

$

#

#

#

"

!

Figure 3-2. Total test cases solved by single best individual per run under SBB with and without layering under the stochastic sampling of 5,000 1 to 10 twist cubes. ‘SBB 0’ and ‘SBB 1’ denote first and second layer Layered SBB solutions. ‘big team’ and ‘big prog’ represent single layer SBB runs with either larger host or symbiont instruction limits.

47

The Rubik Cube and GP Temporal Sequence learning

%

$

$

$

#

"

!

Figure 3-3. Total test cases solved by cumulated population wide performance per run under SBB with and without layering under the stochastic sampling of 5,000 1 to 10 twist cubes. ‘SBB 0’ and ‘SBB 1’ denote first and second layer Layered SBB solutions. ‘big team’ and ‘big prog’ represent single layer SBB runs with either larger host or symbiont instruction limits.

48

Genetic Programming Theory and Practice VIII

Table 3-3. Two-tailed Mann-Whitney test comparing total solutions under the Sampled Test Set provided by Layered SBB (second level) against single layer SBB parameterizations (big team (SBB-bt) and big program (SBB-bp)). The table reports p-values for the pair-wise comparison of distributions from Figures 3-2 and 3-3. Cases where the Layered SBB medians are higher (better) than non-layered SBB medians are noted with a .

Test Case Layered SBB vs SBB-0 Layered SBB vs SBB-bt Layered SBB vs SBB-bp

Champion individual 0.002499 0.1519 0.5566

Population wide 3.11e-15 1.003e-10 0.0002617

improvement as measured in terms of a two-tailed Mann-Whitney test with 0.01 significance level (Table 3-3), effectively identifying the most consistently effective solutions. This appears to indicate that Layered SBB is able to build configuration specific sub-sets of Rubik cube solvers – that is to say, the strategies for solving cube configurations are not colour invariant. Specifically, the macro moves learnt at SBB 0 cannot be generalized over all permutations of cube faces. Thus, at SBB 1, subsets of hosts from SBB 0 can be usefully combined. However, this only results in the median performance improving by approximately 50 (200) test cases between layers 0 and 1 under single champion (respectively population-wise) test counts. Overall, neither increasing the instruction count limit per symbiont or maximum limit on the number of symbionts per host is as effective as layering at leveraging the performance from individual-wise to population wide performance.

Exhaustive test set A second test set is designed consisting of all 1, 2 and 3 quarter twist cube configurations – consisting of 12, 114 and 1,068 unique test cubes respectively.5 Naturally, there is no a priori bias towards solving these during training, cubes being configured stochastically relative to points selected under fitness proportional selection. Figure 3-4 summarizes this as a percentage of the number of 1, 2 and 3 twist configurations solved by the single best individual in each run.6 The impact of layering is again evident, both from a consistency perspective and in terms of incremental improvements to the number of cases solved with each additional layer. Relative to the baseline single layer models, it is interesting to note that both ‘SBB big team’ and ‘SBB big prog’ had difficulty consistently solving the 1 twist configurations, whereas all SBB 1 first quartile performance counts are somewhat lower than those reported in (Korf, 1997) because we do not include 180◦ twists in the set of permitted actions. 6 The same ‘champion’ individual as identified under the aforementioned validation sample a priori to application of the sampled test set. 5 These

49

The Rubik Cube and GP Temporal Sequence learning

$

!

"

#

!

Figure 3-4. Percent of cases solved by single best SBB individuals as estimated under the exhaustive enumeration of 1, 2 and 3 quarter twist test cases. SBB 0 and SBB 1 denote the first and second layer solutions under Layered SBB; ‘big team’ and ‘big prog’ denote the base case SBB configurations without layering.

50

Genetic Programming Theory and Practice VIII

!

"

#

!

"

$

!

"

Figure 3-5. Number of moves used by champion individual to solve 1-, 2- and 3-twist points. ‘SBB 1’ is the second layer from Layered SBB, ‘SBB-bt’ and ‘SBB-bp’ denote the corresponding single layer SBB big team and big program parameterizations.

corresponds to all test cases solved. Of the two baseline configurations, ‘SBB big prog’ was again the more effective, implying that more complexity in the symbionts was more advantageous than larger host–symbiont capacity. Finally, we can also review the (mean) number of twists used to provide solutions to each test configuration (Figure 3-5). The resulting distributions are grouped by the original twist count. The move counts are averaged over all cases solved by an individual, thus although some, say, 1 twist test cases might be solved in one twist, cases that used three moves would naturally increase the average move count above the ideal. Application of a two-tailed Man-Whitney test indicates that the ‘SBB 1’ move counts are lower than the ‘SBB-bp’ (‘big program’) move counts on 2- and 3-twist test cases at a 0.01 significance level (Table 3-4). Thus, although Layered SBB and SBB big program solved a similar total number of test cases (Figure 3-4), Layered SBB is able to solve them using a statistically significant lower number of moves. Conversely, SBB big team was not able to solve as many test cases, but when it did provide solutions, a similar number of moves as Layered SBB where used.

51

The Rubik Cube and GP Temporal Sequence learning

Figure 3-6. Number of symbionts per host over SBB runs.

Table 3-4. Two-tailed Mann-Whitney test results comparing solution move counts for champion individuals with Layered SBB (second level) against single layer SBB parameterizations (big team (SBB-bt) and big program (SBB-bp)). The table reports p-values for the pair-wise comparison of distributions from Figure 3-5. Cases where the single layer SBB medians are higher (worse) than Layered SBB medians are noted with a .

Test Case Layered SBB vs SBB-bt Layered SBB vs SBB-bp

1-twist 0.4976 0.02737

2-twist 0.1374 0.001951

3-twist 0.0534 0.0007957

52

Genetic Programming Theory and Practice VIII

Figure 3-7. Number of instructions per host over SBB runs.

Model complexity Finally, we can also consider model complexity, post intron removal. Relative to the typical number of symbionts utilized per host (Figure 3-6), layer 0 clearly utilizes more symbionts per host than layer 1. This implies that at layer 1 there are 5 to 8 hosts from layer 0 being utilized. As indicated in Section 2, this is possible because each of the hosts from layer 0 is now associated with a symbiont bidding behaviour as evolved at level 1. Further analysis will be necessary to identify what the specific patterns of behaviour associated with these combinations of hosts represent. Both base cases appear to use more symbionts per host, understandable given that they do not have the capacity to make use of additional layers. The same bias towards simplicity again appears relative to instruction count (Figure 3-7), thus SBB 1 uses a significantly lower instruction count than SBB 0 and the ‘SBB big prog’ naturally results in the most complex symbiont programs. Needless to say, SBB 1 solutions will use some combination of SBB 0 solutions, however, relative to any one move, only two hosts are ever involved in defining each action.

5.

Conclusions

Temporal sequence learning represents the most challenging scenario for establishing effective mechanisms for credit assignment. Indeed, specific challenges under the temporal credit assignment problem are generally a superset of

The Rubik Cube and GP Temporal Sequence learning

53

those experienced under supervised learning domains. Layered learning represents one potential way of extending the utility of machine learning algorithms in general to temporal sequence learning (Stone, 2007). However, in order to do so effectively, solutions from any one ‘layer’ need to be both diverse and self-contained; properties that evolutionary computation may naturally support. Moreover, when building a new layer of candidate solutions the problem of automatic context association must be explicitly addressed. The SBB algorithm provides explicit support for these features and thus is able to construct layered solutions without recourse to hand designed objectives for each candidate component contributing to a solution (Lichodzijewski and Heywood, 2010b). This is in marked contrast to the original Layered learning methodology or the more recent developments in hierarchical reinforcement learning (Stone, 2007). The Rubik cube as a whole is certainly not a ‘solved’ problem from a learning algorithm perspective. The current state-of-the-art evolves solutions for each cube configuration (El-Sourani et al., 2010), or as in the work reported here, provides a general strategy for solving a subset of scrambled cubes (Baum and Durdanovic, 2000). The discrete nature of the Rubik problem domain makes the design of suitable fitness and distance functions less intuitive/ more challenging than in the case of continuous valued domains. Indeed, specific examples of the effectiveness of SBB style layered learning under continuous valued reinforcement learning tasks are beginning to appear (Lichodzijewski and Heywood, 2010b). It is therefore anticipated that future developments will need to make use of more structural adaptation to the point population and/ or make use of a priori constraints in the formulation of different fitness functions per layer, as in the case of more classical approaches to building Rubik cube ‘solvers’.

Acknowledgments Peter Lichodzijewski has been a recipient of Precarn, NSERC-PGSD and a Killam Postgraduate Scholarships. Malcolm Heywood holds research grants from NSERC, MITACS, CFI, SwissCom Innovations SA. and TARA Inc.

References Barreto, A. M. S., Augusto, D. A., and Barbosa, H. J. C. (2009). On the characteristics of sequential decision problems and their impact on Evolutionary Computation and Reinforcement learning. In Proceedings of the International Conference on Artificial Evolution, page in press. Baum, E. B. and Durdanovic, I. (2000). Evolution of cooperative problemsolving in an artificial economy. Neural Computation, 12:2743–2775.

54

Genetic Programming Theory and Practice VIII

El-Sourani, N., Hauke, S., and Borschbach, M. (2010). An evolutionary approach for solving the Rubik’s cube incorporating exact methods. In EvoApplications Part – 1: EvoGames, volume 6024 of LNCS, pages 80–89. Harmeling, S., Dornhge, G., Tax, F., Meinecke, F., and Muller, K. R. (2006). From outliers to prototypes: Ordering data. Neurocomputing, 69(13-15):1608– 1618. Heywood, M. I. and Lichodzijewski, P. (2010). Symbiogenesis as a mechanism for building complex adaptive systems: A review. In EvoApplications: Part 1 (EvoComplex), volume 6024 of LNCS, pages 51–60. Korf, R. (1997). Finding optimal solutions to rubik’s cube using pattern databases. In Proceedings of the Workshop on Computer Games (IJCAI), pages 21–26. Kunkle, D. and Cooperman, G. (2007). Twenty-six moves suffice for rubik’s cube. In Proceedings of ACM International Symposium on Symbolic and Algebraic Computation, pages 235–242. Lichodzijewski, P. and Heywood, M. I. (2007). Pareto-coevolutionary Genetic Programming for problem decomposition in multi-class classification. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 464–471. Lichodzijewski, P. and Heywood, M. I. (2008). Managing team-based problem solving with Symbiotic Bid-based Genetic Programming. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 363–370. Lichodzijewski, P. and Heywood, M. I. (2010a). Symbiosis, complexification and simplicity under gp. In Proceedings of the Genetic and Evolutionary Computation Conference. To appear. Lichodzijewski, P. and Heywood, M.I. (2010b). A symbiotic coevolutionary framework for layered learning. In AAAI Symposium on Complex Adaptive Systems. Under review. Oudeyer, P.Y., Kaplan, F., and V.V. Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11:265–286. Pollack, J. B. and Blair, A. D. (1998). Co-evolution in the successful learning of backgammon strategy. Machine Learning, 32:225–240. Stone, P. (2007). Learning and multiagent reasoning for autonomous agents. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 13–30. Whiteson, S. and Stone, P. (2006). Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 7:887–917.

Chapter 4 ENSEMBLE CLASSIFIERS: ADABOOST AND ORTHOGONAL EVOLUTION OF TEAMS Terence Soule1 , Robert B. Heckendorn1, Brian Dyre1 , and Roger Lew1 1 University of Idaho, Moscow, ID 83844, USA.

Abstract

AdaBoost is one of the most commonly used and most successful approaches for generating ensemble classifiers. However, AdaBoost is limited in that it requires independent training cases and can only use voting as a cooperation mechanism. This paper compares AdaBoost to Orthogonal Evolution of Teams (OET), an approach for generating ensembles that allows for a much wider range of problems and cooperation mechanisms. The set of test problems includes problems with significant amounts of noise in the form of erroneous training cases and problems with adjustable levels of epistasis. The results demonstrate that OET is a suitable alternative to AdaBoost for generating ensembles. Over the set of all tested problems OET with a hierarchical cooperation mechanism, rather than voting, is slightly more likely to produce better results. This is most apparent on the problems with very high levels of noise - suggesting that the hierarchical approach is less subject to over-fitting than voting techniques. The results also suggest that there are specific problems and features of problems that make them better suited for different training algorithms and different cooperation mechanisms.

Keywords:

ensembles, teams, classifiers, OET, AdaBoost

1.

Introduction

Classification, the ability to classify a case based on attribute values, is a commonly studied problem with many practical applications. Approaches based on the evolution of classifiers have been widely used and proven to be quite successful (see for example (Muni et al., 2004; Kishore et al., 2000; Paul and Iba, 2009)). However, as the complexity of the classification problem increases, and particularly as the number of attributes increases, the performance of monolithic

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_4, © Springer Science+Business Media, LLC 2011

56

Genetic Programming Theory and Practice VIII

classifiers often degrades. Thus, researchers have introduced the idea of ensemble classifiers, in which multiple classifiers vote on each case (Polikar, 2006). The general idea is that the individual classifiers can partition the attribute space into simpler, overlapping sub-domains for which individual classifiers can be more readily trained. Perhaps the most successful and widely used of these ensemble technqies is AdaBoost (Freund et al., 1999; Schapire et al., 1998). Recently, we introduced an alternative approach, called Orthogonal Evolution of Teams (Soule and Komireddy, 2006), for generating ensembles, or teams1 . A significant advantage of Orthogonal Evolution of Teams (OET) over AdaBoost is that, unlike AdaBoost, it does not require independent training cases or voting as a cooperation mechanism. Thus, OET can be applied in cases when the agents must function simultaneously, such as search and exploration problems, swarms, and problems with non-voting cooperation mechanisms. In previous research we have shown that the OET algorithm produces ensemble members whose errors are inversely correlated demonstrating that they cooperate effectively (Soule and Komireddy, 2006). In addition, repeated tests have shown that OET performs well on traditional multi-agent search problems that are not within the traditional domain of AdaBoost (Soule and Heckendorn, 2007a; Soule and Heckendorn, 2007b; Thomason et al., 2008). However, a systematic comparison of OET and AdaBoost on classification problems has not been performed. We present that comparison here using a range of data sets. The data sets include noisy cases with errors added to the training set and data sets with adjustable levels of epistasis. The goal is to determine whether and, if so, under what circumstances, either of the two algorithms performs better.

2.

Background

Here we present the two ensemble based learning techniques, AdaBoost and OET and briefly describe the strengths and weaknesses of each.

AdaBoost AdaBoost, developed by Freud and Schapire, is an ensemble building technique based on the idea of combining weak learners (Freund et al., 1999). It uses a combination of repeated training and re-weighting of training cases to generate cooperative ensembles. The basic algorithm is as follows: Assign each training example a weight 1 The term ‘ensemble’ is most commonly applied to classifiers with multiple, voting members; whereas the term ‘team’ is commonly applied to multiple agents that work cooperatively on problems other than classification and/or that do not involve a vote. The term ‘swarm’ is commonly used for very large teams. Unlike AdaBoost, OET can be applied to all three types of problems.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

57

For the ensemble size $N$ do Train a weak learner Calculate the error of the weak learner If the error $>$ 0.5 discard the learner and continue Calculate the normalized error of the learner Re-weight the training examples Create the ensemble of the $N$ learners using a vote weighted according to each learners’ normalized error. AdaBoost has several significant advantages for generating ensembles. First, it can be use in conjunction with most learning techniques. Second, theoretical results have shown that a) the ensemble error is bounded above, b) the ensemble error is less than the best ensemble member, and c) additional ensemble members lower the ensemble error - on the training set (and when members with error > 0.5 are discarded) (Polikar, 2006). These strengths make AdaBoost a very powerful and hence widely used technique for generating ensembles. However, AdaBoost has several weaknesses. First,becauseof the re-weighting step it potentially has difficulty with noisy data sets in which some of the examples are mis-classified. In this case increasing emphasis may be placed on the erroneous cases: the early learners ignore them as not fitting the general pattern, their weight then increases to where later learners are effectively forced to consider them. However, in general, AdaBoost has proven surprisingly resistant to overfitting; a strength the some researchers feel has not been satisfactorily explained (Mease and Wyner, 2008). Part of the goal of this research is to compare AdaBoost and OET’s ability to resist overfitting specifically when the training examples are noisy. Second, because AdaBoost trains each ensemble member independently it’s possible that problems with high levels of epistasis may confound it. The members of the ensemble may need to cooperate to overcome the high levels of epistasis in a way that is not possible when the members are trained sequentially. In contrast an algorithm that evolves all ensemble members in parallel may be able to leverage the capabilities of the members simultaneously to more successfully address high levels of epistasis. We use a synthetic problem with adjustable levels of epistasis to test this possibility. Finally, AdaBoost is restricted to problems in which individuals can train independently and cooperate via a vote. This means that it cannot be applied to problems where more than one member is required to actually make progress. A typical example of such a problem is collective foraging where multiple members must work together to collect ‘large’ items or other problems in which members have heterogeneous, complementary capabilities and must be trained collectively to make progress. Similarly, AdaBoost depends on a (weighted) vote for cooperation. It is not directly applicable to ensembles using other

58

Genetic Programming Theory and Practice VIII

forms of cooperation. An example of an alternative cooperative mechanism is the leader mechanism, in which the first ensemble member (the leader) ‘examines’ each input case and assigns it to one of the other ensemble members to classify. AdaBoost’s sequential, vote based, ensemble generation algorithm can not be applied to ensembles using leaders for cooperation. This is a fundamental limitation of AdaBoost’s incremental approach to building ensembles and cannot be readily overcome without fundamentally rewriting the algorithm.

Orthogonal Evolution of Teams Other than AdaBoost common evolutionary ensemble training has fallen into two categories: team based and island based. In team based approaches the entire ensemble is treated as a single individual: the team receives a single fitness value and the selection process is applied entire teams (Luke and Spector, 1996; Soule, 1999; Brameier and Banzhaf, 2001; Platel et al., 2005). Crossover techniques vary, but approaches in which team members in the same ‘position’ within the team are crossed seem to have the most success (Haynes et al., 1995; Luke and Spector, 1996). In island based techniques the individuals are evolved in independent populations, i.e. islands, and best individuals from each island are combined into a single ensemble (see for example, (Imamura et al., 2004)). Both of these techniques suffer from unique strengths and weaknesses. In team based approaches the ensemble members learn to cooperate well (similar to AdaBoost). It has been shown that they can evolve inversely correlated error - the errors of one member are explicitly covered by the other members (Soule and Komireddy, 2006). However, the individual members perform relatively poorly, i.e. their average fitness is often significantly poorer than the fitness of individuals evolved independently (Soule and Komireddy, 2006). In contrast, in island based approaches the individual members have relatively high fitness. However, they cooperate more poorly than in team based approaches; at best their errors are independent and in some cases their errors are correlated undermining the advantage of the ensemble (Imamura et al., 2004; Imamura, 2002; Soule and Komireddy, 2006). The Orthogonal Evolution of Teams approach is an attempt to combine the strengths and avoid the weaknesses of the team and island approaches. A single population is created, but it is alternatively treated as independent islands (columns in the population, see Figure 4-1) or as teams (rows in the population, see Figure 4-1). A number of OET approaches are possible depending on whether the population is treated as rows or columns during selection and replacement (Thomason et al., 2008). In this paper we take one of the most straight-forward approaches: during the selection step the population is treated as islands i.e. selection is applied to each column creating a new team consist-

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

59

ing of highly fit individuals. This is done twice to create two “all-star” teams. These teams undergo crossover, with crossover applied to individuals from the same column, and mutation, to create two offspring teams. These teams are evaluated and reinserted into the population, replacing two poorly fit teams. Thus, during the selection stage the population is treated as islands and during the replacement stage the population is treated as teams. This places direct selection pressure on both individuals, so they can be selected for the all-star parent teams, and on teams, to avoid being replaced.

Figure 4-1. A population of individuals. Selection can be applied to members, keeping selection and replacement within the columns (a) in an island approach with each column serving as an island. Alternatively selection can be applied to whole rows (b) a team-based approach. Finally, selection can be varied between the two; these are the OET approaches.

3.

Problem Instances

To compare the ensemble classifiers we selected two data sets from the UCI Machine Learning Database (Asuncion and Newman, 2007). The sets are the Parkinson’s Telemonitoring Data Set (Tsanas et al., 2009) and the Ionosphere data set (Sigillito et al., 1989). In addition, we used data collected as part of a research project conducted at the University of Idaho to assess cognitive workload (described in detail below) and from a synthetic problem with adjustable levels of epistasis. Each of these data sets represents a binary classification problem with numerical attributes (both integer and real). Table 4-1 summarizes the problems.

Assessing Cognitive Workload This data set was generated as part of a research project conducted at the University of Idaho to measure cognitive workload. Subjects’ skin conductance

60

Genetic Programming Theory and Practice VIII

Table 4-1. Number of attributes and number of cases for each of the test problems. Attributes are numerical (integer and real). The cognitive workload case consists of two separate data sets from two different test subjects. For each of the problems 50% of the cases are used for training and 50% for testing.

Problem Ionosphere Parkinson’s Cognitive Workload (2 subjects) Synthetic Problem

Number of Attributes 34 22 20

Number of Cases 351 195 2048

20

1000

(SC, also known as galvanic skin response, GSR) and pupil diameter were measured while they performed a task with two distinct levels of difficulty. Changes in SC are generally believed to reflect autonomic responses to anxiety or stress, while changes in pupil diameter have been linked to differences in difficulty of tasks including sentence processing, mental calculations and user interface evaluation (Just and Carpenter, 1993; Nakayama and Katsukura, 2007). Thus, it was hypothesized that these physiological indicators could be used to determine which phase of the task the subject was in.

Stimuli and Apparatus. Participants used a black cursor to chase a intensity balanced dot moving in a pseudo random fashion against a gray background. A balanced dot was used as precaution against having pupil dilations due to luminance changes. Participants controlled the cursor using a joystick. For the first minute of the experiment the control mappings were normal: moving the joystick forward moved the cursor up, moving the joystick right moved the cursor right, etc. After 60 seconds the joystick control mappings were abruptly rotated 90 clockwise, such that moving the joystick forward-backward moved the cursor right-left, and moving the joystick left-right moved the cursor upward-downward. The control dynamics were switched between normal and rotated by 90 degrees every 60 seconds for the eight minute duration of the experiment. The abrupt changes in control mappings were hypothesized to elicit transient physiological responses, and the rotated mappings were hypothesized to cause physiological indicators reflecting increased workload. The goal was to train classifiers to use these physiological indicators to determine the control phase, normal or rotated. For this analysis data was used from the last 2 minutes of the experiment (covering one normal and one rotated period), by which time the subjects had obtained some practice with both sets of controls. Data was collected 18 times per second for a total of 2048 separate cases for each subject.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

61

For a more detailed explanation of the experimental conditions please see (Lew et al., 2010).

Participants. The data used is from two university students who participated in this experiment. Both had normal or corrected to normal Snellen visual acuity (20/30 or better). The participants were naive to the hypotheses of the experiment.

Synthetic Problem The synthetic function was designed to allow control of the amount of epistasis in the problem. Each problem is defined in terms of a z-function, which are random Embedded Landscapes (Heckendorn, 2002). These are generalizations of NK-Landscapes in that the sub function masks are not guaranteed to cover the domain of the function and the number of sub-functions is not constrained to be equal to the number of bits in the domain as they are in NK-Landscapes. The range of values of the sub-functions are between -1 and 1. The functions denoted by names of the form: z-N-K-P. They are randomly generated, but are of the form: P

gi (pack(x, mi )) f (x) = positive i=1

where: N is the number of bits (or binary valued features). P is the number of sub-functions to sum. K is the number of bits (or features) in the domain of gi . mi is an N bit mask that selects K bits out of N bits by using the pack function to extract the bits selected by the 1’s in mi . In a given f : mi = mj ∀i, j such that i = j. gi is a function that maps its K bit domain into the reals. This function is fully epistatic in that all Walsh coefficients are nonzero. The values of gi are random in the range between −1 and +1. This where the randomness in the function is created. positive takes a real argument and returns 1 if its argument is positive and 0 otherwise. This creates a function f that has the property that it has at most K bits of epistasis in P groups of interrelated bits that may overlap. Therefore, as K goes up, the amount of epistasis goes up and as P goes up the complexity of

62

Genetic Programming Theory and Practice VIII

the constraint satisfaction problem created by the overlapping fully epistatic g’s goes up when treated as a function to optimize.

Noisy Training Data For many real-world data sets noisy cases - cases with the incorrect classification - are common. These cases can easily mislead training algorithms or lead to overfitting, as the training algorithm is forced to ‘memorize’ cases that don’t fit the general solution because the class is incorrect. Thus, in addition to the basic data sets we ran experiments with noisy versions of each of the problems except the synthetic problem. For the noisy cases 0 (no noise), 10, 20, 30, or 40 percent of the training case answers were changed to the opposite (incorrect) case. The erroneous cases in the training set are kept the same through the evolutionary process to maximize the chance of mis-leading the learners. All of the test cases were unchanged, i.e. all are correct.

4.

Cooperation Mechanisms

With AdaBoost the ensemble members cooperate - collectively determine the classification for each input set - via a weighted vote. With OET two different cooperation mechanisms are tested. The first is a simple majority vote. The second is the leader approach in which the first ensemble member (the leader) ‘examines’ each input case and assigns it to one of the other ensemble members to classify. It is important to note that AdaBoost’s sequential, vote based, ensemble generation algorithm can not be applied to ensembles using leaders for cooperation (or to most other cooperation mechanisms that do not use a vote).

5.

Genetic Program

For these experiments the ensemble size is always 3. One of the potential advantages of GP techniques is its ability to generate (somewhat) human-readable solutions. This advantage is lost if the ensemble size is large, hence the small value used here. The results are the average of 20 trials (synthetic problem) or 10 trials (other problems). The basic GP used in both the AdaBoost and OET experiments is steadystate with a population size of 500, run for either 50000 iterations (synthetic problems) or 12500 iterations (all others). For OET this is the total number of iterations. For AdaBoost this is the number of iterations used to generate each of the three ensemble members. With OET each iteration requires evaluating six trees, three trees for each of the two offspring teams. Because AdaBoost only generates one tree at a time, it only evaluates two trees per iteration. Thus, to equalize the number of tree evaluations AdaBoost uses the full number of

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

63

Table 4-2. Summary of the GP parameters.

Algorithm Iterations Population Size Non-terminals Terminals Crossover Rate Mutation Rate Trials Ensemble Size

Steady-state 50000 (synthetic problem) or 12500 (all others) 500 iflte, +, -, *, / Attributes, Random constants 100% 1/size 20 (synthetic problem) or 10 (all others) 3

evaluations to generate each of the ensemble members effectively tripling the total number of iterations used with AdaBoost. The non-terminal set consists of if-less-than-else, addition, subtraction, multiplication, and protected division (if the absolute value of the divisor is less than 0.00001 it returns 1). The terminal set consists of the N attributes of the problem and real-valued random constants generated in the range -2.0 to 2.0. Table 4-2 summarizes the GP’s parameters.

6.

Results

Figure 4-2 presents the results on the ionosphere problem. For this problem the OET-leader approach performs significantly worse for low levels of noise (all significant tests use a two-tailed, Student’s t-test, with significance defined as P < 0.05). OET-leader’s relative performance improves as noise increases, but does not reach statistically better performance. Figure 4-3 presents the results on the Parkinson’s problem. OET-vote is significantly worse that both other techniques with 30% noise and OET-leader is significantly better with 40% noise. Figure 4-4 presents the results for the cognitive workload problem with subject 1. At 0% noise AdaBoost is significantly worse than the other two approaches and OET-vote is significantly better. At 40% noise OET-leader is significantly better than other two approaches. Figure 4-5 presents the results for the cognitive workload problem with subject 2. At 0% noise AdaBoost is significantly worse than OET-vote. At 20% and 30% noise AdaBoost is significantly better than other two approaches. Figure 4-6 presents the results on the synthetic functions. OET-leader is significantly better than the other two algorithms on 4 of the 7 functions (230, 5-10, 5-30, 10-30). OET-vote is significantly better than the other two algorithms on 1 of the functions (2-3).

64

Genetic Programming Theory and Practice VIII

Classification Error

0.4

AdaBoost OET - Vote OET - Leader

0.35 0.3 0.25 0.2 0.15 0.1 0

10

20 30 Percent Noise

40

Figure 4-2. Results on the ionosphere problem for varying levels of noise in the training data. Arrows show significant differences. For this problem the OET-leader approach is significantly worse (Student’s two-tailed, t-test P < 0.05) than the other two approaches for low levels of noise. It’s relative performance improves for higher levels of noise, but the differences do not reach significance.

Overall the results are mixed. For the majority of cases the performance of the two algorithms are statistically indistinguishable. Generally, OET-leader performs better on the noisiest cases, suggesting that it is less prone to overfitting, but often performs more poorly on the low noise cases. OET-vote performs better on some of the simplest cases (0% noise and the 2-3 function) and AdaBoost’s performance tends to fall in the middle.

7.

Conclusions

In general the results confounded the expectations. The goal of this research was to compare AdaBoost, a well established and widely used ensemble training technique, to OET, a newer approach that has proven successful on a number of problems. Given the nature of AdaBoost it was hypothesized that OET was most likely to perform better under one of two conditions. First, on cases with significant noise, because AdaBoost’s re-weighting approach would force it to focus on erroneous cases causing it to overfit. Second, on cases with high levels of epistasis, because AdaBoost’s incremental approach to building an ensemble

65

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

Classification Error

0.4

AdaBoost OET - Vote OET - Leader

0.35 0.3 0.25 0.2 0.15 0

5

10

15 20 25 30 Percent Noise

35

40

Figure 4-3. Results on the Parkinson’s problem for varying levels of noise in the training data. Arrows show significant differences. For this problem the OET-vote approach is significantly worse (Student’s two-tailed, t-test P < 0.05) than the other two approaches for 30% noise and the OET-leader approach is significantly better with 40% noise.

could interfere with its ability to leverage multiple members simultaneously to ‘untangle’ high epistasis problems. The results do strongly suggest that performance depends both on the training algorithm and the cooperation method, but confounded the specific hypotheses regarding noise and epistasis. OET with voting ensemble members only performed better with zero error and the least epistasis, whereas OET with hierarchical cooperation (the leader approach described previously) had the best performance with high levels of noise and epistasis. AdaBoost’s performance generally fell between OET-vote and OET-leader and showed the best results for the mid-range of noise. However, for the majority of cases the algorithms’ performance was statistically indistinguishable. This suggests that the performance of the algorithms is generally comparable, if not identical. Based on the results it seems plausible that further testing would show that there are specific types of problems or features of problems that make them better suited for one or another of the algorithms and/or cooperation mechanisms. Most importantly, these results strongly suggest that OET is generally on par with AdaBoost, but, as noted previously, OET can be applied to problems and

66

Genetic Programming Theory and Practice VIII

0.35

AdaBoost OET - Vote OET - Leader

Classification Error

0.3 0.25 0.2 0.15 0.1 0.05 0 0

10

20 30 Percent Noise

40

Figure 4-4. Results on the cognitive workload problem for the first subject with varying levels of noise in the training data. Arrows show significant differences. For this problem the results between all three approaches are significantly different with no noise in the training set (Student’s two-tailed t-test P < 0.05). The OET-leader approach is significantly better than the other two approaches with 40% noise.

cooperation mechanisms that are not suitable for AdaBoost. Thus, researchers can confidently apply OET in cases where AdaBoost is inappropriate.

References Asuncion, A. and Newman, D.J. (2007). UCI machine learning repository. Brameier, Markus and Banzhaf, Wolfgang (2001). Evolving teams of predictors with linear genetic programming. Genetic Programming and Evolvable Machines, 2(4):381–408. Freund, Y., Schapire, R., and Abe, N. (1999). A short introduction to boosting. JOURNAL-JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE, 14:771–780. Haynes, Thomas, Sen, Sandip, Schoenefeld, Dale, and Wainwright, Roger (1995). Evolving a team. In Siegel, Eric V. and Koza, John, editors, Working Notes of the AAAI-95 Fall Symposium on GP, pages 23–30. AAAI Press. Heckendorn, Robert B. (2002). Embedded landscapes. Evolutionary Computation, 10(4):345–376.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

Classification Error

0.35

67

AdaBoost OET - Vote OET - Leader

0.3 0.25 0.2 0.15 0.1 0

10

20 30 Percent Noise

40

Figure 4-5. Results on the cognitive workload problem for the second subject with varying levels of noise in the training data. Arrows show significant differences. For this problem AdaBoost is significantly better with noise levels of 20% and 30%. Additionally, OET-vote is significantly better than AdaBoost (but not OET-leader) with 0% noise.

Imamura, Kosuke (2002). N-version Genetic Programming: A probabilistic Optimal Ensemble. PhD thesis, University of Idaho. Imamura, Kosuke, Heckendorn, Robert B., Soule, Terence, and Foster, James A. (2004). Behavioral diversity and a probabilistically optimal gp ensemble. Genetic Programming and Evolvable Machines, 4:235–253. Just, M.A. and Carpenter, P.A. (1993). The intensity dimension of thought: Pupillometric indices of sentence processing. Canadian Journal of Experimental Psychology, 47(2):310–339. Kishore, JK, Patnaik, LM, Mani, V., and Agrawal, VK (2000). Application of genetic programming for multicategory pattern classification. IEEE Transactions on Evolutionary Computation, 4(3):242–258. Lew, R., P., Dyre B., Soule, T., Werner, S., and Ragsdale, S. A. (2010). Assessing mental workload from skin conductance and pupillometry using wavelets and genetic programming. In Proceedings of the 54th Annual Meeting of the Human Factors and Ergonomics Society. Luke, Sean and Spector, Lee (1996). Evolving teamwork and coordination with genetic programming. In Koza, John R., Goldberg, David E., Fogel, David B.,

68

Genetic Programming Theory and Practice VIII

0.5 Classification Error

0.45 0.4

AdaBoost OET - Vote OET - Leader

0.35 0.3 0.25 0.2 0.15 0.1 0.05 2-3

2-10 2-30 5-3 5-10 5-30 10-30 Z Function

Figure 4-6. Results on the z functions. No noise was used with these problems, the problems are arranged along the x-axis in approximate order of difficulty. Arrows show significant differences in performance (Student’s two-tailed, t-test P < 0.05). For the 2-3 function the OET-vote approach is significantly better than both other approaches. For the 2-30, 5-10, 5-30, and 10-30 problems the OET-leader approach is significantly better than the other two approaches Additionally, the OET-leader approach is significantly better than AdaBoost (but not OET-vote) for the 2-10 problem and significantly better than OET-vote (but not AdaBoost) for the 5-3 problem.

and Riolo, Rick R., editors, Genetic Programming 1996: Proceedings of the First Annual Conference on Genetic Programming, pages 150–156. Cambridge, MA: MIT Press. Mease, D. and Wyner, A. (2008). Evidence contrary to the statistical view of boosting. The Journal of Machine Learning Research, 9:131–156. Muni, DP, Pal, NR, and Das, J. (2004). A novel approach to design classifiers using genetic programming. IEEE transactions on evolutionary computation, 8(2):183–196. Nakayama, M. and Katsukura, M. (2007). Feasibility of assessing usability with pupillary responses. Proc. of AUIC 2007, 15, 22. Paul, T.K. and Iba, H. (2009). Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 6(2):353– 367.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

69

Platel, Michael Defoin, Chami, Malik, Clergue, Manuel, and Collard, Philippe (2005). Teams of genetic predictors for inverse problem solving. In Proceeding of the 8th European Conference on Genetic Programming – EuroGP 2005. Polikar, R. (2006). Ensemble based systems in decision making. IEEE Circuits and systems magazine, 6(3):21–45. Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of statistics, 26(5):1651–1686. Sigillito, V G, Wing, S P, Hutton, L V, and Baker, K B (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech. Dig, vol. 10:262–266. in. Soule, T. and Heckendorn, R.B. (2007a). Improving Performance and Cooperation in Multi-Agent Systems. In Proceedings of the Genetic Programming Theory and Practice Workshop. Springer. Soule, Terence (1999). Voting teams: A cooperative approach to non-typical problems. In Banzhaf, Wolfgang, Daida, Jason, Eiben, Agoston E., Garzon, Max H., Honavar, Vasant, Jakiela, Mark, and Smith, Robert E., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 916–922, Orlando, Florida, USA. Morgan Kaufmann. Soule, Terence and Heckendorn, Robert B. (2007b). Evolutionary optimization of cooperative heterogeneous teams. In SPIE Defense and Security Symposium, volume 6563. Soule, Terence and Komireddy, Pavankumarreddy (2006). Orthogonal evolution of teams: A class of algorithms for evolving teams with inversely correlated errors. In Riolo, Rick L., Soule, Terence, and Worzel, Bill, editors, Genetic Programming Theory and Practice IV, volume 5 of Genetic and Evolutionary Computation, chapter 8, pages –. Springer, Ann Arbor. Thomason, Russell, Heckendorn, Robert B., and Soule, Terence (2008). Training time and team composition robustness in evolved multi-agent systems. In O’Neill, Michael, Vanneschi, Leonardo, Gustafson, Steven, Esparcia Alcazar, Anna Isabel, De Falco, Ivanoe, Della Cioppa, Antonio, and Tarantino, Ernesto, editors, Proceedings of the 11th European Conference on Genetic Programming, EuroGP 2008, volume 4971 of Lecture Notes in Computer Science, pages 1–12, Naples. Springer. Tsanas, A., Little, M.A., McSharry, P.E., and Ramig, L.O. (2009). Accurate telemonitoring of Parkinson’s disease progression by non-invasive speech tests. Scientific Commons.

Chapter 5 COVARIANT TARPEIAN METHOD FOR BLOAT CONTROL IN GENETIC PROGRAMMING Riccardo Poli1

1 School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park,

CO4 3SQ, UK.

Abstract

In this paper a simple modification of the Tarpeian bloat-control method is presented which allows one to dynamically set the parameters of the method in such a way to guarantee that the mean program size will either keep a particular value (e.g., its initial value) or will follow a schedule chosen by the user. The mathematical derivation of the technique as well as its numerical and empirical corroboration are presented.

Keywords:

Bloat control, Tarpeian Method, Price’s theorem, Size-evolution equation

1.

Background

Many techniques to control bloat have been proposed in the last two decades (for recent reviews see (Poli et al., 2008; Luke and Panait, 2006; Alfaro-Cid et al., 2010; Silva, 2008)). One with a theoretically-sound basis is the Tarpeian method introduced in (Poli, 2003). This is the focus of this paper. The Tarpeian method is extremely simple in its implementation. All that is needed is a wrapper for the fitness function like the following algorithm: Tarpeian Wrapper: if size(program) > average program size and random() < pt then return( fbad ); else return( fitness(program) ); were pt is a real number between 0 and 1, random() is a function which returns uniformly distributed random numbers in the range [0, 1) and fbad is a constant which represents an extremely low (or high, if minimising) fitness value such that individuals with such fitness are almost guaranteed not to be selected. The R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_5, © Springer Science+Business Media, LLC 2011

72

Genetic Programming Theory and Practice VIII

method got its name after the Tarpeian Rock in Rome, which in Roman times was the infamous execution place for traitors and criminals (above average size individuals), who would be led to its top and then hurled down to their death. A feature of this algorithm is that it does not require a priori knowledge of the size of the potential solutions to a problem. If programs need to grow in order to improve fitness, the original Tarpeian method will not prevent this. It will occasionally hit some individuals that, if evaluated, would result in being fitter than average and this may slow down a little the progress of a run. However, because the wrapper does not evaluate the individuals being given a low fitness, very little computation is wasted. Even at a high anti-bloat intensity, pt , a better-than-average longer-than-average individual has still a chance of making it into the population. If enough individuals of this kind are produced (w.r.t. the individuals which are better-than-average but also shorterthan-average), eventually the average size of the programs in the population may grow. However, when this happens the Tarpeian method will immediately adjust so as to discourage further growth. After its proposal, the Tarpeian method has started being used in a variety of studies and applications. For example, in (Mahler et al., 2005) its performance and generalisation capabilities were studied, while it was compared with other bloat-control techniques in (Luke and Panait, 2006; Wyns and Boullart, 2009; Alfaro-Cid et al., 2010). The method has been used with success in the evolution of bin packing heuristics (Burke et al., 2007; Allen et al., 2009), in the evolution of image analysis operators (Roberts and Claridge, 2004), in artificial financial markets based on GP (Martinez-Jaramillo and Tsang, 2009), in predicting protein networks (Garcia et al., 2008a), in the design of passive analog filters using GP (Chouza et al., 2009), in the prediction of protein-protein functional associations (Garcia et al., 2008b) and in the simplification of decision trees via GP (Garcia-Almanza and Tsang, 2006). In all cases the Tarpeian method has been a solid and efficient choice. All studies and applications, however, have had to determine by trial and error the value of the parameter pt best suited to their problem(s).1 This is not really a drawback of this method: virtually all anti-bloat techniques require setting one or more parameters. For example, also the parsimony pressure method (Koza, 1992; Zhang and M¨uhlenbein, 1995; Zhang and M¨uhlenbein, 1993; Zhang et al., 1997) requires setting one parameter (the parsimony coefficient). In recent research (Poli and McPhee, 2008), we developed a method, called covariant parsimony pressure, that allows one to dynamically and optimally set the parsimony coefficient for the parsimony pressure method in such a way to completely control the evolution of the mean program size. The aim of this 1 In principle also f

bad

virtually no tuning.

needs to be set. However, this is normally easily done (more on this later) and requires

73

Covariant Tarpeian Bloat Control

paper is to achieve the same level of control for the Tarpeian method. We will do this partly by following the tracks of (Poli and McPhee, 2008). We therefore start our journey by briefly summarising the main ideas that led to the covariant parsimony pressure method.

2.

Covariant Parsimony Pressure

Let us start by considering the size evolution equation developed in (Poli, 2003; Poli and McPhee, 2003), which, as shown in (Poli and McPhee, 2008), with trivial manipulations can be rewritten as follows

E[μ ] = p( ) (5.1)

where the index ranges over all program sizes, μ is a stochastic variable which represents the average size of the programs at the next generation and p( ) is the probability of selecting a program of size from the current generation. The equation applies to GP systems with independent selection and symmetric sub-tree crossover. 2 If φ( ) represents the proportion of programs of size in the current generation, then, clearly, the average size of the programs in the current generation is given by μ = φ( ). Thus one can simply express the expected change in average size of programs between two generations as

(p( ) − φ( )) . (5.2) E[Δμ] = E[μ ] − μ =

In (Poli and McPhee, 2008), we showed that if we restrict our attention to , where f ( ) fitness proportionate selection, we can express p( ) = φ( ) f () f¯ ¯ is the average fitness of the programs of size and f is the average fitness of the programs in the population. Then, with some algebraic manipulations, one finds that Equation (5.2) is actually equivalent to Price’s theorem (Price, 1970). That is Cov( , f ) . (5.3) E[Δμ] = f¯ Let us imagine that a fitness function incorporating parsimony, fp = f − c , is used, where c is the parsimony coefficient, is the size of a program and f is its raw fitness (problems-solving performance). Feeding this into Equation (5.3), then setting its l.h.s. (E[Δμ]) to zero and solving for c, one finds c=

Cov( , f ) . Var( )

(5.4)

2 In a symmetric operator the probability of selecting particular crossover points in the parents does not depend on the order in which the parents are drawn from the population.

74

Genetic Programming Theory and Practice VIII

This value of c guarantees that, in expectation, the size of the programs in the next generation will be the same as in the current generation (as long as the coefficient c is recomputed at each generation). In (Poli and McPhee, 2008) we also showed that with simple further manipulations of Equation (5.3) one can even set c dynamically in such a way as to force the mean program size to vary according to any desired function of time, thereby providing complete control over the evolution of size.

3.

Covariant Tarpeian Method

Let us now model the effects on program size of the Tarpeian method in GP systems with independent selection and symmetric sub-tree crossover. In the Tarpeian method the fitness of individuals of size not exceeding the mean size μ is left unaffected. If pt is the Tarpeian rate, on average individuals of size bigger than the mean will see their fitness set to a very low value, fbad , in a proportion pt of cases, while fitness will be unaffected with probability 1 − pt . In order to see what effects the Tarpeian method has on the expected change in program size E[Δμ], we need to verify how the changes in fitness it produces affect the terms in the size evolution equation (Equation (5.2)). In other words, we need to compute

(pt ( ) − φ( )) (5.5) E[Δμt ] =

or E[Δμt ] =

Cov( , ft ) . f¯t

(5.6)

where Δμt = μt −μ, μt is the average program size in the next generation when the Tarpeian method is used, pt ( ) is the probability of selecting individuals of size when the Tarpeian method is used, ft is the fitness of individuals after the application of the Tarpeian method, and f¯t is the mean program fitness after the application of the Tarpeian method. Unfortunately, when attempting to study Equations (5.5) and (5.6) for the Tarpeian method things are significantly harder than for the parsimony pressure method. Under fitness proportionate selection, we have that pt ( ) = φ( ) ftf¯() t where ft ( ) is the mean fitness of the programs of size after the application of the Tarpeian method. In the absence of Tarpeian bloat control (i.e., for pt = 0), these quantities are constants (given that we have full information about the current generation). However, as soon as pt > 0, they become stochastic variables. This is because the Tarpeian method is stochastic and, so, we cannot be certain as to precisely how many individuals will have their fitness reduced by it, how many individual in each length class will be affected and how many

75

Covariant Tarpeian Bloat Control

individuals in each fitness class will be affected. If ft ( ) and f¯t are stochastic variables then so are the selection probabilities pt ( ) and, consequently, also the quantity E[Δμt ] on the l.h.s. of Equations (5.5) and (5.6) In other words Equations (5.5) and (5.6) give us the expectation of the change in mean program size from one generation to the next conditionally to the Tarpeian method modifying the fitness of a particular set of individuals. In formulae,

E[Δμt |Ft = ft ] =

(pt ( ) − φ( )) =

Cov( , ft ) . f¯t

(5.7)

where Ft is a (vector) stochastic variable which represents the fitness associated to the individuals in the population after the application of the Tarpeian method. The distribution Pr{Ft = ft } of Ft depends on the fitness and size of the individuals in the population and the parameter pt . In principle, we could determine the explicit expression for such a distribution and then compute

E[Δμt ] =

E[Δμt |Ft = ft ] Pr{Ft = ft }.

(5.8)

ft

However, working out a closed form for this equation is difficult. The reason is that the fitness values ft appear at the denominator of the selection probabilities pt ( ) via the average fitness f¯t in addition to appearing at the numerators. To overcome the difficulty and obtain results which allow the application of the theory to the problem of optimally choosing the parameters of the Tarpeian method, we will use the following approximation: E[Δμt ] = E E[Δμt |Ft = ft ] Cov( , ft ) ∼ E[Cov( , ft )] . = E = E[f¯t ] f¯t

(5.9)

Later in the paper we will get an idea as to the degree of error introduced by the approximation. For now, however, let us see if we can find a closed form for this approximation.

76

Genetic Programming Theory and Practice VIII

Let us start from computing E[f¯t ]:

E[f¯t ] = E =

φ( )ft ( )

φ( )E[ft ( )] +

≤μ

=

φ( )f ( ) +

≤μ

=

= f¯ +

φ( )E[ft ( )]

>μ

φ( )[pt × fbad + (1 − pt ) × f ( )]

>μ

φ( )f ( ) −

φ( )f ( ) +

>μ

φ( )[pt × fbad + (1 − pt ) × f ( )]

>μ

φ( )[pt × fbad + (1 − pt ) × f ( ) − f ( )]

>μ

= f¯ +

φ( )[pt × fbad − pt × f ( )]

>μ

= f¯ − pt

φ( )(f ( ) − fbad )

>μ

= f¯ − pt φ>

φ( ) >μ

φ>

(f ( ) − fbad )

= f¯ − pt φ> (f¯> − fbad )

(5.10)

where φ> = >μ φ( ) is the proportion of above-average-size programs and f¯> is the average fitness of such programs.

77

Covariant Tarpeian Bloat Control

Let us now compute the expected covariance between and ft : E[Cov( , ft )]

= E φ( )( − μ)(ft ( ) − f¯t ) =

φ( )( − μ)E[(ft ( ) − f¯t )]

=

φ( )( − μ)(E[ft ( )] − E[f¯t ])

=

φ( )( − μ)(E[ft ( )] − f¯ + pt φ> (f¯> − fbad ))

=

φ( )( − μ)(E[ft ( )] − f¯)

+ pt φ> (f¯> − fbad )

=

φ( )( − μ)

=0

φ( )( − μ)(E[ft ( )] − f¯)

=

φ( )( − μ)(E[ft ( )] − f¯)

≤μ

+

φ( )( − μ)(E[ft ( )] − f¯)

>μ

=

φ( )( − μ)(f ( ) − f¯)

≤μ

+

φ( )( − μ)[pt fbad + (1 − pt )f ( ) − f¯]

>μ

=

φ( )( − μ)(f ( ) − f¯) −

+

φ( )( − μ)(f ( ) − f¯)

>μ

φ( )( − μ)[pt fbad + (1 − pt )f ( ) − f¯]

>μ

= Cov( , f )

φ( )( − μ)[pt fbad + (1 − pt )f ( ) − f¯ − f ( ) + f¯] + >μ

Thus E[Cov( , ft )] = Cov( , f ) − pt

>μ

φ( )( − μ)(f ( ) − fbad ).

(5.11)

78

Genetic Programming Theory and Practice VIII

If μ> is the average size of the programs that are longer than μ, we can write

φ( )( − μ)(f ( ) − fbad ) >μ

=

φ( )( − μ> − μ + μ> )(f ( ) − fbad )

>μ

=

φ( )( − μ> )(f ( ) − fbad ) − (μ − μ> )

>μ

=

φ( )(f ( ) − fbad )

>μ

φ( )( − μ> )(f ( ) − f¯> − fbad + f¯> ) − (μ − μ> )φ> (f¯> − fbad )

>μ

=

φ( )( − μ> )(f ( ) − f¯> )

>μ

+

φ( )( − μ> )(f¯> − fbad ) − (μ − μ> )φ> (f¯> − fbad )

>μ

= φ> Cov> ( , f )

φ( )( − μ> ) −(μ − μ> )φ> (f¯> − fbad ), + (f¯> − fbad ) >μ

=0

= φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) . where Cov> ( , f ) is the covariance between program size and fitness within the programs which are of above-average size. Thus, we finally obtain E[Cov( , ft )] = Cov( , f ) − pt φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) . (5.12) Substituting Equations (5.12) and (5.9) into Equation (5.8) we obtain ¯> − fbad ) φ ( , f ) + (μ − μ)( f Cov( , f ) − p Cov t > > > . (5.13) E[Δμt ] ∼ = f¯ − pt φ> (f¯> − fbad ) With this explicit formulation of the expected size changes, following the same strategy as in the covariant parsimony pressure method (see Section 2), we can find out for what value of pt we get E[Δμt ] = 0. By setting the l.h.s. of Equation (5.13) to 0 and solving for pt , we obtain: pt ∼ =

φ>

Cov( , f ) . Cov> ( , f ) + (μ> − μ)(f¯> − fbad )

(5.14)

This equation allows one to determine how often the Tarpeian method should be applied to modify the fitness of above-average-size programs as a function of a small set of descriptors of the current state of the population and of the parameter fbad .

Covariant Tarpeian Bloat Control

79

We should note that for some values of fbad the method is unable to control bloat. For such values, one would need to set pt > 1 which is clearly impossible (since pt is a probability). Naturally, we can find out what such values of fbad are by setting pt = 1 in Equation (5.14) and solving for fbad obtaining Cov( , f ) − Cov> ( , f )φ> . fbad ∼ = f¯> − φ> (μ> − μ)

(5.15)

However, since we normally don’t particularly care about the specific value of fbad , as long as the method gets the job done, the obvious and safe choice fbad = 0 is perhaps the most practical one. What if we wanted μ(t) to follow, in expectation, a particular function γ(t), e.g., the ramp γ(t) = μ(0) + b × t or a sinusoidal function? The theory helps us in this case as well. What we want is that E[μt ] = γ(g), where g is the generation number. Note that E[μt ] = E[Δμt ] + μ. So, adding μ to both sides of Equation (5.13) we obtain: Cov( , f ) − pt φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) ∼ + μ. γ(g) = f¯ − pt φ> (f¯> − fbad ) Solving again for pt yields: pt ∼ =

Cov( , f ) − [γ(g) − μ][f¯ − pt φ> (f¯> − fbad )] φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) (5.16)

Note that, in the absence of sampling noise (i.e., for an infinite population), requiring that E[Δμ] = 0 at each generation implies μ(g) = μ(0) for all g > 0. However, in any finite population the parsimony pressure method can only achieve Δμ = 0 in expectation, so there can be some random drift in μ(g) w.r.t. its starting value of μ(0). If tighter control over the mean program size is desired, one can use Equation (5.15) with the choice γ(g) = μ(0), which leads to the following formula Cov( , f ) − [μ(0) − μ][f¯ − pt φ> (f¯> − fbad )] (5.17) pt ∼ = φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) Note the similarities and differences between this and Equation (5.14). In the presence of any drift moving μ away from μ(0), this equation will actively strengthen the size control pressure to push the mean program size back to its initial value.

4.

Example and Numerical Corroboration

As an example, let us consider the small population in the first two columns of Table 5-1 and let us apply Equation (5.3) to it. We have that Cov( , f ) = 6.75

80

Genetic Programming Theory and Practice VIII

Table 5-1. The effects of the covariant Tarpeian method on a small sample population of 4 individuals. The size and raw fitness of the individuals in the population are shown in the first two columns. The remaining columns report the fitness associated to each such individuals after the application of the Tarpeian method with optimal pt .

Size 5 2 2 7 E[Δμ]

f 9 1 2 8 1.35

Trials ft ft ft ft ft ft ft ft ft ft 0 0 0 0 0 9 0 9 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 0 0 8 0 8 0 0 0 8 0 -2.00 -2.00 1.64 -2.00 1.64 0.25 -2.00 0.25 1.64 -2.00 Average E[Δμ] = −0.46

and f¯ = 5. So, in the absence of bloat control we will have an expected increase in program size of E[Δμ] = 1.35 at the next generation. This is to be expected given the strong correlation between fitness and size in our sample population. Let us now compute pt using Equation (5.14). Since in our population μ = 4, we have that φ> = 0.5, the programs of size 5 and 7 being of above-average size. Their average size is μ> = 6 and their average fitness is f¯> = 8.5. Finally, the covariance between their size and their fitness is Cov> ( , f ) = −0.5. Using these values and the covariance between size and fitness which we computed previously, and taking the safe value fbad = 0, we obtain pt ∼ = 0.818182. Let us now imagine that we adopt this particular value of pt and let us recompute the Tarpeian fitness of the members of our population based on the application of the Tarpeian method (with fbad = 0). Since the method is stochastic we will do it multiple times, so as to get an idea of its expected behaviour. The results of these trials are shown in columns 3–12 of Table 5-1. Computing the expected change in program size after the application of the Tarpeian method shows that in 5 out of 10 cases it is negative, in 2 cases it is marginally positive and only in the remaining cases it is comparable (in fact slightly bigger) than expected when the Tarpeian method is not used. Indeed, on average we expect a slight contraction in the mean program size of −0.46. In other words, the estimate for pt has exceeded the value required to achieve a zero expected change in program size. Errors such as this have to be expected given the tiny population we have used. To corroborate the theory presented in the previous section and evaluate how population size affects the accuracy of our estimate of pt , we need to perform many more trials (so as to avoid small sample errors) with a variety of population sizes. For these tests we will create populations with an extremely high correlation between fitness and size.

81

Covariant Tarpeian Bloat Control

Table 5-2. Errors in E[Δμt ] resulting from the approximations in the calculation of pt for different population sizes and for a fitness function where f () = . Statistics were computed over 1,000 independent repetitions of the application of the Tarpeian method to a population including programs from size 1 to M , M being the population size.

Population size M 10 100 1000 10000 100000

E[Δμ] without Tarpeian 15.00 16.51 16.80 16.83 16.83

Estimated Average optimal E[Δμt ] with pt Tarpeian 0.750 -3.050 0.795 -0.275 0.804 0.026 0.805 -0.004 0.805 -0.003

Standard deviation of E[Δμt ] 10.74 3.64 1.16 0.36 0.12

Our populations include M =10, 100, 1000,10000, and 100,000 individuals. In each population individual i has size i = Mi × 100 and fitness fi = i . These choices would be expected to produce very strong bloat. Indeed, as shown in the second column of Table 5-2 we expect to see the mean size of programs to increase by between 15 and 16.83 at the next generation. We now apply the Tarpeian method with the optimal pt computed via Equation (5.14) on our test populations 1000 times. The optimal pt obtained for each population size is shown in the third column of Table 5-2. Each time different individuals are hit by the reduction of fitness associated with the method. So, different expected changes in program size E[Δμt ] will be produced. The fourth and fifth columns of Table 5-2 show the mean and standard deviations of E[Δμt ] over the 1000 repetitions of the test. As we can see from these values, in all cases bloat is entirely under control, although, for this problem, Equation (5.14) consistently overestimates pt thereby leading to slightly shrinking individuals on average. Note how rapidly the mean error becomes very small as the population size grows towards the typical values used in realistic GP runs. The standard deviations also rapidly drop, indicating that the method becomes almost deterministic for very large population sizes. This is confirmed by the distributions of E[Δμt ] for different population sizes shown in Figure 5-1.

5.

Empirical Tests

To further corroborate the theory, we conducted experiments using a linear register-based GP system. The system we used is a generational GP system. It initialises the population by repeatedly creating random individuals with lengths uniformly distributed between 1 and 200 primitives. The primitives are drawn randomly and uniformly from a problem’s primitive set. The system uses fitness proportionate selection and crossover applied with a rate of 90%. The remaining 10% of the population is created via selection followed by point

82

Genetic Programming Theory and Practice VIII Table 5-3. Primitive set used in our experiments.

Instructions R1 = RIN R2 = RIN R1 = R1 + R2 R2 = R1 + R2 R1 = R1 * R2 R2 = R1 * R2 Swap R1 R2

mutation (with a rate of 1 mutation per program). Crossover creates offspring by selecting two random crossover points, one in each parent, and taking the first part of the first parent and the second part of the second w.r.t. their crossover points. This is a form of sub-tree crossover for linear structures/trees. We used populations of size 1,000 and 10,000. In each condition we performed 100 independent runs, each lasting either 50 or 100 generations. With this system we solved a classical symbolic regression problem: the quintic polynomial. In other words, the objective was to evolve a function which fits a polynomial of the form x + x2 + · · · + xd , where d = 5 is the degree of the polynomial, for x in the range [−1, 1]. In particular we sampled the polynomials at the 21 equally spaced points x ∈ {−1, −0.9, . . . , 0.9, 1.0}. Polynomials of this type have been widely used as benchmark problems in the GP literature. Fitness (to be maximised) was 1/(1 + error) where error is the sum of the absolute differences between the target polynomial and the output produced by the program under evaluation over these 21 fitness cases. The primitive set used to solve these problems is shown in Table 5-3. The instructions refer to three registers: the input register RIN which is loaded with the value of x before a fitness case is evaluated and the two registers R1 and R2 which can be used for numerical calculations. R1 and R2 are initialised to x and 0, respectively. The output of the program is read from R1 at the end of its execution. Figure 5-2 shows the results of our runs for populations of size 1000 and 10,000 in the absence of bloat control and when using the version of the Covariant Tarpeian method in Equation (5.17). Figure 5-3 shows the results for a population of size 1000 when using the version of the Covariant Tarpeian method in Equation (5.15) where γ(g) is the following triangle wave of period 50 generations: g + 12.5 g + 12.5 − + 0.5 . γ(g) = 100 × 0.75 + 0.5 × 50 50

(5.18)

83

Covariant Tarpeian Bloat Control

Table 5-4. Comparison of success rates in the quintic polynomial regression for different population sizes with and without Tarpeian bloat control. Runs were declared successful if the sum of absolute errors in the best individual fell below 1. Tarpeian bloat control was exerted using Equation (5.15) with γ(g) = μ(0) (“Covariant Tarpeian constant”) or with the γ(g) function in Equation (5.18) (“Covariant Tarpeian triangle”).

Bloat control None Covariant Tarpeian constant Covariant Tarpeian triangle None Covariant Tarpeian constant

pop size 1,000 1,000 1,000 10,000 10,000

success rate 94% 92% 95% 100% 100%

It is apparent that in the absence of bloat control there is very substantial bloat, while the Covariant Tarpeian method provides almost total control over the size dynamics. It has sometimes been suggested that bloat control techniques can harm performance. One may wonder, then, if performance was affected by the use of the covariant Tarpeian method. In the quintic polynomial regression there was very little variation in the success rate (for a given population size) across techniques, as illustrated in Table 5-4. This is very encouraging, but it would be surprising if in other problems and for other parameter settings there weren’t some performance differences. Future research will need to explore this.

6.

Conclusions

There are almost as many anti-bloat recipes as there are researchers in genetic programming. Very few, however, have a theoretical pedigree. The Tarpeian method (Poli, 2003) is one of them. In recent years, the method has started becoming more and more widespread, probably because of its simplicity. The method, however, like most others, requires setting one main parameter (and one secondary one) for it to perform appropriately. Until now this parameter had to be set by trial and error. In this paper we integrate the theory that led to the development of the original Tarpeian method with ideas that recently led to the covariant parsimony pressure method (Poli and McPhee, 2008) (another theoretically derived method), to obtain equations which allow one to optimally set the parameter(s) of the method so as to achieve almost full control over the evolution of the mean program size in runs of genetic programming. Although the complexity of the task has forced us to rely on approximations to make progress, numerical and empirical corroboration confirm that the quality of the approximation is good. Experiments have also confirmed the effectiveness of the Covariant Tarpeian method.

84

Genetic Programming Theory and Practice VIII

References Alfaro-Cid, Eva, Merelo, J. J., Fernandez de Vega, Francisco, Esparcia-Alcazar, Anna I., , and Sharman, Ken (2010). Bloat control operators and diversity in genetic programming: A comparative study. Evolutionary Computation, 18(2):305–332. Allen, Sam, Burke, Edmund K., Hyde, Matthew R., and Kendall, Graham (2009). Evolving reusable 3D packing heuristics with genetic programming. In Raidl, Guenther, Rothlauf, Franz, Squillero, Giovanni, Drechsler, Rolf, Stuetzle, Thomas, Birattari, Mauro, Congdon, Clare Bates, Middendorf, Martin, Blum, Christian, Cotta, Carlos, Bosman, Peter, Grahl, Joern, Knowles, Joshua, Corne, David, Beyer, Hans-Georg, Stanley, Ken, Miller, Julian F., van Hemert, Jano, Lenaerts, Tom, Ebner, Marc, Bacardit, Jaume, O’Neill, Michael, Di Penta, Massimiliano, Doerr, Benjamin, Jansen, Thomas, Poli, Riccardo, and Alba, Enrique, editors, GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 931– 938, Montreal. ACM. Burke, Edmund K., Hyde, Matthew R., Kendall, Graham, and Woodward, John (2007). Automatic heuristic generation with genetic programming: evolving a jack-of-all-trades or a master of one. In Thierens, Dirk, Beyer, Hans-Georg, Bongard, Josh, Branke, Jurgen, Clark, John Andrew, Cliff, Dave, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Kovacs, Tim, Kumar, Sanjeev, Miller, Julian F., Moore, Jason, Neumann, Frank, Pelikan, Martin, Poli, Riccardo, Sastry, Kumara, Stanley, Kenneth Owen, Stutzle, Thomas, Watson, Richard A, and Wegener, Ingo, editors, GECCO ’07: Proceedings of the 9th annual conference on Genetic and evolutionary computation, volume 2, pages 1559–1565, London. ACM Press. Chouza, Mariano, Rancan, Claudio, Clua, Osvaldo, , and Garcia-Martinez, Ramon (2009). Passive analog filter design using GP population control strategies. In Chien, Been-Chian and Hong, Tzung-Pei, editors, Opportunities and Challenges for Next-Generation Applied Intelligence: Proceedings of the International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE) 2009, volume 214 of Studies in Computational Intelligence, pages 153–158. Springer-Verlag. Garcia, Beatriz, Aler, Ricardo, Ledezma, Agapito, and Sanchis, Araceli (2008a). Genetic programming for predicting protein networks. In Geffner, Hector, Prada, Rui, Alexandre, Isabel Machado, and David, Nuno, editors, Proceedings of the 11th Ibero-American Conference on AI, IBERAMIA 2008, volume 5290 of Lecture Notes in Computer Science, pages 432–441, Lisbon, Portugal. Springer. Advances in Artificial Intelligence. Garcia, Beatriz, Aler, Ricardo, Ledezma, Agapito, and Sanchis, Araceli (2008b). Protein-protein functional association prediction using genetic pro-

Covariant Tarpeian Bloat Control

85

gramming. In Keijzer, Maarten, Antoniol, Giuliano, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Hansen, Nikolaus, Holmes, John H., Hornby, Gregory S., Howard, Daniel, Kennedy, James, Kumar, Sanjeev, Lobo, Fernando G., Miller, Julian Francis, Moore, Jason, Neumann, Frank, Pelikan, Martin, Pollack, Jordan, Sastry, Kumara, Stanley, Kenneth, Stoica, Adrian, Talbi, El-Ghazali, and Wegener, Ingo, editors, GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary computation, pages 347–348, Atlanta, GA, USA. ACM. Garcia-Almanza, Alma Lilia and Tsang, Edward P. K. (2006). Simplifying decision trees learned by genetic programming. In Proceedings of the 2006 IEEE Congress on Evolutionary Computation, pages 7906–7912, Vancouver. IEEE Press. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Luke, Sean and Panait, Liviu (2006). A comparison of bloat control methods for genetic programming. Evolutionary Computation, 14(3):309–344. Mahler, S´ebastien, Robilliard, Denis, and Fonlupt, Cyril (2005). Tarpeian bloat control and generalization accuracy. In Keijzer, Maarten, Tettamanzi, Andrea, Collet, Pierre, van Hemert, Jano I., and Tomassini, Marco, editors, Proceedings of the 8th European Conference on Genetic Programming, volume 3447 of Lecture Notes in Computer Science, pages 203–214, Lausanne, Switzerland. Springer. Martinez-Jaramillo, Serafin and Tsang, Edward P. K. (2009). An heterogeneous, endogenous and coevolutionary GP-based financial market. IEEE Transactions on Evolutionary Computation, 13(1):33–55. Poli, Riccardo (2003). A simple but theoretically-motivated method to control bloat in genetic programming. In Ryan, Conor, Soule, Terence, Keijzer, Maarten, Tsang, Edward, Poli, Riccardo, and Costa, Ernesto, editors, Genetic Programming, Proceedings of EuroGP’2003, volume 2610 of LNCS, pages 204–217, Essex. Springer-Verlag. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Poli, Riccardo and McPhee, Nicholas (2008). Parsimony pressure made easy. In Keijzer, Maarten, Antoniol, Giuliano, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Hansen, Nikolaus, Holmes, John H., Hornby, Gregory S., Howard, Daniel, Kennedy, James, Kumar, Sanjeev, Lobo, Fernando G., Miller, Julian Francis, Moore, Jason, Neumann, Frank, Pelikan, Martin, Pollack, Jordan, Sastry, Kumara, Stanley, Kenneth, Stoica, Adrian, Talbi, El-Ghazali, and Wegener, Ingo, editors, GECCO ’08: Proceedings of

86

Genetic Programming Theory and Practice VIII

the 10th annual conference on Genetic and evolutionary computation, pages 1267–1274, Atlanta, GA, USA. ACM. Poli, Riccardo and McPhee, Nicholas Freitag (2003). General schema theory for genetic programming with subtree-swapping crossover: Part II. Evolutionary Computation, 11(2):169–206. Price, George R. (1970). Selection and covariance. Nature, 227, August 1:520– 521. Roberts, Mark E. and Claridge, Ela (2004). Cooperative coevolution of image feature construction and object detection. In Yao, Xin, Burke, Edmund, Lozano, Jose A., Smith, Jim, Merelo-Guerv´os, Juan J., Bullinaria, John A., Rowe, Jonathan, Kab´an, Peter Tiˇno Ata, and Schwefel, Hans-Paul, editors, Parallel Problem Solving from Nature - PPSN VIII, volume 3242 of LNCS, pages 902–911, Birmingham, UK. Springer-Verlag. Silva, Sara (2008). Controlling Bloat: Individual and Population Based Approaches in Genetic Programming. PhD thesis, Coimbra University, Portugal. Full author name is Sara Guilherme Oliveira da Silva. Wyns, Bart and Boullart, Luc (2009). Efficient tree traversal to reduce code growth in tree-based genetic programming. Journal of Heuristics, 15(1):77– 104. Zhang, Byoung-Tak and M¨uhlenbein, Heinz (1993). Evolving optimal neural networks using genetic algorithms with Occam’s razor. Complex Systems, 7:199–220. Zhang, Byoung-Tak and M¨uhlenbein, Heinz (1995). Balancing accuracy and parsimony in genetic programming. Evolutionary Computation, 3(1):17–38. Zhang, Byoung-Tak, Ohm, Peter, and M¨uhlenbein, Heinz (1997). Evolutionary induction of sparse neural trees. Evolutionary Computation, 5(2):213–236.

87

Covariant Tarpeian Bloat Control

0.06

0.08

0.05

0.10

0.04

0.06

0.03

0.04 0.02 0.02

0.01

0.00

15

10

5

5

0

10

15

0.00

15

10

5

5

0

10

15

0.35 1.0

0.30

0.20

0.8

0.15

0.25

0.6

0.4 0.10 0.2

0.05

0.00

15

10

5

5

0

10

15

0.0

15

10

5

0

5

10

15

3.0

2.5

2.0

1.5

1.0

0.5

0.0

15

10

5

0

5

10

15

Figure 5-1. Distributions of E[Δμt ] resulting from the application of the Covariant Tarpeian method for populations of size 10 (top left), 100 (top right), 1,000 (middle left), 10,000 (middle right) and 100,000 (bottom) with our sample fitness function.

88

Genetic Programming Theory and Practice VIII

700

Tarpeian method no bloat control

600

Program Size

500

400

300

200

100

0 0

5

10

15

20

25 Generations

30

35

40

45

50

30

35

40

45

50

(a) 800

Tarpeian method no bloat control

700

600

Program Size

500

400

300

200

100

0 0

5

10

15

20

25 Generations

(b) Figure 5-2. Mean program size for populations of size 1000 (a) and 10,000 (b) as a function of the generation number on the quintic polynomial symbolic regression in the absence of bloat control and when using the version of the Covariant Tarpeian method in Equation (5.17).

89

Covariant Tarpeian Bloat Control

135

mean of the average program size across runs

130 125

Program Size

120 115 110 105 100 95 90 85 0

5

10

15

20

25 Generations

30

35

40

45

50

Figure 5-3. Average program size for populations of size 1000 and runs lasting 100 generations with the quintic polynomial symbolic regression when using the version of the Covariant Tarpeian method in Equation (5.15) where γ(g) is a triangle wave. The dashed line represents the mean of the average program size across runs.

Chapter 6 A SURVEY OF SELF MODIFYING CARTESIAN GENETIC PROGRAMMING Simon Harding1 , Wolfgang Banzhaf1 and Julian F. Miller2 1 Department Of Computer Science, Memorial University, Canada; 2 Department Of Electronics, University of York, UK.

Abstract Self-Modifying Cartesian Genetic Programming (SMCGP) is a general purpose, graph-based, developmental form of Cartesian Genetic Programming. In addition to the usual computational functions found in CGP, SMCGP includes functions that can modify the evolved program at run time. This means that programs can be iterated to produce an infinite sequence of phenotypes from a single evolved genotype. Here, we discuss the results of using SMCGP on a variety of different problems, and see that SMCGP is able to solve tasks that require scalability and plasticity. We demonstrate how SMCGP is able to produce results that would be impossible for conventional, static Genetic Programming techniques.

Keywords:

1.

Cartesian genetic programming, developmental systems

Introduction

In evolutionary computation (EC) scalability has always been an important issue. An evolutionary technique is scalable if the generational time it takes to evolve a satisfactory solution to a problem increases relatively weakly with increasing problem size. As in EC, scalability is an important issue in Genetic Programming (GP). In GP important methods for improving scalability are modularity and re-use. Modularity is introduced through sub-functions or subprocedures. These are often called Automatically Defined Functions (ADFs) (Koza, 1994a). The use of ADFs improves the scalability of GP by allowing solutions of larger or more difficult instances of particular classes of problems to be evolved. However, GP methods in general have largely employed genotype representations whose length (number of genes) is proportional to the size of

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_6, © Springer Science+Business Media, LLC 2011

92

Genetic Programming Theory and Practice VIII

the anticipated problem solutions. This has meant that evolutionary operators (e.g. crossover or mutation) have been used as the mechanism for building large genotypes. The same idea underlies approaches to evolve artificial neural networks. For instance, a well known method called NEAT uses evolutionary operators to introduce new neurons and connections, thus expanding the size of the genotype (Stanley and Miikkulainen, 2002). It is interesting to contrast these approaches to mechanisms employed in evolution of biological organisms. Multicellular organisms, having possibly enormous phenotypes, are developed from relatively simple genotypes. Development implies an unfolding in space and time. It is clearly promising to consider employing an analogue of biological development in genetic programming (Banzhaf and Miller, 2004). There are, of course, many possible aspects of developmental biology that could be adopted to construct a developmental GP method. In this chapter we discuss one such approach. It is called Self Modifying Cartesian Genetic Programming (SMCGP). It is based on a simple underlying idea. Namely, that a phenotype can unfold over time from a genotype by allowing the genotype to include primitive functions which act on the genotype itself. We refer to this as self-modification. As far as the authors are aware, self-modification is included in only one existing GP system: Lee Spector’s Push GP language (Spector and Robinson, 2002). One of the attractive aspects of introducing primitive self-modification functions is that it is relatively easy to include them in any GP system. Since 2007, SMCGP has been applied to a variety of computational problems. In the ensuing time the actual details of the SMCGP implementation have changed, however the key concepts and philosophy have remained the same. Here we present the latest version. We explain the essentials of how SMCGP works in section 2. Section 3 discusses briefly examples of previous work with SMCGP. In section 4 we compare and contrast the way other GP systems include iteration with the iterative unrolling that occurs in SMCGP. We end the chapter with conclusions and suggestions for future work.

2.

Self Modifying Cartesian Genetic Programming

As the name suggests, SMCGP is based on the Cartesian Genetic Programming technique. In CGP, programs are encoded in a partly connected, feed forward graph. A full description can be found in (Miller and Thomson, 2000). The genotype encodes this graph. Associated with each node in the graph are genes that represent the node function and genes representing connections to either other nodes or terminals. The representation has a number of interesting features. Firstly, not all of the nodes in the genotype need to be connected to the output, so there is a degree of neutrality which has been shown to be very useful (Miller and Thomson, 2000; Vassilev and Miller, 2000; Yu and Miller,

A Survey of Self Modifying CGP

93

2001; Miller and Smith, 2006). Secondly, as the genotype encodes a graph there is reuse of nodes, which makes the representation very compact and also distinct from tree based GP. Although CGP has been used in various ways in developmental systems (Miller, 2004; Miller and Thomson, 2003; Khan et al., 2007), the programs that it produces are not themselves developmental. Instead, these approaches used a fixed length genotype to represent the programs defining the behaviour of cells. SMCGP’s representation is similar to CGP in some ways, but has extensions that allow it to have the self modifying features. SMCGP genotypes are a linear string of nodes. That is to say, only one row of nodes is used (in contrast to CGP which can have a rectangular grid of nodes). In contrast to CGP in which connection genes are absolute addresses, indicating where the data supplied to a node is to be obtained, SMCGP uses relative addressing. Each node obtains its data inputs from its connection genes by counting back from its position in the graph. To prevent cycles, nodes can only connect to previous nodes (on their left). The relative addressing allows section of the graph to be moved, duplicated, deleted etc without breaking constraints of the structure whilst allowing some sort of modularity. In addition to CGP, SMCGP has some extra genes that are used by self-modification functions to identify parts or characteristics of the graph that will be changed. Another change from CGP is the way SMCGP handles inputs and outputs. Terminals are acquired through special functions (called INP, INPP, SKIPINP) and program outputs are taken from a special function called OUTPUT. This is an important change as it enables SMCGP programs to obtain and deliver as many inputs or outputs as required by the problem domain, during program execution. This allows the possibility of evolving general solutions to problems. For example, to find a program that can compute even-n parity, where n is arbitrary, one needs to be able to acquire an arbitrary number of inputs or terminals. In summary: Each node in the SMCGP graph contains a number of evolvable elements: The function. Represented in the genotype as an integer. A list of (relative) connections addresses, again represented as integers. A set of 3 floating point number arguments used by self-modification functions. There are also primitive functions that acquire or deliver inputs and outputs. As with CGP, the number of nodes in the genotype is typically kept constant through an experiment. However, this means care has to be taken to ensure that the genotype is large enough to store the target program.

94

Genetic Programming Theory and Practice VIII

Executing a SMCGP Individual SMCGP individuals are evaluated in a multi-step process, with the evolved program (the phenotype) executed several times. The evolved program in SMCGP initially has the same structure as the genotype, hence the first step is to make a copy of the genotype and call it the phenotype. This graph is to be the ‘working copy’ of the program. Each time the program is executed, the graph is first run and then any self modification operations required are invoked. The graph is executed in the following manner. First, the node (or nodes) to be used as outputs are identified. This is done by reading through the graph looking at which nodes are of type OUTPUT. Once a sufficient number of these nodes has been found, the various nodes that they connect to are identified. If not enough output nodes are found, then the last n nodes in the graph are used, where n is the number of outputs required. If there are not enough nodes to satisfy this requirement, then the execution is aborted, and the individual is discarded. At this point in the decoding, all the nodes that are actually used by the program have been identified and so their values can be calculated (the other nodes can simply be ignored). For the mathematical and binary operators, these functions are performed in the usual manner. However, as mentioned before SMCGP has a number of special functions. Table 6-1 shows an example of some of the functions used in previous work (see section 3). The first special functions are the INP and INPP functions. Each time the INP function is called it returns the next available input (starting with the first, and returning to the first after reading the last input). The INPP function is similar, but moves backwards through the inputs. SKIPINP allows a number of inputs to be ignored, and then returns the next input. These functions help SMCGP to scale to handle increasing numbers of inputs through development. This also applies to the use of the OUTPUT function, which allows the number of outputs to change over time. If a function is a self modification function, then it may be activated depending on the following rules. For binary functions they are always activated. For numeric function nodes, if the 1st input is larger than the 2nd input the node is activated. The self modification operation from an activated node is added to a list of pending operations - the ‘ToDo’ list. The maximum length of the list is a parameter of the system. After execution, the self modification functions on the ToDo list are applied to the current graph. The ToDo list is operated as a FIFO list in which the leftmost activated SM function is the first to be executed (and so on). The self modification functions require arguments defining which parts of the phenotype the function operates on. These are taken from the arguments of

A Survey of Self Modifying CGP

95

the calling node. Many of the arguments are integers, so they may need to be cast. The arguments may be treated as an address (depending on the function) and like all SMCGP operations, these are relative addresses. The program can now be iterated again, if necessary.

3.

Summary of Previous Work in SMCGP

Early experiments There are very few benchmark problems in the developmental system literature. In the first paper on SMCGP (Harding et al., 2007), we identified two possible challenges that had been described previously. The first was to find a program that generates a sequence of squares (i.e. 0,1,2,4,9,16,25...) using a restricted set of mathematical operators such as + and −, but not multiplication or power. Without some form of self modification this challenge would be impossible to solve (Spector and Stoffel, 1996). SMCGP was easily able to solve this problem (89% success rate), and a large number of different solutions were found. Typical solutions were similar to the program in table 6-2, where the program grew in length by adding new terms. During evolution, solutions were only tested up to the first 10 iterations. However, after evolution the solutions were tested for generality by increasing the number of iterations to 50. 66% of the solutions are correct to 50 iterations. Thus SMCGP was able to find general solutions. The next benchmark problem was the French Flag (FF) problem. Several developmental systems have been tested on generating the FF pattern (Miller, 2003; Miller and Banzhaf, 2003; Miller, 2004), and it is one of the few common problems tackled. In this problem, the task is to evolve a program that can assign the states of cells (represented as colours) into three distinct regions so that the complete set of cells looks like a French Flag. However, the design goals of SMCGP are very different to those the FF task demands. Many developmental systems are built around the idea of multi-cellularity and although they are capable of producing cellular patterns or even concentrations of simulated proteins, they are not explicitly computational in the sense of Genetic Programming. Often researchers have to devise somewhat arbitrary mappings from developmental outputs (i.e. cell states and protein levels) to those required for some computational application. SMCGP is designed to be an explicitly computational developmental system from the outset. Typically, the FF is produced via a type of cellular automaton (CA), where each cell ‘alive’ contains a copy of an evolved program or set of update rules. We could have taken this approach with SMCGP, but we decided on a more abstract interpretation of the problem. In the CA version, each cell in the CA is analogous to a biological cell. In SMCGP, the biological abstractions

96

Genetic Programming Theory and Practice VIII

Delete (DEL) Add (ADD) Move (MOV)

Overwrite (OVR) Duplication (DUP) Duplicate Preserving Connections (DU3) Duplicate and scale addresses (DU4) Copy To Stop (COPYTOSTOP) Stop Marker (STOP) Shift Connections (SHIFTCONNECTION) Shift Connections 2 (MULTCONNECTION) Change Connection (CHC) Change (CHF) Change (CHP) Flush (FLR)

Function Parameter

Basic Delete the nodes between (P0 +x) and (P0 +x+P1 ). Add P1 new random nodes after (P0 + x). Move the nodes between (P0 +x) and (P0 +x+P1 ) and insert after (P0 + x + P2 ). Duplication Copy the nodes between (P0 + x) and (P0 + x+ P1 ) to position (P0 + x + P2 ), replacing existing nodes in the target position. Copy the nodes between (P0 + x) and (P0 + x+ P1 ) and insert after (P0 + x + P2 ). Copy the nodes between (P0 + x) and (P0 + x+ P1 ) and insert after (P0 + x + P2 ). When copying, this function modifies the cij of the copied nodes so that they continue to point to the original nodes. Starting from position (P0 + x) copy (P1 ) nodes and insert after the node at position (P0 + x + P1 ). During the copy, cij of copied nodes are multiplied by P2 . Copy from x to the next “COPYTOSTOP” or ‘STOP” function node, or the end of the graph. Nodes are inserted at the position the operator stops at. Marks the end of a COPYTOSTOP section. Connection modification Starting at node index (P0 +x), add P2 to the values of the cij of next P1 . Starting at node index (P0 + x), multiply the cij of the next P1 nodes by P2 . Change the (P1 mod3)th connection of node P0 to P2 . Function modification Change the function of node P0 to the function associated with P1 . Change the (P1 mod3)th parameter of node P0 to P2 . Miscellaneous Clears the contents of the ToDo list

Table 6-1. Self modification functions. x represents the absolute position of the node in the graph, where the leftmost node has position 0. PN are evolved parameters stored in each node.

97

A Survey of Self Modifying CGP

Iteration (i) 0 1 2 3 4 etc.

Function 0+i 0+i 0+i+i 0+i+i+i 0+i+i+i+i

Result 0 1 4 9 16

Table 6-2. Program that generates sequence of squares. The program was found by reverse engineering a SMCGP phenotype. i, the current iteration, is the only input to the program.

are blurred, and the SMCGP phenotype itself could be viewed as a collection of cells. One way of viewing cells in SMCGP is to break the phenotype into ‘modules’ and then define these as the cells. In this way, SMCGP cells duplicate and differentiate using the various modifying functions. In a static program, this concept of cellularity does not exist. To tackle the FF problem with SMCGP, we defined the target pattern to be a string of integers that could be visually interpreted as a French Flag pattern. In the CA model, the pattern would be taken as the output of the program at each cell. Here, since we can view SMCGP phenotypes as a collection of cells, we took the output pattern as the set of outputs from all the active (connected) nodes in the phenotype graph. The fitness of an individual is the count of how many of the sequence it got right after a certain number of iterations. As the phenotype can change length when it is iterated, the number of active nodes can change and the length of the output pattern can also change. The value of the output of active nodes is dependent on the calculation it (and the nodes before it) does. So the French Flag pattern is effectively the side effect of some mathematical expression. It was found that this approach was largely successful, but only in generating approximations to the flag. No exact solutions were found, which is similar to the findings of the CA solutions where exact results are uncommon. The final task we explored in this paper was generating parity circuits, a challenge we return to in the next section.

Digital Circuits Digital circuits have often been studied in genetic programming (Koza, 1994b; Koza, 1992b), and some systems have been used to produce ‘general’ solutions (Huelsbergen, 1998; Wong and Leung, 1996; Wong, 2005). A general solution in this sense is a program that can output a digital circuit for an arbitrary number of inputs, for example it may generate a parity circuit of any

98

Genetic Programming Theory and Practice VIII

size 1 . Conveniently, many digital circuits are modular and hierarchical - and this fits the model of development that SMCGP implements. In our first paper, we successfully produced parity circuits up to 8 inputs (Harding et al., 2007). We stopped at this size because, at the time, this was the maximum size we could find conventional CGP solutions for. In a subsequent paper (Harding et al., 2009a), we revisited the problem (using the latest version of SMCGP), and found that not only could we evolve larger parity circuits, but we could rapidly and consistently evolve provably general parity circuits. We used an incremental fitness function to find programs that on the first iteration would solve 2 input parity, then 3 input parity on the next iteration and continue up to a maximum number of inputs. The fitness of an individual is the number of correct output bits over all iterations. To keep the computational costs down, we limited the evolution to 2 to 20 inputs, and then tested the final programs for generality by running up to 24 bits of input. We also stopped iterating programs if they failed to correctly produce all the output bits for the current table. Note how if an individual fails to be successful on a particular iteration the evaluation is canceled. Not only did this reduce the computation time, but we hoped it would also help with producing generalized solutions. Our function set consisted of all the two-input Boolean functions and the self modifying functions. In 251 evolutionary runs we found that the average number of evaluations required to successfully solve the parity problems was (number of inputs in parentheses) are as follows: 1,429(2), 4,013 (3), 43,817 (6), 82, 936 (8), 107,586 (10), 110,216 (17). Here we have given an incomplete list that just illustrates the trend in problem difficulty. We found that the number of evaluations stabilizes when the number of inputs is about 10. This is because after evolution has solved to a given number of inputs the solutions typically become generalized. We found that by the time that evolution had solved 5 inputs, more than half the solutions were generalizable up to 20 inputs, and by 10 inputs this was up to 90%. The percentage of runs that correctly computed even-parity 22 to 24 was approximately 96%. However, without analysis of the programs it was difficult to know whether they were truly general solutions. The evolved programs can be relatively compact, especially when we place constraints on the initial size, the number of self modification operations allowed on the ToDo list and the overall length of the program. Figure 6-1 shows an example of an evolved parity circuit generating a program which we were able to prove is a general solution to even-parity.

1 An even parity circuit takes a set of binary inputs and outputs true if an even number of the inputs are true, and false otherwise.

A Survey of Self Modifying CGP

99

Figure 6-1. An example of the development of a parity circuit. Each line shows the phenotype graph at a given time step. The first graph solves the 2-input parity, the second solves 3-input and continues to 7-bits. The graph has been tested to generalise through to 24 inputs. This pattern of growth is typical of the programs investigated.

In recent work (to be published in (Harding et al., 2010a)) we have also shown general solutions for the digital adder circuit. A digital adder circuit of size n adds two binary n bit numbers together. This problem is much more complicated than parity, as the number of inputs scales twice as fast (i.e. it has to produce 1 bit+1 bit, 2+2, 3+3) and the number of outputs also grows with the number of inputs.

Mathematical problems SMCGP has been applied to a variety of mathematical problems (Harding et al., 2009c; Harding et al., 2010b). For the Fibonacci sequence, the fitness function is the number of correctly calculated Fibonacci numbers in a sequence of 50. The first two Fibonacci numbers are given as fixed inputs (these were 0 and 1). Thus the phenotypes are iterated 48 times. Evolved solutions were tested for generality by iterating up to 72 times (after which the numbers exceeds the long int). A success rate of 87.4% was acheived on 287 runs and 94.5% of these correctly calculated the suceeeding 24 Fibonacci numbers. We found that the average number of evaluations of 774,808 compared favourably with previously published methods and that the generalization rate was higher. In the “list summation problem” we evolved programs that could sum an arbitrarily long list of numbers. At the n-th iteration, the evolved program should be able to take n inputs and compute the sum of all the inputs. We devised this problem because we thought it would be difficult for genetic programming

100

Genetic Programming Theory and Practice VIII

without the addition of an explicit summation command. Koza used a summation operator called SIGMA that repeatedly evaluates its sole input until a predefined termination condition is realised (Koza, 1992a). Input vectors consisted of random sequences of integers. The fitness is defined as the absolute cumulative error between the output of the program and the expected sum of the values. We evolved programs which were evaluated on input sequences of 2 to 10 numbers. The function set consisted of the self modifying functions and just the ADD operator. All 500 experiments were found to be successful, in that they evolved programs that could sum between 2 and 10 numbers (depending on the number of iterations the program is iterated). On average it took 6,922 evaluations to solve this problem. After evolution, the best individual for each run was tested to see how well it generalized. This test involved summing a sequence of 100 numbers. It was found that 99.03% solutions generalized. When conventional CGP was used it could only sum up to 7 numbers. We also studied how SMCGP performed on a “Powers Regression” problem. The task is to evolve a program that, depending on the iteration, approximates the expression xn where n is the iteration number. The fitness function applies x as integers from 0 to 20. The fitness is defined as the number of wrong outputs (i.e. lower is better). Programs were evolved to n = 10 and then tested for generality up to n = 20. As with many of the other experiments, the program is evolved with an incremental fitness function. We obtained 100% correct solutions (in 337 runs). The average number of evalutions was 869,699. More recently we have looked at whether SMCGP could produce algorithms that can compute mathematical constants, like π and e, to arbitrary precision (Harding et al., 2010b). We were able to prove that two of the evolved formulae (one for π and one for e) rapidly converged to the constants in the limit of large iterations. We consider this work to be significant as evolving provable mathematical results is a rarity in evolutionary computation. The fitness function was designed to produce a program where subsequent iterations of the program produced more accurate approximation to π or e. Programs were allowed to iterate for a maximum of 10 iterations. If the output after an iteration did not better approximate π, evaluation was stopped and a large fitness penalty applied. Note that it is possible that after the 10 iterations the output value diverges from the constant and the quality of the result would therefore worsen. We analyzed one of the solutions that accurately converges to π. It had the generating function: f (i) =

cos(sin(cos(sin(0)))) i = 0 f (i − 1) + sin(f (i − 1)) i > 0

(6.1)

101

A Survey of Self Modifying CGP

Equation 6.1 is a nonlinear recurrence relation and it can be proven formally that it is an exact solution in that it rapidly approaches π in the limit of large i. Using the same fitness function as with π, evolving solutions for e was found to be significantly harder. In our experiments we chose the initial genotype to have 20 nodes and the ToDo list length to be 2. This meant that only two SM functions were used in each phenotype. We allowed the iteration number it as the sole program input. Defining x = 4it and y = 4x = 4it+1 we evolved the solution for the output, z as 1 y z = (1 + ) y

q

1+ y1

(6.2)

Eqn 6.2 tends to the form of a well-known Bernoulli formula. 1 lim (1 + )y y→∞ y

(6.3)

Evolving to Learn In nature, we are used to the idea that plasticity (e.g., in the brain) can be used to learn during the lifetime of an organism. In the brain, the ‘self-modification rules’ are ultimately encoded in the genome. In (Harding et al., 2009b), we set out to use SMCGP to evolve a learning algorithm that could act on itself. The basic question being whether SMCGP can evolve a program that can learn during the development phase - how to perform a given task. We chose the task of getting the same phenotype to learn all possible 2-input boolean truth tables. We took 16 copies of the same phenotype, and then tried to train each copy on a different truth table, with the fitness being how well the programs (after the learning phase) did at calculating the correct value based on a pair of inputs. In SMCGP, the activation of a self modifying node is dependent on the values that it reads as inputs. Combined with the various mathematical operators, this allows the phenotype to develop differently in the presence of different sets of inputs. To support the mathematical operators, the Boolean tables were represented (and interpreted) as numbers, with -1.0 being false, +1.0 being true. Figure 6-2 illustrates the process. The evolved genotype (a) is copied into phenotype space (b) where it can be executed. The phenotype is allowed to develop for a number of iterations (c). The number of iterations is defined by a special gene in the genotype. Copies of the developed phenotype are made (d) and each copy is assigned a different truth table to learn. The test set data is applied (e) as described in the following section. After learning (f) the phenotype can now be tested, and its fitness found. During (f), the individual is treated as a static individual - and is no longer allowed to modify itself. This

102

Genetic Programming Theory and Practice VIII

fixed program is then tested for accuracy, and its fitness used as a component in the final fitness score of that individual.

Figure 6-2. Fitness function flow chart, as described in section 3.

During the fitness evaluation stage, each row of the truth table is presented to a copy of the evolved phenotype (Figure 6-2.e). During this presentation, the error between the expected and actual output is fed back into the SMCGP program, in order to provide some sort of feedback. Full details of how this was implemented can be found in (Harding et al., 2009b). During fitness calculation, we tested all 16 tables. However, we split the tables into two sets, one for deriving the fitness score (12 tables) and the other for a validation score (4 tables). It was found that 16% of experimental runs were successfully able to produce programs that correctly learned the 12 tables. None of the evolved programs was able to generalize to learn all the unseen truth tables. However, the system did come close with the best result having only 2 errors (out of a possible 16). Figure 6-3 shows the form of the final phenotypes for the programs for each of the fitness truth tables. We can see both modularity and a high degree of variation - with the graphs for each table looking quite different from one another. This is in contrast to previous examples, such as the parity circuits, where we generally only see regular forms.

4.

Iteration in SMCGP and GP

One of the unique properties of SMCGP is how it handles iteration. Iteration is not new in genetic programming and there are several different forms. The most obvious form of GP with iteration is Linear Genetic Programming (LGP), where evolved programs can execute inside a kind of virtual machine in which the program counter can be modified using jump operations. LGP operates on registers (as in a CPU), and uses this memory to store state between iterations of the same section of program. It is also worth noting that in LGP sub-sections

A Survey of Self Modifying CGP

103

Figure 6-3. Phenotypes for each of the tables learned during evolution.

of code are executed repeatedly. This is different from most implementations of tree-based GP (and we restrict our discussion to the simple, common varieties found in the literature), as the tree represents an expression, and so any iteration has to be applied externally. Tree-based GP also typically does not have a concept of working registers to store state between iterations, so these must be added to the function set, or previous state information passed back via the tree’s inputs. Tree-based GP normally only has one output, and no intermediate state information. So additional mechanisms would be required to select what information to store and pass to subsequent iterations. In LGP termination can be controlled by the evolved program itself, whereby with external iteration another mechanism needs to be defined - perhaps by enforcing a limit to the number of iterations or some form of conditional. SMCGP handles its iteration in a very different manner. It can be viewed as something analogous to loop-unrolling in a compiler, whereby the contents of the loop are explicitly rewritten a number of times. In SMCGP, the duplication operator unrolls the phenotype. State information is passed between iterations by the connections made in the duplicated blocks. In compilers, it is done for program efficiency and is typically only done for small loops. In SMCGP, if the unrolling is excessive it will exceed the maximum permitted phenotype length. We speculate that this may help to evolve more efficient modularization. Because the activation of self modifying functions is determined by both the size of the ToDo list and the inputs to self modifying nodes, it is possible for SMCGP to self-limit when sections of code should be unrolled. SMCGP’s unrolling also has the possibility to grow exponentially, which forms a different kind of loop. For example, imagine a duplication operator that copied every node to its left and inserted it before itself : e.g NODE0

104

Genetic Programming Theory and Practice VIII

NODE1 DUPLICATE. On the next iteration it would produce NODE0 NODE1 NODE0 NODE1 DUPLICATE, then NODE0 NODE1 NODE0 NODE1 NODE0 NODE1 NODE0 NODE1 DUPLICATE and so on. Hence the program length almost doubles at each time. Similarly, the arguments for the duplication operation may only replicate part of the previously inserted module, so the phenotype would grow a different, smaller rate each time. Other growth progressions are also possible, especially when several duplication-style operators are at work on the same section of phenotype. This makes the iteration capabilities of SMCGP very rich and implies that it can also do a form of recursion unrolling - removing the need for explicit procedures in a similar way to the lack of need for loop instructions.

5.

Conclusions and Further Work

Self modification in Genetic Programming seems to be a useful property. With SMCGP we have shown that the implementation of such a system can be relatively straightforward, and that very good results can be achieved. In upcoming work, we will be demonstrating SMCGP on several other problems including generalized digital adders and a structural design problem. Here we have discussed problems that require some sort of developmental process, as the problems require a scaling ability. One benefit of SMCGP is that if the problem does not need self modification, evolution can stop using it. When this happens, the representation reverts to something similar to classical CGP. In (Harding et al., 2009c), we showed that on a bio-informatics classification problem where there should be no benefit in using self modification, SMCGP behaved similarly to CGP. This result lets us be confident that in future work we can by default use SMCGP and automatically gain any advantages that development might bring. The SMCGP representation has changed over time, whilst maintaining the same design philosophy. In future work we consider other variants as well. Currently we are investigating ways to simplify the genotype to make it easier for humans to understand. This should allow us to be able to prove general cases more easily, and perhaps explain how processes like the evolved learning algorithm function. A whole world of self modifying systems seems to have become available now that the principle has been shown work successfully. We plan to investigate this world further and also encourage others to consider self modification in their systems.

6.

Acknowledgments

Funding from NSERC under discovery grant RGPIN 283304-07 to W.B. is gratefully acknowledged. S.H. was supported by an ACENET fellowship.

A Survey of Self Modifying CGP

105

References Banzhaf, W. and Miller, J. F. (2004). The Challenge of Complexity. Kluwer Academic. Harding, S., Miller, J. F., and Banzhaf, W. (2009a). Self modifying cartesian genetic programming: Parity. In Tyrrell, Andy, editor, 2009 IEEE Congress on Evolutionary Computation, pages 285–292, Trondheim, Norway. IEEE Computational Intelligence Society, IEEE Press. Harding, Simon, Miller, Julian F., and Banzhaf, Wolfgang (2009b). Evolution, development and learning with self modifying cartesian genetic programming. In GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 699–706, New York, NY, USA. ACM. Harding, Simon, Miller, Julian F., and Banzhaf, Wolfgang (2010a). Developments in cartesian genetic programming: Self-modifying cgp. To be published in Genetic Programming and Evolvable Machines. Harding, Simon, Miller, Julian F., and Banzhaf, Wolfgang (2010b). Self modifying cartesian genetic programming: Finding algorithms that calculate pi and e to arbitrary precision. In Genetic and Evolutionary Computation Conference, GECCO 2010. Accepted for publication. Harding, Simon, Miller, Julian Francis, and Banzhaf, Wolfgang (2009c). Self modifying cartesian genetic programming: Fibonacci, squares, regression and summing. In Vanneschi, Leonardo, Gustafson, Steven, et al., editors, Genetic Programming, 12th European Conference, EuroGP 2009, T¨ubingen, Germany, April 15-17, 2009, Proceedings, volume 5481 of Lecture Notes in Computer Science, pages 133–144. Springer. Harding, Simon L., Miller, Julian F., and Banzhaf, Wolfgang (2007). Selfmodifying cartesian genetic programming. In Thierens, Dirk, Beyer, HansGeorg, Bongard, Josh, Branke, Jurgen, Clark, John Andrew, Cliff, Dave, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Kovacs, Tim, Kumar, Sanjeev, Miller, Julian F., Moore, Jason, Neumann, Frank, Pelikan, Martin, Poli, Riccardo, Sastry, Kumara, Stanley, Kenneth Owen, Stutzle, Thomas, Watson, Richard A, and Wegener, Ingo, editors, GECCO ’07: Proceedings of the 9th annual conference on Genetic and evolutionary computation, volume 1, pages 1021–1028, London. ACM Press. Huelsbergen, Lorenz (1998). Finding general solutions to the parity problem by evolving machine-language representations. In Koza, John R., Banzhaf, Wolfgang, Chellapilla, Kumar, Deb, Kalyanmoy, Dorigo, Marco, Fogel, David B., Garzon, Max H., Goldberg, David E., Iba, Hitoshi, and Riolo, Rick, editors, Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 158–166, University of Wisconsin, Madison, Wisconsin, USA. Morgan Kaufmann.

106

Genetic Programming Theory and Practice VIII

Khan, G.M., Miller, J.F, and Halliday, D.M. (2007). Coevolution of intelligent agents using cartesian genetic programming. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 269 – 276. Koza, J. R. (1994a). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press. Koza, John R. (1992a). A genetic approach to the truck backer upper problem and the inter-twined spiral problem. In Proceedings of IJCNN International Joint Conference on Neural Networks, volume IV, pages 310–318. IEEE Press. Koza, John R. (1994b). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge Massachusetts. Koza, J.R. (1992b). Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press, Cambridge, Massachusetts, USA. Miller, J. F. and Smith, S. L. (2006). Redundancy and computational efficiency in cartesian genetic programming. In IEEE Transactions on Evoluationary Computation, volume 10, pages 167–174. Miller, Julian F. (2003). Evolving developmental programs for adaptation, morphogenesis, and self-repair. In Banzhaf, Wolfgang, Christaller, Thomas, Dittrich, Peter, Kim, Jan T., and Ziegler, Jens, editors, Advances in Artificial Life. 7th European Conference on Artificial Life, volume 2801 of Lecture Notes in Artificial Intelligence, pages 256–265, Dortmund, Germany. Springer. Miller, Julian F. and Banzhaf, Wolfgang (2003). Evolving the program for a cell: from french flags to boolean circuits. In Kumar, Sanjeev and Bentley, Peter J., editors, On Growth, Form and Computers. Academic Press. Miller, Julian F. and Thomson, Peter (2000). Cartesian genetic programming. In Poli, Riccardo, Banzhaf, Wolfgang, Langdon, William B., Miller, Julian F., Nordin, Peter, and Fogarty, Terence C., editors, Genetic Programming, Proceedings of EuroGP’2000, volume 1802 of LNCS, pages 121–132, Edinburgh. Springer-Verlag. Miller, Julian F. and Thomson, Peter (2003). A developmental method for growing graphs and circuits. In Proceedings of the 5th International Conference on Evolvable Systems: From Biology to Hardware, volume 2606 of Lecture Notes in Computer Science, pages 93–104. Springer. Miller, Julian Francis (2004). Evolving a self-repairing, self-regulating, french flag organism. In Deb, Kalyanmoy, Poli, Riccardo, Banzhaf, Wolfgang, Beyer, Hans-Georg, Burke, Edmund K., Darwen, Paul J., Dasgupta, Dipankar, Floreano, Dario, Foster, James A., Harman, Mark, Holland, Owen, Lanzi, Pier Luca, Spector, Lee, Tettamanzi, Andrea, Thierens, Dirk, and Tyrrell, Andrew M., editors, GECCO (1), volume 3102 of Lecture Notes in Computer Science, pages 129–139. Springer.

A Survey of Self Modifying CGP

107

Spector, L. and Robinson, A. (2002). Genetic programming and autoconstructive evolution with the push programming language. Genetic Programming and Evolvable Machines, 3:7–40. Spector, Lee and Stoffel, Kilian (1996). Ontogenetic programming. In Koza, John R., Goldberg, David E., Fogel, David B., and Riolo, Rick L., editors, Genetic Programming 1996: Proceedings of the First Annual Conference, pages 394–399, Stanford University, CA, USA. MIT Press. Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127. Vassilev, Vesselin K. and Miller, Julian F. (2000). The advantages of landscape neutrality in digital circuit evolution. In Proceedings of the Third International Conference on Evolvable Systems, pages 252–263. Springer-Verlag. Wong, Man Leung (2005). Evolving recursive programs by using adaptive grammar based genetic programming. Genetic Programming and Evolvable Machines, 6(4):421–455. Wong, Man Leung and Leung, Kwong Sak (1996). Evolving recursive functions for the even-parity problem using genetic programming. In Angeline, Peter J. and Kinnear, Jr., K. E., editors, Advances in Genetic Programming 2, chapter 11, pages 221–240. MIT Press, Cambridge, MA, USA. Yu, Tina and Miller, Julian (2001). Neutrality and the evolvability of boolean function landscape. In Miller, Julian F., Tomassini, Marco, Lanzi, Pier Luca, Ryan, Conor, Tettamanzi, Andrea G. B., and Langdon, William B., editors, Genetic Programming, Proceedings of EuroGP’2001, volume 2038 of LNCS, pages 204–217, Lake Como, Italy. Springer-Verlag.

Chapter 7 ABSTRACT EXPRESSION GRAMMAR SYMBOLIC REGRESSION Michael F. Korns1

1 Korns Associates, 1 Plum Hollow, Henderson, Nevada 89052 USA.

Abstract

This chapter examines the use of Abstract Expression Grammars to perform the entire Symbolic Regression process without the use of Genetic Programming per se. The techniques explored produce a symbolic regression engine which has absolutely no bloat, which allows total user control of the search space and output formulas, which is faster, and more accurate than the engines produced in our previous papers using Genetic Programming. The genome is an all vector structure with four chromosomes plus additional epigenetic and constraint vectors, allowing total user control of the search space and the final output formulas. A combination of specialized compiler techniques, genetic algorithms, particle swarm, aged layered populations, plus discrete and continuous differential evolution are used to produce an improved symbolic regression sytem. Nine base test cases, from the literature, are used to test the improvement in speed and accuracy. The improved results indicate that these techniques move us a big step closer toward future industrial strength symbolic regression systems.

Keywords:

abstract expression grammars, differential evolution, grammar template genetic programming, genetic algorithms, particle swarm, symbolic regression.

1.

Introduction

This chapter examines techniques for improving symbolic regression systems with the aim of achieving entry-level industrial strength. In previous papers (Korns, 2006; Korns, 2007; Korns and Nunez, 2008; Korns, 2009), our pursuit of industrial scale performance with large-scale, symbolic regression problems, required us to reexamine many commonly held beliefs and to borrow a number of techniques from disparate schools of genetic programming and recombine them in ways not normally seen in the published literature. The techniques of abstract expression grammars were developed, but expored only tangentially.

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_7, © Springer Science+Business Media, LLC 2011

110

Genetic Programming Theory and Practice VIII

While the techniques, described in detail in (Korns, 2009), produce a symbolic regression system of breadth and strength, lack of user control of the search space, bloated unreadable output formulas, accuracy, and slow convergence speed are all issues keeping an industrial strength symbolic regression system tantalizingly out of reach. In this chapter abstract expression grammars become the main focus and are promoted as the sole means of performing symbolic regression. Using the nine base test cases from (Korns, 2007) as a training set, to test for improvements in accuracy, we constructed our symbolic regression system using these important techniques: Abstract expression grammars Universal abstract goal expression Standard single point vector-based mutation Standard two point vector-based cross over Continuous vector differential evolution Discrete vector differential evolution Continuous particle swarm evolution Pessimal vertical slicing and out-of-sample scoring during training Age-layered populations User defined epigenetic factors User defined constraints For purposes of comparison, all results in this paper were achieved on two workstation computers, specifically an Intel® Core™ 2 Duo Processor T7200 (2.00GHz/667MHz/4MB) and a Dual-Core AMD Opteron™ Processor 8214 (2.21GHz), running our Analytic Information Server software generating Lisp agents that compile to use the on-board Intel registers and on-chip vector processing capabilities so as to maximize execution speed, whose details can be found at www.korns.com/Document Lisp Language Guide.html. Furthermore, our Analytic Information Server is available in an open source software project at aiserver.sourceforge.net.

Testing Regimen and Fitness Measure Our testing regimen uses only statistical best practices out-of-sample testing techniques. We test each of the nine test cases on matrices of 10000 rows samples by 5 columns inputs with no noise, and on matrices of 10000 rows by 20 columns with 40% noise, before drawing any conclusions. Taking all these combinations together, this creates a total of 18 separate test cases. For each test a training matrix is filled with random numbers between -50 and +50. The target expression for the test case is applied to the training matrix to compute the dependent variable and the required noise is added. The symbolic regression system is trained on the training matrix to produce the regression champion. Following training, a testing matrix is filled with random numbers between -50

111

Abstract Expression Grammar Symbolic Regression Table 7-1. Result For 10K rows by 5 columns no Random Noise.

Test linear cubic cross elipse hidden cyclic hyper mixed ratio

Minutes 1 1 145 1 3 1 65 233 229

Train-NLSE 0.00 0.00 0.00 0.00 0.00 0.02 0.17 0.94 0.94

Train-TCE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.33

Test-NLSE 0.00 0.00 0.00 0.00 0.00 0.00 0.17 0.95 0.94

Test-TCE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.32

and +50. The target expression for the test case is applied to the testing matrix to compute the dependent variable and the required noise is added. The regression champion is evaluated on the testing matrix for all scoring (i.e. out of sample testing). Our two fitness measures are described in detail in (Korns, 2009) and consist of a standard least squared error which is normalized by dividing LSE by the standard deviation of Y (dependent variable). This normalization allows us to meaningfully compare the normalized least squared error (NLSE) between different problems. In addition we construct a fitness measure known as tail classification error, TCE, which measures how well the regression champion classifies the top 10% and bottom 10% of the data set. A TCE score of less than 0.20 is excellent. A TCE score of less than 0.30 is good; while, a TCE of 0.30 or greater is poor.

2.

Previous Results on Nine Base Problems

The previously published results (Korns, 2009) of training on the nine base training models on 10,000 rows and five columns with no random noise and only 20 generations allowed, are shown in Table 7-11 . In general, training time is very reasonable given the difficulty of some of the problems and the limited number of training generations allowed. Average percent error performance varies from excellent to poor with the linear and cubic problems showing the best performance. Minimal differences between training error and testing error in the mixed and ratio problems suggest no over-fitting.

1 The

nine base test cases are described in detail in (Korns, 2007).

112

Genetic Programming Theory and Practice VIII Table 7-2. Result for 10K rows by 20 columns with 40% Random Noise.

Test linear cubic cross elipse hidden cyclic hyper mixed ratio

Minutes 82 59 127 162 210 233 163 206 224

Train-NLSE 0.11 0.11 0.87 0.42 0.11 0.39 0.48 0.90 0.90

Train-TCE 0.00 0.00 0.25 0.04 0.02 0.11 0.06 0.27 0.26

Test-NLSE 0.11 0.11 0.93 0.43 0.11 0.35 0.50 0.94 0.95

Test-TCE 0.00 0.00 0.32 0.04 0.02 0.12 0.07 0.32 0.33

Surprisingly, long and short classification is fairly robust in most cases including the very difficult ratio, and mixed test cases. The salient observation is the relative ease of classification compared to regression even in problems with this much noise. In some of the test cases, testing NLSE is either close to or exceeds the standard deviation of Y (not very good); however, in many of the test cases classification is below 0.20. (very good). The previously published results (Korns, 2009) of training on the nine base training models on 10,000 rows and twenty columns with 40% random noise and only 20 generations allowed, are shown in Table 7-2. Clearly the previous symbolic regression system performs most poorly on the test cases mixed and ratio with conditional target expressions. There is no evidence of over-fitting shown by the minimal differences between training error and testing error. Plus, the testing TCE is relatively good in both mixed and ratio test cases. Taken together, these scores portray a symbolic regression system which is ready to handle some industrial strength problems except for a few serious issues. The output formulas are often so bloated, with intron expressions, that they are practically unreadable by humans. This seriously limits the acceptance of the symbolic regression system for many industrial applications. There is no user control of the search space, thus making the system impractical for most specialty applications. And of course we would love to see additional speed and accuracy improvements because industry is insatiable on those features. A new architecture which will completely eliminate bloat, allow total user control over the search space and the final output formulas, improve our regression scores on the two conditional base test cases, and deliver an increase in learning speed, is the subject of the remainder of this chapter.

Abstract Expression Grammar Symbolic Regression

3.

113

New System Architecture

Our new symbolic regression system architecture is based entirely upon an Abstract Expression Grammar foundation. A single abstract expression, called the goal expression, defines the search space during each symbolic regression run. The objective of a symbolic regression run is to optimize the goal expression. An example of a goal expression is: y = f0(c0*x5)+(f1(c1)/(v0+3.14)). As described in detail in (Korns 2009), the expression elements f0, f1, *, +, and / are abstract and concrete functions(operators). The elements v0, and x5 are abstract and concrete features. The elements c0, c1, and 3.14 are abstract and concrete real constants. Since the goal expression is abstract, there are many possible concrete solutions. y = f0(c0*x5)+(f1(c1)/(v0+3.14)) (...to be solved...) y = sin(-1.45*x5)+(log(22.56)/(x4+3.14)) (...possible solution...) y = exp(38.16*x5)+(tan(-8.41)/(x0+3.14)) (...possible solution...) y = square(-0.16*x5)+(cos(317.1)/(x9+3.14)) (...possible solution...) The objective of symbolic regression is to find an optimal concrete solution to the abstract goal expression. In our architecture, each individual solution to the goal expression is implemented as a set of vectors containing the solution values for each abstract function, feature, and constant present in the goal expression. This allows the system to be based upon an all vector genome which is convenient for genetic algorithm, particle swarm, and differential evolution styled population operators. In addition to the regular vector chromosomes providing solutions to the goal expression, epigenetic wrappers and constraint vectors provide an important degree of control over the search process and will be discussed in detail later in this chapter. Taken all together our new symbolic regression system is based upon the following genome. Genome with four chromosome vectors Each chromosome has an epigenetic wrapper There are two user contraint vectors The new system is constructed using these important techniques. Universal abstract goal expression Standard single point vector-based mutation Standard two point vector-based cross over Continuous vector differential evolution Discrete vector differential evolution Continuous particle swarm evolution Pessimal vertical slicing and out-of-sample scoring during training Age-layered populations

114

Genetic Programming Theory and Practice VIII

User defined epigenetic factors User defined constraints The universal abstract goal expression allows the system to be used for general symbolic regression and will be discussed in detail later in this chapter. Both single point vector-based mutation and two point vector-based cross over are discussed in (Man et al., 1999). Continuous and discrete vector differential evolution are discussed in (Price et al., 2005). Continuous particle swarm evolution is discussed in (Eberhart et al., 2001). Pessimal vertical slicing is discussed in (Korns, 2009). Age-layered populations are discussed in (Hornby, 2006) and (Korns, 2009). User defined epigenetic factors and user defined constraints will be discussed in detail later in this chapter. However, before proceeding to discuss the details of the system implemenation, we will review abstract expression grammars as discussed in detail in (Korns, 2009).

Review of Abstract Expression Grammars The simple concrete expression grammar we use in our symbolic regression system is a C-like functional grammar with the following basic elements. Real Numbers: 3.45, -.0982, 100.389, and all other real constants. Row Features: x1, x2, x9, and all other features. Binary Operators: +, *, /, %, max(), min(), mod() Unary Operators: sqrt(), square(), cube(), abs(), sign(), sigmoid() Unary Operators: cos(), sin(), tan(), tanh(), log(), exp() Relational Operators: Conditional Operator: (expr < expr) ? expr : expr) Colon Operator: expr : expr noop Operator: noop() Our numeric expressions are C-like containing the elements shown above and surrounded by regression commands such as, regress(), svm(), etc. Currently we support univariate regression, multivariate regression, and support vector regression. Our conditional expression operator (...) ? (...) : (...) is the Clike conditional operator where the ? and : operators always come in tandem. Our noop operator is an idempotent which simply returns its first argument regardless of the number of arguments: noop(x7,x6/2.1) = x7. Our basic expression grammar is functional in nature, therefore all operators are viewed grammatically as function calls. Our symbolic regression system creates its regression champion using evolution; but, the final regression champion will be a compilation of a basic concrete expression such as: (E1): f = (log(x3)/sin(x2*45.3))>x4 ? tan(x6) : cos(x3)

Abstract Expression Grammar Symbolic Regression

115

Computing an NLSE score for f requires only a single pass over every row of X and results in an attribute being added to f by executing the “score” method compiled into f as follows. f.NLSE = f.score(X,Y). Suppose that we are satisfied with the form of the expression in (E1); but, we are not sure that the real constant 45.3 is optimal. We can enhance our symbolic regression system with the ability to optimize individual real constants by adding abstract constant rules to our built-in algebraic expression grammar. Abstract Constants: c1, c2, and c10 Abstract constants represent placeholders for real numbers which are to be optimized by the symbolic regression system. To further optimize f we would alter the expression in (E1) as follows. (E2): f = (log(x3)/sin(x2*c1))>x4 ? tan(x6) : cos(x3) The compiler adds a new real number vector, C, attribute to f such that f.C has as many elements as there are abstract constants in (E2). Optimizing this version of f requires that the built-in “score” method compiled into f be changed from a single pass to a multiple pass algorithm in which the real number values in the abstract constant vector, f.C, are iterated until the expression in (E2) produces an optimized NLSE. This new score method has the side effect that executing f.score(X,Y) also alters the abstract constant vector, f.C, to optimal real number choices. Clearly the particle swarm (Eberhardt 2001) and differential evolution algorithms provide excellent candidate algorithms for optimizing f.C and they can easily be compiled into f.score by common compilation techniques currently in the main stream. Summarizing, we have a new grammar term, c1, which is a reference to the 1st element of the real number vector, f.C (in C language syntax c1 == f.C[1]). The f.C vector is optimized by scoring f, then altering the values in f.C, then repeating the process iteratively until an optimum NLSE is achieved. For instance, if the regression champion agent in (E2) is optimized with: f.C == < 45.396 > Then the optimized regression champion agent in (E2) has a concrete conversion counterpart as follows:

116

Genetic Programming Theory and Practice VIII

f = (log(x3)/sin(x2*45.396))>x4 ? tan(x6) : cos(x3) Suppose that we are satisfied with the form of the expression in (E1); but, we are not sure that the features, x2, x3, and x6, are optimal choices. We can enhance our symbolic regression system with the ability to optimize individual features by adding abstract feature rules to our built-in algebraic expression grammar. Abstract Features: v1, v2, and v10 Abstract features represent placeholders for features which are to be optimized by the nonlinear regression system. To further optimize f we would alter the expression in (E1) as follows. (E3): f = (log(v1)/sin(v2*45.3))>v3 ? tan(v4) : cos(v1) The compiler adds a new integer vector, V, attribute to f such that f.V has as many elements as there are abstract features in (E3). Each integer element in the f.V vector is constrained between 1 and M, and represents a choice of feature (in x). Optimizing this version of f requires that the built-in “score” method compiled into f be changed from a single pass to a multiple pass algorithm in which the integer values in the abstract feature vector, f.V, are iterated until the expression in (E3) produces an optimized NLSE. This new score method has the side effect that executing f.score(X,Y) also alters the abstract feature vector, f.V, to integer choices selecting optimal features (in x). Clearly the genetic algorithm (Man 1999), discrete particle swarm (Eberhardt 2001), and discrete differential evolution (Price 2005) algorithms provide excellent candidate algorithms for optimizing f.V and they can easily be compiled into f.score by common compilation techniques currently in the main stream. The f.V vector is optimized by scoring f, then altering the values in f.V, then repeating the process iteratively until an optimum NLSE is achieved. For instance, the regression champion agent in (E3) is optimized with: f.V == < 2, 4, 1, 6 > Then the optimized regression champion agent in (E3) has a concrete conversion counterpart as follows: f = (log(x2)/sin(x4*45.396))>x1 ? tan(x6) : cos(x2)

Abstract Expression Grammar Symbolic Regression

117

Similarly, we can enhance our nonlinear regression system with the ability to optimize individual functions by adding abstract functions rules to our built-in algebraic expression grammar. Abstract Functions: f1, f2, and f10 Abstract functions represent placeholders for built-in functions which are to be optimized by the nonlinear regression system. To further optimize f we would alter the expression in (E2) as follows. (E4): f = (f1(x3)/f2(x2*45.3))>x4 ? f3(x6) : f4(x3) The compiler adds a new integer vector, F, attribute to f such that f.F has as many elements as there are abstract features in (E4). Each integer element in the f.F vector is constrained between 1 and (number of built-in functions available in the expression grammar), and represents a choice of built-in function. Optimizing this version of f requires that the built-in “score” method compiled into f be changed from a single pass to a multiple pass algorithm in which the integer values in the abstract function vector, f.F, are iterated until the expression in (E4) produces an optimized NLSE. This new score method has the side effect that executing f.score(X,Y) also alters the abstract function vector, f.F, to integer choices selecting optimal built-in functions. Clearly the genetic algorithm (Man et al., 1999), discrete particle swarm (Eberhart et al., 2001), and discrete differential evolution (Price et al., 2005) algorithms provide excellent candidate algorithms for optimizing f.F and they can easily be compiled into f.score by common compilation techniques currently in the main stream. Summarizing, we have a new grammar term, f1, which is an indirect function reference thru to the 1st element of the integer vector, f.F (in C language syntax f1 == funtionList[f.F[1]]). The f.F vector is optimized by scoring f, then altering the values in f.F, then repeating the process iteratively until an optimum NLSE is achieved. For instance, if the valid function list in the expression grammar is f.functionList = < log, sin, cos, tan, max, min, avg, cube, sqrt > And the regression champion agent in (E4) is optimized with: f.F = < 1, 8, 2, 4 > Then the optimized regression champion agent in (E4) has a concrete conversion counterpart as follows:

118

Genetic Programming Theory and Practice VIII

f = (log(x3)/cube(x2*45.3))>x4 ? sin(x6) : tan(x3) The built-in function argument arity issue is easily resolved by having each built-in function ignore any excess arguments and substitute defaults for any missing arguments. Finally, we can enhance our nonlinear regression system with the ability to optimize either features or constants by adding abstract term rules to our built-in algebraic expression grammar. Abstract Terms: t1, t2, and t10 Abstract terms represent placeholders for either abstract features or constants which are to be optimized by the nonlinear regression system. To further optimize f we would alter the expression in (E2) as follows. (E5): f = (log(t0)/sin(t1*t2))>t3 ? tan(t4) : cos(t5) The compiler adds a new binary vector, T, attribute to f such that f.T has as many elements as there are abstract terms in (E5). Each binary element in the f.T vector is either 0 or 1, and represents a choice of abstract feature or abstract constant. Adding abstract terms allows the sytem to construct a universal formula containing all possible concrete formulas. Additional details on Abstract Expression Grammars can be found in (Korns, 2009).

4.

Universal Abstract Expressions

A general nonlinear regression system accepts an input matrix, X, of N rows and M columns and a dependent variable vector, Y, of length N. The dependent vector Y is related to X thru the (quite possibly nonlinear) transformation function, Q, as follows: Y[n] = Q(X[n]). The nonlinear transformation function, Q, can be related to linear regression systems, without loss of generality, as follows. Given an N rows by M columns matrix X (independent variables), an N vector Y (dependent variable), and a K+1 vector of coefficients, the nonlinear transformation, Q, is a system of K transformations, Qk : (R1 xR2 x...RM )−> R, such that y = C0 + (C1 ∗ Q1 (X)) + ...(CK ∗ QK (X))+err minimizes the normalized least squared error. Obviously, in this formalization, a nonlinear regression system is a linear regression system which searches for a set of K suitable transformations which minimize the normalized least squared error. If K is equal to M, then Q is dimensional, and Q is covering if, for every m in M, there is at least one instance of Xm in at least one term Qk .

Abstract Expression Grammar Symbolic Regression

119

With reference to our system architecture, what is needed to implement general nonlinear regression, in this formalization, is a method of constructing a universal goal expression which contains all possible nonlinear transformations up to a pre-specified complexity level. Such a method exists and is described as follows. Given any concrete expression grammar, suitable for nonlinear regression, we can construct a universal abstract goal expression, of an arbitrary grammar node depth level, which contains all possible concrete instance expressions within any one of the K transformations in Q. For instance, the universal abstract expression, U0 , of all Qk of depth level 0 is t0. Remember that t0 is either v0 or c0. The universal abstract expression, U1 , of all Qk of depth level 1 is f0(t0,t1). In general we have the following. U0 : U1 : U2 : U3 : Uk :

t0 f0(t0,t1) f0(f1(t0,t1),f2(t2,t3)) f0(f1(f2(t0,t1),f3(t2,t3)),f4(f5(t4,t5),f6(t6,t7))) f0(Uk−1 , Uk−1 )

Given any suitable functional grammar with features, constants, and operators, we add a special operator, noop, which simply returns its first argument. This allows any universal expression to contain all smaller concrete expressions. For instance, if f0 = noop, then f0(t0,t1) = t0. We solve the arity problem for unary operators by altering them to ignore the rightmost arguments, for binary operators by altering them to substitute default arguments for missing rightmost arguments, and for N-ary operators by wrapping the additional arguments onto lower levels of the binary expression using appropriate context sensitive grammar rules. For example, let’s see how we can wrap the 4-ary conditional function(operator) ? onto multiple grammar node levels using context sensitive constraints. y = f0(f1(expr,expr),f2(expr,expr)) Clearly if, during evolution in any concrete solution, the abstract function f0 were to represent the ? conditional function, then the abstract function f1 would be restricted to one of the relational functions(operators), and the abstract function f2 would be restricted to only the colon function(operator). Therefore one would have any number of possible solutions to the goal expression, but some of the possible solutions would violate these context sensitive constraints and would be unreasonable. The assertion that certain possible solutions are unreasonable depends upon the violation of context sensitive constraints implicit with each operator as follows.

120

Genetic Programming Theory and Practice VIII

y = f0(f1(expr,expr),f2(expr,expr)) (goal expression) y = ?( ? : noop) constraints: f1(+ * / % max min mod sqrt square cube abs sign sigmoid cos sin tan tanh log exp < = > ? : noop) constraints: f2(+ * / % max min mod sqrt square cube abs sign sigmoid cos sin tan tanh log exp < = > ? : noop) However if we know that a particular solution has selected f0 to be the operator ?, then we must implicitly assume that the constraints for abstract functions f0, f1, and f2, with respect to that solution are as follows. constraints: f0(?)

Abstract Expression Grammar Symbolic Regression

121

constraints: f1(< = >) constraints: f2(:) In the goal expression genome, f0 is a single gene located in position zero in the chromosome for abstract functions. The constraints are wrapped around each chomosome and are a vector of reasonable choices for each gene. In a context insensitive genome, chosing any specific value for gene f0 or gene v6, etc. has no effect on the contraint wrappers in the genome. However, in a context sensitive genome, chosing any specific value for gene f0 or gene v6, etc. may have an effect on the contraint wrappers in the genome. Furthermore, we are not limited to implicit control of the genome’s contraint wrappers. We can extend control of the genome’s contraints to the user in an effort to allow greater control of the search space. For instance, if the user wanted to perform a univariate regression on a problem with ten features but desired only logrithmic transforms in the output, the following abstract goal expression would be appropriate. y = f0(v0) where f0(cos sin tan tanh) Publishing the genome’s contraints for explicit user guidance is an attempt to explore greater user control of the search space during the evolutionary process.

6.

Epigenome

In order to perform symbolic regression with a single abstract goal expression, all of the individual solutions must have the same shape genome. In a context insensitive architecture with only one population island performing only a general search strategy, this is not an issue. However, if we wish to perform symbolic regression, with a single abstract goal expression, on multiple population islands each searching a different part of the problem space, then we have to be more sophisticated in our approach. We have already seen how constraints can be used to control, both implicitly and explicitly, evolutionary choices within a single gene. But what if we wish to influence which genes are chosen for exploration during the evolutionary process? Then we must provide some mechanism for choosing which genes are to be chosen and which genes are not to be chosen for exploration. Purely arbitrarily and in the sole interest of keeping faith with the original biological motivation of genetic algorithms, we choose to call genes which are chosen for exploration during evolution as expressed and genes which are chosen NOT to be explored during evolution as unexpressed. Furthermore, the wrapper around each chomosome, which determines which genes are and are not expressed, we call the epigenome. Once again, consider the following goal expression.

122

Genetic Programming Theory and Practice VIII

regress(f0(f1(expr,expr),f2(expr,expr))) where f0(?) Since we know that the user has requsted only solutions where f0 has selected to be the operator ?, then we must implicitly assume that the constraints and epigenome for abstract functions f0, f1, and f2, with respect to any solution are as follows. constraints: f0(?) constraints: f1(< = >) constraints: f2(:) epigenome: ef(f1) We can assume the epigenome is limited to function f1 because, with both gene f0 and gene f2 constrained to a single choice each, f0 and f2 are implicitly no longer allowed to vary during evolution, with respect to any solution. Effectively both f0 and f2 are unexpressed. In the goal expression genome, ef is the epigenome associated with the chromosome for abstract functions. The epigenomes are wrapped around each chomosome and are a vector of expressed genes. In a context insensitive genome, chosing any specific value for gene f0 or gene v6, etc. has no effect on the contraint wrappers or the epigenome. However, in a context sensitive genome, chosing any specific value for gene f0 or gene v6, etc. may have an effect on the contraint wrappers and the epigenome. Of course, we are not limited to implicit control of the epigenome. We can extend control of the epigenome to the user in an effort to allow greater control of the search space. For instance, the following goal expression is an example of a user specified epigenome. (E6): regress(f0(f1(f2(v0,v1),f3(v2,v3)),f4(f5(v4,v5),f6(v6,v7)))) (E6.1): where {} (E6.2): where {ff(noop) f2(cos sin tan tanh) ef(f2) ev(v0)} Obviously expression (E6) has only one genome; however, the two where clauses request two distinct simultaneous search strategies. The first where clause (E6.1) tells the system to perform an unconstrained general search of all possible solutions. The second where clause (E6.2) tells the system to simultaneously perform a more complex search among a limited set of possible solutions as follows. The ff(noop) condition tells the system to initialize all functions to noop unless otherwise specified. The f2(cos sin tan tanh) condition tells the system to restrict abstract function f2 to only the trigonometric functions starting with cos. The ef(f2) epigenome tells the system that only f2 will participate in the evolutionary process. The ev(v0) epigenome tells the system that only v0 will participate in the evolutionary process. Therefore, (E6.2) causes the system to evolve only solutions of a single trignonometric function on a single feature i.e. tan(x4), cos(x0), etc. These two distinct search strategies are explored simultaneously. The resulting champion will be the winning (optimal) solution across all simultaneous search strategies.

Abstract Expression Grammar Symbolic Regression

7.

123

Control

The user community is increasingly demanding better control of the search space and better control of the output from symbolic regression systems. In search of a control paradigm for symbolic regression, we have chosen to notice the relationship of SQL to database searches. Originally database searches where highly constrained and heavily dictated by the choice of storage mechanism. With the advent of relational databases, searches became increasingly under user control to the point that modern SQL is amazingly flexible. An unanswered research question is how much user control of the symbolic regression process can be reasonably achieved? Our system architecture allows us to use abstract goal expressions to better explore the possibilities for user control. Given the immense value of search space reduction and search specialization, the symbolic regression system can benefit greatly if the epigenome and the constraints are made available to the user. This allows the user to specify goal formulas and candidate individuals which are tailored to specific applications. For instance, the following univariate abstract goal expression is a case in point. (E7): regress(f0(f1(f2(v0,v1),f3(v2,v3)),f4(f5(v4,v5),f6(v6,v7)))) (E7.1): where {} (E7.2): where {ff(noop) f2(cos sin tan tanh) ef(f2) ev(v0)} (E7.3): where {ff(noop) f1(noop,*) f2(*) ef(f1) ev(v0,v1,v2)} (E7.4): where {ff(noop) f0(cos sin tan tanh) f1(noop,*) f2(*) ef(f0,f1) ev(v0,v1,v2)} (E7.5): where {f0(?) f4(:)} Expression (E7) has only one genome and can be entered as a single goal expression requesting five distinct simultaneous search strategies. Borrowing a term from chess playing programs, we can create an opening book by adding where clauses like (E7.2), (E7.3), (E7.4), and (E7.5). The first where clause (E7.1) tells the system to perform an unconstrained general search of all possible solutions. The second where clause (E7.2) tells the system to evolve only solutions of a single trignonometric function on a single feature i.e. tan(x4), cos(x0), etc. In the third where clause (E7.3), the f1(noop,*) condition tells the system to restrict abstract function f1 to only the noop and * starting with noop. The f2(*) condition tells the system to restrict abstract function f2 to only the * function. The ef(f1) epigenome tells the system that only f1 will participate in the evolutionary process. The ev(v0,v1,v2) epigenome tells the system that only v0, v1, and v2 will participate in the evolutionary process. Therefore, (E7.3) causes the system to evolve champions of a pair or a triple cross correlations only i.e. (x3*x1) or (x1*x4*x2).

124

Genetic Programming Theory and Practice VIII

In the fourth where clause (E7.4), the ff(noop) condition tells the system to initialize all functions to noop unless otherwise specified. The f0(cos sin tan tanh) condition tells the system to restrict abstract function f0 to only the trigonometric functions starting with cos. The f1(noop,*) condition tells the system to restrict abstract function f1 to only the noop and * starting with noop. The f2(*) condition tells the system to restrict abstract function f2 to only the * function. The ef(f0,f1) epigenome tells the system that only f0 and f1 will participate in the evolutionary process. The ev(v0,v1,v2) epigenome tells the system that only v0, v1, and v2 will participate in the evolutionary process. Therefore, (E7.4) causes the system to evolve champions of a single trignonometric function operating on a pair or triple cross correlation only i.e. cos(x3*x1) or tan(x1*x4*x2). In the fifth where clause (E7.5), causes the system to evolve only conditional champions i.e. ((x3*x1)

Genetic and Evolutionary Computation Series Editors John R. Koza Consulting Editor Medical Informatics Stanford University Stanford, CA 94305-5479 USA Email: [email protected]

For other titles published in this series, go to http://www.springer.com/series/6016

Rick Riolo • Trent McConaghy Ekaterina Vladislavleva Editors

Genetic Programming Theory and Practice VIII Foreword by Nic McPhee

1C

Editors Dr. Rick Riolo University of Michigan Center for the Study of Complex Systems 323 West Hall Ann Arbor Michigan 48109 USA [email protected]

Dr. Ekaterina Vladislavleva University of Antwerp Dept. Mathematics & Computer Science Campus Middelheim G.103 2020 Antwerpen Belgium [email protected]

Dr. Trent McConaghy Solido Design Automation, Inc. 102-116 Research Drive S7N 3R3 Saskatoon Saskatchewan Canada [email protected]

ISSN 1566-7863 ISBN 978-1-4419-7746-5 e-ISBN 978-1-4419-7747-2 DOI 10.1007/978-1-4419-7747-2 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010938320 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Contents

Contributing Authors

vii

Preface

xi

Foreword

xiii

Genetic Programming Theory and Practice 2010: An Introduction Trent McConaghy, Ekaterina Vladislavleva and Rick Riolo

xvii

1 FINCH: A System for Evolving Java (Bytecode) Michael Orlov and Moshe Sipper 2 Towards Practical Autoconstructive Evolution: Self-Evolution of Problem-Solving Genetic Programming Systems Lee Spector

1

17

3 The Rubik Cube and GP Temporal Sequence Learning: An Initial Study Peter Lichodzijewski and Malcolm Heywood

35

4 Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams Terence Soule, Robert B. Heckendorn, Brian Dyre, and Roger Lew

55

5 Covariant Tarpeian Method for Bloat Control in Genetic Programming Riccardo Poli

71

6 A Survey of Self Modifying Cartesian Genetic Programming Simon Harding, Wolfgang Banzhaf and Julian F. Miller

91

vi

Genetic Programming Theory and Practice VIII

7 Abstract Expression Grammar Symbolic Regression Michael F. Korns

109

8 Age-Fitness Pareto Optimization Michael Schmidt and Hod Lipson

129

9 Scalable Symbolic Regression by Continuous Evolution with Very Small Populations Guido F. Smits, Ekaterina Vladislavleva and Mark E. Kotanchek 10 Symbolic Density Models of One-in-a-Billion Statistical Tails via Importance Sampling and Genetic Programming Trent McConaghy 11 Genetic Programming Transforms in Linear Regression Situations Flor Castillo, Arthur Kordon and Carlos Villa

147

161

175

12 195 Exploiting Expert Knowledge of Protein-Protein Interactions in a Computational Evolution System for Detecting Epistasis Kristine A. Pattin, Joshua L. Payne, Douglas P. Hill, Thomas Caldwell, Jonathan M. Fisher, and Jason H. Moore 13 Composition of Music and Financial Strategies via Genetic Programming Hitoshi Iba and Claus Aranha

211

14 Evolutionary Art Using Summed Multi-Objective Ranks Steven Bergen and Brian J. Ross

227

Index

245

Contributing Authors

Claus Aranha is a graduate student at the Graduate School of Frontier Sciences in the University of Tokyo, Japan ([email protected]). Wolfgang Banzhaf is a professor at the Department of Computer Science at Memorial University of Newfoundland, St. John’s, NL, Canada ([email protected]). Steven Bergen is a graduate student in the Department of Computer Science, Brock University, St. Catharines, Ontario, Canada ([email protected]). Tom Caldwell is a database developer and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]). Flor Castillo is a Lead Research Specialist in the Polyglycols, Surfactants, and Fluids group within Performance Products R&D organization of The Dow Chemical Company ([email protected]). Brian Dyre is an Associate Professor of Experimental Psychology (Human Factors), a member of the Neuroscience Program, and the director of the Idaho Visual Performance Laboratory (IVPL) at the University of Idaho, USA ([email protected]). Jonathan Fisher is a computer programmer and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]). Simon Harding is a postdoctoral research fellow at the Department of Computer Science at Memorial University of Newfoundland, St. John’s, NL, Canada ([email protected]). Robert B. Heckendorn is an Associate Professor of Computer Science and a member of the Bioinformatics and Computational Biology Program at the University of Idaho, USA ([email protected]). Malcolm Heywood is a Professor of Computer Science at Dalhousie University, Halifax, NS, Canada ([email protected]). Douglas Hill is a computer programmer and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]).

viii

Genetic Programming Theory and Practice VIII

Hitoshi Iba is a professor of Computer Science at the Graduate School of Engineering in the University of Tokyo, Japan ([email protected]). Arthur K. Kordon is a Data Mining and Modeling Leader in the Advanced Analytics Group within the Dow Business Services of The Dow Chemical Company ([email protected]). Michael F. Korns is Chief Technology Officer at Freeman Investment Management, Henderson, Nevada, USA ([email protected]). Mark E. Kotanchek is Chief Technology Officer of Evolved Analytics, a data modeling consulting and systems company, USA/China ([email protected]). Roger Lew is a graduate student in the Neuroscience Program at the University of Idaho, USA ([email protected]). Peter Lichodzijewski is a graduate student in the Faculty of Computer Science at Dalhousie University, Halifax, Nova Scotia, Canada ([email protected]). Hod Lipson is an Associate Professor in the school of Mechanical and Aerospace Engineering and the school of Computing and Information Science at Cornell University, Ithaca, NY, USA ([email protected]). Trent McConaghy is co-founder and Chief Scientific Officer of Solido Design Automation Inc., which makes variation-aware IC design software for top-tier semiconductor firms. He is based in Vancouver, Canada. (trent [email protected]). Julian F. Miller is a lecturer in the Department of Electronics at the University of York, UK ([email protected]). Jason H. Moore is the Frank Lane Research Scholar in Computational Genetics and Associate Professor of Genetics at Dartmouth Medical School, USA ([email protected]). Michael Orlov is a graduate student in Computer Science at Ben-Gurion University, Israel ([email protected]). Kristine Pattin is a Molecular and Cellular Biology graduate student and member of the Computational Genetics Laboratory at Dartmouth College ([email protected]).

Contributing Authors

ix

Joshua L. Payne is a postdoctoral research fellow in the computational genetics laboratory at Dartmouth College ([email protected]). Riccardo Poli is a Professor of Computer Science in the School of Computer Science and Electronic Engineering at the University of Essex, UK ([email protected]). Rick Riolo is Director of the Computer Lab and Associate Research Scientist in the Center for the Study of Complex Systems at the University of Michigan, USA ([email protected]). Brian J. Ross is a Professor of Computer Science at Brock University, St. Catharines, ON, Canada ([email protected]). Michael Schmidt is a graduate student in computational biology at Cornell University, Ithaca, NY, USA ([email protected]). Moshe Sipper is a Professor of Computer Science at Ben-Gurion University, Israel ([email protected]). Guido F. Smits is a Research and Development Leader in the New Products Group within the Core R&D Organization of the Dow Chemical Company, Belgium ([email protected]). Terence Soule is an Associate Professor of Computer Science, a member of the Bioinformatics and Computational Biology Program, and Director of the Neuroscience Program at the University of Idaho, USA ([email protected]). Lee Spector is a Professor of Computer Science in the School of Cognitive Science at Hampshire College, Amherst, MA, USA ([email protected]). Carlos Villa is a Senior Research Specialist in Polyurethanes Process Research within Performance Products R&D organization of The Dow Chemical Company ([email protected]). Ekaterina Vladislavleva is a Lecturer in the Department of Mathematics and Computer Science at the University of Antwerp, Belgium ([email protected]).

Preface

The work described in this book was first presented at the Eighth Workshop on Genetic Programming, Theory and Practice, organized by the Center for the Study of Complex Systems at the University of Michigan, Ann Arbor, May 20-22, 2010. The goal of this workshop series is to promote the exchange of research results and ideas between those who focus on Genetic Programming (GP) theory and those who focus on the application of GP to various realworld problems. In order to facilitate these interactions, the number of talks and participants was small and the time for discussion was large. Further, participants were asked to review each other’s chapters before the workshop. Those reviewer comments, as well as discussion at the workshop, are reflected in the chapters presented in this book. Additional information about the workshop, addendums to chapters, and a site for continuing discussions by participants and by others can be found at http://cscs.umich.edu/gptp-workshops/ . We thank all the workshop participants for making the workshop an exciting and productive three days. In particular we thank the authors, without whose hard work and creative talents, neither the workshop nor the book would be possible. We also thank our keynote speaker J¨urgen Schmidhuber, Director of the Swiss Artificial Intelligence Lab IDSIA, Professor of Artificial Intelligence at the University of Lugano, Switzerland, Head of the CogBotLab at TU Munich, Germany, and Professor SUPSI, Switzerland. J¨urgen’s talk inspired a great deal of discussion among the participants throughout the workshop. The workshop received support from these sources: The Center for the Study of Complex Systems (CSCS); John Koza, Third Millennium Venture Capital Limited; Michael Korns, Freeman Investment Management; Ying Becker, State Street Global Advisors, Boston, MA; Mark Kotanchek, Evolved Analytics; Jason Moore, Computational Genetics Laboratory at Dartmouth College; Conor Ryan, Biocomputing and Developmental Systems Group, Computer Science and Information Systems, University of Limerick; and William and Barbara Tozier, Vague Innovation LLC. We thank all of our sponsors for their kind and generous support for the workshop and GP research in general.

xii

Genetic Programming Theory and Practice VIII

A number of people made key contributions to running the workshop and assisting the attendees while they were in Ann Arbor. Foremost among them was Howard Oishi, who makes GPTP workshops run smoothly with his diligent efforts before, during and after the workshop itself. After the workshop, many people provided invaluable assistance in producing this book. Special thanks go to Philipp Cannnons who did a wonderful job working with the authors, editors and publishers to get the book completed very quickly. Jennifer Maurer and Melissa Fearon provided invaluable editorial efforts, from the initial plans for the book through its final publication. Thanks also to Springer for helping with various technical publishing issues. Rick Riolo, Trent McConaghy and Ekaterina (Katya) Vladislavleva

Foreword

If politics is the art of the possible, research is surely the art of the soluble. Both are immensely practical-minded affairs. — Peter Medawar1 The annual Genetic Programming Theory and Practice (GPTP) is an important cross-fertilization event, bringing practitioners and theoreticians together in a small, focussed setting for several days. At larger conferences, parallel sessions force one to miss the great majority of the presentations, and it’s not uncommon for a theoretician and a practitioner to have little more contact than a brief conversation at a coffee break. GPTP blows away any stereotypes suggesting that theoreticians neither care about nor understand the challenges practitioners face, or that practitioners are indifferent to theoretical work, considering it an ivory tower exercise of no real consequence. The mutual respect around the table is manifest, and many participants have made substantial contributions to both theory and practice over the years. As a result, the discussions and debate are open, inclusive, lively, rigorous, and often intense. Despite the “Genetic Programming” in the title, GPTP has always been a showcase for problem solving techniques, without standing too much on the ceremony of names and labels. Many of the techniques and systems discussed this year have moved considerable distances from the standard s-expression GP of the early 90’s, and more and more hybrid systems are bringing together powerful tools from across evolutionary computation, machine learning, and statistics, often incorporating sophisticated domain knowledge as well. The creativity of our community, however, creates a plethora of challenges for those who wish to provide a theoretical understanding of these techniques and their dynamics, and evolutionary computation and GP work have long been dogged by a gap between the racing front of practical exploration and the rather more stately pace of theoretical understanding. Given that mismatch, events like GPTP become even more important, providing valuable opportunities for the community to take stock of the current state-of-play, identifying gaps, opportunities, and connections that have the potential to shape and inform work for years to come. This year’s papers continue to press many of the Hard Problems of the field. A number explore multi-objective evolutionary systems, co-evolution, and various types of modularity, hierarchy, and population structure, all with the goal of finding solutions to complex, structured, and often epistatic, problems. A

1 Review

of Arthur Koestler’s The Act of Creation, in the New Statesman, 19 June 1964.

xiv

Genetic Programming Theory and Practice VIII

constant challenge is finding effective representations, and many of the representations here don’t look much like a traditional tree-based GP. Similarly, configuration and parameter settings are a consistent burr, and this year’s work includes approaches that evolve this information, and approaches that dynamically set these values as a deterministic function of the current state. Application domains range widely through areas such as finance, industrial systems modeling, biology and medicine, games, art, and music; many, however, could still be described as forms of regression or classification, a vein that I suspect people will continue to mine successfully for years to come. A thread running through almost all the applications, in some cases more explicitly than others, is the importance of identifying and incorporating important domain knowledge, and it seems clear that few folks are tackling really tough problems without including the best domain knowledge they can lay their hands on. Another important trend is the continued conversion of GP into an increasingly off-the-shelf tool, what Rick Riolo and Bill Tozier might call the transition from an art to a craft. Several participants are building systems with the express goal of making high quality GP tools available to non-programmers, people with problems to solve but who aren’t interested in (or able to) implement a state-of-the-art evolutionary algorithm themselves. One of the great values in participating in this sort of workshop is the conversation and discussion, both during the presentations and in the halls. Perhaps the biggest “buzz” this year was about the increased computation power being made available through cluster and cloud computing, multiple cores, and the massive parallelism of graphic processing units (GPUs). This topic came up in several papers, and was discussed with both excitement and skepticism throughout the workshop. EC, along with most machine learning and artificial intelligence work, is a processor hungry business and one that parallelizes and distributes in fairly natural ways. This makes the increasing availability of large number of low-cost processing units, whether through physical devices or out on the Internet, very exciting. It wasn’t that long ago when population sizes were often 100 or less. These would now be considered small in many contexts, with population sizes routinely being several orders of magnitude larger. GPUs and cloud computing, however, make it possible to reasonably process populations of millions of individuals today, and no doubt many more in the next few years. This has enormous potential impact for both practice and theory in the field. People often comment on the fact that in the next few decades we’ll likely have computers (or clusters of computers) with computational power comparable to that of the human brain. This also gives us the ability run much more complex evolutionary systems, effectively simulating much richer evolutionary processes in more complex environments. Many have commented over the years that to see the true potential of evolutionary algorithms we need to place

Foreword

xv

them in more complex environments, and this came up again in this year’s GPTP discussions. If we only present our systems with simple problems, or problems with easily discovered local optima, we shouldn’t be surprised if their behaviors are often disappointingly simple. One of the reasons for this simplicity has all too often been the limit on available computing power. The continued growth in computing capacity make it possible to run much richer systems and tackle more challenging problems, shedding light in exciting new places. These changes may have strong implications on the theory side as well. Many theoretical results (such as those from schema theory and many statistical techniques) require infinite population assumptions, for example. While many of the predictions of these theories have been shown to hold for finite populations, sampling effects have often led to significant variances especially for small populations, and many researchers have been skeptical of the practical value of results built on infinite population assumptions. If we reach a point where we’re routinely using population sizes in the millions, then while there will surely be issues of sampling, these will likely be profoundly different than those seen with populations of hundreds. More generally, as population sizes grow, it will become increasingly important to develop and extend theoretical techniques that process individuals in aggregate. Even if we could theoretically characterize each individual in a population of millions, to do so would likely be useless as we would drown in the data. We will instead need ways to characterize the broader properties of the population, probably using tools like statistical distributions and coarse graining. Another subject of considerable discussion throughout the workshop was that of “selling” GP in particular and evolutionary systems in general. Despite the substantial and growing evidence of GP’s ability as a powerful problem solving tool, many remain skeptical. Sometimes this is because people are naturally nervous about the unknown, but caution is certainly warranted when there is a great deal at stake, such as people’s lives or millions of dollars. One traditional way to address this is to try to focus on the evolution of “understandable” solutions so one can offer the ideas embedded in a comprehensible solution instead of trying to pitch a black box that no one understands. Several of this year’s participants were avoiding the difficulty of selling GP by simply sidestepping it. They bundle GP as part of a complex set of tools that collectively address the customer’s problem, and find that in that setting the customer is often less concerned with the technical details of each component. I’m in a privileged position in that I rarely have to “sell” my work and so don’t have to face these issues, which I understand are very real. I must say, however, that I found it somewhat disheartening to hear so many people talk about obscuring the evolutionary component of their systems. Evolution is an incredibly powerful concept, but one that is all too little understood by the gen-

xvi

Genetic Programming Theory and Practice VIII

eral public (especially in the United States). As an educator and evolutionary enthusiast, I see evolutionary computation as a great opportunity to help people understand that evolution is real, an idea that not only led to the amazing diversity of life on Earth, but which can also be harnessed in silico to solve tough problems and explore important new areas. To veil its use and successes seems, to me, to be a lost opportunity on many levels. Not surprisingly, however, there are no simple answers, and the conversations on all of these ideas and issues will continue well into the future, fueled and re-energized by events such as GPTP. None of this would be possible, of course, without the hard work of the folks at the University of Michigan Center for the Study of Complex Systems (CSCS), who organize and host the gathering each year. Particular thanks go to CSCS’s Howard Oishi for his administrative organization and support, and to the organizing committee and editors of this volume: Rick Riolo (CSCS), Trent McConaghy (Solido Design Automation), and Katya Vladislavleva (University of Antwerp). Like all such events, GPTP costs money and we greatly appreciate the generous contributions of Third Millennium; State Street Global Advisors (SSgA); Michael Korns, Investment Science Corporation (ISC); the Computational Genetics Laboratory at Dartmouth College; Evolved Analytics; the Biocomputing and Developmental Systems Group, CSIS, the University of Limerick; and William and Barbara Tozier of Vague Innovation LLC. All that work and those donations made it possible for a group of bright, enthusiastic folks to get together to share and push and stretch. This volume contains one form of their collective effort, and it’s a valuable one. Read on, and be prepared to take a few notes along the way. Nic McPhee, Professor Division of Science and Mathematics University of Minnesota, Morris Morris, MN, USA July, 2010

Genetic Programming Theory and Practice 2010: An Introduction Trent McConaghy1, Ekaterina Vladislavleva2, and Rick Riolo3 1 Solido Design Automation Inc., Canada; 2 Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium; 3 Center for Study of Complex Systems, University of

Michigan.

Abstract

The toy problems are long gone, real applications are standard, and the systems have arrived. Genetic programming (GP) researchers have been designing and exploiting advances in theory, algorithm design, and computing power to the point where (traditionally) hard problems are the norm. As GP is being deployed in more real-world and hard problems, GP research goals are evolving to a higher level, to systems in which GP algorithms play a key role. The key goals in GP algorithm design are reasonable resource usage, high-quality results, and reliable convergence. To these GP algorithm goals, we add GP system goals: ease of system integration, end-user friendliness, and user control of the problem and interactivity. In this book, expert GP researchers demonstrate how they have been achieving and improving upon the key GP algorithm and system aims, to realize them on real-world / hard problems. This work was presented at the GP Theory and Practice (GPTP) 2010 worshop. This introductory chapter summarizes how these experts’ work is driving the frontiers of GP algorithms and GP systems in their application to ever-harder application domains.

Keywords:

genetic programming, evolutionary computation

1.

The Workshop

In May 2010 the Center of Studies of Complex Systems at the University of Michigan – with deep historical roots in evolutionary computation tracing back to Holland’s seminal work – opened its doors for the invitees of the workshop on Genetic Programming in Theory and Practice 2010. Over twenty experienced and internationally distinguished GP researchers gathered in Ann Arbor to close themselves in one room for two and a half days, present their newest (and often controversial) work to the critical attention of their peers, discuss the challenges of genetic programming, search for common traits in the field’s development, get a better understanding of the global state-of-the-art and share the vision on the “next big things” in GP theory and practice. The atmosphere at the workshop has always been enjoyable, with every participant trying to get a deep understanding of presented work, provide constructive comments on it, suggest links to the relevant topics in the broad field of computing, and question generality, scalability of the approach. The workshop fosters a friendly atmosphere wherein inquiring minds are genuinely trying to understand not only what they collectively know or can do with GP, but also

xviii

Genetic Programming Theory and Practice VIII

what they collectively do not yet know or cannot yet do with GP. The latter understanding is a major driving force for further developments that we have observed in all workshops. We are grateful to all sponsors and acknowledge the importance of their contributions to such an intellectually productive and regular event. The workshop is generously founded and sponsored by the University of Michigan Center for the Study of Complex Systems (CSCS) and receives further funding from the following people and or organizations: Michael Korns of Freeman Investment Management, State Street Global Advisors, Third Millenium, Bill and Barbara Tozier of Vague Innovation, Evolved Analytics, the Computational Genetics Laboratory of Dartmouth College and the Biocomputing and Developmental Systems Group of the University of Limerick. We also thank J¨urgen Schmidhuber for an enlightening and provocative keynote speech, which covered his thoughts on what makes a scientific field mature, a review of his work in solving difficult real-world problems in pragmatic ways, and his theoretical work in GP- and non-GP-based program induction.

2.

Summary of Progress

Last year, GPTP 2009 marked a transition wherein the aims of GP algorithms – reasonable resource usage, high results quality, and reliable convergence – were being consistently realized on an impressive variety of “real-world” applications by skilled practitioners in the field. This year, for GPTP 2010, researchers have begun to aim for the next level: for systems where GP algorithms play a key role. This was evident by the record number of GPTP demos, and by a renewed emphasis on system usability and user control. Also reflecting this transition, discsussions had a marked unity and depth of questions on the philosophy and future of GP, on the need to re-think the algorithms and re-design systems to solve conceptually harder problems. This chapter is organized accordingly. After a brief introduction to GP, Section 4 describes goals for design of GP algorithms and systems. Then the contributions of this volume (from the workshop) are summarized from two complementary perspectives: section 5 describes the “best practice” techniques that GP practitioners have invented and deployed to achieve the GP algorithm and system aims (including the improvements of GPTP 2010), and section 6 describes the application domains in which success through best practices has been reported. We conclude with a discussion of observations that emerged from the workshop, challenges that remain and potential avenues of future work. To make the results of the workshop useful to even a relative novice in the field of GP, we first provide a brief overview of GP.

GPTP2010: An Introduction

3.

xix

A Brief Introduction to Genetic Programming2

Genetic programming is a search and optimization technique for executable expressions that is modeled on natural evolution. Natural evolution is a powerful process that can be described by a few central, general mechanisms; for an introduction, see (Futuyma, 2009). A population is composed of organisms which can be distinguished in terms of how fit they are with respect to their environment. Over time, members of the population breed in frequency proportional to their fitness. The new offspring inherit the combined genetic material of their parents with some random variation, and may replace existing members of the population. The entire process is iterative, adaptive and open ended. GP and other evolutionary algorithms typically realize this central description of evolution, albeit in somewhat abstract forms. GP is a set of algorithms that mimic of survival of the fittest, genetic inheritance and variation, and that iterate over a “parent” population, selectively “breeding” them and replacing them with offspring. Though in general evolution does not have a problem solving goal, GP is nonetheless used to solve problems arising in diverse domains ranging from engineering to art. This is accomplished by casting the organism in the population as a candidate program-like solution to the chosen problem. The organism is represented as a computationally executable expression (aka structure), which is considered its genome. When the expression is executed on some supplied set of inputs, it generates an output (and possibly some intermediate results). This execution behavior is akin to the natural phenotype. By comparing the expression’s output to target outputs, a measure of the solution’s quality is obtained. This is used as the “fitness” of an expression. The fact that the candidate solutions are computationally executable structures (expressions), not binary or continuous coded values which are elements of a solution, is what distinguishes GP from other evolutionary algorithms (O’Reilly and Angeline, 1997). GP expressions include LISP functions (Koza, 1992; Wu and Banzhaf, 1998), stack or register based programs (Kantschik and Banzhaf, 2002; Spector and Robinson, 2002a), graphs (Miller and Harding, 2008; Mattiussi and Floreano, 2007; Poli, 1997), programs derived from grammars (Gruau, 1993; Whigham, 1995; O’Neill and Ryan, 2003), and generative representations which evolve the grammar itself (Hemberg, 2001; Hornby and Pollack, 2002; O’Reilly and Hemberg, 2007). Key steps in applying GP to a specific problem collectively define its search space: the problem’s candidate solutions are designed by choosing a representation; variation operators (mutation and crossover) are selected (or specialized); and a fitness function (objectives and

2 Adapted

from (O’Reilly et al., 2009).

xx

Genetic Programming Theory and Practice VIII

constraints) which expresses the relative merits of partial and complete solutions is formulated. For a more detailed overview we refer the reader to the book (Poli et al., 2008), which is available for free online.

4.

GP Challenges and Goals

In the early days of GP, the challenge was simply to “make it work” on small problems. As the field of GP research has matured, to be able to solve challenging real-world problems GP experts have strived to improve GP algorithms in terms of efficient computational resource usage, ensuring better quality results, and attaining more reliable convergence. With the maturation of “best practice” approaches, researchers are starting to create whole systems using GP which present its own challenges: ease of system integration, end-user friendliness, user control of the problem (perhaps interactively). This section elaborates on these GP algorithm and system goals and challenges.

GP Algorithm Goals and Challenges A successful GP algorithm has at least the following attributes. Efficent Use of Computational Resources includes shorter runtime, reduced usage of processor(s), and reduced memory and disk usage, for a given result. Achieving efficent use of computer resources has traditionally been a major issue for GP. A key reason is that GP search spaces are astronomically large, multi-modal, epistatic (e.g., variable interactions), have poor locality3 , and other nonlinearities. To handle such challenging search spaces, significant exploration is needed (e.g. large population sizes). This entails intensive processing and memory needs. Exacerbating the problem, fitness evaluations (objectives and constraints) of real-world problems tend to be expensive. Finally, because GP expressions have variable length, there is a tendency for them to “bloat”— to grow rapidly without a corresponding increase in performance (cf. Poli’s Chapter 5 in this book). Bloat can be a significant drain on available memory and CPU resources. Ensuring Quality Results. The key question is: “can a GP result be used in the target application?” This may be more difficult to attain than evident at first glance because the result may need to be human-interpretable, trustworthy, or predictive on dramatically different inputs— attaining such qualities can be

3 Poor locality means that a small change in the individual’s genotype often leads to large changes in the fitness and introducing additional difficulty into the search effort. For example, the GP “crossover” operation of swapping the subtrees of two parents might change the comparison of two elements from a “less than” relationship to an “equal to” relationship. This usually gives dramatically different behavior and fitness.

GPTP2010: An Introduction

xxi

challenging. Ensuring quality results has always been perceived as an issue, but the goal is becoming more prominent as GP is being applied to more real world problems. Practitioners, not GP, are responsible for deploying a GP result in their application domain. This means that the practitioner (and potentially their client) must trust the result sufficiently to be comfortable using it. Humaninterpretability (readability) of the result is a key factor in trust. This can be an issue when deployment of the result is expensive or risky, when customers’ understanding of the solution is crucial; when the result must be inspected or approved; or to gain acceptance of GP methodology. Reliable convergence means that the GP run can be trusted to return reasonable, useful results, without the practitioner having to worry about premature convergence or whether algorithm parameters like population size were set correctly. GP can fail to capably identify sub-solutions or partially correct solutions and successfully promote, combine and reuse them to generate good solutions with effective structure. The default approach has been to use the largest population size possible, subject to time and resource constraints. This invariably implies high resource usage, and still gives no guarantee of hitting useful results even if such results exist. Alternative approaches to increase the number of iterations with smaller population sizes still lack robust scenarios for computing resource allocation.

Goals for GP Incorporated in larger Systems These are necessary attributes of GP for successful “GP systems,” i.e., systems in which GP plays a key role4 . A successful GP system must no doubt have many other attributes particular to the context in which it is deployed, but each of the following factors certainly have high impact on the system’s success. Ease of system integration is how easy the GP algorithm is to deploy as part of the entire system, by the person or a team building the system. Even if a GP algorithm does well on the algorithm challenges, its may be hard for system integrators (or other researchers) to deploy because of high complexity or many parameters. Simple algorithms with few parameters are worth striving for; and if this is not possible, then readily available software with simple application programming interfaces and good documentation is a reasonable solution. End-user friendliness is the end-user’s perspective of how easy the system is to use when solving the problem at hand, when GP is only a subcomponent of the overall system. The user wants to solve a problem economically, with

4 GP

may not even be the centerpiece of the system—that’s fine!

xxii

Genetic Programming Theory and Practice VIII

quality results, reliably. The user task should be smooth and efficient, not tedious and time consuming. User (Interactive) Control of the Problem. The system (and its subsystems) should not be solving a problem any harder than it needs to be, especially when it makes a qualitative difference to resource usage, result quality, or convergence. To meet this goal, users should be able to specify problems to be solved with as much resolution as appropriate. In some cases, this also means interactivity with results so far, to further guide exploration according to the user’s needs, intuitions or subjective tastes. And it specifically does not mean user-level control of the GP algorithm itself: the end-user should not have to be a GP expert to use GP to solve a problem, just as GP experts do not have to be experts on electronics in order to use computers. For more book-length texts on applying GP to industrial problems, we refer the reader to recent books on the subject – by GPTP participants themselves: (Kordon, 2009), (Iba et al., 2010), and (McConaghy et al., 2009).

5.

GP Best Practices

First, we describe general best practices that GP practitioners use to achieve GP algorithm goals. Then, we review advances made at GPTP 2010 toward attaining those GP algorithm goals, followed a review of GPTP 2010 work that addresses GP system goals. In general, GP computational resource use has been made more efficient by improved algorithm design, improved design of representation and operators in specific domains. The importance of high demands of GP for computational resources has been lessened by Moore’s Law and increasing availability of parallel computational approaches, meaning that computational resources become exponentially cheaper over time. Results quality has improved for the same reasons. It is also due to a new emphasis by GP practitioners on getting interpretable or trustworthy results. Reliability has been enhanced via algorithm techniques that support continuous evolutionary improvement through a systematic or structured fashion, so that the practitioner no longer has to “hope” that the algorithm isn’t stuck. Implicit or explicit diversity maintenance also helps. Finally, thoughtful design of expression representation and genetic operators, for general and specific problem domains, has led to GP systems achieving human-competitive performance. Techniques along these lines include evolvability, self-adaptiveness, modularity and bloat control. At GPTP 2010, the following papers demonstrated advances in GP algorithm aims (efficient computational resource usage, results quality, or reliable convergence): Poli (Chapter 5) draws on recently developed theory to construct a very simple technique that manages bloat.

GPTP2010: An Introduction

xxiii

Harding et al. (Chapter 6) and Spector (Chapter 2) illuminate the state of the art in using self-modifying individuals to achieve highly scalable GP. Pattin, Moore et al. (Chapter 12) also uses self-adaptation and demonstrates how to incorporate expert knowledge in novel ways, for highly scalable GP. Lichodzijewski and Heywood (Chapter 3) and Soule et al. (Chapter 4) make further advances in GP scalability through evolution of teams. Orlov and Sipper (Chapter 1) is an excellent example of best-practice operator design to maintain evolvability in a highly constrained space. Smits et al. (Chapter 9) points towards evolution in the “compute cloud,” by exploring massively parallel evolution. Iba and Aranha (Chapter 13) exploits the structure of the resourceallocation problem in operator and algorithm design to improve GP scalability and results quality. Bergen and Ross (Chapter 14) explores how to handle problems with 2 objectives yet maintain convergence. Korns (Chapter 7) and McConaghy (Chapter 10) aggressively transform and simplify their respective problems for GP as much as possible, to greatly reduce GP resource needs. At GPTP 2010, the following papers demonstrated advances in GP system goals (system-integrator usability, user-level usability, or user control of the problem and interactivity). For system integrator usability: Schmidt and Lipson (Chapter 8) shows an approach that achieves the reliable convergence of the popular ALPS algorithm (Hornby, 2006), but with a simpler algorithm having fewer parameters. Harding et al. and Spector (Chapter 2) are also examples of relatively simple algorithms, algorithms that have been simplified over the years as their designers gained experience with them. In his keynote address, J¨urgen Schmidhuber described the achievement of best-in-class results using simple backpropagation neural networks but with modern computational resources. For user-level usability: Castillo et al. (Chapter 11) prescribes a flow for industrial modeling people where they can use GP as part of their overall manual flow in developing trustworthy industrial models. In the special demos session, many researchers presented highly usable GP systems, including Kotanchek’s DataModeler (symbolic regression and data analysis package for Mathematica), Schmidt and Lipson’s Eureqa (symbolic regression), Bergen and Ross’s Jnetic Textures (art), and Iba and Aranha’s CACIE (music). For user control of the problem / interactivity: Korns (Chapter 7) describes an SQL-style language to specify symbolic regression problems, so that function search only changes subsections of the overall expression. Bergen and Ross (Chapter 14) and Iba and Aranha (Chapter 13) describe systems that emphasize usability in interactive design of art and music, respectively.

xxiv

Genetic Programming Theory and Practice VIII

What is equally significant in these papers is that which is not mentioned or barely mentioned: GP algorithm goals that have already been solved sufficiently for particular problem domains, allowing researchers to focus their work on the more challenging issues. For example, there are several papers that do some form of symbolic regression (SR), which historically has had major issues with interpretability or bloat. Yet in these pages, the SR papers barely discuss interpretability or bloat, because best practices avoid the issue in one or more ways, most notably: pareto optimization using an extra objective of minimizing complexity, templated functional forms like McConaghy’s CAFFEINE or Korns’ abstract expressions or simply using the GP system to generate promising subexpressions in a manual modeling flow. Other off-the-shelf techniques that solve specific problems well have been around for years and are being increasingly adopted by the GPTP community. These include grammars to restrict program evolution (Whigham, 1995; O’Neill and Ryan, 2003), competent algorithms to handle multiple objectives and/or constraints e.g. (Deb et al., 2002), and meta-algorithms providing diversity and continuous improvement like ALPS (Hornby, 2006). Finally, significant compute resources are available to most: in an informal survey at the workshop, we found that most groups use a compute cluster, and two groups are already using “the compute cloud.”

6.

Application Successes Via Best Practices

One of the fascinating aspects of GP research is that GP is so general, i.e. “search through a space of (program or structure) entities,” that it can be used to attack an enormous variety of problems, including many problems that are currently unapproachable by any other technique. This year’s batch of applications is no exception. This section briefly reviews the applications. One of the long-standing aims of AI, and GP, has been evolution of software in the most general sense possible. GPTP this year was fortunate to have three groups present work directly on this. Orlov and Sipper (Chapter 1) present FINCH, a system to evolve Java bytecode, an evolutionary substrate that has evolvability close to machine code, yet returns interpretable Java code thanks to industry-standard bytecode decompilers. Spector (Chapter 2) presents an autoconstructive version of PUSH, a GP system which evolves stack-based programs. Finally, Harding et al. (Chapter 6) presents a self-modifying Cartesian GP which evolves graphs that can be interpreted as software, circuits, equations, and more. Two chapters introduce wholly new problems for GP. McConaghy (Chapter 10) introduces the problem of building density models at a distribution’s tails (and dusts off the general problem of symbolic density modeling), for the application of SRAM memory circuit analysis. Lichodzijewski and Heywood

GPTP2010: An Introduction

xxv

(Chapter 3) introduce the problem of solving a Rubik’s cube with GP, taking the perspective of temporal sequence learning. GP continues to help the artistic types. Bergen and Ross (Chapter 14) describe a sophisticated interactive system for interactive evolutionary art, and Iba and Aranha (Chapter 13) describe an advanced system for interactive evolutionary music. Both systems have been already used extensively by artists and musicians. In a biology application, Pattin, Moore et al. (Chapter 12) describe the use of GP for disease susceptibility modeling. GP remains popular in financial applications. Korns (Chapter 7) ups the ante on a set of symbolic regression and classification problems that are representative of financial modeling problems to aid stock-trading decisionmaking. Iba and Aranha (Chapter 13) describes a system for portfolio allocation. For the problem of industrial modeling (e.g. of inferential sensors at Dow), Castillo et al. (Chapter 11) focuses on a structured approach to exploit GP results within industrial modelers’ model development flows. Undoubtedly, the symbolic regression approach in Smits et al. (Chapter 9) will find end usage in Dow’s industrial modeling environment as well. Other approaches used standard problems in (symbolic) classification or regression as their test suites, though the emphasis was not the application. This includes work by Soule et al. (Chapter 4), Poli (Chapter 4), and Schmidt and Lipson (Chapter 8).

7.

Themes, Summary and Looking Forward

The toy problems are gone; the GP systems have arrived. No doubt there will continue to be qualitative improvements to GP algorithms and GP systems for years to come. But is there more? We posit there is. Despite these achievements, GP’s computer-based evolution does not demonstrate the potential associated with natural evolution, nor does it always satisfactorily solve important problems we might hope to use it on. Even when using best practice approaches to manage challenges in resources, results, and reliability, the computational load may still be too excessive and the final results may still be inadequate. To achieve success in a difficult problem domain takes a great deal of human effort toward thoughtful design of representations and operators. Many questions and challenges remain: • What does it take to make GP a science? (Is this even a realistic question?) How can work on applications facilitate the continued development of a GP theory? • What does it take to make GP a technology? (Is this even a realistic question?) What fundamental contributions will allow GP to be adopted into broader

xxvi

Genetic Programming Theory and Practice VIII

use beyond that of expert practitioners? For example, how can GP be scoped so that it becomes another standard, off-the-shelf method in the “toolboxes” of scientists and engineers around the world? Can GP follow in the same vein of linear programming? Can it follow the example of support vector machines and convex optimization methods? One challenge is in formulating the algorithm so that it provides more ease in laying out a problem. Another is determining how, by default – without parameter tuning, GP can efficiently exploit specified resources to return results reliably. • How do we get 1 million people using GP? 1 billion? (Should they even know they’re using GP?) • Success with GP often requires extensive human effort in capturing and embedding the domain knowledge. How can this up-front human effort be reduced while still achieving excellent results? Are there additional automatic ways to capture domain knowledge for input to GP systems? • Scalability is always relative. GP has attacked fairly large problems, but how can GP be improved to solve problems that are 10x, 100x, 1,000,000x harder? • What opportunities await GP due to new computing architectures and substrates, with potentially vastly richer processing resources? This includes massively multicore processors, GPUs, and cloud computing; but it also includes digital microfluidics, modern programmable logic, and more. • What opportunities await GP due to massive memory and storage capacity, coupled with giant databases? For example, this has already profoundly affected machine learning applied to speech recognition, not to mention web search. Massive and freely available databases are coming online, especially from biology. • What “uncrackable” problems await a creative GP approach? The future has many challenges in energy, health care, defence, and more. For many fields, there are lists of “holy grail” problems, unsolved problems, even problems with prize money attached. These questions and their answers will provide the fodder for future GPTP workshops. We wish you many hours of stimulating reading of this volume’s contributions.

References Deb, Kalyanmoy, Pratap, Amrit, Agarwal, Sameer, and Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6:182–197. Futuyma, Douglas (2009). Evolution, Second Edition. Sinauer Associates Inc.

GPTP2010: An Introduction

xxvii

Gruau, Frederic (1993). Cellular encoding as a graph grammar. IEE Colloquium on Grammatical Inference: Theory, Applications and Alternatives, (Digest No.092):17/1–10. Hemberg, Martin (2001). GENR8 - A design tool for surface generation. Master’s thesis, Department of Physical Resource Theory, Chalmers University, Sweden. Hornby, Gregory S. (2006). ALPS: the age-layered population structure for reducing the problem of premature convergence. In Keijzer, Maarten, Cattolico, Mike, Arnold, Dirk, Babovic, Vladan, Blum, Christian, Bosman, Peter, Butz, Martin V., Coello Coello, Carlos, Dasgupta, Dipankar, Ficici, Sevan G., Foster, James, Hernandez-Aguirre, Arturo, Hornby, Greg, Lipson, Hod, McMinn, Phil, Moore, Jason, Raidl, Guenther, Rothlauf, Franz, Ryan, Conor, and Thierens, Dirk, editors, GECCO 2006: Proceedings of the 8th annual conference on Genetic and evolutionary computation, volume 1, pages 815–822, Seattle, Washington, USA. ACM Press. Hornby, Gregory S. and Pollack, Jordan B. (2002). Creating high-level components with a generative representation for body-brain evolution. Artificial Life, 8(3):223–246. Iba, Hitoshi, Paul, Topon Kumar, and Hasegawa, Yoshihiko (2010). Applied Genetic Programming and Machine Learning. CRC Press. Kantschik, Wolfgang and Banzhaf, Wolfgang (2002). Linear-graph GP—A new GP structure. In Foster, James A., Lutton, Evelyne, Miller, Julian, Ryan, Conor, and Tettamanzi, Andrea G. B., editors, Genetic Programming, Proceedings of the 5th European Conference, EuroGP 2002, volume 2278 of LNCS, pages 83–92, Kinsale, Ireland. Springer-Verlag. Kordon, Arthur (2009). Applying Computational Intelligence: How to Create Value. Springer. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Mattiussi, Claudio and Floreano, Dario (2007). Analog genetic encoding for the evolution of circuits and networks. IEEE Transactions on Evolutionary Computation, 11(5):596–607. McConaghy, Trent, Palmers, Pieter, Gao, Peng, Steyaert, Michiel, and Gielen, Georges G.E. (2009). Variation-Aware Analog Structural Synthesis: A Computational Intelligence Approach. Springer. Miller, Julian Francis and Harding, Simon L. (2008). Cartesian genetic programming. In Ebner, Marc, Cattolico, Mike, van Hemert, Jano, Gustafson, Steven, Merkle, Laurence D., Moore, Frank W., Congdon, Clare Bates, Clack, Christopher D., Moore, Frank W., Rand, William, Ficici, Sevan G., Riolo, Rick, Bacardit, Jaume, Bernado-Mansilla, Ester, Butz, Martin V., Smith, Stephen L., Cagnoni, Stefano, Hauschild, Mark, Pelikan, Martin, and Sastry,

xxviii

Genetic Programming Theory and Practice VIII

Kumara, editors, GECCO-2008 tutorials, pages 2701–2726, Atlanta, GA, USA. ACM. O’Neill, Michael and Ryan, Conor (2003). Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language, volume 4 of Genetic programming. Kluwer Academic Publishers. O’Reilly, Una-May and Angeline, Peter J. (1997). Trends in evolutionary methods for program induction. Evolutionary Computation, 5(2):v–ix. O’Reilly, Una-May and Hemberg, Martin (2007). Integrating generative growth and evolutionary computation for form exploration. Genetic Programming and Evolvable Machines, 8(2):163–186. Special issue on developmental systems. O’Reilly, Una-May, McConaghy, Trent, and Riolo, Rick (2009). GPTP 2009: An example of evolvability. In Riolo, Rick L., O’Reilly, Una-May, and McConaghy, Trent, editors, Genetic Programming Theory and Practice VII, Genetic and Evolutionary Computation, chapter 1, pages 1–18. Springer, Ann Arbor. Poli, Riccardo (1997). Evolution of graph-like programs with parallel distributed genetic programming. In Back, Thomas, editor, Genetic Algorithms: Proceedings of the Seventh International Conference, pages 346–353, Michigan State University, East Lansing, MI, USA. Morgan Kaufmann. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Spector, Lee and Robinson, Alan (2002). Genetic programming and autoconstructive evolution with the push programming language. Genetic Programming and Evolvable Machines, 3(1):7–40. Whigham, P. A. (1995). Grammatically-based genetic programming. In Rosca, Justinian P., editor, Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Applications, pages 33–41, Tahoe City, California, USA. Wu, Annie S. and Banzhaf, Wolfgang (1998). Introduction to the special issue: Variable-length representation and noncoding segments for evolutionary algorithms. Evolutionary Computation, 6(4):iii–vi.

Chapter 1 FINCH: A SYSTEM FOR EVOLVING JAVA (BYTECODE) Michael Orlov and Moshe Sipper Department of Computer Science, Ben-Gurion University, Beer-Sheva 84105, Israel.

Abstract

The established approach in genetic programming (GP) involves the definition of functions and terminals appropriate to the problem at hand, after which evolution of expressions using these definitions takes place. We have recently developed a system, dubbed FINCH (Fertile Darwinian Bytecode Harvester), to evolutionarily improve actual, extant software, which was not intentionally written for the purpose of serving as a GP representation in particular, nor for evolution in general. This is in contrast to existing work that uses restricted subsets of the Java bytecode instruction set as a representation language for individuals in genetic programming. The ability to evolve Java programs will hopefully lead to a valuable new tool in the software engineer’s toolkit.

Keywords:

Java bytecode, automatic programming, software evolution, genetic programming.

1.

Introduction

The established approach in genetic programming (GP) involves the definition of functions and terminals appropriate to the problem at hand, after which evolution of expressions using these definitions takes place (Koza, 1992; Poli et al., 2008). Poli et al. recently noted that: While it is common to describe GP as evolving programs, GP is not typically used to evolve programs in the familiar Turing-complete languages humans normally use for software development. It is instead more common to evolve programs (or expressions or formulae) in a more constrained and often domain-specific language. (Poli et al., 2008, ch. 3.1; emphasis in original)

The above statement is (arguably) true not only where “traditional” treebased GP is concerned, but also for other forms of GP, such as linear GP and grammatical evolution (Poli et al., 2008). R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_1, © Springer Science+Business Media, LLC 2011

2

Genetic Programming Theory and Practice VIII

}}

0 iconst_1 1 istore_2 2 iload_1 3 ifle 16 6 iload_1 7 aload_0 8 iload_1 9 iconst_1 10 isub 11 invokevirtual #2 14 imul 15 istore_2 16 iload_2 17 ireturn

(a)

(b)

class F { int fact(int n) { // offsets 0-1 int ans = 1; // offsets 2-3 if (n > 0) // offsets 6-15 ans = n * fact(n-1); // offsets 16-17 return ans;

Figure 1-1. A recursive factorial function in Java (a) and its corresponding bytecode (b). The argument to the virtual method invocation (invokevirtual) references the int F.fact(int) method via the constant pool.

We have recently developed a system, dubbed FINCH (Fertile Darwinian Bytecode Harvester), to evolutionarily improve actual, extant software, which was not intentionally written for the purpose of serving as a GP representation in particular, nor for evolution in general. The only requirement is that the software source code be either written in Java or can be compiled to Java bytecode. The following chapter provides an overview of our system, ending with a pr´ecis of results. Additional information can be found in (Orlov and Sipper, 2009; Orlov and Sipper, 2010). Java compilers typically do not produce machine code directly, but instead compile source-code files to platform-independent bytecode, to be interpreted in software or, rarely, to be executed in hardware by a Java Virtual Machine (JVM) (Lindholm and Yellin, 1999). The JVM is free to apply its own optimization techniques, such as Just-in-Time (JIT) on-demand compilation to native machine code—a process that is transparent to the user. The JVM implements a stack-based architecture with high-level language features such as object management and garbage collection, virtual function calls, and strong typing. The bytecode language itself is a well-designed assembly-like language with a limited yet powerful instruction set (Engel, 1999; Lindholm and Yellin, 1999). Figure 1-1 shows a recursive Java program for computing the factorial of a number, and its corresponding bytecode. The JVM architecture is successful enough that several programming languages compile directly to Java bytecode (e.g., Scala, Groovy, Jython, Kawa, JavaFX Script, and Clojure). Moreover, Java decompilers are available, which facilitate restoration of the Java source code from compiled bytecode. Since the design of the JVM is closely tied to the design of the Java programming

FINCH: A System for Evolving Java (Bytecode)

3

language, such decompilation often produces code that is very similar to the original source code (Miecznikowski and Hendren, 2002). We chose to automatically improve extant Java programs by evolving the respective compiled bytecode versions. This allows us to leverage the power of a well-defined, cross-platform, intermediate machine language at just the right level of abstraction: We do not need to define a special evolutionary language, thus necessitating an elaborate two-way transformation between Java and our language; nor do we evolve at the Java level, with its encumbering syntactic constraints, which render the genetic operators of crossover and mutation arduous to implement. Note that we do not wish to invent a language to improve upon some aspect or other of GP (efficiency, terseness, readability, etc.), as has been amply done. Nor do we wish to extend standard GP to become Turing complete, an issue which has also been addressed (Woodward, 2003). Rather, conversely, our point of departure is an extant, highly popular, general-purpose language, with our aim being to render it evolvable. The ability to evolve Java programs will hopefully lead to a valuable new tool in the software engineer’s toolkit. The motivation behind evolving Java bytecode is detailed in Section 2. The principles of bytecode evolution are described in Section 3. Section 4 describes compatible bytecode crossover—the main evolutionary operator driving the FINCH system. Alternative ways of evolving software are considered in Section 5. Program halting and compiler optimization issues are dealt with in Sections 6 and 7. Current experimental results are summarized in Section 8, and the concluding remarks are in Section 9.

2.

Why Target Bytecode for Evolution?

Bytecode is the intermediate, platform-independent representation of Java programs, created by a Java compiler. Figure 1-2 depicts the process by which Java source code is compiled to bytecode and subsequently loaded by the JVM, which verifies it and (if the bytecode passes verification) decides whether to interpret the bytecode directly, or to compile and optimize it—thereupon executing the resultant native code. The decision regarding interpretation or further compilation (and optimization) depends upon the frequency at which a particular method is executed, its size, and other parameters. Our decision to evolve bytecode instead of the more high-level Java source code is guided in part by the desire to avoid altogether the possibility of producing non-compilable source code. The purpose of source code is to be easy for human programmers to create and to modify, a purpose which conflicts with the ability to automatically modify such code. We note in passing that we do not seek an evolvable programming language—a problem tackled, e.g., by

4

Genetic Programming Theory and Practice VIII MPS[EGI iconst_1 IA32 Bytecode UHVWRUH PRYHG[HVL VXELO istore_2 GHFHVL FPSO FPS[HVL iload_1 EJSQLFF[IEF Load MJ[EGHH QRS ifle 16 PRY[HVL ELFF[IEF MPS[EGI Verify iload_1 PRYL PRYHG[[HVS PRYHG[HD[ aload_0 VXEOR PRYHVLHGL GHFHD[ FDOO[IEFHD GHFHGL if (n > 0) iload_1 WHVWHD[HD[ Interpret PRYLR PRYHGLHG[ MOH[IFD ans = n * iconst_1 PXO[ORL PRYHVL[HVS PRYHG[HES PXO[LLL FDOO[EGI fact(n-1); isub DGG[IIIIIIIIIIIII VHWKLKL[II PRY[HVS HVL WHVWHESHES invokevirtual #2 OG>O@J LPXOHVLHD[ SPARC MOH[IFD UHW return ans; imul PRYHD[HVL PRYHD[[UVS PRY[HVS HG[ CompileLPXOHG[HVL Compileistore_2 PRYHG[UVS } DGG[IIIIIIIIIIIII } iload_2 FDOOT[IFIE ireturn LPXOHESHD[ LPXO[UVS HD[ AMD64 PRYUVS HG[ Platform-dependent Platform-independent

class F Source { int fact(int n) { int ans = 1;

Java compiler

just-in-time compiler

Figure 1-2. Java source code is first compiled to platform-independent bytecode by a Java compiler. The JVM only loads the bytecode, which it verifies for correctness, and raises an exception in case the verification fails. After that, the JVM typically interprets the bytecode until it detects that it would be advantageous to compile it, with optimizations, to native, platformdependent code. The native code is then executed by the CPU as any other program. Note that no optimization is performed when Java source code is compiled to bytecode. Optimization only takes place during compilation from bytecode to native code.

(Spector and Robinson, 2002)—but rather aim to handle the Java programming language in particular. Evolving bytecode instead of source code alleviates the issue of producing non-compilable programs to some extent—but not completely. Java bytecode must be correct with respect to dealing with stack and local variables (cf. Figure 1-3). Values that are read and written should be type-compatible, and stack underflow must not occur. The JVM performs bytecode verification and raises an exception in case of any such incompatibility. We wish not merely to evolve bytecode, but indeed to evolve correct bytecode. This task is hard, because our purpose is to evolve given, unrestricted code, and not simply to leverage the capabilities of the JVM to perform GP. Therefore, basic evolutionary operations, such as bytecode crossover and mutation, should produce correct individuals.

3.

Bytecode Evolution Principles

We define a good crossover of two parents as one where the offspring is a correct bytecode program, meaning one that passes verification with no errors; conversely, a bad crossover of two parents is one where the offspring is an incorrect bytecode program, meaning one whose verification produces errors. While it is easy to define a trivial slice-and-swap crossover operator on two programs, it is far more arduous to define a good crossover operator. This latter is necessary in order to preserve variability during the evolutionary process, because incorrect programs cannot be run, and therefore cannot be ascribed a

5

FINCH: A System for Evolving Java (Bytecode) IDFW method call frame IDFW method call frame IDFW method call frame (active)

Heap Shared objects store.

11

(stack top)

int 4 ³)´ (WKLV)

Program Counter ³)´ object

int 5

Holds offset of currently executing instruction in method code area.

³)´ (WKLV)

int 5

int 1

0

1

2

Operand Stack

Local Variables Array

References objects on the heap. Used to provide arguments to JVM instructions, such as arithmetic operations and method calls.

References objects on the heap. Contains method arguments and locally defined variables.

Figure 1-3. Call frames in the architecture of the Java Virtual Machine, during execution of the recursive factorial function code shown in Figure 1-1, with parameter n = 7. The top call frame is in a state preceding execution of invokevirtual. This instruction will pop a parameter and an object reference from the operand stack, invoke the method fact of class F, and open a new frame for the fact(4) call. When that frame closes, the returned value will be pushed onto the operand stack.

fitness value—or, alternatively, must be assigned the worst possible value. Too many bad crossovers will hence produce a population with little variability. Note that we use the term good crossover to refer to an operator that produces a viable offspring (i.e., one that passes the JVM verification) given two parents; compatible crossover, defined below, is one mechanism by which good crossover can be implemented. The Java Virtual Machine is a stack-based architecture for executing Java bytecode. The JVM holds a stack for each execution thread, and creates a frame on this stack for each method invocation. The frame contains a code array, an operand stack, a local variables array, and a reference to the constant pool of the current class (Engel, 1999). The code array contains the bytecode to be executed by the JVM. The local variables array holds all method (or function) parameters, including a reference to the class instance in which the current method executes. In addition, the variables array also holds local-scope variables. The operand stack is used by stack-based instructions, and for arguments when calling other methods. A method call moves parameters from the caller’s operand stack to the callee’s variables array; a return moves the top value from the callee’s stack to the caller’s stack, and disposes of the callee’s frame. Both the operand stack and the variables array contain typed items, and instructions always act on a specific type. The relevant bytecode instructions are prefixed accordingly: ‘a’ for an object or array reference, ‘i’ and ‘l’ for integral types int and long, and

6

Genetic Programming Theory and Practice VIII

‘f’ and ‘d’ for floating-point types float and double.1 Finally, the constant pool is an array of references to classes, methods, fields, and other unvarying entities. The JVM architecture is illustrated in Figure 1-3. In our evolutionary setup, the individuals are bytecode sequences annotated with all the necessary stack and variables information. This information is gathered in one pass over the bytecode, using the ASM bytecode manipulation and analysis library (Bruneton et al., 2002). Afterwards, similar information for any sequential code segment in the individual can be aggregated separately. This preprocessing step allows us to define compatible two-point crossover on bytecode sequences (Orlov and Sipper, 2009). Code segments can be replaced only by other segments that use the operand stack and the local variables array in a depth-compatible and type-compatible manner. The compatible crossover operator thus maximizes the viability potential for offspring, preventing type incompatibility and stack underflow errors that would otherwise plague indiscriminating bytecode crossover. Note that the crossover operation is unidirectional, or asymmetric—the code segment compatibility criterion as described here is not a symmetric relation. An ability to replace segment α in individual A with segment β in individual B does not imply an ability to replace segment β in B with segment α. As an example of compatible crossover, consider two identical programs with the same bytecode as in Figure 1-1, which are reproduced as parents A and B in Figure 1-4. We replace bytecode instructions at offsets 7–11 in parent A with the single iload 2 instruction at offset 16 from parent B. Offsets 7–11 correspond to the fact(n-1) call that leaves an integer value on the stack, whereas offset 16 corresponds to pushing the local variable ans on the stack. This crossover, the result of which is shown as offspring x in Figure 1-4, is good, because the operand stack is used in a compatible manner by the source segment, and although this segment reads the variable ans that is not read in the destination segment, that variable is guaranteed to have been written previously, at offset 1. Alternatively, consider replacing the imul instruction in the newly formed offspring x with the single invokevirtual instruction from parent B. This crossover is bad, as illustrated by incorrect offspring y in Figure 1-4. Although both invokevirtual and imul pop two values from the stack and then push one value, invokevirtual expects the topmost value to be of reference type F, whereas imul expects an integer. Another negative example is an attempt to replace bytecode offsets 0–1 in parent B (that correspond to the int ans=1 statement) with an empty segment. In this case, illustrated by incorrect offspring z in Figure 1-4, variable ans is no longer guaranteed to be initialized 1 The

types boolean, byte, char and short are treated as the computational type int by the Java Virtual Machine, except for array accesses and explicit conversions (Lindholm and Yellin, 1999).

7

FINCH: A System for Evolving Java (Bytecode) iconst_1 istore_2 iload_1 ifle iload_1 aload_0 iload_1 iconst_1 isub invokevirtual imul istore_2 iload_2 ireturn

Parent A

x

z iconst_1 istore_2 iload_1 ifle iload_1 aload_0 iload_1 iconst_1 isub invokevirtual imul istore_2 iload_2 ireturn Parent B

y

iconst_1 istore_2 iload_1 ifle iload_1 iload_2 imul istore_2 iload_2 ireturn

(correct) Offspring x

iconst_1 istore_2 iload_1 ifle iload_1 iload_2 invokevirtual istore_2 iload_2 ireturn

iload_1 ifle iload_1 aload_0 iload_1 iconst_1 isub invokevirtual imul istore_2 iload_2 ireturn

(incorrect) Offspring y

(incorrect) Offspring z

Figure 1-4. An example of good and bad crossovers. The two identical individuals A and B represent a recursive factorial function (see Figure 1-1; here we use an arrow instead of branch offset). In parent A, the bytecode sequence that corresponds to the fact(n-1) call that leaves an integer value on the stack, is replaced with the single instruction in B that corresponds to pushing the local variable ans on the stack. The resulting correct offspring x and the original parent B are then considered as two new parents. We see that either replacing the first two instructions in B with an empty section, or replacing the imul instruction in x with the invokevirtual instruction from B, result in incorrect bytecode, shown as offspring y and z—see main text for full explanation.

when it is read immediately prior to the function’s return, and the resulting bytecode is therefore incorrect. A mutation operator employs the same constraints as compatible crossover, but the constraints are applied to variations of the same individual. The requirements for correct bytecode mutation are thus derived from those of compatible crossover. To date, we did not use this type of mutation as it proved unnecessary, and instead implemented a restricted form of constants-only point mutation, where each constant in a new individual is modified with a given probability.

4.

Compatible Bytecode Crossover

As discussed above, compatible bytecode crossover is a fundamental building block for effective evolution of correct bytecode. In order to describe the formal requirements for compatible crossover, we need to define the meaning of variable accesses for a segment of code. That is, a section of code (that is not necessary linear, since there are branching instructions) can be viewed as reading and writing some local variables, or as an aggregation of reads and writes by individual bytecode instructions. However, when a variable is written before being read, the write “shadows” the read, in the sense that the code executing prior to the given section does not have to provide a value of the correct type in the variable.

8

Genetic Programming Theory and Practice VIII

Variables Access Sets. We define variables access sets, to be used ahead by the compatible crossover operator, as follows: Let a and b be two locations in the same bytecode sequence. For a set of instructions δa,b that could potentially be executed starting at a and ending at b, we define the following access sets. r : set of local variables such that for each variable v, there exists a potential δa,b execution path (i.e., one not necessarily taken) between a and b, in which v is read before any write to it. w δa,b : set of local variables that are written to through at least one potential execution path. w! δa,b : set of local variables that are guaranteed to be written to, no matter which execution path is taken.

These sets of local variables are incrementally computed by analyzing the data flow between locations a and b. For a single instruction c, the three access sets for δc are given by the Java bytecode definition. Consider a set of (normally non-consecutive) instructions {bi } that branch to instruction c or have c as their immediate subsequent instruction. The variables accessed between a and c are computed as follows: r r , with the addition of variables read by is the union of all reads δa,b δa,c i instruction c—unless these variables to be written before r are guaranteed r = r w! . δ \ δ ∪ δ c. Formally, δa,c c i a,bi i a,bi w is the union of all writes δ w , with the addition of variables written by δa,c a,bi w w = w δ instruction c: δa,c i a,bi ∪ δc . w! is the set of variables guaranteed to be written before c, with the addition δa,c w! w! = w! of variables written by instruction c: δa,c i δa,bi ∪ δc (note that w! has already been computed, its previous value δcw! = δcw ). When δa,c needs to be a part of the intersection as well.

We therefore traverse the data-flow graph starting at a, and updating the variables access sets as above, until they stabilize—i.e., stop changing.2 During the traversal, necessary stack depths are also updated. The requirements for compatible bytecode crossover can now be specified.

Bytecode Constraints on Crossover. In order to attain viable offspring, several conditions must hold when performing crossover of two bytecode programs. Let A and B be functions in Java, represented as bytecode sequences. Consider segments α and β in A and B, respectively, and let pα and pβ be the necessary depth of stack for these segments—i.e., the minimal number of 2 The

data-flow traversal process is similar to the data-flow analyzer’s loop in (Lindholm and Yellin, 1999).

FINCH: A System for Evolving Java (Bytecode)

9

elements in the stack required to avoid underflow. Segment α can be replaced with β if the following conditions hold. Operand stack: (1) it is possible to ensure that pβ pα by prefixing stack pops and pushes of α with some frames from the stack state at the beginning of α; (2) α and β have compatible stack frames up to depth pβ : stack pops of α have identical or narrower types as stack pops of β, and stack pushes of β have identical or narrower types as stack pushes of α; (3) α has compatible stack frames deeper than pβ : stack pops of α have identical or narrower types as corresponding stack pushes of α. Local variables: (1) local variables written by β (β w ) have identical or narrower types as corresponding variables that are read after α (post-αr ); (2) local variables read after α (post-αr ) and not necessarily written by β (β w! ) must be written before α (pre-αw! ), or provided as arguments for call to A, as identical or narrower types; (3) local variables read by β (β r ) must be written before α (pre-αw! ), or provided as arguments for call to A, as identical or narrower types. Control flow: (1) no branch instruction outside of α has branch destination in α, and no branch instruction in β has branch destination outside of β; (2) code before α has transition to the first instruction of α, and code in β has transition to the first instruction after β; (3) last instruction in α implies transition to the first instruction after α. Detailed examples of the above conditions can be found in (Orlov and Sipper, 2009). Compatible bytecode crossover prevents verification errors in offspring, in other words, all offspring compile sans error. As with any other evolutionary method, however, it does not prevent production of non-viable offspring—in our case, those with runtime errors. An exception or a timeout can still occur during an individual’s evaluation, and the fitness of the individual should be reset accordingly. We chose bytecode segments randomly before checking them for crossover compatibility as follows: For a given method, a segment size is selected using a given probability distribution among all bytecode segments that are branchconsistent under the first control-flow requirement; then a segment with the chosen size is uniformly selected. Whenever the chosen segments result in bad crossover, bytecode segments are chosen again (up to some limit of retries). Note that this selection process is very fast (despite the retries), as it involves fast operations—and, most importantly, we ensure that crossover always produces a viable offspring.

10

Genetic Programming Theory and Practice VIII

float x; int y = 7; if (y >= 0) x = y; else x = -y; System.out.println(x);

int x = 7; float y; if (y >= 0) { y = x; x = y; } System.out.println(z);

(a)

(b)

Figure 1-5. Two Java snippets that comply with the context-free grammar rules of the programming language. However, only snippet (a) is legal once the full Java Language Specification (Gosling et al., 2005) is considered . Snippet (b), though Java-compliant syntactically, is revealed to be ill-formed when semantics are thrown into play.

5.

The Grammar Alternative

One might ask whether it is really necessary to evolve bytecode in order to support the evolution of unrestricted Java software. After all, Java is a programming language with strict, formal rules, which are precisely defined in Backus-Naur form (BNF). One could make an argument for the possibility of providing this BNF description to a grammar evolutionary system (O’Neill and Ryan, 2003) and evolving away. We disagree with such an argument. The apparent ease with which one might apply the BNF rules of a real-world programming language in an evolutionary system (either grammatical or tree-based) is an illusion stemming from the blurred boundary between syntactic and semantic constraints (Poli et al., 2008, ch. 6.2.4). Java’s formal (BNF) rules are purely syntactic, in no way capturing the language’s type system, variable visibility and accessibility, and other semantic constraints. Correct handling of these constraints in order to ensure the production of viable individuals would essentially necessitate the programming of a full-scale Java compiler—a highly demanding task, not to be taken lightly. This is not to claim that such a task is completely insurmountable—e.g., an extension to context-free grammars (CFGs), such as logic grammars, can be taken advantage of in order to represent the necessary contextual constraints (Wong and Leung, 2000). But we have yet to see such a GP implementation in practice, addressing real-world programming problems. We cannot emphasize the distinction between syntax and semantics strongly enough. Consider, for example, the Java program segment shown in Figure 15(a). It is a seemingly simple syntactic structure, which belies, however, a host of semantic constraints, including: type compatibility in variable assignment, variable initialization before read access, and variable visibility. The similar (and CFG-conforming) segment shown in Figure 1-5(b) violates all these constraints: variable y in the conditional test is uninitialized during a read access, its subsequent assignment to x is type-incompatible, and variable z is undefined.

FINCH: A System for Evolving Java (Bytecode)

11

It is quite telling that despite the popularity and generality of grammatical evolution, we were able to uncover only a single case of evolution using a real-world, unrestricted phenotypic language—involving a semantically simple hardware description language (HDL). (Mizoguchi et al., 1994) implemented the complete grammar of SFL (Structured Function description Language) (Nakamura et al., 1991) as production rules of a rewriting system, using approximately 350(!) rules for a language far simpler than Java. The semantic constraints of SFL—an object-oriented, register-transfer-level language—are sufficiently weak for using its BNF directly: By designing the genetic operators based on the production rules and by performing them in the chromosome, a grammatically correct SFL program can be generated. This eliminates the burden of eliminating grammatically incorrect HDL programs through the evolution process and helps to concentrate selective pressure in the target direction. (Mizoguchi et al., 1994)

(Arcuri, 2009) recently attempted to repair Java source code using syntax-tree transformations. His JAFF system is not able to handle the entire language— only an explicitly defined subset (Arcuri, 2009, Table 6.1), and furthermore, exhibits a host of problems that evolution of correct Java bytecode avoids inherently: individuals are compiled at each fitness evaluation, compilation errors occur despite the syntax-tree modifications being legal (cf. discussion above), lack of support for a significant part of the Java syntax (inner and anonymous classes, labeled break and continue statements, Java 5.0 syntax extensions, etc.), incorrect support of method overloading, and other problems: The constraint system consists of 12 basic node types and 5 polymorphic types. For the functions and the leaves, there are 44 different types of constraints. For each program, we added as well the constraints regarding local variables and method calls. Although the constraint system is quite accurate, it does not completely represent yet all the possible constraints in the employed subset of the Java language (i.e., a program that satisfies these constraints would not be necessarily compilable in Java). (Arcuri, 2009)

FINCH, through its clever use of Java bytecode, attains a scalability leap in evolutionarily manageable programming language complexity.

6.

The Halting Issue

An important issue that must be considered when dealing with the evolution of unrestricted programs is whether they halt—or not (Langdon and Poli, 2006). Whenever Turing-complete programs with arbitrary control flow are evolved, a possibility arises that computation will turn out to be unending. A program that has acquired the undesirable non-termination property during evolution is executed directly by the JVM, and FINCH has nearly no control over the process.

12

Genetic Programming Theory and Practice VIII

A straightforward approach for dealing with non-halting programs is to limit the execution time of each individual during evaluation, assigning a minimal fitness value to programs that exceed the time limit. This approach, however, suffers from two shortcomings: First, limiting execution time provides coarsetime granularity at best, is unreliable in the presence of varying CPU load, and as a result is wasteful of computer resources due to the relatively high time-limit value that must be used. Second, applying a time limit to an arbitrary program requires running it in a separate thread, and stopping the execution of the thread once it exceeds the time limit. However, externally stopping the execution is either unreliable (when interrupting the thread that must then eventually enter a blocked state), or unsafe for the whole application (when attempting to kill the thread).3 Therefore, in FINCH we exercise a different approach, taking advantage of the lucid structure offered by Java bytecode. Before evaluating a program, it is temporarily instrumented with calls to a function that throws an exception if called more than a given number of times (steps). A call to this function is inserted before each backward branch instruction and before each method invocation. Thus, an infinite loop in any evolved individual program will raise an exception after exceeding the predefined steps limit. Note that this is not a coarse-grained (run)time limit, but a precise limit on the number of steps.

7.

(No) Loss of Compiler Optimization

Another issue that surfaces when bytecode genetic operators are considered is the apparent loss of compiler optimization. Indeed, most native-code producing compilers provide the option of optimizing the resulting machine code to varying degrees of speed and size improvements. These optimizations would presumably be lost during the process of bytecode evolution. Surprisingly, however, bytecode evolution does not induce loss of compiler optimization, since there is no optimization to begin with! The common assumption regarding Java compilers’ similarity to native-code compilers is simply incorrect. As far as we were able to uncover, with the exception of the IBM Jikes Compiler (which has not been under development since 2004, and which does not support modern Java), no Java-to-bytecode compiler is optimizing. Sun’s Java Compiler, for instance, has not had an optimization switch since version 1.3.4 Moreover, even the GNU Compiler for Java, which is part of the highly optimizing GNU Compiler Collection (GCC), does not optimize at the

3 For the intricacies of stopping Java threads see http://java.sun.com/javase/6/docs/technotes/ guides/concurrency/threadPrimitiveDeprecation.html. 4 See the old manual page at http://java.sun.com/j2se/1.3/docs/tooldocs/solaris/javac. html, which contains the following note in the definition of the -O (Optimize) option: the -O option does nothing in the current implementation of javac.

FINCH: A System for Evolving Java (Bytecode)

13

bytecode-producing phase—for which it uses the Eclipse Compiler for Java as a front-end—and instead performs (optional) optimization at the native codeproducing phase. The reason for this is that optimizations are applied at a later stage, whenever the JVM decides to proceed from interpretation to just-in-time compilation (Kotzmann et al., 2008). The fact that Java compilers do not optimize bytecode does not preclude the possibility of doing so, nor render it particularly hard in various cases. Indeed, in FINCH we apply an automatic post-crossover bytecode transformation that is typically performed by a Java compiler: dead-code elimination. After crossover is done, it is possible to get a method with unreachable bytecode sections (e.g., a forward goto with no instruction that jumps into the section between the goto and its target code offset). Such dead code is problematic in Java bytecode, and it is therefore automatically removed from the resulting individuals by our system. This technique does not impede the ability of individuals to evolve introns, since there is still a multitude of other intron types that can be evolved (Brameier and Banzhaf, 2007) (e.g., any arithmetic bytecode instruction not affecting the method’s return value, which is not considered dead-code bytecode, though it is an intron nonetheless).

8.

A Summary of Results

Due to space limitations we only provide a brief description of our results, with the full account available in (Orlov and Sipper, 2009; Orlov and Sipper, 2010). To date, we have successfully tackled several problems: Simple and complex symbolic regression: Evolve programs to approximate the simple x4 + x3 + x2 + x and the more complex 9 polynomial i polynomial i=1 x . Artificial ant problem: Evolve programs to find all 89 food pellets on the Santa Fe trail. Intertwined spirals problem: Evolve programs to correctly classify 194 points on two spirals. Array sum: Evolve programs to compute the sum of values of an integer array, along the way demonstrating FINCH’s ability to handle loops and recursion. Tic-tac-toe: Evolve a winning program for the game, starting from a flawed implementation of the negamax algorithm. This example shows that programs can be improved. Figure 1-6 shows two examples of Java programs evolved by FINCH.

14

Genetic Programming Theory and Practice VIII

Number simpleRegression(Number num) { double d = num.doubleValue(); return Double.valueOf(d + (d * (d * (d + ((d = num.doubleValue()) + (((num.doubleValue() * (d = d) + d) * d + d) * d + d) * d) * d) + d) + d) * d); }

int sumlistrec(List list) { int sum = 0; if (list.isEmpty()) sum = sum; else sum += ((Integer)list.get(0)) .intValue() + sumlistrec( list.subList(1, list.size())); return sum; }

(a)

(b)

Figure 1-6. Examples of evolved programs for the degree-9 polynomial regression problem (a), and the recursive array sum problem (b). The Java code shown was produced by decompiling the respective evolved bytecode solutions.

9.

Concluding Remarks

A recent study commissioned by the US Department of Defense on the subject of futuristic ultra-large-scale (ULS) systems that have billions of lines of code noted, among others, that, “Judiciously used, digital evolution can substantially augment the cognitive limits of human designers and can find novel (possibly counterintuitive) solutions to complex ULS system design problems” (Northrop et al., 2006, p. 33). This study does not detail any actual research performed but attempts to build a road map for future research. Moreover, it concentrates on huge, futuristic systems, whereas our aim is at current systems of any size. Differences aside, both our work and this study share the vision of true software evolution. Turing famously (and wrongly...) predicted that, “in about fifty years’ time it will be possible, to programme computers [. . . ] to make them play the imitation game so well that an average interrogator will not have more than 70 per cent. chance of making the right identification after five minutes of questioning” (Turing, 1950). Recently, Harman wrote that, “. . . despite its current widespread use, there was, within living memory, equal skepticism about whether compiled code could be trusted. If a similar change of attitude to evolved code occurs over time. . . ” (Harman, 2010). We wish to offer our own prediction for fifty years hence, in the hope that we shall not be wrong: We believe that in about fifty years’ time it will be possible, to program computers by means of evolution. Not merely possible but indeed prevalent.

References Arcuri, Andrea (2009). Automatic Software Generation and Improvement Through Search Based Techniques. PhD thesis, University of Birmingham, Birmingham, UK.

FINCH: A System for Evolving Java (Bytecode)

15

Brameier, Markus and Banzhaf, Wolfgang (2007). Linear Genetic Programming. Number XVI in Genetic and Evolutionary Computation. Springer. Bruneton, Eric, Lenglet, Romain, and Coupaye, Thierry (2002). ASM: A code manipulation tool to implement adaptable systems (Un outil de manipulation de code pour la r´ealisation de syst`emes adaptables). In Adaptable and Extensible Component Systems (Syst`emes a` Composants Adaptables et Extensibles), October 17–18, 2002, Grenoble, France, pages 184–195. Engel, Joshua (1999). Programming for the JavaTM Virtual Machine. AddisonWesley, Reading, MA, USA. Gosling, James, Joy, Bill, Steele, Guy, and Bracha, Gilad (2005). The JavaTM Language Specification. The JavaTM Series. Addison-Wesley, Boston, MA, USA, third edition. Harman, Mark (2010). Automated patching techniques: The fix is in. Communications of the ACM, 53(5):108. Kotzmann, Thomas, Wimmer, Christian, M¨ossenb¨ock, Hanspeter, Rodriguez, Thomas, Russell, Kenneth, and Cox, David (2008). Design of the Java HotSpotTM client compiler for Java 6. ACM Transactions on Architecture and Code Optimization, 5(1):7:1–32. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Langdon, W. B. and Poli, R. (2006). The halting probability in von Neumann architectures. In Collet, Pierre, Tomassini, Marco, Ebner, Marc, Gustafson, Steven, and Ek´art, Anik´o, editors, Proceedings of the 9th European Conference on Genetic Programming, volume 3905 of Lecture Notes in Computer Science, pages 225–237, Budapest, Hungary. Springer. Lindholm, Tim and Yellin, Frank (1999). The JavaTM Virtual Machine Specification. The JavaTM Series. Addison-Wesley, Boston, MA, USA, second edition. Miecznikowski, Jerome and Hendren, Laurie (2002). Decompiling Java bytecode: Problems, traps and pitfalls. In Horspool, R. Nigel, editor, Compiler Construction: 11th International Conference, CC 2002, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2002, Grenoble, France, April 8–12, 2002, volume 2304 of Lecture Notes in Computer Science, pages 111–127, Berlin / Heidelberg. Springer-Verlag. Mizoguchi, Jun’ichi, Hemmi, Hitoshi, and Shimohara, Katsunori (1994). Production genetic algorithms for automated hardware design through an evolutionary process. In Proceedings of the First IEEE Conference on Evolutionary Computation, ICEC’94, volume 2, pages 661–664. Nakamura, Yukihiro, Oguri, Kiyoshi, and Nagoya, Akira (1991). Synthesis from pure behavioral descriptions. In Camposano, Raul and Wolf, Wayne Hendrix, editors, High-Level VLSI Synthesis, pages 205–229. Kluwer, Norwell, MA, USA.

16

Genetic Programming Theory and Practice VIII

Northrop, Linda et al. (2006). Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Pittsburgh, PA, USA. O’Neill, Michael and Ryan, Conor (2003). Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language, volume 4 of Genetic programming. Kluwer Academic Publishers. Orlov, Michael and Sipper, Moshe (2009). Genetic programming in the wild: Evolving unrestricted bytecode. In Raidl, G¨unther et al., editors, Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, July 8–12, 2009, Montr´eal Qu´ebec, Canada, pages 1043–1050, New York, NY, USA. ACM Press. Orlov, Michael and Sipper, Moshe (2010). Flight of the FINCH through the Java wilderness. IEEE Transactions on Evolutionary Computation. In press. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Spector, Lee and Robinson, Alan (2002). Genetic programming and autoconstructive evolution with the Push programming language. Genetic Programming and Evolvable Machines, 3(1):7–40. Turing, Alan Mathison (1950). Computing machinery and intelligence. Mind, 59(236):433–460. Wong, Man Leung and Leung, Kwong Sak (2000). Data Mining Using Grammar Based Genetic Programming and Applications, volume 3 of Genetic Programming. Kluwer, Norwell, MA, USA. Woodward, John R. (2003). Evolving Turing complete representations. In Sarker, Ruhul et al., editors, The 2003 Congress on Evolutionary Computation, CEC 2003, Canberra, Australia, 8–12 December, 2003, volume 2, pages 830–837. IEEE Press.

Chapter 2 TOWARDS PRACTICAL AUTOCONSTRUCTIVE EVOLUTION: SELF-EVOLUTION OF PROBLEM-SOLVING GENETIC PROGRAMMING SYSTEMS Lee Spector Cognitive Science, Hampshire College, Amherst, MA, 01002-3359 USA.

Abstract

Most genetic programming systems use hard-coded genetic operators that are applied according to user-specified parameters. Because it is unlikely that the provided operators or the default parameters will be ideal for all problems or all program representations, practitioners often devote considerable energy to experimentation with alternatives. Attempts to bring choices about operators and parameters under evolutionary control, through self-adaptative algorithms or meta-genetic programming, have been explored in the literature and have produced interesting results. However, no systems based on such principles have yet been demonstrated to have greater practical problem-solving power than the more-standard alternatives. This chapter explores the prospects for extending the practical power of genetic programming through the refinement of an approach called autoconstructive evolution, in which the algorithms used for the reproduction and variation of evolving programs are encoded in the programs themselves, and are thereby subject to variation and evolution in tandem with their problem-solving components. We present the motivation for the autoconstructive evolution approach, show how it can be instantiated using the Push programming language, summarize previous results with the Pushpop system, outline the more recent AutoPush system, and chart a course for future work focused on the production of practical systems that can solve hard problems.

Keywords:

genetic programming, meta-genetic programming, autoconstructive evolution, Push, PushGP, Pushpop, AutoPush

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_2, © Springer Science+Business Media, LLC 2011

18

1.

Genetic Programming Theory and Practice VIII

Introduction

The work described in this chapter is motivated both by features of biological evolution and by the requirements for the high-performance problem-solving systems of the future. Under common conceptions of biological evolution the variation of genotypes from parents to children, and hence the diversification of phenotypes from progenitors to their descendants, is essentially random prior to selection. Offspring vary randomly, it is said, and selection acts on the resulting diversity by allowing the better-adapted random variants to survive and reproduce. Such conceptions are held not only by the lay public but also by theorists such as Jerry Fodor and Massimo Piattelli-Palmarini who, in their book What Darwin Got Wrong, criticize Darwinian theory in part on the grounds that the random “generate and test” algorithm at its core is insufficiently powerful to account for the facts of natural history (Fodor and Piattelli-Palmarini, 2010). But diversification in nature, while certainly random in some respects, is also clearly non-random in several others. If one were to modify DNA molecules in truly random ways, considering all chemical bonds to be equally good candidates for breakage and re-connection, then one would not end up with DNA molecules at all but instead with some other sort of organic soup. Cellular machinery copies DNA, and repairs copying errors, in ways that allow for certain kinds of “errors” but only within tightly constrained bounds. At higher levels of organization variation is constrained by genetic regulatory processes, the mechanics of sexual recombination, cell division and development, and, at a much higher level of organization, by social structures that guide non-random mate selection. All of these constraints emerge from reproductive processes that have themselves evolved over the course of natural history. There is a large literature on such constraints, including a recent theory of “facilitated variation” (Gerhart and Kirschner, 2007), and summaries of the evolution of variation from pre-biotic Earth to the present (Maynard Smith and Szathm´ary, 1999). Whether or not the evolved-non-randomness of biological variation constitutes a significant critique of neo-Darwinism or of the historical Darwin, as claimed by Fodor and Piattelli-Palmarini, is beyond the scope of the present discussion. For our purposes, however, two related points should be made. First, while truly random variation, filtered by selection, may be too weak of a mechanism to have produced the sequence of phenotypes observed over time in the historical record, it is possible for random variation, when acting on the reproductive mechanisms themselves, to produce variation mechanisms that are not purely random. This is presumably what happened in natural history. Second, this bootstrapping process, of the evolution of adaptive, not-entirelyrandom variation by means of the initially random variation of the variation

Towards practical autoconstructive evolution

19

mechanisms, might also be applied to evolutionary problem-solving technologies. Why would we want to do this? One reason is that the problem-solving power of current evolutionary computing technologies is limited by the nature of the variation mechanisms that we build into these systems by hand. Consider, for example, the standard mutation operators used in genetic programming. Subtree replacement, applied uniformly to the nodes in a program tree (or uniformly to interior vs. leaf nodes with a specified probability), involving the replacement of subtrees with newly-generated random subtrees, provides a form of variation that leads to solutions in some but not all problem environments. This has led to the development of a wide range of alternative mutation operators; see, for example, the “Mutation Cookbook” section of (Poli et al., 2008, pp. 42– 44). But which of these will be most helpful in which circumstances, and which others, perhaps not yet invented, may be needed to solve challenging new problems? The field currently has no satisfying answer to this question, which will become all the more pressing as genetic programming systems incorporate more expressive and heterogeneous program representations. In the context of such representations it may well make sense for different program elements or program locations to have different variation rates or procedures, and it will not be obvious, in advance, how to make these choices. The question will also become all the more pressing as genetic programming systems are applied to ever more complex problems, about which the system designers will have less knowledge and intuition. And the question will be raised with even greater urgency with respect to recombination operators such as crossover, for which there even more open questions (e.g. about how to choose crossover partners) that currently require the user to make choices that may not be optimal. Two approaches to these general issues that have previously been explored in the literature are “self-adaptation” and “meta-genetic programming.” Many forms of self-adaptation have been investigated, both within genetic programming and in other areas of evolutionary computation (with many examples including (Angeline, 1995; Spears, 1995; Angeline, 1996; Eiben et al., 1999; MacCallum, 2003; Fry et al., 2005; Beyer and Meyer-Nieberg, 2006; Vafaee et al., 2008; Silva and Dignum, 2009)). In all of these systems the parameters of the evolutionary algorithm are varied and subjected to some form of selection, whether the variation and selection is accomplished by means of the overarching evolutionary algorithm, by a secondary evolutionary algorithm, or by some other machine learning technique. In some cases the parameters are adapted on an individual basis, while in others the self-adaptive system modifies global parameters that apply to an entire population. In general, however, these systems vary only pre-selected parameters of the variation operators in pre-specified ways, and they do not allow for the evolution of arbitrary methods of variation.

20

Genetic Programming Theory and Practice VIII

By contrast, the “meta-genetic programming” approach leverages the program-space search capabilities of genetic programming to search for variation operators—which are, after all, themselves programs—during the search for problem-solving programs (Schmidhuber, 1987; Kantschik et al., 1999; Edmonds, 2001; Tavares et al., 2004; Diosan and Oltean, 2009). These systems would appear to have more potential to evolve adaptive variation algorithms, but they have generally been subject to one or both of the following two significant limitations: The evolving genetic operators are not associated with specific evolving problem-solving programs; they are expected to apply to all evolving problem-solving programs equally well. The evolving genetic operators are restricted to being compositions of a small number of pre-designed components; many conceivable genetic operators will not be representable using these components. The first of these limitations contrasts with some of the self-adaptive evolutionary algorithms mentioned previously, in which the values of parameters for genetic operators are encoded in individuals. That this “global” conception of the applicability of genetic operators might be a limitation should be evident from a cursory examination of the diversity of reproductive strategies in nature. For example, the reproductive strategies of the dandelion are quite different from those of the tiger, the oyster mushroom, and Escherichia coli; nobody would expect the strategies of any of these organisms to work particularly well for any of the others. Of course the diversity present in the Earth’s biosphere dwarfs that of any current genetic programming system, but it would nonetheless be quite surprising if the same genetic operators worked equally well across a genetic programming population with any significant diversity. One could well imagine, for example, that a subset of the population might share one particular subtree in which a high degree of mutation is adaptive and a second subtree in which mutation is always deleterious. Other individuals in the population might lack either or both of these subtrees, or they might contain additional code that changes the effects of mutations within these particular subtrees. The second of these limitations is probably mostly a reflection of the fact that most genetic programming representations limit the expressiveness of the programs that they can evolve more generally. Although several Turing complete representations have been described (for example, (Teller, 1994; Nordin and Banzhaf, 1995; Spector and Robinson, 2002a; Woodward, 2003; Yabuki and Iba, 2004; Langdon and Poli, 2006)), such representations are relatively rare and representations that can easily perform arbitrary transformations on variable-sized programs are rarer still. Nature appears to be quite flexible and

Towards practical autoconstructive evolution

21

inventive in the variation mechanisms that it employs (e.g., mechanisms involving gene duplication), and we can easily imagine cases in which genetic programming systems would benefit from the use of genetic operators that are not simple compositions of hand-designed operator components. Another line of research that bears on the approach presented here generally appears in the artificial life literature. Systems such as Tierra (Ray, 1991), Avida (Ofria and Wilke, 2004), and SeMar (Suzuki, 2004) all involve the evolution of programs that are partially responsible for their own reproduction, and in which the reproductive mechanisms (including genetic operators) are therefore subject to variation and selection. However, in these systems diversification is generally driven by hand-designed “ancestor” replicators and/or by the effects of hand-designed mutation algorithms that are applied automatically to the results of all code manipulation operations. Furthermore, while some of these systems have been used to solve computational problems their problem-solving power has been quite limited; they have been used to evolve simple logic gates and arithmetic functions, but they have not been applied to the kinds of difficult problems that genetic programming practitioners are interested in solving. This is not surprising, as these systems have generally been developed primarily to study biological evolution, not to solve difficult computational problems. Additional related work has been conducted in the context of evolved selfreproduction (Taylor, 1999; Sipper and Reggia, 2001) although most of this work has been focused on the evolution of exact replication rather than the evolution of adaptive variation. An exception, and the closest work to that described below, is Koza’s work on the “Spontaneous Emergence of Self-Replicating and Evolutionarily Self-Improving Computer Programs” (Koza, 1994). In that work Koza evolved programs that simultaneously solved problems (albeit simple Boolean problems) and produced variant offspring using template-based code self-modification in a “sea” or “Turing gas” of programs (Fontana, 1992). This chapter describes an approach to self-adaptive genetic programming, called autoconstructive evolution, that combines several features of the approaches described above, with the long-term goal of producing a new generation of powerful problem solving systems. The potential advantage of the autoconstructive evolution approach is that it will allow variation mechanisms to co-evolve with the programs to which they are applied, thereby allowing the evolutionary system itself to adapt to its problem environments in significant ways. The autoconstructive evolution approach was first described in 2001 and 2002 (Spector, 2001; Spector, 2002; Spector and Robinson, 2002a; Spector and Robinson, 2002b), using the Pushpop system that leveraged features of the Push programming language for evolved programs. In the next section this earlier work is briefly described. The subsequent section describes more recent work on the approach, using better technology and a more explicit focus on the goal

22

Genetic Programming Theory and Practice VIII

of high performance problem solving, implemented in a newer system called AutoPush. The final section of the chapter offers some brief conclusions.

2.

Push and Pushpop

An autoconstructive evolution system was defined in (Spector and Robinson, 2002a) as “any evolutionary computation system that adaptively constructs its own mechanisms of reproduction and diversification as it runs.” In the context of the present discussion, however, that definition is too general, and a more specific definition that captures both the past and present usage would be “any genetic programming system in which the methods for reproduction and diversification are encoded in the individual programs themselves, and are thereby subject to variation and evolution.” The goal in the previous work, as in the work described here, is for the ways in which children are produced to be evolved along with the programs to which they will be applied. This is done by encoding the mechanisms for reproduction and diversification within the programs themselves, which must be capable of producing children and, in principle, of solving the problem to which the genetic programming system is being applied. The space of possible reproduction and diversification methods is vast and an ideal system would allow evolving programs to reach new and uncharted reaches of this space. Human-designed diversification mechanisms, including human-designed genetic operators, human-specified automatic mutation during code-manipulation, and human-written ancestor programs, should all be avoided. Of course it will generally be necessary for some features of any evolutionary system to be pre-specified; for example, all of the systems described here borrow several pre-specified elements of traditional genetic programming systems, including a generation-based evolutionary loop, a fixed-size population, and tournament selection with a pre-specified tournament size. The focus here is on the evolution of the means by which children are produced from parents, and it is this task for which we currently seek autoconstructive methods. A prerequisite for this approach is a program representation in which problemsolving functions and child-production functions can both be easily expressed. The Push programming language was originally designed specifically for this purpose (Spector, 2001). Push is a stack-based language roughly in the tradition of Forth, but for which each data type has its own stack. Instructions generally take their arguments from the appropriate stacks and push their results onto the appropriate stacks.1 If an instruction requires arguments that are not present on the appropriate stacks when it is called then it does nothing (it acts as a “no-op”). 1 Exceptions are instructions that draw their inputs from external data structures, for example instructions that access inputs, and instructions that act on external data structures, for example “developmental” instructions that add components to externally-developing representations of circuits or other structured objects.

Towards practical autoconstructive evolution

23

These specifications mean that even though multiple data types may be present in a program no instruction will ever be called on arguments of the wrong type, regardless of its syntactic position in the program. Among other benefits, this means that there are essentially no syntax constraints on Push programs aside from a requirement that parentheses be balanced. This is particularly useful for systems in which child programs will be produced by evolving programs. One of Push’s most important features for autoconstructive evolution, and for genetic programming more generally, is the fact that “code” is a first-class data type. When a Push program is being executed the code that is queued for execution is stored on a special stack called the “exec” stack, and exec instructions in the program can manipulate the queued instructions in order to implement a wide variety of evolved control structures (Spector et al., 2005). Additional code stacks (including one called simply “code,” and in some implementations others with names such as “child”) can be used to store and manipulate code for a variety of other purposes. This feature has significant benefits for genetic programming even in a non-autoconstructive context (that is, even when standard, hard-coded genetic operators are used, as in the PushGP system), but here we focus on the use of Push for autoconstructive evolution. Space limits prevent full exposition of the Push language here; see (Spector et al., 2005) and the references therein for further details. 2 The first autoconstructive evolution system built using Push, called Pushpop, can best be understood as an extension of a more-standard genetic programming system such as PushGP. In PushGP, when a program is being tested for fitness on a particular fitness case it is run and then the problem-solving outputs are collected from the relevant data stacks (typically integer or float) and tested for errors; Pushpop does this as well, but it also simultaneously collects a potential child from the child stack. If the problem to which the system is being applied involves n fitness cases then the testing of each program in the population will produce n potential children. In the reproductive phase tournaments are conducted among parents and children are selected randomly from the set of potential children of the winning parents. If there are insufficient children to fill the child population then newly generated random individuals are used. In Pushpop, as in any autoconstructive evolution system, care must be taken to prevent the takeover of the population by perfect replicators or other pathological replicants. Because there is no automatic mutation in Pushpop a perfect replicator can rapidly fill the population with copies of itself, after which no evolution (and indeed no change at all) will occur. The production of perfect replicators in Push is generally trivial, because programs are pushed onto the code stack prior to execution. For this reason Pushpop includes a “no cloning” rule that specifies that exact clones will not be allowed into the child popula2 See

also http://hampshire.edu/lspector/push.html.

24

Genetic Programming Theory and Practice VIII

tion. Settings are also available that prohibit children that are identical to any of their ancestors or any other individuals in the population. The “no cloning” rule forces programs to diversify in some way, but it does not dictate the mode or extent of diversification. The pathology of perfect replicators in nature was presumably overcome with the aid of vast stretches of time and over vast expanses of the Earth, within which perfect replicators may have arisen but later been eliminated when changes occurred to which they could not adapt. Our resources are much more constrained, however, and so we must proactively cull the individuals that we know cannot possibly evolve. Programs in a Pushpop population can reproduce using evolved forms of multi-parent recombination, accessing other individuals in the population through the use of a variety of instructions provided for this purpose and using them in any computable way to produce their children (Spector and Robinson, 2002a). In fact, evolving Pushpop programs can access and then execute code from other individuals in the population, which means that evolved programs may not work correctly when executed outside of the populations within which they evolved. This is unfortunate from the perspective of a practitioner who is primarily interested in producing a program that will solve a particular problem, since the “solution” may require the entire population to work and it may be exceptionally difficult to understand. The mechanisms for population access in Pushpop are also somewhat complex, and the presence of these mechanisms makes it particularly difficult to analyze the performance of the system. For these reasons the new work described here does not allow executing programs to access the other programs in the population; see below for further discussion. Pushpop is capable of solving simple symbolic regression problems, and it has served as the basis for studies of the evolution of diversification. For example, one study showed that evolving populations that produce adaptive Pushpop programs—that is, programs that actually solve the problems presented to the system—are reliably more diverse than is required by the “no cloning” rule alone (Spector, 2002). But Pushpop’s utility as a problem-solving system is limited, and the focus of the Push project in subsequent years has been on more traditional genetic programming systems such as PushGP. PushGP uses traditional genetic operators but the code-manipulation features of Push nonetheless provide benefits, for example by simplifying the evolution of novel control structures and modular architectures. More recently, however, the use of Push for autoconstructive evolution has been revisited in light of improvements to the Push language (Spector et al., 2005), the availability of substantially faster hardware, and a clarified focus on the long-term potential of autoconstructive evolution to solve problems that cannot be solved with hand-coded diversification mechanisms.

Towards practical autoconstructive evolution

3.

25

Practical Autoconstructive Evolution

AutoPush is a new autoconstructive genetic programming system, a successor to Pushpop built on the more expressive version 3 of the Push programming language and designed with a more explicit focus on problem-solving power. To that end, several sources of inessential complexity in Pushpop have been removed to aid in the analysis of AutoPush runs and their results. AutoPush, like Pushpop, uses the basic generational loop of a standard genetic programming system and tournament selection with a pre-specified tournament size. Also like Pushpop it uses no pre-specified genetic operators, no ancestor replicators, and no pre-specified, automatic mutation. And like Pushpop it represents its programs in a Turing complete language so that children may be produced from parents by means of any computable function, modulo limits on execution steps or time. The current version of AutoPush is asexual—that is, parents must construct their children without having access to other programs in the population— because this eliminates the complexity that may not be necessary and it also simplifies analysis. Asexual programs may be run in isolation, both to solve the target problem and to study the range of children that they produce, and it is easy to store all of their ancestors (of which there will be only as many as there have been generations, while each individual in a sexually-reproducing population may have exponentially many ancestors). Future versions of AutoPush may reintroduce the possibility of recombination by reintroducing instructions that provide access to other individuals in the population; it is our intention to explore this option once the dynamics of the asexual version are better understood. It is also worth noting that the role of sex in biological diversification is a subject of considerable debate, and that asexual organisms diversify in complex and significant ways (Barraclough et al., 2003). The processes by which programs are tested for problem-solving performance and used to produce children also differ between Pushpop and AutoPush. In Pushpop a potential child is produced for each fitness case, during the calculation of the problem-solving answer for that fitness case. This means that the number of children may depend on the number of fitness cases, which complicates analysis and also changes the way that the algorithm will perform on problems with different numbers of fitness cases. By contrast, in AutoPush no children are produced during fitness testing; any code left on the code stack after a fitness-testing run is ignored. 3 Instead, when an individual is selected

3 In Pushpop a special child stack is used for the production of children because the code stack is needed for the expression of evolved control structures in Push1, in which Pushpop was implemented. AutoPush is implemented in Push3, in which the new exec stack can be used for evolved control structures, freeing up the code stack for child production.

26

Genetic Programming Theory and Practice VIII

for autoconstructive reproduction in a tournament it is run again, with an input of 0, to produce a child program for the next generation. 4 The most significant innovation in AutoPush is a new approach to constraints on birth and selection. Pushpop incorporates a “no cloning” rule but AutoPush goes further, adding more constraints on birth and selection to facilitate the evolution of adaptive diversification. Following the lead of meta-genetic programming developers who judged the fitness of evolving operators by “some measure of success in increasing the fitness of the population they operate on” (Edmonds, 2001), AutoPush incorporates factors based on the history of improvement within the ancestry of an individual. There are many ways in which one might measure “history of improvement” and many ways in which such measurements might be used in an evolutionary algorithm. For example, Smits et al. define “activity” or “potential to improve” as “the sum of the number of moves [in the program search space] that either improved the fitness or neutral moves that resulted in either no change in fitness or a change that was less than a given (dynamic) tolerance limit” (Smits et al., 2010). They use this measure to select candidates for further testing, crossover, and replacement. Additional comments on varieties and measures of selfimprovement can be found in (Schmidhuber, 2006). In AutoPush the history of improvement is a scalar that summarizes the direction of problem-solving performance changes over the individual’s ancestry, with greater weight given to more recent changes (see formula below). It would be tempting to use this measure of improvement only in selection, possibly as a second objective—in addition to problem-solving performance—in the context of a multi-objective selection scheme. But this, by itself, would not work well because selection cannot salvage a population that has become overrun by evolutionary “dead-enders” that can never produce improved descendants. Such dead-enders include not only cloners but also programs of several other categories. For example, consider a population full of programs that produce children that vary only in a subexpression that is never executed. This population is just as un-adaptive as a population of cloners, and it will do no good to select among its individuals on any basis whatsoever. Many other, more subtle categories of dead-enders exist, presenting challenges to any evolutionary system that relies only on selection to drive adaptation. The alternative approach taken in AutoPush is to prevent such dead-enders, when they can be detected, from reproducing at all, and to make room in the population for the children of improvers or at least for new random individuals.

4 The input

value of 0 is arbitrary, and an input value is provided only for the minor convenience of avoiding re-definition of the input-pushing instruction. None of this should be significant as long as we are consistent in the ways that we conduct the autoconstructive reproduction runs.

Towards practical autoconstructive evolution

27

As a result, we place a variety of constraints on birth and selection which act collectively to promote the evolution of adaptive diversification without specifying the form(s) that the actual diversification algorithms will take. More specifically, we conduct selection using tournaments, with comparisons within the tournament set computed as follows:5 Prefer reproductively competent parents: Individuals that were generated by other individuals beat randomly-generated individuals, and individuals that are “grandchildren” beat all others that are not. If both individuals being compared are grandchildren then the lengths of their lineages are not otherwise decisive. Prefer parents with non-stagnant lineages: A lineage is considered stagnant if it has persisted for at least some preset number of generations (6 in the experiments described here) and if problem-solving performance has not changed in the most recent half of the lineage. Prefer parents with good problem-solving performance: If neither reproductive competence nor lineage stagnation are decisive then select the parent that does a better job on the target problem. The constraints on birth make use of two auxiliary definitions, for “improvement” and “code discrepancy.” Improvement is a measure of how much the problem-solving performance of a lineage has improved, with greater weight being given to the most recent steps in the lineage. We first compute a normalized vector of changes in problem-solving performance, with improvements represented as 1, declines represented as −1, and repeats of the same value represented as 0. The overall improvement value is then calculated as the weighted average of the elements of this vector, with the weights produced by following function (with decay factor δ = 0.1 for the runs described here): wg=current−gen = 1 wg−1 = wg ∗ (1 − δ) Code discrepancy is a measure of the difference between two programs, calculated as the sum, over all unique expressions and sub-expressions in either of the programs, of the difference between the numbers of occurrences of the expression in the two programs. In the context of these definitions we can state the constraints on birth as follows: 5 These constraints, and those mentioned for birth below, are stated using the numerical parameter values that were chosen, more or less arbitrarily, for the runs described here. Other values may perform better, and further study may provide guidance on setting these values or eliminating the parameters altogether.

28

Genetic Programming Theory and Practice VIII

Prevent birth from lineages with at least a preset threshold number of ancestors (4 here) and an improvement of less than some preset minimum (0.1 here). Prevent birth from lineages with at least a preset threshold number of ancestors (3 here) and constant discrepancy between parent and child in all generations. Prevent birth from parents that received disqualifying fitness penalties, e.g. for nontermination or non-production of result values. Prevent birth of children with sizes outside of the specified legal range (here 10–100 points). Prevent birth of children that are identical to any of their ancestors. Prevent birth of children that are identical to potential siblings; for this test the parent program is run a second time to produce an additional child that is used only for this comparison.

4.

Preliminary results

While the approach described here has not yet been shown to solve problems that are out of reach of more conventional genetic programming systems— indeed, it is currently weaker than the more-standard PushGP system—it has solved simple problems and produced illuminating data that may help to deepen our understanding. For example, in one run on a symbolic regression problem with the target function y = x3 − 2x2 − x AutoPush found a solution that descended from the following randomly generated program: 6 ((code_if (code_noop) boolean_fromfloat (2) integer_fromfloat) (code_rand integer_rot) exec_swap code_append integer_mult)

While it is difficult to tell from inspection how this program works, even for those experienced in reading Push code, the specific code instructions that are included provide clues about how it constructs children. For example, the code rand instruction generates new random code, and the code append instruction combines two pieces of code on the code stack. It is even more revealing to look at the code outputs from several runs of this program. In this case they are all of the form: (RANDOM-INSTRUCTION (code_if (code_noop) boolean_fromfloat

6 Space limitations prevent full description of the run parameters or the instruction set; see (Spector et al., 2005) and the source code at http://hampshire.edu/lspector/gptp10 for more information.

Towards practical autoconstructive evolution

29

(2) integer_fromfloat) (code_rand integer_rot) exec_swap code_append integer_mult)

where “RANDOM-INSTRUCTION” is some particular randomly chosen instruction. So this program’s reproductive strategy is merely to add a new, random instruction to the beginning of itself. This strategy continues for several generations, with several improvements in problem-solving performance, until something new and interesting happens. In the sixth generation a child is produced with a new list added, rather than just a new instruction, and it also has a new reproductive strategy: it adds something new to the beginning of both of its top-level lists. In other words, the sixth-generation individual is of this form: (SUB-EXPRESSION-1 SUB-EXPRESSION-2)

where each “SUB-EXPRESSION-n” is a different sub-expression, and the seventhgeneration children of this program are all of the form: ((RANDOM-INSTRUCTION-1 (SUB-EXPRESSION-1)) (RANDOM-INSTRUCTION-2 (SUB-EXPRESSION-2)))

where each “RANDOM-INSTRUCTION-n” is some particular randomly chosen instruction. One generation later the problem was solved, by the following program: ((integer_stackdepth (boolean_and code_map)) (integer_sub (integer_stackdepth (integer_sub (in (code_wrap (code_if (code_noop) boolean_fromfloat (2) integer_fromfloat) (code_rand integer_rot) exec_swap code_append integer_mult))))))

This program inherits the altered reproductive strategy of its parent, augmenting both of its primary sub-expressions with new initial instructions in its children. In the run described above the only available code-manipulation instructions were those in the standard Push specification, which are modeled loosely on Lisp list-manipulation primitives. In some runs, however, we have added a “perturb” instruction that changes symbols and constants in a program to other random symbols or constants with a probability derived from an integer popped from the integer stack. Perturb, which was also used in some Pushpop runs, is itself a powerful mutation operator, but its availability does not dictate if or how or where it will be used; for example, it would be possible for an evolved reproductive strategy to use perturb on only one part of its code, or to use it with different probabilities on different parts of its code, or to use it conditionally or in conjunction with other code-manipulation instructions. With the perturb instruction included we have been able to solve somewhat more difficult problems such as the symbolic regression of y = x6 −2x4 +x2 −2, and

30

Genetic Programming Theory and Practice VIII

we are actively exploring application to more difficult problems and analysis of the resulting programs and lineages, with the hypothesis that more complex and adaptive reproductive strategies will emerge in the context of more challenging problem environments.

5.

Conclusions

The specific results reported here are preliminary, and the hypothesis that autoconstructive evolution will extend the problem-solving power of genetic programming is still speculative. However, the hypothesis has been refined, the means for testing it have been simplified, the principles that underlie it have been better articulated, and the prospects for analysis of incremental results have been improved. We have shown (again) that mechanisms of adaptive variation can evolve as components of evolving problem-solving systems, and we have described reasons to believe that the best problem-solving systems of the future will make use of some such techniques. Only further experimentation will determine whether and when autoconstructive evolution will become the most appropriate technique for solving difficult problems of practical significance.

Acknowledgments Kyle Harrington, Paul Sawaya, Thomas Helmuth, Brian Martin, Scott Niekum and Rebecca Neimark contributed to conversations in which some of the ideas used in this work were refined. Thanks also to the GPTP reviewers, to William Josiah Erikson for superb technical support, and to Hampshire College for support for the Hampshire College Institute for Computational Intelligence.

References Angeline, Peter J. (1995). Adaptive and self-adaptive evolutionary computations. In Palaniswami, Marimuthu and Attikiouzel, Yianni, editors, Computational Intelligence: A Dynamic Systems Perspective, pages 152–163. IEEE Press. Angeline, Peter J. (1996). Two self-adaptive crossover operators for genetic programming. In Angeline, Peter J. and Kinnear, Jr., K. E., editors, Advances in Genetic Programming 2, chapter 5, pages 89–110. MIT Press, Cambridge, MA, USA. Barraclough, Timothy G., Birky, C. William Jr., and Burt, Austin (2003). Diversification in sexual and asexual organisms. Evolution, 57:2166–2172. Beyer, Hans-Georg and Meyer-Nieberg, Silja (2006). Self-adaptation of evolution strategies under noisy fitness evaluations. Genetic Programming and Evolvable Machines, 7(4):295–328. Diosan, Laura and Oltean, Mihai (2009). Evolutionary design of evolutionary algorithms. Genetic Programming and Evolvable Machines, 10(3):263–306.

Towards practical autoconstructive evolution

31

Edmonds, Bruce (2001). Meta-genetic programming: Co-evolving the operators of variation. Elektrik, 9(1):13–29. Turkish Journal Electrical Engineering and Computer Sciences. Eiben, Agoston Endre, Hinterding, Robert, and Michalewicz, Zbigniew (1999). Parameter control in evolutionary algorithms. IEEE Transations on Evolutionary Computation, 3(2):124–141. Fodor, Jerry and Piattelli-Palmarini, Massimo (2010). What Darwin got wrong. New York: Farrar, Straus and Giroux. Fontana, Walter (1992). Algorithmic chemistry. In Langton, C. G., Taylor, C., Farmer, J. D., and Rasmussen, S., editors, Artificial Life II, pages 159–210. Addison-Wesley. Fry, Rodney, Smith, Stephen L., and Tyrrell, Andy M. (2005). A self-adaptive mate selection model for genetic programming. In Proceedings of the 2005 IEEE Congress on Evolutionary Computation, volume 3, pages 2707–2714, Edinburgh, UK. IEEE Press. Gerhart, John and Kirschner, Marc (2007). The theory of facilitated variation. Proceedings of the National Academy of Sciences, 104:8582–8589. Kantschik, Wolfgang, Dittrich, Peter, Brameier, Markus, and Banzhaf, Wolfgang (1999). Meta-evolution in graph GP. In Genetic Programming, Proceedings of EuroGP’99, volume 1598 of LNCS, pages 15–28, Goteborg, Sweden. Springer-Verlag. Koza, John R. (1994). Spontaneous emergence of self-replicating and evolutionarily self-improving computer programs. In Langton, Christopher G., editor, Artificial Life III, volume XVII of SFI Studies in the Sciences of Complexity, pages 225–262. Addison-Wesley, Santa Fe, New Mexico, USA. Langdon, William B. and Poli, Riccardo (2006). On turing complete T7 and MISC F–4 program fitness landscapes. In Arnold, Dirk V., Jansen, Thomas, Vose, Michael D., and Rowe, Jonathan E., editors, Theory of Evolutionary Algorithms, Dagstuhl, Germany. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany. MacCallum, Robert M. (2003). Introducing a perl genetic programming system: and can meta-evolution solve the bloat problem? In Genetic Programming, Proceedings of EuroGP’2003, volume 2610 of LNCS, pages 364–373, Essex. Springer-Verlag. Maynard Smith, John and Szathm´ary, E¨ors (1999). The origins of life. Oxford: Oxford University Press. Nordin, Peter and Banzhaf, Wolfgang (1995). Evolving turing-complete programs for a register machine with self-modifying code. In Genetic Algorithms: Proceedings of the Sixth International Conference (ICGA95), pages 318–325, Pittsburgh, PA, USA. Morgan Kaufmann.

32

Genetic Programming Theory and Practice VIII

Ofria, Charles and Wilke, Claus O. (2004). Avida: A software platform for research in computational evolutionary biology. Artificial Life, 10(2):191– 229. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Ray, Thomas S. (1991). Is it alive or is it GA. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 527–534, University of California - San Diego, La Jolla, CA, USA. Morgan Kaufmann. Schmidhuber, Jurgen (1987). Evolutionary principles in self-referential learning. on learning now to learn: The meta-meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany. Schmidhuber, Jurgen (2006). G¨odel machines: Fully self-referential optimal universal self-improvers. In Goertzel, B. and Pennachin, C., editors, Artificial General Intelligence, pages 119–226. Springer. Silva, Sara and Dignum, Stephen (2009). Extending operator equalisation: Fitness based self adaptive length distribution for bloat free GP. In Proceedings of the 12th European Conference on Genetic Programming, EuroGP 2009, volume 5481 of LNCS, pages 159–170, Tuebingen. Springer. Sipper, Moshe and Reggia, James A. (2001). Go forth and replicate. Scientific American, 265(2):27–35. Smits, Guido F., Vladislavleva, Ekaterina, and Kotanchek, Mark E. (2010). Scalable symbolic regression by continuous evolution with very small populations. In Riolo, Rick L., McConaghy, Trent, and Vladislavleva, Ekaterina, editors, Genetic Programming Theory and Practice VIII. Springer. Spears, William M. (1995). Adapting crossover in evolutionary algorithms. In Proceedings of the Fourth Annual Conference on Evolutionary Programming, pages 367–384. MIT Press. Spector, Lee (2001). Autoconstructive evolution: Push, pushGP, and pushpop. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pages 137–146, San Francisco, California, USA. Morgan Kaufmann. Spector, Lee (2002). Adaptive populations of endogenously diversifying pushpop organisms are reliably diverse. In Proceedings of Artificial Life VIII, the 8th International Conference on the Simulation and Synthesis of Living Systems, pages 142–145, University of New South Wales, Sydney, NSW, Australia. The MIT Press. Spector, Lee, Klein, Jon, and Keijzer, Maarten (2005). The push3 execution stack and the evolution of control. In GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, volume 2, pages 1689–1696, Washington DC, USA. ACM Press.

Towards practical autoconstructive evolution

33

Spector, Lee and Robinson, Alan (2002a). Genetic programming and autoconstructive evolution with the push programming language. Genetic Programming and Evolvable Machines, 3(1):7–40. Spector, Lee and Robinson, Alan (2002b). Multi-type, self-adaptive genetic programming as an agent creation tool. In GECCO 2002: Proceedings of the Bird of a Feather Workshops, Genetic and Evolutionary Computation Conference, pages 73–80, New York. AAAI. Suzuki, Hideaki (2004). Design Optimization of Artificial Evolutionary Systems. Doctor of informatics, Graduate School of Informatics, Kyoto University, Japan. Tavares, Jorge, Machado, Penousal, Cardoso, Amilcar, Pereira, Francisco B., and Costa, Ernesto (2004). On the evolution of evolutionary algorithms. In Genetic Programming 7th European Conference, EuroGP 2004, Proceedings, volume 3003 of LNCS, pages 389–398, Coimbra, Portugal. SpringerVerlag. Taylor, Timothy John (1999). From Artificial Evolution to Artificial Life. PhD thesis, Division of Informatics, University of Edinburgh, UK. Teller, Astro (1994). Turing completeness in the language of genetic programming with indexed memory. In Proceedings of the 1994 IEEE World Congress on Computational Intelligence, volume 1, pages 136–141, Orlando, Florida, USA. IEEE Press. Vafaee, Fatemeh, Xiao, Weimin, Nelson, Peter C., and Zhou, Chi (2008). Adaptively evolving probabilities of genetic operators. In Seventh International Conference on Machine Learning and Applications, ICMLA ’08, pages 292– 299, La Jolla, San Diego, USA. IEEE. Woodward, John (2003). Evolving turing complete representations. In Proceedings of the 2003 Congress on Evolutionary Computation, pages 830–837, Canberra. IEEE Press. Yabuki, Taro and Iba, Hitoshi (2004). Genetic programming using a Turing complete representation: recurrent network consisting of trees. In de Castro, Leandro N. and Von Zuben, Fernando J., editors, Recent Developments in Biologically Inspired Computing, chapter 4, pages 61–81. Idea Group Publishing.

Chapter 3 THE RUBIK CUBE AND GP TEMPORAL SEQUENCE LEARNING: AN INITIAL STUDY Peter Lichodzijewski and Malcolm Heywood Faculty of Computer Science, Dalhousie University, 6050 University Av., Halifax, NS, B3H 1W5. Canada.

Abstract

The 3 × 3 Rubik cube represents a potential benchmark for temporal sequence learning under a discrete application domain with multiple actions. Challenging aspects of the problem domain include the large state space and a requirement to learn invariances relative to the specific colours present the latter element of the domain making it difficult to evolve individuals that learn ‘macro-moves’ relative to multiple cube configurations. An initial study is presented in this work to investigate the utility of Genetic Programming capable of layered learning and problem decomposition. The resulting solutions are tested on 5,000 test cubes, of which specific individuals are able to solve up to 350 (7 percent) cube configurations and population wide behaviours are capable of solving up to 1,200 (24 percent) of the test cube configurations. It is noted that the design options for generic fitness functions are such that users are likely to face either reward functions that are very expensive to evaluate or functions that are very deceptive. Addressing this might well imply that domain knowledge is explicitly used to decompose the task to avoid these challenges. This would augment the described generic approach currently employed for Layered learning/ problem decomposition.

Keywords:

bid-based cooperative behaviours, problem decomposition, Rubik cube, symbiotic coevolution, temporal sequence learning.

1.

Introduction

Evolutionary Computation as applied to temporal sequence learning problems generally assumes a phylogenetic framework for learning (Barreto et al., 2009). That is to say, policies are evaluated in their entirety on the problem domain before search operators are applied to produce new policies. Conversely, the ontogenetic approach to temporal sequence learning performs incremental refinement over a single candidate solution with respect to each state–action pair

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_3, © Springer Science+Business Media, LLC 2011

36

Genetic Programming Theory and Practice VIII

(Barreto et al., 2009). The latter is traditionally referred to as reinforcement learning. However, the distinction is often ignored, with reinforcement learning frequently used as a general label for any scenario in which the temporal credit assignment problem/ delayed reward exists; not least because algorithms are beginning to appear which combine both phylogenetic and ontogenetic mechanisms of learning (Whiteson and Stone, 2006).1 Examples of the temporal sequence learning problem appear in many forms, from control style formulations in which the goal is to learn a policy for controlling a robot or vehicle to games in which the general objective is to learn a strategy. In this work we are interested in the latter domain, specifically the case of learning a strategy to solve multiple configurations of the 3 × 3 Rubik cube. The problem of learning to solve Rubik cube configurations presents multiple challenges of wider interest to the temporal sequence learning community. Specific examples might include: 1) a large number of states ranging from trivial to demanding, 2) the problem is known to challenge human players, 3) a wide variation in start states exists, therefore resilient to self play dynamics that might simplify board games such as back-gammon (Pollack and Blair, 1998), 4) generalization to learn invariances/ symmetries implicit in the game. Approaches for finding solutions to scrambled configurations of a Rubik cube fall into one of two general approaches: optimal solvers or macro-moves. In the case of solving a cube using a minimal (optimal) number of moves, extensive use is made of lookup tables to provide an exact evaluation function as deployed relative to a game tree summary of the cube state. Thus with respect to the eight corner cubies, the position and orientation of a single cubie is defined by the other 7; or 8! × 37 = 88,179,840 combinations. An iterative deepening breadth first search would naturally enumerate all such paths between goal and possible configurations for the corner cubies, forming a “pattern database” for later use. Most emphasis is therefore on the utilization of appropriate hash codings and graph symmetries to extend this enumeration over all possible legal states of a cube (in total there are 4.3252 × 1019 legal states in a 3 × 3 cube). Such an approach recently identified an upper bound on the number of moves necessary to solve a worst case cube configuration as 26 (Kunkle and Cooperman, 2007). Conversely, non-optimal methods rely on ‘macro-moves’ which establish the correct location for specific cubies without disrupting the location of perviously positioned cubies. This is the approach most widely assumed by both human players and ‘automated’ solvers. Such strategies generally take 50 to 100 moves to solve a scrambled cube (Korf, 1997). The advantage this gives is that “general purpose” strategies might result that are appropriate to a wide range of

1 In the following we will use the terms reinforcement and temporal sequence learning interchangeably, particularly where there is a previous established history of terminology e.g., as in hierarchical reinforcement learning.

The Rubik Cube and GP Temporal Sequence learning

37

scenarios, thus giving hope for identifying machine learning approaches that generalize. However, from the perspective of cube ‘state’ we can also see that once one face of a cube is completed the completion of the remaining faces will increasingly result in periods when the relative entropy of the cube will go up considerably.2 Moreover, from a learning system perspective macro-moves need to be associated with any color combination to be effective, a problem that represents a requirement for learning invariances in a scalable manner. Two previous published attempts to evolve solutions to the Rubik cube using evolutionary methods have taken rather different approaches to the problem. One attempts to evolve a generic strategy under little a priori information (Baum and Durdanovic, 2000); whereas the second concentrates on independently evolving optimal move sequences to each scrambled cube (El-Sourani et al., 2010), making use of domain knowledge to formulate appropriate constraints and objectives. In this work we assume the motivation of the former, thus the goal is to evolve a program able to provide solutions to as many scrambled cubes as possible. The approach taken by (Baum and Durdanovic, 2000) employed a domain specific language under the Hayek framework for phylogenetic temporal sequence learning. A domain specific representation included the capability to ‘address’ specific faces of the cube and compare content with other faces as well as tests for the number of correct cubies. Two approaches to training were considered, either incrementally increasing the difficulty of cube configurations (e.g., one or two twist modifications relative to a solved cube) with binary (solved/ not solved) feedback or cubes with a 100 twist ‘scrambling’ and feedback proportional to the number of correctly placed cubies. Two different formulations for actions applied to a cube were also considered, with an action space of either three (90 degree turn of the front face, and row or column twists of the cube) or a fixed 3-dimensional co-ordinate frame on which a total of twelve 90 degree turns are applied (the scheme employed here). The most notable result from Hayek relative to this work was that up to 10 cubies could be correctly placed (one face and some of the middle). In effect Hayek was building macro-moves, but could not work through the construction of the remaining cube faces without destroying the work done on the first face. In the following we develop the Symbiotic Bid-Based (SBB) GP framework and introduce a generic approach to layered learning that does not rely on the a priori definition of different goal functions for each ‘layer’ (as per the classical definition of Layered learning (Stone, 2007)). The use of layering is supported by the explicitly symbiotic approach adopted to evolution. A discussion of the domain specific requirements will then be made, with results establishing the

2 Consider

the case of completing the final face if all other cubies are correctly positioned.

38

Genetic Programming Theory and Practice VIII

relative success of the initial approach adopted here, and conclusions discussing future work in the Rubik cube domain.

2.

Layered learning in Symbiotic Bid-Based GP

Symbiosis is a process by which symbionts of different species – in this case computer programs – receive sufficient ecological pressure to cooperate in a common host compartment (Heywood and Lichodzijewski, 2010). Over a period of time the symbionts will either develop the fitness of the host or not, as per natural selection. Thus, fitness evaluation takes place at the level of hosts not at the level of individual symbionts, or a serial dependence between host and symbionts (Figure 3-1). In the case of this work, hosts are represented by an independent population – a Genetic Algorithm in this case – each host individual defining a compartment by indexing a subset of individuals from an independent symbiont population (Figure 3-1). However, rather than symbionts from the same host having their respective outcomes combined in some form of a voting policy – as in ensemble methods – we explicitly require each symbiont to learn the specific context in which they operate. To do so, symbionts assume the bid-based GP framework (Lichodzijewski and Heywood, 2007). Thus, each symbiont consists of a program and a scalar. The program is used to evolve a bidding strategy and the scalar expresses a domain dependent action, say class label or ‘turn right’. The program evolves whereas the action does not. Within the context of a host individual each symbiont executes their program on the current state of the world/ training instance. The symbiont with the largest bid winning the right to present its action as the outcome from that host under the current state. Under a reinforcement learning domain this action would update the state of the world and the process repeats, with a new round of bidding between symbionts from the same host w.r.t. the updated state of the world. Fitness evaluation is performed over the worlds/ training instances defined by the point population (Algorithm 1, Step 10). Competitive coevolution therefore facilitates the development of point and host populations, with co-operative coevolution developing the interaction between symbionts within a host. Competitive coevolution again appears between hosts in the host population (speciation) to maintain host diversity. This latter point is deemed particularly important in supporting ‘intrinsic motivation’ in the behaviours evolved,3 where this represents a central tenet for hierarchical reinforcement learning in general (Oudeyer et al., 2007).

3 Intrinsic motivations or goals are considered to be those central to supporting the existence of an organism. In addition to behaviour diversity, the desire to reproduce is considered an intrinsic motivation/ goal. Conversely, ‘extrinsic motivations’ are secondary factors that might act in support of the original intrinsic factors such as food seeking behaviours, where these are learnt during the lifetime of the organism and might be specific to that particular organism.

39

The Rubik Cube and GP Temporal Sequence learning

Figure 3-1. Generic architecture of Symbiotic Bid-Based GP (SBB). A point population represents the subset of training scenarios over which a training epoch is performed. The host population conducts a combinatorial search for the best symbiont partnerships; whereas the symbiont population contains the bid-based GP individuals who attempt to learn a good context for their corresponding actions.

Algorithm 1 The core SBB training algorithm. P t , H t , and S t refer to the point, host, and symbiont populations at time t. 1: procedure Train 2: t=0 Initialization 3: initialize point population P t initialize host population H t (and symbiont population S t ) 4: 5: while t ≤ tmax do Main loop t 6: create new points and add to P 7: create new hosts and add to H t (add new symbionts to S t ) 8: for all hi ∈ H t do 9: for all pk ∈ P t do 10: evaluate hi on pk 11: end for 12: end for 13: remove points from P t remove hosts from H t (remove symbionts from S t ) 14: 15: t=t+1 16: end while 17: end procedure

40

Genetic Programming Theory and Practice VIII

The above Symbiotic Bid-Based GP or ‘SBB’ framework – as summarized by Figure 3-1 and Algorithm 1 – provides a natural scheme for layered learning by letting the content of the (converged) host population represent the actions for a new set of symbionts in a second application of the SBB algorithm; hereafter ‘Layered SBB’. The association between the next population of symbionts and the earlier population of hosts is explicitly hierarchical. However, there is no explicit requirement to re-craft fitness functions at each layering (although this is also possible). Instead, the reapplication of the SBB algorithm results in a second layer of hosts that learn how to combine previously learnt behaviours in specific contexts. The insight behind this is that SBB bidding policies under a temporal sequence learning domain are effectively evolving the conditions under which an action begins and ends its deployment. This is the general goal of hierarchical reinforcement learning. However, the SBB framework achieves this without also requiring an a priori formulation of the appropriate subtasks, the relation between subtasks, or a modified credit assignment policy; as is generally the case under hierarchical reinforcement learning (Oudeyer et al., 2007). In the following we summarize the core SBB algorithm, where this extends the original SBB framework presented in (Lichodzijewski and Heywood, 2008) and was applied elsewhere in a single layer supervised learning context (Lichodzijewski and Heywood, 2010a); the reader is referred to the latter for additional details of regarding host–symbiont variation operators.

Point Population As indicated in the above generic algorithm description, a competitive coevolutionary relationship is assumed between point and host population (Figure 3-1). Specifically, variation in the point population supports the necessary development in the host population. This implies that points have a fitness and are subject to variation operators. Thus, points are created in two phases on account of assuming a breeder style of replacement in which the worst Pgap points are removed (Step 13) – hereafter all references to specific ‘Steps’ are w.r.t. Algorithm 1 – and a corresponding number of new points are introduced (Step 6) at each generation. New points are created under one of two paths. Either a point is created as per the routine utilized at initialization (no concept of a parent point) or offspring are initialized relative to a parent point, with the parent selected under fitness proportional selection. The relative frequency of each point creation scheme is defined by a corresponding probability, pgenp . Discussion of the point population variation operators is necessarily application dependent, and is therefore presented later (Section 3). The evaluation function of Step 10 assumes the application of a domain specific reward that is a function of the interaction between point (pk ) and host

The Rubik Cube and GP Temporal Sequence learning

41

(hi ) individuals, or G(hi , pk ). This is therefore defined later (Equation (3.6), Section 3) as a weighted distance relative to the ideal target state. The global / base point fitness, fk , may now be defined relative to the count of hosts, ck , within a neighbourhood (Lichodzijewski and Heywood, 2010a), or 1−ck 1+ H if ck > 0 size fk = (3.1) 0 otherwise where Hsize is the host population size, and count ck is relative to the arithmetic mean μk of outcomes on point pk or, G(hi , pk ) (3.2) μk = h i Hsize where μ → 0 implies that hosts are failing on point pk and ck is set to zero. Otherwise, ck is defined by the number of hosts satisfying G(hi , pk ) ≥ μk ; that is the number of hosts with an outcome reaching the mean performance on point pk . Equation (3.1) establishes the global fitness of a point. However, unlike classification problem domains, points frequently have context under reinforcement learning domains i.e., a geometric interpretation. This enables us to define a local factor by which the global reward is modulated in proportion to the relative ‘local’ uniqueness of the candidate point. Specifically, each point is rewarded in proportion to the distance from the point to a subset of its nearest neighbours using ideas from outlier detection (Harmeling et al., 2006). To do so, all the points are first normalized by the maximum pair-wise Euclidean distance – as estimated across the point population content, therefore limiting local reward to the unit interval – after which the following reward scheme is adopted: 1. The set of K points nearest to pk is identified; 2. The local reward rk is calculated as, 2 2 pl (D(pk , pl )) rk = K

(3.3)

where the summation is taken over the set of K points nearest to pk and D(·, ·) is the application specific distance function (Equation (3.7), Section 3). 3. The corresponding final fitness for point pk is defined in terms of both global and local rewards or fk = fk · rk

(3.4)

42

Genetic Programming Theory and Practice VIII

With the normalized fitness fk established we can now delete the worst performing Pgap points (Step 13).

Host and Symbiont Population Hosts are also subject to the removal and addition of a fixed number of Hgap individuals per generation, Steps 14 and 7 respectively. However, in order to also promote diversity in the host population behaviours, we assume a fitness sharing formulation. Thus, shared fitness, si of host hi has the form,

si =

pk

G(hi , pk ) hj G(hj , pk )

3 (3.5)

Thus, for point pk the shared fitness score si re-weights the reward that host hi receives on pk relative to the reward on the same point as received by all hosts. As per the earlier comments regarding the role of fitness sharing in supporting ‘intrinsic motivation,’ a strong bias for diversity is provided through the cubic power. Evaluation takes place at Step 10, thus all hosts, hi , are evaluated on all points, pk . Once the shared score for each host is calculated, the Hgap lowest ranked hosts are removed. Any symbionts that are no longer indexed by hosts are considered ineffective and are therefore also deleted. Thus, the symbiont population size may dynamically vary, with variation operators having the capacity to add additional symbionts (Lichodzijewski and Heywood, 2010a), whereas the point and host populations are of a fixed size.

3.

Domain specific design decisions

Cube representation and actions The representation assumed directly indexes all 54 facelets comprising the 3 × 3 Rubik cube. Indexing is sequential, beginning at the centre face with cubie colours differentiated in terms of integers over the interval [0, ..., 5]. Such a scheme is simplistic with no explicit support for indicating which facelets are explicitly connected to make corner or edges. Actions in layer 0 define a 90 degree clock-wise or counter clock-wise twists to each face; there are 6 faces resulting in a total of 12 actions. When additional layers are added under SBB, the population of host behaviours from the previous population represent the set of candidate actions. As such additional layers attempt to evolve new contexts for previously evolved behaviours/ build larger macro-moves.

The Rubik Cube and GP Temporal Sequence learning

43

Reward and distance functions The reward function applies a simple weighting scheme to the number of quarter turn twists (i.e., actions) necessary to move the final cube state to a solved cube. Naturally, such a test becomes increasingly expensive as the number of moves applied in the ‘search’ about the final cube state increases. Hence, the search is limited to testing for up to 2 moves away from the solution, resulting in the following reward function, 1 (3.6) (1 + D(sf , s∗ ))2 where sf is the final state of the cube relative to cube configuration pk and sequence of moves defined by host hi ; s∗ is the ideal solved cube configuration, and; D(s2 , s1 ) defines the weighted distance function, or G(hi , pk ) =

⎧ 0, ⎪ ⎪ ⎨ 1, D(s2 , s1 ) = 4, ⎪ ⎪ ⎩ 16,

when 0 quarter twists match state s2 with s1 when 1 quarter twists match state s2 with s1 when 2 quarter twists match state s2 with s1 when > 2 quarter twists match state s2 with s1

(3.7)

Naturally, curtailing the ‘look-ahead’ to 2 quarter turn twists from the presented solution casts the fitness function into that of a highly deceptive ‘needle in a haystack’ style reward i.e., feedback is only available when you have all but provided a perfect solution. Adding additional twist tests however would result in tens of thousands of cube combinations potentially requiring evaluation before fitness could be defined. Other functions such as counting the number of correct facelets or cube entropy generally appeared to be less informative. The utility of combined metrics or a priori defined constraints might be of interest in future work.

Symbiont representation Symbionts take the form of a linear GP representation, with instruction set for the Bid-Based GP individuals consisting of the following generic set of operators {+, −, ×, ÷, ln(·), cos(·), exp(·), if }. The conditional operator ‘if ’ applies an inequality operator to two registers and interchanges the sign of the first register if its value is smaller than the second. There are always 8 registers and a maximum of 24 instructions per symbiont.

Point initialization and offspring Initialization of points – cube configurations used during evolution (Step 3) – takes the form of: (1) uniform sampling from the interval [1, ..., 10] to define the number of twists applied to a solved cube; (2) stochastic selection of the

44

Genetic Programming Theory and Practice VIII

sequence of quarter twist actions used to ‘scramble’ the cube, and; (3) test for a return to the solved cube configuration (in which case the quarter twist step is repeated). Thereafter, new points introduced during breeding (Step 6) follow one of two scenarios: adding twists to a parent point to create a child with probability pgenp or create a new point as per the aforementioned point initialization algorithm with probability 1 − pgenp . The point offspring/ parentwise creation is governed by the following process: 1. Select parent point, pi ∈ P t , under fitness proportional selection (point fitness defined by Equation (3.4), Section 2); 2. Define the number of additional twists, wi , applied to create the child from the parent in terms of a normal p.d.f., or wi = abs(N (0, σgenT wist )) + 1

(3.8)

where N (0, σgenT wist ) is a normal p.d.f. with zero mean and variance σgenT wist. Naturally, this is rounded to the nearest integer value; 3. Until the twist limit (wi ) is reached, select faces and clockwise/ counter clockwise twists with uniform probability relative to the parent cube configuration, pi ; 4. Should the resulting cube be a solved cube, the previous step is repeated.

4.

Results

Parameterization Runs are performed over 60 initializations for both the case of Layered SBB (two layers) and single layer SBB base cases. The latter are parameterized to provide the same number of fitness evaluations/ upper bound on the number of instructions executed as per the total Layered SBB requirement. In the case of this work this implies a limit of 72000 evaluations or a maxP rogSize limit of 36 under the single layer baseline; hereafter ‘big prog’. Likewise reasoning brings about a team size limit (ω) of 36 under the single layer SBB baseline; hereafter ‘big team’. Relative to the sister work in which the current SBB formulation was applied to data sets from the supervised learning domain of classification (Lichodzijewski and Heywood, 2010a), three additional parameters are introduced for point generation (Section 2): (1) outlier parameter K = 13; (2) the probability of creating points pgenp = 0.9; and, (3) the variance for defining the number of additional twists necessary to create an offspring from a parent point σgenT wist = 3. All other parameters are unchanged relative to those of the classification study (Table 3-1).

45

The Rubik Cube and GP Temporal Sequence learning

Table 3-1. Parameterization at Host (GA) and Symbiont (GP) populations. As per Linear GP, a fixed number of general purpose registers are assumed (numRegisters) and variable length programs subject to a max. instruction count (maxP rogSize).

Host (solution) level Parameter Value Parameter tmax 1 000 ω Psize , Hsize 120 Pgap , Hgap pmd 0.7 pma pmm 0.2 pmn Symbiont (program) level numRegisters 8 maxProgSize pdelete , padd 0.5 pmutate , pswap

Value 24 20, 60 0.7 0.1 24 1.0

Sampled Test Set Post training test performance is evaluated w.r.t. 5,000 unique ‘random’ test cubes, created as per the point initialization algorithm. Table 3-2 summarizes the distribution of cubes relative to the number of twists used to create them. A combined violin / quartile box plot is then used to express the total number of cube configurations solved. Figures 3-2 and 3-3 summarize this in terms of a single champion individual from each run4 and corresponding cumulated population wide performance. It is immediately apparent that the population wide behaviour (Figure 3-3) provides a significant source of useful diversity relative to that of the corresponding individual-wise performance (Figure 3-2). This is a generic property of fitness sharing implicit in the base SBB algorithm; Equation (3.5). However, it is also clear that under SBB 1 – in which second layer symbionts assume the hosts from layer 0 as their actions – the champion individuals are unable to directly build on the cumulative population wide behaviour from SBB 0. Conversely, under the case of real-valued reinforcement problem domains – such as the truck backer-upper (Lichodzijewski and Heywood, 2010b) – SBB 1 individuals were capable of producing champions that subsumed the SBB 0 population-wise performance. We attribute this to the more informative fitness function available under the truck backer-upper domain than that available under the Rubik cube. Relative to the non-layered SBB base cases, no real trend appears under the individual-wise performance (Figure 3-2). Conversely, under the cumulative population wide behaviour (Figure 3-3), SBB 1 provides a significant 4 Identified

post training on an independent validation set generated as per the stochastic process used to identify the independent test set.

46

Genetic Programming Theory and Practice VIII

Table 3-2. Distribution of test cases. Samples selected over 1 to 10 random twists relative to solved cube resulting in 5,000 unique test configurations.

Number of twists 1 2 3 4 5

# of test cases 9 86 403 527 588

Number of twists 6 7 8 9 10

# of test cases 662 640 728 673 683

$

#

#

#

"

!

Figure 3-2. Total test cases solved by single best individual per run under SBB with and without layering under the stochastic sampling of 5,000 1 to 10 twist cubes. ‘SBB 0’ and ‘SBB 1’ denote first and second layer Layered SBB solutions. ‘big team’ and ‘big prog’ represent single layer SBB runs with either larger host or symbiont instruction limits.

47

The Rubik Cube and GP Temporal Sequence learning

%

$

$

$

#

"

!

Figure 3-3. Total test cases solved by cumulated population wide performance per run under SBB with and without layering under the stochastic sampling of 5,000 1 to 10 twist cubes. ‘SBB 0’ and ‘SBB 1’ denote first and second layer Layered SBB solutions. ‘big team’ and ‘big prog’ represent single layer SBB runs with either larger host or symbiont instruction limits.

48

Genetic Programming Theory and Practice VIII

Table 3-3. Two-tailed Mann-Whitney test comparing total solutions under the Sampled Test Set provided by Layered SBB (second level) against single layer SBB parameterizations (big team (SBB-bt) and big program (SBB-bp)). The table reports p-values for the pair-wise comparison of distributions from Figures 3-2 and 3-3. Cases where the Layered SBB medians are higher (better) than non-layered SBB medians are noted with a .

Test Case Layered SBB vs SBB-0 Layered SBB vs SBB-bt Layered SBB vs SBB-bp

Champion individual 0.002499 0.1519 0.5566

Population wide 3.11e-15 1.003e-10 0.0002617

improvement as measured in terms of a two-tailed Mann-Whitney test with 0.01 significance level (Table 3-3), effectively identifying the most consistently effective solutions. This appears to indicate that Layered SBB is able to build configuration specific sub-sets of Rubik cube solvers – that is to say, the strategies for solving cube configurations are not colour invariant. Specifically, the macro moves learnt at SBB 0 cannot be generalized over all permutations of cube faces. Thus, at SBB 1, subsets of hosts from SBB 0 can be usefully combined. However, this only results in the median performance improving by approximately 50 (200) test cases between layers 0 and 1 under single champion (respectively population-wise) test counts. Overall, neither increasing the instruction count limit per symbiont or maximum limit on the number of symbionts per host is as effective as layering at leveraging the performance from individual-wise to population wide performance.

Exhaustive test set A second test set is designed consisting of all 1, 2 and 3 quarter twist cube configurations – consisting of 12, 114 and 1,068 unique test cubes respectively.5 Naturally, there is no a priori bias towards solving these during training, cubes being configured stochastically relative to points selected under fitness proportional selection. Figure 3-4 summarizes this as a percentage of the number of 1, 2 and 3 twist configurations solved by the single best individual in each run.6 The impact of layering is again evident, both from a consistency perspective and in terms of incremental improvements to the number of cases solved with each additional layer. Relative to the baseline single layer models, it is interesting to note that both ‘SBB big team’ and ‘SBB big prog’ had difficulty consistently solving the 1 twist configurations, whereas all SBB 1 first quartile performance counts are somewhat lower than those reported in (Korf, 1997) because we do not include 180◦ twists in the set of permitted actions. 6 The same ‘champion’ individual as identified under the aforementioned validation sample a priori to application of the sampled test set. 5 These

49

The Rubik Cube and GP Temporal Sequence learning

$

!

"

#

!

Figure 3-4. Percent of cases solved by single best SBB individuals as estimated under the exhaustive enumeration of 1, 2 and 3 quarter twist test cases. SBB 0 and SBB 1 denote the first and second layer solutions under Layered SBB; ‘big team’ and ‘big prog’ denote the base case SBB configurations without layering.

50

Genetic Programming Theory and Practice VIII

!

"

#

!

"

$

!

"

Figure 3-5. Number of moves used by champion individual to solve 1-, 2- and 3-twist points. ‘SBB 1’ is the second layer from Layered SBB, ‘SBB-bt’ and ‘SBB-bp’ denote the corresponding single layer SBB big team and big program parameterizations.

corresponds to all test cases solved. Of the two baseline configurations, ‘SBB big prog’ was again the more effective, implying that more complexity in the symbionts was more advantageous than larger host–symbiont capacity. Finally, we can also review the (mean) number of twists used to provide solutions to each test configuration (Figure 3-5). The resulting distributions are grouped by the original twist count. The move counts are averaged over all cases solved by an individual, thus although some, say, 1 twist test cases might be solved in one twist, cases that used three moves would naturally increase the average move count above the ideal. Application of a two-tailed Man-Whitney test indicates that the ‘SBB 1’ move counts are lower than the ‘SBB-bp’ (‘big program’) move counts on 2- and 3-twist test cases at a 0.01 significance level (Table 3-4). Thus, although Layered SBB and SBB big program solved a similar total number of test cases (Figure 3-4), Layered SBB is able to solve them using a statistically significant lower number of moves. Conversely, SBB big team was not able to solve as many test cases, but when it did provide solutions, a similar number of moves as Layered SBB where used.

51

The Rubik Cube and GP Temporal Sequence learning

Figure 3-6. Number of symbionts per host over SBB runs.

Table 3-4. Two-tailed Mann-Whitney test results comparing solution move counts for champion individuals with Layered SBB (second level) against single layer SBB parameterizations (big team (SBB-bt) and big program (SBB-bp)). The table reports p-values for the pair-wise comparison of distributions from Figure 3-5. Cases where the single layer SBB medians are higher (worse) than Layered SBB medians are noted with a .

Test Case Layered SBB vs SBB-bt Layered SBB vs SBB-bp

1-twist 0.4976 0.02737

2-twist 0.1374 0.001951

3-twist 0.0534 0.0007957

52

Genetic Programming Theory and Practice VIII

Figure 3-7. Number of instructions per host over SBB runs.

Model complexity Finally, we can also consider model complexity, post intron removal. Relative to the typical number of symbionts utilized per host (Figure 3-6), layer 0 clearly utilizes more symbionts per host than layer 1. This implies that at layer 1 there are 5 to 8 hosts from layer 0 being utilized. As indicated in Section 2, this is possible because each of the hosts from layer 0 is now associated with a symbiont bidding behaviour as evolved at level 1. Further analysis will be necessary to identify what the specific patterns of behaviour associated with these combinations of hosts represent. Both base cases appear to use more symbionts per host, understandable given that they do not have the capacity to make use of additional layers. The same bias towards simplicity again appears relative to instruction count (Figure 3-7), thus SBB 1 uses a significantly lower instruction count than SBB 0 and the ‘SBB big prog’ naturally results in the most complex symbiont programs. Needless to say, SBB 1 solutions will use some combination of SBB 0 solutions, however, relative to any one move, only two hosts are ever involved in defining each action.

5.

Conclusions

Temporal sequence learning represents the most challenging scenario for establishing effective mechanisms for credit assignment. Indeed, specific challenges under the temporal credit assignment problem are generally a superset of

The Rubik Cube and GP Temporal Sequence learning

53

those experienced under supervised learning domains. Layered learning represents one potential way of extending the utility of machine learning algorithms in general to temporal sequence learning (Stone, 2007). However, in order to do so effectively, solutions from any one ‘layer’ need to be both diverse and self-contained; properties that evolutionary computation may naturally support. Moreover, when building a new layer of candidate solutions the problem of automatic context association must be explicitly addressed. The SBB algorithm provides explicit support for these features and thus is able to construct layered solutions without recourse to hand designed objectives for each candidate component contributing to a solution (Lichodzijewski and Heywood, 2010b). This is in marked contrast to the original Layered learning methodology or the more recent developments in hierarchical reinforcement learning (Stone, 2007). The Rubik cube as a whole is certainly not a ‘solved’ problem from a learning algorithm perspective. The current state-of-the-art evolves solutions for each cube configuration (El-Sourani et al., 2010), or as in the work reported here, provides a general strategy for solving a subset of scrambled cubes (Baum and Durdanovic, 2000). The discrete nature of the Rubik problem domain makes the design of suitable fitness and distance functions less intuitive/ more challenging than in the case of continuous valued domains. Indeed, specific examples of the effectiveness of SBB style layered learning under continuous valued reinforcement learning tasks are beginning to appear (Lichodzijewski and Heywood, 2010b). It is therefore anticipated that future developments will need to make use of more structural adaptation to the point population and/ or make use of a priori constraints in the formulation of different fitness functions per layer, as in the case of more classical approaches to building Rubik cube ‘solvers’.

Acknowledgments Peter Lichodzijewski has been a recipient of Precarn, NSERC-PGSD and a Killam Postgraduate Scholarships. Malcolm Heywood holds research grants from NSERC, MITACS, CFI, SwissCom Innovations SA. and TARA Inc.

References Barreto, A. M. S., Augusto, D. A., and Barbosa, H. J. C. (2009). On the characteristics of sequential decision problems and their impact on Evolutionary Computation and Reinforcement learning. In Proceedings of the International Conference on Artificial Evolution, page in press. Baum, E. B. and Durdanovic, I. (2000). Evolution of cooperative problemsolving in an artificial economy. Neural Computation, 12:2743–2775.

54

Genetic Programming Theory and Practice VIII

El-Sourani, N., Hauke, S., and Borschbach, M. (2010). An evolutionary approach for solving the Rubik’s cube incorporating exact methods. In EvoApplications Part – 1: EvoGames, volume 6024 of LNCS, pages 80–89. Harmeling, S., Dornhge, G., Tax, F., Meinecke, F., and Muller, K. R. (2006). From outliers to prototypes: Ordering data. Neurocomputing, 69(13-15):1608– 1618. Heywood, M. I. and Lichodzijewski, P. (2010). Symbiogenesis as a mechanism for building complex adaptive systems: A review. In EvoApplications: Part 1 (EvoComplex), volume 6024 of LNCS, pages 51–60. Korf, R. (1997). Finding optimal solutions to rubik’s cube using pattern databases. In Proceedings of the Workshop on Computer Games (IJCAI), pages 21–26. Kunkle, D. and Cooperman, G. (2007). Twenty-six moves suffice for rubik’s cube. In Proceedings of ACM International Symposium on Symbolic and Algebraic Computation, pages 235–242. Lichodzijewski, P. and Heywood, M. I. (2007). Pareto-coevolutionary Genetic Programming for problem decomposition in multi-class classification. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 464–471. Lichodzijewski, P. and Heywood, M. I. (2008). Managing team-based problem solving with Symbiotic Bid-based Genetic Programming. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 363–370. Lichodzijewski, P. and Heywood, M. I. (2010a). Symbiosis, complexification and simplicity under gp. In Proceedings of the Genetic and Evolutionary Computation Conference. To appear. Lichodzijewski, P. and Heywood, M.I. (2010b). A symbiotic coevolutionary framework for layered learning. In AAAI Symposium on Complex Adaptive Systems. Under review. Oudeyer, P.Y., Kaplan, F., and V.V. Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11:265–286. Pollack, J. B. and Blair, A. D. (1998). Co-evolution in the successful learning of backgammon strategy. Machine Learning, 32:225–240. Stone, P. (2007). Learning and multiagent reasoning for autonomous agents. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 13–30. Whiteson, S. and Stone, P. (2006). Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 7:887–917.

Chapter 4 ENSEMBLE CLASSIFIERS: ADABOOST AND ORTHOGONAL EVOLUTION OF TEAMS Terence Soule1 , Robert B. Heckendorn1, Brian Dyre1 , and Roger Lew1 1 University of Idaho, Moscow, ID 83844, USA.

Abstract

AdaBoost is one of the most commonly used and most successful approaches for generating ensemble classifiers. However, AdaBoost is limited in that it requires independent training cases and can only use voting as a cooperation mechanism. This paper compares AdaBoost to Orthogonal Evolution of Teams (OET), an approach for generating ensembles that allows for a much wider range of problems and cooperation mechanisms. The set of test problems includes problems with significant amounts of noise in the form of erroneous training cases and problems with adjustable levels of epistasis. The results demonstrate that OET is a suitable alternative to AdaBoost for generating ensembles. Over the set of all tested problems OET with a hierarchical cooperation mechanism, rather than voting, is slightly more likely to produce better results. This is most apparent on the problems with very high levels of noise - suggesting that the hierarchical approach is less subject to over-fitting than voting techniques. The results also suggest that there are specific problems and features of problems that make them better suited for different training algorithms and different cooperation mechanisms.

Keywords:

ensembles, teams, classifiers, OET, AdaBoost

1.

Introduction

Classification, the ability to classify a case based on attribute values, is a commonly studied problem with many practical applications. Approaches based on the evolution of classifiers have been widely used and proven to be quite successful (see for example (Muni et al., 2004; Kishore et al., 2000; Paul and Iba, 2009)). However, as the complexity of the classification problem increases, and particularly as the number of attributes increases, the performance of monolithic

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_4, © Springer Science+Business Media, LLC 2011

56

Genetic Programming Theory and Practice VIII

classifiers often degrades. Thus, researchers have introduced the idea of ensemble classifiers, in which multiple classifiers vote on each case (Polikar, 2006). The general idea is that the individual classifiers can partition the attribute space into simpler, overlapping sub-domains for which individual classifiers can be more readily trained. Perhaps the most successful and widely used of these ensemble technqies is AdaBoost (Freund et al., 1999; Schapire et al., 1998). Recently, we introduced an alternative approach, called Orthogonal Evolution of Teams (Soule and Komireddy, 2006), for generating ensembles, or teams1 . A significant advantage of Orthogonal Evolution of Teams (OET) over AdaBoost is that, unlike AdaBoost, it does not require independent training cases or voting as a cooperation mechanism. Thus, OET can be applied in cases when the agents must function simultaneously, such as search and exploration problems, swarms, and problems with non-voting cooperation mechanisms. In previous research we have shown that the OET algorithm produces ensemble members whose errors are inversely correlated demonstrating that they cooperate effectively (Soule and Komireddy, 2006). In addition, repeated tests have shown that OET performs well on traditional multi-agent search problems that are not within the traditional domain of AdaBoost (Soule and Heckendorn, 2007a; Soule and Heckendorn, 2007b; Thomason et al., 2008). However, a systematic comparison of OET and AdaBoost on classification problems has not been performed. We present that comparison here using a range of data sets. The data sets include noisy cases with errors added to the training set and data sets with adjustable levels of epistasis. The goal is to determine whether and, if so, under what circumstances, either of the two algorithms performs better.

2.

Background

Here we present the two ensemble based learning techniques, AdaBoost and OET and briefly describe the strengths and weaknesses of each.

AdaBoost AdaBoost, developed by Freud and Schapire, is an ensemble building technique based on the idea of combining weak learners (Freund et al., 1999). It uses a combination of repeated training and re-weighting of training cases to generate cooperative ensembles. The basic algorithm is as follows: Assign each training example a weight 1 The term ‘ensemble’ is most commonly applied to classifiers with multiple, voting members; whereas the term ‘team’ is commonly applied to multiple agents that work cooperatively on problems other than classification and/or that do not involve a vote. The term ‘swarm’ is commonly used for very large teams. Unlike AdaBoost, OET can be applied to all three types of problems.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

57

For the ensemble size $N$ do Train a weak learner Calculate the error of the weak learner If the error $>$ 0.5 discard the learner and continue Calculate the normalized error of the learner Re-weight the training examples Create the ensemble of the $N$ learners using a vote weighted according to each learners’ normalized error. AdaBoost has several significant advantages for generating ensembles. First, it can be use in conjunction with most learning techniques. Second, theoretical results have shown that a) the ensemble error is bounded above, b) the ensemble error is less than the best ensemble member, and c) additional ensemble members lower the ensemble error - on the training set (and when members with error > 0.5 are discarded) (Polikar, 2006). These strengths make AdaBoost a very powerful and hence widely used technique for generating ensembles. However, AdaBoost has several weaknesses. First,becauseof the re-weighting step it potentially has difficulty with noisy data sets in which some of the examples are mis-classified. In this case increasing emphasis may be placed on the erroneous cases: the early learners ignore them as not fitting the general pattern, their weight then increases to where later learners are effectively forced to consider them. However, in general, AdaBoost has proven surprisingly resistant to overfitting; a strength the some researchers feel has not been satisfactorily explained (Mease and Wyner, 2008). Part of the goal of this research is to compare AdaBoost and OET’s ability to resist overfitting specifically when the training examples are noisy. Second, because AdaBoost trains each ensemble member independently it’s possible that problems with high levels of epistasis may confound it. The members of the ensemble may need to cooperate to overcome the high levels of epistasis in a way that is not possible when the members are trained sequentially. In contrast an algorithm that evolves all ensemble members in parallel may be able to leverage the capabilities of the members simultaneously to more successfully address high levels of epistasis. We use a synthetic problem with adjustable levels of epistasis to test this possibility. Finally, AdaBoost is restricted to problems in which individuals can train independently and cooperate via a vote. This means that it cannot be applied to problems where more than one member is required to actually make progress. A typical example of such a problem is collective foraging where multiple members must work together to collect ‘large’ items or other problems in which members have heterogeneous, complementary capabilities and must be trained collectively to make progress. Similarly, AdaBoost depends on a (weighted) vote for cooperation. It is not directly applicable to ensembles using other

58

Genetic Programming Theory and Practice VIII

forms of cooperation. An example of an alternative cooperative mechanism is the leader mechanism, in which the first ensemble member (the leader) ‘examines’ each input case and assigns it to one of the other ensemble members to classify. AdaBoost’s sequential, vote based, ensemble generation algorithm can not be applied to ensembles using leaders for cooperation. This is a fundamental limitation of AdaBoost’s incremental approach to building ensembles and cannot be readily overcome without fundamentally rewriting the algorithm.

Orthogonal Evolution of Teams Other than AdaBoost common evolutionary ensemble training has fallen into two categories: team based and island based. In team based approaches the entire ensemble is treated as a single individual: the team receives a single fitness value and the selection process is applied entire teams (Luke and Spector, 1996; Soule, 1999; Brameier and Banzhaf, 2001; Platel et al., 2005). Crossover techniques vary, but approaches in which team members in the same ‘position’ within the team are crossed seem to have the most success (Haynes et al., 1995; Luke and Spector, 1996). In island based techniques the individuals are evolved in independent populations, i.e. islands, and best individuals from each island are combined into a single ensemble (see for example, (Imamura et al., 2004)). Both of these techniques suffer from unique strengths and weaknesses. In team based approaches the ensemble members learn to cooperate well (similar to AdaBoost). It has been shown that they can evolve inversely correlated error - the errors of one member are explicitly covered by the other members (Soule and Komireddy, 2006). However, the individual members perform relatively poorly, i.e. their average fitness is often significantly poorer than the fitness of individuals evolved independently (Soule and Komireddy, 2006). In contrast, in island based approaches the individual members have relatively high fitness. However, they cooperate more poorly than in team based approaches; at best their errors are independent and in some cases their errors are correlated undermining the advantage of the ensemble (Imamura et al., 2004; Imamura, 2002; Soule and Komireddy, 2006). The Orthogonal Evolution of Teams approach is an attempt to combine the strengths and avoid the weaknesses of the team and island approaches. A single population is created, but it is alternatively treated as independent islands (columns in the population, see Figure 4-1) or as teams (rows in the population, see Figure 4-1). A number of OET approaches are possible depending on whether the population is treated as rows or columns during selection and replacement (Thomason et al., 2008). In this paper we take one of the most straight-forward approaches: during the selection step the population is treated as islands i.e. selection is applied to each column creating a new team consist-

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

59

ing of highly fit individuals. This is done twice to create two “all-star” teams. These teams undergo crossover, with crossover applied to individuals from the same column, and mutation, to create two offspring teams. These teams are evaluated and reinserted into the population, replacing two poorly fit teams. Thus, during the selection stage the population is treated as islands and during the replacement stage the population is treated as teams. This places direct selection pressure on both individuals, so they can be selected for the all-star parent teams, and on teams, to avoid being replaced.

Figure 4-1. A population of individuals. Selection can be applied to members, keeping selection and replacement within the columns (a) in an island approach with each column serving as an island. Alternatively selection can be applied to whole rows (b) a team-based approach. Finally, selection can be varied between the two; these are the OET approaches.

3.

Problem Instances

To compare the ensemble classifiers we selected two data sets from the UCI Machine Learning Database (Asuncion and Newman, 2007). The sets are the Parkinson’s Telemonitoring Data Set (Tsanas et al., 2009) and the Ionosphere data set (Sigillito et al., 1989). In addition, we used data collected as part of a research project conducted at the University of Idaho to assess cognitive workload (described in detail below) and from a synthetic problem with adjustable levels of epistasis. Each of these data sets represents a binary classification problem with numerical attributes (both integer and real). Table 4-1 summarizes the problems.

Assessing Cognitive Workload This data set was generated as part of a research project conducted at the University of Idaho to measure cognitive workload. Subjects’ skin conductance

60

Genetic Programming Theory and Practice VIII

Table 4-1. Number of attributes and number of cases for each of the test problems. Attributes are numerical (integer and real). The cognitive workload case consists of two separate data sets from two different test subjects. For each of the problems 50% of the cases are used for training and 50% for testing.

Problem Ionosphere Parkinson’s Cognitive Workload (2 subjects) Synthetic Problem

Number of Attributes 34 22 20

Number of Cases 351 195 2048

20

1000

(SC, also known as galvanic skin response, GSR) and pupil diameter were measured while they performed a task with two distinct levels of difficulty. Changes in SC are generally believed to reflect autonomic responses to anxiety or stress, while changes in pupil diameter have been linked to differences in difficulty of tasks including sentence processing, mental calculations and user interface evaluation (Just and Carpenter, 1993; Nakayama and Katsukura, 2007). Thus, it was hypothesized that these physiological indicators could be used to determine which phase of the task the subject was in.

Stimuli and Apparatus. Participants used a black cursor to chase a intensity balanced dot moving in a pseudo random fashion against a gray background. A balanced dot was used as precaution against having pupil dilations due to luminance changes. Participants controlled the cursor using a joystick. For the first minute of the experiment the control mappings were normal: moving the joystick forward moved the cursor up, moving the joystick right moved the cursor right, etc. After 60 seconds the joystick control mappings were abruptly rotated 90 clockwise, such that moving the joystick forward-backward moved the cursor right-left, and moving the joystick left-right moved the cursor upward-downward. The control dynamics were switched between normal and rotated by 90 degrees every 60 seconds for the eight minute duration of the experiment. The abrupt changes in control mappings were hypothesized to elicit transient physiological responses, and the rotated mappings were hypothesized to cause physiological indicators reflecting increased workload. The goal was to train classifiers to use these physiological indicators to determine the control phase, normal or rotated. For this analysis data was used from the last 2 minutes of the experiment (covering one normal and one rotated period), by which time the subjects had obtained some practice with both sets of controls. Data was collected 18 times per second for a total of 2048 separate cases for each subject.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

61

For a more detailed explanation of the experimental conditions please see (Lew et al., 2010).

Participants. The data used is from two university students who participated in this experiment. Both had normal or corrected to normal Snellen visual acuity (20/30 or better). The participants were naive to the hypotheses of the experiment.

Synthetic Problem The synthetic function was designed to allow control of the amount of epistasis in the problem. Each problem is defined in terms of a z-function, which are random Embedded Landscapes (Heckendorn, 2002). These are generalizations of NK-Landscapes in that the sub function masks are not guaranteed to cover the domain of the function and the number of sub-functions is not constrained to be equal to the number of bits in the domain as they are in NK-Landscapes. The range of values of the sub-functions are between -1 and 1. The functions denoted by names of the form: z-N-K-P. They are randomly generated, but are of the form: P

gi (pack(x, mi )) f (x) = positive i=1

where: N is the number of bits (or binary valued features). P is the number of sub-functions to sum. K is the number of bits (or features) in the domain of gi . mi is an N bit mask that selects K bits out of N bits by using the pack function to extract the bits selected by the 1’s in mi . In a given f : mi = mj ∀i, j such that i = j. gi is a function that maps its K bit domain into the reals. This function is fully epistatic in that all Walsh coefficients are nonzero. The values of gi are random in the range between −1 and +1. This where the randomness in the function is created. positive takes a real argument and returns 1 if its argument is positive and 0 otherwise. This creates a function f that has the property that it has at most K bits of epistasis in P groups of interrelated bits that may overlap. Therefore, as K goes up, the amount of epistasis goes up and as P goes up the complexity of

62

Genetic Programming Theory and Practice VIII

the constraint satisfaction problem created by the overlapping fully epistatic g’s goes up when treated as a function to optimize.

Noisy Training Data For many real-world data sets noisy cases - cases with the incorrect classification - are common. These cases can easily mislead training algorithms or lead to overfitting, as the training algorithm is forced to ‘memorize’ cases that don’t fit the general solution because the class is incorrect. Thus, in addition to the basic data sets we ran experiments with noisy versions of each of the problems except the synthetic problem. For the noisy cases 0 (no noise), 10, 20, 30, or 40 percent of the training case answers were changed to the opposite (incorrect) case. The erroneous cases in the training set are kept the same through the evolutionary process to maximize the chance of mis-leading the learners. All of the test cases were unchanged, i.e. all are correct.

4.

Cooperation Mechanisms

With AdaBoost the ensemble members cooperate - collectively determine the classification for each input set - via a weighted vote. With OET two different cooperation mechanisms are tested. The first is a simple majority vote. The second is the leader approach in which the first ensemble member (the leader) ‘examines’ each input case and assigns it to one of the other ensemble members to classify. It is important to note that AdaBoost’s sequential, vote based, ensemble generation algorithm can not be applied to ensembles using leaders for cooperation (or to most other cooperation mechanisms that do not use a vote).

5.

Genetic Program

For these experiments the ensemble size is always 3. One of the potential advantages of GP techniques is its ability to generate (somewhat) human-readable solutions. This advantage is lost if the ensemble size is large, hence the small value used here. The results are the average of 20 trials (synthetic problem) or 10 trials (other problems). The basic GP used in both the AdaBoost and OET experiments is steadystate with a population size of 500, run for either 50000 iterations (synthetic problems) or 12500 iterations (all others). For OET this is the total number of iterations. For AdaBoost this is the number of iterations used to generate each of the three ensemble members. With OET each iteration requires evaluating six trees, three trees for each of the two offspring teams. Because AdaBoost only generates one tree at a time, it only evaluates two trees per iteration. Thus, to equalize the number of tree evaluations AdaBoost uses the full number of

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

63

Table 4-2. Summary of the GP parameters.

Algorithm Iterations Population Size Non-terminals Terminals Crossover Rate Mutation Rate Trials Ensemble Size

Steady-state 50000 (synthetic problem) or 12500 (all others) 500 iflte, +, -, *, / Attributes, Random constants 100% 1/size 20 (synthetic problem) or 10 (all others) 3

evaluations to generate each of the ensemble members effectively tripling the total number of iterations used with AdaBoost. The non-terminal set consists of if-less-than-else, addition, subtraction, multiplication, and protected division (if the absolute value of the divisor is less than 0.00001 it returns 1). The terminal set consists of the N attributes of the problem and real-valued random constants generated in the range -2.0 to 2.0. Table 4-2 summarizes the GP’s parameters.

6.

Results

Figure 4-2 presents the results on the ionosphere problem. For this problem the OET-leader approach performs significantly worse for low levels of noise (all significant tests use a two-tailed, Student’s t-test, with significance defined as P < 0.05). OET-leader’s relative performance improves as noise increases, but does not reach statistically better performance. Figure 4-3 presents the results on the Parkinson’s problem. OET-vote is significantly worse that both other techniques with 30% noise and OET-leader is significantly better with 40% noise. Figure 4-4 presents the results for the cognitive workload problem with subject 1. At 0% noise AdaBoost is significantly worse than the other two approaches and OET-vote is significantly better. At 40% noise OET-leader is significantly better than other two approaches. Figure 4-5 presents the results for the cognitive workload problem with subject 2. At 0% noise AdaBoost is significantly worse than OET-vote. At 20% and 30% noise AdaBoost is significantly better than other two approaches. Figure 4-6 presents the results on the synthetic functions. OET-leader is significantly better than the other two algorithms on 4 of the 7 functions (230, 5-10, 5-30, 10-30). OET-vote is significantly better than the other two algorithms on 1 of the functions (2-3).

64

Genetic Programming Theory and Practice VIII

Classification Error

0.4

AdaBoost OET - Vote OET - Leader

0.35 0.3 0.25 0.2 0.15 0.1 0

10

20 30 Percent Noise

40

Figure 4-2. Results on the ionosphere problem for varying levels of noise in the training data. Arrows show significant differences. For this problem the OET-leader approach is significantly worse (Student’s two-tailed, t-test P < 0.05) than the other two approaches for low levels of noise. It’s relative performance improves for higher levels of noise, but the differences do not reach significance.

Overall the results are mixed. For the majority of cases the performance of the two algorithms are statistically indistinguishable. Generally, OET-leader performs better on the noisiest cases, suggesting that it is less prone to overfitting, but often performs more poorly on the low noise cases. OET-vote performs better on some of the simplest cases (0% noise and the 2-3 function) and AdaBoost’s performance tends to fall in the middle.

7.

Conclusions

In general the results confounded the expectations. The goal of this research was to compare AdaBoost, a well established and widely used ensemble training technique, to OET, a newer approach that has proven successful on a number of problems. Given the nature of AdaBoost it was hypothesized that OET was most likely to perform better under one of two conditions. First, on cases with significant noise, because AdaBoost’s re-weighting approach would force it to focus on erroneous cases causing it to overfit. Second, on cases with high levels of epistasis, because AdaBoost’s incremental approach to building an ensemble

65

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

Classification Error

0.4

AdaBoost OET - Vote OET - Leader

0.35 0.3 0.25 0.2 0.15 0

5

10

15 20 25 30 Percent Noise

35

40

Figure 4-3. Results on the Parkinson’s problem for varying levels of noise in the training data. Arrows show significant differences. For this problem the OET-vote approach is significantly worse (Student’s two-tailed, t-test P < 0.05) than the other two approaches for 30% noise and the OET-leader approach is significantly better with 40% noise.

could interfere with its ability to leverage multiple members simultaneously to ‘untangle’ high epistasis problems. The results do strongly suggest that performance depends both on the training algorithm and the cooperation method, but confounded the specific hypotheses regarding noise and epistasis. OET with voting ensemble members only performed better with zero error and the least epistasis, whereas OET with hierarchical cooperation (the leader approach described previously) had the best performance with high levels of noise and epistasis. AdaBoost’s performance generally fell between OET-vote and OET-leader and showed the best results for the mid-range of noise. However, for the majority of cases the algorithms’ performance was statistically indistinguishable. This suggests that the performance of the algorithms is generally comparable, if not identical. Based on the results it seems plausible that further testing would show that there are specific types of problems or features of problems that make them better suited for one or another of the algorithms and/or cooperation mechanisms. Most importantly, these results strongly suggest that OET is generally on par with AdaBoost, but, as noted previously, OET can be applied to problems and

66

Genetic Programming Theory and Practice VIII

0.35

AdaBoost OET - Vote OET - Leader

Classification Error

0.3 0.25 0.2 0.15 0.1 0.05 0 0

10

20 30 Percent Noise

40

Figure 4-4. Results on the cognitive workload problem for the first subject with varying levels of noise in the training data. Arrows show significant differences. For this problem the results between all three approaches are significantly different with no noise in the training set (Student’s two-tailed t-test P < 0.05). The OET-leader approach is significantly better than the other two approaches with 40% noise.

cooperation mechanisms that are not suitable for AdaBoost. Thus, researchers can confidently apply OET in cases where AdaBoost is inappropriate.

References Asuncion, A. and Newman, D.J. (2007). UCI machine learning repository. Brameier, Markus and Banzhaf, Wolfgang (2001). Evolving teams of predictors with linear genetic programming. Genetic Programming and Evolvable Machines, 2(4):381–408. Freund, Y., Schapire, R., and Abe, N. (1999). A short introduction to boosting. JOURNAL-JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE, 14:771–780. Haynes, Thomas, Sen, Sandip, Schoenefeld, Dale, and Wainwright, Roger (1995). Evolving a team. In Siegel, Eric V. and Koza, John, editors, Working Notes of the AAAI-95 Fall Symposium on GP, pages 23–30. AAAI Press. Heckendorn, Robert B. (2002). Embedded landscapes. Evolutionary Computation, 10(4):345–376.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

Classification Error

0.35

67

AdaBoost OET - Vote OET - Leader

0.3 0.25 0.2 0.15 0.1 0

10

20 30 Percent Noise

40

Figure 4-5. Results on the cognitive workload problem for the second subject with varying levels of noise in the training data. Arrows show significant differences. For this problem AdaBoost is significantly better with noise levels of 20% and 30%. Additionally, OET-vote is significantly better than AdaBoost (but not OET-leader) with 0% noise.

Imamura, Kosuke (2002). N-version Genetic Programming: A probabilistic Optimal Ensemble. PhD thesis, University of Idaho. Imamura, Kosuke, Heckendorn, Robert B., Soule, Terence, and Foster, James A. (2004). Behavioral diversity and a probabilistically optimal gp ensemble. Genetic Programming and Evolvable Machines, 4:235–253. Just, M.A. and Carpenter, P.A. (1993). The intensity dimension of thought: Pupillometric indices of sentence processing. Canadian Journal of Experimental Psychology, 47(2):310–339. Kishore, JK, Patnaik, LM, Mani, V., and Agrawal, VK (2000). Application of genetic programming for multicategory pattern classification. IEEE Transactions on Evolutionary Computation, 4(3):242–258. Lew, R., P., Dyre B., Soule, T., Werner, S., and Ragsdale, S. A. (2010). Assessing mental workload from skin conductance and pupillometry using wavelets and genetic programming. In Proceedings of the 54th Annual Meeting of the Human Factors and Ergonomics Society. Luke, Sean and Spector, Lee (1996). Evolving teamwork and coordination with genetic programming. In Koza, John R., Goldberg, David E., Fogel, David B.,

68

Genetic Programming Theory and Practice VIII

0.5 Classification Error

0.45 0.4

AdaBoost OET - Vote OET - Leader

0.35 0.3 0.25 0.2 0.15 0.1 0.05 2-3

2-10 2-30 5-3 5-10 5-30 10-30 Z Function

Figure 4-6. Results on the z functions. No noise was used with these problems, the problems are arranged along the x-axis in approximate order of difficulty. Arrows show significant differences in performance (Student’s two-tailed, t-test P < 0.05). For the 2-3 function the OET-vote approach is significantly better than both other approaches. For the 2-30, 5-10, 5-30, and 10-30 problems the OET-leader approach is significantly better than the other two approaches Additionally, the OET-leader approach is significantly better than AdaBoost (but not OET-vote) for the 2-10 problem and significantly better than OET-vote (but not AdaBoost) for the 5-3 problem.

and Riolo, Rick R., editors, Genetic Programming 1996: Proceedings of the First Annual Conference on Genetic Programming, pages 150–156. Cambridge, MA: MIT Press. Mease, D. and Wyner, A. (2008). Evidence contrary to the statistical view of boosting. The Journal of Machine Learning Research, 9:131–156. Muni, DP, Pal, NR, and Das, J. (2004). A novel approach to design classifiers using genetic programming. IEEE transactions on evolutionary computation, 8(2):183–196. Nakayama, M. and Katsukura, M. (2007). Feasibility of assessing usability with pupillary responses. Proc. of AUIC 2007, 15, 22. Paul, T.K. and Iba, H. (2009). Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 6(2):353– 367.

Ensemble Classifiers: AdaBoost and Orthogonal Evolution of Teams

69

Platel, Michael Defoin, Chami, Malik, Clergue, Manuel, and Collard, Philippe (2005). Teams of genetic predictors for inverse problem solving. In Proceeding of the 8th European Conference on Genetic Programming – EuroGP 2005. Polikar, R. (2006). Ensemble based systems in decision making. IEEE Circuits and systems magazine, 6(3):21–45. Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of statistics, 26(5):1651–1686. Sigillito, V G, Wing, S P, Hutton, L V, and Baker, K B (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech. Dig, vol. 10:262–266. in. Soule, T. and Heckendorn, R.B. (2007a). Improving Performance and Cooperation in Multi-Agent Systems. In Proceedings of the Genetic Programming Theory and Practice Workshop. Springer. Soule, Terence (1999). Voting teams: A cooperative approach to non-typical problems. In Banzhaf, Wolfgang, Daida, Jason, Eiben, Agoston E., Garzon, Max H., Honavar, Vasant, Jakiela, Mark, and Smith, Robert E., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 916–922, Orlando, Florida, USA. Morgan Kaufmann. Soule, Terence and Heckendorn, Robert B. (2007b). Evolutionary optimization of cooperative heterogeneous teams. In SPIE Defense and Security Symposium, volume 6563. Soule, Terence and Komireddy, Pavankumarreddy (2006). Orthogonal evolution of teams: A class of algorithms for evolving teams with inversely correlated errors. In Riolo, Rick L., Soule, Terence, and Worzel, Bill, editors, Genetic Programming Theory and Practice IV, volume 5 of Genetic and Evolutionary Computation, chapter 8, pages –. Springer, Ann Arbor. Thomason, Russell, Heckendorn, Robert B., and Soule, Terence (2008). Training time and team composition robustness in evolved multi-agent systems. In O’Neill, Michael, Vanneschi, Leonardo, Gustafson, Steven, Esparcia Alcazar, Anna Isabel, De Falco, Ivanoe, Della Cioppa, Antonio, and Tarantino, Ernesto, editors, Proceedings of the 11th European Conference on Genetic Programming, EuroGP 2008, volume 4971 of Lecture Notes in Computer Science, pages 1–12, Naples. Springer. Tsanas, A., Little, M.A., McSharry, P.E., and Ramig, L.O. (2009). Accurate telemonitoring of Parkinson’s disease progression by non-invasive speech tests. Scientific Commons.

Chapter 5 COVARIANT TARPEIAN METHOD FOR BLOAT CONTROL IN GENETIC PROGRAMMING Riccardo Poli1

1 School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park,

CO4 3SQ, UK.

Abstract

In this paper a simple modification of the Tarpeian bloat-control method is presented which allows one to dynamically set the parameters of the method in such a way to guarantee that the mean program size will either keep a particular value (e.g., its initial value) or will follow a schedule chosen by the user. The mathematical derivation of the technique as well as its numerical and empirical corroboration are presented.

Keywords:

Bloat control, Tarpeian Method, Price’s theorem, Size-evolution equation

1.

Background

Many techniques to control bloat have been proposed in the last two decades (for recent reviews see (Poli et al., 2008; Luke and Panait, 2006; Alfaro-Cid et al., 2010; Silva, 2008)). One with a theoretically-sound basis is the Tarpeian method introduced in (Poli, 2003). This is the focus of this paper. The Tarpeian method is extremely simple in its implementation. All that is needed is a wrapper for the fitness function like the following algorithm: Tarpeian Wrapper: if size(program) > average program size and random() < pt then return( fbad ); else return( fitness(program) ); were pt is a real number between 0 and 1, random() is a function which returns uniformly distributed random numbers in the range [0, 1) and fbad is a constant which represents an extremely low (or high, if minimising) fitness value such that individuals with such fitness are almost guaranteed not to be selected. The R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_5, © Springer Science+Business Media, LLC 2011

72

Genetic Programming Theory and Practice VIII

method got its name after the Tarpeian Rock in Rome, which in Roman times was the infamous execution place for traitors and criminals (above average size individuals), who would be led to its top and then hurled down to their death. A feature of this algorithm is that it does not require a priori knowledge of the size of the potential solutions to a problem. If programs need to grow in order to improve fitness, the original Tarpeian method will not prevent this. It will occasionally hit some individuals that, if evaluated, would result in being fitter than average and this may slow down a little the progress of a run. However, because the wrapper does not evaluate the individuals being given a low fitness, very little computation is wasted. Even at a high anti-bloat intensity, pt , a better-than-average longer-than-average individual has still a chance of making it into the population. If enough individuals of this kind are produced (w.r.t. the individuals which are better-than-average but also shorterthan-average), eventually the average size of the programs in the population may grow. However, when this happens the Tarpeian method will immediately adjust so as to discourage further growth. After its proposal, the Tarpeian method has started being used in a variety of studies and applications. For example, in (Mahler et al., 2005) its performance and generalisation capabilities were studied, while it was compared with other bloat-control techniques in (Luke and Panait, 2006; Wyns and Boullart, 2009; Alfaro-Cid et al., 2010). The method has been used with success in the evolution of bin packing heuristics (Burke et al., 2007; Allen et al., 2009), in the evolution of image analysis operators (Roberts and Claridge, 2004), in artificial financial markets based on GP (Martinez-Jaramillo and Tsang, 2009), in predicting protein networks (Garcia et al., 2008a), in the design of passive analog filters using GP (Chouza et al., 2009), in the prediction of protein-protein functional associations (Garcia et al., 2008b) and in the simplification of decision trees via GP (Garcia-Almanza and Tsang, 2006). In all cases the Tarpeian method has been a solid and efficient choice. All studies and applications, however, have had to determine by trial and error the value of the parameter pt best suited to their problem(s).1 This is not really a drawback of this method: virtually all anti-bloat techniques require setting one or more parameters. For example, also the parsimony pressure method (Koza, 1992; Zhang and M¨uhlenbein, 1995; Zhang and M¨uhlenbein, 1993; Zhang et al., 1997) requires setting one parameter (the parsimony coefficient). In recent research (Poli and McPhee, 2008), we developed a method, called covariant parsimony pressure, that allows one to dynamically and optimally set the parsimony coefficient for the parsimony pressure method in such a way to completely control the evolution of the mean program size. The aim of this 1 In principle also f

bad

virtually no tuning.

needs to be set. However, this is normally easily done (more on this later) and requires

73

Covariant Tarpeian Bloat Control

paper is to achieve the same level of control for the Tarpeian method. We will do this partly by following the tracks of (Poli and McPhee, 2008). We therefore start our journey by briefly summarising the main ideas that led to the covariant parsimony pressure method.

2.

Covariant Parsimony Pressure

Let us start by considering the size evolution equation developed in (Poli, 2003; Poli and McPhee, 2003), which, as shown in (Poli and McPhee, 2008), with trivial manipulations can be rewritten as follows

E[μ ] = p( ) (5.1)

where the index ranges over all program sizes, μ is a stochastic variable which represents the average size of the programs at the next generation and p( ) is the probability of selecting a program of size from the current generation. The equation applies to GP systems with independent selection and symmetric sub-tree crossover. 2 If φ( ) represents the proportion of programs of size in the current generation, then, clearly, the average size of the programs in the current generation is given by μ = φ( ). Thus one can simply express the expected change in average size of programs between two generations as

(p( ) − φ( )) . (5.2) E[Δμ] = E[μ ] − μ =

In (Poli and McPhee, 2008), we showed that if we restrict our attention to , where f ( ) fitness proportionate selection, we can express p( ) = φ( ) f () f¯ ¯ is the average fitness of the programs of size and f is the average fitness of the programs in the population. Then, with some algebraic manipulations, one finds that Equation (5.2) is actually equivalent to Price’s theorem (Price, 1970). That is Cov( , f ) . (5.3) E[Δμ] = f¯ Let us imagine that a fitness function incorporating parsimony, fp = f − c , is used, where c is the parsimony coefficient, is the size of a program and f is its raw fitness (problems-solving performance). Feeding this into Equation (5.3), then setting its l.h.s. (E[Δμ]) to zero and solving for c, one finds c=

Cov( , f ) . Var( )

(5.4)

2 In a symmetric operator the probability of selecting particular crossover points in the parents does not depend on the order in which the parents are drawn from the population.

74

Genetic Programming Theory and Practice VIII

This value of c guarantees that, in expectation, the size of the programs in the next generation will be the same as in the current generation (as long as the coefficient c is recomputed at each generation). In (Poli and McPhee, 2008) we also showed that with simple further manipulations of Equation (5.3) one can even set c dynamically in such a way as to force the mean program size to vary according to any desired function of time, thereby providing complete control over the evolution of size.

3.

Covariant Tarpeian Method

Let us now model the effects on program size of the Tarpeian method in GP systems with independent selection and symmetric sub-tree crossover. In the Tarpeian method the fitness of individuals of size not exceeding the mean size μ is left unaffected. If pt is the Tarpeian rate, on average individuals of size bigger than the mean will see their fitness set to a very low value, fbad , in a proportion pt of cases, while fitness will be unaffected with probability 1 − pt . In order to see what effects the Tarpeian method has on the expected change in program size E[Δμ], we need to verify how the changes in fitness it produces affect the terms in the size evolution equation (Equation (5.2)). In other words, we need to compute

(pt ( ) − φ( )) (5.5) E[Δμt ] =

or E[Δμt ] =

Cov( , ft ) . f¯t

(5.6)

where Δμt = μt −μ, μt is the average program size in the next generation when the Tarpeian method is used, pt ( ) is the probability of selecting individuals of size when the Tarpeian method is used, ft is the fitness of individuals after the application of the Tarpeian method, and f¯t is the mean program fitness after the application of the Tarpeian method. Unfortunately, when attempting to study Equations (5.5) and (5.6) for the Tarpeian method things are significantly harder than for the parsimony pressure method. Under fitness proportionate selection, we have that pt ( ) = φ( ) ftf¯() t where ft ( ) is the mean fitness of the programs of size after the application of the Tarpeian method. In the absence of Tarpeian bloat control (i.e., for pt = 0), these quantities are constants (given that we have full information about the current generation). However, as soon as pt > 0, they become stochastic variables. This is because the Tarpeian method is stochastic and, so, we cannot be certain as to precisely how many individuals will have their fitness reduced by it, how many individual in each length class will be affected and how many

75

Covariant Tarpeian Bloat Control

individuals in each fitness class will be affected. If ft ( ) and f¯t are stochastic variables then so are the selection probabilities pt ( ) and, consequently, also the quantity E[Δμt ] on the l.h.s. of Equations (5.5) and (5.6) In other words Equations (5.5) and (5.6) give us the expectation of the change in mean program size from one generation to the next conditionally to the Tarpeian method modifying the fitness of a particular set of individuals. In formulae,

E[Δμt |Ft = ft ] =

(pt ( ) − φ( )) =

Cov( , ft ) . f¯t

(5.7)

where Ft is a (vector) stochastic variable which represents the fitness associated to the individuals in the population after the application of the Tarpeian method. The distribution Pr{Ft = ft } of Ft depends on the fitness and size of the individuals in the population and the parameter pt . In principle, we could determine the explicit expression for such a distribution and then compute

E[Δμt ] =

E[Δμt |Ft = ft ] Pr{Ft = ft }.

(5.8)

ft

However, working out a closed form for this equation is difficult. The reason is that the fitness values ft appear at the denominator of the selection probabilities pt ( ) via the average fitness f¯t in addition to appearing at the numerators. To overcome the difficulty and obtain results which allow the application of the theory to the problem of optimally choosing the parameters of the Tarpeian method, we will use the following approximation: E[Δμt ] = E E[Δμt |Ft = ft ] Cov( , ft ) ∼ E[Cov( , ft )] . = E = E[f¯t ] f¯t

(5.9)

Later in the paper we will get an idea as to the degree of error introduced by the approximation. For now, however, let us see if we can find a closed form for this approximation.

76

Genetic Programming Theory and Practice VIII

Let us start from computing E[f¯t ]:

E[f¯t ] = E =

φ( )ft ( )

φ( )E[ft ( )] +

≤μ

=

φ( )f ( ) +

≤μ

=

= f¯ +

φ( )E[ft ( )]

>μ

φ( )[pt × fbad + (1 − pt ) × f ( )]

>μ

φ( )f ( ) −

φ( )f ( ) +

>μ

φ( )[pt × fbad + (1 − pt ) × f ( )]

>μ

φ( )[pt × fbad + (1 − pt ) × f ( ) − f ( )]

>μ

= f¯ +

φ( )[pt × fbad − pt × f ( )]

>μ

= f¯ − pt

φ( )(f ( ) − fbad )

>μ

= f¯ − pt φ>

φ( ) >μ

φ>

(f ( ) − fbad )

= f¯ − pt φ> (f¯> − fbad )

(5.10)

where φ> = >μ φ( ) is the proportion of above-average-size programs and f¯> is the average fitness of such programs.

77

Covariant Tarpeian Bloat Control

Let us now compute the expected covariance between and ft : E[Cov( , ft )]

= E φ( )( − μ)(ft ( ) − f¯t ) =

φ( )( − μ)E[(ft ( ) − f¯t )]

=

φ( )( − μ)(E[ft ( )] − E[f¯t ])

=

φ( )( − μ)(E[ft ( )] − f¯ + pt φ> (f¯> − fbad ))

=

φ( )( − μ)(E[ft ( )] − f¯)

+ pt φ> (f¯> − fbad )

=

φ( )( − μ)

=0

φ( )( − μ)(E[ft ( )] − f¯)

=

φ( )( − μ)(E[ft ( )] − f¯)

≤μ

+

φ( )( − μ)(E[ft ( )] − f¯)

>μ

=

φ( )( − μ)(f ( ) − f¯)

≤μ

+

φ( )( − μ)[pt fbad + (1 − pt )f ( ) − f¯]

>μ

=

φ( )( − μ)(f ( ) − f¯) −

+

φ( )( − μ)(f ( ) − f¯)

>μ

φ( )( − μ)[pt fbad + (1 − pt )f ( ) − f¯]

>μ

= Cov( , f )

φ( )( − μ)[pt fbad + (1 − pt )f ( ) − f¯ − f ( ) + f¯] + >μ

Thus E[Cov( , ft )] = Cov( , f ) − pt

>μ

φ( )( − μ)(f ( ) − fbad ).

(5.11)

78

Genetic Programming Theory and Practice VIII

If μ> is the average size of the programs that are longer than μ, we can write

φ( )( − μ)(f ( ) − fbad ) >μ

=

φ( )( − μ> − μ + μ> )(f ( ) − fbad )

>μ

=

φ( )( − μ> )(f ( ) − fbad ) − (μ − μ> )

>μ

=

φ( )(f ( ) − fbad )

>μ

φ( )( − μ> )(f ( ) − f¯> − fbad + f¯> ) − (μ − μ> )φ> (f¯> − fbad )

>μ

=

φ( )( − μ> )(f ( ) − f¯> )

>μ

+

φ( )( − μ> )(f¯> − fbad ) − (μ − μ> )φ> (f¯> − fbad )

>μ

= φ> Cov> ( , f )

φ( )( − μ> ) −(μ − μ> )φ> (f¯> − fbad ), + (f¯> − fbad ) >μ

=0

= φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) . where Cov> ( , f ) is the covariance between program size and fitness within the programs which are of above-average size. Thus, we finally obtain E[Cov( , ft )] = Cov( , f ) − pt φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) . (5.12) Substituting Equations (5.12) and (5.9) into Equation (5.8) we obtain ¯> − fbad ) φ ( , f ) + (μ − μ)( f Cov( , f ) − p Cov t > > > . (5.13) E[Δμt ] ∼ = f¯ − pt φ> (f¯> − fbad ) With this explicit formulation of the expected size changes, following the same strategy as in the covariant parsimony pressure method (see Section 2), we can find out for what value of pt we get E[Δμt ] = 0. By setting the l.h.s. of Equation (5.13) to 0 and solving for pt , we obtain: pt ∼ =

φ>

Cov( , f ) . Cov> ( , f ) + (μ> − μ)(f¯> − fbad )

(5.14)

This equation allows one to determine how often the Tarpeian method should be applied to modify the fitness of above-average-size programs as a function of a small set of descriptors of the current state of the population and of the parameter fbad .

Covariant Tarpeian Bloat Control

79

We should note that for some values of fbad the method is unable to control bloat. For such values, one would need to set pt > 1 which is clearly impossible (since pt is a probability). Naturally, we can find out what such values of fbad are by setting pt = 1 in Equation (5.14) and solving for fbad obtaining Cov( , f ) − Cov> ( , f )φ> . fbad ∼ = f¯> − φ> (μ> − μ)

(5.15)

However, since we normally don’t particularly care about the specific value of fbad , as long as the method gets the job done, the obvious and safe choice fbad = 0 is perhaps the most practical one. What if we wanted μ(t) to follow, in expectation, a particular function γ(t), e.g., the ramp γ(t) = μ(0) + b × t or a sinusoidal function? The theory helps us in this case as well. What we want is that E[μt ] = γ(g), where g is the generation number. Note that E[μt ] = E[Δμt ] + μ. So, adding μ to both sides of Equation (5.13) we obtain: Cov( , f ) − pt φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) ∼ + μ. γ(g) = f¯ − pt φ> (f¯> − fbad ) Solving again for pt yields: pt ∼ =

Cov( , f ) − [γ(g) − μ][f¯ − pt φ> (f¯> − fbad )] φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) (5.16)

Note that, in the absence of sampling noise (i.e., for an infinite population), requiring that E[Δμ] = 0 at each generation implies μ(g) = μ(0) for all g > 0. However, in any finite population the parsimony pressure method can only achieve Δμ = 0 in expectation, so there can be some random drift in μ(g) w.r.t. its starting value of μ(0). If tighter control over the mean program size is desired, one can use Equation (5.15) with the choice γ(g) = μ(0), which leads to the following formula Cov( , f ) − [μ(0) − μ][f¯ − pt φ> (f¯> − fbad )] (5.17) pt ∼ = φ> Cov> ( , f ) + (μ> − μ)(f¯> − fbad ) Note the similarities and differences between this and Equation (5.14). In the presence of any drift moving μ away from μ(0), this equation will actively strengthen the size control pressure to push the mean program size back to its initial value.

4.

Example and Numerical Corroboration

As an example, let us consider the small population in the first two columns of Table 5-1 and let us apply Equation (5.3) to it. We have that Cov( , f ) = 6.75

80

Genetic Programming Theory and Practice VIII

Table 5-1. The effects of the covariant Tarpeian method on a small sample population of 4 individuals. The size and raw fitness of the individuals in the population are shown in the first two columns. The remaining columns report the fitness associated to each such individuals after the application of the Tarpeian method with optimal pt .

Size 5 2 2 7 E[Δμ]

f 9 1 2 8 1.35

Trials ft ft ft ft ft ft ft ft ft ft 0 0 0 0 0 9 0 9 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 0 0 8 0 8 0 0 0 8 0 -2.00 -2.00 1.64 -2.00 1.64 0.25 -2.00 0.25 1.64 -2.00 Average E[Δμ] = −0.46

and f¯ = 5. So, in the absence of bloat control we will have an expected increase in program size of E[Δμ] = 1.35 at the next generation. This is to be expected given the strong correlation between fitness and size in our sample population. Let us now compute pt using Equation (5.14). Since in our population μ = 4, we have that φ> = 0.5, the programs of size 5 and 7 being of above-average size. Their average size is μ> = 6 and their average fitness is f¯> = 8.5. Finally, the covariance between their size and their fitness is Cov> ( , f ) = −0.5. Using these values and the covariance between size and fitness which we computed previously, and taking the safe value fbad = 0, we obtain pt ∼ = 0.818182. Let us now imagine that we adopt this particular value of pt and let us recompute the Tarpeian fitness of the members of our population based on the application of the Tarpeian method (with fbad = 0). Since the method is stochastic we will do it multiple times, so as to get an idea of its expected behaviour. The results of these trials are shown in columns 3–12 of Table 5-1. Computing the expected change in program size after the application of the Tarpeian method shows that in 5 out of 10 cases it is negative, in 2 cases it is marginally positive and only in the remaining cases it is comparable (in fact slightly bigger) than expected when the Tarpeian method is not used. Indeed, on average we expect a slight contraction in the mean program size of −0.46. In other words, the estimate for pt has exceeded the value required to achieve a zero expected change in program size. Errors such as this have to be expected given the tiny population we have used. To corroborate the theory presented in the previous section and evaluate how population size affects the accuracy of our estimate of pt , we need to perform many more trials (so as to avoid small sample errors) with a variety of population sizes. For these tests we will create populations with an extremely high correlation between fitness and size.

81

Covariant Tarpeian Bloat Control

Table 5-2. Errors in E[Δμt ] resulting from the approximations in the calculation of pt for different population sizes and for a fitness function where f () = . Statistics were computed over 1,000 independent repetitions of the application of the Tarpeian method to a population including programs from size 1 to M , M being the population size.

Population size M 10 100 1000 10000 100000

E[Δμ] without Tarpeian 15.00 16.51 16.80 16.83 16.83

Estimated Average optimal E[Δμt ] with pt Tarpeian 0.750 -3.050 0.795 -0.275 0.804 0.026 0.805 -0.004 0.805 -0.003

Standard deviation of E[Δμt ] 10.74 3.64 1.16 0.36 0.12

Our populations include M =10, 100, 1000,10000, and 100,000 individuals. In each population individual i has size i = Mi × 100 and fitness fi = i . These choices would be expected to produce very strong bloat. Indeed, as shown in the second column of Table 5-2 we expect to see the mean size of programs to increase by between 15 and 16.83 at the next generation. We now apply the Tarpeian method with the optimal pt computed via Equation (5.14) on our test populations 1000 times. The optimal pt obtained for each population size is shown in the third column of Table 5-2. Each time different individuals are hit by the reduction of fitness associated with the method. So, different expected changes in program size E[Δμt ] will be produced. The fourth and fifth columns of Table 5-2 show the mean and standard deviations of E[Δμt ] over the 1000 repetitions of the test. As we can see from these values, in all cases bloat is entirely under control, although, for this problem, Equation (5.14) consistently overestimates pt thereby leading to slightly shrinking individuals on average. Note how rapidly the mean error becomes very small as the population size grows towards the typical values used in realistic GP runs. The standard deviations also rapidly drop, indicating that the method becomes almost deterministic for very large population sizes. This is confirmed by the distributions of E[Δμt ] for different population sizes shown in Figure 5-1.

5.

Empirical Tests

To further corroborate the theory, we conducted experiments using a linear register-based GP system. The system we used is a generational GP system. It initialises the population by repeatedly creating random individuals with lengths uniformly distributed between 1 and 200 primitives. The primitives are drawn randomly and uniformly from a problem’s primitive set. The system uses fitness proportionate selection and crossover applied with a rate of 90%. The remaining 10% of the population is created via selection followed by point

82

Genetic Programming Theory and Practice VIII Table 5-3. Primitive set used in our experiments.

Instructions R1 = RIN R2 = RIN R1 = R1 + R2 R2 = R1 + R2 R1 = R1 * R2 R2 = R1 * R2 Swap R1 R2

mutation (with a rate of 1 mutation per program). Crossover creates offspring by selecting two random crossover points, one in each parent, and taking the first part of the first parent and the second part of the second w.r.t. their crossover points. This is a form of sub-tree crossover for linear structures/trees. We used populations of size 1,000 and 10,000. In each condition we performed 100 independent runs, each lasting either 50 or 100 generations. With this system we solved a classical symbolic regression problem: the quintic polynomial. In other words, the objective was to evolve a function which fits a polynomial of the form x + x2 + · · · + xd , where d = 5 is the degree of the polynomial, for x in the range [−1, 1]. In particular we sampled the polynomials at the 21 equally spaced points x ∈ {−1, −0.9, . . . , 0.9, 1.0}. Polynomials of this type have been widely used as benchmark problems in the GP literature. Fitness (to be maximised) was 1/(1 + error) where error is the sum of the absolute differences between the target polynomial and the output produced by the program under evaluation over these 21 fitness cases. The primitive set used to solve these problems is shown in Table 5-3. The instructions refer to three registers: the input register RIN which is loaded with the value of x before a fitness case is evaluated and the two registers R1 and R2 which can be used for numerical calculations. R1 and R2 are initialised to x and 0, respectively. The output of the program is read from R1 at the end of its execution. Figure 5-2 shows the results of our runs for populations of size 1000 and 10,000 in the absence of bloat control and when using the version of the Covariant Tarpeian method in Equation (5.17). Figure 5-3 shows the results for a population of size 1000 when using the version of the Covariant Tarpeian method in Equation (5.15) where γ(g) is the following triangle wave of period 50 generations: g + 12.5 g + 12.5 − + 0.5 . γ(g) = 100 × 0.75 + 0.5 × 50 50

(5.18)

83

Covariant Tarpeian Bloat Control

Table 5-4. Comparison of success rates in the quintic polynomial regression for different population sizes with and without Tarpeian bloat control. Runs were declared successful if the sum of absolute errors in the best individual fell below 1. Tarpeian bloat control was exerted using Equation (5.15) with γ(g) = μ(0) (“Covariant Tarpeian constant”) or with the γ(g) function in Equation (5.18) (“Covariant Tarpeian triangle”).

Bloat control None Covariant Tarpeian constant Covariant Tarpeian triangle None Covariant Tarpeian constant

pop size 1,000 1,000 1,000 10,000 10,000

success rate 94% 92% 95% 100% 100%

It is apparent that in the absence of bloat control there is very substantial bloat, while the Covariant Tarpeian method provides almost total control over the size dynamics. It has sometimes been suggested that bloat control techniques can harm performance. One may wonder, then, if performance was affected by the use of the covariant Tarpeian method. In the quintic polynomial regression there was very little variation in the success rate (for a given population size) across techniques, as illustrated in Table 5-4. This is very encouraging, but it would be surprising if in other problems and for other parameter settings there weren’t some performance differences. Future research will need to explore this.

6.

Conclusions

There are almost as many anti-bloat recipes as there are researchers in genetic programming. Very few, however, have a theoretical pedigree. The Tarpeian method (Poli, 2003) is one of them. In recent years, the method has started becoming more and more widespread, probably because of its simplicity. The method, however, like most others, requires setting one main parameter (and one secondary one) for it to perform appropriately. Until now this parameter had to be set by trial and error. In this paper we integrate the theory that led to the development of the original Tarpeian method with ideas that recently led to the covariant parsimony pressure method (Poli and McPhee, 2008) (another theoretically derived method), to obtain equations which allow one to optimally set the parameter(s) of the method so as to achieve almost full control over the evolution of the mean program size in runs of genetic programming. Although the complexity of the task has forced us to rely on approximations to make progress, numerical and empirical corroboration confirm that the quality of the approximation is good. Experiments have also confirmed the effectiveness of the Covariant Tarpeian method.

84

Genetic Programming Theory and Practice VIII

References Alfaro-Cid, Eva, Merelo, J. J., Fernandez de Vega, Francisco, Esparcia-Alcazar, Anna I., , and Sharman, Ken (2010). Bloat control operators and diversity in genetic programming: A comparative study. Evolutionary Computation, 18(2):305–332. Allen, Sam, Burke, Edmund K., Hyde, Matthew R., and Kendall, Graham (2009). Evolving reusable 3D packing heuristics with genetic programming. In Raidl, Guenther, Rothlauf, Franz, Squillero, Giovanni, Drechsler, Rolf, Stuetzle, Thomas, Birattari, Mauro, Congdon, Clare Bates, Middendorf, Martin, Blum, Christian, Cotta, Carlos, Bosman, Peter, Grahl, Joern, Knowles, Joshua, Corne, David, Beyer, Hans-Georg, Stanley, Ken, Miller, Julian F., van Hemert, Jano, Lenaerts, Tom, Ebner, Marc, Bacardit, Jaume, O’Neill, Michael, Di Penta, Massimiliano, Doerr, Benjamin, Jansen, Thomas, Poli, Riccardo, and Alba, Enrique, editors, GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 931– 938, Montreal. ACM. Burke, Edmund K., Hyde, Matthew R., Kendall, Graham, and Woodward, John (2007). Automatic heuristic generation with genetic programming: evolving a jack-of-all-trades or a master of one. In Thierens, Dirk, Beyer, Hans-Georg, Bongard, Josh, Branke, Jurgen, Clark, John Andrew, Cliff, Dave, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Kovacs, Tim, Kumar, Sanjeev, Miller, Julian F., Moore, Jason, Neumann, Frank, Pelikan, Martin, Poli, Riccardo, Sastry, Kumara, Stanley, Kenneth Owen, Stutzle, Thomas, Watson, Richard A, and Wegener, Ingo, editors, GECCO ’07: Proceedings of the 9th annual conference on Genetic and evolutionary computation, volume 2, pages 1559–1565, London. ACM Press. Chouza, Mariano, Rancan, Claudio, Clua, Osvaldo, , and Garcia-Martinez, Ramon (2009). Passive analog filter design using GP population control strategies. In Chien, Been-Chian and Hong, Tzung-Pei, editors, Opportunities and Challenges for Next-Generation Applied Intelligence: Proceedings of the International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE) 2009, volume 214 of Studies in Computational Intelligence, pages 153–158. Springer-Verlag. Garcia, Beatriz, Aler, Ricardo, Ledezma, Agapito, and Sanchis, Araceli (2008a). Genetic programming for predicting protein networks. In Geffner, Hector, Prada, Rui, Alexandre, Isabel Machado, and David, Nuno, editors, Proceedings of the 11th Ibero-American Conference on AI, IBERAMIA 2008, volume 5290 of Lecture Notes in Computer Science, pages 432–441, Lisbon, Portugal. Springer. Advances in Artificial Intelligence. Garcia, Beatriz, Aler, Ricardo, Ledezma, Agapito, and Sanchis, Araceli (2008b). Protein-protein functional association prediction using genetic pro-

Covariant Tarpeian Bloat Control

85

gramming. In Keijzer, Maarten, Antoniol, Giuliano, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Hansen, Nikolaus, Holmes, John H., Hornby, Gregory S., Howard, Daniel, Kennedy, James, Kumar, Sanjeev, Lobo, Fernando G., Miller, Julian Francis, Moore, Jason, Neumann, Frank, Pelikan, Martin, Pollack, Jordan, Sastry, Kumara, Stanley, Kenneth, Stoica, Adrian, Talbi, El-Ghazali, and Wegener, Ingo, editors, GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary computation, pages 347–348, Atlanta, GA, USA. ACM. Garcia-Almanza, Alma Lilia and Tsang, Edward P. K. (2006). Simplifying decision trees learned by genetic programming. In Proceedings of the 2006 IEEE Congress on Evolutionary Computation, pages 7906–7912, Vancouver. IEEE Press. Koza, John R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Luke, Sean and Panait, Liviu (2006). A comparison of bloat control methods for genetic programming. Evolutionary Computation, 14(3):309–344. Mahler, S´ebastien, Robilliard, Denis, and Fonlupt, Cyril (2005). Tarpeian bloat control and generalization accuracy. In Keijzer, Maarten, Tettamanzi, Andrea, Collet, Pierre, van Hemert, Jano I., and Tomassini, Marco, editors, Proceedings of the 8th European Conference on Genetic Programming, volume 3447 of Lecture Notes in Computer Science, pages 203–214, Lausanne, Switzerland. Springer. Martinez-Jaramillo, Serafin and Tsang, Edward P. K. (2009). An heterogeneous, endogenous and coevolutionary GP-based financial market. IEEE Transactions on Evolutionary Computation, 13(1):33–55. Poli, Riccardo (2003). A simple but theoretically-motivated method to control bloat in genetic programming. In Ryan, Conor, Soule, Terence, Keijzer, Maarten, Tsang, Edward, Poli, Riccardo, and Costa, Ernesto, editors, Genetic Programming, Proceedings of EuroGP’2003, volume 2610 of LNCS, pages 204–217, Essex. Springer-Verlag. Poli, Riccardo, Langdon, William B., and McPhee, Nicholas Freitag (2008). A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk. (With contributions by J. R. Koza). Poli, Riccardo and McPhee, Nicholas (2008). Parsimony pressure made easy. In Keijzer, Maarten, Antoniol, Giuliano, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Hansen, Nikolaus, Holmes, John H., Hornby, Gregory S., Howard, Daniel, Kennedy, James, Kumar, Sanjeev, Lobo, Fernando G., Miller, Julian Francis, Moore, Jason, Neumann, Frank, Pelikan, Martin, Pollack, Jordan, Sastry, Kumara, Stanley, Kenneth, Stoica, Adrian, Talbi, El-Ghazali, and Wegener, Ingo, editors, GECCO ’08: Proceedings of

86

Genetic Programming Theory and Practice VIII

the 10th annual conference on Genetic and evolutionary computation, pages 1267–1274, Atlanta, GA, USA. ACM. Poli, Riccardo and McPhee, Nicholas Freitag (2003). General schema theory for genetic programming with subtree-swapping crossover: Part II. Evolutionary Computation, 11(2):169–206. Price, George R. (1970). Selection and covariance. Nature, 227, August 1:520– 521. Roberts, Mark E. and Claridge, Ela (2004). Cooperative coevolution of image feature construction and object detection. In Yao, Xin, Burke, Edmund, Lozano, Jose A., Smith, Jim, Merelo-Guerv´os, Juan J., Bullinaria, John A., Rowe, Jonathan, Kab´an, Peter Tiˇno Ata, and Schwefel, Hans-Paul, editors, Parallel Problem Solving from Nature - PPSN VIII, volume 3242 of LNCS, pages 902–911, Birmingham, UK. Springer-Verlag. Silva, Sara (2008). Controlling Bloat: Individual and Population Based Approaches in Genetic Programming. PhD thesis, Coimbra University, Portugal. Full author name is Sara Guilherme Oliveira da Silva. Wyns, Bart and Boullart, Luc (2009). Efficient tree traversal to reduce code growth in tree-based genetic programming. Journal of Heuristics, 15(1):77– 104. Zhang, Byoung-Tak and M¨uhlenbein, Heinz (1993). Evolving optimal neural networks using genetic algorithms with Occam’s razor. Complex Systems, 7:199–220. Zhang, Byoung-Tak and M¨uhlenbein, Heinz (1995). Balancing accuracy and parsimony in genetic programming. Evolutionary Computation, 3(1):17–38. Zhang, Byoung-Tak, Ohm, Peter, and M¨uhlenbein, Heinz (1997). Evolutionary induction of sparse neural trees. Evolutionary Computation, 5(2):213–236.

87

Covariant Tarpeian Bloat Control

0.06

0.08

0.05

0.10

0.04

0.06

0.03

0.04 0.02 0.02

0.01

0.00

15

10

5

5

0

10

15

0.00

15

10

5

5

0

10

15

0.35 1.0

0.30

0.20

0.8

0.15

0.25

0.6

0.4 0.10 0.2

0.05

0.00

15

10

5

5

0

10

15

0.0

15

10

5

0

5

10

15

3.0

2.5

2.0

1.5

1.0

0.5

0.0

15

10

5

0

5

10

15

Figure 5-1. Distributions of E[Δμt ] resulting from the application of the Covariant Tarpeian method for populations of size 10 (top left), 100 (top right), 1,000 (middle left), 10,000 (middle right) and 100,000 (bottom) with our sample fitness function.

88

Genetic Programming Theory and Practice VIII

700

Tarpeian method no bloat control

600

Program Size

500

400

300

200

100

0 0

5

10

15

20

25 Generations

30

35

40

45

50

30

35

40

45

50

(a) 800

Tarpeian method no bloat control

700

600

Program Size

500

400

300

200

100

0 0

5

10

15

20

25 Generations

(b) Figure 5-2. Mean program size for populations of size 1000 (a) and 10,000 (b) as a function of the generation number on the quintic polynomial symbolic regression in the absence of bloat control and when using the version of the Covariant Tarpeian method in Equation (5.17).

89

Covariant Tarpeian Bloat Control

135

mean of the average program size across runs

130 125

Program Size

120 115 110 105 100 95 90 85 0

5

10

15

20

25 Generations

30

35

40

45

50

Figure 5-3. Average program size for populations of size 1000 and runs lasting 100 generations with the quintic polynomial symbolic regression when using the version of the Covariant Tarpeian method in Equation (5.15) where γ(g) is a triangle wave. The dashed line represents the mean of the average program size across runs.

Chapter 6 A SURVEY OF SELF MODIFYING CARTESIAN GENETIC PROGRAMMING Simon Harding1 , Wolfgang Banzhaf1 and Julian F. Miller2 1 Department Of Computer Science, Memorial University, Canada; 2 Department Of Electronics, University of York, UK.

Abstract Self-Modifying Cartesian Genetic Programming (SMCGP) is a general purpose, graph-based, developmental form of Cartesian Genetic Programming. In addition to the usual computational functions found in CGP, SMCGP includes functions that can modify the evolved program at run time. This means that programs can be iterated to produce an infinite sequence of phenotypes from a single evolved genotype. Here, we discuss the results of using SMCGP on a variety of different problems, and see that SMCGP is able to solve tasks that require scalability and plasticity. We demonstrate how SMCGP is able to produce results that would be impossible for conventional, static Genetic Programming techniques.

Keywords:

1.

Cartesian genetic programming, developmental systems

Introduction

In evolutionary computation (EC) scalability has always been an important issue. An evolutionary technique is scalable if the generational time it takes to evolve a satisfactory solution to a problem increases relatively weakly with increasing problem size. As in EC, scalability is an important issue in Genetic Programming (GP). In GP important methods for improving scalability are modularity and re-use. Modularity is introduced through sub-functions or subprocedures. These are often called Automatically Defined Functions (ADFs) (Koza, 1994a). The use of ADFs improves the scalability of GP by allowing solutions of larger or more difficult instances of particular classes of problems to be evolved. However, GP methods in general have largely employed genotype representations whose length (number of genes) is proportional to the size of

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_6, © Springer Science+Business Media, LLC 2011

92

Genetic Programming Theory and Practice VIII

the anticipated problem solutions. This has meant that evolutionary operators (e.g. crossover or mutation) have been used as the mechanism for building large genotypes. The same idea underlies approaches to evolve artificial neural networks. For instance, a well known method called NEAT uses evolutionary operators to introduce new neurons and connections, thus expanding the size of the genotype (Stanley and Miikkulainen, 2002). It is interesting to contrast these approaches to mechanisms employed in evolution of biological organisms. Multicellular organisms, having possibly enormous phenotypes, are developed from relatively simple genotypes. Development implies an unfolding in space and time. It is clearly promising to consider employing an analogue of biological development in genetic programming (Banzhaf and Miller, 2004). There are, of course, many possible aspects of developmental biology that could be adopted to construct a developmental GP method. In this chapter we discuss one such approach. It is called Self Modifying Cartesian Genetic Programming (SMCGP). It is based on a simple underlying idea. Namely, that a phenotype can unfold over time from a genotype by allowing the genotype to include primitive functions which act on the genotype itself. We refer to this as self-modification. As far as the authors are aware, self-modification is included in only one existing GP system: Lee Spector’s Push GP language (Spector and Robinson, 2002). One of the attractive aspects of introducing primitive self-modification functions is that it is relatively easy to include them in any GP system. Since 2007, SMCGP has been applied to a variety of computational problems. In the ensuing time the actual details of the SMCGP implementation have changed, however the key concepts and philosophy have remained the same. Here we present the latest version. We explain the essentials of how SMCGP works in section 2. Section 3 discusses briefly examples of previous work with SMCGP. In section 4 we compare and contrast the way other GP systems include iteration with the iterative unrolling that occurs in SMCGP. We end the chapter with conclusions and suggestions for future work.

2.

Self Modifying Cartesian Genetic Programming

As the name suggests, SMCGP is based on the Cartesian Genetic Programming technique. In CGP, programs are encoded in a partly connected, feed forward graph. A full description can be found in (Miller and Thomson, 2000). The genotype encodes this graph. Associated with each node in the graph are genes that represent the node function and genes representing connections to either other nodes or terminals. The representation has a number of interesting features. Firstly, not all of the nodes in the genotype need to be connected to the output, so there is a degree of neutrality which has been shown to be very useful (Miller and Thomson, 2000; Vassilev and Miller, 2000; Yu and Miller,

A Survey of Self Modifying CGP

93

2001; Miller and Smith, 2006). Secondly, as the genotype encodes a graph there is reuse of nodes, which makes the representation very compact and also distinct from tree based GP. Although CGP has been used in various ways in developmental systems (Miller, 2004; Miller and Thomson, 2003; Khan et al., 2007), the programs that it produces are not themselves developmental. Instead, these approaches used a fixed length genotype to represent the programs defining the behaviour of cells. SMCGP’s representation is similar to CGP in some ways, but has extensions that allow it to have the self modifying features. SMCGP genotypes are a linear string of nodes. That is to say, only one row of nodes is used (in contrast to CGP which can have a rectangular grid of nodes). In contrast to CGP in which connection genes are absolute addresses, indicating where the data supplied to a node is to be obtained, SMCGP uses relative addressing. Each node obtains its data inputs from its connection genes by counting back from its position in the graph. To prevent cycles, nodes can only connect to previous nodes (on their left). The relative addressing allows section of the graph to be moved, duplicated, deleted etc without breaking constraints of the structure whilst allowing some sort of modularity. In addition to CGP, SMCGP has some extra genes that are used by self-modification functions to identify parts or characteristics of the graph that will be changed. Another change from CGP is the way SMCGP handles inputs and outputs. Terminals are acquired through special functions (called INP, INPP, SKIPINP) and program outputs are taken from a special function called OUTPUT. This is an important change as it enables SMCGP programs to obtain and deliver as many inputs or outputs as required by the problem domain, during program execution. This allows the possibility of evolving general solutions to problems. For example, to find a program that can compute even-n parity, where n is arbitrary, one needs to be able to acquire an arbitrary number of inputs or terminals. In summary: Each node in the SMCGP graph contains a number of evolvable elements: The function. Represented in the genotype as an integer. A list of (relative) connections addresses, again represented as integers. A set of 3 floating point number arguments used by self-modification functions. There are also primitive functions that acquire or deliver inputs and outputs. As with CGP, the number of nodes in the genotype is typically kept constant through an experiment. However, this means care has to be taken to ensure that the genotype is large enough to store the target program.

94

Genetic Programming Theory and Practice VIII

Executing a SMCGP Individual SMCGP individuals are evaluated in a multi-step process, with the evolved program (the phenotype) executed several times. The evolved program in SMCGP initially has the same structure as the genotype, hence the first step is to make a copy of the genotype and call it the phenotype. This graph is to be the ‘working copy’ of the program. Each time the program is executed, the graph is first run and then any self modification operations required are invoked. The graph is executed in the following manner. First, the node (or nodes) to be used as outputs are identified. This is done by reading through the graph looking at which nodes are of type OUTPUT. Once a sufficient number of these nodes has been found, the various nodes that they connect to are identified. If not enough output nodes are found, then the last n nodes in the graph are used, where n is the number of outputs required. If there are not enough nodes to satisfy this requirement, then the execution is aborted, and the individual is discarded. At this point in the decoding, all the nodes that are actually used by the program have been identified and so their values can be calculated (the other nodes can simply be ignored). For the mathematical and binary operators, these functions are performed in the usual manner. However, as mentioned before SMCGP has a number of special functions. Table 6-1 shows an example of some of the functions used in previous work (see section 3). The first special functions are the INP and INPP functions. Each time the INP function is called it returns the next available input (starting with the first, and returning to the first after reading the last input). The INPP function is similar, but moves backwards through the inputs. SKIPINP allows a number of inputs to be ignored, and then returns the next input. These functions help SMCGP to scale to handle increasing numbers of inputs through development. This also applies to the use of the OUTPUT function, which allows the number of outputs to change over time. If a function is a self modification function, then it may be activated depending on the following rules. For binary functions they are always activated. For numeric function nodes, if the 1st input is larger than the 2nd input the node is activated. The self modification operation from an activated node is added to a list of pending operations - the ‘ToDo’ list. The maximum length of the list is a parameter of the system. After execution, the self modification functions on the ToDo list are applied to the current graph. The ToDo list is operated as a FIFO list in which the leftmost activated SM function is the first to be executed (and so on). The self modification functions require arguments defining which parts of the phenotype the function operates on. These are taken from the arguments of

A Survey of Self Modifying CGP

95

the calling node. Many of the arguments are integers, so they may need to be cast. The arguments may be treated as an address (depending on the function) and like all SMCGP operations, these are relative addresses. The program can now be iterated again, if necessary.

3.

Summary of Previous Work in SMCGP

Early experiments There are very few benchmark problems in the developmental system literature. In the first paper on SMCGP (Harding et al., 2007), we identified two possible challenges that had been described previously. The first was to find a program that generates a sequence of squares (i.e. 0,1,2,4,9,16,25...) using a restricted set of mathematical operators such as + and −, but not multiplication or power. Without some form of self modification this challenge would be impossible to solve (Spector and Stoffel, 1996). SMCGP was easily able to solve this problem (89% success rate), and a large number of different solutions were found. Typical solutions were similar to the program in table 6-2, where the program grew in length by adding new terms. During evolution, solutions were only tested up to the first 10 iterations. However, after evolution the solutions were tested for generality by increasing the number of iterations to 50. 66% of the solutions are correct to 50 iterations. Thus SMCGP was able to find general solutions. The next benchmark problem was the French Flag (FF) problem. Several developmental systems have been tested on generating the FF pattern (Miller, 2003; Miller and Banzhaf, 2003; Miller, 2004), and it is one of the few common problems tackled. In this problem, the task is to evolve a program that can assign the states of cells (represented as colours) into three distinct regions so that the complete set of cells looks like a French Flag. However, the design goals of SMCGP are very different to those the FF task demands. Many developmental systems are built around the idea of multi-cellularity and although they are capable of producing cellular patterns or even concentrations of simulated proteins, they are not explicitly computational in the sense of Genetic Programming. Often researchers have to devise somewhat arbitrary mappings from developmental outputs (i.e. cell states and protein levels) to those required for some computational application. SMCGP is designed to be an explicitly computational developmental system from the outset. Typically, the FF is produced via a type of cellular automaton (CA), where each cell ‘alive’ contains a copy of an evolved program or set of update rules. We could have taken this approach with SMCGP, but we decided on a more abstract interpretation of the problem. In the CA version, each cell in the CA is analogous to a biological cell. In SMCGP, the biological abstractions

96

Genetic Programming Theory and Practice VIII

Delete (DEL) Add (ADD) Move (MOV)

Overwrite (OVR) Duplication (DUP) Duplicate Preserving Connections (DU3) Duplicate and scale addresses (DU4) Copy To Stop (COPYTOSTOP) Stop Marker (STOP) Shift Connections (SHIFTCONNECTION) Shift Connections 2 (MULTCONNECTION) Change Connection (CHC) Change (CHF) Change (CHP) Flush (FLR)

Function Parameter

Basic Delete the nodes between (P0 +x) and (P0 +x+P1 ). Add P1 new random nodes after (P0 + x). Move the nodes between (P0 +x) and (P0 +x+P1 ) and insert after (P0 + x + P2 ). Duplication Copy the nodes between (P0 + x) and (P0 + x+ P1 ) to position (P0 + x + P2 ), replacing existing nodes in the target position. Copy the nodes between (P0 + x) and (P0 + x+ P1 ) and insert after (P0 + x + P2 ). Copy the nodes between (P0 + x) and (P0 + x+ P1 ) and insert after (P0 + x + P2 ). When copying, this function modifies the cij of the copied nodes so that they continue to point to the original nodes. Starting from position (P0 + x) copy (P1 ) nodes and insert after the node at position (P0 + x + P1 ). During the copy, cij of copied nodes are multiplied by P2 . Copy from x to the next “COPYTOSTOP” or ‘STOP” function node, or the end of the graph. Nodes are inserted at the position the operator stops at. Marks the end of a COPYTOSTOP section. Connection modification Starting at node index (P0 +x), add P2 to the values of the cij of next P1 . Starting at node index (P0 + x), multiply the cij of the next P1 nodes by P2 . Change the (P1 mod3)th connection of node P0 to P2 . Function modification Change the function of node P0 to the function associated with P1 . Change the (P1 mod3)th parameter of node P0 to P2 . Miscellaneous Clears the contents of the ToDo list

Table 6-1. Self modification functions. x represents the absolute position of the node in the graph, where the leftmost node has position 0. PN are evolved parameters stored in each node.

97

A Survey of Self Modifying CGP

Iteration (i) 0 1 2 3 4 etc.

Function 0+i 0+i 0+i+i 0+i+i+i 0+i+i+i+i

Result 0 1 4 9 16

Table 6-2. Program that generates sequence of squares. The program was found by reverse engineering a SMCGP phenotype. i, the current iteration, is the only input to the program.

are blurred, and the SMCGP phenotype itself could be viewed as a collection of cells. One way of viewing cells in SMCGP is to break the phenotype into ‘modules’ and then define these as the cells. In this way, SMCGP cells duplicate and differentiate using the various modifying functions. In a static program, this concept of cellularity does not exist. To tackle the FF problem with SMCGP, we defined the target pattern to be a string of integers that could be visually interpreted as a French Flag pattern. In the CA model, the pattern would be taken as the output of the program at each cell. Here, since we can view SMCGP phenotypes as a collection of cells, we took the output pattern as the set of outputs from all the active (connected) nodes in the phenotype graph. The fitness of an individual is the count of how many of the sequence it got right after a certain number of iterations. As the phenotype can change length when it is iterated, the number of active nodes can change and the length of the output pattern can also change. The value of the output of active nodes is dependent on the calculation it (and the nodes before it) does. So the French Flag pattern is effectively the side effect of some mathematical expression. It was found that this approach was largely successful, but only in generating approximations to the flag. No exact solutions were found, which is similar to the findings of the CA solutions where exact results are uncommon. The final task we explored in this paper was generating parity circuits, a challenge we return to in the next section.

Digital Circuits Digital circuits have often been studied in genetic programming (Koza, 1994b; Koza, 1992b), and some systems have been used to produce ‘general’ solutions (Huelsbergen, 1998; Wong and Leung, 1996; Wong, 2005). A general solution in this sense is a program that can output a digital circuit for an arbitrary number of inputs, for example it may generate a parity circuit of any

98

Genetic Programming Theory and Practice VIII

size 1 . Conveniently, many digital circuits are modular and hierarchical - and this fits the model of development that SMCGP implements. In our first paper, we successfully produced parity circuits up to 8 inputs (Harding et al., 2007). We stopped at this size because, at the time, this was the maximum size we could find conventional CGP solutions for. In a subsequent paper (Harding et al., 2009a), we revisited the problem (using the latest version of SMCGP), and found that not only could we evolve larger parity circuits, but we could rapidly and consistently evolve provably general parity circuits. We used an incremental fitness function to find programs that on the first iteration would solve 2 input parity, then 3 input parity on the next iteration and continue up to a maximum number of inputs. The fitness of an individual is the number of correct output bits over all iterations. To keep the computational costs down, we limited the evolution to 2 to 20 inputs, and then tested the final programs for generality by running up to 24 bits of input. We also stopped iterating programs if they failed to correctly produce all the output bits for the current table. Note how if an individual fails to be successful on a particular iteration the evaluation is canceled. Not only did this reduce the computation time, but we hoped it would also help with producing generalized solutions. Our function set consisted of all the two-input Boolean functions and the self modifying functions. In 251 evolutionary runs we found that the average number of evaluations required to successfully solve the parity problems was (number of inputs in parentheses) are as follows: 1,429(2), 4,013 (3), 43,817 (6), 82, 936 (8), 107,586 (10), 110,216 (17). Here we have given an incomplete list that just illustrates the trend in problem difficulty. We found that the number of evaluations stabilizes when the number of inputs is about 10. This is because after evolution has solved to a given number of inputs the solutions typically become generalized. We found that by the time that evolution had solved 5 inputs, more than half the solutions were generalizable up to 20 inputs, and by 10 inputs this was up to 90%. The percentage of runs that correctly computed even-parity 22 to 24 was approximately 96%. However, without analysis of the programs it was difficult to know whether they were truly general solutions. The evolved programs can be relatively compact, especially when we place constraints on the initial size, the number of self modification operations allowed on the ToDo list and the overall length of the program. Figure 6-1 shows an example of an evolved parity circuit generating a program which we were able to prove is a general solution to even-parity.

1 An even parity circuit takes a set of binary inputs and outputs true if an even number of the inputs are true, and false otherwise.

A Survey of Self Modifying CGP

99

Figure 6-1. An example of the development of a parity circuit. Each line shows the phenotype graph at a given time step. The first graph solves the 2-input parity, the second solves 3-input and continues to 7-bits. The graph has been tested to generalise through to 24 inputs. This pattern of growth is typical of the programs investigated.

In recent work (to be published in (Harding et al., 2010a)) we have also shown general solutions for the digital adder circuit. A digital adder circuit of size n adds two binary n bit numbers together. This problem is much more complicated than parity, as the number of inputs scales twice as fast (i.e. it has to produce 1 bit+1 bit, 2+2, 3+3) and the number of outputs also grows with the number of inputs.

Mathematical problems SMCGP has been applied to a variety of mathematical problems (Harding et al., 2009c; Harding et al., 2010b). For the Fibonacci sequence, the fitness function is the number of correctly calculated Fibonacci numbers in a sequence of 50. The first two Fibonacci numbers are given as fixed inputs (these were 0 and 1). Thus the phenotypes are iterated 48 times. Evolved solutions were tested for generality by iterating up to 72 times (after which the numbers exceeds the long int). A success rate of 87.4% was acheived on 287 runs and 94.5% of these correctly calculated the suceeeding 24 Fibonacci numbers. We found that the average number of evaluations of 774,808 compared favourably with previously published methods and that the generalization rate was higher. In the “list summation problem” we evolved programs that could sum an arbitrarily long list of numbers. At the n-th iteration, the evolved program should be able to take n inputs and compute the sum of all the inputs. We devised this problem because we thought it would be difficult for genetic programming

100

Genetic Programming Theory and Practice VIII

without the addition of an explicit summation command. Koza used a summation operator called SIGMA that repeatedly evaluates its sole input until a predefined termination condition is realised (Koza, 1992a). Input vectors consisted of random sequences of integers. The fitness is defined as the absolute cumulative error between the output of the program and the expected sum of the values. We evolved programs which were evaluated on input sequences of 2 to 10 numbers. The function set consisted of the self modifying functions and just the ADD operator. All 500 experiments were found to be successful, in that they evolved programs that could sum between 2 and 10 numbers (depending on the number of iterations the program is iterated). On average it took 6,922 evaluations to solve this problem. After evolution, the best individual for each run was tested to see how well it generalized. This test involved summing a sequence of 100 numbers. It was found that 99.03% solutions generalized. When conventional CGP was used it could only sum up to 7 numbers. We also studied how SMCGP performed on a “Powers Regression” problem. The task is to evolve a program that, depending on the iteration, approximates the expression xn where n is the iteration number. The fitness function applies x as integers from 0 to 20. The fitness is defined as the number of wrong outputs (i.e. lower is better). Programs were evolved to n = 10 and then tested for generality up to n = 20. As with many of the other experiments, the program is evolved with an incremental fitness function. We obtained 100% correct solutions (in 337 runs). The average number of evalutions was 869,699. More recently we have looked at whether SMCGP could produce algorithms that can compute mathematical constants, like π and e, to arbitrary precision (Harding et al., 2010b). We were able to prove that two of the evolved formulae (one for π and one for e) rapidly converged to the constants in the limit of large iterations. We consider this work to be significant as evolving provable mathematical results is a rarity in evolutionary computation. The fitness function was designed to produce a program where subsequent iterations of the program produced more accurate approximation to π or e. Programs were allowed to iterate for a maximum of 10 iterations. If the output after an iteration did not better approximate π, evaluation was stopped and a large fitness penalty applied. Note that it is possible that after the 10 iterations the output value diverges from the constant and the quality of the result would therefore worsen. We analyzed one of the solutions that accurately converges to π. It had the generating function: f (i) =

cos(sin(cos(sin(0)))) i = 0 f (i − 1) + sin(f (i − 1)) i > 0

(6.1)

101

A Survey of Self Modifying CGP

Equation 6.1 is a nonlinear recurrence relation and it can be proven formally that it is an exact solution in that it rapidly approaches π in the limit of large i. Using the same fitness function as with π, evolving solutions for e was found to be significantly harder. In our experiments we chose the initial genotype to have 20 nodes and the ToDo list length to be 2. This meant that only two SM functions were used in each phenotype. We allowed the iteration number it as the sole program input. Defining x = 4it and y = 4x = 4it+1 we evolved the solution for the output, z as 1 y z = (1 + ) y

q

1+ y1

(6.2)

Eqn 6.2 tends to the form of a well-known Bernoulli formula. 1 lim (1 + )y y→∞ y

(6.3)

Evolving to Learn In nature, we are used to the idea that plasticity (e.g., in the brain) can be used to learn during the lifetime of an organism. In the brain, the ‘self-modification rules’ are ultimately encoded in the genome. In (Harding et al., 2009b), we set out to use SMCGP to evolve a learning algorithm that could act on itself. The basic question being whether SMCGP can evolve a program that can learn during the development phase - how to perform a given task. We chose the task of getting the same phenotype to learn all possible 2-input boolean truth tables. We took 16 copies of the same phenotype, and then tried to train each copy on a different truth table, with the fitness being how well the programs (after the learning phase) did at calculating the correct value based on a pair of inputs. In SMCGP, the activation of a self modifying node is dependent on the values that it reads as inputs. Combined with the various mathematical operators, this allows the phenotype to develop differently in the presence of different sets of inputs. To support the mathematical operators, the Boolean tables were represented (and interpreted) as numbers, with -1.0 being false, +1.0 being true. Figure 6-2 illustrates the process. The evolved genotype (a) is copied into phenotype space (b) where it can be executed. The phenotype is allowed to develop for a number of iterations (c). The number of iterations is defined by a special gene in the genotype. Copies of the developed phenotype are made (d) and each copy is assigned a different truth table to learn. The test set data is applied (e) as described in the following section. After learning (f) the phenotype can now be tested, and its fitness found. During (f), the individual is treated as a static individual - and is no longer allowed to modify itself. This

102

Genetic Programming Theory and Practice VIII

fixed program is then tested for accuracy, and its fitness used as a component in the final fitness score of that individual.

Figure 6-2. Fitness function flow chart, as described in section 3.

During the fitness evaluation stage, each row of the truth table is presented to a copy of the evolved phenotype (Figure 6-2.e). During this presentation, the error between the expected and actual output is fed back into the SMCGP program, in order to provide some sort of feedback. Full details of how this was implemented can be found in (Harding et al., 2009b). During fitness calculation, we tested all 16 tables. However, we split the tables into two sets, one for deriving the fitness score (12 tables) and the other for a validation score (4 tables). It was found that 16% of experimental runs were successfully able to produce programs that correctly learned the 12 tables. None of the evolved programs was able to generalize to learn all the unseen truth tables. However, the system did come close with the best result having only 2 errors (out of a possible 16). Figure 6-3 shows the form of the final phenotypes for the programs for each of the fitness truth tables. We can see both modularity and a high degree of variation - with the graphs for each table looking quite different from one another. This is in contrast to previous examples, such as the parity circuits, where we generally only see regular forms.

4.

Iteration in SMCGP and GP

One of the unique properties of SMCGP is how it handles iteration. Iteration is not new in genetic programming and there are several different forms. The most obvious form of GP with iteration is Linear Genetic Programming (LGP), where evolved programs can execute inside a kind of virtual machine in which the program counter can be modified using jump operations. LGP operates on registers (as in a CPU), and uses this memory to store state between iterations of the same section of program. It is also worth noting that in LGP sub-sections

A Survey of Self Modifying CGP

103

Figure 6-3. Phenotypes for each of the tables learned during evolution.

of code are executed repeatedly. This is different from most implementations of tree-based GP (and we restrict our discussion to the simple, common varieties found in the literature), as the tree represents an expression, and so any iteration has to be applied externally. Tree-based GP also typically does not have a concept of working registers to store state between iterations, so these must be added to the function set, or previous state information passed back via the tree’s inputs. Tree-based GP normally only has one output, and no intermediate state information. So additional mechanisms would be required to select what information to store and pass to subsequent iterations. In LGP termination can be controlled by the evolved program itself, whereby with external iteration another mechanism needs to be defined - perhaps by enforcing a limit to the number of iterations or some form of conditional. SMCGP handles its iteration in a very different manner. It can be viewed as something analogous to loop-unrolling in a compiler, whereby the contents of the loop are explicitly rewritten a number of times. In SMCGP, the duplication operator unrolls the phenotype. State information is passed between iterations by the connections made in the duplicated blocks. In compilers, it is done for program efficiency and is typically only done for small loops. In SMCGP, if the unrolling is excessive it will exceed the maximum permitted phenotype length. We speculate that this may help to evolve more efficient modularization. Because the activation of self modifying functions is determined by both the size of the ToDo list and the inputs to self modifying nodes, it is possible for SMCGP to self-limit when sections of code should be unrolled. SMCGP’s unrolling also has the possibility to grow exponentially, which forms a different kind of loop. For example, imagine a duplication operator that copied every node to its left and inserted it before itself : e.g NODE0

104

Genetic Programming Theory and Practice VIII

NODE1 DUPLICATE. On the next iteration it would produce NODE0 NODE1 NODE0 NODE1 DUPLICATE, then NODE0 NODE1 NODE0 NODE1 NODE0 NODE1 NODE0 NODE1 DUPLICATE and so on. Hence the program length almost doubles at each time. Similarly, the arguments for the duplication operation may only replicate part of the previously inserted module, so the phenotype would grow a different, smaller rate each time. Other growth progressions are also possible, especially when several duplication-style operators are at work on the same section of phenotype. This makes the iteration capabilities of SMCGP very rich and implies that it can also do a form of recursion unrolling - removing the need for explicit procedures in a similar way to the lack of need for loop instructions.

5.

Conclusions and Further Work

Self modification in Genetic Programming seems to be a useful property. With SMCGP we have shown that the implementation of such a system can be relatively straightforward, and that very good results can be achieved. In upcoming work, we will be demonstrating SMCGP on several other problems including generalized digital adders and a structural design problem. Here we have discussed problems that require some sort of developmental process, as the problems require a scaling ability. One benefit of SMCGP is that if the problem does not need self modification, evolution can stop using it. When this happens, the representation reverts to something similar to classical CGP. In (Harding et al., 2009c), we showed that on a bio-informatics classification problem where there should be no benefit in using self modification, SMCGP behaved similarly to CGP. This result lets us be confident that in future work we can by default use SMCGP and automatically gain any advantages that development might bring. The SMCGP representation has changed over time, whilst maintaining the same design philosophy. In future work we consider other variants as well. Currently we are investigating ways to simplify the genotype to make it easier for humans to understand. This should allow us to be able to prove general cases more easily, and perhaps explain how processes like the evolved learning algorithm function. A whole world of self modifying systems seems to have become available now that the principle has been shown work successfully. We plan to investigate this world further and also encourage others to consider self modification in their systems.

6.

Acknowledgments

Funding from NSERC under discovery grant RGPIN 283304-07 to W.B. is gratefully acknowledged. S.H. was supported by an ACENET fellowship.

A Survey of Self Modifying CGP

105

References Banzhaf, W. and Miller, J. F. (2004). The Challenge of Complexity. Kluwer Academic. Harding, S., Miller, J. F., and Banzhaf, W. (2009a). Self modifying cartesian genetic programming: Parity. In Tyrrell, Andy, editor, 2009 IEEE Congress on Evolutionary Computation, pages 285–292, Trondheim, Norway. IEEE Computational Intelligence Society, IEEE Press. Harding, Simon, Miller, Julian F., and Banzhaf, Wolfgang (2009b). Evolution, development and learning with self modifying cartesian genetic programming. In GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 699–706, New York, NY, USA. ACM. Harding, Simon, Miller, Julian F., and Banzhaf, Wolfgang (2010a). Developments in cartesian genetic programming: Self-modifying cgp. To be published in Genetic Programming and Evolvable Machines. Harding, Simon, Miller, Julian F., and Banzhaf, Wolfgang (2010b). Self modifying cartesian genetic programming: Finding algorithms that calculate pi and e to arbitrary precision. In Genetic and Evolutionary Computation Conference, GECCO 2010. Accepted for publication. Harding, Simon, Miller, Julian Francis, and Banzhaf, Wolfgang (2009c). Self modifying cartesian genetic programming: Fibonacci, squares, regression and summing. In Vanneschi, Leonardo, Gustafson, Steven, et al., editors, Genetic Programming, 12th European Conference, EuroGP 2009, T¨ubingen, Germany, April 15-17, 2009, Proceedings, volume 5481 of Lecture Notes in Computer Science, pages 133–144. Springer. Harding, Simon L., Miller, Julian F., and Banzhaf, Wolfgang (2007). Selfmodifying cartesian genetic programming. In Thierens, Dirk, Beyer, HansGeorg, Bongard, Josh, Branke, Jurgen, Clark, John Andrew, Cliff, Dave, Congdon, Clare Bates, Deb, Kalyanmoy, Doerr, Benjamin, Kovacs, Tim, Kumar, Sanjeev, Miller, Julian F., Moore, Jason, Neumann, Frank, Pelikan, Martin, Poli, Riccardo, Sastry, Kumara, Stanley, Kenneth Owen, Stutzle, Thomas, Watson, Richard A, and Wegener, Ingo, editors, GECCO ’07: Proceedings of the 9th annual conference on Genetic and evolutionary computation, volume 1, pages 1021–1028, London. ACM Press. Huelsbergen, Lorenz (1998). Finding general solutions to the parity problem by evolving machine-language representations. In Koza, John R., Banzhaf, Wolfgang, Chellapilla, Kumar, Deb, Kalyanmoy, Dorigo, Marco, Fogel, David B., Garzon, Max H., Goldberg, David E., Iba, Hitoshi, and Riolo, Rick, editors, Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 158–166, University of Wisconsin, Madison, Wisconsin, USA. Morgan Kaufmann.

106

Genetic Programming Theory and Practice VIII

Khan, G.M., Miller, J.F, and Halliday, D.M. (2007). Coevolution of intelligent agents using cartesian genetic programming. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 269 – 276. Koza, J. R. (1994a). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press. Koza, John R. (1992a). A genetic approach to the truck backer upper problem and the inter-twined spiral problem. In Proceedings of IJCNN International Joint Conference on Neural Networks, volume IV, pages 310–318. IEEE Press. Koza, John R. (1994b). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge Massachusetts. Koza, J.R. (1992b). Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press, Cambridge, Massachusetts, USA. Miller, J. F. and Smith, S. L. (2006). Redundancy and computational efficiency in cartesian genetic programming. In IEEE Transactions on Evoluationary Computation, volume 10, pages 167–174. Miller, Julian F. (2003). Evolving developmental programs for adaptation, morphogenesis, and self-repair. In Banzhaf, Wolfgang, Christaller, Thomas, Dittrich, Peter, Kim, Jan T., and Ziegler, Jens, editors, Advances in Artificial Life. 7th European Conference on Artificial Life, volume 2801 of Lecture Notes in Artificial Intelligence, pages 256–265, Dortmund, Germany. Springer. Miller, Julian F. and Banzhaf, Wolfgang (2003). Evolving the program for a cell: from french flags to boolean circuits. In Kumar, Sanjeev and Bentley, Peter J., editors, On Growth, Form and Computers. Academic Press. Miller, Julian F. and Thomson, Peter (2000). Cartesian genetic programming. In Poli, Riccardo, Banzhaf, Wolfgang, Langdon, William B., Miller, Julian F., Nordin, Peter, and Fogarty, Terence C., editors, Genetic Programming, Proceedings of EuroGP’2000, volume 1802 of LNCS, pages 121–132, Edinburgh. Springer-Verlag. Miller, Julian F. and Thomson, Peter (2003). A developmental method for growing graphs and circuits. In Proceedings of the 5th International Conference on Evolvable Systems: From Biology to Hardware, volume 2606 of Lecture Notes in Computer Science, pages 93–104. Springer. Miller, Julian Francis (2004). Evolving a self-repairing, self-regulating, french flag organism. In Deb, Kalyanmoy, Poli, Riccardo, Banzhaf, Wolfgang, Beyer, Hans-Georg, Burke, Edmund K., Darwen, Paul J., Dasgupta, Dipankar, Floreano, Dario, Foster, James A., Harman, Mark, Holland, Owen, Lanzi, Pier Luca, Spector, Lee, Tettamanzi, Andrea, Thierens, Dirk, and Tyrrell, Andrew M., editors, GECCO (1), volume 3102 of Lecture Notes in Computer Science, pages 129–139. Springer.

A Survey of Self Modifying CGP

107

Spector, L. and Robinson, A. (2002). Genetic programming and autoconstructive evolution with the push programming language. Genetic Programming and Evolvable Machines, 3:7–40. Spector, Lee and Stoffel, Kilian (1996). Ontogenetic programming. In Koza, John R., Goldberg, David E., Fogel, David B., and Riolo, Rick L., editors, Genetic Programming 1996: Proceedings of the First Annual Conference, pages 394–399, Stanford University, CA, USA. MIT Press. Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127. Vassilev, Vesselin K. and Miller, Julian F. (2000). The advantages of landscape neutrality in digital circuit evolution. In Proceedings of the Third International Conference on Evolvable Systems, pages 252–263. Springer-Verlag. Wong, Man Leung (2005). Evolving recursive programs by using adaptive grammar based genetic programming. Genetic Programming and Evolvable Machines, 6(4):421–455. Wong, Man Leung and Leung, Kwong Sak (1996). Evolving recursive functions for the even-parity problem using genetic programming. In Angeline, Peter J. and Kinnear, Jr., K. E., editors, Advances in Genetic Programming 2, chapter 11, pages 221–240. MIT Press, Cambridge, MA, USA. Yu, Tina and Miller, Julian (2001). Neutrality and the evolvability of boolean function landscape. In Miller, Julian F., Tomassini, Marco, Lanzi, Pier Luca, Ryan, Conor, Tettamanzi, Andrea G. B., and Langdon, William B., editors, Genetic Programming, Proceedings of EuroGP’2001, volume 2038 of LNCS, pages 204–217, Lake Como, Italy. Springer-Verlag.

Chapter 7 ABSTRACT EXPRESSION GRAMMAR SYMBOLIC REGRESSION Michael F. Korns1

1 Korns Associates, 1 Plum Hollow, Henderson, Nevada 89052 USA.

Abstract

This chapter examines the use of Abstract Expression Grammars to perform the entire Symbolic Regression process without the use of Genetic Programming per se. The techniques explored produce a symbolic regression engine which has absolutely no bloat, which allows total user control of the search space and output formulas, which is faster, and more accurate than the engines produced in our previous papers using Genetic Programming. The genome is an all vector structure with four chromosomes plus additional epigenetic and constraint vectors, allowing total user control of the search space and the final output formulas. A combination of specialized compiler techniques, genetic algorithms, particle swarm, aged layered populations, plus discrete and continuous differential evolution are used to produce an improved symbolic regression sytem. Nine base test cases, from the literature, are used to test the improvement in speed and accuracy. The improved results indicate that these techniques move us a big step closer toward future industrial strength symbolic regression systems.

Keywords:

abstract expression grammars, differential evolution, grammar template genetic programming, genetic algorithms, particle swarm, symbolic regression.

1.

Introduction

This chapter examines techniques for improving symbolic regression systems with the aim of achieving entry-level industrial strength. In previous papers (Korns, 2006; Korns, 2007; Korns and Nunez, 2008; Korns, 2009), our pursuit of industrial scale performance with large-scale, symbolic regression problems, required us to reexamine many commonly held beliefs and to borrow a number of techniques from disparate schools of genetic programming and recombine them in ways not normally seen in the published literature. The techniques of abstract expression grammars were developed, but expored only tangentially.

R. Riolo et al. (eds.), Genetic Programming Theory and Practice VIII, DOI 10.1007/978-1-4419-7747-2_7, © Springer Science+Business Media, LLC 2011

110

Genetic Programming Theory and Practice VIII

While the techniques, described in detail in (Korns, 2009), produce a symbolic regression system of breadth and strength, lack of user control of the search space, bloated unreadable output formulas, accuracy, and slow convergence speed are all issues keeping an industrial strength symbolic regression system tantalizingly out of reach. In this chapter abstract expression grammars become the main focus and are promoted as the sole means of performing symbolic regression. Using the nine base test cases from (Korns, 2007) as a training set, to test for improvements in accuracy, we constructed our symbolic regression system using these important techniques: Abstract expression grammars Universal abstract goal expression Standard single point vector-based mutation Standard two point vector-based cross over Continuous vector differential evolution Discrete vector differential evolution Continuous particle swarm evolution Pessimal vertical slicing and out-of-sample scoring during training Age-layered populations User defined epigenetic factors User defined constraints For purposes of comparison, all results in this paper were achieved on two workstation computers, specifically an Intel® Core™ 2 Duo Processor T7200 (2.00GHz/667MHz/4MB) and a Dual-Core AMD Opteron™ Processor 8214 (2.21GHz), running our Analytic Information Server software generating Lisp agents that compile to use the on-board Intel registers and on-chip vector processing capabilities so as to maximize execution speed, whose details can be found at www.korns.com/Document Lisp Language Guide.html. Furthermore, our Analytic Information Server is available in an open source software project at aiserver.sourceforge.net.

Testing Regimen and Fitness Measure Our testing regimen uses only statistical best practices out-of-sample testing techniques. We test each of the nine test cases on matrices of 10000 rows samples by 5 columns inputs with no noise, and on matrices of 10000 rows by 20 columns with 40% noise, before drawing any conclusions. Taking all these combinations together, this creates a total of 18 separate test cases. For each test a training matrix is filled with random numbers between -50 and +50. The target expression for the test case is applied to the training matrix to compute the dependent variable and the required noise is added. The symbolic regression system is trained on the training matrix to produce the regression champion. Following training, a testing matrix is filled with random numbers between -50

111

Abstract Expression Grammar Symbolic Regression Table 7-1. Result For 10K rows by 5 columns no Random Noise.

Test linear cubic cross elipse hidden cyclic hyper mixed ratio

Minutes 1 1 145 1 3 1 65 233 229

Train-NLSE 0.00 0.00 0.00 0.00 0.00 0.02 0.17 0.94 0.94

Train-TCE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.33

Test-NLSE 0.00 0.00 0.00 0.00 0.00 0.00 0.17 0.95 0.94

Test-TCE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.32

and +50. The target expression for the test case is applied to the testing matrix to compute the dependent variable and the required noise is added. The regression champion is evaluated on the testing matrix for all scoring (i.e. out of sample testing). Our two fitness measures are described in detail in (Korns, 2009) and consist of a standard least squared error which is normalized by dividing LSE by the standard deviation of Y (dependent variable). This normalization allows us to meaningfully compare the normalized least squared error (NLSE) between different problems. In addition we construct a fitness measure known as tail classification error, TCE, which measures how well the regression champion classifies the top 10% and bottom 10% of the data set. A TCE score of less than 0.20 is excellent. A TCE score of less than 0.30 is good; while, a TCE of 0.30 or greater is poor.

2.

Previous Results on Nine Base Problems

The previously published results (Korns, 2009) of training on the nine base training models on 10,000 rows and five columns with no random noise and only 20 generations allowed, are shown in Table 7-11 . In general, training time is very reasonable given the difficulty of some of the problems and the limited number of training generations allowed. Average percent error performance varies from excellent to poor with the linear and cubic problems showing the best performance. Minimal differences between training error and testing error in the mixed and ratio problems suggest no over-fitting.

1 The

nine base test cases are described in detail in (Korns, 2007).

112

Genetic Programming Theory and Practice VIII Table 7-2. Result for 10K rows by 20 columns with 40% Random Noise.

Test linear cubic cross elipse hidden cyclic hyper mixed ratio

Minutes 82 59 127 162 210 233 163 206 224

Train-NLSE 0.11 0.11 0.87 0.42 0.11 0.39 0.48 0.90 0.90

Train-TCE 0.00 0.00 0.25 0.04 0.02 0.11 0.06 0.27 0.26

Test-NLSE 0.11 0.11 0.93 0.43 0.11 0.35 0.50 0.94 0.95

Test-TCE 0.00 0.00 0.32 0.04 0.02 0.12 0.07 0.32 0.33

Surprisingly, long and short classification is fairly robust in most cases including the very difficult ratio, and mixed test cases. The salient observation is the relative ease of classification compared to regression even in problems with this much noise. In some of the test cases, testing NLSE is either close to or exceeds the standard deviation of Y (not very good); however, in many of the test cases classification is below 0.20. (very good). The previously published results (Korns, 2009) of training on the nine base training models on 10,000 rows and twenty columns with 40% random noise and only 20 generations allowed, are shown in Table 7-2. Clearly the previous symbolic regression system performs most poorly on the test cases mixed and ratio with conditional target expressions. There is no evidence of over-fitting shown by the minimal differences between training error and testing error. Plus, the testing TCE is relatively good in both mixed and ratio test cases. Taken together, these scores portray a symbolic regression system which is ready to handle some industrial strength problems except for a few serious issues. The output formulas are often so bloated, with intron expressions, that they are practically unreadable by humans. This seriously limits the acceptance of the symbolic regression system for many industrial applications. There is no user control of the search space, thus making the system impractical for most specialty applications. And of course we would love to see additional speed and accuracy improvements because industry is insatiable on those features. A new architecture which will completely eliminate bloat, allow total user control over the search space and the final output formulas, improve our regression scores on the two conditional base test cases, and deliver an increase in learning speed, is the subject of the remainder of this chapter.

Abstract Expression Grammar Symbolic Regression

3.

113

New System Architecture

Our new symbolic regression system architecture is based entirely upon an Abstract Expression Grammar foundation. A single abstract expression, called the goal expression, defines the search space during each symbolic regression run. The objective of a symbolic regression run is to optimize the goal expression. An example of a goal expression is: y = f0(c0*x5)+(f1(c1)/(v0+3.14)). As described in detail in (Korns 2009), the expression elements f0, f1, *, +, and / are abstract and concrete functions(operators). The elements v0, and x5 are abstract and concrete features. The elements c0, c1, and 3.14 are abstract and concrete real constants. Since the goal expression is abstract, there are many possible concrete solutions. y = f0(c0*x5)+(f1(c1)/(v0+3.14)) (...to be solved...) y = sin(-1.45*x5)+(log(22.56)/(x4+3.14)) (...possible solution...) y = exp(38.16*x5)+(tan(-8.41)/(x0+3.14)) (...possible solution...) y = square(-0.16*x5)+(cos(317.1)/(x9+3.14)) (...possible solution...) The objective of symbolic regression is to find an optimal concrete solution to the abstract goal expression. In our architecture, each individual solution to the goal expression is implemented as a set of vectors containing the solution values for each abstract function, feature, and constant present in the goal expression. This allows the system to be based upon an all vector genome which is convenient for genetic algorithm, particle swarm, and differential evolution styled population operators. In addition to the regular vector chromosomes providing solutions to the goal expression, epigenetic wrappers and constraint vectors provide an important degree of control over the search process and will be discussed in detail later in this chapter. Taken all together our new symbolic regression system is based upon the following genome. Genome with four chromosome vectors Each chromosome has an epigenetic wrapper There are two user contraint vectors The new system is constructed using these important techniques. Universal abstract goal expression Standard single point vector-based mutation Standard two point vector-based cross over Continuous vector differential evolution Discrete vector differential evolution Continuous particle swarm evolution Pessimal vertical slicing and out-of-sample scoring during training Age-layered populations

114

Genetic Programming Theory and Practice VIII

User defined epigenetic factors User defined constraints The universal abstract goal expression allows the system to be used for general symbolic regression and will be discussed in detail later in this chapter. Both single point vector-based mutation and two point vector-based cross over are discussed in (Man et al., 1999). Continuous and discrete vector differential evolution are discussed in (Price et al., 2005). Continuous particle swarm evolution is discussed in (Eberhart et al., 2001). Pessimal vertical slicing is discussed in (Korns, 2009). Age-layered populations are discussed in (Hornby, 2006) and (Korns, 2009). User defined epigenetic factors and user defined constraints will be discussed in detail later in this chapter. However, before proceeding to discuss the details of the system implemenation, we will review abstract expression grammars as discussed in detail in (Korns, 2009).

Review of Abstract Expression Grammars The simple concrete expression grammar we use in our symbolic regression system is a C-like functional grammar with the following basic elements. Real Numbers: 3.45, -.0982, 100.389, and all other real constants. Row Features: x1, x2, x9, and all other features. Binary Operators: +, *, /, %, max(), min(), mod() Unary Operators: sqrt(), square(), cube(), abs(), sign(), sigmoid() Unary Operators: cos(), sin(), tan(), tanh(), log(), exp() Relational Operators: Conditional Operator: (expr < expr) ? expr : expr) Colon Operator: expr : expr noop Operator: noop() Our numeric expressions are C-like containing the elements shown above and surrounded by regression commands such as, regress(), svm(), etc. Currently we support univariate regression, multivariate regression, and support vector regression. Our conditional expression operator (...) ? (...) : (...) is the Clike conditional operator where the ? and : operators always come in tandem. Our noop operator is an idempotent which simply returns its first argument regardless of the number of arguments: noop(x7,x6/2.1) = x7. Our basic expression grammar is functional in nature, therefore all operators are viewed grammatically as function calls. Our symbolic regression system creates its regression champion using evolution; but, the final regression champion will be a compilation of a basic concrete expression such as: (E1): f = (log(x3)/sin(x2*45.3))>x4 ? tan(x6) : cos(x3)

Abstract Expression Grammar Symbolic Regression

115

Computing an NLSE score for f requires only a single pass over every row of X and results in an attribute being added to f by executing the “score” method compiled into f as follows. f.NLSE = f.score(X,Y). Suppose that we are satisfied with the form of the expression in (E1); but, we are not sure that the real constant 45.3 is optimal. We can enhance our symbolic regression system with the ability to optimize individual real constants by adding abstract constant rules to our built-in algebraic expression grammar. Abstract Constants: c1, c2, and c10 Abstract constants represent placeholders for real numbers which are to be optimized by the symbolic regression system. To further optimize f we would alter the expression in (E1) as follows. (E2): f = (log(x3)/sin(x2*c1))>x4 ? tan(x6) : cos(x3) The compiler adds a new real number vector, C, attribute to f such that f.C has as many elements as there are abstract constants in (E2). Optimizing this version of f requires that the built-in “score” method compiled into f be changed from a single pass to a multiple pass algorithm in which the real number values in the abstract constant vector, f.C, are iterated until the expression in (E2) produces an optimized NLSE. This new score method has the side effect that executing f.score(X,Y) also alters the abstract constant vector, f.C, to optimal real number choices. Clearly the particle swarm (Eberhardt 2001) and differential evolution algorithms provide excellent candidate algorithms for optimizing f.C and they can easily be compiled into f.score by common compilation techniques currently in the main stream. Summarizing, we have a new grammar term, c1, which is a reference to the 1st element of the real number vector, f.C (in C language syntax c1 == f.C[1]). The f.C vector is optimized by scoring f, then altering the values in f.C, then repeating the process iteratively until an optimum NLSE is achieved. For instance, if the regression champion agent in (E2) is optimized with: f.C == < 45.396 > Then the optimized regression champion agent in (E2) has a concrete conversion counterpart as follows:

116

Genetic Programming Theory and Practice VIII

f = (log(x3)/sin(x2*45.396))>x4 ? tan(x6) : cos(x3) Suppose that we are satisfied with the form of the expression in (E1); but, we are not sure that the features, x2, x3, and x6, are optimal choices. We can enhance our symbolic regression system with the ability to optimize individual features by adding abstract feature rules to our built-in algebraic expression grammar. Abstract Features: v1, v2, and v10 Abstract features represent placeholders for features which are to be optimized by the nonlinear regression system. To further optimize f we would alter the expression in (E1) as follows. (E3): f = (log(v1)/sin(v2*45.3))>v3 ? tan(v4) : cos(v1) The compiler adds a new integer vector, V, attribute to f such that f.V has as many elements as there are abstract features in (E3). Each integer element in the f.V vector is constrained between 1 and M, and represents a choice of feature (in x). Optimizing this version of f requires that the built-in “score” method compiled into f be changed from a single pass to a multiple pass algorithm in which the integer values in the abstract feature vector, f.V, are iterated until the expression in (E3) produces an optimized NLSE. This new score method has the side effect that executing f.score(X,Y) also alters the abstract feature vector, f.V, to integer choices selecting optimal features (in x). Clearly the genetic algorithm (Man 1999), discrete particle swarm (Eberhardt 2001), and discrete differential evolution (Price 2005) algorithms provide excellent candidate algorithms for optimizing f.V and they can easily be compiled into f.score by common compilation techniques currently in the main stream. The f.V vector is optimized by scoring f, then altering the values in f.V, then repeating the process iteratively until an optimum NLSE is achieved. For instance, the regression champion agent in (E3) is optimized with: f.V == < 2, 4, 1, 6 > Then the optimized regression champion agent in (E3) has a concrete conversion counterpart as follows: f = (log(x2)/sin(x4*45.396))>x1 ? tan(x6) : cos(x2)

Abstract Expression Grammar Symbolic Regression

117

Similarly, we can enhance our nonlinear regression system with the ability to optimize individual functions by adding abstract functions rules to our built-in algebraic expression grammar. Abstract Functions: f1, f2, and f10 Abstract functions represent placeholders for built-in functions which are to be optimized by the nonlinear regression system. To further optimize f we would alter the expression in (E2) as follows. (E4): f = (f1(x3)/f2(x2*45.3))>x4 ? f3(x6) : f4(x3) The compiler adds a new integer vector, F, attribute to f such that f.F has as many elements as there are abstract features in (E4). Each integer element in the f.F vector is constrained between 1 and (number of built-in functions available in the expression grammar), and represents a choice of built-in function. Optimizing this version of f requires that the built-in “score” method compiled into f be changed from a single pass to a multiple pass algorithm in which the integer values in the abstract function vector, f.F, are iterated until the expression in (E4) produces an optimized NLSE. This new score method has the side effect that executing f.score(X,Y) also alters the abstract function vector, f.F, to integer choices selecting optimal built-in functions. Clearly the genetic algorithm (Man et al., 1999), discrete particle swarm (Eberhart et al., 2001), and discrete differential evolution (Price et al., 2005) algorithms provide excellent candidate algorithms for optimizing f.F and they can easily be compiled into f.score by common compilation techniques currently in the main stream. Summarizing, we have a new grammar term, f1, which is an indirect function reference thru to the 1st element of the integer vector, f.F (in C language syntax f1 == funtionList[f.F[1]]). The f.F vector is optimized by scoring f, then altering the values in f.F, then repeating the process iteratively until an optimum NLSE is achieved. For instance, if the valid function list in the expression grammar is f.functionList = < log, sin, cos, tan, max, min, avg, cube, sqrt > And the regression champion agent in (E4) is optimized with: f.F = < 1, 8, 2, 4 > Then the optimized regression champion agent in (E4) has a concrete conversion counterpart as follows:

118

Genetic Programming Theory and Practice VIII

f = (log(x3)/cube(x2*45.3))>x4 ? sin(x6) : tan(x3) The built-in function argument arity issue is easily resolved by having each built-in function ignore any excess arguments and substitute defaults for any missing arguments. Finally, we can enhance our nonlinear regression system with the ability to optimize either features or constants by adding abstract term rules to our built-in algebraic expression grammar. Abstract Terms: t1, t2, and t10 Abstract terms represent placeholders for either abstract features or constants which are to be optimized by the nonlinear regression system. To further optimize f we would alter the expression in (E2) as follows. (E5): f = (log(t0)/sin(t1*t2))>t3 ? tan(t4) : cos(t5) The compiler adds a new binary vector, T, attribute to f such that f.T has as many elements as there are abstract terms in (E5). Each binary element in the f.T vector is either 0 or 1, and represents a choice of abstract feature or abstract constant. Adding abstract terms allows the sytem to construct a universal formula containing all possible concrete formulas. Additional details on Abstract Expression Grammars can be found in (Korns, 2009).

4.

Universal Abstract Expressions

A general nonlinear regression system accepts an input matrix, X, of N rows and M columns and a dependent variable vector, Y, of length N. The dependent vector Y is related to X thru the (quite possibly nonlinear) transformation function, Q, as follows: Y[n] = Q(X[n]). The nonlinear transformation function, Q, can be related to linear regression systems, without loss of generality, as follows. Given an N rows by M columns matrix X (independent variables), an N vector Y (dependent variable), and a K+1 vector of coefficients, the nonlinear transformation, Q, is a system of K transformations, Qk : (R1 xR2 x...RM )−> R, such that y = C0 + (C1 ∗ Q1 (X)) + ...(CK ∗ QK (X))+err minimizes the normalized least squared error. Obviously, in this formalization, a nonlinear regression system is a linear regression system which searches for a set of K suitable transformations which minimize the normalized least squared error. If K is equal to M, then Q is dimensional, and Q is covering if, for every m in M, there is at least one instance of Xm in at least one term Qk .

Abstract Expression Grammar Symbolic Regression

119

With reference to our system architecture, what is needed to implement general nonlinear regression, in this formalization, is a method of constructing a universal goal expression which contains all possible nonlinear transformations up to a pre-specified complexity level. Such a method exists and is described as follows. Given any concrete expression grammar, suitable for nonlinear regression, we can construct a universal abstract goal expression, of an arbitrary grammar node depth level, which contains all possible concrete instance expressions within any one of the K transformations in Q. For instance, the universal abstract expression, U0 , of all Qk of depth level 0 is t0. Remember that t0 is either v0 or c0. The universal abstract expression, U1 , of all Qk of depth level 1 is f0(t0,t1). In general we have the following. U0 : U1 : U2 : U3 : Uk :

t0 f0(t0,t1) f0(f1(t0,t1),f2(t2,t3)) f0(f1(f2(t0,t1),f3(t2,t3)),f4(f5(t4,t5),f6(t6,t7))) f0(Uk−1 , Uk−1 )

Given any suitable functional grammar with features, constants, and operators, we add a special operator, noop, which simply returns its first argument. This allows any universal expression to contain all smaller concrete expressions. For instance, if f0 = noop, then f0(t0,t1) = t0. We solve the arity problem for unary operators by altering them to ignore the rightmost arguments, for binary operators by altering them to substitute default arguments for missing rightmost arguments, and for N-ary operators by wrapping the additional arguments onto lower levels of the binary expression using appropriate context sensitive grammar rules. For example, let’s see how we can wrap the 4-ary conditional function(operator) ? onto multiple grammar node levels using context sensitive constraints. y = f0(f1(expr,expr),f2(expr,expr)) Clearly if, during evolution in any concrete solution, the abstract function f0 were to represent the ? conditional function, then the abstract function f1 would be restricted to one of the relational functions(operators), and the abstract function f2 would be restricted to only the colon function(operator). Therefore one would have any number of possible solutions to the goal expression, but some of the possible solutions would violate these context sensitive constraints and would be unreasonable. The assertion that certain possible solutions are unreasonable depends upon the violation of context sensitive constraints implicit with each operator as follows.

120

Genetic Programming Theory and Practice VIII

y = f0(f1(expr,expr),f2(expr,expr)) (goal expression) y = ?( ? : noop) constraints: f1(+ * / % max min mod sqrt square cube abs sign sigmoid cos sin tan tanh log exp < = > ? : noop) constraints: f2(+ * / % max min mod sqrt square cube abs sign sigmoid cos sin tan tanh log exp < = > ? : noop) However if we know that a particular solution has selected f0 to be the operator ?, then we must implicitly assume that the constraints for abstract functions f0, f1, and f2, with respect to that solution are as follows. constraints: f0(?)

Abstract Expression Grammar Symbolic Regression

121

constraints: f1(< = >) constraints: f2(:) In the goal expression genome, f0 is a single gene located in position zero in the chromosome for abstract functions. The constraints are wrapped around each chomosome and are a vector of reasonable choices for each gene. In a context insensitive genome, chosing any specific value for gene f0 or gene v6, etc. has no effect on the contraint wrappers in the genome. However, in a context sensitive genome, chosing any specific value for gene f0 or gene v6, etc. may have an effect on the contraint wrappers in the genome. Furthermore, we are not limited to implicit control of the genome’s contraint wrappers. We can extend control of the genome’s contraints to the user in an effort to allow greater control of the search space. For instance, if the user wanted to perform a univariate regression on a problem with ten features but desired only logrithmic transforms in the output, the following abstract goal expression would be appropriate. y = f0(v0) where f0(cos sin tan tanh) Publishing the genome’s contraints for explicit user guidance is an attempt to explore greater user control of the search space during the evolutionary process.

6.

Epigenome

In order to perform symbolic regression with a single abstract goal expression, all of the individual solutions must have the same shape genome. In a context insensitive architecture with only one population island performing only a general search strategy, this is not an issue. However, if we wish to perform symbolic regression, with a single abstract goal expression, on multiple population islands each searching a different part of the problem space, then we have to be more sophisticated in our approach. We have already seen how constraints can be used to control, both implicitly and explicitly, evolutionary choices within a single gene. But what if we wish to influence which genes are chosen for exploration during the evolutionary process? Then we must provide some mechanism for choosing which genes are to be chosen and which genes are not to be chosen for exploration. Purely arbitrarily and in the sole interest of keeping faith with the original biological motivation of genetic algorithms, we choose to call genes which are chosen for exploration during evolution as expressed and genes which are chosen NOT to be explored during evolution as unexpressed. Furthermore, the wrapper around each chomosome, which determines which genes are and are not expressed, we call the epigenome. Once again, consider the following goal expression.

122

Genetic Programming Theory and Practice VIII

regress(f0(f1(expr,expr),f2(expr,expr))) where f0(?) Since we know that the user has requsted only solutions where f0 has selected to be the operator ?, then we must implicitly assume that the constraints and epigenome for abstract functions f0, f1, and f2, with respect to any solution are as follows. constraints: f0(?) constraints: f1(< = >) constraints: f2(:) epigenome: ef(f1) We can assume the epigenome is limited to function f1 because, with both gene f0 and gene f2 constrained to a single choice each, f0 and f2 are implicitly no longer allowed to vary during evolution, with respect to any solution. Effectively both f0 and f2 are unexpressed. In the goal expression genome, ef is the epigenome associated with the chromosome for abstract functions. The epigenomes are wrapped around each chomosome and are a vector of expressed genes. In a context insensitive genome, chosing any specific value for gene f0 or gene v6, etc. has no effect on the contraint wrappers or the epigenome. However, in a context sensitive genome, chosing any specific value for gene f0 or gene v6, etc. may have an effect on the contraint wrappers and the epigenome. Of course, we are not limited to implicit control of the epigenome. We can extend control of the epigenome to the user in an effort to allow greater control of the search space. For instance, the following goal expression is an example of a user specified epigenome. (E6): regress(f0(f1(f2(v0,v1),f3(v2,v3)),f4(f5(v4,v5),f6(v6,v7)))) (E6.1): where {} (E6.2): where {ff(noop) f2(cos sin tan tanh) ef(f2) ev(v0)} Obviously expression (E6) has only one genome; however, the two where clauses request two distinct simultaneous search strategies. The first where clause (E6.1) tells the system to perform an unconstrained general search of all possible solutions. The second where clause (E6.2) tells the system to simultaneously perform a more complex search among a limited set of possible solutions as follows. The ff(noop) condition tells the system to initialize all functions to noop unless otherwise specified. The f2(cos sin tan tanh) condition tells the system to restrict abstract function f2 to only the trigonometric functions starting with cos. The ef(f2) epigenome tells the system that only f2 will participate in the evolutionary process. The ev(v0) epigenome tells the system that only v0 will participate in the evolutionary process. Therefore, (E6.2) causes the system to evolve only solutions of a single trignonometric function on a single feature i.e. tan(x4), cos(x0), etc. These two distinct search strategies are explored simultaneously. The resulting champion will be the winning (optimal) solution across all simultaneous search strategies.

Abstract Expression Grammar Symbolic Regression

7.

123

Control

The user community is increasingly demanding better control of the search space and better control of the output from symbolic regression systems. In search of a control paradigm for symbolic regression, we have chosen to notice the relationship of SQL to database searches. Originally database searches where highly constrained and heavily dictated by the choice of storage mechanism. With the advent of relational databases, searches became increasingly under user control to the point that modern SQL is amazingly flexible. An unanswered research question is how much user control of the symbolic regression process can be reasonably achieved? Our system architecture allows us to use abstract goal expressions to better explore the possibilities for user control. Given the immense value of search space reduction and search specialization, the symbolic regression system can benefit greatly if the epigenome and the constraints are made available to the user. This allows the user to specify goal formulas and candidate individuals which are tailored to specific applications. For instance, the following univariate abstract goal expression is a case in point. (E7): regress(f0(f1(f2(v0,v1),f3(v2,v3)),f4(f5(v4,v5),f6(v6,v7)))) (E7.1): where {} (E7.2): where {ff(noop) f2(cos sin tan tanh) ef(f2) ev(v0)} (E7.3): where {ff(noop) f1(noop,*) f2(*) ef(f1) ev(v0,v1,v2)} (E7.4): where {ff(noop) f0(cos sin tan tanh) f1(noop,*) f2(*) ef(f0,f1) ev(v0,v1,v2)} (E7.5): where {f0(?) f4(:)} Expression (E7) has only one genome and can be entered as a single goal expression requesting five distinct simultaneous search strategies. Borrowing a term from chess playing programs, we can create an opening book by adding where clauses like (E7.2), (E7.3), (E7.4), and (E7.5). The first where clause (E7.1) tells the system to perform an unconstrained general search of all possible solutions. The second where clause (E7.2) tells the system to evolve only solutions of a single trignonometric function on a single feature i.e. tan(x4), cos(x0), etc. In the third where clause (E7.3), the f1(noop,*) condition tells the system to restrict abstract function f1 to only the noop and * starting with noop. The f2(*) condition tells the system to restrict abstract function f2 to only the * function. The ef(f1) epigenome tells the system that only f1 will participate in the evolutionary process. The ev(v0,v1,v2) epigenome tells the system that only v0, v1, and v2 will participate in the evolutionary process. Therefore, (E7.3) causes the system to evolve champions of a pair or a triple cross correlations only i.e. (x3*x1) or (x1*x4*x2).

124

Genetic Programming Theory and Practice VIII

In the fourth where clause (E7.4), the ff(noop) condition tells the system to initialize all functions to noop unless otherwise specified. The f0(cos sin tan tanh) condition tells the system to restrict abstract function f0 to only the trigonometric functions starting with cos. The f1(noop,*) condition tells the system to restrict abstract function f1 to only the noop and * starting with noop. The f2(*) condition tells the system to restrict abstract function f2 to only the * function. The ef(f0,f1) epigenome tells the system that only f0 and f1 will participate in the evolutionary process. The ev(v0,v1,v2) epigenome tells the system that only v0, v1, and v2 will participate in the evolutionary process. Therefore, (E7.4) causes the system to evolve champions of a single trignonometric function operating on a pair or triple cross correlation only i.e. cos(x3*x1) or tan(x1*x4*x2). In the fifth where clause (E7.5), causes the system to evolve only conditional champions i.e. ((x3*x1)

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close