A Tutorial on R

From Opasnet
Revision as of 08:14, 12 April 2011 by Teemu R (talk | contribs)
Jump to navigation Jump to search


Sessions to held on Wednesdays at 12:30 in Kielo starting 13.4.2011.

Introduction

Disclaimer: I'm just someone who just started learning R less than a year ago, so do not expect everything in this tutorial to be correct or extremely accurate. I'm stating things the way I think they are or the way I believe it is useful to think they are.

Aim of the tutorial

Starting from some fundamentals, learn R on a generic level. R has got so many different packages for many different approaches that it would be very difficult to cover all of them comprehensively. So I want to give you a nice start into learning R for whatever purpose you may need it for.

Chapter 1

  • Basic syntax
  • The creatures of R: vectors, lists, factors, data.frames, arrays
  • Data types

R is an object oriented language.

The R console is a command line interpreter, typing out a command e.g. 2+2 returns 4. Mathematical operators (+,-,*,/,^) are used in their mathematical sense and the order of execution is the mathematical one. The operators can also be used in function form, e.g. '+'(1,1). More operators here. Brackets () can be used to control which expressions are evaluated first, e.g. (1+1)/2.

Variables and functions are saved in memory using either an arrow (<- or ->) or =, e.g.: var1 <- 1. It is preferable to use the arrow operator.

Functions are used as function(parameter1, parameter2, parameter6 = value), e.g. mean(rnorm(10, sd = 2)). Checking syntax and other details for a function is easy by using ?function, e.g. ?mean. Lists of functions can be found using library(help="package"), where package is the container/library of functions of interest. E.g. library(help="base") or library(help="stats"). A semicolon (;) can be used to separate statements on the same line. E.g. a <- 2;a.

The basic structure most objects in R is a vector. Vectors can be atomic (contain only values), or recursive (vector of objects, basically). A vector in R is basically an ordered set of values or objects. Selecting (subscripting) the n:th element is done by vector[n] or vector[[n]]. The double bracket is used when subscripting from recursive vectors (lists and its subtypes) to extract the stored objects themselves rather than a recursive vector of the selected elements.

There are 3 basic data types: text (class = character), numbers (class = numeric) and logical. Data types in R consist of these basic data types and their more elaborate derivatives. E.g. a factor is a character vector (a vector that consists of textual elements) stored as a numeric vector, where each number represents a unique element of the character vector. The unique elements are stored as a levels attribute of the factor object. Integers are a special case of the numeric class, they are handled as numeric except in storage; normally all numbers are numeric, a special case is the 'a:b' operator which produces a vector of integers from a to b. An atomic vector can only contain values of a single data type.

Attributes are used to simulate more complex data structures. An atomic vector can be given a dimensions attribute dim (which is a numeric vector containing the lengths of the dimensions), to turn it into an array (a matrix is an array with length(dim) = 2). Dimensions can be given names in the dimnames attribute which is a list of named character vectors. Because arrays and matrices are atomic vectors by nature they can only contain values of one data type.

+ Show code

Recursive vectors; lists and data.frames (which are lists with elements of equal length atomic vectors) can have values of different data types since they consist of different objects. The data.frame is perhaps the most common object type in R. It resembles the basic rectangular table format.

+ Show code

In general arrays are produced and used when data is summarized (summed or averaged over some marginals), for computational purposes the data.frame is superior.

More info on objects: http://cran.r-project.org/doc/manuals/R-lang.html#Objects

Classes in R refer to either the data type of a simple atomic vector, or the object type of a more complex object. Functions in R may have different methods for handling inputs of different classes and this may sometimes confuse newcomers; e.g. some functions take a factor input as only a numeric vector instead of a character vector. Many functions try to coerce their input into the format they can operate with by using functions like as.character and as.numeric.

+ Show code

More info on classes and other attributes: http://cran.r-project.org/doc/manuals/R-lang.html#Attributes

Many R functions are vectorized, meaning that a function can take one or more vectors as input to produce a vector as output. I.e. 1:5 + 10:6 produces 11 11 11 11 11. If the vectors are of different length, the shorter one is usually recycled to match the length of the longest vector. E.g. 1:2 + rep(4,5) produces 5 6 5 6 5 with a warning calling attention to the arguments different lengths. Vectorized operations are incredibly fast and should be used whenever possible in place of for, while or other loops.

Chapter 2?

  • Importing and exporting data from/to files
  • Working with data: subscription, merge, apply, reshape, conversion between data.frame and array
  • OpasnetBaseUtils?

The easiest way to get data in and out of R is through delimited text files (.txt or .csv). The read.table function reads files specified by a path to a local file or an url. read.table follows the following syntax read.table(file, header, sep, quote, dec, fill, strip.white, ...) (actually there are more arguments but they're not all so relevent, check ?read.table for yourself), where file is a character string specifying the file by path or url e.g. "M:/test.txt" (Note that you have to use forward slash, backslash is the escape character); header is either TRUE or FALSE depending on whether the first line in the file is a header, default is false; sep is the cell separator, default is "\t" meaning tab, csv files usually on an European locale use ";", while the global standard is ","; quote is the quote character used in the file, default is ""\" (escaped "); dec is the decimal separator, default is "."; fill determines whether uneven rows are filled with extra empty cells, default is FALSE, hence by default an error will be produced when the file has uneven rows; strip.white removes extra white space from empty cells and strings' leading and tailing edges, default is FALSE. There is a csv wrapper (read.csv) for the function which changes the default of sep to ",", and csv2 (read.csv2) which changes sep to ";" and dec to ",". I would recommend always using the read.table while changing the arguments, since the wrappers don't accept some of the other arguments.

The output is in the data.frame format. We can select row(s) and column(s) by using subscription. The data.frame is the most flexible format when it comes to data exploration and subscription. Since data.frames are essentially lists, we can use list1x to select a column x, where x can be a numerical vector (it can be longer than one, experiment with nested lists if you're interested). If the list is named (data.frames always are) we can select its elements by using list1x, where x is a character vector of length one, or list1$col1, where col1 is simply the name of the list element (column in a data.frame). After selecting an object from a list we can subscript from it again, e.g. list11[1] returns the first element of the first object stored in list1, list1$col1[1] is similar; list111... can be used for nested lists. A data.frame is special in that it is also subscribable as a two dimensional array: df1[x,y] returns the x:th value of the row y, both can be vectors of any length and of any basic data type (either numeric', character or logical). Either of the x and y can be left blank so that a vector is returned; if a vertical slice is extracted, the result is an atomic vector if only one column was selected (same as selecting the object from the list); if a horizontal slice is extracted, the result is a data.frame.

A comprehensive guide to importing and exporting data in R can be found on http://cran.r-project.org/doc/manuals/R-data.html.

Chapter 3?

  • The magic
  • Packages
  • Fancy plots

Chapter Z

  • BRUGS (Open BUGS on R)
  • ff (on disk objects)
  • ...