A Tutorial on R: Difference between revisions

From Opasnet
Jump to navigation Jump to search
(quick outline)
 
(Chapter 3: Graphs)
 
(67 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{lecture|moderator=Teemu R|stub=Yes}}
{{lecture|moderator=Teemu R|stub=Yes}}
[[Category:Contains R code]]
[[Category:Code under inspection]]


* Sessions to held on Wednesdays 12:30 someplace starting 13.4.2011
== Introduction ==


== Aim of the tutorial ==
Welcome to the [[R]] tutorial. The tutorial is meant for people who already know something about programming in general.


Learn some R.
=== Aim of the tutorial ===


== Chapter 1 ==
Starting from some fundamentals, learn [[R]] on a generic level. R has got so many different packages for many different approaches that it would be very difficult to cover all of them comprehensively. So I want to give you a nice start into learning R for whatever purpose you may need it for.


== Slicing different parts of objects ==
Run this code on your own computer to see different methods to slice objects.
<pre>
dat <- data.frame( # Define a data frame with five rows and three columns.
A = 1:5,
B = c("c", "b", "d", "e", "c"),
Result = c(6, 45, 2, 4.5, 2)
)
dat # The data frame dat.
dat$A  # The vector that forms column A.
dat[["A"]] # The same as previous.
dat[[1]] # The vector that forms the first column (the same as previous).
dat["A"] # The column with name A (this is a data frame with one column).
dat[1] # The first column  (with name A; the same as previous).
dat[c(1,3)] # Data frame with the first and third columns.
dat[c("A", "Result")] The same as previous.
dat[2:4, 1:2] # Data frame with rows 2 to 4 and columns 1 to 2.
dat[2:4, "Result"] # The vector that is formed from rows 2 to 4 of column Result.
dat$Result[2:4] # The same as previous.
dat[2:4, ]["Result"] # Data frame that first takes rows 2 to 4 and then column Result.
dat$B # Factor that forms column B. Note that factors are also vectors.
levels(dat$B) # Levels (i.e. possible values of the factor) of the previous factor.
as.numeric(dat$B) # The numeric position values of the previous factor.
levels(dat$B)[dat$B] # The previous factor converted into a character vector.
c("b", "c", "d", "e")[c(2, 1, 3, 4, 2)] # The same as previous
library(OpasnetUtils) # This package is needed to operate with ovariables.
odat <- Ovariable("odat", data = dat) # Ovariable that has dat as data.
odat@data # Data slot of odat.
odat # All slots of odat.
odat <- EvalOutput(odat) # Evaluate odat (i.e., calculate the output)
odat@output # Output slot of odat.
result(odat) # The result column of the output slot of odat.
summary(odat) # Summary of odat. If odat is probabilistic, summary includes mean and other statistics.
odat@marginal # Which columns in output are marginals?
colnames(odat@output) # Names of columns of the output of odat
colnames(odat@output)[odat@marginal] # Names of marginal columns.
odat@output[odat@output$B == "c" , ] # Data frame from odat output of rows where column B has value c.
result(odat)[odat@output$B == "c"] # Vector from odat result of rows rows where column B has value c.
</pre>
== Chapter 1 - the Basics of the R Language==
* Basic syntax
* The creatures of R: vectors, lists, factors, data.frames, arrays
* The creatures of R: vectors, lists, factors, data.frames, arrays
* Data types
* Data types
* Basic syntax
 
*
R is an object oriented programming language.
 
The R console is a command line interpreter, typing out a command e.g. 2+2 returns 4. Mathematical operators (+,-,*,/,^) are used in their mathematical sense and the order of execution is the mathematical one. The operators can also be used in function form, e.g. '+'(1,1). More operators [http://cran.r-project.org/doc/manuals/R-lang.html#Operators here]. Brackets () can be used to control which expressions are evaluated first, e.g. (1+1)/2.
 
Variables and functions are saved in memory using either an arrow (''<-'' or ''->'') or ''{{=}}'', e.g. var1 <- 1. It is preferable to use the arrow operator.
 
Functions are used as function(parameter1, parameter2, parameter6 = value), e.g. mean(rnorm(10, sd = 2)). Checking syntax and other details for a function is easy by using ?function, e.g. ?mean. Lists of functions can be found using library(help="package"), where package is the container/library of functions of interest. E.g. library(help="base") or library(help="stats"). A semicolon (;) can be used to separate statements on the same line. E.g. a <- 2;a.
 
The basic structure most objects in R is a vector. Vectors can be atomic (contain only values), or [[:en:Recursion (computer science)#Recursive data types|recursive]] (vector of objects, basically). A vector in R is basically an ordered set of values or objects. Selecting (''subscripting'') the n:th element is done by vector[n] or <nowiki>vector[[n]]</nowiki>. The double bracket is used when subscripting from recursive vectors (lists and its subtypes) to extract the stored objects themselves rather than a recursive vector of the selected elements.
 
There are 3 basic data types: text (class = ''character''), numbers (class = ''numeric'') and ''logical''. Data types in R consist of these basic data types and their more elaborate derivatives. E.g. a ''factor'' is a ''character vector'' (a vector that consists of textual elements) stored as a ''numeric vector'', where each number represents a unique element of the character vector. The unique elements are stored as a ''levels'' attribute of the factor object. Integers are a special case of the ''numeric'' class, they are handled as ''numeric'' except in storage; normally all numbers are ''numeric'', a special case is the 'a:b' operator which produces a vector of ''integers'' from a to b. An atomic vector can only contain values of a single data type.
 
<rcode showcode="1">
a <- 1:6
a
b <- c("Q", "W", "E", "R", "T", "Y")
b
d <- data.frame(Row = a, Letter = b)
d
b <- c(b, c("W", "W", "T"))
b
class(b)
b <- as.factor(b)
b
class(b)
b <- as.numeric(b)
b
class(b)
d[ , "Letter"]
d$Letter
d["Letter"]
d[d$Row == 2, ]
d[d$Row %in% 2:4, ]
 
print(d)
library(xtable)
print(xtable(d), type = 'html')
library(OpasnetUtils)
oprint(d)
</rcode>
 
Attributes are used to simulate more complex data structures. An atomic vector can be given a dimensions attribute ''dim'' (which is a numeric vector containing the lengths of the dimensions), to turn it into an ''array'' (a ''matrix'' is an ''array'' with length(dim) = 2). Dimensions can be given names in the ''dimnames'' attribute which is a list of named character vectors. Because arrays and matrices are atomic vectors by nature they can only contain values of one data type.
 
<rcode showcode=1>
a <- 1:27;dim(a) <- c(3,3,3);dimnames(a) <- list(dim1=1:3, dim2=1:3, dim3=1:3);a
b <- array(1:27, dim = c(3,3,3), dimnames = list(dim1=1:3, dim2=1:3, dim3=1:3));b
</rcode>
 
Recursive vectors; ''lists'' and ''data.frames'' (which are ''lists'' with elements of equal length atomic vectors, character vectors are by default converted to factors) can have values of different data types since they consist of different objects. The ''data.frame'' is perhaps the most common object type in R. It resembles the basic rectangular table format.
 
<rcode showcode=1>
a <- 1:9; b <- factor(rep(c("a","b","c"),3)); d <- factor(rep(c("q","w","e"), each = 3))
df <- data.frame(a,b,d);df
df2 <- data.frame(a = 18:10, b = factor(rep(c("a","b","c"),3)), d = factor(rep(c("q","w","e"), each = 3)));df2
l <- list(df, df2, k = 1:2);l
l[[1]]
l[[3]]
l[3]
l[[1]][5,1]
l[[1]][5,]
l[[1]][,1]
</rcode>
 
In general ''arrays'' are produced and used when data is summarized (summed or averaged over some marginals), for computational purposes the ''data.frame'' is superior.
 
More info on objects: http://cran.r-project.org/doc/manuals/R-lang.html#Objects
 
Classes in R refer to either the data type of a simple atomic vector, or the object type of a more complex object. Functions in R may have different ''methods'' for handling inputs of different classes and this may sometimes confuse newcomers; e.g. some functions take a factor input as only a numeric vector instead of a character vector. Many functions try to coerce their input into the format they can operate with by using functions like ''as.character'' and ''as.numeric''.
 
<rcode showcode=1>
class(1);class(1:4);class("a");class(TRUE);class(as.factor("a","a","b"))
1:4
</rcode>
 
More info on classes and other attributes: http://cran.r-project.org/doc/manuals/R-lang.html#Attributes
 
Many R functions are vectorized, meaning that a function can take one or more vectors as input to produce a vector as output. I.e. ''1:5 + 10:6'' produces ''11 11 11 11 11''. If the vectors are of different length, the shorter one is usually recycled to match the length of the longest vector. E.g. ''1:2 + rep(4,5)'' produces ''5 6 5 6 5'' with a warning calling attention to the arguments different lengths. Vectorized operations are incredibly fast and should be used whenever possible in place of for, while or other loops.
 
== Chapter 2 - Getting your data into R ==
 
<rcode showcode=1>
library(OpasnetUtils)
 
e <- opbase.data("Op_en5103")
oprint(e)
e <- tidy(e)
oprint(e)
</rcode>
 
'''To upload data, you can use Table2Base tables on a wiki page:
 
<t2b index="Row,Sector,Year,Observation" locations="Amount,Per.person" unit="kton CO2e">
1|Consumers' use of electricity |2010|126.96|1.31
2|Electric heating |2010|32.65|0.34
3|District heating |2010|321.18|3.31
4|Separate heating |2010|41.10|0.42
5|Traffic |2010|164.47|1.70
6|Agriculture |2010|24.36|0.25
7|Waste management |2010|20.55|0.21
8|Total |2010|731.27|7.54
</t2b>
 
'''Or you can go to upload functionality to upload an Excel or CSV file:
 
{{uploadlink}}
 
* Importing and exporting data from/to files
* Working with data: subscription, merge, apply, reshape, conversion between data.frame and array
* [[OpasnetUtils]]?
 
Example data: [[File:ArkS280.csv]]. Be careful when converting your Excel sheets to .csv, Microsoft Office is an idiotic piece of software and it sometimes writes empty cells in places where you have once edited something. To fix this open the exported .csv in the Open Office equivalent of Excel and overwrite the previous file.
 
The easiest way to get data in and out of R is through delimited text files (.txt or .csv). The ''read.table'' function reads files specified by a path to a local file or an url. ''read.table'' follows the following syntax ''<nowiki>read.table(file, header, sep, quote, dec, fill, strip.white, ...)</nowiki>'' (actually there are more arguments but they're not all so relevent, check ''?read.table'' for yourself), where ''file'' is a character string specifying the file by path or url e.g. "M:/test.txt" (Note that you have to use forward slash, backslash is the escape character); ''header'' is either ''TRUE'' or ''FALSE'' depending on whether the first line in the file is a header, default is false; ''sep'' is the cell separator, default is "\t" meaning tab, csv files usually on an European locale use ";", while the global standard is ","; ''quote'' is the quote character used in the file, default is "\"'" ([[:en:Escape character|escaped]] " followed by '); ''dec'' is the decimal separator, default is "."; ''fill'' determines whether uneven rows are filled with extra empty cells, default is ''FALSE'', hence by default an error will be produced when the file has uneven rows; ''strip.white'' removes extra white space from empty cells and strings' leading and tailing edges, default is ''FALSE''. There is a csv wrapper (''read.csv'') for the function which changes the default of ''sep'' to ",", and csv2 (''read.csv2'') which changes ''sep'' to ";" and ''dec'' to ",". I would recommend always using the ''read.table'' while changing the arguments, since the wrappers don't accept some of the other arguments. The ''write.table'' function uses the following basic syntax: ''write.table(x, file, ... , sep, dec, row.names, na.string)'', where ''x'' is the object to be written; ''file'', ''sep'' and ''dec'' are the same as for ''read.table''; ''row.names'' specifies whether to write row-names into the file, default is ''TRUE''; ''na.string'' is the string to be used in missing cells.
 
Example:
<nowiki>test <- read.table("M:/R koulutus/arkS280.csv", sep = ";", dec = ",", header = TRUE)
test</nowiki>
 
The output is in the ''data.frame'' format. We can select row(s) and column(s) by using subscription. The ''data.frame'' is the most flexible format when it comes to data exploration and subscription. Since ''data.frames'' are essentially ''lists'', we can use <nowiki>list1[[x]]</nowiki> to select a column x, where x can be a numerical vector (it can be longer than one, experiment with nested lists if you're interested). If the ''list'' is named (''data.frames'' always are) we can select its elements by using <nowiki>list1[[x]]</nowiki>, where x is a character vector of length one, or list1$col1, where col1 is simply the name of the ''list'' element (column in a ''data.frame''). After selecting an object from a ''list'' we can subscript from it again, e.g. ''<nowiki>list1[[1]][1]</nowiki>'' returns the first element of the first object stored in list1, ''list1$col1[1]'' is similar; <nowiki>list1[[1]][[1]]</nowiki>... can be used for nested lists. A data.frame is special in that it is also subscribeable as a two dimensional array: ''df1[x,y]'' returns the x:th value of the row y, both can be vectors of any length and of any basic data type (either ''numeric'', ''character'' or ''logical''). Either of the x and y can be left blank so that a vector is returned; if a vertical slice is extracted, the result is an atomic vector if only one column was selected (same as selecting the object from the list); if a horizontal slice is extracted, the result is a ''data.frame''. Arrays with more dimensions can be subscribed from in similar fashion e.g. arr1[x,y,z,...], where x, y, z and so on can be vectors of any basic data type.
 
To better utilize subscription you should learn about logical operators. ''<'', ''>'', ''<='', ''>='', ''=='', and ''!='', are pretty self-explanatory. More advanced ones include the and (''&'') and or (''|''). ''%in%'' is also pretty useful. To obtain a inversion use (statement)==FALSE or !(statement). These operators are vectorized, however the expression on the right side must be of length one (comparing all values of a vector on the left side to a single value/expression is allowed, but element by element comparison is not allowed). ''grep'' can be used to find regular expressions. All logical operators return a ''logical'' vector.
 
Example:
<nowiki>test <- read.table("M:/R koulutus/arkS280.csv", sep = ";", dec = ",", header = TRUE)
test[test$Suklaa0>=5,]
test[test$Ryhmä=="ip"]</nowiki>
 
R has got very powerful data manipulation facilities. Data in ''data.frames'' usually consist of a few ''factors'' and ''numeric'' vectors. The factors are usually indices to the data, e.g. in a population data there could be indices for Age, Year and Place. Unique cells in the data would be identified by a unique combination of these factors. This format can be used very similarly to an ''array''. We could sum or take a mean over the levels of specified factors using ''tapply'' (''data.frame'' variant of ''apply''). ''xtabs'' creates a contingency table from a ''data.frame'' with cross-classifying factors, this is similar to an ''array'' with some extras. ''table'' is similar to ''xtabs'' but simpler and only does counts. ''reshape'' is a function that transforms ''data.frames'' with a single ''numeric'' vector into a ''data.frame'' with multiple ''numeric'' vectors specified by one or more indexing ''factors'' and vice versa. ''merge'' is a function that merges two ''data.frames'' by finding matching (indexing) vectors, any extra vectors in either data are carried over to the resulting ''data.frame''. To simplify variable selection, components of ''data.frames'' can be ''attached'' to the general namespace, i.e. ''attach(data)'' would enable calling the component vectors of ''data'' directly: e.g. ''vec1'' instead of ''data[,"vec1"]''.
 
There are many packages for database connections. One for use with the [[Opasnet Base]] is the [[OpasnetUtils]], which uses the RODBC package for the actual connection.
 
A comprehensive guide to importing and exporting data in R can be found on http://cran.r-project.org/doc/manuals/R-data.html.
 
'''Useful data packages
* fmi: direct access to Finnish Meteorological Institute database: all monitoring station data, modelled weather. Note! You need a personal key from FMI to use it.
* rOpenGov: direct access to many sources, e.g. eduskunta.
 
'''Opening sources from Opasnet and own computer
 
<rcode embed=1>
library(OpasnetUtils)
 
# Read a csv file from your own computer
d <- read.csv("C:/Users/Neukkari/AppData/Local/Temp/ArkS280.csv", sep = ";", dec = ",")
 
# Write a data.frame to a csv file on your own computer
write.table(rob, "C:/Users/Neukkari/Documents/koe.csv", sep = ",", dec = ".")
 
# Download a csv file from Opasnet wiki
d <- opasnet.csv("/f/fc/ArkS280.csv", wiki = "opasnet_en")
 
# Download a table from Opasnet Base
e <- opbase.data("Op_en5103")
rob <- opbase.data("Op_fi5339", subset = "Robottiautot ja matkojenyhdistely")
sarc <- opbase.data("op_en2721", subset = "Koe")
 
# Useful package to access data from Finnish Meteorological Institute FMI
library(fmi)
</rcode>
 
== Chapter 3: Graphs ==
 
Examples of making graphs with ggplot2 package.
 
<rcode embed=1 showcode=1 graphics=1>
library(OpasnetUtils)
library(ggplot2) # graphical package
library(reshape2) # package for melt function
 
# Create a data.frame with 3 columns of data
a <- data.frame(
  A = 1:100,
  B = rnorm(100),
  C = rnorm(100, 2, 1),
  D = rnorm(100, -1, 3)
)
 
# Melt the data columns into one and create an explanatory column for source.
a <- melt(a, id.vars = "A", variable.name = "Source")
 
# Draw a line graph. You can also use geom_point() for point graphs and a lot of other alternatives.
ggplot(a, aes(x = A, y = value, colour = Source)) + geom_line()+
  theme_gray(base_size = 24)+
  labs(x = "Number",
      y = "Result",
      title = "Great graph"
  )
 
</rcode>
 
== Chapter 4 ==
 
'''Getting help
 
* Just google "What-I-want-to-do in R". Most likely you'll find an answer to your question in top 5 search results.
** You can also find all R Help pages in the Internet: http://stat.ethz.ch/R-manual/R-devel/library/base/html/print.html
* Post a message on Avary's wall: http://www.facebook.com/groups/151630078282972 and someone will answer you.
* Print object contents, function codes or help pages with R-tools.
* If you want a new package to be installed in R-tools, contact [[User:Ehac|Einari]] or [[User:Jouni|Jouni]].
 
<rcode showcode="1">
a <- 1:10
print(a)
print
?print
</rcode>
 
'''Plotting fancy plots
 
<rcode showcode="1" graphics="1">
n <- 1000
population <- data.frame(
Sex = rep(c("Male", "Female"), each = n),
Height = c(
rnorm(n, 178, 18),
rnorm(n, 168, 15)
)
)
head(population)
plot(population$Sex, population$Height)
tapply(population$Height, population$Sex, mean)
tapply(population$Height, population$Sex, sd)
 
</rcode>
 
'''More fancy plots: [[:op_fi:Radonin terveysvaikutukset|Radonin terveysvaikutukset]]
 
'''Fancy map plots:
 
<rcode
showcode="1"
include="page:A_Tutorial_on_R|name:eumap"
name='gmapspsqltest2'
variables="name:fi|default:25|description:What is the value for Finland?" >
 
library(xtable)
data <- data.frame(
    Country = c("AT", "BE", "BG", "CH", "CY", "CZ", "DE", "DK", "EE", "ES", "FI",
"FR", "GR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL", "NO", "PL", "PT", "RO", "SE", "SI", "SK", "UK"),
    Result = c(42, 78, 33, 57, 82, 66, 40, 65, 93, 50, 37, 74, 93, 26, 27, 15, 83, 36, 34, 89, 45, 96, 23, 39, 40, 22, 58, 20, 10)
)
data[data$Country == "FI", "Result"] <- fi
 
cat("Results by country (using country codes).\n")
print(xtable(data), type = 'html')
 
eumap(data$Result)
</rcode>
 
<rcode name="eumap" label="Function eumap">
library(rgdal)
library(maptools)
library(RColorBrewer)
library(classInt)
library(OpasnetUtils)
 
shp<-readOGR('PG:host=localhost user=postgres dbname=spatial_db','watson_wkt')
 
###################### eumap plots values of EU29 countries on map. Parameters:
######### data: must be a vector with length 29 (country values in this order:
## "AT", "BE", "BG", "CH", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR", "GR", "HU",
## "IE", "IT", "LT", "LU", "LV", "MT", "NL", "NO", "PL", "PT", "RO", "SE", "SI", "SK", "UK"
######### nclr: number of colours to be used. Default: 24
 
eumap <- function(data = 1:29, nclr = 24) {
 
shp@data$value_inhalation <- data
plotvar        <- shp@data$value_inhalation
 
nclr            <- 24
plotclr        <- brewer.pal(nclr,"BuPu")
class          <- classIntervals(plotvar,nclr,style="quantile")
colcode        <- findColours(class,plotclr)
epsg4326String  <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
proj4string(shp)<- ("+init=epsg:3035")
shp2            <- spTransform(shp,epsg4326String)
 
out<-sapply(
slot(shp2,"polygons"),
function(x){
kmlPolygon(
x,
name=as(shp2,"data.frame")[slot(x,"ID"),"country_code"],
col=colcode[[((as.numeric(slot(x,"ID"))+1))]],
lwd=1,
border='black',
description=paste("Value:",as(shp2,"data.frame")[slot(x,"ID"),"value_inhalation"])
)
}
)
 
data<-paste(
paste(
kmlPolygon(
kmlname="This will be layer name",
kmldescription="<i>More info about layer here</i>"
)$header,
collapse="\n"
),
paste(unlist(out["style",]), collapse="\n"),
paste(unlist(out["content",]), collapse="\n"),
paste(kmlPolygon()$footer, collapse="\n"),
sep=''
)
 
google.show_kml_data_on_maps(data)
}
 
</rcode>
 
 
* The magic
* Packages
* Fancy plots
 
R is completely modular, i.e. all functions in R come in packages (libraries). The basic installation of R comes with some 10 packages, which define most of R's basic functionalities. Installing new packages is easy through a top bar menu in the R GUI. Alternatively if you know what you're doing you can use the ''install.package'' function directly. Only the some basic packages are loaded into memory during R startup, though those settings can be altered. Specific packages can be loaded using ''library(packagename)''.
 
A useful package for plotting gorgeous graphs is ggplot2. Information on it can be found [http://had.co.nz/ggplot2/ here]. Help about graphics:
* [http://gettinggeneticsdone.blogspot.com/2009/07/ggplot2-more-wicked-cool-plots-in-r.html ggplot2: more wicked-cool plots in R]
* [http://www.statmethods.net/advgraphs/axes.html Quick-R plot settings]
* [https://wiki.nbic.nl/index.php/R_ggplot2_tutorial ggplot2 tutorial in BioAssist]
* [http://had.co.nz/ggplot2/book/qplot.pdf Getting started with qplot] (from Hadley Wickham's book on ggplot2)
 
 
* Modeling
* Probability distributions
* VOI analysis on R ==
* BRUGS (Open BUGS on R)
* RJAGS (Just another Gibbs sampler)
* ff (on disk objects)
* ...
 
== Cool tricks with R ==
 
=== Multiplying data and adding depth to a time dimension (e.g. minutes) using string selections and regular expressions (completely vectorized) ===
 
Time format "1.1.2011 00:00:00", the length of this string varies.
 
<rcode>
# First we'll create some random data
data <- data.frame(Time=paste("1.1.2011 ", c(paste(0, 0:9, sep = ""), as.character(10:23)), ":00:00", sep = ""), Conc = rnorm(24,10,2))
data
 
# Then multiply the number of rows by 60
data <- data[rep(1:nrow(data), each = 60),] #select from data 60 of each row, overwrite data
rownames(data) <- 1:nrow(data) #make rownames sensible
data
 
# Change minutes to "00" to "59", currently repeated "00"
temp <- as.character(data[,"Time"])
svec <- regexpr(":00:", temp) # gives the positions of ":00:" in the strings, output is a vector so the strings may be of different length e.g. "11.1.2011 00:00:00" vs "1.1.2011 00:00:00"
substring(temp,svec+1,svec+2) <- c(paste(0, 0:9, sep = ""), as.character(10:59)) # left side selects parts of the strings (vectorized) based on the svec --
# and the right side substitutes a vector of strings to replace the selection
data[,"Time"] <- temp
data
</rcode>
 
=== Applying a function in a data.frame with multiple rows with values ===
 
Example data from [[#Chapter 2 - Getting your data into R|Chapter 2]].
 
<nowiki>test <- read.table("M:/R koulutus/arkS280.csv", sep = ";", dec = ",", header = TRUE)
 
# The basic case where we want to take mean of a set of observations, indexed by something
tapply(test[,"Paino2"], test[,c("Annos", "Ryhmä")], mean) # mean of Paino2 by Annos and Ryhmä
 
# Tricky version where we take means of several sets of observations indexed by some indices
testf <- function(X, INDEX2, FUN2) tapply(X, INDEX2, FUN2) # define a custom function that does the above to a set of observations X
lapply(test[,7:15], testf, INDEX2 = test[,c("Annos", "Ryhmä")], FUN2 = mean) # apply our test function to all selected columns
 
# Could also be done with a loop
output <- list() # define a variable so it can be used inside the loop
for (i in 7:15) { # loop for i so that it takes the values given in a vector, 7:15 in this case
  output[[length(output)+1]] <- tapply(test[,i], test[,c("Annos", "Ryhmä")], mean) # apply the above to column defined by i and put into the output list
}
names(output) <- colnames(test)[7:15] # give names to the list objects from the column names of the original table
output # this is identical to the output from the tricky version. Both methods actually use loops so the main difference here is the syntax.</nowiki>
 
== HELP ==
 
*How to merge two veeery big datasets together by R? Pauliina is going to workout it in here:
[[Temperature and population in Europe]]. Codes are very heavy and free to rewrite.
 
*If I want to attach new column with rownumbers, how can I do it (What is the code for that)?
**Would it not suffice to use the rownames attribute? e.g. rownames(test) <- 1:nrow(test)
**Yeah, I'll managed to do that with this code:
<pre>rownames(data) = c(1:nrow(data))#names rows by numbers of the row
rownames(data) = rownames(data, prefix = "ID")#name "ID" for the new column</pre>
 
 
== See Also ==
 
*[[List of R functions]]
*''help.start()''
*[http://www.montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20092010/Class8/Using%20R%20for%20linear%20regression.pdf Regression Analysis with R]

Latest revision as of 09:06, 19 December 2016

Introduction

Welcome to the R tutorial. The tutorial is meant for people who already know something about programming in general.

Aim of the tutorial

Starting from some fundamentals, learn R on a generic level. R has got so many different packages for many different approaches that it would be very difficult to cover all of them comprehensively. So I want to give you a nice start into learning R for whatever purpose you may need it for.

Slicing different parts of objects

Run this code on your own computer to see different methods to slice objects.

dat <- data.frame( # Define a data frame with five rows and three columns.
	A = 1:5, 
	B = c("c", "b", "d", "e", "c"),
	Result = c(6, 45, 2, 4.5, 2)
)

dat		# The data frame dat.
dat$A  		# The vector that forms column A.
dat[["A"]]	# The same as previous.
dat[[1]]	# The vector that forms the first column (the same as previous).
dat["A"] 	# The column with name A (this is a data frame with one column).
dat[1] 		# The first column  (with name A; the same as previous).
dat[c(1,3)]	# Data frame with the first and third columns.
dat[c("A", "Result")] 	The same as previous.
dat[2:4, 1:2] 			# Data frame with rows 2 to 4 and columns 1 to 2.
dat[2:4, "Result"] 		# The vector that is formed from rows 2 to 4 of column Result.
dat$Result[2:4] 		# The same as previous.
dat[2:4, ]["Result"] 	# Data frame that first takes rows 2 to 4 and then column Result.
dat$B		# Factor that forms column B. Note that factors are also vectors.
levels(dat$B)	# Levels (i.e. possible values of the factor) of the previous factor.
as.numeric(dat$B)	# The numeric position values of the previous factor.
levels(dat$B)[dat$B]	# The previous factor converted into a character vector.
c("b", "c", "d", "e")[c(2, 1, 3, 4, 2)]	# The same as previous

library(OpasnetUtils) # This package is needed to operate with ovariables.

odat <- Ovariable("odat", data = dat) # Ovariable that has dat as data.
odat@data 	# Data slot of odat.
odat 		# All slots of odat.
odat <- EvalOutput(odat) # Evaluate odat (i.e., calculate the output)
odat@output	# Output slot of odat.
result(odat)	# The result column of the output slot of odat.
summary(odat) # Summary of odat. If odat is probabilistic, summary includes mean and other statistics.
odat@marginal # Which columns in output are marginals?
colnames(odat@output) 	# Names of columns of the output of odat
colnames(odat@output)[odat@marginal]	# Names of marginal columns.
odat@output[odat@output$B == "c" , ]	# Data frame from odat output of rows where column B has value c.
result(odat)[odat@output$B == "c"]	# Vector from odat result of rows rows where column B has value c.

Chapter 1 - the Basics of the R Language

  • Basic syntax
  • The creatures of R: vectors, lists, factors, data.frames, arrays
  • Data types

R is an object oriented programming language.

The R console is a command line interpreter, typing out a command e.g. 2+2 returns 4. Mathematical operators (+,-,*,/,^) are used in their mathematical sense and the order of execution is the mathematical one. The operators can also be used in function form, e.g. '+'(1,1). More operators here. Brackets () can be used to control which expressions are evaluated first, e.g. (1+1)/2.

Variables and functions are saved in memory using either an arrow (<- or ->) or =, e.g. var1 <- 1. It is preferable to use the arrow operator.

Functions are used as function(parameter1, parameter2, parameter6 = value), e.g. mean(rnorm(10, sd = 2)). Checking syntax and other details for a function is easy by using ?function, e.g. ?mean. Lists of functions can be found using library(help="package"), where package is the container/library of functions of interest. E.g. library(help="base") or library(help="stats"). A semicolon (;) can be used to separate statements on the same line. E.g. a <- 2;a.

The basic structure most objects in R is a vector. Vectors can be atomic (contain only values), or recursive (vector of objects, basically). A vector in R is basically an ordered set of values or objects. Selecting (subscripting) the n:th element is done by vector[n] or vector[[n]]. The double bracket is used when subscripting from recursive vectors (lists and its subtypes) to extract the stored objects themselves rather than a recursive vector of the selected elements.

There are 3 basic data types: text (class = character), numbers (class = numeric) and logical. Data types in R consist of these basic data types and their more elaborate derivatives. E.g. a factor is a character vector (a vector that consists of textual elements) stored as a numeric vector, where each number represents a unique element of the character vector. The unique elements are stored as a levels attribute of the factor object. Integers are a special case of the numeric class, they are handled as numeric except in storage; normally all numbers are numeric, a special case is the 'a:b' operator which produces a vector of integers from a to b. An atomic vector can only contain values of a single data type.

- Hide code

a <- 1:6
a
b <- c("Q", "W", "E", "R", "T", "Y")
b
d <- data.frame(Row = a, Letter = b)
d
b <- c(b, c("W", "W", "T"))
b
class(b)
b <- as.factor(b)
b
class(b)
b <- as.numeric(b)
b
class(b)
d[ , "Letter"]
d$Letter
d["Letter"]
d[d$Row == 2, ]
d[d$Row %in% 2:4, ]

print(d)
library(xtable)
print(xtable(d), type = 'html')
library(OpasnetUtils)
oprint(d)

Attributes are used to simulate more complex data structures. An atomic vector can be given a dimensions attribute dim (which is a numeric vector containing the lengths of the dimensions), to turn it into an array (a matrix is an array with length(dim) = 2). Dimensions can be given names in the dimnames attribute which is a list of named character vectors. Because arrays and matrices are atomic vectors by nature they can only contain values of one data type.

- Hide code

a <- 1:27;dim(a) <- c(3,3,3);dimnames(a) <- list(dim1=1:3, dim2=1:3, dim3=1:3);a
b <- array(1:27, dim = c(3,3,3), dimnames = list(dim1=1:3, dim2=1:3, dim3=1:3));b

Recursive vectors; lists and data.frames (which are lists with elements of equal length atomic vectors, character vectors are by default converted to factors) can have values of different data types since they consist of different objects. The data.frame is perhaps the most common object type in R. It resembles the basic rectangular table format.

- Hide code

a <- 1:9; b <- factor(rep(c("a","b","c"),3)); d <- factor(rep(c("q","w","e"), each = 3))
df <- data.frame(a,b,d);df
df2 <- data.frame(a = 18:10, b = factor(rep(c("a","b","c"),3)), d = factor(rep(c("q","w","e"), each = 3)));df2
l <- list(df, df2, k = 1:2);l
l[[1]]
l[[3]]
l[3]
l[[1]][5,1]
l[[1]][5,]
l[[1]][,1]

In general arrays are produced and used when data is summarized (summed or averaged over some marginals), for computational purposes the data.frame is superior.

More info on objects: http://cran.r-project.org/doc/manuals/R-lang.html#Objects

Classes in R refer to either the data type of a simple atomic vector, or the object type of a more complex object. Functions in R may have different methods for handling inputs of different classes and this may sometimes confuse newcomers; e.g. some functions take a factor input as only a numeric vector instead of a character vector. Many functions try to coerce their input into the format they can operate with by using functions like as.character and as.numeric.

- Hide code

class(1);class(1:4);class("a");class(TRUE);class(as.factor("a","a","b"))
1:4

More info on classes and other attributes: http://cran.r-project.org/doc/manuals/R-lang.html#Attributes

Many R functions are vectorized, meaning that a function can take one or more vectors as input to produce a vector as output. I.e. 1:5 + 10:6 produces 11 11 11 11 11. If the vectors are of different length, the shorter one is usually recycled to match the length of the longest vector. E.g. 1:2 + rep(4,5) produces 5 6 5 6 5 with a warning calling attention to the arguments different lengths. Vectorized operations are incredibly fast and should be used whenever possible in place of for, while or other loops.

Chapter 2 - Getting your data into R

- Hide code

library(OpasnetUtils)

e <- opbase.data("Op_en5103")
oprint(e)
e <- tidy(e)
oprint(e)

To upload data, you can use Table2Base tables on a wiki page:

A Tutorial on R: Difference between revisions(kton CO2e)
ObsRowSectorYearAmountPer.person
11Consumers' use of electricity 2010126.961.31
22Electric heating 201032.650.34
33District heating 2010321.183.31
44Separate heating 201041.100.42
55Traffic 2010164.471.70
66Agriculture 201024.360.25
77Waste management 201020.550.21
88Total 2010731.277.54

Or you can go to upload functionality to upload an Excel or CSV file:

Upload data


  • Importing and exporting data from/to files
  • Working with data: subscription, merge, apply, reshape, conversion between data.frame and array
  • OpasnetUtils?

Example data: File:ArkS280.csv. Be careful when converting your Excel sheets to .csv, Microsoft Office is an idiotic piece of software and it sometimes writes empty cells in places where you have once edited something. To fix this open the exported .csv in the Open Office equivalent of Excel and overwrite the previous file.

The easiest way to get data in and out of R is through delimited text files (.txt or .csv). The read.table function reads files specified by a path to a local file or an url. read.table follows the following syntax read.table(file, header, sep, quote, dec, fill, strip.white, ...) (actually there are more arguments but they're not all so relevent, check ?read.table for yourself), where file is a character string specifying the file by path or url e.g. "M:/test.txt" (Note that you have to use forward slash, backslash is the escape character); header is either TRUE or FALSE depending on whether the first line in the file is a header, default is false; sep is the cell separator, default is "\t" meaning tab, csv files usually on an European locale use ";", while the global standard is ","; quote is the quote character used in the file, default is "\"'" (escaped " followed by '); dec is the decimal separator, default is "."; fill determines whether uneven rows are filled with extra empty cells, default is FALSE, hence by default an error will be produced when the file has uneven rows; strip.white removes extra white space from empty cells and strings' leading and tailing edges, default is FALSE. There is a csv wrapper (read.csv) for the function which changes the default of sep to ",", and csv2 (read.csv2) which changes sep to ";" and dec to ",". I would recommend always using the read.table while changing the arguments, since the wrappers don't accept some of the other arguments. The write.table function uses the following basic syntax: write.table(x, file, ... , sep, dec, row.names, na.string), where x is the object to be written; file, sep and dec are the same as for read.table; row.names specifies whether to write row-names into the file, default is TRUE; na.string is the string to be used in missing cells.

Example:

test <- read.table("M:/R koulutus/arkS280.csv", sep = ";", dec = ",", header = TRUE)
test

The output is in the data.frame format. We can select row(s) and column(s) by using subscription. The data.frame is the most flexible format when it comes to data exploration and subscription. Since data.frames are essentially lists, we can use list1[[x]] to select a column x, where x can be a numerical vector (it can be longer than one, experiment with nested lists if you're interested). If the list is named (data.frames always are) we can select its elements by using list1[[x]], where x is a character vector of length one, or list1$col1, where col1 is simply the name of the list element (column in a data.frame). After selecting an object from a list we can subscript from it again, e.g. list1[[1]][1] returns the first element of the first object stored in list1, list1$col1[1] is similar; list1[[1]][[1]]... can be used for nested lists. A data.frame is special in that it is also subscribeable as a two dimensional array: df1[x,y] returns the x:th value of the row y, both can be vectors of any length and of any basic data type (either numeric, character or logical). Either of the x and y can be left blank so that a vector is returned; if a vertical slice is extracted, the result is an atomic vector if only one column was selected (same as selecting the object from the list); if a horizontal slice is extracted, the result is a data.frame. Arrays with more dimensions can be subscribed from in similar fashion e.g. arr1[x,y,z,...], where x, y, z and so on can be vectors of any basic data type.

To better utilize subscription you should learn about logical operators. <, >, <=, >=, ==, and !=, are pretty self-explanatory. More advanced ones include the and (&) and or (|). %in% is also pretty useful. To obtain a inversion use (statement)==FALSE or !(statement). These operators are vectorized, however the expression on the right side must be of length one (comparing all values of a vector on the left side to a single value/expression is allowed, but element by element comparison is not allowed). grep can be used to find regular expressions. All logical operators return a logical vector.

Example:

test <- read.table("M:/R koulutus/arkS280.csv", sep = ";", dec = ",", header = TRUE)
test[test$Suklaa0>=5,]
test[test$Ryhmä=="ip"]

R has got very powerful data manipulation facilities. Data in data.frames usually consist of a few factors and numeric vectors. The factors are usually indices to the data, e.g. in a population data there could be indices for Age, Year and Place. Unique cells in the data would be identified by a unique combination of these factors. This format can be used very similarly to an array. We could sum or take a mean over the levels of specified factors using tapply (data.frame variant of apply). xtabs creates a contingency table from a data.frame with cross-classifying factors, this is similar to an array with some extras. table is similar to xtabs but simpler and only does counts. reshape is a function that transforms data.frames with a single numeric vector into a data.frame with multiple numeric vectors specified by one or more indexing factors and vice versa. merge is a function that merges two data.frames by finding matching (indexing) vectors, any extra vectors in either data are carried over to the resulting data.frame. To simplify variable selection, components of data.frames can be attached to the general namespace, i.e. attach(data) would enable calling the component vectors of data directly: e.g. vec1 instead of data[,"vec1"].

There are many packages for database connections. One for use with the Opasnet Base is the OpasnetUtils, which uses the RODBC package for the actual connection.

A comprehensive guide to importing and exporting data in R can be found on http://cran.r-project.org/doc/manuals/R-data.html.

Useful data packages

  • fmi: direct access to Finnish Meteorological Institute database: all monitoring station data, modelled weather. Note! You need a personal key from FMI to use it.
  • rOpenGov: direct access to many sources, e.g. eduskunta.

Opening sources from Opasnet and own computer

+ Show code

Chapter 3: Graphs

Examples of making graphs with ggplot2 package.

- Hide code

library(OpasnetUtils)
library(ggplot2) # graphical package
library(reshape2) # package for melt function

# Create a data.frame with 3 columns of data
a <- data.frame(
  A = 1:100, 
  B = rnorm(100), 
  C = rnorm(100, 2, 1), 
  D = rnorm(100, -1, 3)
)

# Melt the data columns into one and create an explanatory column for source.
a <- melt(a, id.vars = "A", variable.name = "Source")

# Draw a line graph. You can also use geom_point() for point graphs and a lot of other alternatives.
ggplot(a, aes(x = A, y = value, colour = Source)) + geom_line()+
  theme_gray(base_size = 24)+
  labs(x = "Number",
       y = "Result",
       title = "Great graph"
  )

Chapter 4

Getting help

- Hide code

a <- 1:10
print(a)
print
?print

Plotting fancy plots

- Hide code

n <- 1000
population <- data.frame(
	Sex = rep(c("Male", "Female"), each = n),
	Height = c(
		rnorm(n, 178, 18), 
		rnorm(n, 168, 15)
	)
)
head(population)
plot(population$Sex, population$Height)
tapply(population$Height, population$Sex, mean)
tapply(population$Height, population$Sex, sd)

More fancy plots: Radonin terveysvaikutukset

Fancy map plots:

What is the value for Finland?:

- Hide code


library(xtable)
data <- data.frame( 
     Country = c("AT", "BE", "BG", "CH", "CY", "CZ", "DE", "DK", "EE", "ES", "FI",
	 "FR", "GR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL", "NO", "PL", "PT", "RO", "SE", "SI", "SK", "UK"), 
     Result = c(42, 78, 33, 57, 82, 66, 40, 65, 93, 50, 37, 74, 93, 26, 27, 15, 83, 36, 34, 89, 45, 96, 23, 39, 40, 22, 58, 20, 10)
)
data[data$Country == "FI", "Result"] <- fi

cat("Results by country (using country codes).\n")
print(xtable(data), type = 'html')

eumap(data$Result)

+ Show code


  • The magic
  • Packages
  • Fancy plots

R is completely modular, i.e. all functions in R come in packages (libraries). The basic installation of R comes with some 10 packages, which define most of R's basic functionalities. Installing new packages is easy through a top bar menu in the R GUI. Alternatively if you know what you're doing you can use the install.package function directly. Only the some basic packages are loaded into memory during R startup, though those settings can be altered. Specific packages can be loaded using library(packagename).

A useful package for plotting gorgeous graphs is ggplot2. Information on it can be found here. Help about graphics:


  • Modeling
  • Probability distributions
  • VOI analysis on R ==
  • BRUGS (Open BUGS on R)
  • RJAGS (Just another Gibbs sampler)
  • ff (on disk objects)
  • ...

Cool tricks with R

Multiplying data and adding depth to a time dimension (e.g. minutes) using string selections and regular expressions (completely vectorized)

Time format "1.1.2011 00:00:00", the length of this string varies.

+ Show code

Applying a function in a data.frame with multiple rows with values

Example data from Chapter 2.

test <- read.table("M:/R koulutus/arkS280.csv", sep = ";", dec = ",", header = TRUE)

# The basic case where we want to take mean of a set of observations, indexed by something
tapply(test[,"Paino2"], test[,c("Annos", "Ryhmä")], mean) # mean of Paino2 by Annos and Ryhmä

# Tricky version where we take means of several sets of observations indexed by some indices
testf <- function(X, INDEX2, FUN2) tapply(X, INDEX2, FUN2) # define a custom function that does the above to a set of observations X
lapply(test[,7:15], testf, INDEX2 = test[,c("Annos", "Ryhmä")], FUN2 = mean) # apply our test function to all selected columns

# Could also be done with a loop
output <- list() # define a variable so it can be used inside the loop
for (i in 7:15) { # loop for i so that it takes the values given in a vector, 7:15 in this case
  output[[length(output)+1]] <- tapply(test[,i], test[,c("Annos", "Ryhmä")], mean) # apply the above to column defined by i and put into the output list
}
names(output) <- colnames(test)[7:15] # give names to the list objects from the column names of the original table
output # this is identical to the output from the tricky version. Both methods actually use loops so the main difference here is the syntax.

HELP

  • How to merge two veeery big datasets together by R? Pauliina is going to workout it in here:

Temperature and population in Europe. Codes are very heavy and free to rewrite.

  • If I want to attach new column with rownumbers, how can I do it (What is the code for that)?
    • Would it not suffice to use the rownames attribute? e.g. rownames(test) <- 1:nrow(test)
    • Yeah, I'll managed to do that with this code:
rownames(data) = c(1:nrow(data))#names rows by numbers of the row
rownames(data) = rownames(data, prefix = "ID")#name "ID" for the new column


See Also