Opasnet Base Connection for R: Difference between revisions

From Opasnet
Jump to navigation Jump to search
 
(98 intermediate revisions by 3 users not shown)
Line 4: Line 4:
[[Category:SQL tool]]
[[Category:SQL tool]]
[[Category:Open assessment]]
[[Category:Open assessment]]
[[Heande:Opasnet Base Connection for R]]
[[Category:Code under inspection]]
{{tool|moderator=Teemu R}}
{{tool|moderator=Teemu R}}
Code for R for the purpose of interacting with the [[Opasnet Base]] is collected on this page. To use it, copy paste the code to R; this defines the functions, after which they can be called in R. I've tried to explain  the code in comments (#). This is a work in progress, if you have comments or new ideas post them using the comment templates on a new line after the one you want to comment, I'm no expert in programming so input is appreciated.  
{{comment|# |Should we merge this page with [[OpasnetBaseUtils]]?|--[[User:Jouni|Jouni]] 20:18, 28 December 2011 (EET)}}
 
Code for [[R]] for the purpose of interacting with the [[Opasnet Base]] is collected on this page. To use it, copy paste the code you need to the R console; this defines the functions, after which they can be called for in that R session. Or alternatively install the [[OpasnetBaseUtils]] package.  


==Package dependencies==
==Package dependencies==


These packages are required for most of the code to work. Load these first.
These packages are required for most of the code to work. To install: from the top bar menu Packages>Install. To load: copy-paste.
 
library(RODBC)
 
==Downloading data==
 
'''op_baseGetData


# utils
<rcode name="op_baseGetData">
# RODBC
op_baseGetData <- function(dsn, ident, include = NULL, exclude = NULL, series_id = NULL, iterations = NULL, use.utf8 = TRUE, apply.utf8 = TRUE) {
if (use.utf8) db <- odbcConnect(dsn, DBMSencoding = "UTF-8") else db <- odbcConnect(dsn)
obj_id <- sqlQuery(db, paste('SELECT id FROM obj WHERE ident = "', ident, '"', sep=''))[1,1]
if (length(series_id) == 0) {series_id <- sqlQuery(db, paste('SELECT series_id FROM actobj WHERE obj_id = ', obj_id,
' ORDER BY series_id DESC LIMIT 1', sep = ''))[1,1]}
sliced <- FALSE
locations <- NULL
x <- 1
basequery <- paste('SELECT loccell.cell_id FROM actobj LEFT JOIN cell ON actobj.id = cell.actobj_id LEFT JOIN',
' loccell ON cell.id = loccell.cell_id WHERE actobj.obj_id = ', obj_id, ' AND actobj.series_id = ',
series_id, ' AND loccell.loc_id IN(', sep = '')
if (length(include) != 0) {
sliced <- TRUE
locations[x] <- paste("IN(", basequery, paste(include, collapse = ","), ")", sep = "")
x <- x + 1
}
if (length(exclude) != 0) {
sliced <- TRUE
locations[x] <- paste("NOT IN(", basequery, paste(exclude, collapse = ","), ")", sep = "")
}
if (sliced == FALSE) {
Data <- sqlQuery(db, paste('SELECT cell.id, res.obs, obj.ident AS ind, loc.location AS loc, res.result,',
' res.restext FROM actobj LEFT JOIN cell ON actobj.id = cell.actobj_id LEFT JOIN res ON cell.id =',
' res.cell_id LEFT JOIN loccell ON cell.id = loccell.cell_id LEFT JOIN loc ON loccell.loc_id',
' = loc.id LEFT JOIN obj ON loc.obj_id_i = obj.id WHERE actobj.obj_id = ', obj_id,
' AND actobj.series_id = ', series_id, if(length(iterations)==1){paste(" AND obs <= ", iterations,
sep = "")}, sep = '')) } else {
Data <- sqlQuery(db, paste('SELECT cell.id, res.obs, obj.ident AS ind, loc.location AS loc, res.result,',
' res.restext FROM actobj LEFT JOIN cell ON actobj.id = cell.actobj_id LEFT JOIN res ON cell.id =',
' res.cell_id LEFT JOIN loccell ON cell.id = loccell.cell_id LEFT JOIN loc ON loccell.loc_id',
' = loc.id LEFT JOIN obj ON loc.obj_id_i = obj.id WHERE actobj.obj_id = ', obj_id,
' AND actobj.series_id = ', series_id, if(length(iterations)==1){paste(" AND obs <= ", iterations,
sep = "")}, ' AND (cell.id ', paste(locations, collapse = ') AND cell.id '),
'))', sep = ''))
}
odbcClose(db)
Data <- Data[order(Data[,1], Data[,2], Data[,3]),]
nind <- length(levels(Data[,3]))
nres <- nrow(Data)/nind
dataframe <- Data[1:nres*nind, c(1,2)]
for (i in 1:nind) {
dataframe[,2 + i] <- factor(Data[1:nres*nind - (nind - i), 4])
levels(dataframe[,2 + i]) <- gsub(" *$", "",gsub("^ *", "", levels(dataframe[,2 + i])))
colnames(dataframe)[2 + i] <- as.character(Data[i, 3])
if(apply.utf8) Encoding(levels(dataframe[,2 + i])) <- "UTF-8"
}
dataframe[,1:2 + 2 + nind] <- Data[1:nres*nind, 5:6]
colnames(dataframe)[1:2 + 2 + nind] <- c("Result", "Result.Text")
if(apply.utf8) {if(is.factor(dataframe[,"Result.Text"])) {Encoding(levels(dataframe[,"Result.Text"])) <- "UTF-8"} else if(is.character(dataframe[,"Result.Text"])) {
Encoding(dataframe[,"Result.Text"]) <- "UTF-8"}}
rownames(dataframe) <- 1:nres
return(dataframe)
}
</rcode>


==Setup==
===Usage===


Establishes the connection to the database. Copy paste first.
variable <- op_baseGetData("opasnet_base", "page identifier", include = vector_of_loc_ids, exclude = vector_of_loc_ids)


db <- odbcConnect("opasnet_base")  
*Assuming "opasnet_base" is a correctly defined DSN (Data Service Name; in Windows XP: Control Panel\Administrative tools\Data Sources (ODBC)).
*Include and exclude are optional.
**Include picks all cells in the locations given.
***The clearest case is when all the included locations belong to the same index: Any cells in the non-included locations of the index will be left out.
***In case given locations are in multiple indices: The effect produced will be the same as picking separately for each index and removing duplicates.
**Exclude unpicks any cells which are indexed by the locations given. Slower than include.
**They can be used in unison.
*Result will be in a table format with columns: id, obs, ind1, ind2 ... indn, Result.
*series_id is an optional parameter, if it is not given the most current upload of the data will be downloaded.


*Assuming "opasnet_base" is a correctly defined DSN.
===Finding index data===
**Note that uploading requires writing permissions.


==Downloading data==
'''op_baseGetLocs


*Tested and mostly works, in case of error with GetArray try GetDataFrame, it should work with data missing indexes and locations.  
<rcode name="op_baseGetLocs">
op_baseGetLocs <- function(dsn, ident, series_id = NULL, use.utf8 = TRUE, apply.utf8 = TRUE) {
if (use.utf8) db <- odbcConnect(dsn, DBMSencoding = "UTF-8") else db <- odbcConnect(dsn)
obj_id <- sqlQuery(db, paste('SELECT id FROM obj WHERE ident = "', ident, '"', sep=''))[1,1]
if (length(series_id) == 0) {series_id <- sqlQuery(db, paste('SELECT series_id FROM actobj WHERE obj_id = ', obj_id,
' ORDER BY series_id DESC LIMIT 1', sep = ''))[1,1]}
Locs <- sqlQuery(db, paste("SELECT DISTINCT obj.ident AS ind, loc.location AS loc, loc.id AS loc_id",
" FROM actobj LEFT JOIN actloc ON actobj.id = actloc.actobj_id LEFT JOIN loc ON actloc.loc_id = loc.id",
" LEFT JOIN obj ON loc.obj_id_i = obj.id WHERE actobj.obj_id = ", obj_id, " AND actobj.series_id = ",
series_id, sep = ""))
odbcClose(db)
Locs <- Locs[order(Locs$ind, Locs$loc_id),]
rownames(Locs) <- 1:nrow(Locs)
if(apply.utf8) {Encoding(levels(Locs$ind)) <- "UTF-8"; Encoding(levels(Locs$loc)) <- "UTF-8"}
return(Locs)
}
</rcode>


===Low level functions (mainly sql queries)===
*Returns all indexes and locations and their ids in a table of format: ind, loc, loc_id.


Copy paste these second.
==Manipulating data==
#Get latest series id
op_baseGetLatest <- function(ident) {
series_list <- sqlQuery(db, paste('
SELECT actobj.series_id
FROM obj
LEFT JOIN actobj
ON obj.id = actobj.obj_id
<nowiki>WHERE obj.ident = "', ident, '"', sep = ''))</nowiki>
series_list[nrow(series_list),1]
}
#Get results (cell id, result, obs)
op_baseGetData <- function(page_ident, series_id) {
sqlQuery(db, paste('
SELECT cell.id, res.result, res.obs
FROM obj
LEFT JOIN actobj
ON obj.id = actobj.obj_id
LEFT JOIN cell
ON actobj.id = cell.actobj_id
LEFT JOIN res
ON cell.id = res.cell_id
WHERE actobj.series_id = ', series_id,
<nowiki>' AND obj.ident = "', page_ident, '"', sep = ''))</nowiki>
}
#Not in use: Get indexes and locations per cell
op_baseGetDims <- function(cell_id_list) {
sqlQuery(db, paste('
SELECT loccell.cell_id, loc.id, loc.location, loc.obj_id_i, obj.name
FROM loccell
LEFT JOIN loc
ON loccell.loc_id = loc.id
LEFT JOIN obj
ON loc.obj_id_i = obj.id
<nowiki>WHERE loccell.cell_id BETWEEN ', cell_id_list[1,1], ' AND ', cell_id_list[nrow(cell_id_list),1], sep = ''))</nowiki>
}
#Experimental: match locations and indexes
#tapply(dims_table[,3], dims_table[,5], paste)
#Not in use: Get indexes from dims_table
op_baseFindInd <- function(dims_table) {
factor(op_baseDims[,4])
}
#Get indexes
op_baseGetInd <- function (cell_id_list) {
indexes <- sqlQuery(db, paste('
SELECT DISTINCT loc.obj_id_i
FROM cell
LEFT JOIN loccell
ON cell.id = loccell.cell_id
LEFT JOIN loc
ON loccell.loc_id = loc.id
<nowiki>WHERE cell.n <> 0 AND</nowiki>
cell.id BETWEEN ', cell_id_list[1,1], ' AND ', cell_id_list[nrow(cell_id_list),1],
<nowiki>' ORDER BY loc.obj_id_i', sep = ''))</nowiki>
as.numeric(indexes[,1])
}
#Get locations
op_baseGetLoc <- function(index, cell_id_list) {
locations <- sqlQuery(db, paste('
SELECT DISTINCT loc.id
FROM loccell
LEFT JOIN loc
ON loccell.loc_id = loc.id
WHERE loc.obj_id_i = ', index, ' AND loccell.cell_id BETWEEN ', cell_id_list[1,1], ' AND ',
<nowiki>cell_id_list[nrow(cell_id_list),1], sep = ''))</nowiki>
as.numeric(locations[,1])
}
#Not in use: Get location names
op_baseGetLocNames <- function(index, cell_id_list) {
locations <- sqlQuery(db, paste('
SELECT DISTINCT loc.location, loc.obj_id_i
FROM loccell
LEFT JOIN loc
ON loccell.loc_id = loc.id
WHERE loc.obj_id_i = ', index, ' AND loccell.cell_id BETWEEN ', cell_id_list[1,1], ' AND ',
<nowiki>cell_id_list[nrow(cell_id_list),1], sep = ''))</nowiki>
as.numeric(locations[,1])
}
#Get location name (simple)
op_baseGetLocName <- function(loc_id) {
sqlQuery(db, paste('
SELECT location
FROM loc
<nowiki>WHERE id = ', loc_id, sep = ''))</nowiki>
}
#Get index name
op_baseGetIndName <- function(ind_id) {
sqlQuery(db, paste('
SELECT name
FROM obj
<nowiki>WHERE id = ', ind_id, sep = '')) </nowiki>
}
#Get locations for a specific cell
op_baseGetLocs <- function(cell_id) {
sqlQuery(db, paste('
SELECT DISTINCT loc.location, loc.id, loc.obj_id_i
FROM loccell
LEFT JOIN loc
ON loccell.loc_id = loc.id
WHERE loccell.cell_id = ', cell_id,
<nowiki>' ORDER BY loc.obj_id_i', sep = ''))</nowiki>
}
#Get cell.n (cell id, cell.n)
op_baseGetCelln <- function(cell_id_list) {
sqlQuery(db, paste('
SELECT cell.id, cell.n
FROM cell
WHERE cell.id
<nowiki>BETWEEN ', cell_id_list[1,1], ' AND ', cell_id_list[nrow(cell_id_list),1], sep = ''))</nowiki>
}
#Get results from probabilistic data (cell id, result, obs)
op_baseGetDataProb <- function(page_ident, series_id, Iterations) {
sqlQuery(db, paste('
SELECT cell.id, res.result, res.obs
FROM obj
LEFT JOIN actobj
ON obj.id = actobj.obj_id
LEFT JOIN cell
ON actobj.id = cell.actobj_id
LEFT JOIN res
ON cell.id = res.cell_id
WHERE actobj.series_id = ', series_id,
<nowiki>' AND obj.ident = "', page_ident, '"', </nowiki>
<nowiki>' AND res.obs > 0 AND res.obs <= ', Iterations, sep = ''))</nowiki>
}
#Prompt user for amount of iterations to download
PromptIterations <- function(op_baseCelln) {
as.numeric(readline(paste("There are ", max(op_baseCelln[,2]), " iterations of this data.",
" How many iterations would you like to download?\n")))
}


===High Level function===
'''DataframeToArray


Copy paste this last.
<rcode name="DataframeToArray">
DataframeToArray <- function(dataframe, rescol = NULL) {
ColNames <- colnames(dataframe[(colnames(dataframe) == "id") == FALSE])
if (length(ColNames[(ColNames == "obs")])>0) {if(length(levels(factor(dataframe[,"obs"]))) == 1) {ColNames <- ColNames[
(ColNames == "obs") == FALSE]} else {dataframe[,"obs"] <- factor(as.character(dataframe[,"obs"]))}}
if (length(rescol)==0) {
rescol <- colnames(dataframe) == "Freq"
if (length(rescol[rescol==TRUE]) >= 1) rescol <- "Freq" else {
rescol <- colnames(dataframe) == "Result"
if (length(rescol[rescol==TRUE]) >= 1) rescol <- "Result" else {
rescol <- colnames(dataframe) == "result"
if (length(rescol[rescol==TRUE]) >= 1) rescol <- "result" else stop("No result column found")
}
}
} else {ColNames <- ColNames[(ColNames == rescol) == FALSE]}
ColNames <- ColNames[(ColNames == "Result") == FALSE]
ColNames <- ColNames[(ColNames == "result") == FALSE]
ColNames <- ColNames[(ColNames == "Freq") == FALSE]
nind <- length(ColNames)
DimNames <- rep(vector("list", 1), nind)
names(DimNames) <- ColNames
indlengths <- 0
for (i in 1:nind) {
DimNames[[i]] <- levels(factor(dataframe[,ColNames[i]]))
indlengths[i] <- length(DimNames[[i]])
}
array <- rep(NA, prod(indlengths))
dim(array) <- indlengths
dimnames(array) <- DimNames
array[as.matrix(dataframe[,ColNames])] <- dataframe[,rescol]
return(array)
}
</rcode>


op_baseGetArray <- function(page_ident, Iterations) {
===Usage===
#Basic data downloading
op_baseLatest <- op_baseGetLatest(page_ident)
op_baseData <- op_baseGetData(page_ident, op_baseLatest) #cell,result, obs
#Differentiate between probabilistic and non-probabilistic
probabilistic <- FALSE #variable indicator of whether the object is probabilistic
for (i in 1:length(op_baseData)) {
if (op_baseData[i,3] > 0) probabilistic <- TRUE
}
#final_array <- op_baseData[,2] #ideal case (data in proper order, achievable by tweaking upload order)
#op_baseDims <- op_baseGetDims(op_baseData) #with further understanding of R number of queries could be reduced
#op_baseIndexes <- op_baseFindInd(op_baseDims) #improved code here
op_baseInd <- op_baseGetInd(op_baseData)
if (probabilistic == TRUE) op_baseCelln <- op_baseGetCelln(op_baseData)
#Array structure building
IndLengths <- 1 #blank vector for index lengths
IndList <- rep(vector("list", 1), length(op_baseInd)) #blank list of appropriate length for indexes and
#locations for naming the dimensions of the array
IndIdList <- rep(vector("list", 1), length(op_baseInd)) #blank list of appropriate length for ids of indexes
IndNames <- c(1) #blank vector for index names
for (i in 1:length(op_baseInd)) {
<nowiki>IndIdList[[i]] <- op_baseGetLoc(op_baseInd[i], op_baseData) #fetch location ids in index to a list, </nowiki>
  #assumed every cell has locations for all unique index id values in loc.obj_id_i in the range of cells
<nowiki>IndNames[i] <- paste(op_baseGetIndName(op_baseInd[i])[1,1]) #fetch index name to a vector</nowiki>
<nowiki>IndLengths[i] <- length(IndIdList[[i]]) #backup: op_baseGetLoc(op_baseInd[i], op_baseData))</nowiki>
for (j in 1:IndLengths[i]) {
<nowiki>IndList[[i]][j] <- paste(op_baseGetLocName(IndIdList[[i]][j])[1,1])</nowiki>
}
}
if (probabilistic == TRUE) { #Probabilistic exceptions (add index: Iteration)
Iterations <- if (interactive()) PromptIterations(op_baseCelln) else
if (Iterations < 1 && Iterations > max(op_baseCelln[,2])) max(op_baseCelln[,2])
#if not an interactive session, returns max
while (Iterations < 1 && Iterations > max(op_baseCelln[,2])) Iterations <- PromptIterations(op_baseCelln)
IndLengths[length(IndLengths) + 1] <- Iterations
<nowiki>IndList[[length(IndList) + 1]] <- 1:Iterations</nowiki>
IndNames[length(IndNames) + 1] <- "Iteration"
op_baseData <- op_baseGetDataProb(page_ident, op_baseLatest, Iterations) #cell,result, obs
}
final_array <- rep(NA, prod(IndLengths))
dim(final_array) <- IndLengths
names(IndList) <- IndNames
names(IndLengths) <- op_baseInd
dimnames(final_array) <- IndList
#Cell population
IndexNumber <- 1:length(op_baseInd)
names(IndexNumber) <- op_baseInd
LocList <- c(1) #blank vector for locations per cell
LocMap <- c(1) #blank vector for location to position mapping
LocMapNames <- c(1)
CellPosition <- rep(1, length(IndLengths)) #blank vector for cell position
dim(CellPosition) <- c(1, length(IndLengths)) #form into matrix of appropriate form
for (i in 1:length(op_baseData[,2])) { #for each result
LocList <- op_baseGetLocs(op_baseData[i,1]) #get list of locations
for (j in 1:length(LocList[,1])) { #for each location per cell
LocMap <- 1:IndLengths[paste(LocList[j,3])] #get available positions in given index
names(LocMap) <- dimnames(final_array)[[paste(names(IndList)[IndexNumber[paste(LocList[j,3])]])]]
#add names (locations) of positions in given index
CellPosition[1,IndexNumber[paste(LocList[j,3])]] <- LocMap[paste(LocList[j,1])]
#match cell specific locations to positions in index
}
if (probabilistic == TRUE) CellPosition[1,length(CellPosition)] <- op_baseData[i,3]
final_array[CellPosition] <- op_baseData[i,2] #insert result in array in position determined by CellPosition
}
final_array #print/return the final array
}
ArraytoDataFrame <- function(array) {
as.data.frame(as.table(array))
}
op_baseGetDataFrame <- function(page_ident) {
#Basic data downloading
op_baseLatest <- op_baseGetLatest(page_ident)
op_baseData <- op_baseGetData(page_ident, op_baseLatest) #cell,result, obs
#Differentiate between probabilistic and non-probabilistic
probabilistic <- FALSE #variable indicator of whether the object is probabilistic
for (i in 1:length(op_baseData)) {
if (op_baseData[i,3] > 0) probabilistic <- TRUE
}
if (probabilistic == TRUE) op_baseData <- op_baseGetDataProb(page_ident, op_baseLatest) #cell,result, obs
op_baseInd <- op_baseGetInd(op_baseData)
#Table building
for (i in 1:length(op_baseData[,2])) { #for each result
LocList <- paste(op_baseGetLocs(op_baseData[i,1])[,1]) #get list of locations
if (length(LocList) != 0) {
for (j in 1:length(LocList)) { #for each location per cell
op_baseData[i,(3+j)] <- LocList[j] #put in place in table
}
}
}
final_table <- data.frame(1:length(op_baseData[,1]))
#final_table[,1] <- op_baseData[,1]
x <- 0
if (probabilistic == TRUE) {
x <- 1
<nowiki>dimnames(final_table)[[2]][1] <- "Iteration"</nowiki>
final_table[,1] <- op_baseData[,3]
}
if (is.na(op_baseInd[1]) == TRUE) y <- 0 else y <- length(op_baseInd)
if (y > 0) {
for (i in 1:y) {
final_table[,x + i] <- op_baseData[,3+i]
}
}
final_table[,(x + y + 1)] <- op_baseData[,2]
if (y > 0) {
for (i in 1:y) {
<nowiki>dimnames(final_table)[[2]][x+i] <- paste(op_baseGetIndName(op_baseInd[i])[1,1])</nowiki>
}
}
<nowiki>dimnames(final_table)[[2]][(x + y + 1)] <- "Result"</nowiki>
final_table
}


====Usage====
variable2 <- DataframeToArray(variable1, rescol = NULL)


variable <- op_baseGetArray("page identifier")
*variable1 must be in similar format as the result when downloading.
variable <- ArraytoDataFrame(array name)
*Columns named "id" and various versions of "Result" are ignored for dimension creation.
variable <- op_baseGetDataFrame("page identifier")
**"obs" column will also be ignored if there's only one.
*The column containing the values may be defined in the parameters, otherwise it is assumed to be either "Freq", "Result" or "result" in that order.


==Uploading data==
==Uploading data==


Tested and works. Cell writing is efficient with even with huge data, index writing requires more work. Probabilistic data not supported.  
'''op_baseWrite
 
<rcode name="op_baseWrite">
op_baseWrite <- function(dsn, input, ident = NULL, name = NULL, unit = NULL, objtype_id = NULL, who = NULL, acttype = NULL,
rescol = NULL, n.obs.const = FALSE, maxrows = 50000, use.utf8 = TRUE, use.utf8.read = TRUE, latin1.2.utf8.conv.write = TRUE, utf8.2.latin1.conv.read = TRUE) {
# Coerce input into a data frame if it isn't one already; get rid of empty cells
if (is.array(input)) dataframe <- as.data.frame(as.table(input)) else dataframe <- input
if (is.null(rescol)) {
rescol <- colnames(dataframe) == "Freq"
if (sum(rescol) == 1) rescol <- "Freq" else {
rescol <- colnames(dataframe) == "Result"
if (sum(rescol) == 1) rescol <- "Result" else {
rescol <- colnames(dataframe) == "result"
if (sum(rescol) == 1) rescol <- "result"
}
}}
dataframe <- dataframe[is.na(dataframe[,rescol]) == FALSE,]
ColNames <- colnames(dataframe)[!(colnames(dataframe)%in%c(rescol, "id", "obs"))]
for (i in ColNames) {
dataframe[,i] <- factor(dataframe[,i])
levels(dataframe[,i]) <- gsub(" *$", "",gsub("^ *", "", levels(dataframe[,i])))
if(latin1.2.utf8.conv.write) if(sum(Encoding(levels(dataframe[,i]))=="latin1")!=0) levels(dataframe[,i]) <- iconv(levels(dataframe[,i]), "latin1", "UTF-8")
}
#if(!is.numeric(dataframe[,rescol]))
# Open database connection
if(use.utf8) db <- odbcConnect(dsn, DBMSencoding = "UTF-8") else db <- odbcConnect(dsn)
if(!use.utf8.read) db2 <- odbcConnect(dsn)
# Add page to database (if it doesn't already exist)
if (is.null(ident)) if (interactive()) ident <- readline(paste("What is the identifier of this object?",
"\n", sep = "")) else stop("indentifier of object no given")
obj_id <- sqlQuery(db, paste('SELECT id FROM obj WHERE ident = "', ident, '"', sep = ''))[1,1]
if (is.na(obj_id)) {
# Wiki id
if (substr(ident, 1,5)=="Op_en") {wiki_id <- 1; page <- substr(ident, 6, nchar(ident))} else {
if (substr(ident, 1,5)=="Op_fi") {wiki_id <- 2; page <- substr(ident, 6, nchar(ident))} else {
if (substr(ident, 1,6)=="Heande") {wiki_id <- 6; page <- substr(ident, 7, nchar(ident))} else {
if (substr(ident, 1,4)=="Erac") {wiki_id <- 6; page <- substr(ident, 5, nchar(ident))} else {
wiki_id <- 0; page <- 0; warning("No wiki id found in ident, writing zero.")}}}}
page <- as.numeric(page)
if (is.na(page)) stop("could not convert characters following the wiki ident into a page number")
# Name etc.
if (is.null(name)) if (interactive()) name <- readline(paste("What is the name of this object?",
"\n", sep = "")) else stop("object name not given")
if (is.null(objtype_id)) if (interactive()) objtype_id <- readline(paste("What type of object is",
" this (id)?", paste(paste(sqlQuery(db, "SELECT id FROM objtype")[,1], sqlQuery(db, paste("SELECT objtype",
" FROM objtype", sep = ""))[,1], sep = " - "), collapse = ", "), "\n", collapse = " ")) else {
stop("object type not given")}
sqlQuery(db, paste('INSERT INTO obj (ident, name, objtype_id, page, wiki_id) VALUES ("', paste(ident,
name, sep = '","'), '",', paste(objtype_id, page, wiki_id, sep = ','), ')', sep = ''))
obj_id <- sqlQuery(db, paste('SELECT id FROM obj WHERE ident = "', ident, '"', sep = ''))[1,1]
}
# Write act and actobj
if (is.null(who)==TRUE) if (interactive()) {who <- readline(paste("What is the name of the uploader?", "\n", sep = ""))
} else stop("uploader name not given")
series_id <- sqlQuery(db, paste("SELECT series_id FROM actobj WHERE obj_id = ", obj_id, " ORDER BY series_id DESC LIMIT 1",
sep = ""))[1,1]
if (is.na(series_id)==FALSE) {if (is.null(acttype)==TRUE) {if (interactive()) {acttype <- readline(paste("What type of upload",
" is this? 4 - new data to replace any existing, 5 - new data to be appended to existing data (must have the same",
" indices).", "\n", sep = ""))
} else acttype <- 4
}} else acttype <- 4
if (!(acttype%in%c(4,5))) stop ("proper acttype not given")
sqlQuery(db, paste('INSERT INTO act (acttype_id, who, comments) VALUES (', acttype, ',"', who, '","R upload")', sep = ''))
act_id <- sqlQuery(db, paste('SELECT id FROM act WHERE who = "', who,'" AND comments = "R upload" ORDER BY id DESC LIMIT 1',
sep = ''))[1,1]
if (acttype == 4) series_id <- act_id
if (is.null(unit)) if (interactive()) unit <- readline(paste("What is the unit of this object?",
"\n", sep = "")) else stop("unit not given")
sqlQuery(db, paste('INSERT INTO actobj (act_id, obj_id, series_id, unit) VALUES (', paste(act_id, obj_id, series_id, sep = ','),
',"', unit, '")', sep = ''))
actobj_id <- sqlQuery(db, paste('SELECT id FROM actobj WHERE act_id = ', act_id, sep = ''))[1,1]
#Write indexes
for (i in ColNames) {
sqlQuery(db, paste('INSERT IGNORE INTO obj (ident, name, objtype_id) VALUES ("', gsub(' ', '_', i), '","',
i, '", 6)', sep = ''))
}
IndIds <- sqlQuery((if(use.utf8.read) db else db2), paste('SELECT id, ident FROM obj WHERE ident IN("', paste(gsub(" ", "_", ColNames),
collapse = '","'), '")', sep = ''))
if(utf8.2.latin1.conv.read) levels(IndIds$ident) <- iconv(levels(IndIds$ident), "UTF-8", "latin1")
IndIdMap <- IndIds$id
names(IndIdMap) <- tolower(IndIds$ident)
ColIds <- as.character(IndIdMap[tolower(gsub(" ", "_", ColNames))])
colnames(dataframe)[colnames(dataframe)%in%ColNames] <- ColIds
#Write locations
for (i in ColIds) {
for (j in levels(dataframe[, i])) {
sqlQuery(db, paste('INSERT IGNORE INTO loc (obj_id_i, location) VALUES (', i, ',"', j, '")',
sep = ''))
}
}
LocIds <- sqlQuery((if(use.utf8.read) db else db2), paste('SELECT id, obj_id_i, location FROM loc WHERE obj_id_i IN("', paste(ColIds, collapse = '","'),
'")', sep = ''))
if(utf8.2.latin1.conv.read) levels(LocIds$location) <- iconv(levels(LocIds$location), "UTF-8", "latin1")
for (i in ColIds) {
LocIdMap <- LocIds[LocIds$obj_id_i == i, 1]
names(LocIdMap) <- gsub(" *$", "",gsub("^ *", "", tolower(LocIds[LocIds$obj_id_i == i, 3])))
levels(dataframe[, i]) <- LocIdMap[tolower(levels(dataframe[, i]))]
if (sum(is.na(levels(dataframe[, i]))) != 0) stop("Faulty location matching. Usually caused by special characters.")
#Writing actloc
sqlQuery(db, paste("INSERT INTO actloc (actobj_id, loc_id) VALUES (", paste(actobj_id, levels(dataframe[, i]),
sep = ",", collapse = "),("), ")", sep = ""))
}
#Writing cell
n <- tapply(dataframe[,rescol], dataframe[,ColIds], length)
ncell <- sum(!is.na(n))
if (is.numeric(dataframe[,rescol])) means <- tapply(dataframe[,rescol], dataframe[,ColIds], mean) else means <- rep(0, ncell)
if (is.numeric(dataframe[,rescol])) {
sds <- tapply(dataframe[,rescol], dataframe[,ColIds], sd); sds[] <- ifelse(n == 1, 0, sds)} else sds <- rep(0, ncell)
cellQuery <- paste(actobj_id, means[!is.na(means)], sds[!is.na(sds)], n[!is.na(n)], sep = ",")
i <- 1
while (length(cellQuery) >= (i + maxrows - 1)) {
sqlQuery(db, paste('INSERT INTO cell (actobj_id, mean, sd, n) VALUES (', paste(cellQuery[i:(i + maxrows - 1)],
collapse = '),('), ')', sep = ''))
i <- i + maxrows
}
if (length(cellQuery) %% maxrows != 0) {
sqlQuery(db, paste('INSERT INTO cell (actobj_id, mean, sd, n) VALUES (',
paste(cellQuery[i:length(cellQuery)], collapse = '),('), ')', sep = ''))
}
#Writing res
cell_id <- sqlQuery(db, paste("SELECT id FROM cell WHERE actobj_id = ", actobj_id, " ORDER BY ID", sep = ""))[,1]
if (length(cell_id) != ncell) stop("number of written cells differs from given data")
if (is.numeric(dataframe[,rescol])) ids <- means else ids <- n
ids[!is.na(ids)] <- cell_id
dataframe[, ncol(dataframe) + 1] <- ids[as.matrix(dataframe[,ColIds])]
colnames(dataframe)[ncol(dataframe)] <- "cell_id"
resQuery <- paste(dataframe[,"cell_id"], ',', if(sum(colnames(dataframe) == "obs") == 0) 1 else dataframe[,"obs"], ',',
if (!is.numeric(dataframe[,rescol])) '"', dataframe[,rescol], if (!is.numeric(dataframe[,rescol])) '"', sep = "")
i <- 1
while (length(resQuery) >= (i + maxrows - 1)) {
sqlQuery(db, paste('INSERT INTO res (cell_id, obs, ', ifelse(is.numeric(dataframe[,rescol]), "result", "restext"), ') VALUES (',
paste(resQuery[i:(i + maxrows - 1)], collapse = '),('), ')', sep = ''))
i <- i + maxrows
}
if (length(resQuery) %% maxrows != 0) {
sqlQuery(db, paste('INSERT INTO res (cell_id, obs, ', ifelse(is.numeric(dataframe[,rescol]), "result", "restext"), ') VALUES (',
paste(resQuery[i:length(resQuery)], collapse = '),('), ')', sep = ''))
}
#Writing loccell
ids <- as.data.frame(as.table(ids))
ids <- ids[!is.na(ids$Freq),]
loccellQuery <- paste(ids$Freq, unlist(ids[,-ncol(ids)]), sep = ",")
i <- 1
while (length(loccellQuery) >= (i + maxrows - 1)) {
sqlQuery(db, paste('INSERT INTO loccell (cell_id, loc_id) VALUES (', paste(loccellQuery[i:(i + maxrows - 1)], collapse = '),('), ')',
sep = ''))
i <- i + maxrows
}
if (length(loccellQuery) %% maxrows != 0) {
sqlQuery(db, paste('INSERT INTO loccell (cell_id, loc_id) VALUES (', paste(loccellQuery[i:length(loccellQuery)], collapse = '),('), ')',
sep = ''))
}
#Close database connection
odbcClose(db)
cat("Successful\n")
return(character())
}
</rcode>
 
===Usage===
 
op_baseWrite(dsn, input, ident = NULL, name = NULL, unit = NULL, objtype_id = NULL, who = NULL, acttype = NULL, rescol = NULL)


===Low level===
*dsn and input must be defined, the rest of the object and act parameters if not defined are prompted for by the function as needed.
**For uploading the DSN defined must have writers permissions.
*rescol defines the column from which the values are chosen from, both numerical and textual data are allowed, if left undefined the function will check column matches for "Freq", "Result" and "result" in that order.


library(utils)
====Restrictions====
library(RODBC)
 
*Input may only be given in either array or data.frame form.
dbSimpleRead <- function(db, table, column, filtercol, filter) {
**Indexes used may not exceed the character limit of 20.
sqlQuery(db, paste('SELECT ', column, ' FROM ', table, ' WHERE ', filtercol, ' = "', filter, '"', sep = ''))[1,1]
***Indexes should preferably match an earlier entry: [[Special:OpasnetBaseIndices]].
}
***Indexes are treated as identifiers for indexes in the database, spaces in the indexes are converted to _. This ensures maximum compatibility and ease in operations in which data is downloaded and uploaded again. Names and more specific details can be edited into the indexes separately.
dbLessSimpleRead <- function(db, table, column, filtercol1, filter1, filtercol2, filter2) {
sqlQuery(db, paste('SELECT ', column, ' FROM ', table, ' WHERE ', filtercol1, ' = "', filter1, '" AND ', filtercol2, ' = "', filter2, '"', sep = ''))[1,1]
}
SimplePrompt <- function(question) {
readline(paste(question, "\n", sep = ""))
}


===High level===
== Regarding special characters and character encoding ==


op_baseWriteArray <- function(db, array, ident = NULL, name = NULL, unit = NULL, objtype_id = NULL, who = NULL) {
Using special characters like ä and ö when our database is encoded in latin1 while wiki is in UTF-8 is a bit complicated. New parameters -- for forcing the odbc connection to use UTF-8 and in-R conversion of one encoding to the other prior to writing or after reading -- have been created. The defaults have been made to work with opasnet_base. For use with heande_base, forcing UTF-8 on the odbc connection when reading should be disabled, meaning that ''use.utf8'' should be set to ''FALSE'' when using ''op_baseGetData'' and ''use.utf8.read'' should be set to ''FALSE'' when using ''op_baseWrite''.
originaldataframe <- as.data.frame(as.table(array))
values <- originaldataframe[,"Freq"]
cond <- (is.na(values) == FALSE)
dataframe <- originaldataframe[cond,]
#Add page to database (if it doesn't already exist)
if (is.null(ident)==TRUE) if (interactive()) ident <- SimplePrompt("What is the identifier of this object?")
wiki_name <- substring(ident, 1, 5)
if (wiki_name=="Op_en") {wiki_id <- 1; page <- substring(ident, 6, nchar(ident))} else if (wiki_name=="Op_fi") {wiki_id <- 2; page <- substring(ident, 6, nchar(ident))} else {wiki_id <- 0; page <- 0}
obj_id <- dbSimpleRead(db, "obj", "id", "ident", ident)
if (is.na(obj_id)==TRUE) {
if (is.null(name)==TRUE) if (interactive()) name <- SimplePrompt("What is the name of this object?") else error
if (is.null(unit)==TRUE) if (interactive()) unit <- SimplePrompt("What is the unit of this object?") else error
if (is.null(objtype_id)==TRUE) if (interactive()) objtype_id <- SimplePrompt(paste("What type of object is this (id)?", paste(paste(sqlQuery(db, "SELECT id FROM objtype")[,1], sqlQuery(db, "SELECT objtype FROM objtype")[,1], sep = " - "), collapse = ", "), collapse = " ")) else error
sqlQuery(db, paste('INSERT INTO obj (ident, name, unit, objtype_id, page, wiki_id) VALUES ("', paste(ident, name, unit, sep = '","'), '",', paste(objtype_id, page, wiki_id, sep = ','), ')', sep = ''))
obj_id <- dbSimpleRead(db, "obj", "id", "ident", ident)
}
#Write act and actobj
if (is.null(who)==TRUE) if (interactive()) who <- SimplePrompt("What is the name of the uploader?") else error
sqlQuery(db, paste('INSERT INTO act (acttype_id, who, comments) VALUES (4, "', who, '","R upload")', sep = ''))
act_id <- sqlQuery(db, "SELECT id FROM act ORDER BY id DESC LIMIT 1")[1,1]
sqlQuery(db, paste('INSERT INTO actobj (act_id, obj_id, series_id) VALUES (', paste(act_id, obj_id, act_id, sep = ','), ')', sep = ''))
actobj_id <- dbSimpleRead(db, "actobj", "id", "act_id", act_id)
#Write indexes and locations
x <- 1
DimIds <- list()
IndId <- 0
Ind <- 0
loc <- 0
LocId <- 0
for (i in 1:(ncol(dataframe) - x)) {
DimIds[[i]] <- rep(NA, length(levels(factor(dataframe[,i]))))
IndId <- dbSimpleRead(db, "obj", "id", "ident", dimnames(dataframe)[[2]][i])
if (is.na(IndId)==TRUE) {
sqlQuery(db, paste('INSERT INTO obj (ident, name, objtype_id) VALUES ("', paste(dimnames(dataframe)[[2]][i], dimnames(dataframe)[[2]][i], sep = '","'), '",', 6, ')', sep = ''))
IndId <- dbSimpleRead(db, "obj", "id", "ident", dimnames(dataframe)[[2]][i])
}
for (j in 1:length(levels(factor(dataframe[,i])))) { #1:length(dimnames(array)[[i]])) {
LocId <- dbLessSimpleRead(db, "loc", "id", "location", levels(factor(dataframe[,i]))[j], "obj_id_i", IndId)
if (is.na(LocId)==TRUE) {
sqlQuery(db, paste('INSERT INTO loc (obj_id_i, location) VALUES (', IndId, ',"', levels(factor(dataframe[,i]))[j], '")', sep = ''))
LocId <- dbLessSimpleRead(db, "loc", "id", "location", levels(factor(dataframe[,i]))[j], "obj_id_i", IndId)
}
DimIds[[i]][j] <- LocId
}
names(DimIds)[i] <- IndId
}
# Changing location names in table into ids
for (i in 1:(ncol(dataframe)-1)) {
dataframe[,i] <- factor(dataframe[,i])
levels(dataframe[,i]) <- DimIds[[i]]
}
maxrows <- 5000
#Writing cell
n <- 1
cellQuery <- NA
cellQuery[1:nrow(dataframe)*6-5] <- actobj_id
cellQuery[1:nrow(dataframe)*6-4] <- ","
cellQuery[1:nrow(dataframe)*6-3] <- dataframe[,"Freq"]
cellQuery[1:nrow(dataframe)*6-2] <- ","
cellQuery[1:nrow(dataframe)*6-1] <- n
cellQuery[1:(nrow(dataframe)-1)*6] <- "),("
if (nrow(dataframe)>=maxrows) {
for (i in 1:(nrow(dataframe)%/%maxrows)) {
sqlQuery(db, paste('INSERT INTO cell (actobj_id, mean, n) VALUES (', paste(cellQuery[((i-1)*6*maxrows+1):(i*6*maxrows-1)], collapse = ''), ')', sep = ''))
}
}
sqlQuery(db, paste('INSERT INTO cell (actobj_id, mean, n) VALUES (', paste(cellQuery[(nrow(dataframe)%/%maxrows*6*maxrows+1):length(cellQuery)], collapse = ''), ')', sep = ''))
#Writing res
cell_id <- rev(sqlQuery(db, paste("SELECT id FROM cell WHERE actobj_id = ", actobj_id, " ORDER BY ID DESC LIMIT ", nrow(dataframe), sep = ""))[,1])
resQuery <- NA
resQuery[1:length(cell_id)*4-3] <- cell_id
resQuery[1:length(cell_id)*4-2] <- ","
resQuery[1:length(cell_id)*4-1] <- dataframe[,"Freq"]
resQuery[1:(length(cell_id)-1)*4] <- "),("
if (length(cell_id)>=maxrows) {
for (i in 1:(length(cell_id)%/%maxrows)) {
sqlQuery(db, paste('INSERT INTO res (cell_id, result) VALUES (', paste(resQuery[((i-1)*4*maxrows+1):(i*4*maxrows-1)], collapse = ''), ')', sep = ''))
}
}
sqlQuery(db, paste('INSERT INTO res (cell_id, result) VALUES (', paste(resQuery[(length(cell_id)%/%maxrows*4*maxrows+1):length(resQuery)], collapse = ''), ')', sep = ''))
#Writing loccell
locidmatrix <- rep(NA, (ncol(dataframe)-1)*length(cell_id))
dim(locidmatrix) <- c(ncol(dataframe)-1,length(cell_id))
for (i in 1:(ncol(dataframe)-1)) {
locidmatrix[i,] <- as.character(dataframe[,i])
}
loccellQuery <- rep(NA, 4*(ncol(dataframe)-1)*length(cell_id))
dim(loccellQuery) <- c(4*(ncol(dataframe)-1), length(cell_id))
loccellQuery[1:(ncol(dataframe)-1)*4-3, 1:length(cell_id)] <- rep(cell_id, each = ncol(dataframe) - 1)
loccellQuery[1:(ncol(dataframe)-1)*4-2, 1:length(cell_id)] <- ","
loccellQuery[1:(ncol(dataframe)-1)*4-1, 1:length(cell_id)] <- locidmatrix
loccellQuery[1:(ncol(dataframe)-1)*4, 1:length(cell_id)] <- "),("
loccellQuery[(ncol(dataframe)-1)*4, length(cell_id)] <- ""
if (length(cell_id)>=maxrows) {
for (i in 1:(length(cell_id)%/%maxrows)) {
loccellQuery[(ncol(dataframe)-1)*4, i*maxrows] <- ""
sqlQuery(db, paste('INSERT INTO loccell (cell_id, loc_id) VALUES (', paste(loccellQuery[,((i-1)*maxrows+1):(i*maxrows)], collapse = ''), ')', sep = ''))
}
}
sqlQuery(db, paste('INSERT INTO loccell (cell_id, loc_id) VALUES (', paste(loccellQuery[,(length(cell_id)%/%maxrows*maxrows+1):length(cell_id)], collapse = ''), ')', sep = ''))
}


====Usage====
== See also ==


op_baseWriteArray(db, array, ident = NULL, name = NULL, unit = NULL, objtype_id = NULL, who = NULL)
{{Opasnet Base}}


Db and array must be defined, the rest if not defined is prompted for by the function.
*[[A Tutorial on R]]

Latest revision as of 20:29, 10 April 2015


----#: . Should we merge this page with OpasnetBaseUtils? --Jouni 20:18, 28 December 2011 (EET) (type: truth; paradigms: science: comment)

Code for R for the purpose of interacting with the Opasnet Base is collected on this page. To use it, copy paste the code you need to the R console; this defines the functions, after which they can be called for in that R session. Or alternatively install the OpasnetBaseUtils package.

Package dependencies

These packages are required for most of the code to work. To install: from the top bar menu Packages>Install. To load: copy-paste.

library(RODBC)

Downloading data

op_baseGetData

+ Show code

Usage

variable <- op_baseGetData("opasnet_base", "page identifier", include = vector_of_loc_ids, exclude = vector_of_loc_ids)
  • Assuming "opasnet_base" is a correctly defined DSN (Data Service Name; in Windows XP: Control Panel\Administrative tools\Data Sources (ODBC)).
  • Include and exclude are optional.
    • Include picks all cells in the locations given.
      • The clearest case is when all the included locations belong to the same index: Any cells in the non-included locations of the index will be left out.
      • In case given locations are in multiple indices: The effect produced will be the same as picking separately for each index and removing duplicates.
    • Exclude unpicks any cells which are indexed by the locations given. Slower than include.
    • They can be used in unison.
  • Result will be in a table format with columns: id, obs, ind1, ind2 ... indn, Result.
  • series_id is an optional parameter, if it is not given the most current upload of the data will be downloaded.

Finding index data

op_baseGetLocs

+ Show code

  • Returns all indexes and locations and their ids in a table of format: ind, loc, loc_id.

Manipulating data

DataframeToArray

+ Show code

Usage

variable2 <- DataframeToArray(variable1, rescol = NULL)
  • variable1 must be in similar format as the result when downloading.
  • Columns named "id" and various versions of "Result" are ignored for dimension creation.
    • "obs" column will also be ignored if there's only one.
  • The column containing the values may be defined in the parameters, otherwise it is assumed to be either "Freq", "Result" or "result" in that order.

Uploading data

op_baseWrite

+ Show code

Usage

op_baseWrite(dsn, input, ident = NULL, name = NULL, unit = NULL, objtype_id = NULL, who = NULL, acttype = NULL, rescol = NULL)
  • dsn and input must be defined, the rest of the object and act parameters if not defined are prompted for by the function as needed.
    • For uploading the DSN defined must have writers permissions.
  • rescol defines the column from which the values are chosen from, both numerical and textual data are allowed, if left undefined the function will check column matches for "Freq", "Result" and "result" in that order.

Restrictions

  • Input may only be given in either array or data.frame form.
    • Indexes used may not exceed the character limit of 20.
      • Indexes should preferably match an earlier entry: Special:OpasnetBaseIndices.
      • Indexes are treated as identifiers for indexes in the database, spaces in the indexes are converted to _. This ensures maximum compatibility and ease in operations in which data is downloaded and uploaded again. Names and more specific details can be edited into the indexes separately.

Regarding special characters and character encoding

Using special characters like ä and ö when our database is encoded in latin1 while wiki is in UTF-8 is a bit complicated. New parameters -- for forcing the odbc connection to use UTF-8 and in-R conversion of one encoding to the other prior to writing or after reading -- have been created. The defaults have been made to work with opasnet_base. For use with heande_base, forcing UTF-8 on the odbc connection when reading should be disabled, meaning that use.utf8 should be set to FALSE when using op_baseGetData and use.utf8.read should be set to FALSE when using op_baseWrite.

See also

Pages related to Opasnet Base

Opasnet Base · Uploading to Opasnet Base · Data structures in Opasnet · Opasnet Base UI · Modelling in Opasnet · Special:Opasnet Base Import · Opasnet Base Connection for R (needs updating) · Converting KOPRA data into Opasnet Base · Poll · Working with sensitive data · Saved R objects

Pages related to the 2008-2011 version of Opasnet Base

Opasnet base connection for Analytica · Opasnet base structure · Related Analytica file (old version File:Transferring to result database.ANA) · Analytica Web Player · Removed pages and other links · Standard run · OpasnetBaseUtils