Opasnet base structure: Difference between revisions

Revision as of 06:37, 22 February 2009

[show] This page is a knowledge crystal of subtype variable. The page identifier is Op_en1913
Moderator:Nobody (see all) Click here to sign up.

Upload data {{#opasnet_base_link:Op_en1913}}

Scope

Opasnet base is a storage and retrieval system for variable results and nugget data. What is the structure of Opasnet base such that it enables the following functionalities?

Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.R↻
Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
Description and handling of the dimensions that a variable may take.
It is possible to protect some results and data from reading by unauthorised persons.
If is possible to build user interfaces for easily entering observations into the Base.

Definition

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x₁,x₂,...)

where P(R) is the probability distribution of the result and x₁ and x₂ are defining locations of a dimension where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one dimension. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.

Old description of the structure

Dependencies

Result

Opasnet base is a MySQL database located at http://base.opasnet.org.

Table structure

We need Ressec (Result secure) and Resinfosec (Result info secure) tables for secure information. All other tables are openly readable except these two. They have the same structure as Res and Resinfo tables, respectively.

Obj
Describes all objects
FIELD	TYPE	EXTRA
id	int(10)	primary
Ident	varchar(20)	unique
Name	varchar(200)
Unit	varchar(16)
Objtype_id	tinyint(3)
Page	int(10)
Wiki_id	tinyint(3)

Cell (previously Res)
Cells of an object
FIELD	TYPE	EXTRA
id	int(12)	primary
Obj_id_v (variable id)	int(10)
Obj_id_r (run id)	int(10)
Mean (mean of the cell)	float
N (samplesize)	int(10)

Loc
Location information
FIELD	TYPE	EXTRA
id	int(10)	primary
Obj_id_i (index id)	int(10)
Location	varchar(1000)
Roww (row # of index)	Mediumint(8)
Description	varchar(150)

Item
Items of a set
FIELD	TYPE	EXTRA
id	int(10)	primary
Sett_id (set to which the item belongs)	int(10)
Obj_id (item id)	int(10)
Fail (membership not valid?)	tinyint(1)

Loccell (previously Locres)
Locations of a cell
FIELD	TYPE	EXTRA
id	int(10)	primary
Cell_id	int(10)
Loc_id	int(10)

Res (previously Sam)
Result distribution (actual values)
FIELD	TYPE	EXTRA
id	bigint(20)	primary
Cell_id	int(12)
Obs (previously Sample)	int(10)
Result	float

Sett
List of sets
FIELD	TYPE	EXTRA
id	int(10)	primary
Obj_id	int(10)
Settype_id	tinyint(3)

Settype (previously Sty)
Types of set-item memberships
FIELD	TYPE	EXTRA
id	tinyint(3)
Settype (previously Stype)	varchar(30)

Objtype (previously Typ)
Types of objects
FIELD	TYPE	EXTRA
id	tinyint(3)	primary
Objtype (previously Type)	varchar(30)

Wiki (previously Wik)
Wiki information
FIELD	TYPE	EXTRA
id	tinyint(3)	primary
Url	varchar(255)
Wname	varchar(20)

Resinfo (previously Descr)
Additional description of the result
FIELD	TYPE	EXTRA
id	bigint(20)	primary
Restext (previously Description)	varchar(250)
Who	varchar(50)
When	timestamp

Objinfo (previously Inf)
Additional information about the object
FIELD	TYPE	EXTRA
id	int(10)	primary
Begin	date
End	date
Who	varchar(50)
Url	varchar(250)

Opasnet Data is a database that is designed to collect observation data from studies. A study can be a traditional research study, which is documented in Opasnet Data afterwards, or it can be an Opasnet study where the data is collected on a particular page of Opasnet using a web form. There are several purposes:

To collect observation data to be directly usable in interpretations of variables and other objects.
To collectively collect information about specific cases, and based on these data conditionalise a generalised assessment model with data specific to a particular case.

The structure of the Opasnet Data is the same as that of Opasnet Base. Actually, they are physically in the same database. However, there are some things that should be understood:

The object for a collection of observations is called a study, while the object of interpretations is called a variable. As an example, a study can collect information about a population group by a questionnaire and by taking a blood sample. The study identifier is the Obj.id in the Opasnet Base.
The object may be divided into smaller pieces along one or more indices. For example, the questionnaire may have 30 questions, and therefore the questionnaire data can be indexed by an index with 30 columns (or rows, depending on which way you think), one row for each question. Each column of the study object has one cell, i.e. an answer to one question. In this study, ten blood markers will be studied, and therefore the study object will have 40 cells, and the index 40 columns (30 from the questionnaire, 10 from the blood sample). The cell identifier is the Res.id in the Opasnet Base.
For each individual patient, there is one row of observations, each 40 cells. The observation row identifier is Sam.Sample in Opasnet Base.
The actual result of a particular cell of a particular patient is located in Sam.Result in Opasnet Base (or in Descr.Description in the case where the result is text, i.e. non-numeric).
Each study may be multidimensional just like a variable and have indices along e.g. space, time, or sex.
If the data is collected using an Opasnet web form, then the timestamp and username or IP will be recorded for each entry into Descr.When and Descr.Who fields, respectively. This is not needed, if the data comes from a previously performed study (which is static data in the eyes of Opasnet).
In some cases, it might be useful to restrict the number of entries per user to one. However, this is done only at the interpretation phase where only the last entry is counted. There are no restrictions to enter new data, and therefore a user may change his/her previous entry by simply making a new entry.

Making value-of-information analyses in Opasnet base

Value of information (VOI) is a decision analysis tool for estimating the importance of remaining uncertainty for decision-making. Result database can be used to perform a large number of VOI analyses, because all variables are in the right format for that: as random samples from uncertain variables. The analysis is done by optimising an indicator variable by adjusting a decision variable so that the variable under analysis is conditionalised to different values. All this can in theory be done in the result database by just listing the indicator, the decision variable, and the variable of interest. Practical tools should be developed for this. After that, systematic VOI analyses can be made over a wide range of environmental health issues.

Analysing the change in the quality of a variable result in Opasnet base

All results that have once been stored in the result database remain there. Old results can be very interesting for some purposes:

The time trend of informativeness and calibration (see performance) can be evaluated for a single variable against the newest information.
Critical pieces of information that had a major impact on the informativeness and calibration can be identified afterwards.
Large number of variables can be assessed and e.g. following questions can be asked:
- How much work is needed to make a variable with reasonable performance for practical applications?
- What are the critical steps after which the variable performance is saturated, i.e., does not improve much despite additional effort?

Some useful syntax

http://www.baycongroup.com/sql_join.htm
Opasnet base connection.ANA for Analytica: for writing and reading variable results into and from the database. Writing requires a password. For SQL used in the model, see the model page.
Some historical queries

NOTE! The queries below work in the new database "opasnet_base", not "resultdb" as the old versions.

{{#sql-query: SELECT Var.Ident, Var.Name, Var.Unit, Run.Ident, Begin, Who, Run.Name as Method FROM Obj as Var, Obj as Run, Cell, Objinfo WHERE Var.Ident = "Op_en1913" AND Var.id = Cell.Obj_id_v AND Run.id = Cell.Obj_id_r AND Run.id = Objinfo.id GROUP BY Var.id, Run.id |Runs}}

{{#sql-query: SELECT Var.Ident, Var.Name, Cell.id, N, Begin, Mean, Var.Unit FROM Obj as Var, Obj as Run, Cell, Objinfo WHERE Var.Ident = "Op_en1913" AND Var.id = Cell.Obj_id_v AND Run.id = Cell.Obj_id_r AND Run.id = Objinfo.id GROUP BY Cell.id ORDER BY Run.id DESC, Var.Ident |Means and samplesizes (N)}}

{{#sql-query: SELECT Var.Ident, Cell.id, Cell.Obj_id_r as Run, Obs, Result, Var.Unit FROM Obj as Var, Cell, Res WHERE Var.Ident = "Op_en1913" AND Var.id = Cell.Obj_id_v AND Cell.id = Res.Cell_id ORDER BY Cell.Obj_id_r, Var.Ident, Cell.id |Full sample}}

List all dimensions that have indices, and the indices concatenated:

<sql-query display="1"> SELECT Dim.Ident, Dim.Name, Dim.Unit, Group_concat(Ind.Ident ORDER BY Ind.Name SEPARATOR ', ') as Indices FROM Obj AS Dim, Obj as Ind, Sett, Item WHERE Dim.id = Sett.Obj_id AND Sett.Settype_id=1 AND Sett.id = Item.Sett_id AND Item.Obj_id = Ind.id GROUP BY Dim.Name ORDER BY Dim.id </sql-query>

List all indices, and their locations concatenated:

<sql-query display="1"> SELECT Ident, Name, Unit, GROUP_CONCAT(Location ORDER BY Roww SEPARATOR ', ') AS Locations FROM Obj AS Ind, Loc WHERE Ind.id = Loc.Obj_id_i GROUP BY Name ORDER BY Name </sql-query>

List all variables and their runs, and also list all indices (concatenated) used for each variable for each run.

<sql-query display="1"> SELECT Var_id, Run_id, Ident, Name, GROUP_CONCAT(Indic SEPARATOR ', ') AS Indices, N, Method FROM

  (SELECT Var.id as Var_id, Run.id as Run_id, Var.Ident AS Ident, Var.Name as Name, Ind.Ident AS Indic, N, Run.Name AS Method
  FROM Obj AS Var, Obj AS Run, Obj AS Ind, Loccell, Loc, Cell
  WHERE Var.id = Cell.Obj_id_v
  AND Run.id = Cell.Obj_id_r
  AND Cell.id = Loccell.Cell_id
  AND Loc.id = Loccell.Loc_id
  AND Ind.id = Loc.Obj_id_i
  GROUP BY Var_id, Run_id, Ind.Ident ) AS Temp1

GROUP BY Var_id, Run_id </sql-query>

@@ Line 410: / Line 410: @@
 * [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base&oldid=7181#Other_queries Some historical queries]
+'''NOTE! The queries below work in the new database "opasnet_base", not "resultdb" as the old versions.
 {{#sql-query:
@@ Line 468: / Line 470: @@
-'''List all variables and their runs, and also list all dimensions (concatenated) used for each variable for each run.
+'''List all variables and their runs, and also list all indices (concatenated) used for each variable for each run.
 <sql-query display="1">
- SELECT Var_id, Run_id, Var_name, Var_title, GROUP_CONCAT(Dim_name SEPARATOR ', ') as Dimensions, n, Run_method
+SELECT Var_id, Run_id, Ident, Name, GROUP_CONCAT(Indic SEPARATOR ', ') AS Indices, N, Method
- FROM
+FROM
-    (SELECT Loc_of_result.Var_id, Run_list.Run_id, Var_name, Var_title, Dim_name, n, Run_method
+   (SELECT Var.id as Var_id, Run.id as Run_id, Var.Ident AS Ident, Var.Name as Name, Ind.Ident AS Indic, N, Run.Name AS Method
-    FROM Loc_of_result, Run_list, Run, Variable, Location, Dimension
+   FROM Obj AS Var, Obj AS Run, Obj AS Ind, Loccell, Loc, Cell
-    WHERE Loc_of_result.Result_id = Run_list.Result_id
+   WHERE Var.id = Cell.Obj_id_v
-    AND Run_list.Run_id = Run.Run_id
+   AND Run.id = Cell.Obj_id_r
-    AND Loc_of_result.Var_id = Variable.Var_id
+   AND Cell.id = Loccell.Cell_id
-    AND Loc_of_result.Loc_id = Location.Loc_id
+   AND Loc.id = Loccell.Loc_id
-    AND Location.Dim_id = Dimension.Dim_id
+   AND Ind.id = Loc.Obj_id_i
-    GROUP BY Dimension.Dim_id, Loc_of_result.Var_id, Run_list.Run_id
+   GROUP BY Var_id, Run_id, Ind.Ident ) AS Temp1
-    ORDER BY Loc_of_result.Var_id, Run_list.Run_id) as temp1
+GROUP BY Var_id, Run_id
- GROUP BY Var_id, Run_id
 </sql-query>