Opasnet base structure: Difference between revisions

From Opasnet
Jump to navigation Jump to search
(→‎See also: First desctiption of Opasnet Data based on own thinking and discussions with Mikko, Hanna, and Kalle)
(marked outdated)
 
(74 intermediate revisions by 4 users not shown)
Line 2: Line 2:
[[Category:Open assessment]]
[[Category:Open assessment]]
[[Category:Tool]]
[[Category:Tool]]
{{variable}}
{{variable|moderator = Jouni
| reference = {{publication
| authors        = Juha Villman, Einari Happonen, Jouni T. Tuomisto
| page          = Opasnet Base structure
| explanation    =
| publishingyear = 2010
| urn            =
| elsewhere      =
}}
}}


==Scope==
:''This page is about the '''old structure of Opasnet Base''' used in ca. 2008-2011. For a description about the current database, see [[Opasnet base 2]].


'''Opasnet base''' is a storage and retrieval system for [[variable]] [[result]]s. What is the structure of [[Opasnet base]] such that it enables the following functionalities?
==Question==
# Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.{{disclink|Should all variables go to result distribution database?}}
# Automatic retrieval of results when called from [[Opasnet wiki]] or other platforms or modelling systems.
# Description and handling of the [[dimension]]s that a [[variable]] may take.
# Storage and retrieval system for items that are needed to calculate the [[result]]s of variables.(?)
# A platform for planning computer runs about variable results based on the update need, CPU demand, and CPU availability.


==Definition==
[[image:Opasnet Base structure.png|thumb|400px|Structure and connections (lines) of the tables (boxes) in the [[Opasnet Base]]. All table identifiers are called id (so they can be called by like obj.id). When obj.id is referred to in another table such as actobj, it is called actobj.obj_id. The latter end of a one-to-many relationship is marked with a ring. Important substantive fields are listed inside table boxes.]]


===Data===
'''Opasnet Base''' is a storage and retrieval system for [[result]]s of [[variable]] and [[data]] from [[study|studies]]. What is the structure of [[Opasnet Base]] such that it enables the following functionalities?
# Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.{{reslink|Should all variables go to result distribution database?}}
# Automatic retrieval of results when called from [[Opasnet wiki]] or other platforms or modelling systems.
# Description and handling of the [[index|indices]]s that a [[variable]] may take.
# It is possible to protect some results and data from reading by unauthorised persons.
# If is possible to build user interfaces for easily entering observations into the Base.


====Software====
==Answer==


Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a [[:en:SQL|SQL]] database. Because of its flexibility, ease of use, and cost, [[:en:MySQL|MySQL]] is an optimal choice among SQL software. In addition to the database software, a [[variable transfer protocol]] is needed on top of that so that the results of variables can be retrieved and  new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.
Opasnet base is a [[:en:MySQL|MySQL]] database located at http://base.opasnet.org.


====Storage and retrieval of results of variables====
===Data structure===


The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the [[variable]] page, and it is only briefly summarised here. The result is described as
:''Main article: '''[[Data structures in Opasnet]]'''''


  P(R|x<sub>1</sub>,x<sub>2</sub>,...)
All data should be convertible into the following format:


where P(R) is the probability distribution of the result and x<sub>1</sub> and x<sub>2</sub> are defining [[location]]s of a [[dimension]] where a particular P(R) applies. Typically locations are operationalised as discrete [[Index|indices]]. A variable must have at least one dimension. [[Uncertainty]] about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index ''Sample'', which is a list of integers 1,2,3...n, where n=number of samples.
{| {{prettytable}}
! colspan="3"| || colspan="3"  style="background-color: #FFD8F0;"|Observation
|-----
!  style="background-color: #CCDFC8;"|Year ||  style="background-color: #CCDFC8;"|Sex ||  style="background-color: #CCDFC8;"|Age ||  style="background-color: #DFB8D0;"|Height ||  style="background-color: #DFB8D0;"|Weight ||  style="background-color: #DFB8D0;"|Description
|-----
|  style="background-color: #ECFFE8;"|2009 ||  style="background-color: #ECFFE8;"|Male ||  style="background-color: #ECFFE8;"|20 ||  style="background-color: #ECDFE8;"|178 ||  style="background-color: #ECDFE8;"|70 ||  style="background-color: #ECDFE8;"|An optional column for descriptive text about each row.
|-----
|  style="background-color: #ECFFE8;"|2009 ||  style="background-color: #ECFFE8;"|Male ||  style="background-color: #ECFFE8;"|30 ||  style="background-color: #ECDFE8;"|174 ||  style="background-color: #ECDFE8;"|79 ||  style="background-color: #ECDFE8;"|
|-----
|  style="background-color: #ECFFE8;"|2010 ||  style="background-color: #ECFFE8;"|Male ||  style="background-color: #ECFFE8;"|25 ||  style="background-color: #ECDFE8;"|183 ||  style="background-color: #ECDFE8;"|84 ||  style="background-color: #ECDFE8;"|
|-----
|  style="background-color: #ECFFE8;"|2010 ||  style="background-color: #ECFFE8;"|Female ||  style="background-color: #ECFFE8;"|22 ||  style="background-color: #ECDFE8;"|168 ||  style="background-color: #ECDFE8;"|65 ||  style="background-color: #ECDFE8;"|
|}


====Table and field names====
where


Principles:  
{| {{prettytable}}
* The names should be as short as possible: three letters.
|  style="background-color: #CCDFC8;"|Names of explanation columns, also known as indices.
* Tables that are only connecting two substance tables (i.e. tables for making many-to-many relationships) have a name that is a combination of the two, with six letters.
|-----
* Identifiers are named like Var_id where Var is the name of the table.
|  style="background-color: #ECFFE8;"|Explanation data, also known as locations. You can use these columns as search criteria.
* Substantive fields may have longer names.
|-----
* Substantive fields do not repeat the table name unless there is a possibility to mix two fields in different tables.
|  style="background-color: #FFD8F0;"|Observation index, typically called "Observation". Common name for all observation columns
* The field endings have the following meaning:
|-----
** _id: the identifier of the row in RDB, a sequential number in the table.
|  style="background-color: #DFB8D0;"|Names of observation columns. These are the parameters of interest.
** _name: the identifier for Analytica, format: wiki link+page (e.g. Op_en2356)
|-----
** _title: a longer, descriptive title
|  style="background-color: #ECDFE8;"|Observation data. These are the actual measurements.
** page: the page identifier from [[Opasnet]]
|-----
|}


'''An idea of major reconstruction of the Opasnet base
===Table structure in the database===
 
Tables:
* Obj (Int8) Object of some kind (previously Variable, Dimension, Index, and Risk_assessment)
** Oid (Var_id), Name, Title, Unit, Tid, Page, Wid (Wiki_id)
* Typ (Tinyint3) Types of objects: variable, dimension (which is a specific kind of variable), method, assessment, class, index (which is not a universal object in the [[PSSP]] context), run.
** Tid, Type
* Set (Int8) Defines the sets in the system, i.e. lists of objects that belong to a group or set.
** Item, Set, Row, Sid (Set_id)
* Sty (Tinyint3) Types of sets: locations of dimension, locations of index, items of class, indices of assessment, variables of assessment, indices of dimension, dependencies of variable. Replaces tables Locations, Index, Rows, RA_vars, RA_indices, Run_list.
** Sid, Stype
* Loc (previously Location)
** Lid (Loc_id), Did (Dim_id), Loct (Location as text), Locn (Location as number), Num (Yes/No)
* Res (Int8) Result information (previously Loc_of_result)
** Rid (Result_id), Vid (Var_id), Lid, Iid (Ind_id), Run, Med (Median), N (number of sample)
* Sam (Int10) Sample information (previously Result)
** Rid, Vid, Sample (Int 6), Result (Double)
* Wik (Wiki_location)
** Wid, Url, Wname
* Row (previously Rows)
** Iid, Row, Lid
* Inf (additional info about objects)
** Oid, Begin, End, Who, Method, Url
 
'''In practice the tables and fields would look like this:
 
Tables:
* Variable -> Obj
* Result -> Res
* Location -> Loc
* Dimension -> Dim or Obj
* Index -> Ind or
* Rows -> Row
* Loc_of_result -> Locres (the location of each result)
* Run -> Run
* Run_list -> Runres (the run of each result)
* Wiki_location -> Wik
* Risk assessment -> Oa
* RA_vars -> Oavar (the risk assessment of each variable)
* RA_indices -> Oaind (the risk assessment of each index)
* Causality -> do we actually need this?
* Formula -> do we actually need this?
* Data -> do we actually need this?
 
 
 
Fields (only those are listed that are actively used and should be changed):
* Var table: Remove the "Var_" from all fields except Var_id.
* Page_id -> Page (because this is rather a substantive field than an identifier; there is no table called "Page")
* Result_id -> Res_id
* Dimension table: Dimensions are actually variables themselves. Therefore, all substantive content should be moved to Var; we don't need any more Dim_name, Dim_title, Dim_unit, Page_id and Wiki_id in this table. We need to add Var_id field, which tells where in the Var table the info of each dimension is found.
* Row_number -> Row
* Run table: Remove "Run_" from the field names except Run_id
* Runres table: Run_order -> order (do we actually need this field?)
 
===Dependencies===
 
* [[Opasnet structure]]
* [[Open assessment]]
 
==Result==
 
Opasnet base is a [[:en:MySQL|MySQL]] database located at http://base.opasnet.org.


===Table structure===
==== All tables ====


{| VALIGN="top" BORDER="0"
{| VALIGN="top" BORDER="0"
|-
|-
|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|----
|colspan=5|'''act'''
|COLSPAN="3"|'''Obj'''
|-
|colspan=5|'''Uploads, updates, and other actions'''
|-
|Field||Type||Null||Extra||Key
|-
|id||int(10) unsigned||NO||auto_increment||PRI
|-
|acttype_id||tinyint(3) unsigned||NO||||MUL
|-
|who||varchar(50)||NO||||
|-
|comments||varchar(250)||YES||||
|-
|-
|COLSPAN="3"|Contains all objects
|time||timestamp||NO||||
|----
| '''FIELD'''
| '''TYPE'''
| '''EXTRA'''
|----
| id
| mediumint(8)
| primary
|----
| Ident
| varchar(20)
| unique
|----
| Name
| varchar(200)
|
|----
| Unit
| varchar(16)
|
|----
| Typ_id
| tinyint(3)
|
|----
| Page
| mediumint(8)
|
|-
|-
| Wik_id
|temp_id||int(10) unsigned||NO||||MUL
| tinyint(3)
|  
|----
|}
|}
|VALIGN="top"|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Res'''
|colspan=5|'''actloc'''
|-
|colspan=5|'''Locations of an act'''
|-
|-
|COLSPAN="3"|''Result information''
|Field||Type||Null||Extra||Key
|-
|-
| '''FIELD'''
|actobj_id||int(10) unsigned||NO||||PRI
| '''TYPE'''
|-
| '''EXTRA'''
|loc_id||int(10) unsigned||NO||||PRI
|----
| id
| int(10)
| primary
|----
| Obj_id_v
| mediumint(8)
|  
|----
| Obj_id_r
| mediumint(8)
|  
|----
| Mean
| float
|  
|----
| N
| smallint(6)
|
|----
|}
|}
|VALIGN="top"|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Loc'''
|colspan=5|'''actobj'''
|-
|-
|COLSPAN="3"|''Location information''
|colspan=5|'''Acts of an object'''
|-
|-
| '''FIELD'''
|Field||Type||Null||Extra||Key
| '''TYPE'''
|-
| '''EXTRA'''
|id||int(10) unsigned||NO||auto_increment||PRI
|----
|-
| id
|act_id||int(10) unsigned||NO||||MUL
| mediumint(8)
|-
| primary
|obj_id||int(10) unsigned||NO||||MUL
|----
|-
| Obj_id_d
|series_id||int(10) unsigned||NO||||MUL
| mediumint(8)
|-
|  
|unit||varchar(64)||YES||||
|----
| Location
| varchar(1000)
|  
|----
| Locn
| float
|  
|----
| Num
| tinyint(1)
|  
|----
| Description
| varchar(150)
|  
|----
|}
|}
|-
|
{| {{prettytable}}
|colspan=5|'''acttype'''
|-
|colspan=5|'''List of action types'''
|-
|-
|VALIGN="top"|
|Field||Type||Null||Extra||Key
{| WIDTH="250px" {{prettytable}}
|COLSPAN="3"|'''Item'''
|-
|-
|COLSPAN="3"|''add description'''
|id||int(10) unsigned||NO||auto_increment||PRI
|-
|-
| '''FIELD'''
|acttype||varchar(250)||NO||||UNI
| '''TYPE'''
| '''EXTRA'''
|----
| id
| mediumint(8)
| primary
|----
| Sett_id
| mediumint(8)
|
|----
| Obj_id
| mediumint(8)
|
|----
| Fail
| tinyint(1)
|  
|----
|}
|}
|VALIGN="top"|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Locres'''
|colspan=5|'''cell'''
|-
|colspan=5|'''Cells of an object'''
|-
|Field||Type||Null||Extra||Key
|-
|-
|COLSPAN="3"|''Result information''
|id||int(12) unsigned||NO||auto_increment||PRI
|-
|-
| '''FIELD'''
|actobj_id||int(10) unsigned||NO||||MUL
| '''TYPE'''
|-
| '''EXTRA'''
|mean||float||YES||||
|----
|-
| id
|sd||float||NO||||
| int(10)
|-
| primary
|n||int(10)||NO||||
|----
|-
| Res_id
|sip||varchar(2000)||YES||||
| int(10)
|  
|----
| Roww_id
| mediumint(8)
|  
|----
|}
|}
|VALIGN="top"|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Roww'''
|colspan=5|'''loc'''
|-
|-
|COLSPAN="3"|''Locations of an index''
|colspan=5|'''Location information'''
|-
|-
| '''FIELD'''
|Field||Type||Null||Extra||Key
| '''TYPE'''
|-
| '''EXTRA'''
|id||int(10) unsigned||NO||auto_increment||PRI
|----
|-
| id
|std_id||int(10) unsigned||NO||||MUL
| mediumint(8)
|-
| primary
|obj_id_i||int(10) unsigned||NO||||MUL
|----
|-
| Obj_id_i
|location||varchar(100)||NO||||
| mediumint(8)
|-
|  
|roww||mediumint(8) unsigned||NO||||
|----
|-
| Roww
|description||varchar(150)||NO||||
| mediumint(8)
|  
|----
| Loc_id
| mediumint(8)
|  
|----
|}
|}
|-
|-
|VALIGN="top"|
|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Sam'''
|colspan=5|'''loccell'''
|-
|colspan=5|'''Locations of a cell'''
|-
|Field||Type||Null||Extra||Key
|-
|-
|COLSPAN="3"|''Result distribution (actual values)''
|cell_id||int(10) unsigned||NO||||PRI
|-
|-
| '''FIELD'''
|loc_id||int(10) unsigned||NO||||PRI
| '''TYPE'''
| '''EXTRA'''
|----
| id
| int(10)
| primary
|----
| Res_id
| mediumint(8)
|----
| Descr_id
| mediumint(8)
|
|----
| Sample
| mediumint(8)
|  
|----
| Result
| float
|  
|----
|}
|}
|VALIGN="top"|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Sett'''
|colspan=5|'''obj'''
|-
|colspan=5|'''Object information (all objects)'''
|-
|-
|COLSPAN="3"|''Memberships of items in sets''
|Field||Type||Null||Extra||Key
|-
|-
| '''FIELD'''
|id||int(10) unsigned||NO||auto_increment||PRI
| '''TYPE'''
|-
| '''EXTRA'''
|ident||varchar(20)||NO||||UNI
|----
|-
| id
|name||varchar(200)||NO||||
| mediumint(8)
|-
| primary
|objtype_id||tinyint(3) unsigned||NO||||MUL
|----
|-
| Obj_id
|page||int(10) unsigned||NO||||
| mediumint(8)
|-
|  
|wiki_id||tinyint(3) unsigned||NO||||
|----
| Sty_id
| mediumint(8)
|  
|----
|}
|}
|VALIGN="top"|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Sty'''
|colspan=5|'''objtype'''
|-
|colspan=5|'''Types of objects'''
|-
|Field||Type||Null||Extra||Key
|-
|-
|COLSPAN="3"|''Types of set-item memberships''
|id||tinyint(3)||NO||||PRI
|-
|-
| '''FIELD'''
|objtype||varchar(30)||NO||||
| '''TYPE'''
| '''EXTRA'''
|----
| id
| tinyint(3)
|  
|----
| Stype
| varchar(30)
|  
|----
|}
|}
|-
|
{| {{prettytable}}
|colspan=5|'''res'''
|-
|-
|colspan=5|'''Result distribution (actual values)'''
|-
|Field||Type||Null||Extra||Key
|-
|id||bigint(20) unsigned||NO||auto_increment||PRI
|-
|cell_id||int(12) unsigned||NO||||MUL
|-
|obs||int(10) unsigned||NO||||
|-
|result||float||NO||||
|-
|restext||varchar(250)||YES||||
|-
|implausible||binary(1)||YES||||
|}
|VALIGN="top"|
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
{| {{prettytable}}
|COLSPAN="3"|'''Typ'''
|colspan=5|'''wiki'''
|-
|-
|COLSPAN="3"|''Object type information''
|colspan=5|'''Wiki information'''
|-
|-
| '''FIELD'''
|Field||Type||Null||Extra||Key
| '''TYPE'''
|-
| '''EXTRA'''
|id||tinyint(3)||NO||||PRI
|-
|url||varchar(255)||NO||||
|-
|wname||varchar(20)||NO||||
|}
|}
 
====Contents of selected tables====
 
{|
|
{| {{prettytable}}
|+ Table objtype
! id!! objtype
|----
|| 1|| Variable
|----
|| 2|| Study
|----
|----
| id
|| 3|| Method
| tinyint(3)
| primary
|----
|----
| Type
|| 4|| Assessment
| varchar(30)
|  
|----
|----
|}
|| 5|| Class
|VALIGN="top"|
|----
{| WIDTH="250px" {{prettytable}}
|| 6|| Index
|COLSPAN="3"|'''Wik'''
|----
|-
|| 7|| Nugget
|COLSPAN="3"|''Wiki information''
|-
| '''FIELD'''
| '''TYPE'''
| '''EXTRA'''
|----
|----
| id
|| 8|| Encyclopedia article
| tinyint(3)
| primary
|----
|----
| Url
|| 9|| Run
| varchar(255)
|  
|----
|----
|}
|}
|VALIGN="top"|
 
{| WIDTH="250px" {{prettytable}}
|
|COLSPAN="3"|'''Descr'''
 
|-
{| {{prettytable}}
|COLSPAN="3"|''Additional description of the result''
|+ Table acttype
|-
! id|| acttype
| '''FIELD'''
| '''TYPE'''
| '''EXTRA'''
|----
|----
| id
|| 1|| Start object
| mediumint(8)
| primary
|----
|----
| Description
|| 2|| Finish assessment
| varchar(250)
|  
|----
|----
|}
|| 3|| Update formula
|-
|VALIGN="top"|
{| WIDTH="250px" {{prettytable}}
|COLSPAN="3"|'''Inf'''
|-
|COLSPAN="3"|''Additional information about an object''
|-
| '''FIELD'''
| '''TYPE'''
| '''EXTRA'''
|----
|----
| id
|| 4|| Upload data (replace)
| mediumint(8)
| primary
|----
|----
| Begin
|| 5|| Upload data (append)
| date
|  
|----
|----
| End
|| 6|| Review scope
| date
|  
|----
|----
| Who
|| 7|| Review definition
| varchar(50)
|  
|----
|----
| Url
|| 8|| Add object info
| varchar(250)
|  
|----
|----
|}
|}
|}
|}


==See also==
==Rationale==
 
===Data===
 
====Software====
 
Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a [[:en:SQL|SQL]] database. Because of its flexibility, ease of use, and cost, [[:en:MySQL|MySQL]] is an optimal choice among SQL software. In addition to the database software, a [[variable transfer protocol]] is needed on top of that so that the results of variables can be retrieved and  new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.
 
====Storage and retrieval of results of variables====
 
The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the [[variable]] page, and it is only briefly summarised here. The result is described as
 
  P(R|x<sub>1</sub>,x<sub>2</sub>,...)
 
where P(R) is the probability distribution of the result and x<sub>1</sub> and x<sub>2</sub> are defining [[location]]s of an [[index]] where a particular P(R) applies. Typically locations are operationalised as discrete [[Index|indices]]. A variable must have at least one [[index]]. [[Uncertainty]] about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index ''Sample'', which is a list of integers 1,2,3...n, where n=number of samples.
 
 
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base&oldid=7856 Old description of the structure]
 
===Dependencies===


These texts could be moved to separate pages.
* [[Opasnet structure]]
* [[Open assessment]]


===Opasnet Data===
====Replacing some cells====


'''Opasnet Data''' is another database, which is designed to collect observation data about [[object]]s. The purpose is to make it possible to take a generalised assessment model, and conditionalise that based on case-specific data. Using the Opasnet Data database, it is possible to collectively collect information about the the specific case.  
It is possible that there is a large data, where there is a need to update only a few cells while all others remain the same. How should this be done? There are a few potential alternatives.
# Use the current replace functionality. Replace all cells but most of them with the original value.
# Use a new act_type that is similar to the current append functionality. This should be understood in a way that if there are two (or more) identical cells (based on cell indices and locations), then the newest result is used and all older ones are discarded. (If the old ''append'' is used, then new info is just seen as a new row in the data table, not a replacement of an existing row.
# Add a new field into the cell (?) table with an updated cell_id (in a similar way than act_id and series_id are used in the actobj table). This way, the new cell can automatically inherit all locations of the old cell.


The structure of the Opasnet Data structure is mostly similar to that of Opasnet Base, with the following exceptions:
===Formula structure===
* The basic data unit is an ''observation'' about an object, typically a [[variable]]. An observation is treated like a result in [[Opasnet Base]], except that it is a point estimate instead of a distribution.
* Each observation may have locations along several dimensions. e.g. in space, time, or sex. However, indices are usually not used for observations, because an observation may locate anywhere along a dimension. An exception is when the data is collected using the help of an index, which restricts the data to fall into the locations of an index instead of anywhere along the dimension.
* Each data point, or observation, has always the following dimensions: time of entry, username or IP of the user who enters the data, and the object Ident.
* For a routine collection effort of data there may be a fixed list of dimensions that will be asked. This list is created by using a set of dimensions for the collection effort. These collection efforts are treated as Runs. An input format template is usually created for a collection effort.
* It is not clear whether an observation should always be a single number (or text field). If there is e.g. a table of observations with determinants on column headings and individual observations as rows, can this be treated as a single entry? Or should the table be disaggregated into single numbers with the columns treated as dimensions?
* In some cases, it  might be useful to restrict the number of entries per user to one. However, this is done only at the interpretation phase where only the last entry is counted. There are no restrictions to enter new data.


===Making value-of-information analyses in [[Opasnet base]]===
Now it has become clear that it is not enough to have samples of the result distributions. It must be possible to completely recalculate the result based on the information in the [[Opasnet Base]]. There are different approaches:
* Calculate the result based on a formula that may refer to other variables called parents. This is a deterministic approach.
* Calculate the result based on the marginal distribution and (conditional) rank correlations with parent variables. This is a probabilistic approach.


[[:en:Value of information|Value of information]] (VOI) is a [[:en:decision analysis|decision analysis]] tool for estimating the importance of remaining uncertainty for decision-making. Result database can be used to perform a large number of VOI analyses, because all variables are in the right format for that: as random samples from uncertain variables. The analysis is done by optimising an [[indicator]] variable by adjusting a [[decision variable]] so that the variable under analysis is conditionalised to different values. All this can in theory be done in the result database by just listing the indicator, the decision variable, and the variable of interest. Practical tools should be developed for this. After that, systematic VOI analyses can be made over a wide range of environmental health issues.
This approach requires new tables, namely Formula and Language.


===Analysing the change in the quality of a variable result in [[Opasnet base]]===


All results that have once been stored in the result database remain there. Old results can be very interesting for some purposes:
: {{comment|11|Do we need tables DIF and DIP like Uninet?|--[[User:Jouni|Jouni]] 21:50, 30 December 2009 (UTC)}}
* The time trend of [[informativeness]] and [[calibration]] (see [[performance]]) can be evaluated for a single variable against the newest information.
* DIP
* Critical pieces of information that had a major impact on the informativeness and calibration can be identified afterwards.
** DIP_node_id
* Large number of variables can be assessed and e.g. following questions can be asked:
** DIP_parent_node_id
** How much work is needed to make a variable with reasonable performance for practical applications?
** DIP_corr_coeff
** What are the critical steps after which the variable performance is saturated, i.e., does not improve much despite additional effort?
** DIP_parent_index
* DIF
** DIF_node_id
** DIF_formula
** DIF_varnames_in_formula


===Some useful syntax===
===Universal Opasnet Base===


* http://www.baycongroup.com/sql_join.htm
The idea of universal Opasnet Base says that it should be possible to store results in such a way that the results themselves are public but their interpretation is limited. For example, patient symptoms and clinical test results should be openly available for research, but information about whose results they are should be private. This can be achieved with the following database structure.
* [[:image:Opasnet base connection.ANA|Opasnet base connection.ANA]] for Analytica: for writing and reading variable results into and from the database. Writing requires a password. For SQL used in the model, see the model page.
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base&oldid=7181#Other_queries Some historical queries]


[[File:Universal Opasnet Base structure.png|thumb|400px|Universal Opasnet Base has some parts that exist in different versions depending on the privacy level. The yellow areas are e.g. a public area and a private area. The parts that are white are public.]]


'''List all dimensions that have indices, and the indices concatenated:
Let's say that it is enough to have two security levels, public and private. A person wants to record personal health information into the database. She logs in with her personal user name. The private profile gives the name (say, Liisa) and social security number of the person, while the public profile says only "30-40-year-old woman in Finland". Liisa writes down her symptoms or medical information and saves them. This is what is stored in the databases:


<sql-query display="1">
{| {{prettytable}}
Select Dim_name, dim_title, dim_unit, Group_concat(Ind_name order by ind_name separator ', ') as Indices
|+ '''Information stored in the public and private databases. The private database can read tables from the public one but not vice versa.
from Dimension, `Index`
! Table, field
where Dimension.dim_id = `Index`.Dim_id
! Private database
group by Dim_name
! Public database
order by Dimension.dim_id
|----
</sql-query>
| act.who
| Liisa, 010175-1024
| Woman, 30-40 a
|----
| act.when
| 2011-03-09 22:09:10
| 2011-03
|----
| obj.name
| N/A. Data is taken from public side.
| Pregnancy test
|----
| loccell.loc_id (locations and indices explained)
| Person = 010175-1024 <br>Time = 2011-03-09 <br>Test = Clearblue digital test
| Age = 30-40 <br>Sex = Female <br>Country = Finland <br>Time = 2011-03 <br>Test = Clearblue digital test
|----
| res.restext
| N/A. Data is taken from public side.
| Pregnant 1-2 weeks.
|----
|}


Based on the information, anyone can see that there is a woman in Finland who has used a Clearblue pregnancy test and the result was positive. But there is no way an outsider could connect this information to any particular person, because all information that could be used for linking is located in the private website. However, an authorised person from health case could see the data in the private database and connect Liisa and the test result.


'''List all indices, and their locations concatenated:
==See also==


<sql-query display="1">
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=18900#All_tables:_Overview A previous discussion about the structure]
Select Dim_name, Dim_title, Dim_unit, Ind_name, Group_concat(Location order by row_number separator ', ') as Locations
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=18900#Main_tables A previous structure and related discussions]
from `Index`, Location, Rows, Dimension
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=18900#Tasks_performed Previous tasks performed]
where `Index`.ind_id= Rows.ind_id and Rows.loc_id = Location.loc_id and `Index`.dim_id = Dimension.dim_id
group by Ind_name
order by Dim_name, `Index`.ind_name
</sql-query>


; A basic query for retrieving the full result of a variable upload (an example):


'''List all variables and their runs, and also list all dimensions (concatenated) used for each variable for each run.
{{#sql-query:
SELECT obj.ident, obj.name, obj.unit, obj.page, obj.wiki_id, comments, mean, sd, n, location, ind.ident, obs, result, restext
FROM obj
LEFT JOIN actobj ON actobj.obj_id = obj.id
LEFT JOIN act ON actobj.act_id = act.id
LEFT JOIN cell ON cell.actobj_id = actobj.id
LEFT JOIN loccell ON loccell.cell_id = cell.id
LEFT JOIN loc on loccell.loc_id = loc.id
LEFT JOIN obj AS ind ON loc.obj_id_i = ind.id
LEFT JOIN res ON res.cell_id = cell.id
WHERE obj.ident = "Op_en1912"
AND actobj.series_id = 190
LIMIT 0,100
}}


<sql-query display="1">
;Some useful syntax
SELECT Var_id, Run_id, Var_name, Var_title, GROUP_CONCAT(Dim_name SEPARATOR ', ') as Dimensions, n, Run_method
* http://www.baycongroup.com/sql_join.htm
FROM
* [[:image:Opasnet base connection.ANA|Opasnet base connection.ANA]] for Analytica: for writing and reading variable results into and from the database. Writing requires a password. For SQL used in the model, see the model page.
    (SELECT Loc_of_result.Var_id, Run_list.Run_id, Var_name, Var_title, Dim_name, n, Run_method
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base&oldid=7181#Other_queries Some historical queries]
    FROM Loc_of_result, Run_list, Run, Variable, Location, Dimension
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=14214#Some_useful_syntax Some historical queries 2]
    WHERE Loc_of_result.Result_id = Run_list.Result_id
[[Category:Opasnet Base]]
    AND Run_list.Run_id = Run.Run_id
[[Category:Data]]
    AND Loc_of_result.Var_id = Variable.Var_id
[[Category:Opasnet]]
    AND Loc_of_result.Loc_id = Location.Loc_id
    AND Location.Dim_id = Dimension.Dim_id
    GROUP BY Dimension.Dim_id, Loc_of_result.Var_id, Run_list.Run_id
    ORDER BY Loc_of_result.Var_id, Run_list.Run_id) as temp1
GROUP BY Var_id, Run_id
</sql-query>

Latest revision as of 12:01, 10 January 2014



This page is about the old structure of Opasnet Base used in ca. 2008-2011. For a description about the current database, see Opasnet base 2.

Question

Structure and connections (lines) of the tables (boxes) in the Opasnet Base. All table identifiers are called id (so they can be called by like obj.id). When obj.id is referred to in another table such as actobj, it is called actobj.obj_id. The latter end of a one-to-many relationship is marked with a ring. Important substantive fields are listed inside table boxes.

Opasnet Base is a storage and retrieval system for results of variable and data from studies. What is the structure of Opasnet Base such that it enables the following functionalities?

  1. Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.R↻
  2. Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
  3. Description and handling of the indicess that a variable may take.
  4. It is possible to protect some results and data from reading by unauthorised persons.
  5. If is possible to build user interfaces for easily entering observations into the Base.

Answer

Opasnet base is a MySQL database located at http://base.opasnet.org.

Data structure

Main article: Data structures in Opasnet

All data should be convertible into the following format:

Observation
Year Sex Age Height Weight Description
2009 Male 20 178 70 An optional column for descriptive text about each row.
2009 Male 30 174 79
2010 Male 25 183 84
2010 Female 22 168 65

where

Names of explanation columns, also known as indices.
Explanation data, also known as locations. You can use these columns as search criteria.
Observation index, typically called "Observation". Common name for all observation columns
Names of observation columns. These are the parameters of interest.
Observation data. These are the actual measurements.

Table structure in the database

All tables

act
Uploads, updates, and other actions
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
acttype_id tinyint(3) unsigned NO MUL
who varchar(50) NO
comments varchar(250) YES
time timestamp NO
temp_id int(10) unsigned NO MUL
actloc
Locations of an act
Field Type Null Extra Key
actobj_id int(10) unsigned NO PRI
loc_id int(10) unsigned NO PRI
actobj
Acts of an object
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
act_id int(10) unsigned NO MUL
obj_id int(10) unsigned NO MUL
series_id int(10) unsigned NO MUL
unit varchar(64) YES
acttype
List of action types
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
acttype varchar(250) NO UNI
cell
Cells of an object
Field Type Null Extra Key
id int(12) unsigned NO auto_increment PRI
actobj_id int(10) unsigned NO MUL
mean float YES
sd float NO
n int(10) NO
sip varchar(2000) YES
loc
Location information
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
std_id int(10) unsigned NO MUL
obj_id_i int(10) unsigned NO MUL
location varchar(100) NO
roww mediumint(8) unsigned NO
description varchar(150) NO
loccell
Locations of a cell
Field Type Null Extra Key
cell_id int(10) unsigned NO PRI
loc_id int(10) unsigned NO PRI
obj
Object information (all objects)
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
ident varchar(20) NO UNI
name varchar(200) NO
objtype_id tinyint(3) unsigned NO MUL
page int(10) unsigned NO
wiki_id tinyint(3) unsigned NO
objtype
Types of objects
Field Type Null Extra Key
id tinyint(3) NO PRI
objtype varchar(30) NO
res
Result distribution (actual values)
Field Type Null Extra Key
id bigint(20) unsigned NO auto_increment PRI
cell_id int(12) unsigned NO MUL
obs int(10) unsigned NO
result float NO
restext varchar(250) YES
implausible binary(1) YES
wiki
Wiki information
Field Type Null Extra Key
id tinyint(3) NO PRI
url varchar(255) NO
wname varchar(20) NO

Contents of selected tables

Table objtype
id objtype
1 Variable
2 Study
3 Method
4 Assessment
5 Class
6 Index
7 Nugget
8 Encyclopedia article
9 Run
Table acttype
id acttype
1 Start object
2 Finish assessment
3 Update formula
4 Upload data (replace)
5 Upload data (append)
6 Review scope
7 Review definition
8 Add object info

Rationale

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x1,x2,...) 

where P(R) is the probability distribution of the result and x1 and x2 are defining locations of an index where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one index. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.


Dependencies

Replacing some cells

It is possible that there is a large data, where there is a need to update only a few cells while all others remain the same. How should this be done? There are a few potential alternatives.

  1. Use the current replace functionality. Replace all cells but most of them with the original value.
  2. Use a new act_type that is similar to the current append functionality. This should be understood in a way that if there are two (or more) identical cells (based on cell indices and locations), then the newest result is used and all older ones are discarded. (If the old append is used, then new info is just seen as a new row in the data table, not a replacement of an existing row.
  3. Add a new field into the cell (?) table with an updated cell_id (in a similar way than act_id and series_id are used in the actobj table). This way, the new cell can automatically inherit all locations of the old cell.

Formula structure

Now it has become clear that it is not enough to have samples of the result distributions. It must be possible to completely recalculate the result based on the information in the Opasnet Base. There are different approaches:

  • Calculate the result based on a formula that may refer to other variables called parents. This is a deterministic approach.
  • Calculate the result based on the marginal distribution and (conditional) rank correlations with parent variables. This is a probabilistic approach.

This approach requires new tables, namely Formula and Language.


----11: . Do we need tables DIF and DIP like Uninet? --Jouni 21:50, 30 December 2009 (UTC) (type: truth; paradigms: science: comment)
  • DIP
    • DIP_node_id
    • DIP_parent_node_id
    • DIP_corr_coeff
    • DIP_parent_index
  • DIF
    • DIF_node_id
    • DIF_formula
    • DIF_varnames_in_formula

Universal Opasnet Base

The idea of universal Opasnet Base says that it should be possible to store results in such a way that the results themselves are public but their interpretation is limited. For example, patient symptoms and clinical test results should be openly available for research, but information about whose results they are should be private. This can be achieved with the following database structure.

Universal Opasnet Base has some parts that exist in different versions depending on the privacy level. The yellow areas are e.g. a public area and a private area. The parts that are white are public.

Let's say that it is enough to have two security levels, public and private. A person wants to record personal health information into the database. She logs in with her personal user name. The private profile gives the name (say, Liisa) and social security number of the person, while the public profile says only "30-40-year-old woman in Finland". Liisa writes down her symptoms or medical information and saves them. This is what is stored in the databases:

Information stored in the public and private databases. The private database can read tables from the public one but not vice versa.
Table, field Private database Public database
act.who Liisa, 010175-1024 Woman, 30-40 a
act.when 2011-03-09 22:09:10 2011-03
obj.name N/A. Data is taken from public side. Pregnancy test
loccell.loc_id (locations and indices explained) Person = 010175-1024
Time = 2011-03-09
Test = Clearblue digital test
Age = 30-40
Sex = Female
Country = Finland
Time = 2011-03
Test = Clearblue digital test
res.restext N/A. Data is taken from public side. Pregnant 1-2 weeks.

Based on the information, anyone can see that there is a woman in Finland who has used a Clearblue pregnancy test and the result was positive. But there is no way an outsider could connect this information to any particular person, because all information that could be used for linking is located in the private website. However, an authorised person from health case could see the data in the private database and connect Liisa and the test result.

See also

A basic query for retrieving the full result of a variable upload (an example)

{{#sql-query: SELECT obj.ident, obj.name, obj.unit, obj.page, obj.wiki_id, comments, mean, sd, n, location, ind.ident, obs, result, restext FROM obj LEFT JOIN actobj ON actobj.obj_id = obj.id LEFT JOIN act ON actobj.act_id = act.id LEFT JOIN cell ON cell.actobj_id = actobj.id LEFT JOIN loccell ON loccell.cell_id = cell.id LEFT JOIN loc on loccell.loc_id = loc.id LEFT JOIN obj AS ind ON loc.obj_id_i = ind.id LEFT JOIN res ON res.cell_id = cell.id WHERE obj.ident = "Op_en1912" AND actobj.series_id = 190 LIMIT 0,100 }}

Some useful syntax