Opasnet base structure: Difference between revisions

From Opasnet
Jump to navigation Jump to search
m (Help:Result database moved to Result database: Methods and tools are moved from the Help to the Main namespace)
(marked outdated)
 
(112 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''Result database''' is a storage and retrieval system for variable results. It is basically an SQL database with the following functionalities:
[[Category:SQL tool]]
# Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.{{disclink|Should all variables go to result distribution database?}}
[[Category:Open assessment]]
# Automatic retrieval of results when called from the collaborative workspace or other platforms.
[[Category:Tool]]
# Description and handling of the dimensions that the variables may take.
{{variable|moderator = Jouni
# Storage and retrieval system for items that are needed to calculate the results of variables.
| reference = {{publication
# A platform for performing computer runs to update the results of variables.
| authors        = Juha Villman, Einari Happonen, Jouni T. Tuomisto
# Follow-up of the linkages between variables, the data about a particular variable, and the computing formula of the variable, in respect to their impact on the variable result.
| page          = Opasnet Base structure
# Follow-up of the age and validity of the content based on the previous point.
| explanation    =
# A platform for planning computer runs based on the update need, CPU demand, and CPU availability.
| publishingyear = 2010
| urn            =
| elsewhere      =
}}
}}


'''''Functionalities of the result database'''''
:''This page is about the '''old structure of Opasnet Base''' used in ca. 2008-2011. For a description about the current database, see [[Opasnet base 2]].


'''Storage and retrieval of results of variables'''
==Question==


The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the [[Help:Variable|variable]] page, and it is only briefly summarised here. The result is described as
[[image:Opasnet Base structure.png|thumb|400px|Structure and connections (lines) of the tables (boxes) in the [[Opasnet Base]]. All table identifiers are called id (so they can be called by like obj.id). When obj.id is referred to in another table such as actobj, it is called actobj.obj_id. The latter end of a one-to-many relationship is marked with a ring. Important substantive fields are listed inside table boxes.]]


  P(R|x<sub>1</sub>,x<sub>2</sub>,...)
'''Opasnet Base''' is a storage and retrieval system for [[result]]s of [[variable]] and [[data]] from [[study|studies]]. What is the structure of [[Opasnet Base]] such that it enables the following functionalities?
# Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.{{reslink|Should all variables go to result distribution database?}}
# Automatic retrieval of results when called from [[Opasnet wiki]] or other platforms or modelling systems.
# Description and handling of the [[index|indices]]s that a [[variable]] may take.
# It is possible to protect some results and data from reading by unauthorised persons.
# If is possible to build user interfaces for easily entering observations into the Base.


where P(R) is the probability distribution of the result and x<sub>1</sub> and x<sub>2</sub> are defining locations where a particular P(R) applies. A ''dimension'' means a property along which there are multiple locations and the result of the variable may have different values when the location changes. In this case, x<sub>1</sub> and x<sub>2</sub> are dimensions, and particular values of x<sub>1</sub> and x<sub>2</sub> are locations. A variable can have zero, one, or more dimensions. Even if a dimension is continuous, it is usually operationalised in practice as a list of discrete locations. Such a list is called an ''index'', and each location is called a ''row'' of the index. In the general information structure of the new risk assessment method, dimensions are [[Help:Class|Classes]] with a special purpose. An index can be thought of as a variable that inherits its plausible range from a dimension (class).
==Answer==


Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index ''Sample'', which is a list of integers 1,2,3...n, where n=number of samples.
Opasnet base is a [[:en:MySQL|MySQL]] database located at http://base.opasnet.org.


The dimensions of a variable are determined by the parent variables (by inheritance) and the formula used to calculate the result. Thus, there is not a place where the dimensions of a particular variable are explicitly asked for. In addition, the indices (as operationalisations of dimensions) are NOT properties of variables but of risk assessments. This may sound unintuitive, but the reasoning is that indices are just practical ways to handle dimensions, and these practical needs may change from one assessment to another.
===Data structure===


The tables '''Variable''' and '''Result''' contain the result data. In addition, '''Location, Dimension, Index''', and '''Rows''' contain data about the dimensions and indices used. These tables together offer the functionalities of data storage and retrieval, and handling of multidimensionality and uncertainty.
:''Main article: '''[[Data structures in Opasnet]]'''''


'''Calculation of the updated results'''
All data should be convertible into the following format:


The result of a variable can be calculated when four things are known:
{| {{prettytable}}
# The list of of upstream variables (parents) (Definition/causality attribute),
! colspan="3"| || colspan="3"  style="background-color: #FFD8F0;"|Observation
# The results of the parent variables,
|-----
# The data used to derive the result (Definition/data attribute), and
!  style="background-color: #CCDFC8;"|Year ||  style="background-color: #CCDFC8;"|Sex ||  style="background-color: #CCDFC8;"|Age ||  style="background-color: #DFB8D0;"|Height ||  style="background-color: #DFB8D0;"|Weight ||  style="background-color: #DFB8D0;"|Description
# The formula used to calculate the result based on the items above (Definition/formula attribute).
|-----
The three sub-attributes of the Definition are represented by three tables in the result database: '''Causality''', '''Formula''', and '''Data'''. In addition, the results of the parents can be obtained from the Result table. The [[Help:Variable transfer protocol|variable transfer protocol]] is used to extract these data from the result database, send them to an external software such as R to calculate the result, and store the calculated result into the Result table of the database. The technical solutions to do this in practice have to be developed.
|  style="background-color: #ECFFE8;"|2009 ||  style="background-color: #ECFFE8;"|Male ||  style="background-color: #ECFFE8;"|20 ||  style="background-color: #ECDFE8;"|178 ||  style="background-color: #ECDFE8;"|70 ||  style="background-color: #ECDFE8;"|An optional column for descriptive text about each row.
|-----
|  style="background-color: #ECFFE8;"|2009 ||  style="background-color: #ECFFE8;"|Male ||  style="background-color: #ECFFE8;"|30 ||  style="background-color: #ECDFE8;"|174 ||  style="background-color: #ECDFE8;"|79 ||  style="background-color: #ECDFE8;"|
|-----
|  style="background-color: #ECFFE8;"|2010 ||  style="background-color: #ECFFE8;"|Male ||  style="background-color: #ECFFE8;"|25 ||  style="background-color: #ECDFE8;"|183 ||  style="background-color: #ECDFE8;"|84 ||  style="background-color: #ECDFE8;"|
|-----
|  style="background-color: #ECFFE8;"|2010 ||  style="background-color: #ECFFE8;"|Female ||  style="background-color: #ECFFE8;"|22 ||  style="background-color: #ECDFE8;"|168 ||  style="background-color: #ECDFE8;"|65 ||  style="background-color: #ECDFE8;"|
|}


When a variable result is calculated, the computing software must know, which indices must be used with which variables. This can be automatically resolved using the following reasoning algorithm.
where
* Make a list of all unfinished risk assessments.
* Make a list of all indices in these risk assessments.
* Compile all indices of a particular dimension into one large "super-index" with all the locations.
* Use these "super-indices" in the calculations.
* Apply a particular "super-index" for a particular variable, if that variable has the dimension in question.
A wild use of occasional indices is discouraged, because they cause heavy computing needs with little benefit. Therefore, there should be a "standard risk assessment" that is constantly kept unfinished. It would then contain recommended indices for all major dimensions. This way, at least the standard indices are always used in computations, and the need for users to develop their own indices is smaller.


When the new results are stored in the database, the old results of the variables are deleted. The different versions of the variable results are NOT permanently stored anywhere. However, when a risk assessment report is created using the [[Help:Reporting tool|reporting tool]], the result distributions used for that report are stored, together with the definitions and other data about all variables. Thus, a full copy of everything that relates to a particular assessment can be downloaded and stored outside the result database.
{| {{prettytable}}
|  style="background-color: #CCDFC8;"|Names of explanation columns, also known as indices.
|-----
|  style="background-color: #ECFFE8;"|Explanation data, also known as locations. You can use these columns as search criteria.
|-----
|  style="background-color: #FFD8F0;"|Observation index, typically called "Observation". Common name for all observation columns
|-----
|  style="background-color: #DFB8D0;"|Names of observation columns. These are the parameters of interest.
|-----
|  style="background-color: #ECDFE8;"|Observation data. These are the actual measurements.
|-----
|}


'''Follow-up of validity'''
===Table structure in the database===


The result of a variable is valid since its update until something that affects its content (i.e., the four things listed above) changes. Therefore, there must be a system that follows what things these are for a particular variable, and whether they have changed since the last calculation of the variable result. When the data in Causality, Formula, and Data tables is combined with the data of the dates when the parent variables were run, it can be automatically concluded whether the variable is valid or not. If the variable is older than its determinants, there is a need to recalculate the result. This cannot be done fully automatically, because some variables are probably being actively edited, and this would create a constant need to update everything downstream. In addition, some complex variables may take even weeks to compute.
==== All tables ====


Therefore, there should be a planning system for result updates. This can easily be done by adding tables '''Run''' and '''Run_list''' to the database. These tables contain information about the runs that have been performed or are being planned to be performed. The user can add variables to and delete them from the lists of planned runs. The needs for updating can be combined into practical collections of variables, given their connections, computer time needed, and computer time available. Then, when the task has been defined and the resources are available, a computer run can automatically be performed.
{| VALIGN="top" BORDER="0"
|-
|VALIGN="top"|
{| {{prettytable}}
|colspan=5|'''act'''
|-
|colspan=5|'''Uploads, updates, and other actions'''
|-
|Field||Type||Null||Extra||Key
|-
|id||int(10) unsigned||NO||auto_increment||PRI
|-
|acttype_id||tinyint(3) unsigned||NO||||MUL
|-
|who||varchar(50)||NO||||
|-
|comments||varchar(250)||YES||||
|-
|time||timestamp||NO||||
|-
|temp_id||int(10) unsigned||NO||||MUL
|}


'''Suggested techniques to get started'''
|VALIGN="top"|
{| {{prettytable}}
|colspan=5|'''actloc'''
|-
|colspan=5|'''Locations of an act'''
|-
|Field||Type||Null||Extra||Key
|-
|actobj_id||int(10) unsigned||NO||||PRI
|-
|loc_id||int(10) unsigned||NO||||PRI
|}


'''[[:en:MySQL|MySQL]]:'''  
|VALIGN="top"|
{| {{prettytable}}
|colspan=5|'''actobj'''
|-
|colspan=5|'''Acts of an object'''
|-
|Field||Type||Null||Extra||Key
|-
|id||int(10) unsigned||NO||auto_increment||PRI
|-
|act_id||int(10) unsigned||NO||||MUL
|-
|obj_id||int(10) unsigned||NO||||MUL
|-
|series_id||int(10) unsigned||NO||||MUL
|-
|unit||varchar(64)||YES||||
|}


The current idea is to describe the variables in Mediawiki, which is a text-based software. It would therefore become very difficult to operate these functionalities from there. Instead, if we store the data into the most convenient way, it can be effectively utilised. The most convenient way is to use an SQL database, which is the standard for large databanks. Among all SQL software, MySQL is the best due to several reasons:
|-
* It is freely available open access software.
* It is easy to use.
* It has powerful functionalities.
 
 
To make this work out, we need a '''[[Help:Variable transfer protocol|variable transfer protocol]]''' so that the result of a variable can be retrieved either automatically by a calculating software, or manually by the user who wants to explore the result. Fancy presenting software can be built on top of the database, so that the user does not see huge lists of numbers, but nice distributions instead. The development of this software is, again, technically straightforward, because:
* It is only communicating with the MySQL database, except some launch codes must be placed in other parts of the [[Help:Intarese toolbox|toolbox]]. Thus, the development can easily be decentralized.
* Something applicable probably exists in the open code world.
* It is not needed in the early life stages of the toolbox.
 
 
'''''A suggested table and column structure for the database'''''
 
{|
|
|
{| {{prettytable}}
{| {{prettytable}}
!Variable
|colspan=5|'''acttype'''
|-
|-
|Information about variable attributes and validity
|colspan=5|'''List of action types'''
|-
|-
|Columns
|Field||Type||Null||Extra||Key
*Var_id<sup>*</sup> (identifier of the variable)
|-
*Var_name (variable name as in Mediawiki)
|id||int(10) unsigned||NO||auto_increment||PRI
*Var_scope (variable scope as in Mediawiki)
|-
*Var_unit (variable unit as in Mediawiki)
|acttype||varchar(250)||NO||||UNI
*Validity_date (the date when the variable last was valid [1.1.2100 if it is currently valid])
*Run_id (the run that produced the current results of the variable)
*Run_time (CPU time that was needed for this variable during the last run)
|}
|}
|
 
|VALIGN="top"|
{| {{prettytable}}
{| {{prettytable}}
!Result
|colspan=5|'''cell'''
|-
|-
|All results are stored in this table. Each value of a result of a variable has an own row.
|colspan=5|'''Cells of an object'''
|-
|-
|Columns
|Field||Type||Null||Extra||Key
*Result_id<sup>*</sup> (identifier of the row in this table)
|-
*Var_id (identifier of the variable)
|id||int(12) unsigned||NO||auto_increment||PRI
*Result (the actual value of the variable)
|-
*Sample (the row in the uncertainty index ''Sample''. Use 0, if the result is deterministic.)
|actobj_id||int(10) unsigned||NO||||MUL
|-
|mean||float||YES||||
|-
|sd||float||NO||||
|-
|n||int(10)||NO||||
|-
|sip||varchar(2000)||YES||||
|}
|}
|-----
 
|VALIGN="top"|
{| {{prettytable}}
|colspan=5|'''loc'''
|-
|colspan=5|'''Location information'''
|-
|Field||Type||Null||Extra||Key
|-
|id||int(10) unsigned||NO||auto_increment||PRI
|-
|std_id||int(10) unsigned||NO||||MUL
|-
|obj_id_i||int(10) unsigned||NO||||MUL
|-
|location||varchar(100)||NO||||
|-
|roww||mediumint(8) unsigned||NO||||
|-
|description||varchar(150)||NO||||
|}
|}
{{pagebreak}}
 
{|
|-
|
|
{|{{prettytable}}
{| {{prettytable}}
!Location
|colspan=5|'''loccell'''
|-
|-
|The location of the result along a particular dimension.
|colspan=5|'''Locations of a cell'''
|-
|-
|Columns
|Field||Type||Null||Extra||Key
*Result_id<sup>*</sup>
|-
*Dim_id<sup>*</sup>
|cell_id||int(10) unsigned||NO||||PRI
*Location
|-
|loc_id||int(10) unsigned||NO||||PRI
|}
|}
|
 
|VALIGN="top"|
{| {{prettytable}}
{| {{prettytable}}
!Dimension
|colspan=5|'''obj'''
|-
|colspan=5|'''Object information (all objects)'''
|-
|Field||Type||Null||Extra||Key
|-
|id||int(10) unsigned||NO||auto_increment||PRI
|-
|ident||varchar(20)||NO||||UNI
|-
|-
|Information about dimensions
|name||varchar(200)||NO||||
|-
|-
|Columns
|objtype_id||tinyint(3) unsigned||NO||||MUL
*Dim_id<sup>*</sup> (Dimension identifier)
|-
*Dim_name (Dimension name)
|page||int(10) unsigned||NO||||
*Dim_scope (Dimension scope)
|-
*Dim_unit (Dimension unit)
|wiki_id||tinyint(3) unsigned||NO||||
*Dim_definition (Dimension definition)
*Dim_result:  (Dimension result: the range of plausible values, such as "non-negative real number", "positive integer", or an exhaustive list of labels)
|}
|}
|-----
 
|
|VALIGN="top"|
{| {{prettytable}}
{| {{prettytable}}
!Index
|colspan=5|'''objtype'''
|-
|colspan=5|'''Types of objects'''
|-
|Field||Type||Null||Extra||Key
|-
|-
|Information about indices
|id||tinyint(3)||NO||||PRI
|-
|-
|Columns
|objtype||varchar(30)||NO||||
*Ind_id<sup>*</sup> (index identifier)
*Ind_name (index name)
*Dim_id (dimension identifier)
|}
|}
|-
|
|
{| {{prettytable}}
{| {{prettytable}}
!Rows
|colspan=5|'''res'''
|-
|colspan=5|'''Result distribution (actual values)'''
|-
|Field||Type||Null||Extra||Key
|-
|id||bigint(20) unsigned||NO||auto_increment||PRI
|-
|cell_id||int(12) unsigned||NO||||MUL
|-
|obs||int(10) unsigned||NO||||
|-
|result||float||NO||||
|-
|-
|Information about rows of indices
|restext||varchar(250)||YES||||
|-
|-
|Columns
|implausible||binary(1)||YES||||
*Ind_id<sup>*</sup> (index identifier)
*Row_number<sup>*</sup> (the number of this row in the index)
*Location (the location along the dimension of this row and index)
|}
|}
|-----
 
|
|VALIGN="top"|
{|{{prettytable}}
{| {{prettytable}}
! Risk_assessment
|colspan=5|'''wiki'''
|-----
|-
| Attributes of a risk assessment
|colspan=5|'''Wiki information'''
|-----
|-
| Columns
|Field||Type||Null||Extra||Key
* RA_id<sup>*</sup> (risk assessment identifier)
|-
* RA_name
|id||tinyint(3)||NO||||PRI
* RA_scope
|-
* RA_started (date when the risk assessment was started)
|url||varchar(255)||NO||||
* RA_finished (date when the risk assessment was finished)
|-
|wname||varchar(20)||NO||||
|}
|}
|
{|{{prettytable}}
! RA_vars
|-----
| Defines the variables used in a risk assessment
|-----
| Columns
* RA_id<sup>*</sup> (risk assessment identifier)
* Var_id<sup>*</sup> (variable identifier)
|}
|}
|
 
|-----
====Contents of selected tables====
|
 
{|{{prettytable}}
! RA_indices
|-----
| Defines the indices used in a risk assessment
|-----
| Columns
* RA_id<sup>*</sup> (risk assessment identifier)
* Ind_id<sup>*</sup> (index identifier)
|}
|
|-----
|
|}
{{pagebreak}}
{|
{|
|
|
{|{{prettytable}}
{| {{prettytable}}
! Causality
|+ Table objtype
|-----
! id!! objtype
| Defines the parents in the causal chain
|----
|-----
|| 1|| Variable
| Columns
|----
* Var_id<sup>*</sup>
|| 2|| Study
* Causality_date (date when the parent list was last changed)
|----
* Parent_id<sup>*</sup> (var_id of a parent variable)
|| 3|| Method
|----
|| 4|| Assessment
|----
|| 5|| Class
|----
|| 6|| Index
|----
|| 7|| Nugget
|----
|| 8|| Encyclopedia article
|----
|| 9|| Run
|----
|}
|}
|
|
{|{{prettytable}}
 
! Formula
{| {{prettytable}}
|-----
|+ Table acttype
| Defines the formulas of the variables
! id|| acttype
|-----
|----
| Columns
|| 1|| Start object
* Var_id<sup>*</sup>
|----
* Formula_date (date when the formula was last changed)
|| 2|| Finish assessment
* Software<sup>*</sup> (name of the software able to run the formula)
|----
* Formula (software code)
|| 3|| Update formula
|----
|| 4|| Upload data (replace)
|----
|| 5|| Upload data (append)
|----
|| 6|| Review scope
|----
|| 7|| Review definition
|----
|| 8|| Add object info
|----
|}
|}
|-----
|
{|{{prettytable}}
! Data
|-----
| Defines the data of the variables
|-----
| Columns
* Var_id<sup>*</sup>
* Data_date (date when the data was last changed)
* URL<sup>*</sup> (location of the data file)
|}
|}
|
 
==Rationale==
 
===Data===
 
====Software====
 
Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a [[:en:SQL|SQL]] database. Because of its flexibility, ease of use, and cost, [[:en:MySQL|MySQL]] is an optimal choice among SQL software. In addition to the database software, a [[variable transfer protocol]] is needed on top of that so that the results of variables can be retrieved and  new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.
 
====Storage and retrieval of results of variables====
 
The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the [[variable]] page, and it is only briefly summarised here. The result is described as
 
  P(R|x<sub>1</sub>,x<sub>2</sub>,...)
 
where P(R) is the probability distribution of the result and x<sub>1</sub> and x<sub>2</sub> are defining [[location]]s of an [[index]] where a particular P(R) applies. Typically locations are operationalised as discrete [[Index|indices]]. A variable must have at least one [[index]]. [[Uncertainty]] about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index ''Sample'', which is a list of integers 1,2,3...n, where n=number of samples.
 
 
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base&oldid=7856 Old description of the structure]
 
===Dependencies===
 
* [[Opasnet structure]]
* [[Open assessment]]
 
====Replacing some cells====
 
It is possible that there is a large data, where there is a need to update only a few cells while all others remain the same. How should this be done? There are a few potential alternatives.
# Use the current replace functionality. Replace all cells but most of them with the original value.
# Use a new act_type that is similar to the current append functionality. This should be understood in a way that if there are two (or more) identical cells (based on cell indices and locations), then the newest result is used and all older ones are discarded. (If the old ''append'' is used, then new info is just seen as a new row in the data table, not a replacement of an existing row.
# Add a new field into the cell (?) table with an updated cell_id (in a similar way than act_id and series_id are used in the actobj table). This way, the new cell can automatically inherit all locations of the old cell.
 
===Formula structure===
 
Now it has become clear that it is not enough to have samples of the result distributions. It must be possible to completely recalculate the result based on the information in the [[Opasnet Base]]. There are different approaches:
* Calculate the result based on a formula that may refer to other variables called parents. This is a deterministic approach.
* Calculate the result based on the marginal distribution and (conditional) rank correlations with parent variables. This is a probabilistic approach.
 
This approach requires new tables, namely Formula and Language.
 
 
: {{comment|11|Do we need tables DIF and DIP like Uninet?|--[[User:Jouni|Jouni]] 21:50, 30 December 2009 (UTC)}}
* DIP
** DIP_node_id
** DIP_parent_node_id
** DIP_corr_coeff
** DIP_parent_index
* DIF
** DIF_node_id
** DIF_formula
** DIF_varnames_in_formula
 
===Universal Opasnet Base===
 
The idea of universal Opasnet Base says that it should be possible to store results in such a way that the results themselves are public but their interpretation is limited. For example, patient symptoms and clinical test results should be openly available for research, but information about whose results they are should be private. This can be achieved with the following database structure.
 
[[File:Universal Opasnet Base structure.png|thumb|400px|Universal Opasnet Base has some parts that exist in different versions depending on the privacy level. The yellow areas are e.g. a public area and a private area. The parts that are white are public.]]
 
Let's say that it is enough to have two security levels, public and private. A person wants to record personal health information into the database. She logs in with her personal user name. The private profile gives the name (say, Liisa) and social security number of the person, while the public profile says only "30-40-year-old woman in Finland". Liisa writes down her symptoms or medical information and saves them. This is what is stored in the databases:
 
{| {{prettytable}}
{| {{prettytable}}
!Run
|+ '''Information stored in the public and private databases. The private database can read tables from the public one but not vice versa.
|-
! Table, field
|Information about the computer runs
! Private database
|-
! Public database
|Columns
|----
*Run_id<sup>*</sup> (the identifier of the computer run)
| act.who
*Run_date (when the run was actually performed successfully)
| Liisa, 010175-1024
*Run_who (who performed/will perform the run)
| Woman, 30-40 a
*Run_method (what method was/will be used in the run)
|----
*Planned_run_date (the estimated date for the run)
| act.when
|}
| 2011-03-09 22:09:10
|-----
| 2011-03
|
|----
{| {{prettytable}}
| obj.name
!Run_list
| N/A. Data is taken from public side.
|-
| Pregnancy test
|List of variables in a run
|----
|-
| loccell.loc_id (locations and indices explained)
|Columns
| Person = 010175-1024 <br>Time = 2011-03-09 <br>Test = Clearblue digital test
*Run_id<sup>*</sup> (the identifier of the computer run)
| Age = 30-40 <br>Sex = Female <br>Country = Finland <br>Time = 2011-03 <br>Test = Clearblue digital test
*Run_order<sup>*</sup> (the order in which the variables will be computed)
|----
*Var_id (the identifier of the variable)
| res.restext
|}
| N/A. Data is taken from public side.
|
| Pregnant 1-2 weeks.
|----
|}
|}
<sup>*</sup> This column or these columns together uniquely identify the row in the table


----
Based on the information, anyone can see that there is a woman in Finland who has used a Clearblue pregnancy test and the result was positive. But there is no way an outsider could connect this information to any particular person, because all information that could be used for linking is located in the private website. However, an authorised person from health case could see the data in the private database and connect Liisa and the test result.


'''Result database''' is for storing variable results in a way that they can be used independently of the assessment they were created in.
==See also==


===Some useful syntax===
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=18900#All_tables:_Overview A previous discussion about the structure]
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=18900#Main_tables A previous structure and related discussions]
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=18900#Tasks_performed Previous tasks performed]


http://www.baycongroup.com/sql_join.htm
; A basic query for retrieving the full result of a variable upload (an example):


----
{{#sql-query:
SELECT obj.ident, obj.name, obj.unit, obj.page, obj.wiki_id, comments, mean, sd, n, location, ind.ident, obs, result, restext
FROM obj
LEFT JOIN actobj ON actobj.obj_id = obj.id
LEFT JOIN act ON actobj.act_id = act.id
LEFT JOIN cell ON cell.actobj_id = actobj.id
LEFT JOIN loccell ON loccell.cell_id = cell.id
LEFT JOIN loc on loccell.loc_id = loc.id
LEFT JOIN obj AS ind ON loc.obj_id_i = ind.id
LEFT JOIN res ON res.cell_id = cell.id
WHERE obj.ident = "Op_en1912"
AND actobj.series_id = 190
LIMIT 0,100
}}


SELECT * FROM `Result`, `Location`, `Variable`
;Some useful syntax
where `Result`.`Result_id` = `Location`.`Result_id`
* http://www.baycongroup.com/sql_join.htm
and `Result`.`Var_id` = `Variable`.`Var_id`
* [[:image:Opasnet base connection.ANA|Opasnet base connection.ANA]] for Analytica: for writing and reading variable results into and from the database. Writing requires a password. For SQL used in the model, see the model page.
and sample <11
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base&oldid=7181#Other_queries Some historical queries]
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base_structure&oldid=14214#Some_useful_syntax Some historical queries 2]
[[Category:Opasnet Base]]
[[Category:Data]]
[[Category:Opasnet]]

Latest revision as of 12:01, 10 January 2014



This page is about the old structure of Opasnet Base used in ca. 2008-2011. For a description about the current database, see Opasnet base 2.

Question

Structure and connections (lines) of the tables (boxes) in the Opasnet Base. All table identifiers are called id (so they can be called by like obj.id). When obj.id is referred to in another table such as actobj, it is called actobj.obj_id. The latter end of a one-to-many relationship is marked with a ring. Important substantive fields are listed inside table boxes.

Opasnet Base is a storage and retrieval system for results of variable and data from studies. What is the structure of Opasnet Base such that it enables the following functionalities?

  1. Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.R↻
  2. Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
  3. Description and handling of the indicess that a variable may take.
  4. It is possible to protect some results and data from reading by unauthorised persons.
  5. If is possible to build user interfaces for easily entering observations into the Base.

Answer

Opasnet base is a MySQL database located at http://base.opasnet.org.

Data structure

Main article: Data structures in Opasnet

All data should be convertible into the following format:

Observation
Year Sex Age Height Weight Description
2009 Male 20 178 70 An optional column for descriptive text about each row.
2009 Male 30 174 79
2010 Male 25 183 84
2010 Female 22 168 65

where

Names of explanation columns, also known as indices.
Explanation data, also known as locations. You can use these columns as search criteria.
Observation index, typically called "Observation". Common name for all observation columns
Names of observation columns. These are the parameters of interest.
Observation data. These are the actual measurements.

Table structure in the database

All tables

act
Uploads, updates, and other actions
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
acttype_id tinyint(3) unsigned NO MUL
who varchar(50) NO
comments varchar(250) YES
time timestamp NO
temp_id int(10) unsigned NO MUL
actloc
Locations of an act
Field Type Null Extra Key
actobj_id int(10) unsigned NO PRI
loc_id int(10) unsigned NO PRI
actobj
Acts of an object
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
act_id int(10) unsigned NO MUL
obj_id int(10) unsigned NO MUL
series_id int(10) unsigned NO MUL
unit varchar(64) YES
acttype
List of action types
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
acttype varchar(250) NO UNI
cell
Cells of an object
Field Type Null Extra Key
id int(12) unsigned NO auto_increment PRI
actobj_id int(10) unsigned NO MUL
mean float YES
sd float NO
n int(10) NO
sip varchar(2000) YES
loc
Location information
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
std_id int(10) unsigned NO MUL
obj_id_i int(10) unsigned NO MUL
location varchar(100) NO
roww mediumint(8) unsigned NO
description varchar(150) NO
loccell
Locations of a cell
Field Type Null Extra Key
cell_id int(10) unsigned NO PRI
loc_id int(10) unsigned NO PRI
obj
Object information (all objects)
Field Type Null Extra Key
id int(10) unsigned NO auto_increment PRI
ident varchar(20) NO UNI
name varchar(200) NO
objtype_id tinyint(3) unsigned NO MUL
page int(10) unsigned NO
wiki_id tinyint(3) unsigned NO
objtype
Types of objects
Field Type Null Extra Key
id tinyint(3) NO PRI
objtype varchar(30) NO
res
Result distribution (actual values)
Field Type Null Extra Key
id bigint(20) unsigned NO auto_increment PRI
cell_id int(12) unsigned NO MUL
obs int(10) unsigned NO
result float NO
restext varchar(250) YES
implausible binary(1) YES
wiki
Wiki information
Field Type Null Extra Key
id tinyint(3) NO PRI
url varchar(255) NO
wname varchar(20) NO

Contents of selected tables

Table objtype
id objtype
1 Variable
2 Study
3 Method
4 Assessment
5 Class
6 Index
7 Nugget
8 Encyclopedia article
9 Run
Table acttype
id acttype
1 Start object
2 Finish assessment
3 Update formula
4 Upload data (replace)
5 Upload data (append)
6 Review scope
7 Review definition
8 Add object info

Rationale

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x1,x2,...) 

where P(R) is the probability distribution of the result and x1 and x2 are defining locations of an index where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one index. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.


Dependencies

Replacing some cells

It is possible that there is a large data, where there is a need to update only a few cells while all others remain the same. How should this be done? There are a few potential alternatives.

  1. Use the current replace functionality. Replace all cells but most of them with the original value.
  2. Use a new act_type that is similar to the current append functionality. This should be understood in a way that if there are two (or more) identical cells (based on cell indices and locations), then the newest result is used and all older ones are discarded. (If the old append is used, then new info is just seen as a new row in the data table, not a replacement of an existing row.
  3. Add a new field into the cell (?) table with an updated cell_id (in a similar way than act_id and series_id are used in the actobj table). This way, the new cell can automatically inherit all locations of the old cell.

Formula structure

Now it has become clear that it is not enough to have samples of the result distributions. It must be possible to completely recalculate the result based on the information in the Opasnet Base. There are different approaches:

  • Calculate the result based on a formula that may refer to other variables called parents. This is a deterministic approach.
  • Calculate the result based on the marginal distribution and (conditional) rank correlations with parent variables. This is a probabilistic approach.

This approach requires new tables, namely Formula and Language.


----11: . Do we need tables DIF and DIP like Uninet? --Jouni 21:50, 30 December 2009 (UTC) (type: truth; paradigms: science: comment)
  • DIP
    • DIP_node_id
    • DIP_parent_node_id
    • DIP_corr_coeff
    • DIP_parent_index
  • DIF
    • DIF_node_id
    • DIF_formula
    • DIF_varnames_in_formula

Universal Opasnet Base

The idea of universal Opasnet Base says that it should be possible to store results in such a way that the results themselves are public but their interpretation is limited. For example, patient symptoms and clinical test results should be openly available for research, but information about whose results they are should be private. This can be achieved with the following database structure.

Universal Opasnet Base has some parts that exist in different versions depending on the privacy level. The yellow areas are e.g. a public area and a private area. The parts that are white are public.

Let's say that it is enough to have two security levels, public and private. A person wants to record personal health information into the database. She logs in with her personal user name. The private profile gives the name (say, Liisa) and social security number of the person, while the public profile says only "30-40-year-old woman in Finland". Liisa writes down her symptoms or medical information and saves them. This is what is stored in the databases:

Information stored in the public and private databases. The private database can read tables from the public one but not vice versa.
Table, field Private database Public database
act.who Liisa, 010175-1024 Woman, 30-40 a
act.when 2011-03-09 22:09:10 2011-03
obj.name N/A. Data is taken from public side. Pregnancy test
loccell.loc_id (locations and indices explained) Person = 010175-1024
Time = 2011-03-09
Test = Clearblue digital test
Age = 30-40
Sex = Female
Country = Finland
Time = 2011-03
Test = Clearblue digital test
res.restext N/A. Data is taken from public side. Pregnant 1-2 weeks.

Based on the information, anyone can see that there is a woman in Finland who has used a Clearblue pregnancy test and the result was positive. But there is no way an outsider could connect this information to any particular person, because all information that could be used for linking is located in the private website. However, an authorised person from health case could see the data in the private database and connect Liisa and the test result.

See also

A basic query for retrieving the full result of a variable upload (an example)

{{#sql-query: SELECT obj.ident, obj.name, obj.unit, obj.page, obj.wiki_id, comments, mean, sd, n, location, ind.ident, obs, result, restext FROM obj LEFT JOIN actobj ON actobj.obj_id = obj.id LEFT JOIN act ON actobj.act_id = act.id LEFT JOIN cell ON cell.actobj_id = actobj.id LEFT JOIN loccell ON loccell.cell_id = cell.id LEFT JOIN loc on loccell.loc_id = loc.id LEFT JOIN obj AS ind ON loc.obj_id_i = ind.id LEFT JOIN res ON res.cell_id = cell.id WHERE obj.ident = "Op_en1912" AND actobj.series_id = 190 LIMIT 0,100 }}

Some useful syntax