Opasnet base structure: Difference between revisions

From Opasnet
Jump to navigation Jump to search
m (Result database moved to Opasnet base: The name is according to the new nomenclature.)
(completely edited and restructured as variable. Now is up-to-date)
Line 1: Line 1:
{{encyclopedia|Open assessment}}
[[Category:Open assessment]]
[[Category:Tool]]
[[Category:Tool]]
'''Result database''' is a storage and retrieval system for variable results. It is basically an SQL database with the following functionalities:
{{variable}}
# Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.{{disclink|Should all variables go to result distribution database?}}
# Automatic retrieval of results when called from the collaborative workspace or other platforms.
# Description and handling of the dimensions that the variables may take.
# Storage and retrieval system for items that are needed to calculate the results of variables.
# A platform for performing computer runs to update the results of variables.
# Follow-up of the linkages between variables, the data about a particular variable, and the computing formula of the variable, in respect to their impact on the variable result.
# Follow-up of the age and validity of the content based on the previous point.
# A platform for planning computer runs based on the update need, CPU demand, and CPU availability.


==Functionalities of the result database==
==Scope==


===Storage and retrieval of results of variables===
'''Opasnet base''' is a storage and retrieval system for [[variable]] [[result]]s. What is the structure of [[Opasnet base]] such that it enables the following functionalities?
# Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.{{disclink|Should all variables go to result distribution database?}}
# Automatic retrieval of results when called from [[Opasnet wiki]] or other platforms or modelling systems.
# Description and handling of the [[dimension]]s that a [[variable]] may take.
# Storage and retrieval system for items that are needed to calculate the [[result]]s of variables.(?)
# A platform for planning computer runs about variable results based on the update need, CPU demand, and CPU availability.


The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the [[variable]] page, and it is only briefly summarised here. The result is described as
==Definition==


  P(R|x<sub>1</sub>,x<sub>2</sub>,...)
===Data===


where P(R) is the probability distribution of the result and x<sub>1</sub> and x<sub>2</sub> are defining locations where a particular P(R) applies. A ''dimension'' means a property along which there are multiple locations and the result of the variable may have different values when the location changes. In this case, x<sub>1</sub> and x<sub>2</sub> are dimensions, and particular values of x<sub>1</sub> and x<sub>2</sub> are locations. A variable can have zero, one, or more dimensions. Even if a dimension is continuous, it is usually operationalised in practice as a list of discrete locations. Such a list is called an ''index'', and each location is called a ''row'' of the index. In the general information structure of the new risk assessment method, dimensions are [[Help:Class|Classes]] with a special purpose. An index can be thought of as a variable that inherits its plausible range from a dimension (class).
====Software====


Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index ''Sample'', which is a list of integers 1,2,3...n, where n=number of samples.
Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a [[:en:SQL|SQL]] database. Because of its flexibility, ease of use, and cost, [[:en:MySQL|MySQL]] is an optimal choice among SQL software. In addition to the database software, a [[variable transfer protocol]] is needed on top of that so that the results of variables can be retrieved and  new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.


The dimensions of a variable are determined by the parent variables (by inheritance) and the formula used to calculate the result. Thus, there is not a place where the dimensions of a particular variable are explicitly asked for. In addition, the indices (as operationalisations of dimensions) are NOT properties of variables but of risk assessments. This may sound unintuitive, but the reasoning is that indices are just practical ways to handle dimensions, and these practical needs may change from one assessment to another.
====Storage and retrieval of results of variables====


The tables '''Variable''' and '''Result''' contain the result data. In addition, '''Location, Dimension, Index''', and '''Rows''' contain data about the dimensions and indices used. These tables together offer the functionalities of data storage and retrieval, and handling of multidimensionality and uncertainty.
The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the [[variable]] page, and it is only briefly summarised here. The result is described as


===Calculation of the updated results===
  P(R|x<sub>1</sub>,x<sub>2</sub>,...)
 
The result of a variable can be calculated when four things are known:
# The list of of upstream variables (parents) (Definition/causality attribute),
# The results of the parent variables,
# The data used to derive the result (Definition/data attribute), and
# The formula used to calculate the result based on the items above (Definition/formula attribute).
The three sub-attributes of the Definition are represented by three tables in the result database: '''Causality''', '''Formula''', and '''Data'''. In addition, the results of the parents can be obtained from the Result table. The [[Help:Variable transfer protocol|variable transfer protocol]] is used to extract these data from the result database, send them to an external software such as R to calculate the result, and store the calculated result into the Result table of the database. The technical solutions to do this in practice have to be developed.
 
When a variable result is calculated, the computing software must know, which indices must be used with which variables. This can be automatically resolved using the following reasoning algorithm.
* Make a list of all unfinished risk assessments.
* Make a list of all indices in these risk assessments.
* Compile all indices of a particular dimension into one large "super-index" with all the locations.
* Use these "super-indices" in the calculations.
* Apply a particular "super-index" for a particular variable, if that variable has the dimension in question.
A wild use of occasional indices is discouraged, because they cause heavy computing needs with little benefit. Therefore, there should be a "standard risk assessment" that is constantly kept unfinished. It would then contain recommended indices for all major dimensions. This way, at least the standard indices are always used in computations, and the need for users to develop their own indices is smaller.


When the new results are stored in the database, the old results of the variables are deleted. The different versions of the variable results are NOT permanently stored anywhere. However, when a risk assessment report is created using the [[Help:Reporting tool|reporting tool]], the result distributions used for that report are stored, together with the definitions and other data about all variables. Thus, a full copy of everything that relates to a particular assessment can be downloaded and stored outside the result database.
where P(R) is the probability distribution of the result and x<sub>1</sub> and x<sub>2</sub> are defining [[location]]s of a [[dimension]] where a particular P(R) applies. Typically locations are operationalised as discrete [[Index|indices]]. A variable must have at least one dimension. [[Uncertainty]] about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index ''Sample'', which is a list of integers 1,2,3...n, where n=number of samples.


===Follow-up of validity===
====Table and field names====
 
The result of a variable is valid since its update until something that affects its content (i.e., the four things listed above) changes. Therefore, there must be a system that follows what things these are for a particular variable, and whether they have changed since the last calculation of the variable result. When the data in Causality, Formula, and Data tables is combined with the data of the dates when the parent variables were run, it can be automatically concluded whether the variable is valid or not. If the variable is older than its determinants, there is a need to recalculate the result. This cannot be done fully automatically, because some variables are probably being actively edited, and this would create a constant need to update everything downstream. In addition, some complex variables may take even weeks to compute.
 
Therefore, there should be a planning system for result updates. This can easily be done by adding tables '''Run''' and '''Run_list''' to the database. These tables contain information about the runs that have been performed or are being planned to be performed. The user can add variables to and delete them from the lists of planned runs. The needs for updating can be combined into practical collections of variables, given their connections, computer time needed, and computer time available. Then, when the task has been defined and the resources are available, a computer run can automatically be performed.
 
===Suggested techniques to get started===
 
'''[[:en:MySQL|MySQL]]:'''
 
The current idea is to describe the variables in Mediawiki, which is a text-based software. It would therefore become very difficult to operate these functionalities from there. Instead, if we store the data into the most convenient way, it can be effectively utilised. The most convenient way is to use an SQL database, which is the standard for large databanks. Among all SQL software, MySQL is the best due to several reasons:
* It is freely available open access software.
* It is easy to use.
* It has powerful functionalities.
 
 
To make this work out, we need a '''[[Help:Variable transfer protocol|variable transfer protocol]]''' so that the result of a variable can be retrieved either automatically by a calculating software, or manually by the user who wants to explore the result. Fancy presenting software can be built on top of the database, so that the user does not see huge lists of numbers, but nice distributions instead. The development of this software is, again, technically straightforward, because:
* It is only communicating with the MySQL database, except some launch codes must be placed in other parts of the [[Help:Intarese toolbox|toolbox]]. Thus, the development can easily be decentralized.
* Something applicable probably exists in the open code world.
* It is not needed in the early life stages of the toolbox.
 
 
==A suggested table and column structure for the database==
 
===Suggested new naming for tables and fields:===


Principles:  
Principles:  
Line 115: Line 73:
* Run table: Remove "Run_" from the field names except Run_id
* Run table: Remove "Run_" from the field names except Run_id
* Runres table: Run_order -> order (do we actually need this field?)
* Runres table: Run_order -> order (do we actually need this field?)
==Result==
Opasnet base is a [[:en:MySQL]] database located at http://base.opasnet.org.


===Table structure===
===Table structure===
{| VALIGN="top" BORDER="0"
{| VALIGN="top" BORDER="0"
|-
|-
Line 551: Line 514:




==See also==


These texts could be moved to separate pages.


----
===Making value-of-information analyses in [[Opasnet base]]===
 
'''Result database''' is for storing variable results in a way that they can be used independently of the assessment they were created in.
 
==Possible uses of the database==
 
===Making value-of-information analyses===


[[:en:Value of information|Value of information]] (VOI) is a [[:en:decision analysis|decision analysis]] tool for estimating the importance of remaining uncertainty for decision-making. For detailed description, see [[:en:Value of information|Value of information]]. Result database can be used to perform a large number of VOI analyses, because all variables are in the right format for that: as random samples from uncertain variables. The analysis is done by optimising an [[indicator]] by adjusting a [[decision variable]] so that the variable under analysis is conditionalised to different values. All this can in theory be done in the result database by just listing the indicator, the decision variable, and the variable of interest. Practical tools should be developed for this. After that, systematic VOI analyses can be made over a wide range of environmental health issues.
[[:en:Value of information|Value of information]] (VOI) is a [[:en:decision analysis|decision analysis]] tool for estimating the importance of remaining uncertainty for decision-making. Result database can be used to perform a large number of VOI analyses, because all variables are in the right format for that: as random samples from uncertain variables. The analysis is done by optimising an [[indicator]] variable by adjusting a [[decision variable]] so that the variable under analysis is conditionalised to different values. All this can in theory be done in the result database by just listing the indicator, the decision variable, and the variable of interest. Practical tools should be developed for this. After that, systematic VOI analyses can be made over a wide range of environmental health issues.


==The improvement of the quality of a variable in time==
===Analysing the change in the quality of a variable result in [[Opasnet base]]===


All results that have once been stored in the result database remain there. Although the old results are not interesting for environmental health assessments after the updated result has been stored, they can be very interesting for other purposes. Some potential uses are listed below:
All results that have once been stored in the result database remain there. Old results can be very interesting for some purposes:
* The [[informativeness]] and [[calibration]] (see [[performance]]) can be evaluated for a single variable in time against the newest information.
* The time trend of [[informativeness]] and [[calibration]] (see [[performance]]) can be evaluated for a single variable against the newest information.
* Critical pieces of information that had made major contribution to the informativeness and calibration can be identified afterwards.
* Critical pieces of information that had a major impact on the informativeness and calibration can be identified afterwards.
* Large number of variables can be assessed and e.g. following questions can be asked:
* Large number of variables can be assessed and e.g. following questions can be asked:
** How much work is needed to make a variable to have reasonable performance for practical applications?
** How much work is needed to make a variable with reasonable performance for practical applications?
** What are the critical steps after which the variable performance is saturated, i.e., does not improve much despite additional effort?
** What are the critical steps after which the variable performance is saturated, i.e., does not improve much despite additional effort?


===Some useful syntax===


==Some useful syntax==
* http://www.baycongroup.com/sql_join.htm
 
* [[:image:Opasnet base connection.ANA|Opasnet base connection.ANA]] for Analytica: for writing and reading variable results into and from the database. Writing requires a password. For SQL used in the model, see the model page.
http://www.baycongroup.com/sql_join.htm
* [http://en.opasnet.org/en-opwiki/index.php?title=Opasnet_base&oldid=7181#Other_queries Some historical queries]
 
----
 
* [[:image:RDB connection.ANA|Result database connection module]] for Analytica: for writing variable results into the database. This requires a password.
* [[:image:RDB reader.ANA|Result database reader module]] for Analytica: for reading variable results into the database.
 
* For SQL used by [[:image:RDB reader.ANA|RDB reader.ANA]] and [[:image:RDB connection.ANA|RDB connection.ANA]], see the respective pages.


===Useful queries that are not (yet) part of a model of procedure===


'''List all dimensions that have indices, and the indices concatenated:
'''List all dimensions that have indices, and the indices concatenated:
Line 595: Line 547:
order by Dimension.dim_id
order by Dimension.dim_id
</sql-query>
</sql-query>


'''List all indices, and their locations concatenated:
'''List all indices, and their locations concatenated:
Line 605: Line 558:
  order by Dim_name, `Index`.ind_name
  order by Dim_name, `Index`.ind_name
</sql-query>
</sql-query>


'''List all variables and their runs, and also list all dimensions (concatenated) used for each variable for each run.
'''List all variables and their runs, and also list all dimensions (concatenated) used for each variable for each run.
Line 622: Line 576:
  GROUP BY Var_id, Run_id
  GROUP BY Var_id, Run_id
</sql-query>
</sql-query>
===Other queries===
'''This query was used to transform the Var_id data from the table Result to Loc_of_result. This was a one-time operation that is recorded for historical interest only.
UPDATE Loc_of_result,
  (SELECT Variable.Var_id, Var_name, Loc_of_result.Loc_id, Loc_of_result.Result_id
    FROM Variable, Loc_of_result, Result
    WHERE Variable.Var_id = Result.Var_id and
    Loc_of_result.Result_id = Result.Result_id
    GROUP BY Loc_of_result.Result_id, Loc_of_result.Loc_id) as temp1
SET Loc_of_result.Var_id = temp1.Var_id
WHERE Loc_of_result.Loc_id = temp1.Loc_id and
    Loc_of_result.Result_id = temp1.Result_id

Revision as of 05:56, 29 November 2008



Scope

Opasnet base is a storage and retrieval system for variable results. What is the structure of Opasnet base such that it enables the following functionalities?

  1. Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.D↷
  2. Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
  3. Description and handling of the dimensions that a variable may take.
  4. Storage and retrieval system for items that are needed to calculate the results of variables.(?)
  5. A platform for planning computer runs about variable results based on the update need, CPU demand, and CPU availability.

Definition

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x1,x2,...) 

where P(R) is the probability distribution of the result and x1 and x2 are defining locations of a dimension where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one dimension. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.

Table and field names

Principles:

  • The names should be as short as possible: three letters.
  • Tables that are only connecting two substance tables (i.e. tables for making many-to-many relationships) have a name that is a combination of the two, with six letters.
  • Identifiers are named like Var_id where Var is the name of the table.
  • Substantive fields may have longer names.
  • Substantive fields do not repeat the table name unless there is a possibility to mix two fields in different tables.
  • The field endings have the following meaning:
    • _id: the identifier of the row in RDB, a sequential number in the table.
    • _name: the identifier for Analytica, format: wiki link+page (e.g. Op_en2356)
    • _title: a longer, descriptive title
    • page: the page identifier from Opasnet
  • QUESTION: Should the name of Var be changed into Obj? Reasoning: in practice, all substantive information of Dimension is located in the Var table, and the Dim table only contains the information about the Dim_id. This could be operationalised in another way: all objects locate in the Obj table, which also contains a new field Type. Type is the type of object (Variable, Dimension, Index, Class, Assessment)


In practice the tables and fields would look like this:

Tables:

  • Variable -> Var
  • Result -> Res
  • Location -> Loc
  • Dimension -> Dim
  • Index -> Ind
  • Rows -> Row
  • Loc_of_result -> Locres (the location of each result)
  • Run -> Run
  • Run_list -> Runres (the run of each result)
  • Wiki_location -> Wik
  • Risk assessment -> Ra
  • RA_vars -> Ravar (the risk assessment of each variable)
  • RA_indices -> Raind (the risk assessment of each index)
  • Causality -> do we actually need this?
  • Formula -> do we actually need this?
  • Data -> do we actually need this?


Fields (only those are listed that are actively used and should be changed):

  • Var table: Remove the "Var_" from all fields except Var_id.
  • Page_id -> Page (because this is rather a substantive field than an identifier; there is no table called "Page")
  • Result_id -> Res_id
  • Dimension table: Dimensions are actually variables themselves. Therefore, all substantive content should be moved to Var; we don't need any more Dim_name, Dim_title, Dim_unit, Page_id and Wiki_id in this table. We need to add Var_id field, which tells where in the Var table the info of each dimension is found.
  • Row_number -> Row
  • Run table: Remove "Run_" from the field names except Run_id
  • Runres table: Run_order -> order (do we actually need this field?)

Result

Opasnet base is a en:MySQL database located at http://base.opasnet.org.

Table structure

Variable
Information about variable attributes and validity
FIELD TYPE EXTRA
Var_id mediumint(8) primary
Var_name varchar(20) unique
Var_title varchar(100)
Var_scope varchar(1000)
Var_unit varchar(16)
Page_id mediumint(8)
Wiki_id tinyint(3)
Result
All results are stored in this table. Each value of a result of a variable has an own row.
FIELD TYPE EXTRA
Result_id int(10) primary
Var_id mediumint(8)
Result varchar(1000)
Sample smallint(5)
Location
The location of the result along a particular dimension.
FIELD TYPE EXTRA
Loc_id mediumint(8) primary
Dim_id mediumint(8)
Location varchar(1000)
Dimension
Information about dimensions
FIELD TYPE EXTRA
Dim_id mediumint(8) primary
Dim_name varchar(100)
Dim_title varchar(100)
Dim_unit varchar(16)
Page_id mediumint(8)
Wiki_id tinyint(3)
Index
Information about indices
FIELD TYPE EXTRA
Ind_id int(10) primary
Ind_name varchar(100)
Dim_id mediumint(8)
Rows
Information about rows of indices
FIELD TYPE EXTRA
Ind_id int(10) unique
Row_number int(10) unique
Loc_id mediumint(8)
Loc_of_result
explanation coming...
FIELD TYPE EXTRA
Loc_id mediumint(8) unique
Result_id int(10) unique
Var_id mediumint(8)
Ind_id mediumint(8)
N mediumint(8)
Run_id mediumint(8)
Run
Information about the computer runs
FIELD TYPE EXTRA
Run_id mediumint(8) primary
Run_date date
Run_who varchar(50)
Run_method varchar(200)
Run_list
List of variables in a run
FIELD TYPE EXTRA
Run_id int(16)
Run_order varchar(100)
Var_id int(16)
Result_id int(10)
Wiki_location
Defines URL of a wiki where object is linked
FIELD TYPE EXTRA
Wiki_id tinyint(3) primary
URL varchar(60)
Wiki_name varchar(20)
Risk_assessment
Attributes of a risk assessment. Not actively used yet.
FIELD TYPE EXTRA
RA_id smallint(5) primary
RA_name varchar(100)
RA_scope varchar(1000)
RA_started date
RA_finished date
RA_vars
Defines the variables used in a risk assessment. Not actively used yet.
FIELD TYPE EXTRA
RA_id smallint(5) unique
Var_id mediumint(8) unique
RA_indices
Defines the indices used in a risk assessment. Not actively used yet.
FIELD TYPE EXTRA
RA_id smallint(5) unique
Ind_id int(10) unique
Causality
Defines the parents in the causal chain. Not actively used yet.
FIELD TYPE EXTRA
Var_id mediumint(8)
Causality_date date
Parent_id mediumint(8)
Formula
Defines the formulas of the variables. Not actively used yet.
FIELD TYPE EXTRA
Var_id mediumint(8)
Formula_date date
Software varchar(100)
Formula varchar(100)
Data
Defines the data of the variables. Not actively used yet.
FIELD TYPE EXTRA
Var_id mediumint(8)
Data_date date
URL varchar(100)


See also

These texts could be moved to separate pages.

Making value-of-information analyses in Opasnet base

Value of information (VOI) is a decision analysis tool for estimating the importance of remaining uncertainty for decision-making. Result database can be used to perform a large number of VOI analyses, because all variables are in the right format for that: as random samples from uncertain variables. The analysis is done by optimising an indicator variable by adjusting a decision variable so that the variable under analysis is conditionalised to different values. All this can in theory be done in the result database by just listing the indicator, the decision variable, and the variable of interest. Practical tools should be developed for this. After that, systematic VOI analyses can be made over a wide range of environmental health issues.

Analysing the change in the quality of a variable result in Opasnet base

All results that have once been stored in the result database remain there. Old results can be very interesting for some purposes:

  • The time trend of informativeness and calibration (see performance) can be evaluated for a single variable against the newest information.
  • Critical pieces of information that had a major impact on the informativeness and calibration can be identified afterwards.
  • Large number of variables can be assessed and e.g. following questions can be asked:
    • How much work is needed to make a variable with reasonable performance for practical applications?
    • What are the critical steps after which the variable performance is saturated, i.e., does not improve much despite additional effort?

Some useful syntax


List all dimensions that have indices, and the indices concatenated:

<sql-query display="1"> Select Dim_name, dim_title, dim_unit, Group_concat(Ind_name order by ind_name separator ', ') as Indices from Dimension, `Index` where Dimension.dim_id = `Index`.Dim_id group by Dim_name order by Dimension.dim_id </sql-query>


List all indices, and their locations concatenated:

<sql-query display="1">

Select Dim_name, Dim_title, Dim_unit, Ind_name, Group_concat(Location order by row_number separator ', ') as Locations 
from `Index`, Location, Rows, Dimension
where `Index`.ind_id= Rows.ind_id and Rows.loc_id = Location.loc_id and `Index`.dim_id = Dimension.dim_id
group by Ind_name
order by Dim_name, `Index`.ind_name

</sql-query>


List all variables and their runs, and also list all dimensions (concatenated) used for each variable for each run.

<sql-query display="1">

SELECT Var_id, Run_id, Var_name, Var_title, GROUP_CONCAT(Dim_name SEPARATOR ', ') as Dimensions, n, Run_method
FROM
   (SELECT Loc_of_result.Var_id, Run_list.Run_id, Var_name, Var_title, Dim_name, n, Run_method
   FROM Loc_of_result, Run_list, Run, Variable, Location, Dimension
   WHERE Loc_of_result.Result_id = Run_list.Result_id 
   AND Run_list.Run_id = Run.Run_id
   AND Loc_of_result.Var_id = Variable.Var_id
   AND Loc_of_result.Loc_id = Location.Loc_id 
   AND Location.Dim_id = Dimension.Dim_id
   GROUP BY Dimension.Dim_id, Loc_of_result.Var_id, Run_list.Run_id
   ORDER BY Loc_of_result.Var_id, Run_list.Run_id) as temp1
GROUP BY Var_id, Run_id

</sql-query>