Opasnet base structure: Difference between revisions

Revision as of 20:12, 29 November 2008

[show] This page is a knowledge crystal of subtype variable. The page identifier is Op_en1913
Moderator:Nobody (see all) Click here to sign up.

Upload data {{#opasnet_base_link:Op_en1913}}

Scope

Opasnet base is a storage and retrieval system for variable results. What is the structure of Opasnet base such that it enables the following functionalities?

Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.D↷
Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
Description and handling of the dimensions that a variable may take.
Storage and retrieval system for items that are needed to calculate the results of variables.(?)
A platform for planning computer runs about variable results based on the update need, CPU demand, and CPU availability.

Definition

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x₁,x₂,...)

where P(R) is the probability distribution of the result and x₁ and x₂ are defining locations of a dimension where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one dimension. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.

Table and field names

Principles:

The names should be as short as possible: three letters.
Tables that are only connecting two substance tables (i.e. tables for making many-to-many relationships) have a name that is a combination of the two, with six letters.
Identifiers are named like Var_id where Var is the name of the table.
Substantive fields may have longer names.
Substantive fields do not repeat the table name unless there is a possibility to mix two fields in different tables.
The field endings have the following meaning:
- _id: the identifier of the row in RDB, a sequential number in the table.
- _name: the identifier for Analytica, format: wiki link+page (e.g. Op_en2356)
- _title: a longer, descriptive title
- page: the page identifier from Opasnet

An idea of major reconstruction of the Opasnet base

Tables:

Obj (Int8) Object of some kind (previously Variable, Dimension, Index, and Risk_assessment)
- Oid (Var_id), Name, Title, Unit, Tid, Page, Wid (Wiki_id)
Typ (Tinyint3) Types of objects: variable, dimension (which is a specific kind of variable), method, assessment, class, index (which is not a universal object in the PSSP context), run.
- Tid, Type
Set (Int8) Defines the sets in the system, i.e. lists of objects that belong to a group or set.
- Item, Set, Row, Sid (Set_id)
Sty (Tinyint3) Types of sets: locations of dimension, locations of index, items of class, indices of assessment, variables of assessment, indices of dimension, dependencies of variable. Replaces tables Locations, Index, Rows, RA_vars, RA_indices, Run_list.
- Sid, Stype
Loc (previously Location)
- Lid (Loc_id), Did (Dim_id), Loct (Location as text), Locn (Location as number), Num (Yes/No)
Res (Int8) Result information (previously Loc_of_result)
- Rid (Result_id), Vid (Var_id), Lid, Iid (Ind_id), Run, Med (Median), N (number of sample)
Sam (Int10) Sample information (previously Result)
- Rid, Vid, Sample (Int 6), Result (Double)
Wik (Wiki_location)
- Wid, Url, Wname

In practice the tables and fields would look like this:

Tables:

Variable -> Obj
Result -> Res
Location -> Loc
Dimension -> Dim or Obj
Index -> Ind or
Rows -> Row
Loc_of_result -> Locres (the location of each result)
Run -> Run
Run_list -> Runres (the run of each result)
Wiki_location -> Wik
Risk assessment -> Oa
RA_vars -> Oavar (the risk assessment of each variable)
RA_indices -> Oaind (the risk assessment of each index)
Causality -> do we actually need this?
Formula -> do we actually need this?
Data -> do we actually need this?

What about if we add one table Info (or Inf) which contains additional information about the object? It has an Obj_id field which is primary so that one object may have only one row. This table would have all the specific information that is not shared by all objects:

Assessment: date started and finished
Index: dimension to which it belongs
Possibly others

Obj table should contain also the following fields:
- Type: a selection between variable, method, class, assessment, index.
- Dim: a yes/no boolean field about whether the object is a dimension or not. Dimensions are variables, and therefore it cannot be added to Type list.
If Var table is changed into Obj, what are the fields of each object type that are not covered with the existing Var fields?
- Assessment (Risk_assessment): RA_started, RA_finished.
- Dimension: List of indices that belong to this dimension.
- Index: Dim_id, List of locations that belong to this index (i.e. Loc_id, Row in Rows table).
- Class: List of items that belong to this class.
- Variable: None.

Fields (only those are listed that are actively used and should be changed):

Var table: Remove the "Var_" from all fields except Var_id.
Page_id -> Page (because this is rather a substantive field than an identifier; there is no table called "Page")
Result_id -> Res_id
Dimension table: Dimensions are actually variables themselves. Therefore, all substantive content should be moved to Var; we don't need any more Dim_name, Dim_title, Dim_unit, Page_id and Wiki_id in this table. We need to add Var_id field, which tells where in the Var table the info of each dimension is found.
Row_number -> Row
Run table: Remove "Run_" from the field names except Run_id
Runres table: Run_order -> order (do we actually need this field?)

Dependencies

Result

Opasnet base is a MySQL database located at http://base.opasnet.org.

Table structure

Variable
Information about variable attributes and validity
FIELD	TYPE	EXTRA
Var_id	mediumint(8)	primary
Var_name	varchar(20)	unique
Var_title	varchar(100)
Var_scope	varchar(1000)
Var_unit	varchar(16)
Page_id	mediumint(8)
Wiki_id	tinyint(3)

Result
All results are stored in this table. Each value of a result of a variable has an own row.
FIELD	TYPE	EXTRA
Result_id	int(10)	primary
Var_id	mediumint(8)
Result	varchar(1000)
Sample	smallint(5)

Location
The location of the result along a particular dimension.
FIELD	TYPE	EXTRA
Loc_id	mediumint(8)	primary
Dim_id	mediumint(8)
Location	varchar(1000)

Dimension
Information about dimensions
FIELD	TYPE	EXTRA
Dim_id	mediumint(8)	primary
Dim_name	varchar(100)
Dim_title	varchar(100)
Dim_unit	varchar(16)
Page_id	mediumint(8)
Wiki_id	tinyint(3)

Index
Information about indices
FIELD	TYPE	EXTRA
Ind_id	int(10)	primary
Ind_name	varchar(100)
Dim_id	mediumint(8)

Rows
Information about rows of indices
FIELD	TYPE	EXTRA
Ind_id	int(10)	unique
Row_number	int(10)	unique
Loc_id	mediumint(8)

Loc_of_result
explanation coming...
FIELD	TYPE	EXTRA
Loc_id	mediumint(8)	unique
Result_id	int(10)	unique
Var_id	mediumint(8)
Ind_id	mediumint(8)
N	mediumint(8)
Run_id	mediumint(8)

Run
Information about the computer runs
FIELD	TYPE	EXTRA
Run_id	mediumint(8)	primary
Run_date	date
Run_who	varchar(50)
Run_method	varchar(200)

Run_list
List of variables in a run
FIELD	TYPE	EXTRA
Run_id	int(16)
Run_order	varchar(100)
Var_id	int(16)
~~Result_id~~	int(10)

Wiki_location
Defines URL of a wiki where object is linked
FIELD	TYPE	EXTRA
Wiki_id	tinyint(3)	primary
URL	varchar(60)
Wiki_name	varchar(20)

Risk_assessment
Attributes of a risk assessment. Not actively used yet.
FIELD	TYPE	EXTRA
RA_id	smallint(5)	primary
RA_name	varchar(100)
RA_scope	varchar(1000)
RA_started	date
RA_finished	date

RA_vars
Defines the variables used in a risk assessment. Not actively used yet.
FIELD	TYPE	EXTRA
RA_id	smallint(5)	unique
Var_id	mediumint(8)	unique

RA_indices
Defines the indices used in a risk assessment. Not actively used yet.
FIELD	TYPE	EXTRA
RA_id	smallint(5)	unique
Ind_id	int(10)	unique

Causality
Defines the parents in the causal chain. Not actively used yet.
FIELD	TYPE	EXTRA
Var_id	mediumint(8)
Causality_date	date
Parent_id	mediumint(8)

Formula
Defines the formulas of the variables. Not actively used yet.
FIELD	TYPE	EXTRA
Var_id	mediumint(8)
Formula_date	date
Software	varchar(100)
Formula	varchar(100)

Data
Defines the data of the variables. Not actively used yet.
FIELD	TYPE	EXTRA
Var_id	mediumint(8)
Data_date	date
URL	varchar(100)

Value of information (VOI) is a decision analysis tool for estimating the importance of remaining uncertainty for decision-making. Result database can be used to perform a large number of VOI analyses, because all variables are in the right format for that: as random samples from uncertain variables. The analysis is done by optimising an indicator variable by adjusting a decision variable so that the variable under analysis is conditionalised to different values. All this can in theory be done in the result database by just listing the indicator, the decision variable, and the variable of interest. Practical tools should be developed for this. After that, systematic VOI analyses can be made over a wide range of environmental health issues.

Analysing the change in the quality of a variable result in Opasnet base

All results that have once been stored in the result database remain there. Old results can be very interesting for some purposes:

The time trend of informativeness and calibration (see performance) can be evaluated for a single variable against the newest information.
Critical pieces of information that had a major impact on the informativeness and calibration can be identified afterwards.
Large number of variables can be assessed and e.g. following questions can be asked:
- How much work is needed to make a variable with reasonable performance for practical applications?
- What are the critical steps after which the variable performance is saturated, i.e., does not improve much despite additional effort?

Some useful syntax

http://www.baycongroup.com/sql_join.htm
Opasnet base connection.ANA for Analytica: for writing and reading variable results into and from the database. Writing requires a password. For SQL used in the model, see the model page.
Some historical queries

List all dimensions that have indices, and the indices concatenated:

<sql-query display="1"> Select Dim_name, dim_title, dim_unit, Group_concat(Ind_name order by ind_name separator ', ') as Indices from Dimension, `Index` where Dimension.dim_id = `Index`.Dim_id group by Dim_name order by Dimension.dim_id </sql-query>

List all indices, and their locations concatenated:

<sql-query display="1">

Select Dim_name, Dim_title, Dim_unit, Ind_name, Group_concat(Location order by row_number separator ', ') as Locations 
from `Index`, Location, Rows, Dimension
where `Index`.ind_id= Rows.ind_id and Rows.loc_id = Location.loc_id and `Index`.dim_id = Dimension.dim_id
group by Ind_name
order by Dim_name, `Index`.ind_name

</sql-query>

List all variables and their runs, and also list all dimensions (concatenated) used for each variable for each run.