Opasnet base structure

From Opasnet
Revision as of 19:37, 1 February 2009 by Jouni (talk | contribs) (idea of nugget data in the Base)
Jump to navigation Jump to search



Scope

Opasnet base is a storage and retrieval system for variable results and nugget data. What is the structure of Opasnet base such that it enables the following functionalities?

  1. Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.D↷
  2. Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
  3. Description and handling of the dimensions that a variable may take.
  4. It is possible to protect some results and data from reading by unauthorised persons.
  5. If is possible to build user interfaces for easily entering observations into the Base.


Definition

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x1,x2,...) 

where P(R) is the probability distribution of the result and x1 and x2 are defining locations of a dimension where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one dimension. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.

Table and field names

Principles:

  • The names should be as short as possible: three letters.
  • Tables that are only connecting two substance tables (i.e. tables for making many-to-many relationships) have a name that is a combination of the two, with six letters.
  • Identifiers are named like Var_id where Var is the name of the table.
  • Substantive fields may have longer names.
  • Substantive fields do not repeat the table name unless there is a possibility to mix two fields in different tables.
  • The field endings have the following meaning:
    • _id: the identifier of the row in RDB, a sequential number in the table.
    • _name: the identifier for Analytica, format: wiki link+page (e.g. Op_en2356)
    • _title: a longer, descriptive title
    • page: the page identifier from Opasnet

An idea of major reconstruction of the Opasnet base

Tables:

  • Obj (Int8) Object of some kind (previously Variable, Dimension, Index, and Risk_assessment)
    • Oid (Var_id), Name, Title, Unit, Tid, Page, Wid (Wiki_id)
  • Typ (Tinyint3) Types of objects: variable, dimension (which is a specific kind of variable), method, assessment, class, index (which is not a universal object in the PSSP context), run.
    • Tid, Type
  • Set (Int8) Defines the sets in the system, i.e. lists of objects that belong to a group or set.
    • Item, Set, Row, Sid (Set_id)
  • Sty (Tinyint3) Types of sets: locations of dimension, locations of index, items of class, indices of assessment, variables of assessment, indices of dimension, dependencies of variable. Replaces tables Locations, Index, Rows, RA_vars, RA_indices, Run_list.
    • Sid, Stype
  • Loc (previously Location)
    • Lid (Loc_id), Did (Dim_id), Loct (Location as text), Locn (Location as number), Num (Yes/No)
  • Res (Int8) Result information (previously Loc_of_result)
    • Rid (Result_id), Vid (Var_id), Lid, Iid (Ind_id), Run, Med (Median), N (number of sample)
  • Sam (Int10) Sample information (previously Result)
    • Rid, Vid, Sample (Int 6), Result (Double)
  • Wik (Wiki_location)
    • Wid, Url, Wname
  • Row (previously Rows)
    • Iid, Row, Lid
  • Inf (additional info about objects)
    • Oid, Begin, End, Who, Method, Url

In practice the tables and fields would look like this:

Tables:

  • Variable -> Obj
  • Result -> Res
  • Location -> Loc
  • Dimension -> Dim or Obj
  • Index -> Ind or
  • Rows -> Row
  • Loc_of_result -> Locres (the location of each result)
  • Run -> Run
  • Run_list -> Runres (the run of each result)
  • Wiki_location -> Wik
  • Risk assessment -> Oa
  • RA_vars -> Oavar (the risk assessment of each variable)
  • RA_indices -> Oaind (the risk assessment of each index)
  • Causality -> do we actually need this?
  • Formula -> do we actually need this?
  • Data -> do we actually need this?


Fields (only those are listed that are actively used and should be changed):

  • Var table: Remove the "Var_" from all fields except Var_id.
  • Page_id -> Page (because this is rather a substantive field than an identifier; there is no table called "Page")
  • Result_id -> Res_id
  • Dimension table: Dimensions are actually variables themselves. Therefore, all substantive content should be moved to Var; we don't need any more Dim_name, Dim_title, Dim_unit, Page_id and Wiki_id in this table. We need to add Var_id field, which tells where in the Var table the info of each dimension is found.
  • Row_number -> Row
  • Run table: Remove "Run_" from the field names except Run_id
  • Runres table: Run_order -> order (do we actually need this field?)

Dependencies

Result

Opasnet base is a MySQL database located at http://base.opasnet.org.

Table structure

  • Roww table is no longer needed, because the same information is in the Loc table.
  • We need Sams (Sample secure) and Descrs tables for secure information. All other tables are openly readable except these two. They have the same structure as Sam and Descr tables, respectively.
Obj
Contains all objects
FIELD TYPE EXTRA
id mediumint(8) primary
Ident varchar(20) unique
Name varchar(200)
Unit varchar(16)
Typ_id tinyint(3)
Page mediumint(8)
Wik_id tinyint(3)
Res
Result information
FIELD TYPE EXTRA
id int(10) primary
Obj_id_v mediumint(8)
Obj_id_r mediumint(8)
Mean float
N smallint(6)
Loc
Location information
FIELD TYPE EXTRA
id mediumint(8) primary
Obj_id_d mediumint(8)
Location varchar(1000)
Locn float
Num tinyint(1)
Description varchar(150)
Item
add description'
FIELD TYPE EXTRA
id mediumint(8) primary
Sett_id mediumint(8)
Obj_id mediumint(8)
Fail tinyint(1)
Locres
Result information
FIELD TYPE EXTRA
id int(10) primary
Res_id int(10)
Roww_id mediumint(8)
Roww
Locations of an index
FIELD TYPE EXTRA
id mediumint(8) primary
Obj_id_i mediumint(8)
Roww mediumint(8)
Loc_id mediumint(8)
Sam
Result distribution (actual values)
FIELD TYPE EXTRA
id int(10) primary
Res_id mediumint(8)
Descr_id mediumint(8)
Sample mediumint(8)
Result float
Sett
Memberships of items in sets
FIELD TYPE EXTRA
id mediumint(8) primary
Obj_id mediumint(8)
Sty_id mediumint(8)
Sty
Types of set-item memberships
FIELD TYPE EXTRA
id tinyint(3)
Stype varchar(30)
Typ
Object type information
FIELD TYPE EXTRA
id tinyint(3) primary
Type varchar(30)
Wik
Wiki information
FIELD TYPE EXTRA
id tinyint(3) primary
Url varchar(255)
Descr
Additional description of the result
FIELD TYPE EXTRA
id mediumint(8) primary
Description varchar(250)
Inf
Additional information about an object
FIELD TYPE EXTRA
id mediumint(8) primary
Begin date
End date
Who varchar(50)
Url varchar(250)

See also

These texts could be moved to separate pages.

Opasnet Data

Opasnet Data is another database, which is designed to collect observation data about objects. The purpose is to make it possible to take a generalised assessment model, and conditionalise that based on case-specific data. Using the Opasnet Data database, it is possible to collectively collect information about the the specific case.

The structure of the Opasnet Data structure is mostly similar to that of Opasnet Base, with the following exceptions:

  • The basic data unit is an observation about an object, typically a variable. An observation is treated like a result in Opasnet base, except that it is a point estimate instead of a distribution.
  • Each observation may have locations along several dimensions. e.g. in space, time, or sex. However, indices are usually not used for observations, because an observation may locate anywhere along a dimension. An exception is when the data is collected using the help of an index, which restricts the data to fall into the locations of an index instead of anywhere along the dimension.
  • Each data point, or observation, has always the following dimensions: time of entry, username or IP of the user who enters the data, and the object Ident.
  • For a routine collection effort of data there may be a fixed list of dimensions that will be asked. This list is created by using a set of dimensions for the collection effort. These collection efforts are treated as Runs. An input format template is usually created for a collection effort.
  • It is not clear whether an observation should always be a single number (or text field). If there is e.g. a table of observations with determinants on column headings and individual observations as rows, can this be treated as a single entry? Or should the table be disaggregated into single numbers with the columns treated as dimensions?
  • In some cases, it might be useful to restrict the number of entries per user to one. However, this is done only at the interpretation phase where only the last entry is counted. There are no restrictions to enter new data.

Making value-of-information analyses in Opasnet base

Value of information (VOI) is a decision analysis tool for estimating the importance of remaining uncertainty for decision-making. Result database can be used to perform a large number of VOI analyses, because all variables are in the right format for that: as random samples from uncertain variables. The analysis is done by optimising an indicator variable by adjusting a decision variable so that the variable under analysis is conditionalised to different values. All this can in theory be done in the result database by just listing the indicator, the decision variable, and the variable of interest. Practical tools should be developed for this. After that, systematic VOI analyses can be made over a wide range of environmental health issues.

Analysing the change in the quality of a variable result in Opasnet base

All results that have once been stored in the result database remain there. Old results can be very interesting for some purposes:

  • The time trend of informativeness and calibration (see performance) can be evaluated for a single variable against the newest information.
  • Critical pieces of information that had a major impact on the informativeness and calibration can be identified afterwards.
  • Large number of variables can be assessed and e.g. following questions can be asked:
    • How much work is needed to make a variable with reasonable performance for practical applications?
    • What are the critical steps after which the variable performance is saturated, i.e., does not improve much despite additional effort?

Some useful syntax


List all dimensions that have indices, and the indices concatenated:

<sql-query display="1"> Select Dim_name, dim_title, dim_unit, Group_concat(Ind_name order by ind_name separator ', ') as Indices from Dimension, `Index` where Dimension.dim_id = `Index`.Dim_id group by Dim_name order by Dimension.dim_id </sql-query>


List all indices, and their locations concatenated:

<sql-query display="1">

Select Dim_name, Dim_title, Dim_unit, Ind_name, Group_concat(Location order by row_number separator ', ') as Locations 
from `Index`, Location, Rows, Dimension
where `Index`.ind_id= Rows.ind_id and Rows.loc_id = Location.loc_id and `Index`.dim_id = Dimension.dim_id
group by Ind_name
order by Dim_name, `Index`.ind_name

</sql-query>


List all variables and their runs, and also list all dimensions (concatenated) used for each variable for each run.

<sql-query display="1">

SELECT Var_id, Run_id, Var_name, Var_title, GROUP_CONCAT(Dim_name SEPARATOR ', ') as Dimensions, n, Run_method
FROM
   (SELECT Loc_of_result.Var_id, Run_list.Run_id, Var_name, Var_title, Dim_name, n, Run_method
   FROM Loc_of_result, Run_list, Run, Variable, Location, Dimension
   WHERE Loc_of_result.Result_id = Run_list.Result_id 
   AND Run_list.Run_id = Run.Run_id
   AND Loc_of_result.Var_id = Variable.Var_id
   AND Loc_of_result.Loc_id = Location.Loc_id 
   AND Location.Dim_id = Dimension.Dim_id
   GROUP BY Dimension.Dim_id, Loc_of_result.Var_id, Run_list.Run_id
   ORDER BY Loc_of_result.Var_id, Run_list.Run_id) as temp1
GROUP BY Var_id, Run_id

</sql-query>