Opasnet base structure: Difference between revisions
(→The sample of the newest run of each variable: Analytica import query added) |
(→Some useful syntax: RDB reader added) |
||
Line 265: | Line 265: | ||
---- | ---- | ||
* [[:image:RDB connection.ANA|Result database connection module]] for Analytica: for writing variable results into the database. | * [[:image:RDB connection.ANA|Result database connection module]] for Analytica: for writing variable results into the database. This requires a password. | ||
* [[:image:RDB reader.ANA|Result database reader module]] for Analytica: for reading variable results into the database. | |||
====List of variables at dimension and run level==== | ====List of variables at dimension and run level==== |
Revision as of 14:43, 8 August 2008
Moderator:Nobody (see all) Click here to sign up. |
|
Upload data
|
Result database is a storage and retrieval system for variable results. It is basically an SQL database with the following functionalities:
- Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.D↷
- Automatic retrieval of results when called from the collaborative workspace or other platforms.
- Description and handling of the dimensions that the variables may take.
- Storage and retrieval system for items that are needed to calculate the results of variables.
- A platform for performing computer runs to update the results of variables.
- Follow-up of the linkages between variables, the data about a particular variable, and the computing formula of the variable, in respect to their impact on the variable result.
- Follow-up of the age and validity of the content based on the previous point.
- A platform for planning computer runs based on the update need, CPU demand, and CPU availability.
Functionalities of the result database
Storage and retrieval of results of variables
The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as
P(R|x1,x2,...)
where P(R) is the probability distribution of the result and x1 and x2 are defining locations where a particular P(R) applies. A dimension means a property along which there are multiple locations and the result of the variable may have different values when the location changes. In this case, x1 and x2 are dimensions, and particular values of x1 and x2 are locations. A variable can have zero, one, or more dimensions. Even if a dimension is continuous, it is usually operationalised in practice as a list of discrete locations. Such a list is called an index, and each location is called a row of the index. In the general information structure of the new risk assessment method, dimensions are Classes with a special purpose. An index can be thought of as a variable that inherits its plausible range from a dimension (class).
Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.
The dimensions of a variable are determined by the parent variables (by inheritance) and the formula used to calculate the result. Thus, there is not a place where the dimensions of a particular variable are explicitly asked for. In addition, the indices (as operationalisations of dimensions) are NOT properties of variables but of risk assessments. This may sound unintuitive, but the reasoning is that indices are just practical ways to handle dimensions, and these practical needs may change from one assessment to another.
The tables Variable and Result contain the result data. In addition, Location, Dimension, Index, and Rows contain data about the dimensions and indices used. These tables together offer the functionalities of data storage and retrieval, and handling of multidimensionality and uncertainty.
Calculation of the updated results
The result of a variable can be calculated when four things are known:
- The list of of upstream variables (parents) (Definition/causality attribute),
- The results of the parent variables,
- The data used to derive the result (Definition/data attribute), and
- The formula used to calculate the result based on the items above (Definition/formula attribute).
The three sub-attributes of the Definition are represented by three tables in the result database: Causality, Formula, and Data. In addition, the results of the parents can be obtained from the Result table. The variable transfer protocol is used to extract these data from the result database, send them to an external software such as R to calculate the result, and store the calculated result into the Result table of the database. The technical solutions to do this in practice have to be developed.
When a variable result is calculated, the computing software must know, which indices must be used with which variables. This can be automatically resolved using the following reasoning algorithm.
- Make a list of all unfinished risk assessments.
- Make a list of all indices in these risk assessments.
- Compile all indices of a particular dimension into one large "super-index" with all the locations.
- Use these "super-indices" in the calculations.
- Apply a particular "super-index" for a particular variable, if that variable has the dimension in question.
A wild use of occasional indices is discouraged, because they cause heavy computing needs with little benefit. Therefore, there should be a "standard risk assessment" that is constantly kept unfinished. It would then contain recommended indices for all major dimensions. This way, at least the standard indices are always used in computations, and the need for users to develop their own indices is smaller.
When the new results are stored in the database, the old results of the variables are deleted. The different versions of the variable results are NOT permanently stored anywhere. However, when a risk assessment report is created using the reporting tool, the result distributions used for that report are stored, together with the definitions and other data about all variables. Thus, a full copy of everything that relates to a particular assessment can be downloaded and stored outside the result database.
Follow-up of validity
The result of a variable is valid since its update until something that affects its content (i.e., the four things listed above) changes. Therefore, there must be a system that follows what things these are for a particular variable, and whether they have changed since the last calculation of the variable result. When the data in Causality, Formula, and Data tables is combined with the data of the dates when the parent variables were run, it can be automatically concluded whether the variable is valid or not. If the variable is older than its determinants, there is a need to recalculate the result. This cannot be done fully automatically, because some variables are probably being actively edited, and this would create a constant need to update everything downstream. In addition, some complex variables may take even weeks to compute.
Therefore, there should be a planning system for result updates. This can easily be done by adding tables Run and Run_list to the database. These tables contain information about the runs that have been performed or are being planned to be performed. The user can add variables to and delete them from the lists of planned runs. The needs for updating can be combined into practical collections of variables, given their connections, computer time needed, and computer time available. Then, when the task has been defined and the resources are available, a computer run can automatically be performed.
Suggested techniques to get started
The current idea is to describe the variables in Mediawiki, which is a text-based software. It would therefore become very difficult to operate these functionalities from there. Instead, if we store the data into the most convenient way, it can be effectively utilised. The most convenient way is to use an SQL database, which is the standard for large databanks. Among all SQL software, MySQL is the best due to several reasons:
- It is freely available open access software.
- It is easy to use.
- It has powerful functionalities.
To make this work out, we need a variable transfer protocol so that the result of a variable can be retrieved either automatically by a calculating software, or manually by the user who wants to explore the result. Fancy presenting software can be built on top of the database, so that the user does not see huge lists of numbers, but nice distributions instead. The development of this software is, again, technically straightforward, because:
- It is only communicating with the MySQL database, except some launch codes must be placed in other parts of the toolbox. Thus, the development can easily be decentralized.
- Something applicable probably exists in the open code world.
- It is not needed in the early life stages of the toolbox.
A suggested table and column structure for the database
|
|
|
| |||||||
|
| |||||||
|
|
|||||||
|
||||||||
|
| ||||||
|
| ||||||
|
* This column or these columns together uniquely identify the row in the table
Result database is for storing variable results in a way that they can be used independently of the assessment they were created in.
Some useful syntax
http://www.baycongroup.com/sql_join.htm
- Result database connection module for Analytica: for writing variable results into the database. This requires a password.
- Result database reader module for Analytica: for reading variable results into the database.
List of variables at dimension and run level
SELECT Variable.var_id, var_name, var_unit, Indices.dim_id, Indices.dim_name, Indices.ind_id, Indices.ind_name, Restat.n, Run.* FROM Variable, Run_list, Run, Loc_of_result, (SELECT var_id, result_id, avg(result) as result, min(result) as minimum, max(result) as maximum, count(sample) as n FROM Result GROUP BY result_id) as Restat, (SELECT Dimension.Dim_id, Dimension.Dim_name, Rows.Ind_id, Ind_name, row_number, Location.Loc_id, Location FROM `Dimension`, `Location`, `Rows`, `Index` WHERE Dimension.Dim_id = Location.Dim_id and Location.Loc_id = Rows.Loc_id and `Index`.Ind_id = Rows.Ind_id) as Indices WHERE Variable.var_id = Restat.var_id and Restat.result_id = Run_list.result_id and Run_list.run_id = Run.run_id and Restat.result_id = Loc_of_result.result_id and Loc_of_result.loc_id = Indices.loc_id GROUP BY Indices.ind_id, run_id, Variable.var_id ORDER BY Variable.var_id, run_id DESC
List of runs for each variable at result_id level
SELECT var_name, Restat.*, var_unit, Indices.*, Run.* FROM Variable, Run_list, Run, Loc_of_result, (SELECT var_id, result_id, avg(result) as result, min(result) as minimum, max(result) as maximum, count(sample) as n FROM Result GROUP BY result_id) as Restat, (SELECT Dimension.Dim_id, Dimension.Dim_name, Rows.Ind_id, Ind_name, row_number, Location.Loc_id, Location FROM `Dimension`, `Location`, `Rows`, `Index` WHERE Dimension.Dim_id = Location.Dim_id and Location.Loc_id = Rows.Loc_id and `Index`.Ind_id = Rows.Ind_id) as Indices WHERE Variable.var_id = Restat.var_id and Restat.result_id = Run_list.result_id and Run_list.run_id = Run.run_id and Restat.result_id = Loc_of_result.result_id and Loc_of_result.loc_id = Indices.loc_id ORDER BY run_id DESC, var_id, result_id
The newest sample from a variable to be converted into Analytica
SELECT Variable.var_name, var_unit, result, sample, dim_name, location, run_method, run_date FROM `Variable` , Result, Loc_of_result, Location, Dimension, Run_list, Run WHERE Variable.var_name = "Fig_3_cost_by_source" AND Variable.var_id = Result.var_id AND Result.result_id = Loc_of_result.result_id AND Loc_of_result.loc_id = Location.loc_id AND Location.dim_id = Dimension.dim_id AND Result.result_id = Run_list.result_id AND Run_list.run_id = Run.run_id
The sample of the newest run of each variable
SELECT Newestrun2.*, Result.* FROM Result, Run_list, ( SELECT var_id, var_name, run_id, max(run_date) as run_date FROM ( SELECT Variable.var_id, var_name, Run.run_id, run_date FROM Variable, Run, Run_list, (SELECT * FROM Result GROUP BY result_id) AS Resrun WHERE Variable.var_id = Resrun.Var_id and Resrun.result_id = Run_list.result_id and Run_list.run_id = Run.run_id GROUP BY Variable.var_id, run_date) as Newestrun GROUP BY var_id) as Newestrun2 WHERE Newestrun2.var_id = Result.var_id and Result.result_id = Run_list.result_id and Run_list.run_id = Newestrun2.run_id
All indices
SELECT Dimension.Dim_id, Dimension.Dim_name, Rows.Ind_id, Ind_name, row_number, Location.Loc_id, Location FROM `Dimension`, `Location`, `Rows`, `Index` WHERE Dimension.Dim_id = Location.Dim_id and Location.Loc_id = Rows.Loc_id and `Index`.Ind_id = Rows.Ind_id