Opasnet base structure: Difference between revisions

Revision as of 05:07, 10 March 2011

This page is a knowledge crystal of subtype variable. The page identifier is Op_en1913
Moderator:Jouni (see all)
Citation of this page: Juha Villman, Einari Happonen, Jouni T. Tuomisto: Opasnet Base structure. Opasnet 2010. [1]. Accessed 21 Nov 2024.
Upload data {{#opasnet_base_link:Op_en1913}}

This page is about the structure of Opasnet Base. For a general description, see Opasnet base.

Scope

Opasnet base is a storage and retrieval system for results of variable and data from studies. What is the structure of Opasnet base such that it enables the following functionalities?

Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.R↻
Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
Description and handling of the indicess that a variable may take.
It is possible to protect some results and data from reading by unauthorised persons.
If is possible to build user interfaces for easily entering observations into the Base.

Definition

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x₁,x₂,...)

where P(R) is the probability distribution of the result and x₁ and x₂ are defining locations of an index where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one index. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.

Old description of the structure

Dependencies

Result

Opasnet base is a MySQL database located at http://base.opasnet.org.

Data structure

All data should be convertible into the following format:

			Personal measurements
Year	Sex	Age	Height	Weight
2009	Male	20	178	70
2009	Male	30	174	79
2010	Male	25	183	84
2010	Female	22	168	65

where

Name for explanation column(s).

Explanation data. These are determined or decided before the the actual observations are done.

Observation index. Common name for all observations

Name for observation column(s). These are the parameters studied.

Observation data. These are the actual measurements.

This is the "Standard data" that is entered as a Data table. The observation index is given separately in Object info and does not yet show up in the table.

Year	Sex	Age	Height	Weight
2009	Male	20	178	70
2009	Male	30	174	79
2010	Male	25	183	84
2010	Female	22	168	65

This is Object information. It slightly varies depending the format you use for uploading data.

**Info_table**
ident	Op_en2693
name	Testvariable
unit	#
# explanation cols	3
observation index	health impact
probabilistic?	No

This is the indexified table where all observations have been put into a single column. The next step is to replace all explanatory data text (columns 1-4) with identifiers (from the Loc table in the Opasnet Base).

Year	Sex	Age	Personal measurements	result
2009	Male	20	Height	178
2009	Male	30	Height	174
2010	Male	25	Height	183
2010	Female	22	Height	168
2009	Male	20	Weight	70
2009	Male	30	Weight	79
2010	Male	25	Weight	84
2010	Female	22	Weight	65

(The tables above have been created with File:Opasnet base explanation.ods.)

Table structure in the database

All tables

TODO: {{#todo:Update these tables!|Juha Villman, Einari Happonen|Opasnet}}

act
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
series_id	int(10) unsigned
acttype_id	tinyint(3) unsigned
who	varchar(50)
comments	varchar(250)	Yes
time	timestamp
temp_id	int(10) unsigned

actobj
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
act_id	int(10) unsigned
obj_id	int(10) unsigned
series_id	int(10) unsigned
unit	varchar(64)

acttype
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
acttype	varchar(250)

cell
Field	Type	Null	Extra
id	int(12) unsigned		auto_increment
obj_id_v	int(10) unsigned
obj_id_r	int(10) unsigned
actobj_id	int(10) unsigned
mean	float	Yes
sd	float
n	int(10)

formula
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
obj_id_v	int(10) unsigned
act_id	int(10) unsigned
actobj_id	int(10) unsigned
language	smallint(5) unsigned
code	longtext	Yes

item
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
sett_id	int(10) unsigned
obj_id	int(10) unsigned
fail	tinyint(1) unsigned

language
Field	Type	Null	Extra
id	tinyint(3) unsigned		auto_increment
language	varchar(250)

loc
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
std_id	int(10) unsigned
obj_id_i	int(10) unsigned
location	varchar(100)
roww	mediumint(8) unsigned
description	varchar(150)

loccell
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
cell_id	int(10) unsigned
loc_id	int(10) unsigned

obj
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
ident	varchar(20)
name	varchar(200)
objtype_id	tinyint(3) unsigned
page	int(10) unsigned
wiki_id	tinyint(3) unsigned
newest	int(10) unsigned

objinfo
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
obj_id	int(10) unsigned
acttype_id	tinyint(3) unsigned
who	varchar(50)
comments	varchar(250)
time	timestamp

objtype
Field	Type	Null	Extra
id	tinyint(3)
objtype	varchar(30)

res
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
cell_id	int(20) unsigned
obs	int(10) unsigned
result	float
restext	varchar(250)	Yes

resinfo
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
restext	varchar(250)
who	varchar(50)
time	timestamp

resinfosec
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
restext	varchar(250)
who	varchar(50)
time	timestamp

ressec
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
cell_id	int(20) unsigned
obs	int(10) unsigned
result	float
restext	varchar(250)	Yes

sett
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
obj_id	int(10) unsigned
settype_id	tinyint(3) unsigned

settype
Field	Type	Null	Extra
id	tinyint(3) unsigned		auto_increment
settype	varchar(30)

wiki
Field	Type	Null	Extra
id	tinyint(3)
url	varchar(255)
wname	varchar(20)

Replacing some cells

It is possible that there is a large data, where there is a need to update only a few cells while all others remain the same. How should this be done? There are a few potential alternatives.

Use the current replace functionality. Replace all cells but most of them with the original value.
Use a new act_type that is similar to the current append functionality. This should be understood in a way that if there are two (or more) identical cells (based on cell indices and locations), then the newest result is used and all older ones are discarded. (If the old append is used, then new info is just seen as a new row in the data table, not a replacement of an existing row.
Add a new field into the cell (?) table with an updated cell_id (in a similar way than act_id and series_id are used in the actobj table). This way, the new cell can automatically inherit all locations of the old cell.

Formula structure

Now it has become clear that it is not enough to have samples of the result distributions. It must be possible to completely recalculate the result based on the information in the Opasnet Base. There are different approaches:

Calculate the result based on a formula that may refer to other variables called parents. This is a deterministic approach.
Calculate the result based on the marginal distribution and (conditional) rank correlations with parent variables. This is a probabilistic approach.

This approach requires new tables, namely Formula and Language.

----11: . Do we need tables DIF and DIP like Uninet? --Jouni 21:50, 30 December 2009 (UTC) (type: truth; paradigms: science: comment)

DIP
- DIP_node_id
- DIP_parent_node_id
- DIP_corr_coeff
- DIP_parent_index
DIF
- DIF_node_id
- DIF_formula
- DIF_varnames_in_formula

All tables: Overview

We need Ressec (Result secure) and Resinfosec (Result info secure) tables for secure information. All other tables are openly readable except these two. They have the same structure as Res and Resinfo tables, respectively.

TODO: {{#todo:Move the descriptions from this table to the tables above.|Juha Villman, Einari Happonen|Opasnet}}

**Tables_in_opasnet_base**
Table	Description
Acttype	List of action types
Cell	Cells of an object
Formula	Formulas for computing variable results
Item
Language	List of languages understood by the formula
Loc	Location information
Loccell	Locations of a cell
Log
Obj	Object information (all objects)
Objinfo	Additional information about the objects
Objtype	Types of objects
Res	Result distribution (actual values)
Resinfo	Additional description of the result
Resinfosec	Additional description of the result
Ressec	Result distribution (actual values)
Sett	Memberships of items in sets
Settype	Types of set-item memeberships
Wiki	Wiki information

Universal Opasnet Base

The idea of universal Opasnet Base says that it should be possible to store results in such a way that the results themselves are public but their interpretation is limited. For example, patient symptoms and clinical test results should be openly available for research, but information about whose results they are should private. This can be achieved with the following database structure.

Let's say that it is enough to have two security levels, public and private. A person wants to record personal health information into the database. She logs in with her personal user name. The private profile gives the name (say, Liisa) and social security number of the person, while the public profile says only "40-50-year-old woman in Finland". Liisa writes down her symptoms and saves them. This is what is stored in the databases:

**Information stored in the public and private databases. The private database can read tables from the public one but not vice versa.**
Table, field	Private database	Public database
act.who	Liisa, 010160-1024	Woman, 40-50 a
act.when	2011-03-09 22:09:10	2011-03
obj.name	Personal reporting of health symptoms	Personal reporting of health symptoms
loccell.loc_id (locations and indices explained)	Person = 010160-1024 Time = 2011-03-09 Severity = Moderate ICD-10 = Headache	Age = 40-50 Sex = Female Country = Finland Time = 2011-03 Severity = Moderate ICD-10 = Headache
res.restext	Nothing. res table does not exist in the private part.	I had headache all morning, but it went away after I took ibuprofen.