Opasnet base structure: Difference between revisions

Revision as of 10:26, 13 January 2010

This page is a knowledge crystal of subtype variable. The page identifier is Op_en1913
Moderator:Jouni (see all)

Upload data {{#opasnet_base_link:Op_en1913}}

This page is about the structure of Opasnet Base. For a general description, see Opasnet base.

Scope

Opasnet base is a storage and retrieval system for results of variable and data from studies. What is the structure of Opasnet base such that it enables the following functionalities?

Storage of results of variables with uncertainties when necessary, and as multidimensional arrays when necessary.R↻
Automatic retrieval of results when called from Opasnet wiki or other platforms or modelling systems.
Description and handling of the dimensions that a variable may take.
It is possible to protect some results and data from reading by unauthorised persons.
If is possible to build user interfaces for easily entering observations into the Base.

Definition

Data

Software

Because Opasnet base will contain very large amounts of mostly numerical information, the state-of-the-art structure is a SQL database. Because of its flexibility, ease of use, and cost, MySQL is an optimal choice among SQL software. In addition to the database software, a variable transfer protocol is needed on top of that so that the results of variables can be retrieved and new results stored either automatically by a calculating software, or manually by the user. Fancy presenting software can be built on top of the database, but that is not the topic of this page.

Storage and retrieval of results of variables

The most important functionality is to store and retrieve the results of variables. Because variables may take very different forms (from a single value such as natural constant to an uncertain spatio-temporal concentration field over the whole Europe), the database must be very flexible. The basic solution is described in the variable page, and it is only briefly summarised here. The result is described as

  P(R|x₁,x₂,...)

where P(R) is the probability distribution of the result and x₁ and x₂ are defining locations of a dimension where a particular P(R) applies. Typically locations are operationalised as discrete indices. A variable must have at least one dimension. Uncertainty about the true value of the variable is operationalised as a random sample from the probability distribution, in such a way that the samples are located along an index Sample, which is a list of integers 1,2,3...n, where n=number of samples.

Old description of the structure

Dependencies

Result

Opasnet base is a MySQL database located at http://base.opasnet.org.

Table structure

All tables

NOTE! Update this table when table structure is changed!

act
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
series_id	int(10) unsigned
acttype_id	tinyint(3) unsigned
who	varchar(50)
comments	varchar(250)	Yes
time	timestamp
temp_id	int(10) unsigned

actobj
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
act_id	int(10) unsigned
obj_id	int(10) unsigned
series_id	int(10) unsigned

acttype
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
acttype	varchar(250)

cell
Field	Type	Null	Extra
id	int(12) unsigned		auto_increment
obj_id_v	int(10) unsigned
obj_id_r	int(10) unsigned
actobj_id	int(10) unsigned
mean	float	Yes
sd	float
n	int(10)

formula
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
obj_id_v	int(10) unsigned
act_id	int(10) unsigned
actobj_id	int(10) unsigned
language	smallint(5) unsigned
code	longtext	Yes

item
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
sett_id	int(10) unsigned
obj_id	int(10) unsigned
fail	tinyint(1) unsigned

language
Field	Type	Null	Extra
id	tinyint(3) unsigned		auto_increment
language	varchar(250)

loc
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
std_id	int(10) unsigned
obj_id_i	int(10) unsigned
location	varchar(100)
roww	mediumint(8) unsigned
description	varchar(150)

loccell
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
cell_id	int(10) unsigned
loc_id	int(10) unsigned

obj
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
ident	varchar(20)
name	varchar(200)
unit	varchar(64)
objtype_id	tinyint(3) unsigned
page	int(10) unsigned
wiki_id	tinyint(3) unsigned
newest	int(10) unsigned

objinfo
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
obj_id	int(10) unsigned
acttype_id	tinyint(3) unsigned
who	varchar(50)
comments	varchar(250)
time	timestamp

objtype
Field	Type	Null	Extra
id	tinyint(3)
objtype	varchar(30)

res
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
cell_id	int(20) unsigned
obs	int(10) unsigned
result	float
restext	varchar(250)	Yes

resinfo
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
restext	varchar(250)
who	varchar(50)
time	timestamp

resinfosec
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
restext	varchar(250)
who	varchar(50)
time	timestamp

ressec
Field	Type	Null	Extra
id	bigint(20) unsigned		auto_increment
cell_id	int(20) unsigned
obs	int(10) unsigned
result	float
restext	varchar(250)	Yes

sett
Field	Type	Null	Extra
id	int(10) unsigned		auto_increment
obj_id	int(10) unsigned
settype_id	tinyint(3) unsigned

settype
Field	Type	Null	Extra
id	tinyint(3) unsigned		auto_increment
settype	varchar(30)

wiki
Field	Type	Null	Extra
id	tinyint(3)
url	varchar(255)
wname	varchar(20)

Formula structure

Now it has become clear that it is not enough to have samples of the result distributions. It must be possible to completely recalculate the result based on the information in the Opasnet Base. There are different approaches:

Calculate the result based on a formula that may refer to other variables called parents. This is a deterministic approach.
Calculate the result based on the marginal distribution and (conditional) rank correlations with parent variables. This is a probabilistic approach.

This approach requires new tables, namely Formula and Language.

----11: . Do we need tables DIF and DIP like Uninet? --Jouni 21:50, 30 December 2009 (UTC) (type: truth; paradigms: science: comment)

DIP
- DIP_node_id
- DIP_parent_node_id
- DIP_corr_coeff
- DIP_parent_index
DIF
- DIF_node_id
- DIF_formula
- DIF_varnames_in_formula

All tables: Overview

We need Ressec (Result secure) and Resinfosec (Result info secure) tables for secure information. All other tables are openly readable except these two. They have the same structure as Res and Resinfo tables, respectively.

**Tables_in_opasnet_base**
Table	Description
Acttype	List of action types
Cell	Cells of an object
Formula	Formulas for computing variable results
Item
Language	List of languages understood by the formula
Loc	Location information
Loccell	Locations of a cell
Log
Obj	Object information (all objects)
Objinfo	Additional information about the objects
Objtype	Types of objects
Res	Result distribution (actual values)
Resinfo	Additional description of the result
Resinfosec	Additional description of the result
Ressec	Result distribution (actual values)
Sett	Memberships of items in sets
Settype	Types of set-item memeberships
Wiki	Wiki information

⇤--3: . Res and Resinfo should be merged. Similarly, Ressec and Resinfosec should be merged. --Jouni 09:48, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)R↻

⇤--4: . We must divide Obj into Obj and Act. Act table (previously Objinfo) would contain uploads (previously Runs) and formula updates. --Jouni 09:48, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

←--12: . Objinfo should be renamed Act. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: defence)

←--13: . Uploads (previously "runs") should no longer be called objects. They belong to table Act. As a side effect, one upload refers to exactly one object (i.e., when uploading a full model, all object will have different upload numbers). --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: defence)

Main tables

**Tables with primary content**
Table	Field	Type	Null	Key	Default	Extra
Cell	id	int(12) unsigned	NO	PRI	NULL	auto_increment
Cell	Obj_id_v	int(10) unsigned	NO	MUL	NULL
Cell	Obj_id_r	int(10) unsigned	NO		NULL
Cell	Mean	float	YES		NULL
Cell	SD	float	NO		NULL
Cell	N	int(10)	NO		NULL
Formula	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Formula	Obj_id_v	int(10) unsigned	NO	MUL	NULL
Formula	Act_id	int(10) unsigned	NO	MUL	NULL
Formula	Language	smallint(5) unsigned	NO		NULL
Formula	Code	longtext	YES		NULL
Loc	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Loc	Std_id	int(10) unsigned	NO	MUL	NULL
Loc	Obj_id_i	int(10) unsigned	NO	MUL	NULL
Loc	Location	varchar(100)	NO		NULL
Loc	Roww	mediumint(8) unsigned	NO		NULL
Loc	Description	varchar(150)	NO		NULL
Loccell	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Loccell	Cell_id	int(10) unsigned	NO	MUL	NULL
Loccell	Loc_id	int(10) unsigned	NO		NULL
Obj	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Obj	Ident	varchar(20)	NO	UNI	NULL
Obj	Name	varchar(200)	NO		NULL
Obj	Unit	varchar(64)	NO		NULL
Obj	Objtype_id	tinyint(3) unsigned	NO	MUL	NULL
Obj	Page	int(10) unsigned	NO		NULL
Obj	Wiki_id	tinyint(3) unsigned	NO		NULL
Obj	Newest	int(10) unsigned	NO	MUL	NULL
Objinfo	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Objinfo	Obj_id	int(10) unsigned	NO	MUL	NULL
Objinfo	Acttype_id	tinyint(3) unsigned	NO		NULL
Objinfo	Who	varchar(50)	NO		NULL
Objinfo	Comments	varchar(250)	NO		NULL
Objinfo	Moment	timestamp	NO		CURRENT_TIMESTAMP
Res	id	bigint(20) unsigned	NO	PRI	NULL	auto_increment
Res	Cell_id	int(12) unsigned	NO	MUL	NULL
Res	Obs	int(10) unsigned	NO		NULL
Res	Result	float	NO		NULL
Resinfo	id	bigint(20) unsigned	NO	PRI	NULL	auto_increment
Resinfo	Restext	varchar(250)	NO		NULL
Resinfo	Who	varchar(50)	NO		NULL
Resinfo	When	timestamp	NO		CURRENT_TIMESTAMP
Resinfosec	Resinfosec must have the identical structure to Resinfo.
Ressec	Ressec must have the identical structure to Res.

⇤--5: . We should add Res.Formula_id. --Jouni 21:50, 30 December 2009 (UTC) (type: truth; paradigms: science: attack)

⇤--9: . Cell.Obj_id_r should be renamed Cell.Act_id. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

----14: . Should we move Obj.Unit to Cell? This may be useful, because different columns in an object (especially a study) may have different units. On the other hand, it will make things more complicated. This does not work unless we develop a good interface for entering units to different columns. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: comment)

⇤--15: . Do we really need Obj.Newest? It is just redundant, although it may save computing time at some point. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

←--': . Yes we need it. --Juha Villman 11:12, 4 January 2010 (UTC) (type: truth; paradigms: science: defence)

⇤--16: . I think that Resinfo.Who and Resinfo.When should rather be merged with Act. Then, when a user uploads new data (even a single line with a wiki form), the upload would always have a new act_id with a timestamp. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

⇤--17: . Resinfo.Restext should rather be a text field, not varchar with a fixed length. Is there a difference? --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

←--18: . Varchar is almost always faster. We should stick to it --Juha Villman 11:12, 4 January 2010 (UTC) (type: truth; paradigms: science: defence)

⇤--18: . The Log table must be read protected. We must give the users an opportunity to use the data without other users being able to know about that. In contrast, uploading data is something that should be open information. --Jouni 06:27, 5 January 2010 (UTC) (type: truth; paradigms: science: attack)

⇤--19: . Loc.Roww should be removed, because it is based on old thinking that did not work out. --Jouni 21:00, 5 January 2010 (UTC) (type: truth; paradigms: science: attack)

⇤--20: . A problem: The data of one object can be uploaded at several times, e.g. if the object is the result of an online survey. Therefore, the act.id is not an unique identifier for a full data. This can be clarified by adding two different act types: upload data (replace) and upload data (append). The previous starts a new dataset that replaces the previous one, while the latter simply adds new data points to the existing data. Two problems still remain:

The nomenclature should be clarified: what is upload, act, and what is the collection of data?

There should be a unambiguous (and hopefully simple) way of performing queries so that all appends of the same data are included. --Jouni 10:34, 6 January 2010 (UTC) (type: truth; paradigms: science: attack)

----21: . Solution: A new field act.bunch_id is the id of the latest act.id of upload data (replace) for that object. This can be updated automatically. All uploads are used where obj.newest eq act.bunch_id and act.id eq cell.act_id. --Jouni 10:34, 6 January 2010 (UTC) (type: truth; paradigms: science: comment)

We need a query for updating act.series_id based on information from act, actobj and obj tables. This is the first attempt, but does not work yet:

SELECT * FROM actobj AS appobj LEFT JOIN act AS appends ON appobj.act_id = appends.id LEFT JOIN actobj AS repobj ON repobj.obj_id = appobj.obj_id LEFT JOIN act AS replaces ON repobj.act_id = replaces.id WHERE appends.acttype_id = 5 AND replaces.acttype_id = 4;

Tables with additional information

**Tables with additional information**
Table	Field	Type	Null	Key	Default	Extra
Item	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Item	Sett_id	int(10) unsigned	NO	MUL	NULL
Item	Obj_id	int(10) unsigned	NO		NULL
Item	Fail	tinyint(1) unsigned	NO	MUL	NULL
Log	id	int(10)	NO	PRI	NULL	auto_increment
Log	ip	varchar(15)	NO		NULL
Log	wiki_uid	varchar(30)	YES		NULL
Log	wiki_page	varchar(30)	YES		NULL
Log	time	timestamp	NO		0000-00-00 00:00:00
Log	query	varchar(256)	NO		NULL
Sett	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Sett	Obj_id	int(10) unsigned	NO	MUL	NULL
Sett	Settype_id	tinyint(3) unsigned	NO		NULL

Tables with lists

**Tables with lists**
Table	Field	Type	Null	Key	Default	Extra
Acttype	id	int(10) unsigned	NO	PRI	NULL	auto_increment
Acttype	Acttype	varchar(250)	NO	UNI	NULL
Language	id	tinyint(3) unsigned	NO	PRI	NULL	auto_increment
Language	Language	varchar(250)	NO	UNI	NULL
Objtype	id	tinyint(3)	NO	PRI	NULL
Objtype	Objtype	varchar(30)	NO		NULL
Settype	id	tinyint(3) unsigned	NO	PRI	NULL	auto_increment
Settype	Settype	varchar(30)	NO		NULL
Wiki	id	tinyint(3)	NO	PRI	NULL
Wiki	Url	varchar(255)	NO		NULL
Wiki	Wname	varchar(20)	NO		NULL

⇤--10: . Objtype.id and Wiki.id should be auto increments. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

Show contents of {{#sql-query: SELECT * FROM objtype |objtype}} | {{#sql-query: SELECT * FROM acttype |acttype}} | {{#sql-query: SELECT * FROM wiki |wiki}} | {{#sql-query: SELECT * FROM language |language}} | {{#sql-query: SELECT * FROM settype |settype}}

Tasks performed

⇤--6: . All timestamps (Resinfo.When, Objinfo.Moment) should be renamed to .Time. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

----': . DONE --Juha Villman 11:23, 5 January 2010 (UTC) (type: truth; paradigms: science: comment)

⇤--7: . All column names should start with either a Capital or small letter. --Jouni 12:55, 31 December 2009 (UTC) (type: truth; paradigms: science: attack)

----': . DONE. All table and column names start with small letter. --Juha Villman 08:58, 5 January 2010 (UTC) (type: truth; paradigms: science: comment)

I restructured the Base by adding a few tables (act, actobj, language) and columns (act.temp_id, cell.actobj_id, formula.actobj_id, res.restext, ressec.restext).

The following queries were run to make new connections possible with the new strucure (obj-actobj-cell-act instead of the old poorer way obj-cell-obj as run-objinfo).

This query was used to fetch all the object - upload (run) pairs

SELECT cell.obj_id_v as obj_id, act.id as act_id, cell.obj_id_r as run_id, objinfo.id, act.temp_id FROM objinfo, cell, act WHERE cell.obj_id_r = objinfo.obj_id AND objinfo.id = act.temp_id GROUP BY cell.obj_id_r, cell.obj_id_v;

This query was used to fill the actobj table

INSERT IGNORE INTO actobj (act_id, obj_id) SELECT act.id as act_id, cell.obj_id_v as obj_id FROM objinfo, cell, act WHERE cell.obj_id_r = objinfo.obj_id AND objinfo.id = act.temp_id GROUP BY cell.obj_id_r, cell.obj_id_v ;

This query was used to update the actobj column in the cell table

UPDATE objinfo, cell, act, actobj SET cell.actobj_id = actobj.id WHERE cell.obj_id_r = objinfo.obj_id AND objinfo.id = act.temp_id AND act.id = actobj.act_id AND cell.obj_id_v = actobj.obj_id;

As a result, the Base seemed to work exactly like it should. Note! Existing columns were NOT changed, so that all previous queries should work just like before. However, with the new structure, it is now possible to start updating the queries.

@@ Line 82: / Line 82: @@
 |-
 |obj_id||int(10) unsigned||
+|-
+|series_id||int(10) unsigned||
 |}
 |VALIGN="top"|