Opasnet Base: Difference between revisions

Revision as of 07:07, 19 April 2012

This page is a knowledge crystal of subtype method. The page identifier is Op_en5474
Moderator:Jouni (see all)
Citation of this page: Juha Villman, Einari Happonen, Jouni T. Tuomisto: Opasnet Base structure. Opasnet 2012. [1]. Accessed 21 Nov 2024.
Upload data {{#opasnet_base_link:Op_en5474}}

This page is about the Opasnet Base 2 -database.

Question

How to improve the existing Opasnet Base? Following issues of Opasnet Base 1 must be resolved:

Opasnet Base 1 structure makes filtering by location very slow on big data
Opasnet Base 1 structure is perhaps unnecessarily complex = queries are hard to adapt and read
MySQL is not ideal for storing huge amounts of data, or at least this is how we assume?
MySQL tables have fixed column types = difficult to store data objects with varying types of indices into one data table
Multiple languages (localization) not supported

Answer

MySQL is good on relations but weak on dynamic big data. Let's keep the basic "scaffold" in MySQL and store the big data (locations and results) into noSQL-base. After few days of research the best candidate for noSQL-base seems to be MongoDB. Combining relational MySQL and non-relational MongoDB will be the foundation for the new Opasnet Base 2.

Table structure in the database

All tables

acts
Uploads, updates, and other actions
Field	Type	Null	Extra	Key
id	int(10) unsigned	NO	auto_increment	PRI
obj_id	int(10) unsigned	NO
series_id	int(10) unsigned	NO
unit	varchar(64)	YES
type	ENUM('initial','replace','append')	NO
who	varchar(255)	NO
when	timestamp	NO
comments	varchar(255)	YES
lang	char(3)	NO	ISO 639-2, default:'eng'

----1: . I guess this should have field obj_id to identify which object we are updating. In the previous version, act - obj was many-to-many relationship, but it can as well be many-to-one which makes life easier; it is not important to know that two variables were uploaded in the same model run. Also series_id and unit could be in this table. --Jouni 22:56, 22 February 2012 (EET) (type: truth; paradigms: science: comment) ←--2: . I have now altered the table structure to implement Jouni's idea. --Einari 12:31, 19 March 2012 (EET) (type: truth; paradigms: science: defence)

objs
Object information (all objects)
Field	Type	Null	Extra	Key
id	int(10) unsigned	NO	auto_increment	PRI
ident	varchar(64)	NO		UNI
name	varchar(255)	NO
type	ENUM('variable','study','method','assessment','class','nugget','encyclopedia')	NO
page	int(10) unsigned	NO
wiki_id	tinyint(3) unsigned	NO

inds
Indices
Field	Type	Null	Extra	Key
id	int(10) unsigned	NO	auto_increment	PRI
series_id	int(10) unsigned	NO		UNI(with ident)
ident	varchar(64)	NO		UNI(with series_id)
type	ENUM('entity','number','time')	NO
name	varchar(255)	NO
page	int(10) unsigned	NO
wiki_id	tinyint(3) unsigned	NO
order_index	int(10) unsigned	NO
hidden	boolean	NO	false

wikis
Wiki information
Field	Type	Null	Extra	Key
id	tinyint(3)	NO		PRI
url	varchar(255)	NO
wname	varchar(255)	NO

MongoDB

Tables to store all locations and results

db.<objs.ident>.dat

Column names:
_sid(series.id), _aid (act.id), <inds.ident>, <inds.ident>, ..., <inds.ident>, _res (result)

Data:
<acts.series_id>, <acts.id>, <locs.id | data>, <locs.id | data>, ..., <locs.id | data>,

Index:
db.<objs.ident>.dat.ensureIndex({_sid:1, _aid:1});

----#: . I guess this is about loc.id's, not about ind.id's. Loc.id's are either identifiers explained below or values of continuous indices. --Jouni 22:56, 22 February 2012 (EET) (type: truth; paradigms: science: comment)
⇤--#: . It was actually about ind.id, but the line described column names, not the data itself. To avoid future misleads I added data describing line as well. Object data can consist of location identifiers or e.g. real numbers, depending of the type of the column index. --Einari 13:33, 19 March 2012 (EET) (type: truth; paradigms: science: attack)

Tables to store real location values for entity type indices

db.<obj.ident>.locs

Column names:
iid (index id), lid (location id), val

Index:
ensureIndex(array("iid" => 1, "lid" => 1));

←--#: . If the index is entity type, inds.id is the identifier of the index (whose details are in MySQL) and value is the name of the location. If the index is continuous, it does not need rows in this table, as everything necessary is described in the tables inds and db.<object.ident>.dat. --Jouni 22:56, 22 February 2012 (EET) (type: truth; paradigms: science: defence)

----#: . What is key? --Jouni 22:56, 22 February 2012 (EET) (type: truth; paradigms: science: comment)
←--#: . Key was something that I cannot really confirm. I altered the table description so that index id and location id together make the key (indexed in that order). This means that location values must always be accessed through index id. I believe that this is the fastest way in practice. --Einari 14:35, 19 March 2012 (EET) (type: truth; paradigms: science: defence)

@@ Line 162: / Line 162: @@
 Column names:<br/>
-ind_id (indexed), loc_id(indexed), value
+iid (index id), lid (location id), val
+Index:<br/>
+ensureIndex(array("iid" => 1, "lid" => 1));
 {{defend|# |If the index is entity type, inds.id is the identifier of the index (whose details are in MySQL) and value is the name of the location. If the index is continuous, it does not need rows in this table, as everything necessary is described in the tables inds and db.<object.ident>.dat.|--[[User:Jouni|Jouni]] 22:56, 22 February 2012 (EET)}}

Opasnet Base: Difference between revisions

Revision as of 07:07, 19 April 2012

Contents