Wikidata: Difference between revisions

From Opasnet
Jump to navigation Jump to search
Line 22: Line 22:
=== Pilot for Opasnet - Opasnet Base - Wikidata - Wikipedia connection ===
=== Pilot for Opasnet - Opasnet Base - Wikidata - Wikipedia connection ===


:'' Main article: [https://www.wikidata.org/wiki/User:Jtuom User page Jtuom in Wikidata]
We start the pilot by adding some existing disease burden data from a credible source (IHME Institute) into Wikidata and link to those data from Wikipedia.
We start the pilot by adding some existing disease burden data from a credible source (IHME Institute) into Wikidata and link to those data from Wikipedia.


Line 27: Line 28:


When adding data to Wikidata, there are some tricks you should know. These can be found from the documentation as well, but I highlight some issues that may save time and trouble.
When adding data to Wikidata, there are some tricks you should know. These can be found from the documentation as well, but I highlight some issues that may save time and trouble.
==== Items and properties ====
Properties are things that define the relation of the item and the value. Technically, items and properties are wiki pages in Wikidata and they have a unique number starting with Q (items) or P (properties). A particular property only accepts certain kind of values. Often values are also items, but sometimes numbers or strings.
These are common properties for '''causalities'''.
* [https://www.wikidata.org/wiki/Property:P1542 cause of] (P1542): the value (underlying cause). Inverse: has cause (P828). See '[https://www.wikidata.org/wiki/Help:Modeling_causes Help:Modeling causes]' for examples and discussion.)
* [https://www.wikidata.org/wiki/Property:P1536 immediate cause of] (P1536) the value. Inverse: has immediate cause (P1478).
* [https://www.wikidata.org/wiki/Property:P1537 contributing factor of] (P1537) the value. Inverse: has contributing factor (P1479). For diseases, use this rather than has cause when linkin risk factors, because almost always a disease has several risk factors, and often one risk factor works with others to cause a case. Also, even if a risk factor is the single cause of a particular case, we do not want to imply that ALL cases of that disease are caused by a single risk factor.
All statements should have '''references'''. References are also statement triplets within a main statement. This kind of references should be used, in the order of preference. For details, see [https://www.wikidata.org/wiki/Help:Sources Help:Sources].
* Use Pubmed ID (P698) with a Pubmed identification number to scientific articles.
* Use DOI (P356) number if Pubmed ID is not available (non-medical literature)
* inventory number (P217) for URN numbers (URN itself is not a property!?)
* Stated in (P248) requires an item as the value. So, if you want to have e.g. a book as reference, you must first add the book as item to the database and then link to the item. See Help:Sources before doing this.
* Reference URL (P854) is the default property for some reason. However, use it only if the reference already has been added to Wikidata as item.
** We probably should add URL items such [[Burden of disease]] (for global numbers) and [[Burden of disease in Finland]].
* Imported from (P143) should be used if the data comes directly from another database. This is relevant if/when Opasnet (or part of it) is considered as a credible source and data can be directly downloaded from there.
** For example: Imported from (P143) Opasnet (Q7095608)
Also organisations can be used as references:
*  imported from (P143)? National Institute for Health and Welfare (Q4354957)
**  Institute for Health Metrics and Evaluation (Q6039400)
* [https://www.wikidata.org/wiki/Q7095608 Opasnet] (Q7095608)  in Wikidata
Other useful:
* [https://www.wikidata.org/wiki/Wikidata:Userboxes Userboxes] to identify oneself in Wikidata. Babel box is useful, because then Wikidata offers languages you actually know.
* [https://www.wikidata.org/wiki/Wikidata:WikiProject_Medicine WikiProject Medicine]
We need a propertis and units of measurement. This is how it goes.
*'''Incidence''' and '''disease burden''' have been suggested as properties, and I hope they get accepted soon.
* '''Life expectancy''' already exists as a property (P2250). The unit '''year''' (Q577) is attached to it. See also [[:en:Life expectancy]]
* '''Applies to part''' (P518) property should be used for distinguishing subgroups such as sex, age etc.
* '''Point in time''' (P585) property should be used to distinguish the year of observation.
* Does item person-year exist so that it can be used in units such as DALY / 100000 person-years? Or does it need to be a property? Use  metre (Q11573) as an example about what properties a unit item should have.
* Other relevant properties:
**  route of administration (P636)
**  symptoms (P780)
** [[:en:Disease burden]] Wikipedia article
**  pathogen transmission process (P1060). process by which a pathogen is transmitted
**  prevalence (P1193) portion of a population with a given disease
**  tuberculosis (Q12204) example of a well-developed disease item. [[:en:Tuberculosis]] has an infobox example
**  Wikidata property for places (Q19829914) Wikidata property for describing places. Might be useful to characterise things in e.g. building model.
Environmental and other pages where we might want to add infoboxes:
* [[:en:Dioxins and dioxin-like compounds]]
* [[:en:Disability-adjusted life year]]
* [[:en:Help:Infobox]]
* [[:en:Study 329]] example of a study with an infobox
* [[:en:Template:Infobox drug]]
* [[:en:Template:Infobox medical condition]]
* [[:en:Template:Infobox medical intervention]] maybe not relevant for public health?
* [[:en:Template:Infobox pandemic]] not much used but might be interesting to ROKO
* [[:en:Template:Infobox nutritional value]] might be interesting for RAVY
* [https://www.sotkanet.fi/sotkanet/fi/haku?g=284 sotkanet] these indicators might link to DALY estimates or calculations [http://www.terveytemme.fi/sairastavuusindeksi/]
Other things
* [[:en:Wikipedia:Meetup/NYC/Wikipedia Day 2016]]
* [[:en:File:Wikipeda Day 15 (2016) NYC Wikidata.pdf]]
* [https://query.wikidata.org/ Interface for making SPARQL queries to Wikidata]


== Rationale ==
== Rationale ==

Revision as of 14:37, 18 March 2016

Wikidata is an open database alongside Wikipedia and it contains all kinds of information that can be automatically used on Wikipedia pages. It can also be called from outside using SPARQL language. It is based on a semantic structure of statement triplets: item, property, value. Each statement should be backed up with credible sources.

Question

How to combine Opasnet, Wikidata, and Wikipedia in a systematic and useful way so that

  • high-quality information is updated from Opasnet to Wikidata,
  • Wikipedia pages are updated to reflect this data,
  • the penetration of environmental health knowledge is as good as possible?

Answer

Open data flow from THL

Open data flow from THL. The traditional scientific knowledge production is shown in bottom right corner. In addition, THL has large sensitive datasets for administrative and research work. From there, synthetic data could be opened up (1) using ReplicaX and other anonymising tools. Then, anyone could develop and test statistical analysis code for the data (2). When ready, THL takes the code and runs it with real data without releasing the data outside of the institute (3). Finally, analysis results are published as knowledge crystals in Opasnet, and the actual analysis result is stored in Opasnet Base as open data.

This is a description of an idea about how THL could open up its large pool of sensitive data. Opasnet and Wikidata have central roles there, because a) there needs to be an open place where the data is shared and worked on, and b) from Wikidata, the information can be spread to all Wikipedias in any language; and Wikipedias are the most read websites in any language.

The traditional scientific knowledge production is shown in bottom right corner. In addition, THL has large sensitive datasets for administrative and research work. From there, synthetic data could be opened up (1) using ReplicaX and other anonymising tools. Synthetic data is located in Opasnet Base and described on Opasnet wiki. Each dataset can have an own page. Then, anyone can develop and test statistical analysis code for the data (2). Synthetic data can be accessed via Opasnet but also directly from user's own computer using either R (easier) or JSON (trickier) interface. When ready, THL takes the code and runs it with real data without releasing the data outside the institute (3). Finally, analysis results are published as knowledge crystals in Opasnet, and the actual analysis result is stored in Opasnet Base as open data. There is a need for a research policy discussion: how should the results be published? Is it the basic rule that all code and results are immediately published in Opasnet when the code is run (and after checking that results do not reveal sensitive data)? Or do we give the code developer a head start of e.g. six months, after which the results are published? Or, is this a service that is free of charge if the results are published immediately, but the author must pay to keep them hidden? The price could go up exponentially as a function of time to make sure that results are published eventually.

My first guess is that the actual policy is a matter of experimenting about which options attract code developers. The strength of nudging toward openness is also a value decision by THL.

Pilot for Opasnet - Opasnet Base - Wikidata - Wikipedia connection

Main article: User page Jtuom in Wikidata

We start the pilot by adding some existing disease burden data from a credible source (IHME Institute) into Wikidata and link to those data from Wikipedia.

The first set of data that is to be uploaded to Wikidata is the burden of disease estimates. Global estimates from the IHME Institute to English Wikipedia to populate disease and risk factor pages. For details of this work, see Burden of disease.

When adding data to Wikidata, there are some tricks you should know. These can be found from the documentation as well, but I highlight some issues that may save time and trouble.

Rationale

Based on some practical learning and testing of the Wikidata website.

See also

Related files