Kennisnet/Structured data

From Meta, a Wikimedia project coordination wiki

I am happy to announce that Kennisnet is willing to pay for the initial development of the ultimate Wiktionary. This will allow us to provide a substantial improvement in the Wiktionary functionality and it will ease the inclusion and prominence of words in languages like Papiamento.

The primary goal of this cooperation is to increase the usability of the Wiktionary projects, in part by unifying the currently split languages into one project (see the description of goals at the ultimate Wiktionary), but also by allowing the project to be structurally searchable as expressed in the Wikidata and WikiDB ideas.

Structured data is essential in the creation of the ultimate Wiktionary; it is also an important addition to the MediaWiki functionality. A proposal which encompasses these goals and more is the Wikidata idea by Erik Möller (Eloquence). In the Wikidata concept, every wiki page can follow a schema which describes its allowed individual data components and possible relations to other pages. As such, Wikidata will be strongly integrated into the software. Erik has volunteered to implement his idea in 5 milestones, as described in his proposal.

If you are willing to do this and believe you are more qualified than Erik to implement this task, or if you have a different concept, you can apply until March 17, 2005 for this development contract (use the talk page to do so). When nobody else is interested, Erik can start programming on the 17th. When other people are interested, a decision will be made on the 24th based on the merits of the proposals and the discussions about the proposals on Meta.

When you have doubts about the technical feasibility of this project, you have time until the 17th to express these doubts. A decision will be made on the 24th about a go / no go.

Please keep in mind:

  1. The value of the contract is 5000 EUR (gross)
  2. Kennisnet will pay the money directly, after receiving an invoice of the developer.
  3. The timeframe for completion of the software is 3-6 months.
  4. The key goals are:
    • enabling relational content to Wikimedia projects , like Wiktionary or Wikispecies.
    • importing existing structured databases like GEMET
    • enabling the unification of all the Wiktionaries into one project
  5. The code should be as application-agnostic as reasonably possible
  6. The implementation of a working ultimate Wiktionary is key to the conclusion of this project.

NB when people express an interest in implementing the described functionality and propose alternatives, Erik’s proposal will be split from this article and the relative merits of the proposals will be discussed on the respective talk pages.

Thanks,

 GerardM 13:41, 10 Mar 2005 (UTC)


Erik's proposal[edit]

This is a summary of the steps required to enrich MediaWiki-based wiki websites with structured data, as described in the Wikidata proposal [1]. The time of development is estimated to be 3 months, at a total cost of EUR 5000.

Current situation[edit]

March 6, 2005 -- The current stable version of MediaWiki is 1.3.11, but version 1.4 is now in Release Candidate stage 1 and actively used by many websites, including all Wikimedia sites (Wikipedia, Wiktionary, and so forth). Version 1.5 is still in the pre-alpha stage. 1.5 is a major change to 1.4; most importantly, the entire database schema has been redesigned to make queries more efficient and to ensure that every article revision has a unique, permanent identity (see http://meta.wikimedia.org/wiki/Proposed_Database_Schema_Changes/October_2004 for details).

Wikidata must be based on the new database schema. The completion of the 1.5 release and, with it, the schema transition, is therefore a prerequisite. The new schema has not yet been properly tested, and putting it in use on the Wikimedia sites after local tests will provide the necessary empirical data to judge whether it is a scalable basis to build upon.

The main outstanding issues in 1.5 are:

  • Deletion does not yet interact properly with the new database scheme.
  • E-mail authentication is broken.
  • New user/groups permission system is incomplete.
  • Watchlist functionality is broken.
  • Support for MySQL 4.1 and 5.0 is incomplete.

The issues are being tracked at http://bugzilla.wikimedia.org/show_bug.cgi?id=1002 (currently not up-to-date, as release work on 1.4 is taking precedence).

Milestones[edit]

Under the assumption that development begins on March 15, 2005, the following timetable is proposed:

Date                 Goal
-------------------  --------------------------------------------------
March 15 - March 31  + assist 1.5 release work to aim for an April beta
                     MILESTONE 1: FIRST MEDIAWIKI 1.5 BETA RELEASE
April 1  - April 30  + implement basic Wikidata storage and retrieval
                     + per-namespace description of data structures
                     + import simple example data (movie descriptions)
                     MILESTONE 2: WIKIDATA PROTOTYPE I
May 1    - May 15    + evaluate experiences from 1.5 beta and make
                       necessary changes to schema
                     + import GEMET data
                     + improve data display and retrieval
                     MILESTONE 3: WIKIDATA PROTOTYPE II
May 16   - May 31    + create basic Wiktionary Wikidata schema
                     + implement data display templates
                     MILESTONE 4: WIKIDATA PROTOTYPE III
June 1   - June 15   + implement data entry and structure templates
                     + testing and improvements based on user feedback
                     MILESTONE 5: WIKIDATA RELEASE IN MEDIAWIKI 1.6+

Updated timetable[edit]

This timetable is old, the updated timetable can be found here: Wikidata/Timetable

MILESTONE 1: FIRST MEDIAWIKI 1.5 BETA RELEASE[edit]

Focus during this time will be on testing and completing work on the new database schema and making sure that it does not clash with any Wikidata requirements. The milestone is reached when the first public beta version of MediaWiki 1.5 is released. This release is critical to the development of Wikidata. Very likely, MediaWiki 1.5 will be put into use on the Wikimedia websites very soon after its first beta release (this is the experience from past betas).

MILESTONE 2: WIKIDATA PROTOTYPE I[edit]

Work on this prototype can happen within a separate branch of MediaWiki or within the HEAD branch, depending on the status of 1.5 at this time. The first prototype of Wikidata must allow the creation of a wiki-editable database within a MediaWiki database. The widgets for data entry as well as the data structure will be hard-coded instead of being user-definable as would be desirable from a user point of view.

On the other hand, the backend should be almost fully developed. The storage layer of the prototype must include revision histories for every individual item of data (which do not yet have to be user-visible). The 1.5 database schema must be amended to add support for different types of data that can be associated with a page (short texts, numbers, dates; Wikidata typing is minimal, type checking happens primarily on the code level).

The search functionality in this prototype should be very basic, but at least allow to retrieve Wikidata records using a unique ID. The example data, a collection of records about movies (year, actors, etc.) would have to be fully viewable and make basic use of wiki-style cross-referencing. Every data element must be flaggable as language-dependent -- language-dependent elements can then be entered in different languages.

MILESTONE 3: WIKIDATA PROTOTYPE II[edit]

With prototype II, we are moving towards our primary goal of getting Wiktionary-type data into the Wikidata framework. The much simpler GEneral Multilingual Environmental Thesaurus (GEMET, http://www.eionet.eu.int/GEMET) provides a good set of example data to import. For this data, an XML import filter will have to be developed (it does not need to be generic at this point). The existing Wikidata prototype needs to be enriched to become usable by alpha testers; for example, it is essential that there is a way to restore an old revision of an individual data element, and that all changes to the data are recorded in the "Recent Changes" protocol.

Insofar as information from the MediaWiki 1.5 beta is available at this point, this would be a good time to make changes to the database schema if needed.

MILESTONE 4: WIKIDATA PROTOTYPE III[edit]

The key goal for prototype III is to greatly increase the circle of alpha testers for the project. Therefore, this prototype will focus on importing existing data from the Wikimedia project "Wiktionary" and making it editable through the Wikidata interface. Specifically, the Wikidata interface will have to be usable to edit a dictionary in multiple languages (see http://meta.wikimedia.org/wiki/The_ultimate_Wiktionary for details). Data display templates will be used to format the data output. These templates can be edited by anyone.

MILESTONE 5: WIKIDATA RELEASE[edit]

The final stage of the project focuses on generalizing the existing functionality and resolving any remaining issues. Most importantly, Wikidata schema definitions must be editable through the wiki interface (though not necessarily after the creation of a Wikidata database). Internationalization of Wikidata records must be fully supported at this point. The code must prove reliable enough to be included in the next release of MediaWiki (1.6 or later).