Skip to content

Repository Storage

Introduction

Overview and Scope

The SoilWise repository aims at merging and seamlessly providing different types of content. To host this content and to be able to efficiently drive internal processes and to offer performant end user functionality, different storage options are implemented.

  1. A relational database management system for the storage of the core metadata of both data and knowledge assets.
  2. A Triple Store to store the metadata of data and knowledge assets as a graph, linked to soil health and related knowledge as a linked data graph.
  3. Git for storage of user-enhanced metadata.

Intended Audience

Storage is a backend component, therefore we only expect a maintenance role:

  • SWC Administrator monitoring the health status, logs, signaling maintenance issues etc .
  • SWC Maintainer performing corrective / adaptive maintenance tasks that require database access and updates.

PostgreSQL RDBMS: storage of raw and augmented metadata

Info

Current version: Postgres release 12.2;

Technology: Postgres

Access point: SQL

A "conventional" RDBMS is used to store the (augmented) metadata of data and knowledge assets. There are several reasons for choosing an RDBMS as the main source for metadata storage and metadata querying:

  • An RDBMS provides good options to efficiently structure and index its contents, thus allowing performant access for both internal processes and end user interface querying.
  • An RDBMS easily allows implementing constraints and checks to keep data and relations consistent and valid.
  • Various extensions, e.g. search engines, are available to make querying, aggregations even more performant and fitted for end users.

Key Features

The Postgres database serves as a the destination and/or source for many of the backend processes of the SoilWise Catalogie. Its key features are:

  1. Raw metadata storage — The harvester process uses it to store the raw results of the metadata harvesting of the different resources that are currently connected.
  2. Storage of Augmented metadata — Various metadata augmentation jobs use it as input and write their input to this data store.
  3. Source for Search Index processing — This database is also the source for denormalisation, processing and indexing metadata through the Solr framework.
  4. Source for UI querying — While Solr is the main resource for end user querying through the catalogue UI, the catalogue also queries the Postgress database.

Virtuoso Triple Store: storage of SWR knowledge graph

Info

Current version: Virtuoso release 07.20.3239

Technology: Virtuoso

Access point: Triple Store (SWR SPARQL endpoint) https://repository.soilwise-he.eu/sparql

A Triple Store is implemented as part of the SWR infrastructure to allow a more flexible linkage between the knowledge captured as metadata and various sources of internal and external knowledge sources, particularly taxonomies, vocabularies and ontologies that are implemented as RDF graphs. Results of the harvesting and metadata augmentation that are stored in the RDBMS are converted to RDF and stored in the Triple Store.

Key Features

A Triple Store, implemented in Virtuoso, is integrated for parallel storage of metadata because it offers several capabilites:

  1. Semantic linkage — It allows the linking of different knowledge models, e.g. to connect the SWR metadata model with existing and new knowledge structures on soil health and related domains.
  2. Cross-domain reasoning — It allows reasoning over the relations in the stored graph, and thus allows connecting and smartly combining knowledge from those domains.
  3. Semantic querying — The SPARQL interface offered on top of the Triple Store allows users and processes to use such reasoning and exploit previously unconnected sets of knowledge.

Apache Lucene: Open-source search engine software library

Info

Current version: Apache Lucene release x.x.x

Technology: Apache Lucene

Access point: Via the Apache Solr API https://

The SoilWise Catalogie uses a dedicated index (Apache Lucene) to efficiently index and store the harvested and augmented metadata, as well as the knowledge extracted from documents referred to through the metadata records (currently only supporting PDF format). Access to the index (both indexing and querying) is provided through the Apache Solr search framework.

Key Features

Apache Lucene offers a range of options that support increasing the search performance and the quality of search results. It also allows to implement strategies for result ranking, faceted search etc. that can increase end user experience :

  1. Search performance — Apache Lucene is a broadly adopted and well maintaned search index that can dramatically speed up and improve the precision of search results
  2. Integration — Integration with Apache Solr provides tools and programmatic access to configure, manage, optimize and query the indexed content.
  3. Lexical Search — Combined with the Apache Solr framework, Lucene offers support for lexical search (based on matching the literals of words and their variants), faceted search and ranking.
  4. Semantic Search — Combined with the Apache Solr framework, Lucene offers support for generating and storing embeddings that support semantic search (based on the meaning of data) and associated AI functions.

Git: storage of code and configuration

Info

Technology: Gitlab and GitHub

Access point: https://github.com/soilwise-he

Git is a multi purpose environment for storing and managing software and documentation, versioning and configuration that also offers various functions the support the management and monitoring of the software development process.

Key Features

Git is an acknowledged platform to store, version, configure and docuemnt software, with additional features for software and software development management. The key features used in SoilWise are:

  1. Code storage, version and configuration management — Git is used to deposit and manage versions of Soilwise code, documentation and configurations.
  2. Issue and release management — SoilWIse uses the issue and release management to document, monitor and track the development of software conponents and their integration.
  3. Process automation — Git defines and runs automated pipelines for deployment, augmentation, validation and harvesting external sources.

Integrations & Interfaces

Key Architectural Decisions

Decision Rationale
RDF/Triple Store for semantics Allows definition of advanced semantic structures and cross-domain interlnkage. Allows semantic reasoning, both internal and by external clients
To be further extended ...

Risks & Limitations

Risk / Limitation Description Mitigation
Inconsistency between RDBMS and Triple Store Parallel sources and query results might deviate if processes are not aligned. Monitoring procedures and corrective actions to be documented for maintenance
Integration issues for Triple store Lack of infrastructure and/or technical knowledge might hinder integration. Continuous alignment with JRC technical team
Integration issues for process automation Currently implemented process automation through Git might not fit JRC Continuous alignment with JRC technical team
To be further extended ...