Data publication support
Introduction
Overview and Scope
At the moment, SoilWise supports data publishers with the following tools:
- DOI Resolution Widget
- Tabular Soil Data Annotation to help users create semantic metadata for tabular datasets.
- INSPIRE Geopackage Transformation
- Soil Vocabulary Viewer, part of the Knowledge Graph component, visualizes and links different soil-domain vocabularies and terms.
Intended Audience
DOI Resolution Widget
Info
Current version:
Technology:
Project:
Access Point:
Overview and Scope
Key Features
Architecture
Technological Stack
Main Sequence Diagram
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Tabular Soil Data Annotation
Info
Current version:
Technology: Streamlit, Python, OpenAI API
Project: Tabular Data Annotator
Access Point: https://dataannotator-swr.streamlit.app/
Overview and Scope
DataAnnotator is a Streamlit-based web application designed to help users create semantic metadata for tabular datasets. It combines optional Large Language Model (LLM) assistance with semantic embeddings to annotate data columns with machine-readable descriptions, element definitions, units, methods, and vocabulary mappings.
The tool addresses the metadata annotation workflow by:
- Enabling manual annotation: Users directly enter descriptions for data columns
- Automating description generation (optional): If users have context documentation, LLMs can help extract and structure descriptions automatically
- Linking to vocabularies: Semantic embeddings match descriptions to controlled vocabularies for standardization
The LLM layer is optional—users can skip automated generation and manually provide descriptions, which the system will then semantically match to existing vocabulary terms.
Key Features
| Feature | Implementation | Purpose |
|---|---|---|
| Auto Type Detection | Statistical sampling | Identify data patterns |
| Manual Description Entry | Streamlit text inputs | Direct user annotation |
| Optional LLM Assistance | OpenAI/Apertus integration | Auto-extract descriptions from docs |
| Semantic Vocabulary Matching | FAISS vector search | Link descriptions to standard vocabularies |
| Context Awareness | PDF/DOCX import + prompting | Extract domain-specific info when available |
| Multi-format Export | flat csv/TableSchema/CSVW | Integration with downstream tools |
Architecture
Technological Stack
| Component | Technology |
|---|---|
| Frontend | Streamlit 1.51+ |
| Backend Logic | Python 3.12+ |
| LLM Integration | OpenAI API, Apertus HTTP |
| Embeddings | Sentence Transformers 5.1+ |
| Vector Search | FAISS (CPU) 1.12+ |
| File Parsing | PyPDF2, python-docx, openpyxl |
| ML Libraries | scikit-learn, NumPy |
Dependencies & Models
-
Python Packages
streamlit>=1.51.0- Web UI frameworkopenai>=2.7.2- LLM API clientsentence-transformers>=5.1.2- Semantic embeddingfaiss-cpu>=1.12.0- Vector similarity searchpandas>=2.0- Data manipulationopenpyxl>=3.1.5- Excel handlingpython-docx>=1.2.0- Word document parsingpypdf2>=3.0.1- PDF text extraction
-
Pre-trained Models/Data
- Embedding Model:
all-MiniLM-L6-v2(384 dimensions, 22M parameters) - FAISS Index: Pre-computed vocabulary embeddings (stored in
data/directory)
- Embedding Model:
Main Components
1. Data Input Module
- Supported Formats - Tabular Data: CSV, Excel (XLSX)
- Supported Formats - Context Documents: Free-form text input, PDF documents, DOCX files
- Processing Functions:
read_csv_with_sniffer(): Auto-detects CSV delimitersimport_metadata_from_file(): Reads existing metadata if providedread_context_file(): Extracts context from PDFs/DOCX for LLM-assisted annotation
2. Data Analysis & Type Detection
- Function:
detect_column_type_from_series() - Detects:
- String: Text data
- Numeric: Integers and floats
- Date: Temporal values
- Approach: Statistical sampling of column values (up to 200 non-null entries)
3. Metadata Framework
- Function:
build_metadata_df_from_df() - Template Fields:
name: Column identifierdatatype: Type classification (string/numeric/date)element: Semantic element definitionunit: Measurement unitmethod: Collection/calculation methoddescription: Human-readable descriptionelement_uri: Link to external vocabulary
4. LLM Integration Layer (Optional)
Purpose: Automate the extraction and structuring of descriptions from existing documentation when users have context materials.
When to Use: - User has documentation (PDFs, Word docs, etc.) describing variables - Manual annotation is time-consuming for large datasets - Descriptions need to be extracted from unstructured text
Supported Providers:
-
OpenAI (Recommended)
- Uses GPT models for high-quality response generation
get_response_OpenAI(): Direct API calls- Best for complex, domain-specific text extraction
-
Apertus (Alternative)
- Self-hosted LLM option
get_response_Apertus(): HTTP-based endpoint- Swiss-based open-source model
Functionality:
- Function:
generate_descriptions_with_LLM() - Inputs:
- Variable names to describe
- Context information from documents or text input
- Output Format: Structured JSON with descriptions for each variable
5. Semantic Embedding & Vocabulary Matching
- Model: Sentence Transformers (default:
all-MiniLM-L6-v2) - Functions:
load_sentence_model(): Load embedding modelload_vocab_indexes(): Load pre-computed FAISS indexesembed_vocab(): Generate embeddings with optional definition weighting
- Purpose: Match generated or manually-entered descriptions to controlled vocabularies
Pre-computed Vocabulary Sources
The FAISS vectorstore was pre-computed by embedding terms from four major public vocabularies:
| Vocabulary | Domain | Source |
|---|---|---|
| Agrovoc | Agricultural and food sciences | FAO - Food and Agriculture Organization |
| GEMET | Environmental terminology | European Environment Agency (EEA) |
| GLOSIS | Soil science and properties | FAO Global Soil Information System |
| ISO 11074:2005 | Soil quality terminology | International Organization for Standardization |
This multi-vocabulary approach enables annotation of diverse datasets including agricultural, environmental, and soil-related data.
FAISS Index Structure:
Index File: vocabCombined-{modelname}.index
Metadata File: vocabCombined-{modelname}-meta.npz
Metadata Dictionary Format:
{
index_id: {
"uri": "vocabulary_uri",
"label": "preferred_label",
"definition": "term_definition",
"QC_label": "prefLabel|altLabel"
},
...
}
6. Export & Download Module
- Function:
download_bytes() - Supported Formats:
- Excel (XLSX) - for human review
- JSON - for machine processing
- CSV - for spreadsheet tools
- Implementation: Streamlit session-based download management
Main Sequence Diagram
graph TB
A["User Interface
Streamlit App"] -->|Upload Data| B["Data Input Handler
CSV/Excel/PDF"]
B -->|Parse Data| C["Data Analysis
Column Type Detection"]
C -->|Analyze Structure| D["Metadata Framework
Build Template"]
D --> E{"Description Source?"}
E -->|Manual Entry| F["User Provides
Descriptions"]
E -->|Optional: Auto-generate| I[/"LLM Provider
(Optional Tool)"\]
I -->|OpenAI API| J["OpenAI Client
GPT Models"]
I -->|Apertus API| K["Apertus Client
Local LLM"]
J -->|JSON Descriptions| L["Response Parser
JSON Extraction"]
K -->|JSON Descriptions| L
L --> M["LLM-generated
Descriptions"]
F --> N["Semantic Embedding
Sentence Transformers"]
M --> N
N -->|Vector Query| O["proposal for generalized ObservedProperty"]
V1["Agrovoc
Agricultural"] -.->|Pre-embedded| G["FAISS vector store"]
V2["GEMET
Environmental"] -.->|Pre-embedded| G
V3["GLOSIS
Soil Science"] -.->|Pre-embedded| G
V4["ISO 11074:2005
Soil Quality"] -.->|Pre-embedded| G
G --> O
O -->Q["Export Handler
flat csv/TableSchema/CSVW"]
Q -->|Downloaded metadata| A
Key Architectural Decisions
Optimization Strategies:
- Model Caching: Streamlit
@st.cache_resourcefor persistent model loading - API Caching: JSON-based result memoization to avoid redundant API calls
- FAISS Optimization: Pre-computed indexes for O(log n) vector search
- Batch Processing: Process multiple columns in single LLM call
INSPIRE Geopackage Transformation
Info
Current version:
Technology:
Project:
Access Point:
Overview and Scope
Key Features
User Manual
Architecture
Technological Stack
Main Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Other recommended tools acknowledged by SoilWise community
The following components are not a product of SoilWise project, and not an integral part of the SoilWise Catalogue, but are recommended by the SoilWise community.
Hale Studio
A proven ETL tool optimised for working with complex structured data, such as XML, relational databases, or a wide range of tabular formats. It supports all required procedures for semantic and structural transformation. It can also handle reprojection. While Hale Studio exists as a multi-platform interactive application, its capabilities can be provided through a web service with an OpenAPI.
User Manual
A comprehensive tutorial video on soil data harmonisation with hale studio can be found here.
Setting up a transformation process in hale»connect
Complete the following steps to set up soil data transformation, validation and publication processes:
- Log into hale»connect.
- Create a new transformation project (or upload it).
- Specify source and target schemas.
- Create a theme (this is a process that describes what should happen with the data).
- Add a new transformation configuration. Note: Metadata generation can be configured in this step.
- A validation process can be set up to check against conformance classes.
Executing a transformation process
- Create a new dataset and select the theme of the current source data, and provide the source data file.
- Execute the transformation process. ETF validation processes are also performed. If successful, a target dataset and the validation reports will be created.
- View and download services will be created if required.
To create metadata (data set and service metadata), activate the corresponding button(s) when setting up the theme for the transformation process.