Scoring & Workflow¶
This page explains how a test runs and how the score is produced – without any programming knowledge.
The workflow of a test¶
flowchart TB
A["📝 The AI receives the<br>plant description"] --> B{Information missing?}
B -- "yes" --> C["❓ AI asks questions<br>– the oracle answers"]
B -- "no" --> D["🏗️ AI builds the plant"]
C --> D
D --> E["🔍 The built plant is<br>compared with the reference"]
E --> F["📊 Three scores<br>in percent"]
Step by step:
- Read the task: The AI receives the description (text or sketch).
- Ask questions (only for incomplete tasks): If something is missing, the AI may
query the oracle. - Build the plant: The AI creates instructions with which PyADM1ODE actually
assembles the plant. - Compare: The resulting plant is compared with the reference (the correct
plant). - Score: From this, three scores in percent are produced.
What does 'the AI builds the plant' mean?
The AI writes a short set of instructions in the language that PyADM1ODE understands. You can think of it as a blueprint: "Take a fermenter of this size, connect it to the secondary digester …". These instructions are executed, and a real, simulatable plant is created.
The three scores¶
The result is examined from three angles. Each score is a percentage between 0 % and 100 %.
-
1. Structure
Are the right components present and correctly connected? For example: does the digestate flow from the fermenter into the secondary digester and the biogas to the combined heat and power unit?
-
2. Measures
Are the sizes and values correct – such as volume, temperature or the power of the combined heat and power unit? Checking uses a tolerance range, so small deviations are allowed.
-
3. Gaps
Did the AI handle missing information correctly? Did it ask or fill in plausibly – instead of simply inventing a wrong value?
Why a tolerance range?
In practice there is rarely a single "correct" value. A fermenter of 312 m³ instead of 315 m³ is not an error. Therefore a value counts as correct if it lies within a reasonable range – not only on an exact match.
What counts – and what does not¶
To keep the scoring fair and meaningful, some things are deliberately not scored:
- Names do not matter: The AI may name components differently. Comparison is by
type of component (fermenter, pump …), not by name. - Substrates are not scored: Which materials are fed in does not factor into the
score – it is solely about the structure of the plant. - Most serious error: Silently inventing an implausible value instead of
asking is penalised most heavily.
Note on sketch tasks¶
Tasks with a sketch (image) can only be solved by AI models that understand images. A pure text model cannot "see" a sketch and would inevitably score 0 % on such tasks – this is then not a content error of the model, but a question of choosing the right model.