Skip to content

llm

academic_doc_generator.project.llm

LLM interface for extracting project work metadata.

extract_project_metadata(pdf_path, llm_client)

Extract metadata from a project work PDF (title page).

This function reads the first two pages of the PDF and uses an LLM to extract relevant information such as student name, matriculation number, project title, examiner name, and work type.

Parameters:

Name Type Description Default
pdf_path str

Path to the project work PDF file.

required
llm_client LLMClient

LLMClient instance for API access.

required

Returns:

Name Type Description
dict dict[str, str]

Dictionary containing extracted metadata with keys: - "student_name": Full name of the student - "student_first_name": First name only (for gender detection) - "id_number": Student's matriculation number - "title": Title of the project work - "first_examiner": Name of the first examiner - "first_examiner_christian": Christian name of examiner - "first_examiner_family": Family name of examiner - "work_type": Type of work (e.g., "Praxisprojekt")

Source code in src/academic_doc_generator/project/llm.py
def extract_project_metadata(pdf_path: str, llm_client: LLMClient) -> dict[str, str]:
    """Extract metadata from a project work PDF (title page).

    This function reads the first two pages of the PDF and uses an LLM to
    extract relevant information such as student name, matriculation number,
    project title, examiner name, and work type.

    Args:
        pdf_path: Path to the project work PDF file.
        llm_client: LLMClient instance for API access.

    Returns:
        dict: Dictionary containing extracted metadata with keys:
            - "student_name": Full name of the student
            - "student_first_name": First name only (for gender detection)
            - "id_number": Student's matriculation number
            - "title": Title of the project work
            - "first_examiner": Name of the first examiner
            - "first_examiner_christian": Christian name of examiner
            - "first_examiner_family": Family name of examiner
            - "work_type": Type of work (e.g., "Praxisprojekt")
    """
    # Extract text from first two pages
    pages_text = extract_text_per_page(pdf_path, max_pages=2)
    sample_text = "\n\n".join([pages_text.get(i, "") for i in sorted(pages_text.keys())])

    prompt = build_prompt(PromptTemplate.EXTRACT_PROJECT_METADATA, text=sample_text)

    messages = [{"role": "user", "content": prompt}]
    content = llm_client.chat_completion(messages)

    try:
        metadata = json.loads(content)
    except json.JSONDecodeError:
        return {"error": "Could not parse JSON", "raw": content}

    # Normalize keys
    if "stud_name" in metadata and "student_name" not in metadata:
        metadata["student_name"] = metadata.pop("stud_name")
    if "sid" in metadata and "id_number" not in metadata:
        metadata["id_number"] = metadata.pop("sid")

    return metadata