llm¶
academic_doc_generator.core.llm
¶
LLM interface with comprehensive type annotations for API interactions.
detect_degree_from_filename(pdf_path, llm_client)
¶
Detect if thesis is Bachelor or Master from PDF filename.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
str
|
Path to the PDF file. |
required |
llm_client
|
LLMClientProtocol
|
LLMClient instance for API access. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
"Bachelor" or "Master", or None if unable to determine. |
Source code in src/academic_doc_generator/core/llm.py
detect_language(results, llm_client, groq_free, sample_size=3)
¶
Detect the language (German or English) of the comments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
dict[int, list[RewrittenComment]]
|
Dictionary containing rewritten comments per page. |
required |
llm_client
|
LLMClientProtocol
|
LLM client instance for API access. |
required |
groq_free
|
bool
|
Whether to apply request throttling (2 second delay). |
required |
sample_size
|
int
|
Number of sample comments to analyze for language detection. Defaults to 3. |
3
|
Returns:
| Type | Description |
|---|---|
str
|
"German" if German language detected, "English" if English. |
Example
comments = {1: [{'rewritten': 'Warum wurde das gewählt?'}]} client = LLMClient() lang = detect_language(comments, client, groq_free=False) lang 'German'
Source code in src/academic_doc_generator/core/llm.py
determine_gender_from_name(first_name, llm_client)
¶
Determine the formal German address (Herr/Frau) from a first name using LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
first_name
|
str
|
First/given name of the person. |
required |
llm_client
|
LLMClientProtocol
|
LLMClient instance for API access. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Either "Herr" or "Frau" based on the name. |
Source code in src/academic_doc_generator/core/llm.py
extract_document_metadata(pages_text, language, llm_client, pdf_path=None)
¶
Extract author, matriculation number, title, and examiners from the first two pages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pages_text
|
dict[int, str]
|
Dictionary mapping page indices to text content. |
required |
language
|
str
|
Language the thesis is written in ("German" or "English"). |
required |
llm_client
|
LLMClientProtocol
|
LLM client instance for API access. |
required |
pdf_path
|
str
|
Path to PDF file for fallback degree detection from filename. |
None
|
Returns:
| Type | Description |
|---|---|
ThesisMetadata
|
Dictionary with extracted metadata. If any field cannot be extracted, |
ThesisMetadata
|
it will contain None as the value. |
Example
text = {0: "Bachelor Thesis by Max Mustermann (123456)"} client = LLMClient() metadata = extract_document_metadata(text, "German", client) metadata['author'] 'Max Mustermann' metadata['id_number'] '123456'
Source code in src/academic_doc_generator/core/llm.py
get_summary_and_metadata_of_pdf(pdf_path, language, llm_client=None, groq_free=False, verbose=False)
¶
Extract thesis metadata and generate a summary from the PDF.
This function uses the first pages of the PDF to detect metadata such as author, matriculation number, thesis title, and examiners, and generates a LaTeX-formatted summary of the thesis content using an LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
str
|
Path to the thesis PDF. |
required |
language
|
str
|
Language the thesis is written in ("German" or "English"). |
required |
llm_client
|
Optional[LLMClientProtocol]
|
LLM client instance. If None, creates a new one with automatic API selection. |
None
|
groq_free
|
bool
|
Whether to apply request throttling to stay under free-tier rate limits. Adds 20s delay after metadata extraction and 2s delay after summarization. Defaults to False. |
False
|
verbose
|
bool
|
If True, prints the generated summary. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
str
|
Tuple of (summary, metadata): |
ThesisMetadata
|
|
tuple[str, ThesisMetadata]
|
|
Example
from llm_client import LLMClient client = LLMClient() summary, metadata = get_summary_and_metadata_of_pdf( ... "thesis.pdf", "German", client ... ) metadata['bachelor_master'] 'Bachelor' "untersucht" in summary True
Source code in src/academic_doc_generator/core/llm.py
rewrite_comments(context_dict, llm_client, groq_free=False, verbose=False)
¶
Rewrite rough comments into clear, polite questions using LLMClient.
Only comments categorized as "llm" are rewritten. Comments with category "quelle" or "language" are skipped but retained in the results for later analysis. Comments with category "ignore" are excluded entirely.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context_dict
|
dict[int, list[AnnotationContext]]
|
Mapping of page numbers to annotation contexts, where each annotation dict contains comment, highlighted text, paragraph, and category. |
required |
llm_client
|
LLMClientProtocol
|
LLM client instance implementing the LLMClientProtocol. |
required |
groq_free
|
bool
|
Whether to apply request throttling to stay under Groq's free-tier rate limits (4s per request, 10s every 5 requests). Defaults to False. |
False
|
verbose
|
bool
|
If True, prints debug information about responses. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
dict[int, list[RewrittenComment]]
|
Dictionary mapping page numbers to rewritten comments. Skipped comments |
dict[int, list[RewrittenComment]]
|
(quelle/language) are excluded from the output. |
Example
context = {1: [{'comment': 'Why?', 'highlighted': 'text', ... 'paragraph': 'context', 'category': 'llm'}]} client = LLMClient() result = rewrite_comments(context, client) result[1][0]['rewritten'] 'Could you explain the reasoning behind this approach?'
Source code in src/academic_doc_generator/core/llm.py
rewrite_comments_in_pdf(pdf_path, llm_client=None, groq_free=False, verbose=False, pdf_processor=None)
¶
Extract and rewrite PDF comments into clear, polite questions.
This function parses the given PDF, extracts annotations, finds their textual context, and uses an LLM to rewrite rough comments into more understandable, well-phrased questions or feedback.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
str
|
Path to the PDF file containing comments/annotations. |
required |
llm_client
|
Optional[LLMClientProtocol]
|
LLM client instance. If None, creates a new one with automatic API selection. |
None
|
groq_free
|
bool
|
Whether to apply request throttling to stay under free-tier rate limits. Defaults to False. |
False
|
verbose
|
bool
|
If True, prints detailed information about original and rewritten comments. Defaults to False. |
False
|
pdf_processor
|
Any
|
Optional PDF processor module for dependency injection in tests. Defaults to None (uses the standard module). |
None
|
Returns:
| Type | Description |
|---|---|
dict[int, list[RewrittenComment]]
|
Tuple of (rewritten_comments, stats): |
CommentStats
|
|
tuple[dict[int, list[RewrittenComment]], CommentStats]
|
|
Example
from llm_client import LLMClient client = LLMClient() rewritten, stats = rewrite_comments_in_pdf("thesis.pdf", client) stats {'quelle': 3, 'language': 2, 'ignore': 0} rewritten[1][0]['category'] 'llm'
Source code in src/academic_doc_generator/core/llm.py
summarize_thesis(pages_text, language, llm_client)
¶
Summarize the thesis from the first 10 pages in LaTeX-friendly format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pages_text
|
dict[int, str]
|
Dictionary mapping page indices to text content. |
required |
language
|
str
|
Language the thesis is written in ("German" or "English"). |
required |
llm_client
|
LLMClientProtocol
|
LLM client instance for API access. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A LaTeX-formatted summary string with escaped special characters. |
Example
text = {0: "This thesis examines...", 1: "The methodology..."} client = LLMClient() summary = summarize_thesis(text, "German", client) "untersucht" in summary True "\\" in summary # LaTeX line breaks True