Miscs (Interro 28)

master
Sébastien Miquel 2026-05-14 09:02:09 +02:00
parent 0836d5809d
commit 7e7045293a
10 changed files with 281 additions and 161 deletions

View File

@ -1,10 +1,11 @@
#+title: Script
#+author: Sébastien Miquel
#+date: 14-03-2026
# Time-stamp: <08-05-26 22:52>
# Time-stamp: <14-05-26 08:55>
#+OPTIONS:
* Quézaco
* Méta
** Quézaco
Ce dépôt contient un certain nombre de script Python que j'utilise
pour faire corriger des copies par Gemini.
@ -20,7 +21,7 @@ pour faire corriger des copies par Gemini.
4. Ces annotations manuscrites sont lues et recompilées en une
version de la copie pour l'élève.
* Disclaimer
** Disclaimer
J'utilise régulièrement cet outil et j'en suis satisfait, mais j'ai
fait peu d'efforts pour le rendre universel et simple à l'emploi.
@ -37,9 +38,9 @@ examples du rendu final (dans le sous dossier =BGnot=).
Cette situation s'améliorera peut-être, mais faciliter l'utilisation
de ce système n'est pas une priorité.
* Requirements
** Requirements
** Python
*** Python
Libraries :
@ -47,13 +48,13 @@ Libraries :
pip install numpy pandas matplotlib pillow pydantic pypdf pdf2image reportlab img2pdf pymupdf ftfy ezodf google
#+END_SRC
** Poppler (for pdf2image)
*** Poppler (for pdf2image)
+ Linux : install poppler-utils
+ Windows : Download from: https://github.com/oschwartz10612/poppler-windows
and add it to your PATH
** Accès à Gemini
*** Accès à Gemini
Il faut créer une clef API pour Gemini (pas facile).
@ -66,7 +67,7 @@ Puis ajouter =GEMINI_API_KEY= à l'environnement avec :
export GEMINI_API_KEY=…
#+END_SRC
* Correction d'un paquet de copies
** Correction d'un paquet de copies
1. Créer un fichier =names= dans le dossier courant, avec les
noms/prénoms des élèves, un par ligne
@ -83,7 +84,8 @@ export GEMINI_API_KEY=…
pour tel truc, etc)
6. Suivre les étapes plus bas.
* Prétraitement
* Étapes et Script
** Prétraitement
1. =./rotate_all.sh Interro=
(facultatif)
@ -107,14 +109,14 @@ export GEMINI_API_KEY=…
Rerun on a single file with =python cutleft.py Interro/Copie01.pdf=
* Génération d'information sur l'énoncé
** Génération d'information sur l'énoncé
1. =python enonce_info.py Interro= (gestion perso)
OU
2. =python gemini_for_enonce.py Interro=
+ Nécessite =enonce.tex/org= et `correction.tex/org`
* Labelisation et regroupement
** Labelisation et regroupement
Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~
@ -130,25 +132,27 @@ Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~
+ Quand un label est manquant, il est possible de cliquer sur
l'image, ce qui copie les coordonnées dans le presse papier
(sous linux…), puis on peut l'ajouter à la main.
+ Utilisation de `_`, `|…` et `…|`
+ Utilisation de `_`, `|…` et `…|` :
+ `|…` n'est pas arrêté verticalement par son type opposé.
+ `…|` est stoppé horizontalement par le `|…` le plus proche.
Pour modifier une seule copie :
=python plotting.py Interro/Copie01.pdf=
It also generates les =Copie01.json=, à partir des =Copie01_01.json=
3. En cas de soucis, (par exemple les pages ne sont pas dans le bon ordre)
1. En cas de soucis, (par exemple les pages ne sont pas dans le bon ordre)
- Réordonner les pages du fichier pdf
- Rerun =python cutleft.py Interro/Copie{id}=
- Rerun =python gemini_dir_batching.py Interro/Copie{id}= ?? À
vérifier, pas sûr que ça marche.
4. =python splitting_int.py Interro=
3. =python splitting_int.py Interro=
Découpe les copies suivant les exercices
5. =python grouping.py Interro=
4. =python grouping.py Interro=
Regroupe les mêmes questions de différentes copies en groupes de
tailles raisonnables.
* Correction et annotation
** Correction et annotation
Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~
@ -170,16 +174,18 @@ Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~
Pour diminuer le coût, il est possible de batch les requêtes, qui
seront alors traitées sous au plus 24h.
+ =python correction.py Interro --batch=
+ OU =python correction.py Interro --batch-from 'Ex 4'=
+ =python submit_batches.py Interro=
+ =python batch_status.py=
+ =python fetch_batched_results.py Interro=
+ =python correction.py Interro --deal-with-batched=
3. =python post-correction.py Interro=
Essaye de corriger des erreurs d'encodage/d'accents dans
- Essaye de corriger des erreurs d'encodage/d'accents dans
=correction.json=.
- aussi échappe les `_` en dehors du mode math, pour LaTeX.
* Génération des copies annotées
** Génération des copies annotées
1. =python annotating.py Interro= (facultatif)
@ -208,7 +214,7 @@ OU
- Vider =Syncthing/Annotées= sur la tablette et localement.
À automatiser, aussi c'est lent…
* Lecture de la correction manuscrite
** Lecture de la correction manuscrite
1. =python from_tablette.py Interro= (gestion perso)
@ -243,6 +249,7 @@ OU
+ =gestion_classe ne= pour créer l'interro puis
+ =gestion_classe we= (set barème here)
+ =python update_ods.py Interro=
ou =python update_ods.py Interro --sum= (en l'absence de barème)
+ =gestion_classe re=
+ =gestion_classe wsent=
+ =python add_final_score.py Interro21=
@ -252,10 +259,7 @@ OU
+ update the copies from =miqmacs.fr/admin=.
6. (gestion perso) Impression d'une copie. Via Evince » print to pdf.
* Recorrection d'une seule copie (peu testé)
** Recorrection d'une seule copie (peu testé)
!! Attention, refaire ne marchera pas si tu fais une annotation non
groupée into refaire !!

View File

@ -160,11 +160,35 @@ def main():
used_prefixes.add(unique_prefix)
existing_items = set()
max_existing_group = 0
if not args.overwrite and os.path.exists(bgnot_dir):
for d in os.listdir(bgnot_dir):
if d.startswith(f"{unique_prefix} G"):
try:
g_id = int(d.split(' G')[-1])
max_existing_group = max(max_existing_group, g_id)
except ValueError:
pass
bnote_path = os.path.join(bgnot_dir, d, "bnote.json")
if os.path.exists(bnote_path):
with open(bnote_path, "r") as bf:
bdata = json.load(bf)
for img in bdata.get("images", []):
existing_items.add((img["id"], img["label"]))
items_to_render = []
for sid, lbls in results.items():
for lbl in labels:
if lbl in lbls:
# Only add if it hasn't been generated yet
if (sid, lbl) not in existing_items:
items_to_render.append((sid, lbl, lbls[lbl]))
if not items_to_render:
continue
# Sort structurally: by student id and label
items_to_render.sort(key=lambda x: (natural_key(x[0]), natural_key(x[1])))
@ -217,7 +241,7 @@ def main():
batches = batches2
for i, batch in enumerate(batches, 1):
save_batch(batch, unique_prefix, i, root_dir, args.overwrite)
save_batch(batch, unique_prefix, max_existing_group + i, root_dir, args.overwrite)
if __name__ == "__main__":
main()

View File

@ -5,14 +5,11 @@ from pathlib import Path
import argparse
if len(sys.argv) < 2:
sys.exit("Usage: python script.py InterroTest/Ex 2/Group_1.jpg OR <InputDir>")
arg_path = Path(sys.argv[1])
tasks = [] # List of tuples: (filepath_str, label_str)
results = {}
sys.exit("Usage: python script.py 'InterroTest/Ex 2/Group_1.jpg' OR <InputDir> OR 'file1' 'file2'")
# Parse Arguments
parser = argparse.ArgumentParser()
parser.add_argument("paths", nargs="+", help="List of images or directories")
parser.add_argument("--overwrite", action="store_true",
help="Force redo requests even if output exists")
parser.add_argument("--limit", type=int, help="limit calls to gemini rpo integer")
@ -20,25 +17,37 @@ parser.add_argument("--refaire", action="store_true",
help="Redo specific copies/labels defined in refaire.json")
parser.add_argument("--batch", action="store_true",
help="Generate a JSONL file of requests to send to the Gemini Batch API")
parser.add_argument("--batch-from", type=str, metavar="LABEL",
help="Do live requests before LABEL, and batch requests from LABEL onwards")
parser.add_argument("--deal-with-batched", action="store_true",
help="Process a JSONL file containing completed batch results")
args, _ = parser.parse_known_args()
tasks = [] # List of tuples: (filepath_str, label_str)
results = {}
for path_str in args.paths:
arg_path = Path(path_str)
if arg_path.suffix == ".jpg":
INPUT_DIR = str(arg_path.parents[1])
FULL_LABEL = arg_path.parent.name
tasks.append((str(arg_path), FULL_LABEL))
results[FULL_LABEL] = []
else:
# Directory behaviour
INPUT_DIR = str(arg_path)
if not arg_path.exists():
sys.exit(f"Directory {INPUT_DIR} not found.")
print(f"Warning: {path_str} not found. Skipping.")
continue
if arg_path.is_file() and arg_path.suffix.lower() == ".jpg":
# Handle individual file
# Note: assumes structure InterroTest/Ex 2/Group_1.jpg to get parents[1]
label = arg_path.parent.name
tasks.append((str(arg_path), label))
if label not in results:
results[label] = []
elif arg_path.is_dir():
# Handle directory (original behavior)
for sub in arg_path.iterdir():
if sub.is_dir() and sub.name.startswith("Ex"):
label = sub.name
if label not in results:
results[label] = []
for img in sub.glob("*.jpg"):
tasks.append((str(img), label))
@ -135,17 +144,15 @@ You are asked to score the question or exercice labeled `<<label>>`,
do not score or give feedback to any other question."""
def make_prompt(full_label):
# l = full_label.split(" ")
# ex_label = l[0] + " " + l[1]
# text = (Path(INPUT_DIR) / "Text" / ex_label).read_text()
# corr = (Path(INPUT_DIR) / "Sol" / ex_label).read_text()
# persp = (Path(INPUT_DIR) / "Persp" / ex_label).read_text()
def read_longest_prefix_file(subdir):
dir_path = Path(INPUT_DIR) / subdir
matches = [f for f in dir_path.iterdir() if f.is_file() and full_label.startswith(f.name)]
matches = [f for f in dir_path.iterdir()
if f.is_file()
and full_label.startswith(f.name)
and f.suffix not in [".pdf", ".tex"]]
if not matches:
return ""
return max(matches, key=lambda f: len(f.name)).read_text()
return max(matches, key=lambda f: len(f.name)).read_text(encoding="utf-8", errors="replace")
text = read_longest_prefix_file("Text")
corr = read_longest_prefix_file("Sol")
@ -482,7 +489,7 @@ def handle_label_errors(pid, label, res, pdf_path):
error_type = res.get("error")
all_labels = read_all_labels(INPUT_DIR)
labels_txt = (Path(INPUT_DIR) / "labels").read_text()
labels_txt = (Path(INPUT_DIR) / "labels").read_text(encoding="utf-8", errors="replace")
enonce = enonce_total(INPUT_DIR)
if error_type == "wrong-label":
@ -499,7 +506,7 @@ Here is the full content of the exam :
{enonce}
Here is a list of all possible lables. You need to answer with one of these :
Here is a list of all possible labels. You need to answer with one of these :
{labels_txt}
"""
@ -780,7 +787,32 @@ if __name__ == "__main__":
print(f"Warning: --refaire flag used, but {refaire_path} not found.", file=sys.stderr)
if args.batch:
if args.batch or args.batch_from:
from utils import read_all_labels
all_labels = read_all_labels(INPUT_DIR)
batch_tasks = []
if args.batch_from:
if args.batch_from not in all_labels:
sys.exit(f"Error: Label '{args.batch_from}' not found. Available labels: {all_labels}")
target_idx = all_labels.index(args.batch_from)
live_tasks = []
for task in tasks_to_process:
lbl = task[1]
# Any label found sequentially equal or after `args.batch_from` gets batched
if lbl in all_labels and all_labels.index(lbl) >= target_idx:
batch_tasks.append(task)
else:
live_tasks.append(task)
tasks_to_process = live_tasks # Keep live tasks to be run right after
else:
batch_tasks = tasks_to_process
tasks_to_process = [] # Run nothing live if just `--batch`
if batch_tasks:
batch_flash_file = Path(INPUT_DIR) / "batch_requests_flash.jsonl"
batch_pro_file = Path(INPUT_DIR) / "batch_requests_pro.jsonl"
@ -790,7 +822,7 @@ if __name__ == "__main__":
with open(batch_flash_file, "w", encoding="utf-8") as f_flash, \
open(batch_pro_file, "w", encoding="utf-8") as f_pro:
for task in tasks_to_process:
for task in batch_tasks:
file_path, label = task[0], task[1]
group_name = os.path.splitext(file_path)[0]
json_path = group_name + '.json'
@ -819,7 +851,6 @@ if __name__ == "__main__":
"maxOutputTokens": 65535,
"responseMimeType": "application/json",
"responseSchema": UNROLLED_SCHEMA
# TypeAdapter(List[EvaluationEntry]).json_schema()
}
}
}
@ -835,6 +866,9 @@ if __name__ == "__main__":
print(f" - {count_flash} requests saved to {batch_flash_file} (for {MODEL_ID_flash})")
print(f" - {count_pro} requests saved to {batch_pro_file} (for {MODEL_ID_pro})")
print("Upload these files via the File API and create two separate batch jobs.")
# If there's no live tasks to do, and we aren't doing a batched ingestion, exit right away
if not tasks_to_process and not args.deal_with_batched:
sys.exit(0)
batched_responses = {}
@ -883,7 +917,7 @@ if __name__ == "__main__":
print("Time elapsed : ", end_time - start_time)
print("Requests to pro / flash : ", pro_count, flash_count)
if errors_summary:
print("\n--- Summary of Exceptions ---", file=sys.stderr)
print("\n--- Summary of Exceptions (You can use several images on one instance) ---", file=sys.stderr)
for (err, file) in errors_summary:
print(err, file=sys.stderr)
escaped_path = shlex.quote(str(file))

View File

@ -296,7 +296,7 @@ def process_copy_group(group_key, files):
continue # Retry immediately
else:
name = "Unknown"
annota.name = name
# Save result
with open(output_json, "w", encoding="utf-8") as f:
json.dump(annota.model_dump(), f, indent=2)

View File

@ -12386,6 +12386,7 @@ maternelles
maternité
mathématicien
mathématique
mathématiquement
mathématiques
maths
matière

View File

@ -7,6 +7,46 @@ import argparse
if len(sys.argv) < 2:
sys.exit("Usage: python script.py <InputDir>")
def escape_latex_underscores(text):
r"""
Escape '_' outside LaTeX math environments.
Supports:
- $...$
- $$...$$
- \( ... \)
- \[ ... \]
"""
# Regex matching LaTeX math blocks
math_pattern = re.compile(
r'(\$\$.*?\$\$|' # $$...$$
r'\$.*?\$|' # $...$
r'\\\(.*?\\\)|' # \( ... \)
r'\\\[.*?\\\])', # \[ ... \]
re.DOTALL
)
parts = []
last_end = 0
for match in math_pattern.finditer(text):
start, end = match.span()
# Escape underscores outside math
outside = text[last_end:start].replace('_', r'\_')
parts.append(outside)
# Keep math block unchanged
parts.append(match.group(0))
last_end = end
# Remaining text after last math block
outside = text[last_end:].replace('_', r'\_')
parts.append(outside)
return ''.join(parts)
arg_path = Path(sys.argv[1])
tasks = [] # List of tuples: (filepath_str, label_str)
results = {}
@ -79,7 +119,8 @@ def clean_string(s: str) -> str:
if '\x00' in s:
s = fast_fix(s)
s = s.replace('\x00', '')
return some_other_replacements(s)
s = some_other_replacements(s)
return escape_latex_underscores(s)
def clean_obj(obj):

View File

@ -8,6 +8,9 @@ import shutil
from pathlib import Path
from collections import defaultdict
carreau = 1000 // 38
def decode_json(pdf_file):
file_path = Path(pdf_file)
with open(file_path.with_suffix(".json"), "r") as f:
@ -26,8 +29,7 @@ def decode_json(pdf_file):
for d in bb_list:
(b, label) = d["box_2d"], d["label"]
pn = page_number(b)
carreau = 1000 // 38
result.append((label, pn, b[0] - int(carreau), b[2]-int(carreau), b[1], b[3]))
result.append((label, pn, b[0] - carreau, b[2]-carreau, b[1], b[3]))
result.sort(key=lambda x: (x[1], x[2]))
return (name, result)
@ -98,7 +100,7 @@ def split_an_interro(base_dir, input_pdf, coords_list):
# RULE 2: Determine stopping label
for next_item in coords_list[idx + 1:]:
n_clean, n_type, n_pn, n_y_start, _, _, _ = next_item
n_clean, n_type, n_pn, n_y_start, n_y_end, _, _ = next_item
if c_type == "L":
is_stop = (n_type in ("L", "N"))
@ -109,7 +111,9 @@ def split_an_interro(base_dir, input_pdf, coords_list):
if is_stop:
end_page = n_pn
end_y_target_raw = n_y_start
# end_y_target_raw = n_y_start
# On avait retiré un carreau précédemment, on le rajoute…
end_y_target_raw = min(n_y_start + int(1.25 * carreau), 1000)
break
# RULES 3 & 4: Calculate horizontal boundaries (0.0 to 1.0 fraction of local page width)

View File

@ -72,8 +72,6 @@ def main():
print("-" * 50)
print("All batch jobs have been initiated.")
print("Save the Batch Job Names above. You can monitor them with:")
print(" client.batches.get(name='YOUR_BATCH_JOB_NAME')")
if __name__ == "__main__":
main()

View File

@ -1,3 +1,4 @@
import argparse
import os
import sys
import json
@ -12,11 +13,12 @@ ODS_PATH = "/home/sebastien/Rust/gestion_classe/Staging/current_eval.ods"
TARGET_DIR_NAME = "A Rendre"
def main():
if len(sys.argv) < 2:
# Default to current directory if not provided, or raise error
work_dir = os.getcwd()
else:
work_dir = os.path.abspath(sys.argv[1])
parser = argparse.ArgumentParser(description="Update ODS with student scores.")
parser.add_argument("work_dir", nargs="?", default=os.getcwd(), help="Directory to process")
parser.add_argument("--sum", action="store_true", help="Write only the total sum per student")
args = parser.parse_args()
work_dir = os.path.abspath(args.work_dir)
all_labels = read_all_labels(Path(work_dir))
@ -101,7 +103,19 @@ def main():
# Start filling from Row 2 (index 2), immediately below the name line
start_row = 2
# for i, key in enumerate(scores_data.keys()):
if args.sum:
# Calculate total
total = 0.0
for val in scores_data.values():
try:
total += float(val)
except (ValueError, TypeError):
continue
cell = sheet[start_row, col_idx]
cell.set_value(total)
print(f"Set sum for {item}: {total}")
else:
for i, key in enumerate(all_labels):
row_idx = start_row + i

View File

@ -14,7 +14,7 @@ def enonce_total(base_dir):
if not text_dir.is_dir():
return ""
files = [f for f in text_dir.iterdir() if f.is_file()]
files = [f for f in text_dir.iterdir() if f.is_file() and f.suffix not in [".pdf", ".tex"]]
files.sort(key=lambda f: natural_key(f.name))
output = []