Miscs (Interro 28)

2026-05-14 09:02:09 +02:00 · 2026-05-14 09:02:09 +02:00 · 7e7045293a
parent 0836d5809d
commit 7e7045293a
10 changed files with 281 additions and 161 deletions
--- a/Readme.org
+++ b/Readme.org
@ -1,10 +1,11 @@
 #+title:  Script
 #+author: Sébastien Miquel
 #+date:   14-03-2026
-# Time-stamp: <08-05-26 22:52>
+# Time-stamp: <14-05-26 08:55>
 #+OPTIONS:

-* Quézaco
+* Méta
+** Quézaco

 Ce dépôt contient un certain nombre de script Python que j'utilise
 pour faire corriger des copies par Gemini.
@ -20,7 +21,7 @@ pour faire corriger des copies par Gemini.
 4. Ces annotations manuscrites sont lues et recompilées en une
    version de la copie pour l'élève.

-* Disclaimer
+** Disclaimer

 J'utilise régulièrement cet outil et j'en suis satisfait, mais j'ai
 fait peu d'efforts pour le rendre universel et simple à l'emploi.
@ -37,9 +38,9 @@ examples du rendu final (dans le sous dossier =BGnot=).
 Cette situation s'améliorera peut-être, mais faciliter l'utilisation
 de ce système n'est pas une priorité.

-* Requirements
+** Requirements

-** Python
+*** Python

 Libraries :

@ -47,13 +48,13 @@ Libraries :
 pip install numpy pandas matplotlib pillow pydantic pypdf pdf2image reportlab img2pdf pymupdf ftfy ezodf google
 #+END_SRC

-** Poppler (for pdf2image)
+*** Poppler (for pdf2image)

 + Linux : install poppler-utils
 + Windows : Download from: https://github.com/oschwartz10612/poppler-windows
   and add it to your PATH

-** Accès à Gemini
+*** Accès à Gemini

 Il faut créer une clef API pour Gemini (pas facile).

@ -66,7 +67,7 @@ Puis ajouter =GEMINI_API_KEY= à l'environnement avec :
 export GEMINI_API_KEY=…
 #+END_SRC

-* Correction d'un paquet de copies
+** Correction d'un paquet de copies

 1. Créer un fichier =names= dans le dossier courant, avec les
    noms/prénoms des élèves, un par ligne
@ -83,7 +84,8 @@ export GEMINI_API_KEY=…
    pour tel truc, etc)
 6. Suivre les étapes plus bas.

-* Prétraitement
+* Étapes et Script
+** Prétraitement

 1. =./rotate_all.sh Interro=
    (facultatif)
@ -107,14 +109,14 @@ export GEMINI_API_KEY=…

    Rerun on a single file with =python cutleft.py Interro/Copie01.pdf=

-* Génération d'information sur l'énoncé
+** Génération d'information sur l'énoncé

 1. =python enonce_info.py Interro= (gestion perso)
 OU
 2. =python gemini_for_enonce.py Interro=
    + Nécessite =enonce.tex/org= et `correction.tex/org`

-* Labelisation et regroupement
+** Labelisation et regroupement

 Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~

@ -130,25 +132,27 @@ Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~
    + Quand un label est manquant, il est possible de cliquer sur
      l'image, ce qui copie les coordonnées dans le presse papier
      (sous linux…), puis on peut l'ajouter à la main.
-    + Utilisation de `_`, `|…` et `…|`
+    + Utilisation de `_`, `|…` et `…|` :
+      + `|…` n'est pas arrêté verticalement par son type opposé.
+      + `…|` est stoppé horizontalement par le `|…` le plus proche.
    Pour modifier une seule copie :
 =python plotting.py Interro/Copie01.pdf=

    It also generates les =Copie01.json=, à partir des =Copie01_01.json=
- 3. En cas de soucis, (par exemple les pages ne sont pas dans le bon ordre)
+    1. En cas de soucis, (par exemple les pages ne sont pas dans le bon ordre)
       - Réordonner les pages du fichier pdf
       - Rerun =python cutleft.py Interro/Copie{id}=
       - Rerun =python gemini_dir_batching.py Interro/Copie{id}= ?? À
         vérifier, pas sûr que ça marche.
- 4. =python splitting_int.py Interro=
+ 3. =python splitting_int.py Interro=

    Découpe les copies suivant les exercices
- 5. =python grouping.py Interro=
+ 4. =python grouping.py Interro=

    Regroupe les mêmes questions de différentes copies en groupes de
    tailles raisonnables.

-* Correction et annotation
+** Correction et annotation

 Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~

@ -170,16 +174,18 @@ Set proxy with ~export HTTPS_PROXY="http://10.0.0.1:3128"~
    Pour diminuer le coût, il est possible de batch les requêtes, qui
    seront alors traitées sous au plus 24h.
    + =python correction.py Interro --batch=
+    + OU =python correction.py Interro --batch-from 'Ex 4'=
    + =python submit_batches.py Interro=
    + =python batch_status.py=
    + =python fetch_batched_results.py Interro=
    + =python correction.py Interro --deal-with-batched=
 3. =python post-correction.py Interro=

-    Essaye de corriger des erreurs d'encodage/d'accents dans
+    - Essaye de corriger des erreurs d'encodage/d'accents dans
 =correction.json=.
+    - aussi échappe les `_` en dehors du mode math, pour LaTeX.

-* Génération des copies annotées
+** Génération des copies annotées

 1. =python annotating.py Interro= (facultatif)

@ -208,7 +214,7 @@ OU
    - Vider =Syncthing/Annotées= sur la tablette et localement.
      À automatiser, aussi c'est lent…

-* Lecture de la correction manuscrite
+** Lecture de la correction manuscrite

 1. =python from_tablette.py Interro= (gestion perso)

@ -243,6 +249,7 @@ OU
    + =gestion_classe ne= pour créer l'interro puis
    + =gestion_classe we= (set barème here)
    + =python update_ods.py Interro=
+      ou =python update_ods.py Interro --sum= (en l'absence de barème)
    + =gestion_classe re=
    + =gestion_classe wsent=
    + =python add_final_score.py Interro21=
@ -252,10 +259,7 @@ OU
    + update the copies from =miqmacs.fr/admin=.
 6. (gestion perso) Impression d'une copie. Via Evince » print to pdf.

-
-
-* Recorrection d'une seule copie (peu testé)
-
+** Recorrection d'une seule copie (peu testé)

 !! Attention, refaire ne marchera pas si tu fais une annotation non
 groupée into refaire !!
--- a/annotating_by_label.py
+++ b/annotating_by_label.py
@ -160,11 +160,35 @@ def main():

        used_prefixes.add(unique_prefix)

+        existing_items = set()
+        max_existing_group = 0
+
+
+        if not args.overwrite and os.path.exists(bgnot_dir):
+            for d in os.listdir(bgnot_dir):
+                if d.startswith(f"{unique_prefix} G"):
+                    try:
+                        g_id = int(d.split(' G')[-1])
+                        max_existing_group = max(max_existing_group, g_id)
+                    except ValueError:
+                        pass
+
+                    bnote_path = os.path.join(bgnot_dir, d, "bnote.json")
+                    if os.path.exists(bnote_path):
+                        with open(bnote_path, "r") as bf:
+                            bdata = json.load(bf)
+                            for img in bdata.get("images", []):
+                                existing_items.add((img["id"], img["label"]))
+
        items_to_render = []
        for sid, lbls in results.items():
            for lbl in labels:
                if lbl in lbls:
+                    # Only add if it hasn't been generated yet
+                    if (sid, lbl) not in existing_items:
                        items_to_render.append((sid, lbl, lbls[lbl]))
+        if not items_to_render:
+            continue

        # Sort structurally: by student id and label
        items_to_render.sort(key=lambda x: (natural_key(x[0]), natural_key(x[1])))
@ -217,7 +241,7 @@ def main():
            batches = batches2

        for i, batch in enumerate(batches, 1):
-            save_batch(batch, unique_prefix, i, root_dir, args.overwrite)
+            save_batch(batch, unique_prefix, max_existing_group + i, root_dir, args.overwrite)

 if __name__ == "__main__":
    main()
--- a/correction.py
+++ b/correction.py
@ -5,14 +5,11 @@ from pathlib import Path
 import argparse

 if len(sys.argv) < 2:
-    sys.exit("Usage: python script.py InterroTest/Ex 2/Group_1.jpg OR <InputDir>")
-
-arg_path = Path(sys.argv[1])
-tasks = [] # List of tuples: (filepath_str, label_str)
-results = {}
+    sys.exit("Usage: python script.py 'InterroTest/Ex 2/Group_1.jpg' OR <InputDir> OR 'file1' 'file2'")

 # Parse Arguments
 parser = argparse.ArgumentParser()
+parser.add_argument("paths", nargs="+", help="List of images or directories")
 parser.add_argument("--overwrite", action="store_true",
                    help="Force redo requests even if output exists")
 parser.add_argument("--limit", type=int, help="limit calls to gemini rpo integer")
@ -20,25 +17,37 @@ parser.add_argument("--refaire", action="store_true",
                    help="Redo specific copies/labels defined in refaire.json")
 parser.add_argument("--batch", action="store_true",
                    help="Generate a JSONL file of requests to send to the Gemini Batch API")
+parser.add_argument("--batch-from", type=str, metavar="LABEL",
+                    help="Do live requests before LABEL, and batch requests from LABEL onwards")
 parser.add_argument("--deal-with-batched", action="store_true",
                    help="Process a JSONL file containing completed batch results")
 args, _ = parser.parse_known_args()

+tasks = [] # List of tuples: (filepath_str, label_str)
+results = {}
+
+
+for path_str in args.paths:
+    arg_path = Path(path_str)

-if arg_path.suffix == ".jpg":
-    INPUT_DIR = str(arg_path.parents[1])
-    FULL_LABEL = arg_path.parent.name
-    tasks.append((str(arg_path), FULL_LABEL))
-    results[FULL_LABEL] = []
-else:
-    # Directory behaviour
-    INPUT_DIR = str(arg_path)
    if not arg_path.exists():
-        sys.exit(f"Directory {INPUT_DIR} not found.")
+        print(f"Warning: {path_str} not found. Skipping.")
+        continue

+    if arg_path.is_file() and arg_path.suffix.lower() == ".jpg":
+        # Handle individual file
+        # Note: assumes structure InterroTest/Ex 2/Group_1.jpg to get parents[1]
+        label = arg_path.parent.name
+        tasks.append((str(arg_path), label))
+        if label not in results:
+            results[label] = []
+
+    elif arg_path.is_dir():
+        # Handle directory (original behavior)
        for sub in arg_path.iterdir():
            if sub.is_dir() and sub.name.startswith("Ex"):
                label = sub.name
+                if label not in results:
                    results[label] = []
                for img in sub.glob("*.jpg"):
                    tasks.append((str(img), label))
@ -135,17 +144,15 @@ You are asked to score the question or exercice labeled `<<label>>`,
 do not score or give feedback to any other question."""

 def make_prompt(full_label):
-    # l = full_label.split(" ")
-    # ex_label = l[0] + " " + l[1]
-    # text = (Path(INPUT_DIR) / "Text" / ex_label).read_text()
-    # corr = (Path(INPUT_DIR) / "Sol" / ex_label).read_text()
-    # persp = (Path(INPUT_DIR) / "Persp" / ex_label).read_text()
    def read_longest_prefix_file(subdir):
        dir_path = Path(INPUT_DIR) / subdir
-        matches = [f for f in dir_path.iterdir() if f.is_file() and full_label.startswith(f.name)]
+        matches = [f for f in dir_path.iterdir()
+                   if f.is_file()
+                   and full_label.startswith(f.name)
+                   and f.suffix not in [".pdf", ".tex"]]
        if not matches:
            return ""
-        return max(matches, key=lambda f: len(f.name)).read_text()
+        return max(matches, key=lambda f: len(f.name)).read_text(encoding="utf-8", errors="replace")

    text = read_longest_prefix_file("Text")
    corr = read_longest_prefix_file("Sol")
@ -482,7 +489,7 @@ def handle_label_errors(pid, label, res, pdf_path):
    error_type = res.get("error")

    all_labels = read_all_labels(INPUT_DIR)
-    labels_txt = (Path(INPUT_DIR) / "labels").read_text()
+    labels_txt = (Path(INPUT_DIR) / "labels").read_text(encoding="utf-8", errors="replace")
    enonce = enonce_total(INPUT_DIR)

    if error_type == "wrong-label":
@ -499,7 +506,7 @@ Here is the full content of the exam :

 {enonce}

-Here is a list of all possible lables. You need to answer with one of these :
+Here is a list of all possible labels. You need to answer with one of these :

 {labels_txt}
 """
@ -780,7 +787,32 @@ if __name__ == "__main__":
            print(f"Warning: --refaire flag used, but {refaire_path} not found.", file=sys.stderr)


-    if args.batch:
+    if args.batch or args.batch_from:
+        from utils import read_all_labels
+        all_labels = read_all_labels(INPUT_DIR)
+
+        batch_tasks = []
+        if args.batch_from:
+            if args.batch_from not in all_labels:
+                sys.exit(f"Error: Label '{args.batch_from}' not found. Available labels: {all_labels}")
+
+            target_idx = all_labels.index(args.batch_from)
+            live_tasks = []
+
+            for task in tasks_to_process:
+                lbl = task[1]
+                # Any label found sequentially equal or after `args.batch_from` gets batched
+                if lbl in all_labels and all_labels.index(lbl) >= target_idx:
+                    batch_tasks.append(task)
+                else:
+                    live_tasks.append(task)
+
+            tasks_to_process = live_tasks # Keep live tasks to be run right after
+        else:
+            batch_tasks = tasks_to_process
+            tasks_to_process = [] # Run nothing live if just `--batch`
+
+        if batch_tasks:
            batch_flash_file = Path(INPUT_DIR) / "batch_requests_flash.jsonl"
            batch_pro_file = Path(INPUT_DIR) / "batch_requests_pro.jsonl"

@ -790,7 +822,7 @@ if __name__ == "__main__":
            with open(batch_flash_file, "w", encoding="utf-8") as f_flash, \
                 open(batch_pro_file, "w", encoding="utf-8") as f_pro:

-            for task in tasks_to_process:
+                for task in batch_tasks:
                    file_path, label = task[0], task[1]
                    group_name = os.path.splitext(file_path)[0]
                    json_path = group_name + '.json'
@ -819,7 +851,6 @@ if __name__ == "__main__":
                                "maxOutputTokens": 65535,
                                "responseMimeType": "application/json",
                                "responseSchema": UNROLLED_SCHEMA
-                            # TypeAdapter(List[EvaluationEntry]).json_schema()
                            }
                        }
                    }
@ -835,6 +866,9 @@ if __name__ == "__main__":
            print(f" - {count_flash} requests saved to {batch_flash_file} (for {MODEL_ID_flash})")
            print(f" - {count_pro} requests saved to {batch_pro_file} (for {MODEL_ID_pro})")
            print("Upload these files via the File API and create two separate batch jobs.")
+
+        # If there's no live tasks to do, and we aren't doing a batched ingestion, exit right away
+        if not tasks_to_process and not args.deal_with_batched:
            sys.exit(0)

    batched_responses = {}
@ -883,7 +917,7 @@ if __name__ == "__main__":
    print("Time elapsed : ", end_time - start_time)
    print("Requests to pro / flash : ", pro_count, flash_count)
    if errors_summary:
-        print("\n--- Summary of Exceptions ---", file=sys.stderr)
+        print("\n--- Summary of Exceptions (You can use several images on one instance) ---", file=sys.stderr)
        for (err, file) in errors_summary:
            print(err, file=sys.stderr)
            escaped_path = shlex.quote(str(file))
--- a/gemini_for_labels.py
+++ b/gemini_for_labels.py
@ -296,7 +296,7 @@ def process_copy_group(group_key, files):
                        continue  # Retry immediately
                    else:
                        name = "Unknown"
-
+                annota.name = name
                # Save result
                with open(output_json, "w", encoding="utf-8") as f:
                    json.dump(annota.model_dump(), f, indent=2)
--- a/liste_francais.txt
+++ b/liste_francais.txt
@ -12386,6 +12386,7 @@ maternelles
 maternité
 mathématicien
 mathématique
+mathématiquement
 mathématiques
 maths
 matière
--- a/post-correction.py
+++ b/post-correction.py
@ -7,6 +7,46 @@ import argparse
 if len(sys.argv) < 2:
    sys.exit("Usage: python script.py <InputDir>")

+def escape_latex_underscores(text):
+    r"""
+    Escape '_' outside LaTeX math environments.
+    Supports:
+      - $...$
+      - $$...$$
+      - \( ... \)
+      - \[ ... \]
+    """
+
+    # Regex matching LaTeX math blocks
+    math_pattern = re.compile(
+        r'(\$\$.*?\$\$|'      # $$...$$
+        r'\$.*?\$|'           # $...$
+        r'\\\(.*?\\\)|'       # \( ... \)
+        r'\\\[.*?\\\])',      # \[ ... \]
+        re.DOTALL
+    )
+
+    parts = []
+    last_end = 0
+
+    for match in math_pattern.finditer(text):
+        start, end = match.span()
+
+        # Escape underscores outside math
+        outside = text[last_end:start].replace('_', r'\_')
+        parts.append(outside)
+
+        # Keep math block unchanged
+        parts.append(match.group(0))
+
+        last_end = end
+
+    # Remaining text after last math block
+    outside = text[last_end:].replace('_', r'\_')
+    parts.append(outside)
+
+    return ''.join(parts)
+
 arg_path = Path(sys.argv[1])
 tasks = [] # List of tuples: (filepath_str, label_str)
 results = {}
@ -79,7 +119,8 @@ def clean_string(s: str) -> str:
    if '\x00' in s:
        s = fast_fix(s)
        s = s.replace('\x00', '')
-    return some_other_replacements(s)
+    s = some_other_replacements(s)
+    return escape_latex_underscores(s)


 def clean_obj(obj):
--- a/splitting_int.py
+++ b/splitting_int.py
@ -8,6 +8,9 @@ import shutil
 from pathlib import Path
 from collections import defaultdict

+carreau = 1000 // 38
+
+
 def decode_json(pdf_file):
    file_path = Path(pdf_file)
    with open(file_path.with_suffix(".json"), "r") as f:
@ -26,8 +29,7 @@ def decode_json(pdf_file):
    for d in bb_list:
        (b, label) = d["box_2d"], d["label"]
        pn = page_number(b)
-        carreau = 1000 // 38
-        result.append((label, pn, b[0] - int(carreau), b[2]-int(carreau), b[1], b[3]))
+        result.append((label, pn, b[0] - carreau, b[2]-carreau, b[1], b[3]))
    result.sort(key=lambda x: (x[1], x[2]))
    return (name, result)

@ -98,7 +100,7 @@ def split_an_interro(base_dir, input_pdf, coords_list):

        # RULE 2: Determine stopping label
        for next_item in coords_list[idx + 1:]:
-            n_clean, n_type, n_pn, n_y_start, _, _, _ = next_item
+            n_clean, n_type, n_pn, n_y_start, n_y_end, _, _ = next_item

            if c_type == "L":
                is_stop = (n_type in ("L", "N"))
@ -109,7 +111,9 @@ def split_an_interro(base_dir, input_pdf, coords_list):

            if is_stop:
                end_page = n_pn
-                end_y_target_raw = n_y_start
+                # end_y_target_raw = n_y_start
+                # On avait retiré un carreau précédemment, on le rajoute…
+                end_y_target_raw = min(n_y_start + int(1.25 * carreau), 1000)
                break

        # RULES 3 & 4: Calculate horizontal boundaries (0.0 to 1.0 fraction of local page width)
--- a/submit_batches.py
+++ b/submit_batches.py
@ -72,8 +72,6 @@ def main():

    print("-" * 50)
    print("All batch jobs have been initiated.")
-    print("Save the Batch Job Names above. You can monitor them with:")
-    print("  client.batches.get(name='YOUR_BATCH_JOB_NAME')")

 if __name__ == "__main__":
    main()
--- a/update_ods.py
+++ b/update_ods.py
@ -1,3 +1,4 @@
+import argparse
 import os
 import sys
 import json
@ -12,11 +13,12 @@ ODS_PATH = "/home/sebastien/Rust/gestion_classe/Staging/current_eval.ods"
 TARGET_DIR_NAME = "A Rendre"

 def main():
-    if len(sys.argv) < 2:
-        # Default to current directory if not provided, or raise error
-        work_dir = os.getcwd()
-    else:
-        work_dir = os.path.abspath(sys.argv[1])
+    parser = argparse.ArgumentParser(description="Update ODS with student scores.")
+    parser.add_argument("work_dir", nargs="?", default=os.getcwd(), help="Directory to process")
+    parser.add_argument("--sum", action="store_true", help="Write only the total sum per student")
+    args = parser.parse_args()
+
+    work_dir = os.path.abspath(args.work_dir)
    
    all_labels = read_all_labels(Path(work_dir))

@ -101,7 +103,19 @@ def main():
        # Start filling from Row 2 (index 2), immediately below the name line
        start_row = 2

-        # for i, key in enumerate(scores_data.keys()):
+        if args.sum:
+            # Calculate total
+            total = 0.0
+            for val in scores_data.values():
+                try:
+                    total += float(val)
+                except (ValueError, TypeError):
+                    continue
+
+            cell = sheet[start_row, col_idx]
+            cell.set_value(total)
+            print(f"Set sum for {item}: {total}")
+        else:
            for i, key in enumerate(all_labels):
                row_idx = start_row + i

--- a/utils.py
+++ b/utils.py
@ -14,7 +14,7 @@ def enonce_total(base_dir):
    if not text_dir.is_dir():
        return ""

-    files = [f for f in text_dir.iterdir() if f.is_file()]
+    files = [f for f in text_dir.iterdir() if f.is_file() and f.suffix not in [".pdf", ".tex"]]
    files.sort(key=lambda f: natural_key(f.name))

    output = []