Script.org file and batching utilities

2026-04-25 09:18:32 +02:00 · 2026-04-25 09:18:32 +02:00 · 4a77f0905b
parent d7a8e03d2c
commit 4a77f0905b
4 changed files with 366 additions and 0 deletions
--- a/Script.org
+++ b/Script.org
@ -0,0 +1,136 @@
 #+title:  Script
 #+author: Sébastien Miquel
 #+date:   14-03-2026
 # Time-stamp: <22-04-26 10:54>
 #+OPTIONS:
 NEEDS :
 - `export GEMINI_API_KEY=…`
 - fichier `names` in the directory.
 * Prétraitement
 1. Rotate every single page 180
    `./rotate_all.sh Interro14`
 2. `./rename_to_copie.sh Interro14`
 3. `python page_splitter.py Interro`.
    Fix issues with `python page_splitter.py Interro14/Copie01.pdf`
 4. `python cutleft.py Interro`
    Rerun on a single file with `python cutleft.py Interro/Copie01.pdf`
 5. `python enonce_info.py Interro`
 * Labelisation et regroupement
 Set proxy with `export HTTPS_PROXY="http://192.168.241.1:3128"`
 7. `python gemini_for_labels.py Interro`, avec éventuellement `--overwrite`
 8. Vérification visuelle : `python plotting.py Interro`
    `python plotting.py InterroTest/Copie01.pdf`
    It also generates les `Copie01.json`, à partir des `Copie01_01.json`
    In case of issue, you may need to
    - Reorder the pdf
    - Run `python cutleft.py Interro/Copie`
    - Run `python gemini_dir_batching.py Interro/Copie`
 9. `python splitting_int.py Interro`
 10. `python grouping.py Interro`
 * Correction et annotation
  Success! Batch Job Name: batches/0hc83m3anayrs5iljygg2v6ozsvxz58k6a5r
 Processing batch_requests_pro.jsonl for model gemini-3.1-pro-preview...
  Uploading file...
  Uploaded successfully! File ID: files/oasj4aty5kco
  Starting batch job...
  Success! Batch Job Name: batches/8pk4m2snr17n31pun3vwzn646qvtkj8ao192
 Set proxy with `export HTTPS_PROXY="http://10.0.0.1:3128"`
 1. Il faut créer des persp, pour indication de comment corriger, et
    relancer `enonce_info.py`
 2. `python correction.py Interro --limit 240` OU
    `python correction.py Interro/Ex\ 2/Group_1.jpg` OU
    `python correction.py Interro --overwrite`
    Will it resume ? It seems so. Best to wait a bit.
    To batch it  :
    + `python correction.py Interro --batch`
    + `python submit_batches.py Interro`
    + `python batch_status.py`
    + `python fetch_batched_results.py DS08VB`
    + `python correction.py DS08VB --deal-with-batched`
 3. Try `python post-correction.py Interro` ; It makes a
    `fixed_correction.json`, to check.
 4. Facultatif : `python annotating.py Interro` dans `Anot`, pass `--overwrite`
 5.
    + `python annotating_with_checks.py Interro` dans `Bnot`, pass `--overwrite`
    OU
    + `python annotating_by_label.py Interro` dans `BGnot`
 _Needs_ : label_groups file. (made automatically by this function)
 6. `python to_tablette.py Interro`
    Cela déplace les groupes dans `SyncCopies/À Annoter`.
    - Les mettre dans le dossier racine de la tablette, et renommer en `aaa`.
    - Vider `Syncthing/Annotées` sur la tablette et localement.
      À automatiser, aussi c'est lent…
 * Lecture de la correction manuelle
 16.  Manually : delete `~/SyncCopies/Annotées`, copy from the tablette to here.
     Then `python from_tablette.py Interro`
 17.
     + `python reading_annotations.py Interro`
     OU
     + `python reading_grouped_annotations.py Interro`
 18. `python giving_names.py InterroTest BGnot`
     It will make `A Rendre` with symlink to the Concat.jpg file
     either in Anot or Bnot, and score.json.
     + In case of Unknown : rename both directory and file inside.
     + Here, you can change `score.json` manually.
 19.
     + `gestion_classe ne` pour créer l'interro puis
     + `gestion_classe we` (set barème here)
     + `python update_ods.py Interro`
     + `gestion_classe re`
     + `gestion_classe wsent`
     + `python add_final_score.py Interro21`
       (this makes files in `Server/copies`)
 20.
     + Deploy `miqmacs-copies-assets`, and
     + update the copies from `miqmacs.fr/admin`.
 * Recorrection d'une seule copie
 !! Attention, refaire ne marchera pas si tu fais une annotation non
 groupée into refaire !!
 1. Redécoupage
    + `python plotting.py InterroTest/Copie01.pdf`
    + `python splitting_int.py InterroTest/Copie20.pdf`
 2. Créer `refaire.json`, avec un contenu comme
    [["Copie01", []],
    ["Copie01", ["Ex 1 : 1)"]]]
 3. Appeler `correction` avec --refaire. Il doit créer des groupes
    individuels, faire des requêtes, et remplacer les corrections
    précédentes (à sauver ailleurs).
    Ou non, si tu veux le faire à la main.
 4. ?? Si je fais refaire, avant d'avoir créer les annotating with
    checks, que se passe-t-il ???
 5. Appeler `annotating_with_checks.py --refaire --overwrite` avec --refaire.
 6. `python to_tablette.py --refaire Interro24`
 6. `python from_tablette.py --refaire Interro24`
 7. `python reading_grouped_annotations.py --refaire Interro24`
 * Install
 #+BEGIN_SRC bash
 pip install highlight-text --break-system-packages
 #+END_SRC
--- a/batch_status.py
+++ b/batch_status.py
@ -0,0 +1,88 @@
 import os
 import sys
 import argparse
 from google import genai
 if "GEMINI_API_KEY" not in os.environ:
    sys.exit("Error: GEMINI_API_KEY environment variable not set.")
 client = genai.Client()
 def list_jobs():
    print("Fetching recent batch jobs...\n")
    try:
        batch_jobs = client.batches.list()
        jobs_found = False
        for job in batch_jobs:
            jobs_found = True
            state = job.state.name if hasattr(job.state, 'name') else job.state
            print("-" * 60)
            print(f"Job Name:      {job.name}")
            if hasattr(job, 'display_name') and job.display_name:
                print(f"Display Name:  {job.display_name}")
            print(f"State:         {state}")
            if state == 'JOB_STATE_FAILED' and hasattr(job, 'error'):
                print(f"Error:         {job.error}")
            if state == 'JOB_STATE_SUCCEEDED' and hasattr(job, 'dest') and job.dest:
                if hasattr(job.dest, 'file_name') and job.dest.file_name:
                    print(f"Output File:   {job.dest.file_name}")
        if not jobs_found:
            print("No batch jobs found.")
        else:
            print("-" * 60)
            print("\nTo download a completed job, run:")
            print("python batch_status.py --download batches/<YOUR_BATCH_ID>")
    except Exception as e:
        sys.exit(f"An error occurred while listing jobs: {e}")
 def download_job(job_name):
    print(f"Checking status for {job_name}...\n")
    try:
        job = client.batches.get(name=job_name)
        state = job.state.name if hasattr(job.state, 'name') else job.state
        print(f"State: {state}")
        if state != 'JOB_STATE_SUCCEEDED':
            print("Job is not ready yet or has failed.")
            if state == 'JOB_STATE_FAILED' and hasattr(job, 'error'):
                print(f"Error: {job.error}")
            return
        if hasattr(job, 'dest') and job.dest and hasattr(job.dest, 'file_name') and job.dest.file_name:
            result_file_name = job.dest.file_name
            print(f"Downloading results from {result_file_name}...")
            file_content_bytes = client.files.download(file=result_file_name)
            output_path = f"results_{job_name.replace('/', '_')}.jsonl"
            with open(output_path, "wb") as f:
                f.write(file_content_bytes)
            print(f"Success! Saved to {output_path}")
            print(f"You can now feed this to your correction script using: --deal-with-batched {output_path}")
        else:
            print("Job succeeded but no output file was found.")
    except Exception as e:
        sys.exit(f"An error occurred while fetching the job: {e}")
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Manage Gemini Batch Jobs")
    parser.add_argument("--download", type=str, metavar="JOB_NAME",
                        help="Download the results for a specific batch job (e.g. batches/123456)")
    args = parser.parse_args()
    if args.download:
        download_job(args.download)
    else:
        list_jobs()
--- a/fetch_batched_results.py
+++ b/fetch_batched_results.py
@ -0,0 +1,63 @@
 import os
 import sys
 import argparse
 from pathlib import Path
 from google import genai
 def main():
    parser = argparse.ArgumentParser(description="Download and combine completed batch jobs for a directory.")
    parser.add_argument("root_dir", type=str, help="Directory containing the original batches")
    args = parser.parse_args()
    target_dir = Path(args.root_dir)
    dir_name = target_dir.name
    output_path = target_dir / "batched_correction_result.jsonl"
    if "GEMINI_API_KEY" not in os.environ:
        sys.exit("Error: GEMINI_API_KEY environment variable not set.")
    client = genai.Client()
    print(f"Fetching jobs matching '{dir_name}'...")
    all_jobs = client.batches.list()
    matching_jobs = []
    # 1. Find jobs associated with this directory
    for job in all_jobs:
        if hasattr(job, 'display_name') and job.display_name and dir_name in job.display_name:
            matching_jobs.append(job)
    if not matching_jobs:
        sys.exit(f"No batch jobs found containing '{dir_name}' in their display name.")
    # 2. Check that all matching jobs are complete
    for job in matching_jobs:
        state = job.state.name if hasattr(job.state, 'name') else job.state
        print(f"Found Job: {job.display_name} | State: {state}")
        if state != 'JOB_STATE_SUCCEEDED':
            sys.exit(f"Error: Job '{job.display_name}' has not succeeded yet. Try again later.")
    # 3. Download and concatenate
    print("\nAll jobs succeeded. Downloading results...")
    combined_data = b""
    for job in matching_jobs:
        if hasattr(job, 'dest') and job.dest and hasattr(job.dest, 'file_name') and job.dest.file_name:
            print(f"Downloading output for {job.display_name}...")
            file_content_bytes = client.files.download(file=job.dest.file_name)
            combined_data += file_content_bytes
            # Ensure proper line separation between files in JSONL
            if combined_data and not combined_data.endswith(b'\n'):
                combined_data += b'\n'
        else:
            print(f"Warning: Job {job.display_name} succeeded but has no output file.")
    # 4. Save to destination
    with open(output_path, "wb") as f:
        f.write(combined_data)
    print(f"\nSuccess! All results concatenated and saved to:\n{output_path}")
 if __name__ == "__main__":
    main()
--- a/submit_batches.py
+++ b/submit_batches.py
@ -0,0 +1,79 @@
 import os
 import sys
 import argparse
 from pathlib import Path
 from google import genai
 from google.genai import types
 def main():
    parser = argparse.ArgumentParser(description="Upload JSONL files and create Gemini Batch jobs.")
    parser.add_argument("root_dir", type=str, help="Root directory containing the batch JSONL files")
    args = parser.parse_args()
    root_dir = Path(args.root_dir)
    if "GEMINI_API_KEY" not in os.environ:
        sys.exit("Error: GEMINI_API_KEY environment variable not set.")
    client = genai.Client()
    # Define the batch files and their corresponding models
    batches_to_create = [
        {
            "file_path": root_dir / "batch_requests_flash.jsonl",
            "model_id": "gemini-3-flash-preview",
            "display_name": f"flash-correction-{root_dir.name}"
        },
        {
            "file_path": root_dir / "batch_requests_pro.jsonl",
            "model_id": "gemini-3.1-pro-preview",
            "display_name": f"pro-correction-{root_dir.name}"
        }
    ]
    for batch in batches_to_create:
        file_path = batch["file_path"]
        model_id = batch["model_id"]
        display_name = batch["display_name"]
        # Check if the file exists
        if not file_path.exists():
            print(f"Skipping {model_id}: {file_path.name} does not exist.")
            continue
        # Check if the file is empty (e.g., if all tasks went to Flash, Pro might be empty)
        if file_path.stat().st_size == 0:
            print(f"Skipping {model_id}: {file_path.name} is empty.")
            continue
        print(f"Processing {file_path.name} for model {model_id}...")
        # 1. Upload the file to the File API
        print(f"  Uploading file...")
        uploaded_file = client.files.upload(
            file=str(file_path),
            config=types.UploadFileConfig(
                display_name=f"{display_name}-input",
                mime_type='jsonl'
            )
        )
        print(f"  Uploaded successfully! File ID: {uploaded_file.name}")
        # 2. Create the batch job
        print(f"  Starting batch job...")
        batch_job = client.batches.create(
            model=model_id,
            src=uploaded_file.name,
            config={
                'display_name': display_name,
            },
        )
        print(f"  Success! Batch Job Name: {batch_job.name}\n")
    print("-" * 50)
    print("All batch jobs have been initiated.")
    print("Save the Batch Job Names above. You can monitor them with:")
    print("  client.batches.get(name='YOUR_BATCH_JOB_NAME')")
 if __name__ == "__main__":
    main()