Script.org file and batching utilities

master
Sébastien Miquel 2026-04-25 09:18:32 +02:00
parent d7a8e03d2c
commit 4a77f0905b
4 changed files with 366 additions and 0 deletions

136
Script.org Normal file
View File

@ -0,0 +1,136 @@
#+title: Script
#+author: Sébastien Miquel
#+date: 14-03-2026
# Time-stamp: <22-04-26 10:54>
#+OPTIONS:
NEEDS :
- `export GEMINI_API_KEY=…`
- fichier `names` in the directory.
* Prétraitement
1. Rotate every single page 180
`./rotate_all.sh Interro14`
2. `./rename_to_copie.sh Interro14`
3. `python page_splitter.py Interro`.
Fix issues with `python page_splitter.py Interro14/Copie01.pdf`
4. `python cutleft.py Interro`
Rerun on a single file with `python cutleft.py Interro/Copie01.pdf`
5. `python enonce_info.py Interro`
* Labelisation et regroupement
Set proxy with `export HTTPS_PROXY="http://192.168.241.1:3128"`
7. `python gemini_for_labels.py Interro`, avec éventuellement `--overwrite`
8. Vérification visuelle : `python plotting.py Interro`
`python plotting.py InterroTest/Copie01.pdf`
It also generates les `Copie01.json`, à partir des `Copie01_01.json`
In case of issue, you may need to
- Reorder the pdf
- Run `python cutleft.py Interro/Copie`
- Run `python gemini_dir_batching.py Interro/Copie`
9. `python splitting_int.py Interro`
10. `python grouping.py Interro`
* Correction et annotation
Success! Batch Job Name: batches/0hc83m3anayrs5iljygg2v6ozsvxz58k6a5r
Processing batch_requests_pro.jsonl for model gemini-3.1-pro-preview...
Uploading file...
Uploaded successfully! File ID: files/oasj4aty5kco
Starting batch job...
Success! Batch Job Name: batches/8pk4m2snr17n31pun3vwzn646qvtkj8ao192
Set proxy with `export HTTPS_PROXY="http://10.0.0.1:3128"`
1. Il faut créer des persp, pour indication de comment corriger, et
relancer `enonce_info.py`
2. `python correction.py Interro --limit 240` OU
`python correction.py Interro/Ex\ 2/Group_1.jpg` OU
`python correction.py Interro --overwrite`
Will it resume ? It seems so. Best to wait a bit.
To batch it :
+ `python correction.py Interro --batch`
+ `python submit_batches.py Interro`
+ `python batch_status.py`
+ `python fetch_batched_results.py DS08VB`
+ `python correction.py DS08VB --deal-with-batched`
3. Try `python post-correction.py Interro` ; It makes a
`fixed_correction.json`, to check.
4. Facultatif : `python annotating.py Interro` dans `Anot`, pass `--overwrite`
5.
+ `python annotating_with_checks.py Interro` dans `Bnot`, pass `--overwrite`
OU
+ `python annotating_by_label.py Interro` dans `BGnot`
_Needs_ : label_groups file. (made automatically by this function)
6. `python to_tablette.py Interro`
Cela déplace les groupes dans `SyncCopies/À Annoter`.
- Les mettre dans le dossier racine de la tablette, et renommer en `aaa`.
- Vider `Syncthing/Annotées` sur la tablette et localement.
À automatiser, aussi c'est lent…
* Lecture de la correction manuelle
16. Manually : delete `~/SyncCopies/Annotées`, copy from the tablette to here.
Then `python from_tablette.py Interro`
17.
+ `python reading_annotations.py Interro`
OU
+ `python reading_grouped_annotations.py Interro`
18. `python giving_names.py InterroTest BGnot`
It will make `A Rendre` with symlink to the Concat.jpg file
either in Anot or Bnot, and score.json.
+ In case of Unknown : rename both directory and file inside.
+ Here, you can change `score.json` manually.
19.
+ `gestion_classe ne` pour créer l'interro puis
+ `gestion_classe we` (set barème here)
+ `python update_ods.py Interro`
+ `gestion_classe re`
+ `gestion_classe wsent`
+ `python add_final_score.py Interro21`
(this makes files in `Server/copies`)
20.
+ Deploy `miqmacs-copies-assets`, and
+ update the copies from `miqmacs.fr/admin`.
* Recorrection d'une seule copie
!! Attention, refaire ne marchera pas si tu fais une annotation non
groupée into refaire !!
1. Redécoupage
+ `python plotting.py InterroTest/Copie01.pdf`
+ `python splitting_int.py InterroTest/Copie20.pdf`
2. Créer `refaire.json`, avec un contenu comme
[["Copie01", []],
["Copie01", ["Ex 1 : 1)"]]]
3. Appeler `correction` avec --refaire. Il doit créer des groupes
individuels, faire des requêtes, et remplacer les corrections
précédentes (à sauver ailleurs).
Ou non, si tu veux le faire à la main.
4. ?? Si je fais refaire, avant d'avoir créer les annotating with
checks, que se passe-t-il ???
5. Appeler `annotating_with_checks.py --refaire --overwrite` avec --refaire.
6. `python to_tablette.py --refaire Interro24`
6. `python from_tablette.py --refaire Interro24`
7. `python reading_grouped_annotations.py --refaire Interro24`
* Install
#+BEGIN_SRC bash
pip install highlight-text --break-system-packages
#+END_SRC

88
batch_status.py Normal file
View File

@ -0,0 +1,88 @@
import os
import sys
import argparse
from google import genai
if "GEMINI_API_KEY" not in os.environ:
sys.exit("Error: GEMINI_API_KEY environment variable not set.")
client = genai.Client()
def list_jobs():
print("Fetching recent batch jobs...\n")
try:
batch_jobs = client.batches.list()
jobs_found = False
for job in batch_jobs:
jobs_found = True
state = job.state.name if hasattr(job.state, 'name') else job.state
print("-" * 60)
print(f"Job Name: {job.name}")
if hasattr(job, 'display_name') and job.display_name:
print(f"Display Name: {job.display_name}")
print(f"State: {state}")
if state == 'JOB_STATE_FAILED' and hasattr(job, 'error'):
print(f"Error: {job.error}")
if state == 'JOB_STATE_SUCCEEDED' and hasattr(job, 'dest') and job.dest:
if hasattr(job.dest, 'file_name') and job.dest.file_name:
print(f"Output File: {job.dest.file_name}")
if not jobs_found:
print("No batch jobs found.")
else:
print("-" * 60)
print("\nTo download a completed job, run:")
print("python batch_status.py --download batches/<YOUR_BATCH_ID>")
except Exception as e:
sys.exit(f"An error occurred while listing jobs: {e}")
def download_job(job_name):
print(f"Checking status for {job_name}...\n")
try:
job = client.batches.get(name=job_name)
state = job.state.name if hasattr(job.state, 'name') else job.state
print(f"State: {state}")
if state != 'JOB_STATE_SUCCEEDED':
print("Job is not ready yet or has failed.")
if state == 'JOB_STATE_FAILED' and hasattr(job, 'error'):
print(f"Error: {job.error}")
return
if hasattr(job, 'dest') and job.dest and hasattr(job.dest, 'file_name') and job.dest.file_name:
result_file_name = job.dest.file_name
print(f"Downloading results from {result_file_name}...")
file_content_bytes = client.files.download(file=result_file_name)
output_path = f"results_{job_name.replace('/', '_')}.jsonl"
with open(output_path, "wb") as f:
f.write(file_content_bytes)
print(f"Success! Saved to {output_path}")
print(f"You can now feed this to your correction script using: --deal-with-batched {output_path}")
else:
print("Job succeeded but no output file was found.")
except Exception as e:
sys.exit(f"An error occurred while fetching the job: {e}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Manage Gemini Batch Jobs")
parser.add_argument("--download", type=str, metavar="JOB_NAME",
help="Download the results for a specific batch job (e.g. batches/123456)")
args = parser.parse_args()
if args.download:
download_job(args.download)
else:
list_jobs()

63
fetch_batched_results.py Normal file
View File

@ -0,0 +1,63 @@
import os
import sys
import argparse
from pathlib import Path
from google import genai
def main():
parser = argparse.ArgumentParser(description="Download and combine completed batch jobs for a directory.")
parser.add_argument("root_dir", type=str, help="Directory containing the original batches")
args = parser.parse_args()
target_dir = Path(args.root_dir)
dir_name = target_dir.name
output_path = target_dir / "batched_correction_result.jsonl"
if "GEMINI_API_KEY" not in os.environ:
sys.exit("Error: GEMINI_API_KEY environment variable not set.")
client = genai.Client()
print(f"Fetching jobs matching '{dir_name}'...")
all_jobs = client.batches.list()
matching_jobs = []
# 1. Find jobs associated with this directory
for job in all_jobs:
if hasattr(job, 'display_name') and job.display_name and dir_name in job.display_name:
matching_jobs.append(job)
if not matching_jobs:
sys.exit(f"No batch jobs found containing '{dir_name}' in their display name.")
# 2. Check that all matching jobs are complete
for job in matching_jobs:
state = job.state.name if hasattr(job.state, 'name') else job.state
print(f"Found Job: {job.display_name} | State: {state}")
if state != 'JOB_STATE_SUCCEEDED':
sys.exit(f"Error: Job '{job.display_name}' has not succeeded yet. Try again later.")
# 3. Download and concatenate
print("\nAll jobs succeeded. Downloading results...")
combined_data = b""
for job in matching_jobs:
if hasattr(job, 'dest') and job.dest and hasattr(job.dest, 'file_name') and job.dest.file_name:
print(f"Downloading output for {job.display_name}...")
file_content_bytes = client.files.download(file=job.dest.file_name)
combined_data += file_content_bytes
# Ensure proper line separation between files in JSONL
if combined_data and not combined_data.endswith(b'\n'):
combined_data += b'\n'
else:
print(f"Warning: Job {job.display_name} succeeded but has no output file.")
# 4. Save to destination
with open(output_path, "wb") as f:
f.write(combined_data)
print(f"\nSuccess! All results concatenated and saved to:\n{output_path}")
if __name__ == "__main__":
main()

79
submit_batches.py Normal file
View File

@ -0,0 +1,79 @@
import os
import sys
import argparse
from pathlib import Path
from google import genai
from google.genai import types
def main():
parser = argparse.ArgumentParser(description="Upload JSONL files and create Gemini Batch jobs.")
parser.add_argument("root_dir", type=str, help="Root directory containing the batch JSONL files")
args = parser.parse_args()
root_dir = Path(args.root_dir)
if "GEMINI_API_KEY" not in os.environ:
sys.exit("Error: GEMINI_API_KEY environment variable not set.")
client = genai.Client()
# Define the batch files and their corresponding models
batches_to_create = [
{
"file_path": root_dir / "batch_requests_flash.jsonl",
"model_id": "gemini-3-flash-preview",
"display_name": f"flash-correction-{root_dir.name}"
},
{
"file_path": root_dir / "batch_requests_pro.jsonl",
"model_id": "gemini-3.1-pro-preview",
"display_name": f"pro-correction-{root_dir.name}"
}
]
for batch in batches_to_create:
file_path = batch["file_path"]
model_id = batch["model_id"]
display_name = batch["display_name"]
# Check if the file exists
if not file_path.exists():
print(f"Skipping {model_id}: {file_path.name} does not exist.")
continue
# Check if the file is empty (e.g., if all tasks went to Flash, Pro might be empty)
if file_path.stat().st_size == 0:
print(f"Skipping {model_id}: {file_path.name} is empty.")
continue
print(f"Processing {file_path.name} for model {model_id}...")
# 1. Upload the file to the File API
print(f" Uploading file...")
uploaded_file = client.files.upload(
file=str(file_path),
config=types.UploadFileConfig(
display_name=f"{display_name}-input",
mime_type='jsonl'
)
)
print(f" Uploaded successfully! File ID: {uploaded_file.name}")
# 2. Create the batch job
print(f" Starting batch job...")
batch_job = client.batches.create(
model=model_id,
src=uploaded_file.name,
config={
'display_name': display_name,
},
)
print(f" Success! Batch Job Name: {batch_job.name}\n")
print("-" * 50)
print("All batch jobs have been initiated.")
print("Save the Batch Job Names above. You can monitor them with:")
print(" client.batches.get(name='YOUR_BATCH_JOB_NAME')")
if __name__ == "__main__":
main()