Garbage commit; added junk code

This commit is contained in:
HugeFrog24
2026-02-20 18:07:02 +01:00
commit 80444405e9
15 changed files with 34271 additions and 0 deletions

11
.env.example Normal file
View File

@@ -0,0 +1,11 @@
# Copy your wordpress_logged_in_... cookie from browser DevTools → Storage → Cookies.
# Paste the full name=value pair below.
# wordpress_sec_... is the wp-admin cookie — irrelevant for read-only viewers.
# __cf_bm is a Cloudflare bot-management cookie — also not needed.
WP_LOGIN_COOKIE=wordpress_logged_in_<hash>=<value>
# PeerTube upload target
PEERTUBE_URL=https://your-peertube-instance.example
PEERTUBE_USER=admin
PEERTUBE_CHANNEL=your-channel-name
PEERTUBE_PASSWORD=your-password

View File

@@ -0,0 +1,14 @@
---
alwaysApply: true
always_on: true
trigger: always_on
applyTo: "**"
description: Snyk Security At Inception
---
# Project security best practices
- Always run snyk_code_scan tool for new first party code that is generated in a Snyk-supported language.
- If any security issues are found based on newly introduced or modified code or dependencies, attempt to fix the issues using the results context from Snyk.
- Rescan the code after fixing the issues to ensure that the issues were fixed and that there are no newly introduced issues.
- Repeat this process until no new issues are found.

5
.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
__pycache__/
downloads/
*.mp4
*.mp4.part
.env

4
.vscode/settings.json vendored Normal file
View File

@@ -0,0 +1,4 @@
{
"snyk.advanced.organization": "512ef4a1-6034-4537-a391-9692d282122a",
"snyk.advanced.autoSelectOrganization": true
}

142
README.md Normal file
View File

@@ -0,0 +1,142 @@
# 𝒥𝒶𝒾𝓁𝒷𝒾𝓇𝒹𝓏-𝒹𝓁
Jailbirdz.com is an Arizona-based subscription video site publishing arrest and jail roleplay scenarios featuring women. This tool scrapes the member area, downloads the videos, and re-hosts them on a self-owned PeerTube instance.
> [!NOTE]
> This tool does not bypass authentication, modify the site, or intercept anything it isn't entitled to. A valid, paid membership is required. The scraper authenticates using your own session cookie and accesses only content your account can already view in a browser.
>
> Downloading content for private, personal use is permitted in many jurisdictions under private copy provisions (e.g., § 53 UrhG in Germany). You are responsible for determining whether this applies in yours.
## Requirements
- Python 3.10+
- `pip install -r requirements.txt`
- `playwright install firefox`
## Setup
```bash
cp .env.example .env
```
### WP_LOGIN_COOKIE
You need to be logged into jailbirdz.com in a browser. Then either:
**Option A — auto (recommended):** let `grab_cookie.py` read it from your browser and write it to `.env` automatically:
```bash
python grab_cookie.py # tries Firefox, Chrome, Edge, Brave in order
python grab_cookie.py -b firefox # or target a specific browser
```
> **Note:** Chrome and Edge on Windows 130+ require the script to run as Administrator due to App-bound Encryption. Firefox works without elevated privileges.
**Option B — manual:** open `.env` and set `WP_LOGIN_COOKIE` yourself. Get the value from browser DevTools → Storage → Cookies while on jailbirdz.com — copy the full `name=value` of the `wordpress_logged_in_*` cookie.
### Other `.env` values
- `PEERTUBE_URL` — base URL of your PeerTube instance.
- `PEERTUBE_USER` — PeerTube username.
- `PEERTUBE_CHANNEL` — channel to upload to.
- `PEERTUBE_PASSWORD` — PeerTube password.
## Workflow
### 1. Scrape
Discovers all post URLs via the WordPress REST API, then visits each page with a headless Firefox browser to intercept video network requests (MP4, MOV, WebM, AVI, M4V).
```bash
python main.py
```
Results are written to `video_map.json`. Safe to re-run — already-scraped posts are skipped.
### 2. Download
```bash
python download.py [options]
Options:
-o, --output DIR Download directory (default: downloads)
-t, --titles Name files by post title
--original Name files by original CloudFront filename (default)
--reorganize Rename existing files to match current naming mode
-w, --workers N Concurrent downloads (default: 4)
-n, --dry-run Print what would be downloaded
```
Resumes partial downloads. The chosen naming mode is saved to `.naming_mode` inside the output directory and persists across runs. Filenames that would clash are placed into subfolders.
### 3. Upload
```bash
python upload.py [options]
Options:
-i, --input DIR MP4 source directory (default: downloads)
--url URL PeerTube instance URL (or set PEERTUBE_URL)
-U, --username NAME PeerTube username (or set PEERTUBE_USER)
-p, --password SECRET PeerTube password (or set PEERTUBE_PASSWORD)
-C, --channel NAME Channel to upload to (or set PEERTUBE_CHANNEL)
-b, --batch-size N Videos to upload before waiting for transcoding (default: 1)
--poll-interval SECS State poll interval in seconds (default: 30)
--skip-wait Upload without waiting for transcoding
--nsfw Mark videos as NSFW
-n, --dry-run Print what would be uploaded
```
Uploads in resumable 10 MB chunks. After each batch, waits for transcoding and object storage to complete before uploading the next batch — this prevents disk exhaustion on the PeerTube server. Videos already present on the channel (matched by name) are skipped. Progress is tracked in `.uploaded` inside the input directory.
## Utilities
### Check for filename clashes
```bash
python check_clashes.py
```
Lists filenames that map to more than one source URL, with sizes.
### Estimate total download size
```bash
python total_size.py
```
Fetches `Content-Length` for every video URL in `video_map.json` and prints a size summary. Does not download anything.
## Data files
| File | Location | Description |
| ---------------- | ---------------- | --------------------------------------------------------------------- |
| `video_map.json` | project root | Scraped post URLs mapped to titles, descriptions, and video URLs |
| `.naming_mode` | output directory | Saved filename mode (`original` or `title`) |
| `.uploaded` | input directory | Newline-delimited list of relative paths already uploaded to PeerTube |
## FAQ
**Is this necessary?**
Yes, obviously.
**Is this project exactly what it looks like?**
Also yes.
**Why go to all this trouble?**
Middle school girls bullied me so hard I decided if you're going to be the weird kid anyway, you might as well commit to the bit and build highly specific pipelines for highly specific content.
Now it's their turn to get booked.
Checkmate, society.
No apologies.
**Why not just download everything manually?**
Dude.
Bondage fantasy.
Not pain play.
Huge difference.
1,300 clicks = torture.
Know your genres.
---
This is the most normal thing I've scripted this month.

159
check_clashes.py Normal file
View File

@@ -0,0 +1,159 @@
"""Filename clash detection and shared URL utilities.
Importable functions:
url_to_filename(url) - extract clean filename from a URL
find_clashes(urls) - {filename: [urls]} for filenames with >1 source
build_download_paths(urls, output_dir) - {url: local_path} with clash resolution
fmt_size(bytes) - human-readable size string
get_remote_size(session, url) - file size via HEAD without downloading
fetch_sizes(urls, workers, on_progress) - bulk size lookup
make_session() - requests.Session with required headers
load_video_map() - load video_map.json, returns {} on missing/corrupt
"""
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path, PurePosixPath
from urllib.parse import urlparse, unquote
import json
import requests
from config import BASE_URL
REFERER = f"{BASE_URL}/"
VIDEO_MAP_FILE = "video_map.json"
VIDEO_EXTS = {".mp4", ".mov", ".m4v", ".webm", ".avi"}
def load_video_map():
if Path(VIDEO_MAP_FILE).exists():
try:
with open(VIDEO_MAP_FILE, encoding="utf-8") as f:
return json.load(f)
except (json.JSONDecodeError, OSError):
return {}
return {}
def make_session():
s = requests.Session()
s.headers.update({"Referer": REFERER})
return s
def fmt_size(b):
for unit in ("B", "KB", "MB", "GB"):
if b < 1024:
return f"{b:.1f} {unit}"
b /= 1024
return f"{b:.1f} TB"
def url_to_filename(url):
return unquote(PurePosixPath(urlparse(url).path).name)
def find_clashes(urls):
# Case-insensitive grouping so that e.g. "DaisyArrest.mp4" and
# "daisyarrest.mp4" are treated as a clash. This is required for
# correctness on case-insensitive filesystems (NTFS, exFAT, macOS HFS+)
# and harmless on case-sensitive ones (ext4) — the actual filenames on
# disk keep their original casing; only the clash *detection* is folded.
by_lower = defaultdict(list)
for url in urls:
by_lower[url_to_filename(url).lower()].append(url)
return {url_to_filename(srcs[0]): srcs
for srcs in by_lower.values() if len(srcs) > 1}
def _clash_subfolder(url):
"""Parent path segment used as disambiguator for clashing filenames."""
parts = urlparse(url).path.rstrip("/").split("/")
return unquote(parts[-2]) if len(parts) >= 2 else "unknown"
def build_download_paths(urls, output_dir):
"""Map each URL to a local file path. Flat layout; clashing names get a subfolder."""
clashes = find_clashes(urls)
clash_lower = {name.lower() for name in clashes}
paths = {}
for url in urls:
filename = url_to_filename(url)
if filename.lower() in clash_lower:
paths[url] = Path(output_dir) / _clash_subfolder(url) / filename
else:
paths[url] = Path(output_dir) / filename
return paths
def get_remote_size(session, url):
try:
r = session.head(url, allow_redirects=True, timeout=15)
if r.status_code < 400 and "Content-Length" in r.headers:
return int(r.headers["Content-Length"])
except Exception:
pass
try:
r = session.get(
url, headers={"Range": "bytes=0-0"}, stream=True, timeout=15)
r.close()
cr = r.headers.get("Content-Range", "")
if "/" in cr:
return int(cr.split("/")[-1])
except Exception:
pass
return None
def fetch_sizes(urls, workers=20, on_progress=None):
"""Return {url: size_or_None}. on_progress(done, total) called after each URL."""
session = make_session()
sizes = {}
total = len(urls)
with ThreadPoolExecutor(max_workers=workers) as pool:
futures = {pool.submit(get_remote_size, session, u): u for u in urls}
done = 0
for fut in as_completed(futures):
sizes[futures[fut]] = fut.result()
done += 1
if on_progress:
on_progress(done, total)
return sizes
# --------------- CLI ---------------
def main():
vm = load_video_map()
urls = [u for entry in vm.values() for u in entry.get("videos", []) if u.startswith("http")]
clashes = find_clashes(urls)
print(f"Total URLs: {len(urls)}")
by_name = defaultdict(list)
for url in urls:
by_name[url_to_filename(url)].append(url)
print(f"Unique filenames: {len(by_name)}")
if not clashes:
print("\nNo filename clashes — every filename is unique.")
return
clash_urls = [u for srcs in clashes.values() for u in srcs]
print(f"\n[+] Fetching file sizes for {len(clash_urls)} clashing URLs…")
sizes = fetch_sizes(clash_urls)
print(f"\n{len(clashes)} filename clash(es):\n")
for name, srcs in sorted(clashes.items()):
print(f" {name} ({len(srcs)} sources)")
for s in srcs:
sz = sizes.get(s)
tag = fmt_size(sz) if sz is not None else "unknown"
print(f" [{tag}] {s}")
print()
if __name__ == "__main__":
main()

2
config.py Normal file
View File

@@ -0,0 +1,2 @@
BASE_URL = "https://www.jailbirdz.com"
COOKIE_DOMAIN = "jailbirdz.com" # rookiepy domain filter (no www)

408
download.py Normal file
View File

@@ -0,0 +1,408 @@
"""Download videos from video_map.json with resume, integrity checks, and naming modes.
Usage:
python download.py # downloads with remembered (or default original) naming
python download.py --output /mnt/nas # custom directory
python download.py --titles # switch to title-based filenames (remembers choice)
python download.py --original # switch back to original filenames (remembers choice)
python download.py --reorganize # rename existing files to match current mode
python download.py --dry-run # preview what would happen
python download.py --workers 6 # override concurrency (default 4)
"""
import argparse
import json
from pathlib import Path
import re
import shutil
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
from check_clashes import (
make_session,
fmt_size,
url_to_filename,
find_clashes,
build_download_paths,
fetch_sizes,
)
VIDEO_MAP_FILE = "video_map.json"
CHUNK_SIZE = 8 * 1024 * 1024
DEFAULT_OUTPUT = "downloads"
DEFAULT_WORKERS = 4
MODE_FILE = ".naming_mode"
MODE_ORIGINAL = "original"
MODE_TITLE = "title"
# ── Naming mode persistence ──────────────────────────────────────────
def read_mode(output_dir):
p = Path(output_dir) / MODE_FILE
if p.exists():
return p.read_text().strip()
return None
def write_mode(output_dir, mode):
Path(output_dir).mkdir(parents=True, exist_ok=True)
(Path(output_dir) / MODE_FILE).write_text(mode)
def resolve_mode(args):
"""Determine naming mode from CLI flags + saved marker. Returns mode string."""
saved = read_mode(args.output)
if args.titles and args.original:
print("[!] Cannot use --titles and --original together.")
raise SystemExit(1)
if args.titles:
return MODE_TITLE
if args.original:
return MODE_ORIGINAL
if saved:
return saved
return MODE_ORIGINAL
# ── Filename helpers ─────────────────────────────────────────────────
def sanitize_filename(title, max_len=180):
name = re.sub(r'[<>:"/\\|?*]', '', title)
name = re.sub(r'\s+', ' ', name).strip().rstrip('.')
return name[:max_len].rstrip() if len(name) > max_len else name
def build_title_paths(urls, url_to_title, output_dir):
name_to_urls = defaultdict(list)
url_to_base = {}
for url in urls:
title = url_to_title.get(url)
ext = Path(url_to_filename(url)).suffix or ".mp4"
base = sanitize_filename(title) if title else Path(url_to_filename(url)).stem
url_to_base[url] = (base, ext)
name_to_urls[base + ext].append(url)
paths = {}
for url in urls:
base, ext = url_to_base[url]
full = base + ext
if len(name_to_urls[full]) > 1:
slug = url_to_filename(url).rsplit('.', 1)[0]
paths[url] = Path(output_dir) / f"{base} [{slug}]{ext}"
else:
paths[url] = Path(output_dir) / full
return paths
def get_paths_for_mode(mode, urls, video_map, output_dir):
if mode == MODE_TITLE:
url_title = build_url_title_map(video_map)
return build_title_paths(urls, url_title, output_dir)
return build_download_paths(urls, output_dir)
# ── Reorganize ───────────────────────────────────────────────────────
def reorganize(urls, video_map, output_dir, target_mode, dry_run=False):
"""Rename existing files from one naming scheme to another."""
other_mode = MODE_TITLE if target_mode == MODE_ORIGINAL else MODE_ORIGINAL
old_paths = get_paths_for_mode(other_mode, urls, video_map, output_dir)
new_paths = get_paths_for_mode(target_mode, urls, video_map, output_dir)
moves = []
for url in urls:
old = old_paths[url]
new = new_paths[url]
if old == new:
continue
if old.exists() and not new.exists():
moves.append((old, new))
# also handle .part files
old_part = old.parent / (old.name + ".part")
new_part = new.parent / (new.name + ".part")
if old_part.exists() and not new_part.exists():
moves.append((old_part, new_part))
if not moves:
print("[✓] Nothing to reorganize — files already match the target mode.")
return
print(f"[+] {len(moves)} file(s) to rename ({other_mode}{target_mode}):\n")
for old, new in moves:
old_rel = old.relative_to(output_dir)
new_rel = new.relative_to(output_dir)
if dry_run:
print(f" [dry-run] {old_rel}{new_rel}")
else:
new.parent.mkdir(parents=True, exist_ok=True)
shutil.move(old, new)
print(f"{old_rel}{new_rel}")
if not dry_run:
# Clean up empty directories left behind
output_path = Path(output_dir)
for old, _ in moves:
d = old.parent
while d != output_path:
try:
d.rmdir()
except OSError:
break
d = d.parent
write_mode(output_dir, target_mode)
print(f"\n[✓] Reorganized. Mode saved: {target_mode}")
else:
print(f"\n[dry-run] Would rename {len(moves)} files. No changes made.")
# ── Download ─────────────────────────────────────────────────────────
def download_one(session, url, dest, expected_size):
dest = Path(dest)
part = dest.parent / (dest.name + ".part")
dest.parent.mkdir(parents=True, exist_ok=True)
if dest.exists():
local = dest.stat().st_size
if expected_size and local == expected_size:
return "ok", 0
if expected_size and local != expected_size:
dest.unlink()
existing = part.stat().st_size if part.exists() else 0
headers = {}
if existing and expected_size and existing < expected_size:
headers["Range"] = f"bytes={existing}-"
try:
r = session.get(url, headers=headers, stream=True, timeout=60)
if r.status_code == 416:
part.rename(dest)
return "ok", 0
r.raise_for_status()
except Exception as e:
return f"error: {e}", 0
mode = "ab" if headers.get("Range") else "wb"
if mode == "wb":
existing = 0
written = 0
try:
with open(part, mode) as f:
for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
f.write(chunk)
written += len(chunk)
except Exception as e:
return f"error: {e}", written
final_size = existing + written
if expected_size and final_size != expected_size:
return "size_mismatch", written
part.rename(dest)
return "ok", written
# ── Data loading ─────────────────────────────────────────────────────
def load_video_map():
with open(VIDEO_MAP_FILE, encoding="utf-8") as f:
return json.load(f)
def _is_valid_url(url):
return url.startswith(
"http") and "<" not in url and ">" not in url and " href=" not in url
def collect_urls(video_map):
urls, seen, skipped = [], set(), 0
for entry in video_map.values():
for video_url in entry.get("videos", []):
if video_url in seen:
continue
seen.add(video_url)
if _is_valid_url(video_url):
urls.append(video_url)
else:
skipped += 1
if skipped:
print(f"[!] Skipped {skipped} malformed URL(s)")
return urls
def build_url_title_map(video_map):
url_title = {}
for entry in video_map.values():
title = entry.get("title", "")
for video_url in entry.get("videos", []):
if video_url not in url_title:
url_title[video_url] = title
return url_title
# ── Main ─────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(
description="Download videos from video_map.json")
parser.add_argument("--output", "-o", default=DEFAULT_OUTPUT,
help=f"Download directory (default: {DEFAULT_OUTPUT})")
naming = parser.add_mutually_exclusive_group()
naming.add_argument("--titles", "-t", action="store_true",
help="Use title-based filenames (saved as default for this directory)")
naming.add_argument("--original", action="store_true",
help="Use original CloudFront filenames (saved as default for this directory)")
parser.add_argument("--reorganize", action="store_true",
help="Rename existing files to match the current naming mode")
parser.add_argument("--dry-run", "-n", action="store_true",
help="Preview without making changes")
parser.add_argument("--workers", "-w", type=int, default=DEFAULT_WORKERS,
help=f"Concurrent downloads (default: {DEFAULT_WORKERS})")
args = parser.parse_args()
video_map = load_video_map()
urls = collect_urls(video_map)
mode = resolve_mode(args)
saved = read_mode(args.output)
mode_changed = saved is not None and saved != mode
print(f"[+] {len(urls)} MP4 URLs from {VIDEO_MAP_FILE}")
print(f"[+] Naming mode: {mode}" + (" (changed!)" if mode_changed else ""))
# Handle reorganize
if args.reorganize or mode_changed:
if mode_changed and not args.reorganize:
print(f"\n[!] Mode changed from '{saved}' to '{mode}'.")
print(
" Use --reorganize to rename existing files, or --dry-run to preview.")
print(" Refusing to download until existing files are reorganized.")
return
reorganize(urls, video_map, args.output, mode, dry_run=args.dry_run)
if args.dry_run or args.reorganize:
return
# Save mode
if not args.dry_run:
write_mode(args.output, mode)
paths = get_paths_for_mode(mode, urls, video_map, args.output)
clashes = find_clashes(urls)
if clashes:
print(
f"[+] {len(clashes)} filename clash(es) resolved with subfolders/suffixes")
already = [u for u in urls if paths[u].exists()]
pending = [u for u in urls if not paths[u].exists()]
print(f"[+] Already downloaded: {len(already)}")
print(f"[+] To download: {len(pending)}")
if not pending:
print("\n[✓] Everything is already downloaded.")
return
if args.dry_run:
print(
f"\n[dry-run] Would download {len(pending)} files to {args.output}/")
for url in pending[:20]:
print(f"{paths[url].name}")
if len(pending) > 20:
print(f" … and {len(pending) - 20} more")
return
print("\n[+] Fetching remote file sizes…")
session = make_session()
remote_sizes = fetch_sizes(pending, workers=20)
sized = {u: s for u, s in remote_sizes.items() if s is not None}
total_bytes = sum(sized.values())
print(
f"[+] Download size: {fmt_size(total_bytes)} across {len(pending)} files")
if already:
print(f"[+] Verifying {len(already)} existing files…")
already_sizes = fetch_sizes(already, workers=20)
mismatched = 0
for url in already:
dest = paths[url]
local = dest.stat().st_size
remote = already_sizes.get(url)
if remote and local != remote:
mismatched += 1
print(f"[!] Size mismatch: {dest.name} "
f"(local {fmt_size(local)} vs remote {fmt_size(remote)})")
pending.append(url)
remote_sizes[url] = remote
if mismatched:
print(
f"[!] {mismatched} file(s) will be re-downloaded due to size mismatch")
print(f"\n[⚡] Downloading with {args.workers} threads…\n")
completed = 0
failed = []
total_written = 0
total = len(pending)
interrupted = False
def do_download(url):
dest = paths[url]
expected = remote_sizes.get(url)
return url, download_one(session, url, dest, expected)
try:
with ThreadPoolExecutor(max_workers=args.workers) as pool:
futures = {pool.submit(do_download, u): u for u in pending}
for fut in as_completed(futures):
url, (status, written) = fut.result()
total_written += written
completed += 1
name = paths[url].name
if status == "ok" and written > 0:
print(
f" [{completed}/{total}] ✓ {name} ({fmt_size(written)})")
elif status == "ok":
print(
f" [{completed}/{total}] ✓ {name} (already complete)")
elif status == "size_mismatch":
print(f" [{completed}/{total}] ⚠ {name} (size mismatch)")
failed.append(url)
else:
print(f" [{completed}/{total}] ✗ {name} ({status})")
failed.append(url)
except KeyboardInterrupt:
interrupted = True
pool.shutdown(wait=False, cancel_futures=True)
print("\n\n[⏸] Interrupted! Partial downloads saved as .part files.")
print(f"\n{'=' * 50}")
print(f" Downloaded: {fmt_size(total_written)}")
print(f" Completed: {completed}/{total}")
if failed:
print(f" Failed: {len(failed)} (re-run to retry)")
if interrupted:
print(" Paused — re-run to resume.")
elif not failed:
print(" All done!")
print(f"{'=' * 50}")
if __name__ == "__main__":
main()

114
grab_cookie.py Normal file
View File

@@ -0,0 +1,114 @@
#!/usr/bin/env python3
"""
grab_cookie.py — read the WordPress login cookie from an
installed browser and write it to .env as WP_LOGIN_COOKIE=name=value.
Usage:
python grab_cookie.py # tries Firefox, Chrome, Edge, Brave
python grab_cookie.py --browser firefox # explicit browser
"""
import argparse
from pathlib import Path
from config import COOKIE_DOMAIN
ENV_FILE = Path(".env")
ENV_KEY = "WP_LOGIN_COOKIE"
COOKIE_PREFIX = "wordpress_logged_in_"
BROWSER_NAMES = ["firefox", "chrome", "edge", "brave"]
def find_cookie(browser_name):
"""Return (name, value) for the wordpress_logged_in_* cookie, or (None, None)."""
try:
import rookiepy
except ImportError:
raise ImportError("rookiepy not installed — run: pip install rookiepy")
fn = getattr(rookiepy, browser_name, None)
if fn is None:
raise ValueError(f"rookiepy does not support '{browser_name}'.")
try:
cookies = fn([COOKIE_DOMAIN])
except PermissionError:
raise PermissionError(
f"Permission denied reading {browser_name} cookies.\n"
" Close the browser, or on Windows run as Administrator for Chrome/Edge."
)
except Exception as e:
raise RuntimeError(f"Could not read {browser_name} cookies: {e}")
for c in cookies:
if c.get("name", "").startswith(COOKIE_PREFIX):
return c["name"], c["value"]
return None, None
def update_env(name, value):
"""Write WP_LOGIN_COOKIE=name=value into .env, replacing any existing line."""
new_line = f"{ENV_KEY}={name}={value}\n"
if ENV_FILE.exists():
text = ENV_FILE.read_text(encoding="utf-8")
lines = text.splitlines(keepends=True)
for i, line in enumerate(lines):
if line.startswith(f"{ENV_KEY}=") or line.strip() == ENV_KEY:
lines[i] = new_line
ENV_FILE.write_text("".join(lines), encoding="utf-8")
return "updated"
# Key not present — append
if text and not text.endswith("\n"):
text += "\n"
ENV_FILE.write_text(text + new_line, encoding="utf-8")
return "appended"
else:
ENV_FILE.write_text(new_line, encoding="utf-8")
return "created"
def main():
parser = argparse.ArgumentParser(
description=f"Copy the {COOKIE_DOMAIN} login cookie from your browser into .env."
)
parser.add_argument(
"--browser", "-b",
choices=BROWSER_NAMES,
metavar="BROWSER",
help=f"Browser to read from: {', '.join(BROWSER_NAMES)} (default: try all in order)",
)
args = parser.parse_args()
order = [args.browser] if args.browser else BROWSER_NAMES
cookie_name = cookie_value = None
for browser in order:
print(f"[…] Trying {browser}")
try:
cookie_name, cookie_value = find_cookie(browser)
except ImportError as e:
raise SystemExit(f"[!] {e}")
except (ValueError, PermissionError, RuntimeError) as e:
print(f"[!] {e}")
continue
if cookie_name:
print(f"[+] Found in {browser}: {cookie_name}")
break
print(f" No {COOKIE_PREFIX}* cookie found in {browser}.")
if not cookie_name:
raise SystemExit(
f"\n[!] No {COOKIE_PREFIX}* cookie found in any browser.\n"
f" Make sure you are logged into {COOKIE_DOMAIN}, then re-run.\n"
" Or set WP_LOGIN_COOKIE manually in .env — see .env.example."
)
action = update_env(cookie_name, cookie_value)
print(f"[✓] {ENV_KEY} {action} in {ENV_FILE}.")
if __name__ == "__main__":
main()

467
main.py Normal file
View File

@@ -0,0 +1,467 @@
import re
import json
import os
import time
import signal
import asyncio
import tempfile
import requests
from pathlib import Path, PurePosixPath
from urllib.parse import urlparse
from dotenv import load_dotenv
from playwright.async_api import async_playwright
from check_clashes import VIDEO_EXTS
from config import BASE_URL
load_dotenv()
def _is_video_url(url):
"""True if `url` ends with a recognised video extension (case-insensitive, path only)."""
return PurePosixPath(urlparse(url).path).suffix.lower() in VIDEO_EXTS
WP_API = f"{BASE_URL}/wp-json/wp/v2"
SKIP_TYPES = {
"attachment", "nav_menu_item", "wp_block", "wp_template",
"wp_template_part", "wp_global_styles", "wp_navigation",
"wp_font_family", "wp_font_face",
}
VIDEO_MAP_FILE = "video_map.json"
MAX_WORKERS = 4
API_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:147.0) Gecko/20100101 Firefox/147.0",
"Accept": "application/json",
"Referer": f"{BASE_URL}/",
}
def _get_login_cookie():
raw = os.environ.get("WP_LOGIN_COOKIE", "").strip() # strip accidental whitespace
if not raw:
raise RuntimeError(
"WP_LOGIN_COOKIE not set. Copy it from your browser into .env — see .env.example.")
name, _, value = raw.partition("=")
if not value:
raise RuntimeError(
"WP_LOGIN_COOKIE looks malformed (no '=' found). Expected: name=value")
if not name.startswith("wordpress_logged_in_"):
raise RuntimeError(
"WP_LOGIN_COOKIE doesn't look right — expected a wordpress_logged_in_... cookie.")
return name, value
def discover_content_types(session):
"""Query /wp-json/wp/v2/types and return a list of (name, rest_base, type_slug) for content types worth scraping."""
r = session.get(f"{WP_API}/types", timeout=30)
r.raise_for_status()
types = r.json()
targets = []
for type_slug, info in types.items():
if type_slug in SKIP_TYPES:
continue
rest_base = info.get("rest_base")
name = info.get("name", type_slug)
if rest_base:
targets.append((name, rest_base, type_slug))
return targets
def fetch_all_posts_for_type(session, type_name, rest_base, type_slug):
"""Paginate one content type and return (url, title, description) tuples.
Uses the `link` field when available; falls back to building from slug."""
url_prefix = type_slug.replace("_", "-")
results = []
page = 1
while True:
r = session.get(
f"{WP_API}/{rest_base}",
params={"per_page": 100, "page": page},
timeout=30,
)
if r.status_code == 400 or not r.ok:
break
data = r.json()
if not data:
break
for post in data:
link = post.get("link", "")
if not link.startswith("http"):
slug = post.get("slug")
if slug:
link = f"{BASE_URL}/{url_prefix}/{slug}/"
else:
continue
title_obj = post.get("title", {})
title = title_obj.get("rendered", "") if isinstance(
title_obj, dict) else str(title_obj)
content_obj = post.get("content", {})
content_html = content_obj.get(
"rendered", "") if isinstance(content_obj, dict) else ""
description = html_to_text(content_html) if content_html else ""
results.append((link, title, description))
print(f" {type_name} page {page}: {len(data)} items")
page += 1
return results
def fetch_post_urls_from_api(headers):
"""Auto-discover all content types via the WP REST API and collect every post URL.
Also builds video_map.json with titles pre-populated."""
print("[+] video_map.json empty or missing — discovering content types from REST API…")
session = requests.Session()
session.headers.update(headers)
targets = discover_content_types(session)
print(
f"[+] Found {len(targets)} content types: {', '.join(name for name, _, _ in targets)}\n")
all_results = []
for type_name, rest_base, type_slug in targets:
type_results = fetch_all_posts_for_type(
session, type_name, rest_base, type_slug)
all_results.extend(type_results)
seen = set()
deduped_urls = []
video_map = load_video_map()
for url, title, description in all_results:
if url not in seen and url.startswith("http"):
seen.add(url)
deduped_urls.append(url)
if url not in video_map:
video_map[url] = {"title": title,
"description": description, "videos": []}
else:
if not video_map[url].get("title"):
video_map[url]["title"] = title
if not video_map[url].get("description"):
video_map[url]["description"] = description
save_video_map(video_map)
print(
f"\n[+] Discovered {len(deduped_urls)} unique URLs → saved to {VIDEO_MAP_FILE}")
print(
f"[+] Pre-populated {len(video_map)} entries in {VIDEO_MAP_FILE}")
return deduped_urls
def fetch_metadata_from_api(video_map, urls, headers):
"""Populate missing titles and descriptions in video_map from the REST API."""
missing = [u for u in urls
if u not in video_map
or not video_map[u].get("title")
or not video_map[u].get("description")]
if not missing:
return
print(f"[+] Fetching metadata from REST API for {len(missing)} posts…")
session = requests.Session()
session.headers.update(headers)
targets = discover_content_types(session)
for type_name, rest_base, type_slug in targets:
type_results = fetch_all_posts_for_type(
session, type_name, rest_base, type_slug)
for url, title, description in type_results:
if url in video_map:
if not video_map[url].get("title"):
video_map[url]["title"] = title
if not video_map[url].get("description"):
video_map[url]["description"] = description
else:
video_map[url] = {"title": title,
"description": description, "videos": []}
save_video_map(video_map)
populated_t = sum(1 for u in urls if video_map.get(u, {}).get("title"))
populated_d = sum(1 for u in urls if video_map.get(
u, {}).get("description"))
print(f"[+] Titles populated: {populated_t}/{len(urls)}")
print(f"[+] Descriptions populated: {populated_d}/{len(urls)}")
def load_post_urls(headers):
vm = load_video_map()
if vm:
print(f"[+] {VIDEO_MAP_FILE} found — loading {len(vm)} post URLs.")
return list(vm.keys())
return fetch_post_urls_from_api(headers)
def html_to_text(html_str):
"""Strip HTML tags, decode entities, and collapse whitespace into clean plain text."""
import html
text = re.sub(r'<br\s*/?>', '\n', html_str)
text = text.replace('</p>', '\n\n')
text = re.sub(r'<[^>]+>', '', text)
text = html.unescape(text)
lines = [line.strip() for line in text.splitlines()]
text = '\n'.join(lines)
text = re.sub(r'\n{3,}', '\n\n', text)
return text.strip()
def extract_mp4_from_html(html):
candidates = re.findall(r'https?://[^\s"\'<>]+', html)
return [u for u in candidates if _is_video_url(u)]
def extract_title_from_html(html):
m = re.search(
r'<h1[^>]*class="entry-title"[^>]*>(.*?)</h1>', html, re.DOTALL)
if m:
title = re.sub(r'<[^>]+>', '', m.group(1)).strip()
return title
m = re.search(r'<title>(.*?)(?:\s*[-|].*)?</title>', html, re.DOTALL)
if m:
return m.group(1).strip()
return None
def load_video_map():
if Path(VIDEO_MAP_FILE).exists():
try:
with open(VIDEO_MAP_FILE, encoding="utf-8") as f:
return json.load(f)
except (json.JSONDecodeError, OSError):
return {}
return {}
def save_video_map(video_map):
fd, tmp_path = tempfile.mkstemp(dir=Path(VIDEO_MAP_FILE).resolve().parent, suffix=".tmp")
try:
with os.fdopen(fd, "w", encoding="utf-8") as f:
json.dump(video_map, f, indent=2, ensure_ascii=False)
Path(tmp_path).replace(VIDEO_MAP_FILE)
except Exception:
try:
Path(tmp_path).unlink()
except OSError:
pass
raise
def _expects_video(url):
return "/pinkcuffs-videos/" in url
MAX_RETRIES = 2
async def worker(worker_id, queue, context, known,
total, retry_counts, video_map, map_lock, shutdown_event):
page = await context.new_page()
video_hits = set()
page.on("response", lambda resp: video_hits.add(resp.url) if _is_video_url(resp.url) else None)
try:
while not shutdown_event.is_set():
try:
idx, url = queue.get_nowait()
except asyncio.QueueEmpty:
break
attempt = retry_counts.get(idx, 0)
label = f" (retry {attempt}/{MAX_RETRIES})" if attempt else ""
print(f"[W{worker_id}] ({idx + 1}/{total}) {url}{label}")
try:
await page.goto(url, wait_until="networkidle", timeout=60000)
except Exception as e:
print(f"[W{worker_id}] Navigation error: {e}")
if _expects_video(url) and attempt < MAX_RETRIES:
retry_counts[idx] = attempt + 1
queue.put_nowait((idx, url))
print(f"[W{worker_id}] Re-queued for retry.")
elif not _expects_video(url):
async with map_lock:
entry = video_map.get(url, {})
entry["scraped_at"] = int(time.time())
video_map[url] = entry
save_video_map(video_map)
else:
print(
f"[W{worker_id}] Still failing after {MAX_RETRIES} retries — will retry next run.")
continue
await asyncio.sleep(1.5)
html = await page.content()
title = extract_title_from_html(html)
html_videos = extract_mp4_from_html(html)
found = set(html_videos) | set(video_hits)
video_hits.clear()
all_videos = [m for m in found if m not in (
f"{BASE_URL}/wp-content/plugins/easy-video-player/lib/blank.mp4",
)]
async with map_lock:
new_found = found - known
if new_found:
print(f"[W{worker_id}] Found {len(new_found)} new video URLs")
known.update(new_found)
elif all_videos:
print(
f"[W{worker_id}] {len(all_videos)} video(s) already known — skipping write.")
else:
print(f"[W{worker_id}] No video found on page.")
entry = video_map.get(url, {})
if title:
entry["title"] = title
existing_videos = set(entry.get("videos", []))
existing_videos.update(all_videos)
entry["videos"] = sorted(existing_videos)
mark_done = bool(all_videos) or not _expects_video(url)
if mark_done:
entry["scraped_at"] = int(time.time())
video_map[url] = entry
save_video_map(video_map)
if not mark_done:
if attempt < MAX_RETRIES:
retry_counts[idx] = attempt + 1
queue.put_nowait((idx, url))
print(
f"[W{worker_id}] Re-queued for retry ({attempt + 1}/{MAX_RETRIES}).")
else:
print(
f"[W{worker_id}] No video after {MAX_RETRIES} retries — will retry next run.")
finally:
await page.close()
async def run():
shutdown_event = asyncio.Event()
loop = asyncio.get_running_loop()
def _handle_shutdown(signum, _frame):
print(f"\n[!] Signal {signum} received — finishing active pages then exiting…")
loop.call_soon_threadsafe(shutdown_event.set)
signal.signal(signal.SIGINT, _handle_shutdown)
signal.signal(signal.SIGTERM, _handle_shutdown)
try:
cookie_name, cookie_value = _get_login_cookie()
req_headers = {
**API_HEADERS,
"Cookie": f"{cookie_name}={cookie_value}; eav-age-verified=1",
}
urls = load_post_urls(req_headers)
video_map = load_video_map()
if any(u not in video_map
or not video_map[u].get("title")
or not video_map[u].get("description")
for u in urls if _expects_video(u)):
fetch_metadata_from_api(video_map, urls, req_headers)
known = {u for entry in video_map.values() for u in entry.get("videos", [])}
total = len(urls)
pending = []
needs_map = 0
for i, u in enumerate(urls):
entry = video_map.get(u, {})
if not entry.get("scraped_at"):
pending.append((i, u))
elif _expects_video(u) and not entry.get("videos"):
pending.append((i, u))
needs_map += 1
done_count = sum(1 for v in video_map.values() if v.get("scraped_at"))
print(f"[+] Loaded {total} post URLs.")
print(f"[+] Already have {len(known)} video URLs mapped.")
print(f"[+] Video map: {len(video_map)} entries in {VIDEO_MAP_FILE}")
if done_count:
remaining_new = len(pending) - needs_map
print(
f"[↻] Resuming: {done_count} done, {remaining_new} new + {needs_map} needing map data.")
if not pending:
print("[✓] All URLs already processed and mapped.")
return
print(
f"[⚡] Running with {min(MAX_WORKERS, len(pending))} concurrent workers.\n")
queue = asyncio.Queue()
for item in pending:
queue.put_nowait(item)
map_lock = asyncio.Lock()
retry_counts = {}
async with async_playwright() as p:
browser = await p.firefox.launch(headless=True)
context = await browser.new_context()
_cookie_domain = urlparse(BASE_URL).netloc
site_cookies = [
{
"name": cookie_name,
"value": cookie_value,
"domain": _cookie_domain,
"path": "/",
"httpOnly": True,
"secure": True,
"sameSite": "None"
},
{
"name": "eav-age-verified",
"value": "1",
"domain": _cookie_domain,
"path": "/"
}
]
await context.add_cookies(site_cookies)
num_workers = min(MAX_WORKERS, len(pending))
workers = [
asyncio.create_task(
worker(i, queue, context, known,
total, retry_counts, video_map, map_lock, shutdown_event)
)
for i in range(num_workers)
]
await asyncio.gather(*workers)
await browser.close()
mapped = sum(1 for v in video_map.values() if v.get("videos"))
print(
f"\n[+] Video map: {mapped} posts with videos, {len(video_map)} total entries.")
if not shutdown_event.is_set():
print(f"[✓] Completed. Full map in {VIDEO_MAP_FILE}")
else:
done = sum(1 for v in video_map.values() if v.get("scraped_at"))
print(f"[⏸] Paused — {done}/{total} done. Run again to resume.")
finally:
signal.signal(signal.SIGINT, signal.SIG_DFL)
signal.signal(signal.SIGTERM, signal.SIG_DFL)
def main():
try:
asyncio.run(run())
except KeyboardInterrupt:
print("\n[!] Interrupted. Run again to resume.")
except RuntimeError as e:
raise SystemExit(f"[!] {e}")
if __name__ == "__main__":
main()

21252
openapi.json Normal file

File diff suppressed because one or more lines are too long

4
requirements.txt Normal file
View File

@@ -0,0 +1,4 @@
playwright==1.58.0
python-dotenv==1.2.1
Requests==2.32.5
rookiepy==0.5.6

61
total_size.py Normal file
View File

@@ -0,0 +1,61 @@
"""Calculate total disk space needed to download all videos.
Importable function:
summarize_sizes(sizes) - return dict with total, smallest, largest, average, failed
"""
from check_clashes import fmt_size, fetch_sizes, load_video_map, VIDEO_MAP_FILE
def summarize_sizes(sizes):
"""Given {url: size_or_None}, return a stats dict."""
known = {u: s for u, s in sizes.items() if s is not None}
failed = [u for u, s in sizes.items() if s is None]
if not known:
return {"sized": 0, "total": len(sizes), "total_bytes": 0,
"smallest": 0, "largest": 0, "average": 0, "failed": failed}
total_bytes = sum(known.values())
return {
"sized": len(known),
"total": len(sizes),
"total_bytes": total_bytes,
"smallest": min(known.values()),
"largest": max(known.values()),
"average": total_bytes // len(known),
"failed": failed,
}
# --------------- CLI ---------------
def _progress(done, total):
if done % 200 == 0 or done == total:
print(f" {done}/{total}")
def main():
vm = load_video_map()
urls = [u for entry in vm.values() for u in entry.get("videos", []) if u.startswith("http")]
print(f"[+] {len(urls)} URLs in {VIDEO_MAP_FILE}")
print("[+] Fetching file sizes (20 threads)…\n")
sizes = fetch_sizes(urls, workers=20, on_progress=_progress)
stats = summarize_sizes(sizes)
print(f"\n{'=' * 45}")
print(f" Sized: {stats['sized']}/{stats['total']} files")
print(f" Total: {fmt_size(stats['total_bytes'])}")
print(f" Smallest: {fmt_size(stats['smallest'])}")
print(f" Largest: {fmt_size(stats['largest'])}")
print(f" Average: {fmt_size(stats['average'])}")
print(f"{'=' * 45}")
if stats["failed"]:
print(f"\n[!] {len(stats['failed'])} URL(s) could not be sized:")
for u in stats["failed"]:
print(f" {u}")
if __name__ == "__main__":
main()

603
upload.py Normal file
View File

@@ -0,0 +1,603 @@
"""Upload videos to PeerTube with transcoding-aware flow control.
Uploads videos one batch at a time, waits for each batch to be fully transcoded
and moved to object storage before uploading the next — preventing disk
exhaustion on the PeerTube server.
Usage:
python upload.py # upload from ./downloads
python upload.py -i /mnt/vol/dl # custom input dir
python upload.py --batch-size 2 # upload 2, wait, repeat
python upload.py --dry-run # preview without uploading
python upload.py --skip-wait # upload without waiting
Required (CLI flag or env var):
--url / PEERTUBE_URL
--username / PEERTUBE_USER
--channel / PEERTUBE_CHANNEL
--password / PEERTUBE_PASSWORD
"""
import argparse
from collections import Counter
import html
import os
from pathlib import Path
import re
import sys
import time
import requests
from dotenv import load_dotenv
from check_clashes import fmt_size, url_to_filename, VIDEO_EXTS
from download import (
load_video_map,
collect_urls,
get_paths_for_mode,
read_mode,
MODE_ORIGINAL,
DEFAULT_OUTPUT,
)
load_dotenv()
# ── Defaults ─────────────────────────────────────────────────────────
DEFAULT_BATCH_SIZE = 1
DEFAULT_POLL = 30
UPLOADED_FILE = ".uploaded"
PT_NAME_MAX = 120
# ── Text helpers ─────────────────────────────────────────────────────
def clean_description(raw):
"""Strip WordPress shortcodes and HTML from a description."""
if not raw:
return ""
text = re.sub(r'\[/?[^\]]+\]', '', raw)
text = re.sub(r'<[^>]+>', '', text)
text = html.unescape(text)
text = re.sub(r'\n{3,}', '\n\n', text).strip()
return text[:10000]
def make_pt_name(title, fallback_filename):
"""Build a PeerTube-safe video name (3-120 chars)."""
name = html.unescape(title).strip(
) if title else Path(fallback_filename).stem
if len(name) > PT_NAME_MAX:
name = name[: PT_NAME_MAX - 1].rstrip() + "\u2026"
while len(name) < 3:
name += "_"
return name
# ── PeerTube API ─────────────────────────────────────────────────────
def get_oauth_token(base, username, password):
r = requests.get(f"{base}/api/v1/oauth-clients/local", timeout=15)
r.raise_for_status()
client = r.json()
r = requests.post(
f"{base}/api/v1/users/token",
data={
"client_id": client["client_id"],
"client_secret": client["client_secret"],
"grant_type": "password",
"username": username,
"password": password,
},
timeout=15,
)
r.raise_for_status()
return r.json()["access_token"]
def api_headers(token):
return {"Authorization": f"Bearer {token}"}
def get_channel_id(base, token, channel_name):
r = requests.get(
f"{base}/api/v1/video-channels/{channel_name}",
headers=api_headers(token),
timeout=15,
)
r.raise_for_status()
return r.json()["id"]
def get_channel_video_names(base, token, channel_name):
"""Paginate through the channel and return a Counter of video names."""
counts = Counter()
start = 0
while True:
r = requests.get(
f"{base}/api/v1/video-channels/{channel_name}/videos",
params={"start": start, "count": 100},
headers=api_headers(token),
timeout=30,
)
r.raise_for_status()
data = r.json()
for v in data.get("data", []):
counts[v["name"]] += 1
start += 100
if start >= data.get("total", 0):
break
return counts
CHUNK_SIZE = 10 * 1024 * 1024 # 10 MB
MAX_RETRIES = 5
def _init_resumable(base, token, channel_id, filepath, filename, name,
description="", nsfw=False):
"""POST to create a resumable upload session. Returns upload URL."""
file_size = Path(filepath).stat().st_size
metadata = {
"name": name,
"channelId": channel_id,
"filename": filename,
"nsfw": nsfw,
"waitTranscoding": True,
"privacy": 1,
}
if description:
metadata["description"] = description
r = requests.post(
f"{base}/api/v1/videos/upload-resumable",
headers={
**api_headers(token),
"Content-Type": "application/json",
"X-Upload-Content-Length": str(file_size),
"X-Upload-Content-Type": "video/mp4",
},
json=metadata,
timeout=30,
)
r.raise_for_status()
location = r.headers["Location"]
if location.startswith("//"):
location = "https:" + location
elif location.startswith("/"):
location = base + location
return location, file_size
def _query_offset(upload_url, token, file_size):
"""Ask the server how many bytes it has received so far."""
r = requests.put(
upload_url,
headers={
**api_headers(token),
"Content-Range": f"bytes */{file_size}",
"Content-Length": "0",
},
timeout=15,
)
if r.status_code == 308:
range_hdr = r.headers.get("Range", "")
if range_hdr:
return int(range_hdr.split("-")[1]) + 1
return 0
if r.status_code == 200:
return file_size
r.raise_for_status()
return 0
def upload_video(base, token, channel_id, filepath, name,
description="", nsfw=False):
"""Resumable chunked upload. Returns (ok, uuid)."""
filepath = Path(filepath)
filename = filepath.name
file_size = filepath.stat().st_size
try:
upload_url, _ = _init_resumable(
base, token, channel_id, filepath, filename,
name, description, nsfw,
)
except Exception as e:
print(f" Init failed: {e}")
return False, None
offset = 0
retries = 0
with open(filepath, "rb") as f:
while offset < file_size:
end = min(offset + CHUNK_SIZE, file_size) - 1
chunk_len = end - offset + 1
f.seek(offset)
chunk = f.read(chunk_len)
pct = int(100 * (end + 1) / file_size)
print(f" {fmt_size(offset)}/{fmt_size(file_size)} ({pct}%)",
end="\r", flush=True)
try:
r = requests.put(
upload_url,
headers={
**api_headers(token),
"Content-Type": "application/octet-stream",
"Content-Range": f"bytes {offset}-{end}/{file_size}",
"Content-Length": str(chunk_len),
},
data=chunk,
timeout=120,
)
except (requests.ConnectionError, requests.Timeout) as e:
retries += 1
if retries > MAX_RETRIES:
print(
f"\n Upload failed after {MAX_RETRIES} retries: {e}")
return False, None
wait = min(2 ** retries, 60)
print(f"\n Connection error, retry {retries}/{MAX_RETRIES} "
f"in {wait}s ...")
time.sleep(wait)
try:
offset = _query_offset(upload_url, token, file_size)
except Exception:
pass
continue
if r.status_code == 308:
range_hdr = r.headers.get("Range", "")
if range_hdr:
offset = int(range_hdr.split("-")[1]) + 1
else:
offset = end + 1
retries = 0
elif r.status_code == 200:
print(
f" {fmt_size(file_size)}/{fmt_size(file_size)} (100%)")
uuid = r.json().get("video", {}).get("uuid")
return True, uuid
elif r.status_code in (502, 503, 429):
retry_after = int(r.headers.get("Retry-After", 10))
retries += 1
if retries > MAX_RETRIES:
print(
f"\n Upload failed: server returned {r.status_code}")
return False, None
print(
f"\n Server {r.status_code}, retry in {retry_after}s ...")
time.sleep(retry_after)
try:
offset = _query_offset(upload_url, token, file_size)
except Exception:
pass
else:
detail = r.text[:300] if r.text else str(r.status_code)
print(f"\n Upload failed ({r.status_code}): {detail}")
return False, None
print("\n Unexpected: sent all bytes but no 200 response")
return False, None
_STATE = {
1: "Published",
2: "To transcode",
3: "To import",
6: "Moving to object storage",
7: "Transcoding failed",
8: "Storage move failed",
9: "To edit",
}
def get_video_state(base, token, uuid):
r = requests.get(
f"{base}/api/v1/videos/{uuid}",
headers=api_headers(token),
timeout=15,
)
r.raise_for_status()
state = r.json()["state"]
return state["id"], state.get("label", "")
def wait_for_published(base, token, uuid, poll_interval):
"""Block until the video reaches state 1 (Published) or a failure state."""
started = time.monotonic()
while True:
elapsed = int(time.monotonic() - started)
hours, rem = divmod(elapsed, 3600)
mins, secs = divmod(rem, 60)
if hours:
elapsed_str = f"{hours}h {mins:02d}m {secs:02d}s"
elif mins:
elapsed_str = f"{mins}m {secs:02d}s"
else:
elapsed_str = f"{secs}s"
try:
sid, label = get_video_state(base, token, uuid)
except requests.exceptions.RequestException as e:
print(f" -> Poll error ({e.__class__.__name__}) "
f"after {elapsed_str}, retrying in {poll_interval}s …")
time.sleep(poll_interval)
continue
display = _STATE.get(sid, label or f"state {sid}")
if sid == 1:
print(f" -> {display}")
return sid
if sid in (7, 8):
print(f" -> FAILED: {display}")
return sid
print(f" -> {display}{elapsed_str} elapsed (next check in {poll_interval}s)")
time.sleep(poll_interval)
# ── State tracker ────────────────────────────────────────────────────
def load_uploaded(input_dir):
path = Path(input_dir) / UPLOADED_FILE
if not path.exists():
return set()
with open(path) as f:
return {Path(line.strip()) for line in f if line.strip()}
def mark_uploaded(input_dir, rel_path):
with open(Path(input_dir) / UPLOADED_FILE, "a") as f:
f.write(f"{rel_path}\n")
# ── File / metadata helpers ─────────────────────────────────────────
def build_path_to_meta(video_map, input_dir):
"""Map each expected download path (relative) to {title, description}."""
urls = collect_urls(video_map)
mode = read_mode(input_dir) or MODE_ORIGINAL
paths = get_paths_for_mode(mode, urls, video_map, input_dir)
url_meta = {}
for entry in video_map.values():
t = entry.get("title", "")
d = entry.get("description", "")
for video_url in entry.get("videos", []):
if video_url not in url_meta:
url_meta[video_url] = {"title": t, "description": d}
result = {}
for url, abs_path in paths.items():
rel = Path(abs_path).relative_to(input_dir)
meta = url_meta.get(url, {"title": "", "description": ""})
result[rel] = {**meta, "original_filename": url_to_filename(url)}
return result
def find_videos(input_dir):
"""Walk input_dir and return a set of relative paths for all video files."""
found = set()
for root, dirs, files in os.walk(input_dir):
dirs[:] = [d for d in dirs if not d.startswith(".")]
for f in files:
if Path(f).suffix.lower() in VIDEO_EXTS:
found.add((Path(root) / f).relative_to(input_dir))
return found
# ── Channel match helpers ─────────────────────────────────────────────
def _channel_match(rel, path_meta, existing):
"""Return (matched, name) for a local file against the channel name set.
Checks both the title-derived name and the original-filename-derived name
so that videos uploaded under either form are recognised. Extracted to
avoid duplicating the logic between the pre-reconcile sweep and the per-
file check inside the upload loop.
"""
meta = path_meta.get(rel, {})
name = make_pt_name(meta.get("title", ""), rel.name)
orig_fn = meta.get("original_filename", "")
raw_name = make_pt_name("", orig_fn) if orig_fn else None
matched = name in existing or (raw_name and raw_name != name and raw_name in existing)
return matched, name
# ── CLI ──────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(
description="Upload videos to PeerTube with transcoding-aware batching",
)
ap.add_argument("--input", "-i", default=DEFAULT_OUTPUT,
help=f"Directory with downloaded videos (default: {DEFAULT_OUTPUT})")
ap.add_argument("--url",
help="PeerTube instance URL (or set PEERTUBE_URL env var)")
ap.add_argument("--username", "-U",
help="PeerTube username (or set PEERTUBE_USER env var)")
ap.add_argument("--password", "-p",
help="PeerTube password (or set PEERTUBE_PASSWORD env var)")
ap.add_argument("--channel", "-C",
help="Channel to upload to (or set PEERTUBE_CHANNEL env var)")
ap.add_argument("--batch-size", "-b", type=int, default=DEFAULT_BATCH_SIZE,
help="Videos to upload before waiting for transcoding (default: 1)")
ap.add_argument("--poll-interval", type=int, default=DEFAULT_POLL,
help=f"Seconds between state polls (default: {DEFAULT_POLL})")
ap.add_argument("--skip-wait", action="store_true",
help="Upload everything without waiting for transcoding")
ap.add_argument("--nsfw", action="store_true",
help="Mark videos as NSFW")
ap.add_argument("--dry-run", "-n", action="store_true",
help="Preview what would be uploaded")
args = ap.parse_args()
url = args.url or os.environ.get("PEERTUBE_URL")
username = args.username or os.environ.get("PEERTUBE_USER")
channel = args.channel or os.environ.get("PEERTUBE_CHANNEL")
password = args.password or os.environ.get("PEERTUBE_PASSWORD")
if not args.dry_run:
missing = [label for label, val in [
("--url / PEERTUBE_URL", url),
("--username / PEERTUBE_USER", username),
("--channel / PEERTUBE_CHANNEL", channel),
("--password / PEERTUBE_PASSWORD", password),
] if not val]
if missing:
for label in missing:
print(f"[!] Required: {label}")
sys.exit(1)
# ── load metadata & scan disk ──
video_map = load_video_map()
path_meta = build_path_to_meta(video_map, args.input)
on_disk = find_videos(args.input)
unmatched = on_disk - set(path_meta.keys())
if unmatched:
print(
f"[!] {len(unmatched)} file(s) on disk not in video_map (will use filename as title)")
for rel in unmatched:
path_meta[rel] = {"title": "", "description": ""}
uploaded = load_uploaded(args.input)
pending = sorted(rel for rel in on_disk if rel not in uploaded)
print(f"[+] {len(on_disk)} video files in {args.input}/")
print(f"[+] {len(uploaded)} already uploaded")
print(f"[+] {len(pending)} pending")
print(f"[+] Batch size: {args.batch_size}")
if not pending:
print("\nAll videos already uploaded.")
return
# ── dry run ──
if args.dry_run:
total_bytes = 0
for rel in pending:
meta = path_meta.get(rel, {})
name = make_pt_name(meta.get("title", ""), rel.name)
sz = (Path(args.input) / rel).stat().st_size
total_bytes += sz
print(f" [{fmt_size(sz):>10}] {name}")
print(
f"\n Total: {fmt_size(total_bytes)} across {len(pending)} videos")
return
# ── authenticate ──
base = url.rstrip("/")
if not base.startswith("http"):
base = "https://" + base
print(f"\n[+] Authenticating with {base} ...")
token = get_oauth_token(base, username, password)
print(f"[+] Authenticated as {username}")
channel_id = get_channel_id(base, token, channel)
print(f"[+] Channel: {channel} (id {channel_id})")
name_counts = get_channel_video_names(base, token, channel)
existing = set(name_counts)
total = sum(name_counts.values())
print(f"[+] Found {total} video(s) on channel ({len(name_counts)} unique name(s))")
dupes = {name: count for name, count in name_counts.items() if count > 1}
if dupes:
print(f"[!] {len(dupes)} duplicate name(s) detected on channel:")
for name, count in sorted(dupes.items()):
print(f" x{count} {name}")
# ── pre-reconcile: sweep all pending against channel names ────────
# The main upload loop discovers already-uploaded videos lazily as it
# walks the sorted pending list — meaning on a fresh run (no .uploaded
# file) you won't know how many files are genuinely new until the loop
# has processed everything. Doing a full sweep here, before any
# upload starts, gives an accurate count up-front and pre-populates
# .uploaded so that interrupted/re-run sessions skip them instantly
# without re-checking each time.
pre_matched = []
for rel in pending:
if _channel_match(rel, path_meta, existing)[0]:
pre_matched.append(rel)
if pre_matched:
print(f"\n[+] Pre-sweep: {len(pre_matched)} local file(s) already on channel — marking uploaded")
for rel in pre_matched:
mark_uploaded(args.input, rel)
pending = [rel for rel in pending if rel not in set(pre_matched)]
print(f"[+] {len(pending)} left to upload\n")
nsfw = args.nsfw
total_up = 0
batch: list[tuple[str, str]] = [] # [(uuid, name), ...]
try:
for rel in pending:
# ── flush batch if full ──
if not args.skip_wait and len(batch) >= args.batch_size:
print(
f"\n[+] Waiting for {len(batch)} video(s) to finish processing ...")
for uuid, bname in batch:
print(f"\n [{bname}]")
wait_for_published(base, token, uuid, args.poll_interval)
batch.clear()
filepath = Path(args.input) / rel
meta = path_meta.get(rel, {})
name = make_pt_name(meta.get("title", ""), rel.name)
desc = clean_description(meta.get("description", ""))
sz = filepath.stat().st_size
if _channel_match(rel, path_meta, existing)[0]:
print(f"\n[skip] already on channel: {name}")
mark_uploaded(args.input, rel)
continue
print(f"\n[{total_up + 1}/{len(pending)}] {name}")
print(f" File: {rel} ({fmt_size(sz)})")
ok, uuid = upload_video(
base, token, channel_id, filepath, name, desc, nsfw)
if not ok:
continue
print(f" Uploaded uuid={uuid}")
mark_uploaded(args.input, rel)
total_up += 1
existing.add(name)
if uuid:
batch.append((uuid, name))
# ── wait for final batch ──
if batch and not args.skip_wait:
print(f"\n[+] Waiting for final {len(batch)} video(s) ...")
for uuid, bname in batch:
print(f"\n [{bname}]")
wait_for_published(base, token, uuid, args.poll_interval)
except KeyboardInterrupt:
print(
f"\n\n[!] Interrupted after {total_up} uploads. Re-run to continue.")
sys.exit(130)
print(f"\n{'=' * 50}")
print(f" Uploaded: {total_up} video(s)")
print(" Done!")
print(f"{'=' * 50}")
if __name__ == "__main__":
main()

11025
video_map.json Normal file

File diff suppressed because one or more lines are too long