mirror of
https://github.com/HugeFrog24/jailbirdz-dl.git
synced 2026-03-02 01:04:31 +00:00
Garbage commit; added junk code
This commit is contained in:
11
.env.example
Normal file
11
.env.example
Normal file
@@ -0,0 +1,11 @@
|
||||
# Copy your wordpress_logged_in_... cookie from browser DevTools → Storage → Cookies.
|
||||
# Paste the full name=value pair below.
|
||||
# wordpress_sec_... is the wp-admin cookie — irrelevant for read-only viewers.
|
||||
# __cf_bm is a Cloudflare bot-management cookie — also not needed.
|
||||
WP_LOGIN_COOKIE=wordpress_logged_in_<hash>=<value>
|
||||
|
||||
# PeerTube upload target
|
||||
PEERTUBE_URL=https://your-peertube-instance.example
|
||||
PEERTUBE_USER=admin
|
||||
PEERTUBE_CHANNEL=your-channel-name
|
||||
PEERTUBE_PASSWORD=your-password
|
||||
14
.github/instructions/snyk_rules.instructions.md
vendored
Normal file
14
.github/instructions/snyk_rules.instructions.md
vendored
Normal file
@@ -0,0 +1,14 @@
|
||||
---
|
||||
alwaysApply: true
|
||||
always_on: true
|
||||
trigger: always_on
|
||||
applyTo: "**"
|
||||
description: Snyk Security At Inception
|
||||
---
|
||||
|
||||
# Project security best practices
|
||||
|
||||
- Always run snyk_code_scan tool for new first party code that is generated in a Snyk-supported language.
|
||||
- If any security issues are found based on newly introduced or modified code or dependencies, attempt to fix the issues using the results context from Snyk.
|
||||
- Rescan the code after fixing the issues to ensure that the issues were fixed and that there are no newly introduced issues.
|
||||
- Repeat this process until no new issues are found.
|
||||
5
.gitignore
vendored
Normal file
5
.gitignore
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
__pycache__/
|
||||
downloads/
|
||||
*.mp4
|
||||
*.mp4.part
|
||||
.env
|
||||
4
.vscode/settings.json
vendored
Normal file
4
.vscode/settings.json
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
{
|
||||
"snyk.advanced.organization": "512ef4a1-6034-4537-a391-9692d282122a",
|
||||
"snyk.advanced.autoSelectOrganization": true
|
||||
}
|
||||
142
README.md
Normal file
142
README.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# 𝒥𝒶𝒾𝓁𝒷𝒾𝓇𝒹𝓏-𝒹𝓁
|
||||
|
||||
Jailbirdz.com is an Arizona-based subscription video site publishing arrest and jail roleplay scenarios featuring women. This tool scrapes the member area, downloads the videos, and re-hosts them on a self-owned PeerTube instance.
|
||||
|
||||
> [!NOTE]
|
||||
> This tool does not bypass authentication, modify the site, or intercept anything it isn't entitled to. A valid, paid membership is required. The scraper authenticates using your own session cookie and accesses only content your account can already view in a browser.
|
||||
>
|
||||
> Downloading content for private, personal use is permitted in many jurisdictions under private copy provisions (e.g., § 53 UrhG in Germany). You are responsible for determining whether this applies in yours.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.10+
|
||||
- `pip install -r requirements.txt`
|
||||
- `playwright install firefox`
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
### WP_LOGIN_COOKIE
|
||||
|
||||
You need to be logged into jailbirdz.com in a browser. Then either:
|
||||
|
||||
**Option A — auto (recommended):** let `grab_cookie.py` read it from your browser and write it to `.env` automatically:
|
||||
|
||||
```bash
|
||||
python grab_cookie.py # tries Firefox, Chrome, Edge, Brave in order
|
||||
python grab_cookie.py -b firefox # or target a specific browser
|
||||
```
|
||||
|
||||
> **Note:** Chrome and Edge on Windows 130+ require the script to run as Administrator due to App-bound Encryption. Firefox works without elevated privileges.
|
||||
|
||||
**Option B — manual:** open `.env` and set `WP_LOGIN_COOKIE` yourself. Get the value from browser DevTools → Storage → Cookies while on jailbirdz.com — copy the full `name=value` of the `wordpress_logged_in_*` cookie.
|
||||
|
||||
### Other `.env` values
|
||||
|
||||
- `PEERTUBE_URL` — base URL of your PeerTube instance.
|
||||
- `PEERTUBE_USER` — PeerTube username.
|
||||
- `PEERTUBE_CHANNEL` — channel to upload to.
|
||||
- `PEERTUBE_PASSWORD` — PeerTube password.
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Scrape
|
||||
|
||||
Discovers all post URLs via the WordPress REST API, then visits each page with a headless Firefox browser to intercept video network requests (MP4, MOV, WebM, AVI, M4V).
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
Results are written to `video_map.json`. Safe to re-run — already-scraped posts are skipped.
|
||||
|
||||
### 2. Download
|
||||
|
||||
```bash
|
||||
python download.py [options]
|
||||
|
||||
Options:
|
||||
-o, --output DIR Download directory (default: downloads)
|
||||
-t, --titles Name files by post title
|
||||
--original Name files by original CloudFront filename (default)
|
||||
--reorganize Rename existing files to match current naming mode
|
||||
-w, --workers N Concurrent downloads (default: 4)
|
||||
-n, --dry-run Print what would be downloaded
|
||||
```
|
||||
|
||||
Resumes partial downloads. The chosen naming mode is saved to `.naming_mode` inside the output directory and persists across runs. Filenames that would clash are placed into subfolders.
|
||||
|
||||
### 3. Upload
|
||||
|
||||
```bash
|
||||
python upload.py [options]
|
||||
|
||||
Options:
|
||||
-i, --input DIR MP4 source directory (default: downloads)
|
||||
--url URL PeerTube instance URL (or set PEERTUBE_URL)
|
||||
-U, --username NAME PeerTube username (or set PEERTUBE_USER)
|
||||
-p, --password SECRET PeerTube password (or set PEERTUBE_PASSWORD)
|
||||
-C, --channel NAME Channel to upload to (or set PEERTUBE_CHANNEL)
|
||||
-b, --batch-size N Videos to upload before waiting for transcoding (default: 1)
|
||||
--poll-interval SECS State poll interval in seconds (default: 30)
|
||||
--skip-wait Upload without waiting for transcoding
|
||||
--nsfw Mark videos as NSFW
|
||||
-n, --dry-run Print what would be uploaded
|
||||
```
|
||||
|
||||
Uploads in resumable 10 MB chunks. After each batch, waits for transcoding and object storage to complete before uploading the next batch — this prevents disk exhaustion on the PeerTube server. Videos already present on the channel (matched by name) are skipped. Progress is tracked in `.uploaded` inside the input directory.
|
||||
|
||||
## Utilities
|
||||
|
||||
### Check for filename clashes
|
||||
|
||||
```bash
|
||||
python check_clashes.py
|
||||
```
|
||||
|
||||
Lists filenames that map to more than one source URL, with sizes.
|
||||
|
||||
### Estimate total download size
|
||||
|
||||
```bash
|
||||
python total_size.py
|
||||
```
|
||||
|
||||
Fetches `Content-Length` for every video URL in `video_map.json` and prints a size summary. Does not download anything.
|
||||
|
||||
## Data files
|
||||
|
||||
| File | Location | Description |
|
||||
| ---------------- | ---------------- | --------------------------------------------------------------------- |
|
||||
| `video_map.json` | project root | Scraped post URLs mapped to titles, descriptions, and video URLs |
|
||||
| `.naming_mode` | output directory | Saved filename mode (`original` or `title`) |
|
||||
| `.uploaded` | input directory | Newline-delimited list of relative paths already uploaded to PeerTube |
|
||||
|
||||
## FAQ
|
||||
|
||||
**Is this necessary?**
|
||||
Yes, obviously.
|
||||
|
||||
**Is this project exactly what it looks like?**
|
||||
Also yes.
|
||||
|
||||
**Why go to all this trouble?**
|
||||
Middle school girls bullied me so hard I decided if you're going to be the weird kid anyway, you might as well commit to the bit and build highly specific pipelines for highly specific content.
|
||||
Now it's their turn to get booked.
|
||||
Checkmate, society.
|
||||
No apologies.
|
||||
|
||||
**Why not just download everything manually?**
|
||||
Dude.
|
||||
Bondage fantasy.
|
||||
Not pain play.
|
||||
Huge difference.
|
||||
1,300 clicks = torture.
|
||||
Know your genres.
|
||||
|
||||
---
|
||||
|
||||
This is the most normal thing I've scripted this month.
|
||||
159
check_clashes.py
Normal file
159
check_clashes.py
Normal file
@@ -0,0 +1,159 @@
|
||||
"""Filename clash detection and shared URL utilities.
|
||||
|
||||
Importable functions:
|
||||
url_to_filename(url) - extract clean filename from a URL
|
||||
find_clashes(urls) - {filename: [urls]} for filenames with >1 source
|
||||
build_download_paths(urls, output_dir) - {url: local_path} with clash resolution
|
||||
fmt_size(bytes) - human-readable size string
|
||||
get_remote_size(session, url) - file size via HEAD without downloading
|
||||
fetch_sizes(urls, workers, on_progress) - bulk size lookup
|
||||
make_session() - requests.Session with required headers
|
||||
load_video_map() - load video_map.json, returns {} on missing/corrupt
|
||||
"""
|
||||
|
||||
from collections import defaultdict
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from pathlib import Path, PurePosixPath
|
||||
from urllib.parse import urlparse, unquote
|
||||
import json
|
||||
import requests
|
||||
from config import BASE_URL
|
||||
|
||||
REFERER = f"{BASE_URL}/"
|
||||
VIDEO_MAP_FILE = "video_map.json"
|
||||
VIDEO_EXTS = {".mp4", ".mov", ".m4v", ".webm", ".avi"}
|
||||
|
||||
|
||||
def load_video_map():
|
||||
if Path(VIDEO_MAP_FILE).exists():
|
||||
try:
|
||||
with open(VIDEO_MAP_FILE, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
except (json.JSONDecodeError, OSError):
|
||||
return {}
|
||||
return {}
|
||||
|
||||
|
||||
def make_session():
|
||||
s = requests.Session()
|
||||
s.headers.update({"Referer": REFERER})
|
||||
return s
|
||||
|
||||
|
||||
def fmt_size(b):
|
||||
for unit in ("B", "KB", "MB", "GB"):
|
||||
if b < 1024:
|
||||
return f"{b:.1f} {unit}"
|
||||
b /= 1024
|
||||
return f"{b:.1f} TB"
|
||||
|
||||
|
||||
def url_to_filename(url):
|
||||
return unquote(PurePosixPath(urlparse(url).path).name)
|
||||
|
||||
|
||||
def find_clashes(urls):
|
||||
# Case-insensitive grouping so that e.g. "DaisyArrest.mp4" and
|
||||
# "daisyarrest.mp4" are treated as a clash. This is required for
|
||||
# correctness on case-insensitive filesystems (NTFS, exFAT, macOS HFS+)
|
||||
# and harmless on case-sensitive ones (ext4) — the actual filenames on
|
||||
# disk keep their original casing; only the clash *detection* is folded.
|
||||
by_lower = defaultdict(list)
|
||||
for url in urls:
|
||||
by_lower[url_to_filename(url).lower()].append(url)
|
||||
return {url_to_filename(srcs[0]): srcs
|
||||
for srcs in by_lower.values() if len(srcs) > 1}
|
||||
|
||||
|
||||
def _clash_subfolder(url):
|
||||
"""Parent path segment used as disambiguator for clashing filenames."""
|
||||
parts = urlparse(url).path.rstrip("/").split("/")
|
||||
return unquote(parts[-2]) if len(parts) >= 2 else "unknown"
|
||||
|
||||
|
||||
def build_download_paths(urls, output_dir):
|
||||
"""Map each URL to a local file path. Flat layout; clashing names get a subfolder."""
|
||||
clashes = find_clashes(urls)
|
||||
clash_lower = {name.lower() for name in clashes}
|
||||
|
||||
paths = {}
|
||||
for url in urls:
|
||||
filename = url_to_filename(url)
|
||||
if filename.lower() in clash_lower:
|
||||
paths[url] = Path(output_dir) / _clash_subfolder(url) / filename
|
||||
else:
|
||||
paths[url] = Path(output_dir) / filename
|
||||
return paths
|
||||
|
||||
|
||||
def get_remote_size(session, url):
|
||||
try:
|
||||
r = session.head(url, allow_redirects=True, timeout=15)
|
||||
if r.status_code < 400 and "Content-Length" in r.headers:
|
||||
return int(r.headers["Content-Length"])
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
r = session.get(
|
||||
url, headers={"Range": "bytes=0-0"}, stream=True, timeout=15)
|
||||
r.close()
|
||||
cr = r.headers.get("Content-Range", "")
|
||||
if "/" in cr:
|
||||
return int(cr.split("/")[-1])
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
def fetch_sizes(urls, workers=20, on_progress=None):
|
||||
"""Return {url: size_or_None}. on_progress(done, total) called after each URL."""
|
||||
session = make_session()
|
||||
sizes = {}
|
||||
total = len(urls)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=workers) as pool:
|
||||
futures = {pool.submit(get_remote_size, session, u): u for u in urls}
|
||||
done = 0
|
||||
for fut in as_completed(futures):
|
||||
sizes[futures[fut]] = fut.result()
|
||||
done += 1
|
||||
if on_progress:
|
||||
on_progress(done, total)
|
||||
|
||||
return sizes
|
||||
|
||||
|
||||
# --------------- CLI ---------------
|
||||
|
||||
def main():
|
||||
vm = load_video_map()
|
||||
urls = [u for entry in vm.values() for u in entry.get("videos", []) if u.startswith("http")]
|
||||
|
||||
clashes = find_clashes(urls)
|
||||
|
||||
print(f"Total URLs: {len(urls)}")
|
||||
by_name = defaultdict(list)
|
||||
for url in urls:
|
||||
by_name[url_to_filename(url)].append(url)
|
||||
print(f"Unique filenames: {len(by_name)}")
|
||||
|
||||
if not clashes:
|
||||
print("\nNo filename clashes — every filename is unique.")
|
||||
return
|
||||
|
||||
clash_urls = [u for srcs in clashes.values() for u in srcs]
|
||||
print(f"\n[+] Fetching file sizes for {len(clash_urls)} clashing URLs…")
|
||||
sizes = fetch_sizes(clash_urls)
|
||||
|
||||
print(f"\n{len(clashes)} filename clash(es):\n")
|
||||
for name, srcs in sorted(clashes.items()):
|
||||
print(f" {name} ({len(srcs)} sources)")
|
||||
for s in srcs:
|
||||
sz = sizes.get(s)
|
||||
tag = fmt_size(sz) if sz is not None else "unknown"
|
||||
print(f" [{tag}] {s}")
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
2
config.py
Normal file
2
config.py
Normal file
@@ -0,0 +1,2 @@
|
||||
BASE_URL = "https://www.jailbirdz.com"
|
||||
COOKIE_DOMAIN = "jailbirdz.com" # rookiepy domain filter (no www)
|
||||
408
download.py
Normal file
408
download.py
Normal file
@@ -0,0 +1,408 @@
|
||||
"""Download videos from video_map.json with resume, integrity checks, and naming modes.
|
||||
|
||||
Usage:
|
||||
python download.py # downloads with remembered (or default original) naming
|
||||
python download.py --output /mnt/nas # custom directory
|
||||
python download.py --titles # switch to title-based filenames (remembers choice)
|
||||
python download.py --original # switch back to original filenames (remembers choice)
|
||||
python download.py --reorganize # rename existing files to match current mode
|
||||
python download.py --dry-run # preview what would happen
|
||||
python download.py --workers 6 # override concurrency (default 4)
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
import re
|
||||
import shutil
|
||||
from collections import defaultdict
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
from check_clashes import (
|
||||
make_session,
|
||||
fmt_size,
|
||||
url_to_filename,
|
||||
find_clashes,
|
||||
build_download_paths,
|
||||
fetch_sizes,
|
||||
)
|
||||
|
||||
VIDEO_MAP_FILE = "video_map.json"
|
||||
CHUNK_SIZE = 8 * 1024 * 1024
|
||||
DEFAULT_OUTPUT = "downloads"
|
||||
DEFAULT_WORKERS = 4
|
||||
MODE_FILE = ".naming_mode"
|
||||
MODE_ORIGINAL = "original"
|
||||
MODE_TITLE = "title"
|
||||
|
||||
|
||||
# ── Naming mode persistence ──────────────────────────────────────────
|
||||
|
||||
def read_mode(output_dir):
|
||||
p = Path(output_dir) / MODE_FILE
|
||||
if p.exists():
|
||||
return p.read_text().strip()
|
||||
return None
|
||||
|
||||
|
||||
def write_mode(output_dir, mode):
|
||||
Path(output_dir).mkdir(parents=True, exist_ok=True)
|
||||
(Path(output_dir) / MODE_FILE).write_text(mode)
|
||||
|
||||
|
||||
def resolve_mode(args):
|
||||
"""Determine naming mode from CLI flags + saved marker. Returns mode string."""
|
||||
saved = read_mode(args.output)
|
||||
|
||||
if args.titles and args.original:
|
||||
print("[!] Cannot use --titles and --original together.")
|
||||
raise SystemExit(1)
|
||||
|
||||
if args.titles:
|
||||
return MODE_TITLE
|
||||
if args.original:
|
||||
return MODE_ORIGINAL
|
||||
if saved:
|
||||
return saved
|
||||
return MODE_ORIGINAL
|
||||
|
||||
|
||||
# ── Filename helpers ─────────────────────────────────────────────────
|
||||
|
||||
def sanitize_filename(title, max_len=180):
|
||||
name = re.sub(r'[<>:"/\\|?*]', '', title)
|
||||
name = re.sub(r'\s+', ' ', name).strip().rstrip('.')
|
||||
return name[:max_len].rstrip() if len(name) > max_len else name
|
||||
|
||||
|
||||
def build_title_paths(urls, url_to_title, output_dir):
|
||||
name_to_urls = defaultdict(list)
|
||||
url_to_base = {}
|
||||
|
||||
for url in urls:
|
||||
title = url_to_title.get(url)
|
||||
ext = Path(url_to_filename(url)).suffix or ".mp4"
|
||||
base = sanitize_filename(title) if title else Path(url_to_filename(url)).stem
|
||||
url_to_base[url] = (base, ext)
|
||||
name_to_urls[base + ext].append(url)
|
||||
|
||||
paths = {}
|
||||
for url in urls:
|
||||
base, ext = url_to_base[url]
|
||||
full = base + ext
|
||||
if len(name_to_urls[full]) > 1:
|
||||
slug = url_to_filename(url).rsplit('.', 1)[0]
|
||||
paths[url] = Path(output_dir) / f"{base} [{slug}]{ext}"
|
||||
else:
|
||||
paths[url] = Path(output_dir) / full
|
||||
return paths
|
||||
|
||||
|
||||
def get_paths_for_mode(mode, urls, video_map, output_dir):
|
||||
if mode == MODE_TITLE:
|
||||
url_title = build_url_title_map(video_map)
|
||||
return build_title_paths(urls, url_title, output_dir)
|
||||
return build_download_paths(urls, output_dir)
|
||||
|
||||
|
||||
# ── Reorganize ───────────────────────────────────────────────────────
|
||||
|
||||
def reorganize(urls, video_map, output_dir, target_mode, dry_run=False):
|
||||
"""Rename existing files from one naming scheme to another."""
|
||||
other_mode = MODE_TITLE if target_mode == MODE_ORIGINAL else MODE_ORIGINAL
|
||||
old_paths = get_paths_for_mode(other_mode, urls, video_map, output_dir)
|
||||
new_paths = get_paths_for_mode(target_mode, urls, video_map, output_dir)
|
||||
|
||||
moves = []
|
||||
for url in urls:
|
||||
old = old_paths[url]
|
||||
new = new_paths[url]
|
||||
if old == new:
|
||||
continue
|
||||
if old.exists() and not new.exists():
|
||||
moves.append((old, new))
|
||||
# also handle .part files
|
||||
old_part = old.parent / (old.name + ".part")
|
||||
new_part = new.parent / (new.name + ".part")
|
||||
if old_part.exists() and not new_part.exists():
|
||||
moves.append((old_part, new_part))
|
||||
|
||||
if not moves:
|
||||
print("[✓] Nothing to reorganize — files already match the target mode.")
|
||||
return
|
||||
|
||||
print(f"[+] {len(moves)} file(s) to rename ({other_mode} → {target_mode}):\n")
|
||||
|
||||
for old, new in moves:
|
||||
old_rel = old.relative_to(output_dir)
|
||||
new_rel = new.relative_to(output_dir)
|
||||
if dry_run:
|
||||
print(f" [dry-run] {old_rel} → {new_rel}")
|
||||
else:
|
||||
new.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.move(old, new)
|
||||
print(f" ✓ {old_rel} → {new_rel}")
|
||||
|
||||
if not dry_run:
|
||||
# Clean up empty directories left behind
|
||||
output_path = Path(output_dir)
|
||||
for old, _ in moves:
|
||||
d = old.parent
|
||||
while d != output_path:
|
||||
try:
|
||||
d.rmdir()
|
||||
except OSError:
|
||||
break
|
||||
d = d.parent
|
||||
|
||||
write_mode(output_dir, target_mode)
|
||||
print(f"\n[✓] Reorganized. Mode saved: {target_mode}")
|
||||
else:
|
||||
print(f"\n[dry-run] Would rename {len(moves)} files. No changes made.")
|
||||
|
||||
|
||||
# ── Download ─────────────────────────────────────────────────────────
|
||||
|
||||
def download_one(session, url, dest, expected_size):
|
||||
dest = Path(dest)
|
||||
part = dest.parent / (dest.name + ".part")
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if dest.exists():
|
||||
local = dest.stat().st_size
|
||||
if expected_size and local == expected_size:
|
||||
return "ok", 0
|
||||
if expected_size and local != expected_size:
|
||||
dest.unlink()
|
||||
|
||||
existing = part.stat().st_size if part.exists() else 0
|
||||
headers = {}
|
||||
if existing and expected_size and existing < expected_size:
|
||||
headers["Range"] = f"bytes={existing}-"
|
||||
|
||||
try:
|
||||
r = session.get(url, headers=headers, stream=True, timeout=60)
|
||||
|
||||
if r.status_code == 416:
|
||||
part.rename(dest)
|
||||
return "ok", 0
|
||||
|
||||
r.raise_for_status()
|
||||
except Exception as e:
|
||||
return f"error: {e}", 0
|
||||
|
||||
mode = "ab" if headers.get("Range") else "wb"
|
||||
if mode == "wb":
|
||||
existing = 0
|
||||
|
||||
written = 0
|
||||
try:
|
||||
with open(part, mode) as f:
|
||||
for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
|
||||
f.write(chunk)
|
||||
written += len(chunk)
|
||||
except Exception as e:
|
||||
return f"error: {e}", written
|
||||
|
||||
final_size = existing + written
|
||||
if expected_size and final_size != expected_size:
|
||||
return "size_mismatch", written
|
||||
|
||||
part.rename(dest)
|
||||
return "ok", written
|
||||
|
||||
|
||||
# ── Data loading ─────────────────────────────────────────────────────
|
||||
|
||||
def load_video_map():
|
||||
with open(VIDEO_MAP_FILE, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def _is_valid_url(url):
|
||||
return url.startswith(
|
||||
"http") and "<" not in url and ">" not in url and " href=" not in url
|
||||
|
||||
|
||||
def collect_urls(video_map):
|
||||
urls, seen, skipped = [], set(), 0
|
||||
for entry in video_map.values():
|
||||
for video_url in entry.get("videos", []):
|
||||
if video_url in seen:
|
||||
continue
|
||||
seen.add(video_url)
|
||||
if _is_valid_url(video_url):
|
||||
urls.append(video_url)
|
||||
else:
|
||||
skipped += 1
|
||||
if skipped:
|
||||
print(f"[!] Skipped {skipped} malformed URL(s)")
|
||||
return urls
|
||||
|
||||
|
||||
def build_url_title_map(video_map):
|
||||
url_title = {}
|
||||
for entry in video_map.values():
|
||||
title = entry.get("title", "")
|
||||
for video_url in entry.get("videos", []):
|
||||
if video_url not in url_title:
|
||||
url_title[video_url] = title
|
||||
return url_title
|
||||
|
||||
|
||||
# ── Main ─────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Download videos from video_map.json")
|
||||
parser.add_argument("--output", "-o", default=DEFAULT_OUTPUT,
|
||||
help=f"Download directory (default: {DEFAULT_OUTPUT})")
|
||||
|
||||
naming = parser.add_mutually_exclusive_group()
|
||||
naming.add_argument("--titles", "-t", action="store_true",
|
||||
help="Use title-based filenames (saved as default for this directory)")
|
||||
naming.add_argument("--original", action="store_true",
|
||||
help="Use original CloudFront filenames (saved as default for this directory)")
|
||||
|
||||
parser.add_argument("--reorganize", action="store_true",
|
||||
help="Rename existing files to match the current naming mode")
|
||||
parser.add_argument("--dry-run", "-n", action="store_true",
|
||||
help="Preview without making changes")
|
||||
parser.add_argument("--workers", "-w", type=int, default=DEFAULT_WORKERS,
|
||||
help=f"Concurrent downloads (default: {DEFAULT_WORKERS})")
|
||||
args = parser.parse_args()
|
||||
|
||||
video_map = load_video_map()
|
||||
urls = collect_urls(video_map)
|
||||
mode = resolve_mode(args)
|
||||
|
||||
saved = read_mode(args.output)
|
||||
mode_changed = saved is not None and saved != mode
|
||||
|
||||
print(f"[+] {len(urls)} MP4 URLs from {VIDEO_MAP_FILE}")
|
||||
print(f"[+] Naming mode: {mode}" + (" (changed!)" if mode_changed else ""))
|
||||
|
||||
# Handle reorganize
|
||||
if args.reorganize or mode_changed:
|
||||
if mode_changed and not args.reorganize:
|
||||
print(f"\n[!] Mode changed from '{saved}' to '{mode}'.")
|
||||
print(
|
||||
" Use --reorganize to rename existing files, or --dry-run to preview.")
|
||||
print(" Refusing to download until existing files are reorganized.")
|
||||
return
|
||||
reorganize(urls, video_map, args.output, mode, dry_run=args.dry_run)
|
||||
if args.dry_run or args.reorganize:
|
||||
return
|
||||
|
||||
# Save mode
|
||||
if not args.dry_run:
|
||||
write_mode(args.output, mode)
|
||||
|
||||
paths = get_paths_for_mode(mode, urls, video_map, args.output)
|
||||
|
||||
clashes = find_clashes(urls)
|
||||
if clashes:
|
||||
print(
|
||||
f"[+] {len(clashes)} filename clash(es) resolved with subfolders/suffixes")
|
||||
|
||||
already = [u for u in urls if paths[u].exists()]
|
||||
pending = [u for u in urls if not paths[u].exists()]
|
||||
|
||||
print(f"[+] Already downloaded: {len(already)}")
|
||||
print(f"[+] To download: {len(pending)}")
|
||||
|
||||
if not pending:
|
||||
print("\n[✓] Everything is already downloaded.")
|
||||
return
|
||||
|
||||
if args.dry_run:
|
||||
print(
|
||||
f"\n[dry-run] Would download {len(pending)} files to {args.output}/")
|
||||
for url in pending[:20]:
|
||||
print(f" → {paths[url].name}")
|
||||
if len(pending) > 20:
|
||||
print(f" … and {len(pending) - 20} more")
|
||||
return
|
||||
|
||||
print("\n[+] Fetching remote file sizes…")
|
||||
session = make_session()
|
||||
remote_sizes = fetch_sizes(pending, workers=20)
|
||||
|
||||
sized = {u: s for u, s in remote_sizes.items() if s is not None}
|
||||
total_bytes = sum(sized.values())
|
||||
print(
|
||||
f"[+] Download size: {fmt_size(total_bytes)} across {len(pending)} files")
|
||||
|
||||
if already:
|
||||
print(f"[+] Verifying {len(already)} existing files…")
|
||||
already_sizes = fetch_sizes(already, workers=20)
|
||||
|
||||
mismatched = 0
|
||||
for url in already:
|
||||
dest = paths[url]
|
||||
local = dest.stat().st_size
|
||||
remote = already_sizes.get(url)
|
||||
if remote and local != remote:
|
||||
mismatched += 1
|
||||
print(f"[!] Size mismatch: {dest.name} "
|
||||
f"(local {fmt_size(local)} vs remote {fmt_size(remote)})")
|
||||
pending.append(url)
|
||||
remote_sizes[url] = remote
|
||||
|
||||
if mismatched:
|
||||
print(
|
||||
f"[!] {mismatched} file(s) will be re-downloaded due to size mismatch")
|
||||
|
||||
print(f"\n[⚡] Downloading with {args.workers} threads…\n")
|
||||
|
||||
completed = 0
|
||||
failed = []
|
||||
total_written = 0
|
||||
total = len(pending)
|
||||
interrupted = False
|
||||
|
||||
def do_download(url):
|
||||
dest = paths[url]
|
||||
expected = remote_sizes.get(url)
|
||||
return url, download_one(session, url, dest, expected)
|
||||
|
||||
try:
|
||||
with ThreadPoolExecutor(max_workers=args.workers) as pool:
|
||||
futures = {pool.submit(do_download, u): u for u in pending}
|
||||
for fut in as_completed(futures):
|
||||
url, (status, written) = fut.result()
|
||||
total_written += written
|
||||
completed += 1
|
||||
name = paths[url].name
|
||||
|
||||
if status == "ok" and written > 0:
|
||||
print(
|
||||
f" [{completed}/{total}] ✓ {name} ({fmt_size(written)})")
|
||||
elif status == "ok":
|
||||
print(
|
||||
f" [{completed}/{total}] ✓ {name} (already complete)")
|
||||
elif status == "size_mismatch":
|
||||
print(f" [{completed}/{total}] ⚠ {name} (size mismatch)")
|
||||
failed.append(url)
|
||||
else:
|
||||
print(f" [{completed}/{total}] ✗ {name} ({status})")
|
||||
failed.append(url)
|
||||
except KeyboardInterrupt:
|
||||
interrupted = True
|
||||
pool.shutdown(wait=False, cancel_futures=True)
|
||||
print("\n\n[⏸] Interrupted! Partial downloads saved as .part files.")
|
||||
|
||||
print(f"\n{'=' * 50}")
|
||||
print(f" Downloaded: {fmt_size(total_written)}")
|
||||
print(f" Completed: {completed}/{total}")
|
||||
if failed:
|
||||
print(f" Failed: {len(failed)} (re-run to retry)")
|
||||
if interrupted:
|
||||
print(" Paused — re-run to resume.")
|
||||
elif not failed:
|
||||
print(" All done!")
|
||||
print(f"{'=' * 50}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
114
grab_cookie.py
Normal file
114
grab_cookie.py
Normal file
@@ -0,0 +1,114 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
grab_cookie.py — read the WordPress login cookie from an
|
||||
installed browser and write it to .env as WP_LOGIN_COOKIE=name=value.
|
||||
|
||||
Usage:
|
||||
python grab_cookie.py # tries Firefox, Chrome, Edge, Brave
|
||||
python grab_cookie.py --browser firefox # explicit browser
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from config import COOKIE_DOMAIN
|
||||
|
||||
ENV_FILE = Path(".env")
|
||||
ENV_KEY = "WP_LOGIN_COOKIE"
|
||||
COOKIE_PREFIX = "wordpress_logged_in_"
|
||||
|
||||
BROWSER_NAMES = ["firefox", "chrome", "edge", "brave"]
|
||||
|
||||
|
||||
def find_cookie(browser_name):
|
||||
"""Return (name, value) for the wordpress_logged_in_* cookie, or (None, None)."""
|
||||
try:
|
||||
import rookiepy
|
||||
except ImportError:
|
||||
raise ImportError("rookiepy not installed — run: pip install rookiepy")
|
||||
|
||||
fn = getattr(rookiepy, browser_name, None)
|
||||
if fn is None:
|
||||
raise ValueError(f"rookiepy does not support '{browser_name}'.")
|
||||
|
||||
try:
|
||||
cookies = fn([COOKIE_DOMAIN])
|
||||
except PermissionError:
|
||||
raise PermissionError(
|
||||
f"Permission denied reading {browser_name} cookies.\n"
|
||||
" Close the browser, or on Windows run as Administrator for Chrome/Edge."
|
||||
)
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Could not read {browser_name} cookies: {e}")
|
||||
|
||||
for c in cookies:
|
||||
if c.get("name", "").startswith(COOKIE_PREFIX):
|
||||
return c["name"], c["value"]
|
||||
|
||||
return None, None
|
||||
|
||||
|
||||
def update_env(name, value):
|
||||
"""Write WP_LOGIN_COOKIE=name=value into .env, replacing any existing line."""
|
||||
new_line = f"{ENV_KEY}={name}={value}\n"
|
||||
|
||||
if ENV_FILE.exists():
|
||||
text = ENV_FILE.read_text(encoding="utf-8")
|
||||
lines = text.splitlines(keepends=True)
|
||||
for i, line in enumerate(lines):
|
||||
if line.startswith(f"{ENV_KEY}=") or line.strip() == ENV_KEY:
|
||||
lines[i] = new_line
|
||||
ENV_FILE.write_text("".join(lines), encoding="utf-8")
|
||||
return "updated"
|
||||
# Key not present — append
|
||||
if text and not text.endswith("\n"):
|
||||
text += "\n"
|
||||
ENV_FILE.write_text(text + new_line, encoding="utf-8")
|
||||
return "appended"
|
||||
else:
|
||||
ENV_FILE.write_text(new_line, encoding="utf-8")
|
||||
return "created"
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description=f"Copy the {COOKIE_DOMAIN} login cookie from your browser into .env."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--browser", "-b",
|
||||
choices=BROWSER_NAMES,
|
||||
metavar="BROWSER",
|
||||
help=f"Browser to read from: {', '.join(BROWSER_NAMES)} (default: try all in order)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
order = [args.browser] if args.browser else BROWSER_NAMES
|
||||
|
||||
cookie_name = cookie_value = None
|
||||
for browser in order:
|
||||
print(f"[…] Trying {browser}…")
|
||||
try:
|
||||
cookie_name, cookie_value = find_cookie(browser)
|
||||
except ImportError as e:
|
||||
raise SystemExit(f"[!] {e}")
|
||||
except (ValueError, PermissionError, RuntimeError) as e:
|
||||
print(f"[!] {e}")
|
||||
continue
|
||||
|
||||
if cookie_name:
|
||||
print(f"[+] Found in {browser}: {cookie_name}")
|
||||
break
|
||||
print(f" No {COOKIE_PREFIX}* cookie found in {browser}.")
|
||||
|
||||
if not cookie_name:
|
||||
raise SystemExit(
|
||||
f"\n[!] No {COOKIE_PREFIX}* cookie found in any browser.\n"
|
||||
f" Make sure you are logged into {COOKIE_DOMAIN}, then re-run.\n"
|
||||
" Or set WP_LOGIN_COOKIE manually in .env — see .env.example."
|
||||
)
|
||||
|
||||
action = update_env(cookie_name, cookie_value)
|
||||
print(f"[✓] {ENV_KEY} {action} in {ENV_FILE}.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
467
main.py
Normal file
467
main.py
Normal file
@@ -0,0 +1,467 @@
|
||||
import re
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import signal
|
||||
import asyncio
|
||||
import tempfile
|
||||
import requests
|
||||
from pathlib import Path, PurePosixPath
|
||||
from urllib.parse import urlparse
|
||||
from dotenv import load_dotenv
|
||||
from playwright.async_api import async_playwright
|
||||
from check_clashes import VIDEO_EXTS
|
||||
from config import BASE_URL
|
||||
|
||||
load_dotenv()
|
||||
|
||||
|
||||
def _is_video_url(url):
|
||||
"""True if `url` ends with a recognised video extension (case-insensitive, path only)."""
|
||||
return PurePosixPath(urlparse(url).path).suffix.lower() in VIDEO_EXTS
|
||||
WP_API = f"{BASE_URL}/wp-json/wp/v2"
|
||||
|
||||
SKIP_TYPES = {
|
||||
"attachment", "nav_menu_item", "wp_block", "wp_template",
|
||||
"wp_template_part", "wp_global_styles", "wp_navigation",
|
||||
"wp_font_family", "wp_font_face",
|
||||
}
|
||||
|
||||
VIDEO_MAP_FILE = "video_map.json"
|
||||
MAX_WORKERS = 4
|
||||
|
||||
API_HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:147.0) Gecko/20100101 Firefox/147.0",
|
||||
"Accept": "application/json",
|
||||
"Referer": f"{BASE_URL}/",
|
||||
}
|
||||
|
||||
|
||||
def _get_login_cookie():
|
||||
raw = os.environ.get("WP_LOGIN_COOKIE", "").strip() # strip accidental whitespace
|
||||
if not raw:
|
||||
raise RuntimeError(
|
||||
"WP_LOGIN_COOKIE not set. Copy it from your browser into .env — see .env.example.")
|
||||
name, _, value = raw.partition("=")
|
||||
if not value:
|
||||
raise RuntimeError(
|
||||
"WP_LOGIN_COOKIE looks malformed (no '=' found). Expected: name=value")
|
||||
if not name.startswith("wordpress_logged_in_"):
|
||||
raise RuntimeError(
|
||||
"WP_LOGIN_COOKIE doesn't look right — expected a wordpress_logged_in_... cookie.")
|
||||
return name, value
|
||||
|
||||
|
||||
def discover_content_types(session):
|
||||
"""Query /wp-json/wp/v2/types and return a list of (name, rest_base, type_slug) for content types worth scraping."""
|
||||
r = session.get(f"{WP_API}/types", timeout=30)
|
||||
r.raise_for_status()
|
||||
types = r.json()
|
||||
|
||||
targets = []
|
||||
for type_slug, info in types.items():
|
||||
if type_slug in SKIP_TYPES:
|
||||
continue
|
||||
rest_base = info.get("rest_base")
|
||||
name = info.get("name", type_slug)
|
||||
if rest_base:
|
||||
targets.append((name, rest_base, type_slug))
|
||||
return targets
|
||||
|
||||
|
||||
def fetch_all_posts_for_type(session, type_name, rest_base, type_slug):
|
||||
"""Paginate one content type and return (url, title, description) tuples.
|
||||
Uses the `link` field when available; falls back to building from slug."""
|
||||
url_prefix = type_slug.replace("_", "-")
|
||||
results = []
|
||||
page = 1
|
||||
|
||||
while True:
|
||||
r = session.get(
|
||||
f"{WP_API}/{rest_base}",
|
||||
params={"per_page": 100, "page": page},
|
||||
timeout=30,
|
||||
)
|
||||
if r.status_code == 400 or not r.ok:
|
||||
break
|
||||
data = r.json()
|
||||
if not data:
|
||||
break
|
||||
for post in data:
|
||||
link = post.get("link", "")
|
||||
if not link.startswith("http"):
|
||||
slug = post.get("slug")
|
||||
if slug:
|
||||
link = f"{BASE_URL}/{url_prefix}/{slug}/"
|
||||
else:
|
||||
continue
|
||||
title_obj = post.get("title", {})
|
||||
title = title_obj.get("rendered", "") if isinstance(
|
||||
title_obj, dict) else str(title_obj)
|
||||
content_obj = post.get("content", {})
|
||||
content_html = content_obj.get(
|
||||
"rendered", "") if isinstance(content_obj, dict) else ""
|
||||
description = html_to_text(content_html) if content_html else ""
|
||||
results.append((link, title, description))
|
||||
print(f" {type_name} page {page}: {len(data)} items")
|
||||
page += 1
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def fetch_post_urls_from_api(headers):
|
||||
"""Auto-discover all content types via the WP REST API and collect every post URL.
|
||||
Also builds video_map.json with titles pre-populated."""
|
||||
print("[+] video_map.json empty or missing — discovering content types from REST API…")
|
||||
session = requests.Session()
|
||||
session.headers.update(headers)
|
||||
|
||||
targets = discover_content_types(session)
|
||||
print(
|
||||
f"[+] Found {len(targets)} content types: {', '.join(name for name, _, _ in targets)}\n")
|
||||
|
||||
all_results = []
|
||||
for type_name, rest_base, type_slug in targets:
|
||||
type_results = fetch_all_posts_for_type(
|
||||
session, type_name, rest_base, type_slug)
|
||||
all_results.extend(type_results)
|
||||
|
||||
seen = set()
|
||||
deduped_urls = []
|
||||
video_map = load_video_map()
|
||||
|
||||
for url, title, description in all_results:
|
||||
if url not in seen and url.startswith("http"):
|
||||
seen.add(url)
|
||||
deduped_urls.append(url)
|
||||
if url not in video_map:
|
||||
video_map[url] = {"title": title,
|
||||
"description": description, "videos": []}
|
||||
else:
|
||||
if not video_map[url].get("title"):
|
||||
video_map[url]["title"] = title
|
||||
if not video_map[url].get("description"):
|
||||
video_map[url]["description"] = description
|
||||
|
||||
save_video_map(video_map)
|
||||
print(
|
||||
f"\n[+] Discovered {len(deduped_urls)} unique URLs → saved to {VIDEO_MAP_FILE}")
|
||||
print(
|
||||
f"[+] Pre-populated {len(video_map)} entries in {VIDEO_MAP_FILE}")
|
||||
return deduped_urls
|
||||
|
||||
|
||||
def fetch_metadata_from_api(video_map, urls, headers):
|
||||
"""Populate missing titles and descriptions in video_map from the REST API."""
|
||||
missing = [u for u in urls
|
||||
if u not in video_map
|
||||
or not video_map[u].get("title")
|
||||
or not video_map[u].get("description")]
|
||||
if not missing:
|
||||
return
|
||||
|
||||
print(f"[+] Fetching metadata from REST API for {len(missing)} posts…")
|
||||
session = requests.Session()
|
||||
session.headers.update(headers)
|
||||
|
||||
targets = discover_content_types(session)
|
||||
|
||||
for type_name, rest_base, type_slug in targets:
|
||||
type_results = fetch_all_posts_for_type(
|
||||
session, type_name, rest_base, type_slug)
|
||||
for url, title, description in type_results:
|
||||
if url in video_map:
|
||||
if not video_map[url].get("title"):
|
||||
video_map[url]["title"] = title
|
||||
if not video_map[url].get("description"):
|
||||
video_map[url]["description"] = description
|
||||
else:
|
||||
video_map[url] = {"title": title,
|
||||
"description": description, "videos": []}
|
||||
|
||||
save_video_map(video_map)
|
||||
populated_t = sum(1 for u in urls if video_map.get(u, {}).get("title"))
|
||||
populated_d = sum(1 for u in urls if video_map.get(
|
||||
u, {}).get("description"))
|
||||
print(f"[+] Titles populated: {populated_t}/{len(urls)}")
|
||||
print(f"[+] Descriptions populated: {populated_d}/{len(urls)}")
|
||||
|
||||
|
||||
def load_post_urls(headers):
|
||||
vm = load_video_map()
|
||||
if vm:
|
||||
print(f"[+] {VIDEO_MAP_FILE} found — loading {len(vm)} post URLs.")
|
||||
return list(vm.keys())
|
||||
return fetch_post_urls_from_api(headers)
|
||||
|
||||
|
||||
def html_to_text(html_str):
|
||||
"""Strip HTML tags, decode entities, and collapse whitespace into clean plain text."""
|
||||
import html
|
||||
text = re.sub(r'<br\s*/?>', '\n', html_str)
|
||||
text = text.replace('</p>', '\n\n')
|
||||
text = re.sub(r'<[^>]+>', '', text)
|
||||
text = html.unescape(text)
|
||||
lines = [line.strip() for line in text.splitlines()]
|
||||
text = '\n'.join(lines)
|
||||
text = re.sub(r'\n{3,}', '\n\n', text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
def extract_mp4_from_html(html):
|
||||
candidates = re.findall(r'https?://[^\s"\'<>]+', html)
|
||||
return [u for u in candidates if _is_video_url(u)]
|
||||
|
||||
|
||||
def extract_title_from_html(html):
|
||||
m = re.search(
|
||||
r'<h1[^>]*class="entry-title"[^>]*>(.*?)</h1>', html, re.DOTALL)
|
||||
if m:
|
||||
title = re.sub(r'<[^>]+>', '', m.group(1)).strip()
|
||||
return title
|
||||
m = re.search(r'<title>(.*?)(?:\s*[-–|].*)?</title>', html, re.DOTALL)
|
||||
if m:
|
||||
return m.group(1).strip()
|
||||
return None
|
||||
|
||||
|
||||
def load_video_map():
|
||||
if Path(VIDEO_MAP_FILE).exists():
|
||||
try:
|
||||
with open(VIDEO_MAP_FILE, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
except (json.JSONDecodeError, OSError):
|
||||
return {}
|
||||
return {}
|
||||
|
||||
|
||||
def save_video_map(video_map):
|
||||
fd, tmp_path = tempfile.mkstemp(dir=Path(VIDEO_MAP_FILE).resolve().parent, suffix=".tmp")
|
||||
try:
|
||||
with os.fdopen(fd, "w", encoding="utf-8") as f:
|
||||
json.dump(video_map, f, indent=2, ensure_ascii=False)
|
||||
Path(tmp_path).replace(VIDEO_MAP_FILE)
|
||||
except Exception:
|
||||
try:
|
||||
Path(tmp_path).unlink()
|
||||
except OSError:
|
||||
pass
|
||||
raise
|
||||
|
||||
|
||||
|
||||
def _expects_video(url):
|
||||
return "/pinkcuffs-videos/" in url
|
||||
|
||||
|
||||
MAX_RETRIES = 2
|
||||
|
||||
|
||||
async def worker(worker_id, queue, context, known,
|
||||
total, retry_counts, video_map, map_lock, shutdown_event):
|
||||
page = await context.new_page()
|
||||
video_hits = set()
|
||||
|
||||
page.on("response", lambda resp: video_hits.add(resp.url) if _is_video_url(resp.url) else None)
|
||||
|
||||
try:
|
||||
while not shutdown_event.is_set():
|
||||
try:
|
||||
idx, url = queue.get_nowait()
|
||||
except asyncio.QueueEmpty:
|
||||
break
|
||||
|
||||
attempt = retry_counts.get(idx, 0)
|
||||
label = f" (retry {attempt}/{MAX_RETRIES})" if attempt else ""
|
||||
print(f"[W{worker_id}] ({idx + 1}/{total}) {url}{label}")
|
||||
|
||||
try:
|
||||
await page.goto(url, wait_until="networkidle", timeout=60000)
|
||||
except Exception as e:
|
||||
print(f"[W{worker_id}] Navigation error: {e}")
|
||||
if _expects_video(url) and attempt < MAX_RETRIES:
|
||||
retry_counts[idx] = attempt + 1
|
||||
queue.put_nowait((idx, url))
|
||||
print(f"[W{worker_id}] Re-queued for retry.")
|
||||
elif not _expects_video(url):
|
||||
async with map_lock:
|
||||
entry = video_map.get(url, {})
|
||||
entry["scraped_at"] = int(time.time())
|
||||
video_map[url] = entry
|
||||
save_video_map(video_map)
|
||||
else:
|
||||
print(
|
||||
f"[W{worker_id}] Still failing after {MAX_RETRIES} retries — will retry next run.")
|
||||
continue
|
||||
|
||||
await asyncio.sleep(1.5)
|
||||
html = await page.content()
|
||||
title = extract_title_from_html(html)
|
||||
html_videos = extract_mp4_from_html(html)
|
||||
found = set(html_videos) | set(video_hits)
|
||||
video_hits.clear()
|
||||
|
||||
all_videos = [m for m in found if m not in (
|
||||
f"{BASE_URL}/wp-content/plugins/easy-video-player/lib/blank.mp4",
|
||||
)]
|
||||
|
||||
async with map_lock:
|
||||
new_found = found - known
|
||||
if new_found:
|
||||
print(f"[W{worker_id}] Found {len(new_found)} new video URLs")
|
||||
known.update(new_found)
|
||||
elif all_videos:
|
||||
print(
|
||||
f"[W{worker_id}] {len(all_videos)} video(s) already known — skipping write.")
|
||||
else:
|
||||
print(f"[W{worker_id}] No video found on page.")
|
||||
|
||||
entry = video_map.get(url, {})
|
||||
if title:
|
||||
entry["title"] = title
|
||||
existing_videos = set(entry.get("videos", []))
|
||||
existing_videos.update(all_videos)
|
||||
entry["videos"] = sorted(existing_videos)
|
||||
mark_done = bool(all_videos) or not _expects_video(url)
|
||||
if mark_done:
|
||||
entry["scraped_at"] = int(time.time())
|
||||
video_map[url] = entry
|
||||
save_video_map(video_map)
|
||||
|
||||
if not mark_done:
|
||||
if attempt < MAX_RETRIES:
|
||||
retry_counts[idx] = attempt + 1
|
||||
queue.put_nowait((idx, url))
|
||||
print(
|
||||
f"[W{worker_id}] Re-queued for retry ({attempt + 1}/{MAX_RETRIES}).")
|
||||
else:
|
||||
print(
|
||||
f"[W{worker_id}] No video after {MAX_RETRIES} retries — will retry next run.")
|
||||
finally:
|
||||
await page.close()
|
||||
|
||||
|
||||
async def run():
|
||||
shutdown_event = asyncio.Event()
|
||||
loop = asyncio.get_running_loop()
|
||||
|
||||
def _handle_shutdown(signum, _frame):
|
||||
print(f"\n[!] Signal {signum} received — finishing active pages then exiting…")
|
||||
loop.call_soon_threadsafe(shutdown_event.set)
|
||||
|
||||
signal.signal(signal.SIGINT, _handle_shutdown)
|
||||
signal.signal(signal.SIGTERM, _handle_shutdown)
|
||||
|
||||
try:
|
||||
cookie_name, cookie_value = _get_login_cookie()
|
||||
req_headers = {
|
||||
**API_HEADERS,
|
||||
"Cookie": f"{cookie_name}={cookie_value}; eav-age-verified=1",
|
||||
}
|
||||
|
||||
urls = load_post_urls(req_headers)
|
||||
|
||||
video_map = load_video_map()
|
||||
if any(u not in video_map
|
||||
or not video_map[u].get("title")
|
||||
or not video_map[u].get("description")
|
||||
for u in urls if _expects_video(u)):
|
||||
fetch_metadata_from_api(video_map, urls, req_headers)
|
||||
|
||||
known = {u for entry in video_map.values() for u in entry.get("videos", [])}
|
||||
|
||||
total = len(urls)
|
||||
pending = []
|
||||
needs_map = 0
|
||||
for i, u in enumerate(urls):
|
||||
entry = video_map.get(u, {})
|
||||
if not entry.get("scraped_at"):
|
||||
pending.append((i, u))
|
||||
elif _expects_video(u) and not entry.get("videos"):
|
||||
pending.append((i, u))
|
||||
needs_map += 1
|
||||
|
||||
done_count = sum(1 for v in video_map.values() if v.get("scraped_at"))
|
||||
print(f"[+] Loaded {total} post URLs.")
|
||||
print(f"[+] Already have {len(known)} video URLs mapped.")
|
||||
print(f"[+] Video map: {len(video_map)} entries in {VIDEO_MAP_FILE}")
|
||||
if done_count:
|
||||
remaining_new = len(pending) - needs_map
|
||||
print(
|
||||
f"[↻] Resuming: {done_count} done, {remaining_new} new + {needs_map} needing map data.")
|
||||
if not pending:
|
||||
print("[✓] All URLs already processed and mapped.")
|
||||
return
|
||||
|
||||
print(
|
||||
f"[⚡] Running with {min(MAX_WORKERS, len(pending))} concurrent workers.\n")
|
||||
|
||||
queue = asyncio.Queue()
|
||||
for item in pending:
|
||||
queue.put_nowait(item)
|
||||
|
||||
map_lock = asyncio.Lock()
|
||||
retry_counts = {}
|
||||
|
||||
async with async_playwright() as p:
|
||||
browser = await p.firefox.launch(headless=True)
|
||||
context = await browser.new_context()
|
||||
|
||||
_cookie_domain = urlparse(BASE_URL).netloc
|
||||
site_cookies = [
|
||||
{
|
||||
"name": cookie_name,
|
||||
"value": cookie_value,
|
||||
"domain": _cookie_domain,
|
||||
"path": "/",
|
||||
"httpOnly": True,
|
||||
"secure": True,
|
||||
"sameSite": "None"
|
||||
},
|
||||
{
|
||||
"name": "eav-age-verified",
|
||||
"value": "1",
|
||||
"domain": _cookie_domain,
|
||||
"path": "/"
|
||||
}
|
||||
]
|
||||
|
||||
await context.add_cookies(site_cookies)
|
||||
|
||||
num_workers = min(MAX_WORKERS, len(pending))
|
||||
workers = [
|
||||
asyncio.create_task(
|
||||
worker(i, queue, context, known,
|
||||
total, retry_counts, video_map, map_lock, shutdown_event)
|
||||
)
|
||||
for i in range(num_workers)
|
||||
]
|
||||
|
||||
await asyncio.gather(*workers)
|
||||
await browser.close()
|
||||
|
||||
mapped = sum(1 for v in video_map.values() if v.get("videos"))
|
||||
print(
|
||||
f"\n[+] Video map: {mapped} posts with videos, {len(video_map)} total entries.")
|
||||
|
||||
if not shutdown_event.is_set():
|
||||
print(f"[✓] Completed. Full map in {VIDEO_MAP_FILE}")
|
||||
else:
|
||||
done = sum(1 for v in video_map.values() if v.get("scraped_at"))
|
||||
print(f"[⏸] Paused — {done}/{total} done. Run again to resume.")
|
||||
finally:
|
||||
signal.signal(signal.SIGINT, signal.SIG_DFL)
|
||||
signal.signal(signal.SIGTERM, signal.SIG_DFL)
|
||||
|
||||
|
||||
def main():
|
||||
try:
|
||||
asyncio.run(run())
|
||||
except KeyboardInterrupt:
|
||||
print("\n[!] Interrupted. Run again to resume.")
|
||||
except RuntimeError as e:
|
||||
raise SystemExit(f"[!] {e}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
21252
openapi.json
Normal file
21252
openapi.json
Normal file
File diff suppressed because one or more lines are too long
4
requirements.txt
Normal file
4
requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
playwright==1.58.0
|
||||
python-dotenv==1.2.1
|
||||
Requests==2.32.5
|
||||
rookiepy==0.5.6
|
||||
61
total_size.py
Normal file
61
total_size.py
Normal file
@@ -0,0 +1,61 @@
|
||||
"""Calculate total disk space needed to download all videos.
|
||||
|
||||
Importable function:
|
||||
summarize_sizes(sizes) - return dict with total, smallest, largest, average, failed
|
||||
"""
|
||||
|
||||
from check_clashes import fmt_size, fetch_sizes, load_video_map, VIDEO_MAP_FILE
|
||||
|
||||
|
||||
def summarize_sizes(sizes):
|
||||
"""Given {url: size_or_None}, return a stats dict."""
|
||||
known = {u: s for u, s in sizes.items() if s is not None}
|
||||
failed = [u for u, s in sizes.items() if s is None]
|
||||
if not known:
|
||||
return {"sized": 0, "total": len(sizes), "total_bytes": 0,
|
||||
"smallest": 0, "largest": 0, "average": 0, "failed": failed}
|
||||
total_bytes = sum(known.values())
|
||||
return {
|
||||
"sized": len(known),
|
||||
"total": len(sizes),
|
||||
"total_bytes": total_bytes,
|
||||
"smallest": min(known.values()),
|
||||
"largest": max(known.values()),
|
||||
"average": total_bytes // len(known),
|
||||
"failed": failed,
|
||||
}
|
||||
|
||||
|
||||
# --------------- CLI ---------------
|
||||
|
||||
def _progress(done, total):
|
||||
if done % 200 == 0 or done == total:
|
||||
print(f" {done}/{total}")
|
||||
|
||||
|
||||
def main():
|
||||
vm = load_video_map()
|
||||
urls = [u for entry in vm.values() for u in entry.get("videos", []) if u.startswith("http")]
|
||||
|
||||
print(f"[+] {len(urls)} URLs in {VIDEO_MAP_FILE}")
|
||||
print("[+] Fetching file sizes (20 threads)…\n")
|
||||
|
||||
sizes = fetch_sizes(urls, workers=20, on_progress=_progress)
|
||||
stats = summarize_sizes(sizes)
|
||||
|
||||
print(f"\n{'=' * 45}")
|
||||
print(f" Sized: {stats['sized']}/{stats['total']} files")
|
||||
print(f" Total: {fmt_size(stats['total_bytes'])}")
|
||||
print(f" Smallest: {fmt_size(stats['smallest'])}")
|
||||
print(f" Largest: {fmt_size(stats['largest'])}")
|
||||
print(f" Average: {fmt_size(stats['average'])}")
|
||||
print(f"{'=' * 45}")
|
||||
|
||||
if stats["failed"]:
|
||||
print(f"\n[!] {len(stats['failed'])} URL(s) could not be sized:")
|
||||
for u in stats["failed"]:
|
||||
print(f" {u}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
603
upload.py
Normal file
603
upload.py
Normal file
@@ -0,0 +1,603 @@
|
||||
"""Upload videos to PeerTube with transcoding-aware flow control.
|
||||
|
||||
Uploads videos one batch at a time, waits for each batch to be fully transcoded
|
||||
and moved to object storage before uploading the next — preventing disk
|
||||
exhaustion on the PeerTube server.
|
||||
|
||||
Usage:
|
||||
python upload.py # upload from ./downloads
|
||||
python upload.py -i /mnt/vol/dl # custom input dir
|
||||
python upload.py --batch-size 2 # upload 2, wait, repeat
|
||||
python upload.py --dry-run # preview without uploading
|
||||
python upload.py --skip-wait # upload without waiting
|
||||
|
||||
Required (CLI flag or env var):
|
||||
--url / PEERTUBE_URL
|
||||
--username / PEERTUBE_USER
|
||||
--channel / PEERTUBE_CHANNEL
|
||||
--password / PEERTUBE_PASSWORD
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from collections import Counter
|
||||
import html
|
||||
import os
|
||||
from pathlib import Path
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
from check_clashes import fmt_size, url_to_filename, VIDEO_EXTS
|
||||
from download import (
|
||||
load_video_map,
|
||||
collect_urls,
|
||||
get_paths_for_mode,
|
||||
read_mode,
|
||||
MODE_ORIGINAL,
|
||||
DEFAULT_OUTPUT,
|
||||
)
|
||||
|
||||
load_dotenv()
|
||||
|
||||
# ── Defaults ─────────────────────────────────────────────────────────
|
||||
|
||||
DEFAULT_BATCH_SIZE = 1
|
||||
DEFAULT_POLL = 30
|
||||
UPLOADED_FILE = ".uploaded"
|
||||
PT_NAME_MAX = 120
|
||||
|
||||
|
||||
# ── Text helpers ─────────────────────────────────────────────────────
|
||||
|
||||
def clean_description(raw):
|
||||
"""Strip WordPress shortcodes and HTML from a description."""
|
||||
if not raw:
|
||||
return ""
|
||||
text = re.sub(r'\[/?[^\]]+\]', '', raw)
|
||||
text = re.sub(r'<[^>]+>', '', text)
|
||||
text = html.unescape(text)
|
||||
text = re.sub(r'\n{3,}', '\n\n', text).strip()
|
||||
return text[:10000]
|
||||
|
||||
|
||||
def make_pt_name(title, fallback_filename):
|
||||
"""Build a PeerTube-safe video name (3-120 chars)."""
|
||||
name = html.unescape(title).strip(
|
||||
) if title else Path(fallback_filename).stem
|
||||
if len(name) > PT_NAME_MAX:
|
||||
name = name[: PT_NAME_MAX - 1].rstrip() + "\u2026"
|
||||
while len(name) < 3:
|
||||
name += "_"
|
||||
return name
|
||||
|
||||
|
||||
# ── PeerTube API ─────────────────────────────────────────────────────
|
||||
|
||||
def get_oauth_token(base, username, password):
|
||||
r = requests.get(f"{base}/api/v1/oauth-clients/local", timeout=15)
|
||||
r.raise_for_status()
|
||||
client = r.json()
|
||||
|
||||
r = requests.post(
|
||||
f"{base}/api/v1/users/token",
|
||||
data={
|
||||
"client_id": client["client_id"],
|
||||
"client_secret": client["client_secret"],
|
||||
"grant_type": "password",
|
||||
"username": username,
|
||||
"password": password,
|
||||
},
|
||||
timeout=15,
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.json()["access_token"]
|
||||
|
||||
|
||||
def api_headers(token):
|
||||
return {"Authorization": f"Bearer {token}"}
|
||||
|
||||
|
||||
def get_channel_id(base, token, channel_name):
|
||||
r = requests.get(
|
||||
f"{base}/api/v1/video-channels/{channel_name}",
|
||||
headers=api_headers(token),
|
||||
timeout=15,
|
||||
)
|
||||
r.raise_for_status()
|
||||
return r.json()["id"]
|
||||
|
||||
|
||||
def get_channel_video_names(base, token, channel_name):
|
||||
"""Paginate through the channel and return a Counter of video names."""
|
||||
counts = Counter()
|
||||
start = 0
|
||||
while True:
|
||||
r = requests.get(
|
||||
f"{base}/api/v1/video-channels/{channel_name}/videos",
|
||||
params={"start": start, "count": 100},
|
||||
headers=api_headers(token),
|
||||
timeout=30,
|
||||
)
|
||||
r.raise_for_status()
|
||||
data = r.json()
|
||||
for v in data.get("data", []):
|
||||
counts[v["name"]] += 1
|
||||
start += 100
|
||||
if start >= data.get("total", 0):
|
||||
break
|
||||
return counts
|
||||
|
||||
|
||||
CHUNK_SIZE = 10 * 1024 * 1024 # 10 MB
|
||||
MAX_RETRIES = 5
|
||||
|
||||
|
||||
def _init_resumable(base, token, channel_id, filepath, filename, name,
|
||||
description="", nsfw=False):
|
||||
"""POST to create a resumable upload session. Returns upload URL."""
|
||||
file_size = Path(filepath).stat().st_size
|
||||
metadata = {
|
||||
"name": name,
|
||||
"channelId": channel_id,
|
||||
"filename": filename,
|
||||
"nsfw": nsfw,
|
||||
"waitTranscoding": True,
|
||||
"privacy": 1,
|
||||
}
|
||||
if description:
|
||||
metadata["description"] = description
|
||||
|
||||
r = requests.post(
|
||||
f"{base}/api/v1/videos/upload-resumable",
|
||||
headers={
|
||||
**api_headers(token),
|
||||
"Content-Type": "application/json",
|
||||
"X-Upload-Content-Length": str(file_size),
|
||||
"X-Upload-Content-Type": "video/mp4",
|
||||
},
|
||||
json=metadata,
|
||||
timeout=30,
|
||||
)
|
||||
r.raise_for_status()
|
||||
|
||||
location = r.headers["Location"]
|
||||
if location.startswith("//"):
|
||||
location = "https:" + location
|
||||
elif location.startswith("/"):
|
||||
location = base + location
|
||||
return location, file_size
|
||||
|
||||
|
||||
def _query_offset(upload_url, token, file_size):
|
||||
"""Ask the server how many bytes it has received so far."""
|
||||
r = requests.put(
|
||||
upload_url,
|
||||
headers={
|
||||
**api_headers(token),
|
||||
"Content-Range": f"bytes */{file_size}",
|
||||
"Content-Length": "0",
|
||||
},
|
||||
timeout=15,
|
||||
)
|
||||
if r.status_code == 308:
|
||||
range_hdr = r.headers.get("Range", "")
|
||||
if range_hdr:
|
||||
return int(range_hdr.split("-")[1]) + 1
|
||||
return 0
|
||||
if r.status_code == 200:
|
||||
return file_size
|
||||
r.raise_for_status()
|
||||
return 0
|
||||
|
||||
|
||||
def upload_video(base, token, channel_id, filepath, name,
|
||||
description="", nsfw=False):
|
||||
"""Resumable chunked upload. Returns (ok, uuid)."""
|
||||
filepath = Path(filepath)
|
||||
filename = filepath.name
|
||||
file_size = filepath.stat().st_size
|
||||
|
||||
try:
|
||||
upload_url, _ = _init_resumable(
|
||||
base, token, channel_id, filepath, filename,
|
||||
name, description, nsfw,
|
||||
)
|
||||
except Exception as e:
|
||||
print(f" Init failed: {e}")
|
||||
return False, None
|
||||
|
||||
offset = 0
|
||||
retries = 0
|
||||
|
||||
with open(filepath, "rb") as f:
|
||||
while offset < file_size:
|
||||
end = min(offset + CHUNK_SIZE, file_size) - 1
|
||||
chunk_len = end - offset + 1
|
||||
|
||||
f.seek(offset)
|
||||
chunk = f.read(chunk_len)
|
||||
|
||||
pct = int(100 * (end + 1) / file_size)
|
||||
print(f" {fmt_size(offset)}/{fmt_size(file_size)} ({pct}%)",
|
||||
end="\r", flush=True)
|
||||
|
||||
try:
|
||||
r = requests.put(
|
||||
upload_url,
|
||||
headers={
|
||||
**api_headers(token),
|
||||
"Content-Type": "application/octet-stream",
|
||||
"Content-Range": f"bytes {offset}-{end}/{file_size}",
|
||||
"Content-Length": str(chunk_len),
|
||||
},
|
||||
data=chunk,
|
||||
timeout=120,
|
||||
)
|
||||
except (requests.ConnectionError, requests.Timeout) as e:
|
||||
retries += 1
|
||||
if retries > MAX_RETRIES:
|
||||
print(
|
||||
f"\n Upload failed after {MAX_RETRIES} retries: {e}")
|
||||
return False, None
|
||||
wait = min(2 ** retries, 60)
|
||||
print(f"\n Connection error, retry {retries}/{MAX_RETRIES} "
|
||||
f"in {wait}s ...")
|
||||
time.sleep(wait)
|
||||
try:
|
||||
offset = _query_offset(upload_url, token, file_size)
|
||||
except Exception:
|
||||
pass
|
||||
continue
|
||||
|
||||
if r.status_code == 308:
|
||||
range_hdr = r.headers.get("Range", "")
|
||||
if range_hdr:
|
||||
offset = int(range_hdr.split("-")[1]) + 1
|
||||
else:
|
||||
offset = end + 1
|
||||
retries = 0
|
||||
|
||||
elif r.status_code == 200:
|
||||
print(
|
||||
f" {fmt_size(file_size)}/{fmt_size(file_size)} (100%)")
|
||||
uuid = r.json().get("video", {}).get("uuid")
|
||||
return True, uuid
|
||||
|
||||
elif r.status_code in (502, 503, 429):
|
||||
retry_after = int(r.headers.get("Retry-After", 10))
|
||||
retries += 1
|
||||
if retries > MAX_RETRIES:
|
||||
print(
|
||||
f"\n Upload failed: server returned {r.status_code}")
|
||||
return False, None
|
||||
print(
|
||||
f"\n Server {r.status_code}, retry in {retry_after}s ...")
|
||||
time.sleep(retry_after)
|
||||
try:
|
||||
offset = _query_offset(upload_url, token, file_size)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
else:
|
||||
detail = r.text[:300] if r.text else str(r.status_code)
|
||||
print(f"\n Upload failed ({r.status_code}): {detail}")
|
||||
return False, None
|
||||
|
||||
print("\n Unexpected: sent all bytes but no 200 response")
|
||||
return False, None
|
||||
|
||||
|
||||
_STATE = {
|
||||
1: "Published",
|
||||
2: "To transcode",
|
||||
3: "To import",
|
||||
6: "Moving to object storage",
|
||||
7: "Transcoding failed",
|
||||
8: "Storage move failed",
|
||||
9: "To edit",
|
||||
}
|
||||
|
||||
|
||||
def get_video_state(base, token, uuid):
|
||||
r = requests.get(
|
||||
f"{base}/api/v1/videos/{uuid}",
|
||||
headers=api_headers(token),
|
||||
timeout=15,
|
||||
)
|
||||
r.raise_for_status()
|
||||
state = r.json()["state"]
|
||||
return state["id"], state.get("label", "")
|
||||
|
||||
|
||||
def wait_for_published(base, token, uuid, poll_interval):
|
||||
"""Block until the video reaches state 1 (Published) or a failure state."""
|
||||
started = time.monotonic()
|
||||
while True:
|
||||
elapsed = int(time.monotonic() - started)
|
||||
hours, rem = divmod(elapsed, 3600)
|
||||
mins, secs = divmod(rem, 60)
|
||||
if hours:
|
||||
elapsed_str = f"{hours}h {mins:02d}m {secs:02d}s"
|
||||
elif mins:
|
||||
elapsed_str = f"{mins}m {secs:02d}s"
|
||||
else:
|
||||
elapsed_str = f"{secs}s"
|
||||
|
||||
try:
|
||||
sid, label = get_video_state(base, token, uuid)
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f" -> Poll error ({e.__class__.__name__}) "
|
||||
f"after {elapsed_str}, retrying in {poll_interval}s …")
|
||||
time.sleep(poll_interval)
|
||||
continue
|
||||
|
||||
display = _STATE.get(sid, label or f"state {sid}")
|
||||
|
||||
if sid == 1:
|
||||
print(f" -> {display}")
|
||||
return sid
|
||||
if sid in (7, 8):
|
||||
print(f" -> FAILED: {display}")
|
||||
return sid
|
||||
|
||||
print(f" -> {display} … {elapsed_str} elapsed (next check in {poll_interval}s)")
|
||||
time.sleep(poll_interval)
|
||||
|
||||
|
||||
# ── State tracker ────────────────────────────────────────────────────
|
||||
|
||||
def load_uploaded(input_dir):
|
||||
path = Path(input_dir) / UPLOADED_FILE
|
||||
if not path.exists():
|
||||
return set()
|
||||
with open(path) as f:
|
||||
return {Path(line.strip()) for line in f if line.strip()}
|
||||
|
||||
|
||||
def mark_uploaded(input_dir, rel_path):
|
||||
with open(Path(input_dir) / UPLOADED_FILE, "a") as f:
|
||||
f.write(f"{rel_path}\n")
|
||||
|
||||
|
||||
# ── File / metadata helpers ─────────────────────────────────────────
|
||||
|
||||
def build_path_to_meta(video_map, input_dir):
|
||||
"""Map each expected download path (relative) to {title, description}."""
|
||||
urls = collect_urls(video_map)
|
||||
mode = read_mode(input_dir) or MODE_ORIGINAL
|
||||
paths = get_paths_for_mode(mode, urls, video_map, input_dir)
|
||||
|
||||
url_meta = {}
|
||||
for entry in video_map.values():
|
||||
t = entry.get("title", "")
|
||||
d = entry.get("description", "")
|
||||
for video_url in entry.get("videos", []):
|
||||
if video_url not in url_meta:
|
||||
url_meta[video_url] = {"title": t, "description": d}
|
||||
|
||||
result = {}
|
||||
for url, abs_path in paths.items():
|
||||
rel = Path(abs_path).relative_to(input_dir)
|
||||
meta = url_meta.get(url, {"title": "", "description": ""})
|
||||
result[rel] = {**meta, "original_filename": url_to_filename(url)}
|
||||
return result
|
||||
|
||||
|
||||
def find_videos(input_dir):
|
||||
"""Walk input_dir and return a set of relative paths for all video files."""
|
||||
found = set()
|
||||
for root, dirs, files in os.walk(input_dir):
|
||||
dirs[:] = [d for d in dirs if not d.startswith(".")]
|
||||
for f in files:
|
||||
if Path(f).suffix.lower() in VIDEO_EXTS:
|
||||
found.add((Path(root) / f).relative_to(input_dir))
|
||||
return found
|
||||
|
||||
|
||||
# ── Channel match helpers ─────────────────────────────────────────────
|
||||
|
||||
def _channel_match(rel, path_meta, existing):
|
||||
"""Return (matched, name) for a local file against the channel name set.
|
||||
|
||||
Checks both the title-derived name and the original-filename-derived name
|
||||
so that videos uploaded under either form are recognised. Extracted to
|
||||
avoid duplicating the logic between the pre-reconcile sweep and the per-
|
||||
file check inside the upload loop.
|
||||
"""
|
||||
meta = path_meta.get(rel, {})
|
||||
name = make_pt_name(meta.get("title", ""), rel.name)
|
||||
orig_fn = meta.get("original_filename", "")
|
||||
raw_name = make_pt_name("", orig_fn) if orig_fn else None
|
||||
matched = name in existing or (raw_name and raw_name != name and raw_name in existing)
|
||||
return matched, name
|
||||
|
||||
|
||||
# ── CLI ──────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(
|
||||
description="Upload videos to PeerTube with transcoding-aware batching",
|
||||
)
|
||||
ap.add_argument("--input", "-i", default=DEFAULT_OUTPUT,
|
||||
help=f"Directory with downloaded videos (default: {DEFAULT_OUTPUT})")
|
||||
ap.add_argument("--url",
|
||||
help="PeerTube instance URL (or set PEERTUBE_URL env var)")
|
||||
ap.add_argument("--username", "-U",
|
||||
help="PeerTube username (or set PEERTUBE_USER env var)")
|
||||
ap.add_argument("--password", "-p",
|
||||
help="PeerTube password (or set PEERTUBE_PASSWORD env var)")
|
||||
ap.add_argument("--channel", "-C",
|
||||
help="Channel to upload to (or set PEERTUBE_CHANNEL env var)")
|
||||
ap.add_argument("--batch-size", "-b", type=int, default=DEFAULT_BATCH_SIZE,
|
||||
help="Videos to upload before waiting for transcoding (default: 1)")
|
||||
ap.add_argument("--poll-interval", type=int, default=DEFAULT_POLL,
|
||||
help=f"Seconds between state polls (default: {DEFAULT_POLL})")
|
||||
ap.add_argument("--skip-wait", action="store_true",
|
||||
help="Upload everything without waiting for transcoding")
|
||||
ap.add_argument("--nsfw", action="store_true",
|
||||
help="Mark videos as NSFW")
|
||||
ap.add_argument("--dry-run", "-n", action="store_true",
|
||||
help="Preview what would be uploaded")
|
||||
args = ap.parse_args()
|
||||
|
||||
url = args.url or os.environ.get("PEERTUBE_URL")
|
||||
username = args.username or os.environ.get("PEERTUBE_USER")
|
||||
channel = args.channel or os.environ.get("PEERTUBE_CHANNEL")
|
||||
password = args.password or os.environ.get("PEERTUBE_PASSWORD")
|
||||
|
||||
if not args.dry_run:
|
||||
missing = [label for label, val in [
|
||||
("--url / PEERTUBE_URL", url),
|
||||
("--username / PEERTUBE_USER", username),
|
||||
("--channel / PEERTUBE_CHANNEL", channel),
|
||||
("--password / PEERTUBE_PASSWORD", password),
|
||||
] if not val]
|
||||
if missing:
|
||||
for label in missing:
|
||||
print(f"[!] Required: {label}")
|
||||
sys.exit(1)
|
||||
|
||||
# ── load metadata & scan disk ──
|
||||
video_map = load_video_map()
|
||||
path_meta = build_path_to_meta(video_map, args.input)
|
||||
on_disk = find_videos(args.input)
|
||||
|
||||
unmatched = on_disk - set(path_meta.keys())
|
||||
if unmatched:
|
||||
print(
|
||||
f"[!] {len(unmatched)} file(s) on disk not in video_map (will use filename as title)")
|
||||
for rel in unmatched:
|
||||
path_meta[rel] = {"title": "", "description": ""}
|
||||
|
||||
uploaded = load_uploaded(args.input)
|
||||
pending = sorted(rel for rel in on_disk if rel not in uploaded)
|
||||
|
||||
print(f"[+] {len(on_disk)} video files in {args.input}/")
|
||||
print(f"[+] {len(uploaded)} already uploaded")
|
||||
print(f"[+] {len(pending)} pending")
|
||||
print(f"[+] Batch size: {args.batch_size}")
|
||||
|
||||
if not pending:
|
||||
print("\nAll videos already uploaded.")
|
||||
return
|
||||
|
||||
# ── dry run ──
|
||||
if args.dry_run:
|
||||
total_bytes = 0
|
||||
for rel in pending:
|
||||
meta = path_meta.get(rel, {})
|
||||
name = make_pt_name(meta.get("title", ""), rel.name)
|
||||
sz = (Path(args.input) / rel).stat().st_size
|
||||
total_bytes += sz
|
||||
print(f" [{fmt_size(sz):>10}] {name}")
|
||||
print(
|
||||
f"\n Total: {fmt_size(total_bytes)} across {len(pending)} videos")
|
||||
return
|
||||
|
||||
# ── authenticate ──
|
||||
base = url.rstrip("/")
|
||||
if not base.startswith("http"):
|
||||
base = "https://" + base
|
||||
|
||||
print(f"\n[+] Authenticating with {base} ...")
|
||||
token = get_oauth_token(base, username, password)
|
||||
print(f"[+] Authenticated as {username}")
|
||||
|
||||
channel_id = get_channel_id(base, token, channel)
|
||||
print(f"[+] Channel: {channel} (id {channel_id})")
|
||||
|
||||
name_counts = get_channel_video_names(base, token, channel)
|
||||
existing = set(name_counts)
|
||||
total = sum(name_counts.values())
|
||||
print(f"[+] Found {total} video(s) on channel ({len(name_counts)} unique name(s))")
|
||||
|
||||
dupes = {name: count for name, count in name_counts.items() if count > 1}
|
||||
if dupes:
|
||||
print(f"[!] {len(dupes)} duplicate name(s) detected on channel:")
|
||||
for name, count in sorted(dupes.items()):
|
||||
print(f" x{count} {name}")
|
||||
|
||||
# ── pre-reconcile: sweep all pending against channel names ────────
|
||||
# The main upload loop discovers already-uploaded videos lazily as it
|
||||
# walks the sorted pending list — meaning on a fresh run (no .uploaded
|
||||
# file) you won't know how many files are genuinely new until the loop
|
||||
# has processed everything. Doing a full sweep here, before any
|
||||
# upload starts, gives an accurate count up-front and pre-populates
|
||||
# .uploaded so that interrupted/re-run sessions skip them instantly
|
||||
# without re-checking each time.
|
||||
pre_matched = []
|
||||
for rel in pending:
|
||||
if _channel_match(rel, path_meta, existing)[0]:
|
||||
pre_matched.append(rel)
|
||||
if pre_matched:
|
||||
print(f"\n[+] Pre-sweep: {len(pre_matched)} local file(s) already on channel — marking uploaded")
|
||||
for rel in pre_matched:
|
||||
mark_uploaded(args.input, rel)
|
||||
pending = [rel for rel in pending if rel not in set(pre_matched)]
|
||||
print(f"[+] {len(pending)} left to upload\n")
|
||||
|
||||
nsfw = args.nsfw
|
||||
total_up = 0
|
||||
batch: list[tuple[str, str]] = [] # [(uuid, name), ...]
|
||||
|
||||
try:
|
||||
for rel in pending:
|
||||
# ── flush batch if full ──
|
||||
if not args.skip_wait and len(batch) >= args.batch_size:
|
||||
print(
|
||||
f"\n[+] Waiting for {len(batch)} video(s) to finish processing ...")
|
||||
for uuid, bname in batch:
|
||||
print(f"\n [{bname}]")
|
||||
wait_for_published(base, token, uuid, args.poll_interval)
|
||||
batch.clear()
|
||||
|
||||
filepath = Path(args.input) / rel
|
||||
meta = path_meta.get(rel, {})
|
||||
name = make_pt_name(meta.get("title", ""), rel.name)
|
||||
desc = clean_description(meta.get("description", ""))
|
||||
sz = filepath.stat().st_size
|
||||
|
||||
if _channel_match(rel, path_meta, existing)[0]:
|
||||
print(f"\n[skip] already on channel: {name}")
|
||||
mark_uploaded(args.input, rel)
|
||||
continue
|
||||
|
||||
print(f"\n[{total_up + 1}/{len(pending)}] {name}")
|
||||
print(f" File: {rel} ({fmt_size(sz)})")
|
||||
|
||||
ok, uuid = upload_video(
|
||||
base, token, channel_id, filepath, name, desc, nsfw)
|
||||
if not ok:
|
||||
continue
|
||||
|
||||
print(f" Uploaded uuid={uuid}")
|
||||
mark_uploaded(args.input, rel)
|
||||
total_up += 1
|
||||
existing.add(name)
|
||||
|
||||
if uuid:
|
||||
batch.append((uuid, name))
|
||||
|
||||
# ── wait for final batch ──
|
||||
if batch and not args.skip_wait:
|
||||
print(f"\n[+] Waiting for final {len(batch)} video(s) ...")
|
||||
for uuid, bname in batch:
|
||||
print(f"\n [{bname}]")
|
||||
wait_for_published(base, token, uuid, args.poll_interval)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print(
|
||||
f"\n\n[!] Interrupted after {total_up} uploads. Re-run to continue.")
|
||||
sys.exit(130)
|
||||
|
||||
print(f"\n{'=' * 50}")
|
||||
print(f" Uploaded: {total_up} video(s)")
|
||||
print(" Done!")
|
||||
print(f"{'=' * 50}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
11025
video_map.json
Normal file
11025
video_map.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user