Automating VT Hash Check: Scripts and Best Practices
Why automate VT hash checks
Automating VirusTotal (VT) hash lookups saves time, reduces human error, and scales threat triage for many files. Instead of manually submitting hashes to the web UI, scripts let you batch-query, integrate checks into pipelines (CI/CD, EDR workflows), and trigger downstream actions (quarantine, alerts, ticket creation).
Common automation goals
- Batch-check large sets of file hashes (MD5/SHA1/SHA256).
- Enrich alerts with VT verdicts and vendor detections.
- Cache results to avoid repeated API calls and rate limits.
- Automatically escalate or block based on thresholds.
- Log and audit all queries for incident investigation.
Prerequisites
- A VirusTotal API key (public or private).
- Basic scripting knowledge (Python, Bash, PowerShell).
- Hashes to check in a structured form (CSV, JSON, or plain text).
- Secure storage for your API key (environment variables, secrets manager).
Recommended workflow
- Read a list of hashes from a file or alert feed.
- Normalize hashes (trim whitespace, verify length/format).
- Check local cache/database for prior results.
- Query VT API only for uncached hashes, obeying rate limits.
- Parse VT response: detection ratio, first/last submission dates, related indicators.
- Store results in your cache and send relevant alerts/actions.
- Periodically refresh cached results for older entries.
Example: Python script (SHA256, VT v3 API)
python
# Requires: requests # Usage: set VT_API_KEY env var; provide hashes.txt with one SHA256 per line import os, time, requests, json VT_API_KEY = os.getenv(“VT_API_KEY”) HEADERS = {“x-apikey”: VT_API_KEY} INPUT_FILE = “hashes.txt” CACHE_FILE = “vt_cache.json” RATE_LIMIT_SLEEP = 15# seconds between requests to avoid throttling def load_cache(): try: with open(CACHE_FILE, “r”) as f: return json.load(f) except: return {} def save_cache(cache): with open(CACHE_FILE, “w”) as f: json.dump(cache, f, indent=2) def query_hash(h): url = f”https://www.virustotal.com/api/v3/files/{h}“ r = requests.get(url, headers=HEADERS, timeout=30) if r.status_code == 200: return r.json() else: return {“error”: r.status_code, “text”: r.text} def parse_result(resp): if “error” in resp: return {“status”: “error”, “code”: resp[“error”]} data = resp.get(“data”, {}) attrs = data.get(“attributes”, {}) stats = attrs.get(“last_analysis_stats”, {}) result = { “malicious”: stats.get(“malicious”, 0), “suspicious”: stats.get(“suspicious”, 0), “undetected”: stats.get(“undetected”, 0), “total_votes”: attrs.get(“total_votes”, {}), “first_submission_date”: attrs.get(“first_submission_date”), “last_analysis_date”: attrs.get(“last_analysis_date”), “links”: data.get(“links”, {}) } return result def main(): cache = load_cache() with open(INPUT_FILE) as f: hashes = [l.strip() for l in f if l.strip()] for h in hashes: if h in cache: print(f”{h}: cached -> {cache[h][‘malicious’]} malicious”) continue resp = query_hash(h) parsed = parse_result(resp) cache[h] = parsed print(f”{h}: {parsed.get(‘malicious’, ‘err’)} malicious”) save_cache(cache) time.sleep(RATE_LIMIT_SLEEP) if name == “main”: main()
Best practices
- Respect rate limits: Use sleeps, exponential backoff, and monitor HTTP 429 responses.
- Cache aggressively: Store results with timestamps; refresh only when needed.
- Secure API keys: Use environment variables or secrets managers; never hard-code keys.
- Normalize inputs: Validate hash lengths (MD5=32, SHA1=40, SHA256=64 hex chars).
- Graceful error handling: Retry transient failures, log persistent errors for review.
- Use VT enrichment fields: Pull vendor detections, community votes, first/last submission dates, and crowdsourced tags.
- Define action thresholds: e.g., block if malicious vendors ≥ 3, quarantine if suspicious > 0. Tailor thresholds to your risk tolerance.
- Privacy and compliance: Avoid uploading sensitive content; prefer hash lookups over file uploads when privacy is a concern.
- Audit and logging: Keep query logs (without sensitive data) for investigations and compliance.
Integrations and scaling tips
- Push results to SIEM (Splunk, Elastic) or ticketing systems (Jira, ServiceNow).
- Use serverless functions (AWS Lambda, Azure Functions) for on-demand checks.
- Parallelize with worker queues but shard to respect per-key rate limits.
- Rotate API keys or use multiple keys/accounts if volume requires it.
Quick decision matrix
| Use case | Recommended approach |
|---|---|
| One-off checks | Manual VT UI or simple script |
| Batch daily feeds | Scheduled script with cache and logging |
| Real-time alerts | Integrate into EDR/SIEM with async workers |
| High-volume automation | Sharded workers, multiple API keys, backoff logic |
Final checklist before production
- API key stored securely and tested.
- Rate limiting and retry logic implemented.
- Caching and expiry policy defined.
- Alert/enforcement thresholds documented.
- Logging and monitoring in place.
If you want, I can adapt the example to PowerShell, Bash, or a serverless function and include concrete threshold values for your environment.
Leave a Reply