Best free or budget
Hugging Face See Hugging Face plansAI Infrastructure & Model APIs
Updated June 22, 2026: compare OpenRouter, OpenAI API, Claude API after Fable/Mythos suspension, Gemini API, Google Cloud data agents and managed MCP servers, Mistral, Groq, Together AI, Replicate, fal, Fireworks AI, Modal, Browserbase, Deepgram, Pinecone, Weaviate/Engram, Qdrant, Llama, LM Studio, and model-availability governance tradeoffs.
Free tier (25+ models, 50 req/day) · Pay-as-you-go (5.5% platform fee on 400+ models) · Enterprise custom
Best model router
OpenRouter
Unified LLM API for hundreds of models, with OpenAI-compatible requests, provider routing, fallbacks, app attribution, and per-model token pricing.
Editorial · no paid placements
Quick paths
Best hosted model catalog
Replicate See Replicate plansBuyer path
Source-backed shortlist
- Source
- Registered source
- Freshness
- Current
- Confidence
- High confidence
Best local or open-model starter
lm-studioLM Studio is the cleanest first stop when the buyer wants local model testing, a desktop workflow, and an OpenAI-compatible local API before choosing hosted inference.
- Source
- Registered source
- Freshness
- Current
- Confidence
- High confidence
- Verified
All tools in AI Infrastructure & Model APIs
- 1
Hugging Face Open AI collaboration hub for models, datasets, Spaces, inference endpoints, evaluations, and enterprise ML workflows. - 2
Modal Serverless cloud for Python, GPUs, jobs, web endpoints, sandboxes, queues, and AI apps that should scale without managing infrastructure. - 3
Together AI AI infrastructure platform for serverless inference, dedicated GPU deployments, fine-tuning, code sandboxes, and open-model training workflows. - 4
Weaviate Open-source vector database and managed cloud for RAG, semantic search, hybrid search, multi-tenancy, embeddings, and AI-native retrieval. - 5
OpenRouter Unified LLM API for hundreds of models, with OpenAI-compatible requests, provider routing, fallbacks, app attribution, and per-model token pricing. - 6
Pinecone Managed vector database for semantic search, hybrid search, RAG, recommendations, Pinecone Assistant, and production AI retrieval workloads. - 7
Qdrant Open-source vector database written in Rust, with managed cloud, Free/Standard/Premium tiers, hybrid/private cloud options, metadata filtering, payload indexes, and RAG-ready retrieval. - 8
Replicate Developer platform for running open and hosted AI models by API, with official models, community models, custom deployments, and usage-based pricing. - 9
Browserbase Cloud browser infrastructure for web agents, scraping, QA automation, and AI-controlled browsing, now framed around real Chromium browsers, web data APIs, Model Gateway, Functions, identity, and observability.
Quick Decision
AI infrastructure tools sit underneath the apps people see. They route model calls, host open models, run GPU workloads, store embeddings, power RAG, transcribe audio, generate media, and help teams compare cost, latency, quality, and control without rebuilding the stack every month.
This category is for developer and platform buyers. If the user is choosing a chatbot, start with AI Chatbots. If the team is shipping an AI product, agent, retrieval layer, or model-backed workflow, this is the better lane.
The late-May infrastructure update is agent control. CoreWeave’s training-to-inference loop pushes traces, evals, RL, inference, and W&B tooling into one reliability story. OpenAI’s Rosalind Biodefense trusted-access expansion shows that specialist frontier models may ship as gated capability programs. Sysdig’s LLM-agent intrusion report makes runtime telemetry and least-privilege design part of infrastructure buying, not only security cleanup.
The June 3 update widens that control story. Microsoft Build put Work IQ and Foundry around enterprise agents; GitHub made the Copilot SDK generally available while AI Credits became the agent-usage meter; NVIDIA pushed enterprise agents, Cosmos 3, open physical-AI agent skills, Alpamayo 2 Super, RTX Spark, and DGX Station for Windows; Postman launched AI Engineer for API work; RelationalAI moved agentic decision intelligence deeper into Snowflake; 7AI kept security agents in the proactive-hunting lane; and the White House AI cybersecurity order put advanced AI cyber capability into public-sector and critical-infrastructure policy. Infrastructure buyers should evaluate agent stacks by context access, runtime isolation, traces, evals, spend controls, simulation/data pipelines, local-vs-cloud compute, and write-action approvals.
The June 14 update keeps model availability as a first-class infrastructure risk. Claude Fable/Mythos access is suspended, GPT-5.2 is retired from ChatGPT, and OpenAI faces reported state-AG scrutiny. Direct frontier API buyers should now document the exact model route in production, the fallback route if a model is suspended or retired, the retention policy for that model class, the staff/client access exposure for restricted routes, and the legal/privacy review path for sensitive users. The AI Model Availability & Churn Tracker is now the canonical AiPedia surface for these app/API/router distinctions.
The June 16 infrastructure update is governed data agents. Google Cloud’s data-agent rollout puts Conversational Analytics, Data Engineering Agent, Looker agents, Gemini Enterprise data access, Data Agent Kit, Managed MCP Servers for Databases are GA, while many of the more ambitious analytics, Looker, Gemini Enterprise, and commerce routes remain preview. Infrastructure teams should evaluate these by IAM scope, roles/mcp.toolUser, service permissions, separate production identities, SQL verification, BigQuery spend limits, job labels, audit logging, Model Armor payload logging, and GA-versus-preview fallback plans.
Use OpenRouter when you need one API across many model providers. The current pricing page lists pay-as-you-go access to 400+ models and 60+ providers, with budget controls, activity logs, prompt caching, preferred vendor selections, and model-priced token billing. Its May 27 funding signal makes the category clearer: routing, fallback, governance, and spend visibility are becoming production infrastructure, not just developer convenience.
Use direct vendor APIs when native features matter. OpenAI API is the default direct route for broad multimodal app work. Claude API is the direct route for long reasoning, writing, code, and document workflows. Gemini API inputs, or Veo video generation are part of the product. The June 22 Gemini recheck keeps Gemini 3.5 Flash pricing mode-specific: standard, batch/flex, priority, grounding, tools, and media rows need separate cost modeling.
Use Mistral AI or Groq when price/performance, open-model strategy, European infrastructure, or low-latency inference matters. The June 15 Mistral check keeps the timeline and cost model honest: Mistral 3 officially launched on December 2, 2025, while Medium 3.5’s model-card date is April 28, 2026. Current Mistral pricing lists Large 3 at $0.50/M input and $1.50/M output, Medium 3.5 at $1.50/M and $7.50/M, and Small 4 at $0.10/M and $0.30/M, but the Small 4 model card still lists $0.15/M and $0.60/M, and the pricing FAQ still uses a generic Mistral Large $2/$6 example. Benchmark real prompts, confirm the live Studio quote, and pin exact model IDs before switching because model quality, output length, retries, aliases, and source drift change the bill.
Use Replicate or fal.ai when the job is hosted image, video, audio, 3D, or custom-model inference. The June 9 Replicate check keeps it strongest as a broad model catalog and custom-model deployment layer: public models may bill by hardware time or by input/output, while most private deployments bill setup, idle, and active time unless they are labeled fast-booting fine-tunes. fal is stronger when successful-output billing and fast media APIs are the buyer problem; the June 2 check keeps prepaid credits, queue behavior, failed-output billing, and the 50% batch discount as the key pricing details to model.
Use Fireworks AI when the workload is production inference over open or commercial models., cached-token discounts, batch jobs, dedicated GPU deployments, fine-tuning, and B200/B300 capacity are the actual purchase.
Use Browserbase when the infrastructure problem is web interaction.. It belongs here when agents need reliable browser sessions, Fetch/Extract, replay, and model routing rather than just another LLM.
Use Deepgram when speech is infrastructure. Deepgram is a better fit for product teams adding STT, TTS, audio intelligence, or voice agents than for creators who only need a one-off transcript.
Use Hugging Face when model discovery, model cards, datasets, Spaces, and managed endpoints need to live in one open-AI collaboration surface. The June 2 pricing check keeps Pro at $9/month, Team at $20/user/month, Enterprise from $50/user/month, storage at $12/TB public and $18/TB private before volume discounts, ZeroGPU on RTX Pro 6000 Blackwell for PRO/Enterprise, and Inference Endpoints starting at $0.033/hour CPU.
Buyer Paths
| Buyer job | Start with | Why | Watch out |
|---|---|---|---|
| Multi-model LLM routing | OpenRouter | One API, many providers, spend controls, logs, routing | Router fees and provider policy choices still need governance |
| Direct frontier LLM API | OpenAI, Claude, or Gemini | Best when native model features, support, and procurement matter | Model access, retirements, legal/data governance, long context, outputs, tools, and video can change cost and risk quickly |
| Budget/open-model API | Mistral AI or Groq | Useful for cost-sensitive, latency-sensitive, and sovereignty-sensitive workloads | Requires benchmarking against your actual prompts, exact model IDs, and current model-card/pricing-page drift |
| Hosted model catalog | Replicate | Public, proprietary, and custom models without owning GPUs | Hardware-time, output-priced media, and private-model idle billing need separate cost modeling |
| Fast media APIs | fal.ai | Image, video, audio, and 3D APIs with per-output or per-second pricing | Prepaid credits and per-model units need tracking |
| Production model inference | Fireworks AI | Serverless inference, batch jobs, dedicated GPUs, fine-tuning, and cached-token discounts | Named model rates, GPU utilization, batch timing, and cached-token behavior decide the real bill |
| Serverless Python/GPU apps | Modal | Python jobs, web endpoints, queues, sandboxes, and per-second GPU billing without Kubernetes | Region selection, non-preemptible execution, and steady 24/7 GPU load can change the economics |
| Cloud browser infrastructure | Browserbase | Managed Chromium sessions, web data APIs, Functions runtime, identity, Model Gateway, observability, Stagehand, and MCP | Browser sessions, Fetch/Extract calls, proxy bandwidth, model tokens, and agent loops need cost, timeout, and credential controls |
| Speech and voice infrastructure | Deepgram | STT, TTS, audio intelligence, and voice-agent APIs | Voice minutes, channels, model choice, and LLM orchestration affect cost |
| Model discovery and endpoints | Hugging Face | Model cards, datasets, Spaces, Inference Endpoints | License and safety checks stay with the builder |
| Production retrieval | Pinecone, Weaviate, or Qdrant | Managed or open vector search for RAG | Index design and embedding cost matter as much as database pricing |
How to Choose
- Model routing: Pick OpenRouter when you need one OpenAI-compatible API across many providers.
- Direct LLM APIs: Pick OpenAI, Claude, or Gemini when native features, procurement, and provider-specific controls matter.
- Cost and latency: Pick Mistral AI or Groq when you can benchmark quality against real prompts and need tighter unit economics.
- Open-model infrastructure: Pick Together AI when you need hosted inference, fine-tuning, dedicated endpoints, code sandboxes, and GPU capacity for open-model products. The June 9 check separates serverless model-token pricing from dedicated inference and GPU clusters, so benchmark your actual traffic before assuming a single “open model” unit cost.
- Model catalog and experiments: Pick Hugging Face for discovery, datasets, model cards, demos, Spaces, ZeroGPU, and endpoints.
- Media and community models: Pick Replicate when the job is running image, video, audio, or custom models by API. The June 9 check confirms buyers should model public output-priced examples separately from hardware-time runs and private deployments that can bill while idle.
- Fast media APIs: Pick fal.ai when successful-output billing, image/video/audio/3D endpoints, and fast experimentation matter.
- Production inference: Pick Fireworks AI when hosted model APIs, batch inference, dedicated GPU deployments, and fine-tuning are more important than a polished chatbot UI.
- Browser automation: Pick Browserbase when an AI agent, scraper, QA runner, or workflow needs managed browsers, Search/Fetch, Functions runtime, identity, observability, Model Gateway, and Stagehand-style automation.
- Speech APIs: Pick Deepgram when speech-to-text, voice agents, or audio intelligence are infrastructure, not just creator utilities.
- Serverless GPU apps: Pick Modal when you want Python jobs, endpoints, queues, sandboxes, and GPU workloads without Kubernetes. The June 8 check keeps Starter at $0 with $30/month credits, Team at $250/month plus compute with $100/month credits, B200 at $0.001736/sec, H100 at $0.001097/sec, and B200+ as a compatibility route that can run on B200 or B300 while billing as B200.
- Open-weight model family: Pick Llama when infrastructure needs self-hostable or provider-hosted open weights. The June 8 check keeps Maverick as the flagship open-weight lane, Scout as the current Groq fast-inference card at $0.11/M input and $0.34/M output, and Together Maverick at $0.27/M input and $0.85/M output.
- Local model runtime: Pick LM Studio when developers need a desktop GUI plus native v1 REST API, OpenAI-compatible and Anthropic-compatible endpoints, MCP support, SDKs, CLI server control, and LM Link for Llama, Qwen, Mistral, and other open weights. LM Studio has been free for ordinary home and work use since its July 2025 terms change.
- Managed vector search: Pick Pinecone, backups, imports, and reranking before treating the database price as the whole retrieval bill.
- Open vector databases and agent memory: Pick Weaviate or Qdrant when self-hosting optionality and control matter. The June 10 Weaviate check keeps Free, Flex from $45/month, Plus from $280/month, Premium from $400/month, Weaviate Embeddings at $0.025-$0.065 per 1M tokens, Query Agent at a free 1,000-request/month trial path or $30/org/month with 4,000 included requests, and Engram generally available as a managed memory/context service for agents. The June 8 Qdrant check keeps the Free Cloud testing tier at 0.5 vCPU, 1GB RAM, and 4GB disk; Standard as usage-based production cloud; Premium as the enterprise-support tier; Hybrid/Private Cloud as the control-first path; and v1.18.2 as the latest release checked, with security fixes included in the release notes.
Money Pages To Keep Current
- Best pay-as-you-go AI tools and APIs was refreshed June 12, 2026 to separate true metered API usage from flat subscriptions and keep OpenAI, Claude, Gemini, OpenRouter, Mistral, Groq, Replicate, fal, Deepgram, ElevenLabs, and Fish Audio pricing risk in one buyer path.
- Best open source AI tools was refreshed June 12, 2026 for Ollama, LM Studio, Open WebUI, Llama, Mistral, DeepSeek, FLUX, Stable Diffusion, Whisper, and Hugging Face because open-model buyers often compare local control against hosted pay-as-you-go APIs.
- Best AI tools for developers is the June 6 verified developer guide for separating Cursor, GitHub Copilot AI Credits, Claude Code shared limits/API credits, Codex token credits, Replit Agent, and Aider BYOK API costs.
- A new
OpenRouter vs direct APIscomparison would capture buyers choosing between a model router and direct OpenAI/Anthropic/Google contracts. - A new
Replicate vs fal.aicomparison would capture image/video/API buyers choosing between broad model catalog and fast media-generation infrastructure.
Watchouts
Infrastructure tools are powerful because they hide messy systems. That can also hide cost and governance risk. Before standardizing, test real workloads, pin model routes where quality matters, model retry costs, and document what data can pass through each provider.
Do not publish infrastructure pages with old flat monthly subscription framing. The buyer question is usually total workload cost: input tokens, output tokens, cached tokens, web/search tools, video seconds, generated images, GPU runtime, voice minutes, channels, retries, and failed generations.
Sources
- OpenRouter pricing (verified 2026-05-27)
- OpenRouter Series B announcement (verified 2026-05-27)
- OpenAI API pricing (verified 2026-06-12)
- Claude API pricing (verified 2026-06-12)
- AiPedia late June 13 AI news update (verified 2026-06-13)
- AiPedia June 14 AI news desk (verified 2026-06-14)
- Anthropic Fable/Mythos access statement (verified 2026-06-14)
- OpenAI ChatGPT release notes (verified 2026-06-13)
- Gemini API pricing (verified 2026-06-22)
- Mistral AI pricing (verified 2026-06-15)
- Mistral Vibe product page (verified 2026-06-15)
- Mistral Vibe agent announcement (verified 2026-06-15)
- Mistral AI Now Summit 2026 (verified 2026-06-15)
- Mistral model docs (verified 2026-06-15)
- Mistral 3 launch post (verified 2026-06-15)
- Groq pricing (verified 2026-06-12)
- Replicate pricing (verified 2026-06-12)
- fal Model API pricing docs (verified 2026-06-12)
- Fireworks AI pricing (verified 2026-06-12)
- Fireworks billing FAQ (verified 2026-06-12)
- Fireworks inference documentation (verified 2026-06-12)
- Browserbase pricing (verified 2026-06-18)
- Browserbase changelog (verified 2026-06-18)
- Browserbase Browser explainer (verified 2026-06-18)
- Browserbase Model Gateway docs (verified 2026-06-18)
- Deepgram pricing (verified 2026-06-12)
- Together AI pricing (verified 2026-06-12)
- Hugging Face pricing (verified 2026-06-12)
- Modal pricing (verified 2026-06-12)
- Modal GPU docs (verified 2026-06-12)
- LM Studio (verified 2026-06-12)
- LM Studio developer docs (verified 2026-06-12)
- Llama official site (verified 2026-06-12)
- Together AI Llama pricing (verified 2026-06-12)
- Groq Llama 4 Scout model card (verified 2026-06-12)
- Pinecone pricing (verified 2026-06-12)
- Pinecone cost docs (verified 2026-06-12)
- Pinecone Assistant pricing and limits (verified 2026-06-12)
- Weaviate pricing (verified 2026-06-12)
- Weaviate Engram GA announcement (verified 2026-06-12)
- Qdrant pricing (verified 2026-06-12)
- Qdrant Cloud billing (verified 2026-06-12)
- Qdrant v1.18.2 release notes (verified 2026-06-12)
- CoreWeave autonomous agent improvement launch (verified 2026-05-31)
- OpenAI Rosalind Biodefense (verified 2026-05-31)
- Geordie AI Series A (verified 2026-05-31)
- Sysdig LLM-agent intrusion analysis (verified 2026-05-31)
- Microsoft Build 2026 Work IQ and Foundry agent stack (verified 2026-06-12)
- Google Cloud data agents announcement (verified 2026-06-16)
- Google Cloud Data Engineering Agent docs (verified 2026-06-16)
- Google Cloud MCP servers docs (verified 2026-06-16)
- Google Cloud Conversational Analytics docs (verified 2026-06-16)
- GitHub Copilot SDK GA (verified 2026-06-12)
- NVIDIA enterprise software agents (verified 2026-06-12)
- NVIDIA Cosmos 3 physical AI model (verified 2026-06-12)
- NVIDIA physical AI agent tools and skills (verified 2026-06-12)
- NVIDIA Alpamayo 2 Super (verified 2026-06-12)
- NVIDIA RTX Spark Windows AI PCs (verified 2026-06-12)
- NVIDIA DGX Station for Windows (verified 2026-06-12)
- Postman AI Engineer (verified 2026-06-12)
- RelationalAI Snowflake agentic decision intelligence (verified 2026-06-12)
- 7AI Agentic Security Platform (verified 2026-06-12)
- White House AI cybersecurity order (verified 2026-06-12)
Workflow playbooks
- Best Pay-As-You-Go AI Tools and APIs (June 2026)Current buyer guide to true pay-as-you-go AI tools, separating metered APIs from flat subscriptions and showing which platform to use for text, coding, media, voice, and production workloads.
- AI Automation Agency Tech Stack (June 2026)A source-backed AI automation agency stack for selling reliable client workflows without overbuying agent platforms or hiding failure modes.
- Best Open Source AI Tools (June 2026)Current buyer guide to open source and open-weight AI tools, covering local chat, self-hosted interfaces, open models, image generation, speech recognition, privacy tradeoffs, hardware limits, and security risks.
Recent product signals
- AI News Desk, May 28, 2026: Claude Opus 4.8, Anthropic's $65B round, enterprise agents, wallets, and runtime governanceMay 28
- Compal and GMI Cloud target the infrastructure bottleneck behind large-scale agentic AIMay 28
- AI News Desk, May 27, 2026: OpenRouter funding, Qwen agents, Windows Copilot, and Samsung's multi-model rolloutMay 27
- OpenRouter's $113M Series B makes model routing an enterprise AI infrastructure betMay 27
- Microsoft Research releases MagenticLite to test small-model agents on local machinesMay 22
Spotted an error or want to share your experience with AI Infrastructure & Model APIs?
Every tool page is re-verified on a recurring cycle, and corrections land faster when readers flag them directly. If you spot a stale fact, a missing capability, or have used AI Infrastructure & Model APIs and want to share what worked or didn't, the editorial desk reviews every message sent through this form.
Email editorial@aipedia.wiki