LLM Completions

Generate AI text completions with or without context from your document collections.

Completions Overview

The LLM Completions API allows you to generate text using VectorForgeAI's language models. You can provide your own context to guide the model's responses, creating tailored answers for your specific use cases.

Generate a Completion

Generate an AI text completion with optional context.

POST /responses

Request Parameters

Parameter Type Required Description
message string Yes The user's message or query
context string No Additional dynamic user context to guide the model's response
system_prompt string No System instructions for the model
effort enum No Reasoning effort (options: minimal, low, medium, high).
Minimal won't use reasoning and doesn't support tools. Default: medium
verbosity enum No Response verbosity (options: low, medium, high).
Lower means shorter answers, higher means longer. Default: medium
max_tokens integer No Maximum length of generated response (1-32768). Default: 1024
model enum No The model to use (options: standard, pro).
Pro model uses 4x more tokens. Default: standard

Request

cURL
curl -X POST https://api.vectorforgeai.com/v1/responses \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Team-Token: YOUR_TEAM_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "context": "You are a helpful AI assistant for VectorForgeAI. Answer questions accurately and clearly based on your knowledge.",
    "message": "What is vector embedding and how does it work?",
    "effort": "medium",
    "verbosity": "medium",
    "max_tokens": 512
  }'

Response

JSON
{
  "response": "Vector embedding is a technique that converts text, images, or other data into numerical vectors in a high-dimensional space. These vectors capture the semantic meaning of the content, allowing machines to understand similarities between different pieces of information.\n\nHere's how it works:\n\n1. Input Processing: Text or other data is tokenized and prepared for the embedding model.\n\n2. Embedding Generation: A neural network trained on vast amounts of data transforms the input into a vector, typically consisting of hundreds or thousands of floating-point numbers.\n\n3. Dimensional Representation: Each dimension in the vector represents some learned aspect of the content, though these dimensions aren't necessarily human-interpretable.\n\n4. Similarity Measurement: Once content is embedded as vectors, similarity between different items can be measured mathematically, usually through cosine similarity or Euclidean distance.\n\nThese embeddings enable many AI capabilities like semantic search, recommendation systems, and knowledge retrieval for language models."
}

Using Contexts Effectively

The context parameter helps guide the model's response. Here are some effective ways to use it:

  • Define the Assistant's Role: "You are a helpful customer support agent for VectorForgeAI..."
  • Set Constraints: "Keep responses concise and under 3 sentences..."
  • Provide Reference Material: "Use the following information to answer questions..."
  • Specify Formats: "Structure your answer in bullet points with a brief summary at the end..."
  • Context vs System Prompt: Context is for frequently changing information, while system_prompt is for static instructions. Use context for dynamic user data and system_prompt for consistent behavior.

Understanding Parameters

Effort

The effort parameter controls the reasoning depth of the model:

  • minimal: Fastest responses without reasoning. Doesn't support tools. Best for simple queries.
  • low: Light reasoning for straightforward tasks.
  • medium: Balanced reasoning for most use cases. Default setting.
  • high: Deep reasoning for complex problems requiring thorough analysis.

Verbosity

The verbosity parameter controls the length and detail of responses:

  • low: Concise, direct answers. Best for quick responses.
  • medium: Balanced detail and length. Suitable for most cases.
  • high: Detailed, comprehensive responses with extensive explanations.

Max Tokens

The max_tokens parameter limits the length of the generated response. A token is roughly 4 characters in English:

  • 128-256: Short responses (roughly 100-200 words)
  • 512-1024: Medium responses (roughly 400-800 words)
  • 2048+: Long, detailed responses

Model Selection

Choose between standard and pro models based on your needs:

  • standard: Efficient model for most use cases. Lower token usage and cost.
  • pro: Advanced model with enhanced capabilities. Uses 4x more tokens, resulting in 4x higher costs.

💡 Usage Tip

For most production applications, we recommend using effort="medium" with verbosity="medium" for balanced performance. Use the pro model only when you need the most advanced capabilities, as it uses 4x more tokens and costs proportionally more.

Best Practices

  • Be Specific: The more specific your message and context, the more targeted the response will be.
  • System Instructions: Use the system_prompt parameter to provide essential instructions that shape how the model responds.
  • Context Length: While you can provide extensive context, focus on the most relevant information to get the best results.
  • Validate Responses: For critical applications, implement validation of AI responses before presenting them to users.

Next Steps

To build more interactive AI experiences, explore:

Need Help?

If you're having trouble with LLM completions or have questions, we're here to help!