28 May 2025 13 min read Getting Started

Hello World, Hello Tokens

If you're starting from scratch with AI engineering, APIs might be your best entry point. Early on in my journey, I found plenty of resources, but they were often either too simplistic or tied too closely to a particular vendor, making it challenging for me to see the bigger picture.

What I lacked was the practical knowledge gained from working with real-world systems. If I were starting over, this is exactly where I'd begin: experimenting, running code, and building an understanding by seeing how concepts work in practice. Calling APIs is your "Hello, World."

I'm motivated to share what I've learned through my own experience, which is why I'm writing this Getting Started series. I want to provide you with foundational concepts paired with code so you can see how things function in action.

In this post, we'll call the OpenAI, Anthropic, and Gemini API, and explore streaming responses, sampling parameters, structured outputs, and function calling or tool use.

Hello Unicorn 🦄

Let's start with a simple curl request:

curl "https://api.openai.com/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
        "model": "gpt-4.1",
        "messages": [
            {
                "role": "user",
                "content": "Write a one-sentence bedtime story about a unicorn."
            }
        ]
    }'

Once you've obtained an API key from OpenAI, we can call the OpenAI Chat Completions API as shown above. You're sending a prompt, in this case, "Write a one-sentence bedtime story about a unicorn," and the API responds with generated content as a structured JSON response.

You'll receive a completion like:

Under the silver glow of a sleepy moon, a gentle unicorn tiptoed through a field of starlit daisies, scattering sweet dreams with every silent step.

We can achieve the same result using Python:

import os
import requests

response = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
    },
    json={
        "model": "gpt-4.1",
        "messages": [
            {
                "role": "user",
                "content": "Write a one-sentence bedtime story about a unicorn.",
            }
        ],
    },
)

print(response.json()["choices"][0]["message"]["content"])

While using Python's requests library works fine, the OpenAI client library is typically preferred, which you can install with pip install openai. It provides additional features, such as streaming responses, asynchronous requests, automatic retries, and better error handling. We typically prefer the convenience of the official client every time.

from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {
            "role": "user",
            "content": "Write a one-sentence bedtime story about a unicorn.",
        }
    ],
)

print(completion.choices[0].message.content)

We have been using the GPT-4.1 model here, but several other options are also available. At this time, there is GPT-4o, or lighter and more cost-effective models like GPT-4o-mini or GPT-4.1-mini. There are also reasoning models, such as o3 and o4-mini, which prioritize spending more computational resources for tasks like math and coding.

Not the only circus in town 🎪

OpenAI isn't the only circus in town, though. There are other providers, including Anthropic, Google, and DeepSeek, just to name a few. Many providers likewise will offer their own API or SDK, available in various programming languages.

For example, we could perform the same request using Anthropic's API:

import anthropic

client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Write a one-sentence bedtime story about a unicorn.",
        }
    ],
)
print(message.content)

You'll need to pip install anthropic, and just like with OpenAI, you'll require an API key from Anthropic.

Or we could make the same call to Google's Gemini API. Before you start, pip install google-genai and set your API key in your environment with export GEMINI_API_KEY=your-api-key. We can send the same prompt as before:

import os
from google import genai

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Write a one-sentence bedtime story about a unicorn.",
)
print(response.text)

You'll notice something different here, though. There is no explicit "user" role. The Gemini SDK treats your input as a user message, so you don't need to specify roles. This makes things more concise for single-turn conversations.

But most chats are rarely this brief. Gemini can handle multi-turn conversations too, of course, but for now, let's jump back to OpenAI. Most APIs, including OpenAI and Anthropic, rely on explicit roles ("user", "assistant", and sometimes "system") and message history to manage context and keep the conversation coherent.

Here is what that looks like in practice:

from openai import OpenAI

client = OpenAI()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the World Series in 2020?"},
    {
        "role": "assistant",
        "content": "The Los Angeles Dodgers won the World Series in 2020.",
    },
    {"role": "user", "content": "Where was it played?"},
]

completion = client.chat.completions.create(model="gpt-4.1", messages=messages)

print(completion.choices[0].message.content)

The response to the final user prompt makes sense only when considering the context provided by earlier messages:

The 2020 World Series was played at Globe Life Field in Arlington, Texas. This was the first time the entire World Series was held at a neutral site, due to the COVID-19 pandemic.

Putting words in the unicorn's mouth

Up until now, the assistant's responses have only been influenced by the back-and-forth between user and assistant messages. Although you may have noticed our last example included a system prompt: "You are a helpful assistant."

A system message is a special instruction placed at the beginning of the conversation. It defines the assistant's behavior, persona, and boundaries. For example, you might want the model to be formal, to answer only as a financial advisor, or even respond with the wisdom of a unicorn:

messages = [
    {"role": "system", "content": "You are a wise unicorn who gives thoughtful, magical advice to all who seek it. Answer in one sentence."},
    {"role": "user", "content": "How can I become more creative in my daily life?"}
]

Dance with curiosity, invite playfulness into routine tasks, and let your imagination gallop freely over the open meadows of possibility each day.

Adding a system message at the start tells the model how to behave throughout the entire conversation. Most providers support this feature, though each model may interpret the prompt differently.

You can use system prompts to:

Set the assistant tone or domain expertise ("You are an expert Python tutor.")
Enforce a specific style ("Always answer in bullet points.")
Restrict certain actions ("Do not provide medical advice.")
Establish a persona ("You speak like a swashbuckling pirate.")

In practical terms, the system prompt establishes the ground rules of the conversation.

Sampling parameters

How else can we control how the model responds? You can further shape how the model responds by adjusting sampling parameters. Each parameter influences how the model selects its next token, providing you some control over the diversity or precision of its outputs.

Temperature

Temperature controls the randomness of the model's responses. Setting it closer to 0 produces more deterministic, predictable completions, which can be useful where precision and some degree of reproducibility is the priority. Increasing the temperature (e.g., 0.7 to 1.0) makes responses more diverse, which could be preferred for generating ideas, for example.

completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Suggest three unusual ice cream flavors."}],
    temperature=1.0
)

Higher temperature often yields more surprising or less conventional responses.

Nucleus sampling (top_p)

Nucleus sampling, or top_p, controls randmness by limiting tokens considered based on their cumulative probability. For example, a top_p of 0.1 limits selection to the most probable tokens, making responses predictable and conservative. A higher top_p (close to 1.0) allows less probable tokens into consideration, resulting in more varied responses.

completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Describe an innovative business idea."}],
    top_p=0.8
)

Generally, you'll want to use either temperature or top_p, but not both at the same time.

Max tokens

Max tokens limits the length of the model's completion. It's especially useful when concise responses are required or when controlling token usage and associated costs.

completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Briefly summarize quantum computing."}],
    max_tokens=50
)

Responses are trimmed according to your specified token limit.

Stop sequences

Stop sequences instruct the model to stop output after encountering specific tokens or sequences. This helps avoid extraneous text.

completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "List items to pack for a trip:"}],
    stop=["\n\n"]
)

In this example, the model stops generating output after the first occurrence of a double newline. Note that stop sequences are generally not supported with reasoning models like o3 and o4-mini.

Presence and frequency penalties

Presence penalty encourages introducing new topics by penalizing previously mentioned tokens. Frequency penalty reduces repetition by penalizing tokens according to how frequently they've appeared. Reasonable values typically range from 0.1 to 1 to moderately reduce repetition. Higher values (up to 2) strongly discourage repetition but may negatively impact output quality.

completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Suggest creative names for a fantasy tavern."}],
    presence_penalty=0.7,
    frequency_penalty=0.7
)

Each sampling parameter allows you to shape how the model behaves, enabling tailored responses for a particular use case. The ones you will likely use most often are probably temperature, top_p, and max_tokens.

Streaming

If you've used ChatGPT, you've probably noticed answers appearing incrementally rather than all at once like some kind of a token typing typewriter. This behavior is achieved through streaming responses, a feature users will expect in an AI application.

Fortunately, implementing streaming is straightforward:

from openai import OpenAI
client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "user", "content": "Write a one-sentence bedtime story about a unicorn."}
    ],
    stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

When you run this, you'll see an incremental response token by token rather than a complete generation all at once. Streaming reduces perceived latency by delivering immediate feedback.

According to one study, the human reading speed averages 238–260 words per minute, depending on the material being read. This translates to roughly four words (or about six tokens) per second. Users will scan content even faster though, and this observation underscores why incremental responses should be seen as a baseline requirement in any interactive application.

Server-Sent Events (SSE)

Behind streaming responses is the Server-Sent Events (SSE) protocol, an HTTP-based mechanism that enables servers to push incremental data to clients. SSE establishes a one-way connection, maintained over a standard HTTP request, in which the server sends data as text/event-stream content type.

Unlike polling, where the client periodically requests updates, SSE maintains a persistent HTTP connection, reducing overhead and latency. In practice, when you set stream=True in the OpenAI API call, the API endpoint initiates an SSE stream and begins pushing incremental chunks of the response as soon as tokens become available from the model.

Each token is transmitted as an individual event encoded as a JSON payload with specific fields. Your client code parses and displays immediately. Although WebSockets offer similar real-time capabilities, SSE is lighter weight and sufficient for unidirectional streaming scenarios like AI-generated text completions.

SSE also integrates naturally with HTTP, which is presumably why it is a common choice by providers for streaming completions from APIs, not only from OpenAI but from others like Anthropic and Google's Gemini API as well.

Let's actually put this into practice by manually handling a simple SSE stream using Python.

SSE operates through a persistent HTTP connection where the server transmits event-formatted text data incrementally. Python can directly consume these SSE streams with a combination of the requests library and the minimal sseclient-py library.

import os
import json
import requests
from sseclient import SSEClient

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
}

data = {
    "model": "gpt-4.1",
    "messages": [
        {
            "role": "user",
            "content": "Write a one-sentence bedtime story about a unicorn.",
        }
    ],
    "stream": True,
}

response = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers=headers,
    json=data,
    stream=True,
)

client = SSEClient(response)

for event in client.events():
    if event.data != "[DONE]":
        payload = json.loads(event.data)
        chunk = payload["choices"][0]["delta"].get("content", "")
        print(chunk, end="", flush=True)

In this example, we're explicitly:

Making an HTTP request to the OpenAI API with stream=True.
Creating an SSE client from the streaming response.
Iterating over incoming SSE events, parsing each JSON payload.
Printing tokens immediately as they're received.

In this snippet, the wire protocol is visible. In fact, most production applications will forward the same text/event-stream bytes straight through to the browser. On the client you can consume that stream with the browser's EventSource API.

<script>
  const es = new EventSource("/api/chat?conversation_id=123");

  es.onmessage = (e) => {
    if (e.data === "[DONE]") {
      es.close();
      return;
    }
    const payload = JSON.parse(e.data);
    const token  = payload.choices[0].delta.content || "";
    document.getElementById("out").textContent += token;
  };

  es.onerror = () => es.close();
</script>

The EventSource issues a GET request. Most likely though your UI will need to originate a POST request so you can send the prompt in the body. You keep the POST on the server, then proxy the resulting SSE stream on a dedicated GET endpoint /api/chat or similar as in the above.

I've found this to be not a terribly convenient workflow though. You can use a framework like Vercel's AI SDK to simplify the relay of SSE streams from your backend to the frontend. They abstract away manual stream handling and may also provide a consistent client-side interface for consuming streaming responses from multiple providers (OpenAI, Anthropic, Gemini).

Structured outputs

Natural language responses is great for talking to users, but production systems often need the model to return machine-readable data like keys, coordinates, or arrays. Downstream code can then parse it without string gymnastics.

You can actually prompt a model to "respond with JSON" and it will comply most of the time, yet it may still wrap the object in explanatory text. OpenAI's JSON mode eliminates that risk by ensuring the server responds with a syntactically valid JSON object and nothing else.

from openai import OpenAI
import json

client = OpenAI()

messages = [
    {
        "role": "user",
        "content": "Give me a recipe JSON object with title, ingredients[], and steps[] for chocolate mug cake.",
    },
]

reply = client.chat.completions.create(
    model="gpt-4o-mini", response_format={"type": "json_object"}, messages=messages
)

recipe = json.loads(reply.choices[0].message.content)
print(recipe)

A typical response looks like:

{
  "title": "Chocolate Mug Cake",
  "ingredients": [
    "flour",
    "cocoa powder",
    "milk",
    "oil",
    "sugar",
    "baking powder"
  ],
  "steps": [
    "mix dry",
    "add wet",
    "microwave 80 s"
  ]
}

More than likely though you'll need exact structure so the payload can be deserialized into a Pydantic model. To do so you'll need to:

Provide a JSON schema
Wrap it in response_format={"type":"json_schema", "json_schema":{...}}.
Set "strict": true inside the json_schema object.
Every property must be listed in "required", and every object must set "additionalProperties": false.

import os
import json
from openai import OpenAI

recipe_schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "ingredients": {"type": "array", "items": {"type": "string"}},
        "steps": {"type": "array", "items": {"type": "string"}},
        "prep_minutes": {"type": "integer"},
    },
    "required": ["title", "ingredients", "steps", "prep_minutes"],
    "additionalProperties": False,
}

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

reply = client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "recipe", "strict": True, "schema": recipe_schema},
    },
    messages=[
        {"role": "user", "content": "Return a recipe object for chocolate mug cake."}
    ],
)

print(json.dumps(json.loads(reply.choices[0].message.content), indent=2))

If any field is missing, extra, or of the wrong type, the API returns HTTP 400 instead of bad data.

Function calling

Language models predict text tokens based on training data. They cannot fetch live data, perform calculations, or interact with the environment on their own. Function calling or tool use lets a model explicitly request these external actions using structured JSON calls.

Your application executes the requested function and returns results for the model to incorporate into its natural language response. For example, without function calling:

# Model attempts to answer with stale training data
> "What's the current stock price of NVDA?"
"I don't have access to real-time data..."

# Model attempts arithmetic with large numbers
> "What's 48,293,847 * 392,847,592?"  
(Likely incorrect response)

# Model can't execute actions
> "Send an email to john@example.com"
"I cannot send emails..."

Tool use is required in these situations:

# Model recognizes it needs live data
> "What's the current stock price of NVDA?"
{"function": "get_stock_price", "args": {"symbol": "NVDA"}}
{"price": 134.81, "time": "2025-05-28T10:30:00Z"}
"NVDA is currently trading at $134.81."

# Model delegates computation
> "What's 48,293,847 * 392,847,592?"
{"function": "calculate", "args": {"expression": "48293847 * 392847592"}}
{"result": "18972121502366424"}
"The result is 18,972,121,502,366,424."

Let's see how that looks in practice. Here is a minimal example:

from openai import OpenAI
import json
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

tools = [
    {
        "type": "function",
        "function": {
            "name": "add",
            "description": "sum two integers",
            "parameters": {
                "type": "object",
                "properties": {"a": {"type": "integer"}, "b": {"type": "integer"}},
                "required": ["a", "b"],
            },
        },
    }
]


def add(a: int, b: int) -> int:
    return a + b


messages = [{"role": "user", "content": "What is 123 456 + 654 321?"}]

response = client.chat.completions.create(
    model="gpt-4o-mini", tools=tools, messages=messages
)

tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = {"result": add(**args)}

print(dict(name=tool_call.function.name, arguments=tool_call.function.arguments))
print(result)

messages.extend(
    [
        response.choices[0].message,
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": json.dumps(result),
        },
    ]
)

final_response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)

print(final_response.choices[0].message.content)

Output:

{'name': 'add', 'arguments': '{"a":123456,"b":654321}'}
{'result': 777777}
123,456 + 654,321 equals 777,777.

Sequential and parallel tool use

Function calling isn't limited to single interactions. Some tasks require multiple tools executed either sequentially or in parallel.

Sequential tool use occurs when the model calls a tool, waits for its result, and then decides whether another tool call is necessary based on the outcome. For example, when asked, "List my public repositories on GitHub," the model might first call get_me to retrieve your GitHub username, then call search_repositories(username) in a subsequent turn. Each call depends on the results of the previous one, requiring a conversational loop that continues until the model is finished.

Parallel tool use happens when the model calls multiple tools at the same time. For instance, "What's the weather in Paris, Berlin, and Rome today?" can trigger parallel calls for each city's current conditions. Modern models typically support this feature returning multiple tool_calls in a single response. Your application then executes them concurrently and returns each result individually in the following messages.

Managing complexity with MCP

As the number of functions or tools in your application increases, maintaining custom integrations can become challenging. Model Context Protocol (MCP) provides an open standard, model-agnostic way for language models to call external tools or fetch live data.

MCP defines a shared JSON-RPC interface through which models request actions. Instead of manually routing function calls, your application communicates with an MCP server that matches each request to the appropriate handler. That could be local code, a REST API, or external services like GitHub, Slack, or a database.

In practice, MCP replaces ad hoc dispatch logic in your application with a single standardized endpoint. It doesn't alter your core function-calling loop. Rather, it streamlines integration as your application grows, making it easier to manage multiple tools and services at scale.