Easy way to run Stable Diffusion 3 using API

Easy way to run Stable Diffusion 3 using API

Updated: April 22 2024 18:02


In the ever-evolving world of AI, a new force has emerged that is redefining the way we create and visualize images. Stable Diffusion 3, or SD3 for short, is the latest text-to-image AI model that has taken the creative world by storm. Developed by Stability AI, Stable Diffusion 3 builds upon the success of its predecessors, leveraging the power of diffusion models and large-scale AI architectures to create images of unparalleled quality. But what truly sets SD3 apart is its unique approach to image generation, combining cutting-edge techniques like Diffusion Transformers and Flow Matching to ensure seamless and cohesive visuals.

On April 17 2024, Stability AI announced the availability of the Stable Diffusion 3 and Stable Diffusion 3 Turbo on their Developer Platform API. This marks a significant milestone in the field of generative AI, offering enhanced capabilities in text-to-image generation systems. The company has partnered with Fireworks AI, known for being the fastest and most reliable API platform in the market. The model weights will also be made available for self-hosting with a Stability AI Membership in the near future.

A Tale of Expansion and Contraction

In a surprising turn of events, just a day after the company had announced the expansion of access to its new flagship model, announced the layoff of 20 employees represent roughly 10 percent of Stability AI's workforce. Interim CEOs Shan Shan Wong and Christian LaForte explained in a memo to staff that the decision was part of a strategic plan. The aim is to reduce their cost-base, strengthen support with investors and partners, and enable teams to continue developing and releasing innovative products.


Stability AI has had a tumultuous few months. Several high-profile researchers and its founder and CEO, Emad Mostaque, have left the company. Mostaque stepped down from his role and the company’s board in March, expressing his desire to "pursue decentralized AI". In his view, AI development and governance should be more transparent and distributed. Reports have also surfaced about conflicts with investors on matters such as the "fair use" of copyrighted material for training AI models and potential discussions about the company's acquisition. These disputes may have played a role in Mostaque's decision to step down.

While these recent changes are undoubtedly a significant setback, the company's continued product development and expansion suggest a commitment to moving forward. As the AI industry continues to evolve at a rapid pace, it remains to be seen how Stability AI will navigate these challenging waters.

How Stable Diffusion 3 Works

As revealed in the Stable Diffusion 3 research paper, this model equals or outperforms state-of-the-art text-to-image generation systems such as DALL-E 3 and Midjourney v6 in typography and prompt adherence, based on human preference evaluations¹. The new Multimodal Diffusion Transformer (MMDiT) architecture uses separate sets of weights for image and language representations, which improves text understanding and spelling capabilities compared to previous versions of Stable Diffusion.


At its core, Stable Diffusion is a powerful AI system that leverages latent diffusion models to transform textual prompts into stunningly realistic images. Trained on a vast dataset of over 512x512 images from the LAION-5B database, this cutting-edge technology combines a diffusion approach with a frozen CLIP ViT-L/14 text encoder, enabling it to understand and interpret language with remarkable accuracy.

What sets Stable Diffusion 3 apart from its predecessors is its unique architecture, which incorporates a groundbreaking Multimodal Diffusion Transformer (MMDiT). This innovative approach utilizes separate sets of weights for image and language representations, resulting in improved text understanding and spelling capabilities, taking the model's interpretive abilities to new heights.

Diffusion Transformers: Scalability and Performance Reimagined

At the heart of Stable Diffusion 3 lies the Diffusion Transformer (DiT), a class of diffusion models that harness the power of transformer architecture for image generation. Unlike traditional methods that rely on U-Net backbones, DiTs operate on latent patches, offering unparalleled scalability and performance.

Through extensive analysis, researchers have discovered that DiTs with higher computational power, achieved through increased transformer depth/width or a higher number of input tokens, consistently exhibit lower Frechet Inception Distance (FID). In layman's terms, this translates to higher-quality, more realistic images – a feat that has long eluded many generative models.

Flow Matching: Redefining Model Training

Complementing the Diffusion Transformer is the innovative Flow Matching (FM) technique, a model training approach that redefines Continuous Normalizing Flows (CNFs). By focusing on regressing vector fields of fixed conditional probability paths, FM eliminates the need for simulations, providing a robust and stable alternative for training diffusion models.

Empirical evaluations on the widely used ImageNet dataset have demonstrated that FM consistently outperforms traditional diffusion-based methods in terms of both likelihood (how probable the generated samples are) and sample quality. Moreover, FM enables fast and reliable sample generation using existing numerical Ordinary Differential Equation (ODE) solvers, further enhancing the model's efficiency and versatility.

This groundbreaking model is not merely a tool; it's a collaborator, a companion that empowers artists, designers, and content creators to bring their imagination to life in ways previously unimaginable. With its capacity to generate high-quality images, tackle complex text integration challenges, and faithfully follow prompts, SD3 stands as a catalyst for a creative revolution, ushering in a new era of visual storytelling and artistic expression.

Easy Access to Stable Diffusion 3 using the API


  1. Create an account on Stability AI Developer platform.


  2. A new account should have come with some free credits when you sign up. Add more credits on the billing page. Each image generation consumes credits. See the pricing page for credit usage for each image.


  3. Create an API key on your account’s API keys page. You can create a new one or use an existing one. Copy the API key.


  4. Open the Google Colab notebook to start generating SD3 images.


  5. Run the notebook by clicking the play button on the left next to the prompt, see the left image. You might also get the warning as shown on the right image, just click "Run anyway" to continue.


  6. Paste your API key and press Enter when prompted. You should get an image generated with Stable Diffusion 3 via the API.


  7. You can also change the settings here such as the 'prompt', 'negative_prompt', 'aspect_ratio', 'seed', 'output_format':


  8. Here is the output image generated from the prompt, you can also save the image on your computer.


  9. When you are done, don’t forget to click "Disconnect and delete runtime".



Stable Diffusion 3 Sample images

Here are a few Stable Diffusion 3 generated images testing its text generation capability. Text generation is definitely the best among all Stable Diffusion models, although it still makes mistakes sometimes.


Prompt: A red sofa on top of a white building. Graffiti with the text "the best view in the city". pic.twitter.com/koSQWxldOH








Prompt: Portrait photograph of an anthropomorphic tortoise seated on a New York City subway train. pic.twitter.com/bKn7gyjRUX








Prompt: A cardboard box with the phrase “they say it's not good to think in here”, the cardboard box is large and sits on a theater stage. pic.twitter.com/nm0KUbmFjF







Check out my recent posts