A Coding Implementation of Web Scraping with Firecrawl and AI-Powered Summarization Using Google Gemini

The rapid growth of web content presents a challenge for efficiently extracting and summarizing relevant information. In this tutorial, we demonstrate how to leverage Firecrawl for web scraping and process the extracted data using AI models like Google Gemini. By integrating these tools in Google Colab, we create an end-to-end workflow that scrapes web pages, retrieves meaningful content, and generates concise summaries using state-of-the-art language models. Whether you want to automate research, extract insights from articles, or build AI-powered applications, this tutorial provides a robust and adaptable solution.

Copy Code

!pip install google-generativeai firecrawl-py

First, we install google-generativeai firecrawl-py, which installs two essential libraries required for this tutorial. google-generativeai provides access to Google’s Gemini API for AI-powered text generation, while firecrawl-py enables web scraping by fetching content from web pages in a structured format.

Copy Code

import os
from getpass import getpass


# Input your API keys (they will be hidden as you type)
os.environ["FIRECRAWL_API_KEY"] = getpass("Enter your Firecrawl API key: ")

Then we securely set the Firecrawl API key as an environment variable in Google Colab. It uses getpass() to prompt the user for the API key without displaying it, ensuring confidentiality. Storing the key in os.environ allows seamless authentication for Firecrawl’s web scraping functions throughout the session.

Copy Code

from firecrawl import FirecrawlApp


firecrawl_app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])


target_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
result = firecrawl_app.scrape_url(target_url)
page_content = result.get("markdown", "")
print("Scraped content length:", len(page_content))

We initialize Firecrawl by creating a FirecrawlApp instance using the stored API key. It then scrapes the content of a specified webpage (in this case, Wikipedia’s Python programming language page) and extracts the data in Markdown format. Finally, it prints the length of the scraped content, allowing us to verify successful retrieval before further processing.

Copy Code

import google.generativeai as genai
from getpass import getpass


# Securely input your Gemini API Key
GEMINI_API_KEY = getpass("Enter your Google Gemini API Key: ")
genai.configure(api_key=GEMINI_API_KEY)

We initialize Google Gemini API by securely capturing the API key using getpass(), preventing it from being displayed in plain text. The genai.configure(api_key=GEMINI_API_KEY) command sets up the API client, allowing seamless interaction with Google’s Gemini AI for text generation and summarization tasks. This ensures secure authentication before making requests to the AI model.

Copy Code

for model in genai.list_models():
    print(model.name)

We iterate through the available models in Google Gemini API using genai.list_models() and print their names. This helps users verify which models are accessible with their API key and select the appropriate one for tasks like text generation or summarization. If a model is not found, this step aids debugging and choosing an alternative.

Copy Code

model = genai.GenerativeModel("gemini-1.5-pro")
response = model.generate_content(f"Summarize this:nn{page_content[:4000]}")
print("Summary:n", response.text)

Finally, we initialize the Gemini 1.5 Pro model using genai.GenerativeModel(“gemini-1.5-pro”) sends a request to generate a summary of the scraped content. It limits the input text to 4,000 characters to stay within API constraints. The model processes the request and returns a concise summary, which is then printed, providing a structured and AI-generated overview of the extracted webpage content.

In conclusion, by combining Firecrawl and Google Gemini, we have created an automated pipeline that scrapes web content and generates meaningful summaries with minimal effort. This tutorial showcased multiple AI-powered solutions, allowing flexibility based on API availability and quota constraints. Whether you’re working on NLP applications, research automation, or content aggregation, this approach enables efficient data extraction and summarization at scale.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .