Transcribing YouTube Video using Whisper for GPT-3 Text Summarization.

6 min readFeb 9


Photo by Possessed Photography on Unsplash

In this blog tutorial, we are going to build something fun and interesting. It is no news that we are now in the “GPT -Era” (lol). If you are new to GPT-3 or language models which is a subfield of Artificial Intelligence, then this is the right place to enrich yourself. GPT stands for Generative Pretrained Transformer and you must have been hearing the gist about ChatGPT which is a conversational bot that answers questions amazingly. ChatGPT is a variant of GPT-3.5 which is also an improved version of GPT-3.

ChatGPT OpenAI

We will be using GPT-3 and Whisper which are both language models open-sourced by OpenAI. Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It enables transcription in multiple languages, as well as translation from those languages into English.

The outline for this tutorial is as follows:

  • Install the necessary libraries and packages to help us read our youtube URL
  • Install whisper for transcribing a youtube video and filtering by audio streams only
  • Summarize the text with GPT-3

We are going to transcribe 60 minutes of Fed Chairman Powell’s November 2nd speech. You can listen to the speech here. The speech did not start until 6 min and 18 seconds and the speech ended at around 51 min and 33 seconds into the recorded video. So, we are going to use that information to trim what part of the video we want to transcribe and summarize.

To reduce the computational cost, we are going to use Google Colab and if you have never used one, you can follow this guide to create yours. Google Colab enables us to choose the hardware we want to use to run our model. In this tutorial, We will use a GPU. There are other options like CPU and TPU. A GPU is a graphical processing unit for faster matrix computations and calculations. Carrying out this tutorial on the CPU would be slow.

To get started, we will install the OpenAI whisper python package using pip.

We will extract some audio from our youtube video using the pytube Python Package. So let’s pip install it.

Now, let’s import our whisper model and the pytube package. Also, the whisper model to be used depends on you as there are many variants according to their size. We will use the base model which has 74M parameters.

The following code shows how to load the base model of whisper and instantiate a youtube pytube object by passing the URL we want to transcribe.

We can extract its title and check for other attributes…

Looking at the second cell above, we are able to get different streams of the youtube video ranging from high-quality to low-quality streams. We can iterate through the list to get different frame rates and resolutions. We were also able to retrieve the title of our youtube video in the code cell above. Next, let’s filter down to audio streams since we are only interested in the audio channel. We will also take the first stream since we are not interested in the quality of the audio. If you want a higher-quality transcription, you will need a better model.

Since we have selected the stream we want, let’s download and save this stream for further pre-processing.

We can do some additional processing on the audio file should we choose. We want to ignore any additional sound and speech after Jerome Powell speaks. So we’ll use ffmpeg to do this. The command will start the audio file at the 375-second mark where he starts with good afternoon, continue for 2715 seconds, and chop off the rest of the audio. The result will be saved in a new file called fed_meeting_trimmed.mp4.

Now, let’s transcribe our audio stream. We are also going to see how much time it takes to do this in Google Colab. If you want to transcribe faster, you can subscribe to Colab Pro. The result shows that it took 1 min and 45s to transcribe the audio stream.

Let’s view our transcribed audio as a text😁….

Woah😱, that’s a lot. We’ve done a really good job! But it is not over yet😉. Let’s summarize our text using GPT-3 from OpenAI. This part is pretty simple but requires you to have an API key. Here is a simple guide to help you create your API and build your own AI projects using OpenAI models.

Let’s install openai using pip install and also import it…

We are going to load the most advanced model for GPT-3 which is davinci. We break the entire summarized text into chunks because of the token size limit. Davinci currently takes a maximum of 4096 tokens and our text is 9000+ tokens. We will also enter a prompt to tell our model what we want to do ( in our case, summarization).

The summarized text is shown below. We cannot really view the entire summarized text so we are going to save it as a text file with python. Feel free to save it in whatever format you wish.

Let’s view our summarized text in Notepad😁…

In conclusion, we were able to transcribe a 1hr youtube video into text using Whisper and, in turn, summarized it using ChatGPT Text Summarizer reducing the length of our text from 43229 to a length of 5003.