How to Transcribe YouTube Videos with Python

In May 2020, while in the middle of doing some research, I came across a YouTube video on a piece of history that I was researching. The video was long and had a lot of reference information that I needed. Instead of replaying the clip over and over to hear the information that I needed, I decided to find a way to obtain a transcript of the video.

Whenever I have a problem to solve, I jump straight to Python for the solution. With a little research, I found an API that will allow you to transcribe YouTube videos for Python called, YouTube Transcript/Subtitle API.

This is a Python API that allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles, and does not require a headless browser, as other Selenium-based solutions do.

You can install this API by using the following command:

pip install youtube_transcript_api

This API works by passing the YouTube video ID to YouTubeTranscriptApi.get_transcript function. The function then returns a dictionary of transcripts for the video. The API has many different options that you can use.

The following code will allow you to grab the transcript/subtitles for a given YouTube video. An error message will be displayed if the video does not contain a transcript. The code will accept one of the following formats:

Full YouTube Video URL (https://www.youtube.com/watch?v=BvA0J_2ZpIQ)
Short YouTube Video URL (https://youtu.be/BvA0J_2ZpIQ)
YouTube Video ID (BvA0J_2ZpIQ)

The code will extract the video ID from the YouTube URL if the URL is used instead of a YouTube video ID.

#!/usr/bin/python3

from youtube_transcript_api import YouTubeTranscriptApi


def get_transcript(video):
    # Extract the Video ID from the YouTube URLs
    if "youtu.be" in video:
        video_id = video.split("/")[3]
    elif "watch?v=" in video:
        video_id = video.split("=")[1]
    else:
        # Video is not a URL but the ID
        video_id = video

    try:
        # Create an empty list to store transcript lines
        transcript_lines = []
        # Iterate over the APIs Dictionary of transcripts
        for transcript in YouTubeTranscriptApi.get_transcript(video_id):
           # Append the text value of each line to the list
           transcript_lines.append(transcript['text'])
        # Join the list elements together separated by a new line and return the results
        return '\n'.join(transcript_lines)
    except:
        # No transcript was found for the video, return an error
        return "Transcription is not available for this video."


if __name__ == "__main__":
    # Pass the YouTube Video to the 'get_transcript' function
    print(get_transcript('https://youtu.be/iODxExWFx_0'))

Terminal Output:

[Music]<br>
cyber security is the term used to<br>
characterize<br>
and collect all of the activities<br>
policies<br>
procedures and tools used in concert to<br>
protect the information technology<br>
systems and data<br>
that is core to the functioning of the<br>
modern world<br>
the protections of cyber security apply<br>
to physical systems<br>
software systems as well as the people<br>
who use them<br>
there are multiple things that<br>
organizations can do to implement cyber<br>
security protections<br>
these include having agreed policies and<br>
procedures<br>
regular staff awareness training deploy<br>
strong password management tools<br>
and multi-factor authentication<br>
implement identity access management and<br>
privileged access management encrypt<br>
data at rest<br>
and in transit over networks install<br>
endpoint protection software and keep it<br>
up to date<br>
do frequent backups to secure locations<br>
and keep all it systems including<br>
network equipment up to date via<br>
installation of the latest<br>
security and operating systems updates<br>
the insights external threat protection<br>
suite enhances cyber security protection<br>
measures<br>
by applying threat intelligence allowing<br>
the monitoring of multiple sources to<br>
identify<br>
threats to your organization on the<br>
clear deep<br>
and dark web that could indicate<br>
potential cyber attack planning<br>
or data exposure against your brand<br>
people<br>
or infrastructure to deliver actionable<br>
insights that can be taken<br>
to mitigate the attack and risk

The code above will display the video transcript in your terminal. You may wish to modify the code so that it saves the output into a file instead. This snippet of code will allow you to do just that.

#!/usr/bin/python3

from youtube_transcript_api import YouTubeTranscriptApi


def get_transcript(video, file_name):
    # Extract the Video ID from the YouTube URLs
    if "youtu.be" in video:
        video_id = video.split("/")[3]
    elif "watch?v=" in video:
        video_id = video.split("=")[1]
    else:
        # Video is not a URL but the ID
        video_id = video

    try:
        # Create an empty list to store transcript lines
        transcript_lines = []
        # Iterate over the APIs Dictionary of transcripts
        for transcript in YouTubeTranscriptApi.get_transcript(video_id):
           # Append the text value of each line to the list
           transcript_lines.append(transcript['text'])

        # Create a file and write the Transcript to it
        with open(file_name, "w") as file:
            # Join the list elements together separated by a new line and write to file
            file.write('\n'.join(transcript_lines))

        # Return a successful message
        return "The transcript has been created."
    except:
        # No transcript was found for the video, return an error
        return "Transcription is not available for this video."


if __name__ == "__main__":
    # Pass the YouTube Video and file name to the 'get_transcript' function
    print(get_transcript('https://youtu.be/iODxExWFx_0', "transcript.txt"))

Tags: Programming Python

How to Transcribe YouTube Videos with Python

Related Articles

Recommended Services

Latest Articles