In May 2020, while in the middle of doing some research, I came across a YouTube video on a piece of history that I was researching. The video was long and had a lot of reference information that I needed. Instead of replaying the clip over and over to hear the information that I needed, I decided to find a way to obtain a transcript of the video.
Whenever I have a problem to solve, I jump straight to Python for the solution. With a little research, I found an API that will allow you to transcribe YouTube videos for Python called, YouTube Transcript/Subtitle API.
This is a Python API that allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles, and does not require a headless browser, as other Selenium-based solutions do.
You can install this API by using the following command:
pip install youtube_transcript_api
This API works by passing the YouTube video ID to YouTubeTranscriptApi.get_transcript function. The function then returns a dictionary of transcripts for the video. The API has many different options that you can use.
The following code will allow you to grab the transcript/subtitles for a given YouTube video. An error message will be displayed if the video does not contain a transcript. The code will accept one of the following formats:
- Full YouTube Video URL (https://www.youtube.com/watch?v=BvA0J_2ZpIQ)
- Short YouTube Video URL (https://youtu.be/BvA0J_2ZpIQ)
- YouTube Video ID (BvA0J_2ZpIQ)
The code will extract the video ID from the YouTube URL if the URL is used instead of a YouTube video ID.
#!/usr/bin/python3 from youtube_transcript_api import YouTubeTranscriptApi def get_transcript(video): # Extract the Video ID from the YouTube URLs if "youtu.be" in video: video_id = video.split("/")[3] elif "watch?v=" in video: video_id = video.split("=")[1] else: # Video is not a URL but the ID video_id = video try: # Create an empty list to store transcript lines transcript_lines = [] # Iterate over the APIs Dictionary of transcripts for transcript in YouTubeTranscriptApi.get_transcript(video_id): # Append the text value of each line to the list transcript_lines.append(transcript['text']) # Join the list elements together separated by a new line and return the results return '\n'.join(transcript_lines) except: # No transcript was found for the video, return an error return "Transcription is not available for this video." if __name__ == "__main__": # Pass the YouTube Video to the 'get_transcript' function print(get_transcript('https://youtu.be/iODxExWFx_0'))
Terminal Output:
[Music]
cyber security is the term used to
characterize
and collect all of the activities
policies
procedures and tools used in concert to
protect the information technology
systems and data
that is core to the functioning of the
modern world
the protections of cyber security apply
to physical systems
software systems as well as the people
who use them
there are multiple things that
organizations can do to implement cyber
security protections
these include having agreed policies and
procedures
regular staff awareness training deploy
strong password management tools
and multi-factor authentication
implement identity access management and
privileged access management encrypt
data at rest
and in transit over networks install
endpoint protection software and keep it
up to date
do frequent backups to secure locations
and keep all it systems including
network equipment up to date via
installation of the latest
security and operating systems updates
the insights external threat protection
suite enhances cyber security protection
measures
by applying threat intelligence allowing
the monitoring of multiple sources to
identify
threats to your organization on the
clear deep
and dark web that could indicate
potential cyber attack planning
or data exposure against your brand
people
or infrastructure to deliver actionable
insights that can be taken
to mitigate the attack and risk
The code above will display the video transcript in your terminal. You may wish to modify the code so that it saves the output into a file instead. This snippet of code will allow you to do just that.
#!/usr/bin/python3 from youtube_transcript_api import YouTubeTranscriptApi def get_transcript(video, file_name): # Extract the Video ID from the YouTube URLs if "youtu.be" in video: video_id = video.split("/")[3] elif "watch?v=" in video: video_id = video.split("=")[1] else: # Video is not a URL but the ID video_id = video try: # Create an empty list to store transcript lines transcript_lines = [] # Iterate over the APIs Dictionary of transcripts for transcript in YouTubeTranscriptApi.get_transcript(video_id): # Append the text value of each line to the list transcript_lines.append(transcript['text']) # Create a file and write the Transcript to it with open(file_name, "w") as file: # Join the list elements together separated by a new line and write to file file.write('\n'.join(transcript_lines)) # Return a successful message return "The transcript has been created." except: # No transcript was found for the video, return an error return "Transcription is not available for this video." if __name__ == "__main__": # Pass the YouTube Video and file name to the 'get_transcript' function print(get_transcript('https://youtu.be/iODxExWFx_0', "transcript.txt"))