Home Guides Programming Python

Create a Word Cloud with Python

Reading Time: 4 mins read

Facebook Twitter

During recent data analysis, I needed to view the frequency of words that appeared within a large dataset of 600,000+ records. I put together a word cloud to accomplish this task, which allowed me to see the most used words within the stored dataset.

A word cloud is similar to a tag cloud. It is a cloud filled with many words of different sizes representing its frequency of appearance. In this guide, I will be showing you how to create a basic word cloud and a customized word cloud within Python.

For this example application, we will be using the text from the Constitution of the United States. I will also be introducing you to three new Python libraries that you will need to install before we begin.

Our Requirements

numpy – The numpy library is one of the most popular and helpful libraries to handle multi-dimensional arrays and matrices. It is also used in combination with the Pandas library to perform data analysis. We will be using this to customize our Word Cloud image later.

pip install numpy

that enables image reading. Pillow is a wrapper for Python Imaging Library (PIL). You will need this library to read in the image as the mask for the wordcloud.

pip install pillow

wordcloud – The wordcloud library is the star attraction of the application we are developing. It will take our text and convert it into a cloud of words.

pip install wordcloud

As mentioned, we will be using the text from the Constitution of the United States as our dataset. Click on the following link to download the file and save it within the same directory that you will be using for this Python application.

Download: constitution.txt

Let's Get Started

Now that you have your environment set up, it is time to get started with the basics. I will be showing you how to create a basic word cloud and then a custom word cloud as displayed in the featured image of this guide. As with my previous Python guides, I will be explaining what each line does within the code.

#!/usr/bin/python3

from PIL import Image
from wordcloud import WordCloud, STOPWORDS


def create_wordcloud():
    # Use the built-in list of words to be eliminated from the word cloud
    stopwords = set(STOPWORDS)

    # Open the text file and save it into memory as a variable
    text = open('constitution.txt', encoding='utf-8').read()

    # Pass the text to the WordCloud function with parameters
    # and generate the word cloud data object
    wc = WordCloud(
        mode             = "RGBA",
        background_color = "black",
        colormap         = "RdYlBu",    
        width            = 815, 
        height           = 486, 
        random_state     = 6, 
        max_words        = 600, 
        stopwords        = stopwords, 
        collocations     = True
    ).generate(text)

    # Save the word cloud data as an image
    wc.to_file('constitution_cloud.png')


if __name__ == "__main__":
    # Call the create_wordcloud function
    create_wordcloud()

In the above code, we generated a rectangular image that was 815×486 pixels in size and filled with popular words from the Constitution of the United States. Your generated image should look similar to the one below.

WordCloud Parameters

The WordCloud library has several parameters that can be passed to the WordCloud function before the image is generated. These settings will help you to customize the appearance of your word cloud. Here is a reference list for those parameters.

Parameters
___________________________________________________________________________
 
font_path : string
    Font path to the font that will be used (OTF or TTF).
    Defaults to DroidSansMono path on a Linux machine. If you are on
    another OS or don't have this font; you need to adjust this path.
 
width : int (default=400)
    Width of the canvas.
 
height : int (default=200)
    Height of the canvas.
 
prefer_horizontal : float (default=0.90)
    The ratio of times to try horizontal fitting as opposed to vertical.
    If prefer_horizontal < 1, the algorithm will try rotating the word if
    it doesn't fit. (There is currently no built-in way to get only
    vertical words.) mask : nd-array or None (default=None) If not None,
    gives a binary mask on where to draw words. If mask is not None, width
    and height will be ignored, and the shape of mask will be used instead.
    All white (#FF or #FFFFFF) entries will be considered “masked out”
    while other entries will be free to draw on. [This changed in the most
    recent version!] contour_width: float (default=0) If mask is not None
    and contour_width > 0, draw the mask contour.
 
contour_color: color value (default=”black”)
    Mask contour color.
 
scale : float (default=1)
    Scaling between computation and drawing. For large word-cloud images,
    using scale instead of larger canvas size is significantly faster, but
    might lead to a coarser fit for the words.
 
min_font_size : int (default=4)
    Smallest font size to use. Will stop when there is no more room in this
    size.
 
font_step : int (default=1)
    Step size for the font. font_step > 1 might speed up computation but
    give a worse fit.
 
max_words : number (default=200)
    The maximum number of words.
 
stopwords : set of strings or None
    The words that will be eliminated. If None, the build-in STOPWORDS
    list will be used.
 
background_color : color value (default=”black”)
    Background color for the word cloud image.
 
max_font_size : int or None (default=None)
    Maximum font size for the largest word. If None, the height of the image
    is used.
 
mode : string (default=”RGB”)
    Transparent background will be generated when mode is “RGBA” and
    background_color is None.
 
relative_scaling : float (default=0.5)
    Importance of relative word frequencies for font-size.  With
    relative_scaling=0, only word-ranks are considered.  With
    relative_scaling=1, a word that is twice as frequent will have twice
    the size.  If you want to consider the word frequencies and not only
    their rank, relative_scaling around .5 often looks good.
 
color_func : callable, default=None
    Callable with parameters word, font_size, position, orientation,
    font_path, random_state that returns a PIL color for each word.
    Overwrites “colormap”.
    See colormap for specifying a matplotlib colormap instead.
 
regexp : string or None (optional)
    Regular expression to split the input text into tokens in process_text.
    If None is specified, “r”\w[\w']+”“ is used.
 
collocations : bool, default=True
    Whether to include collocations (bigrams) of two words.
 
colormap : string or matplotlib colormap, default=”viridis”
    Matplotlib colormap to randomly draw colors from for each word.
    Ignored if “color_func” is specified.
 
normalize_plurals : bool, default=True
    Whether to remove trailing ‘s' from words. If True and a word
    appears with and without a trailing ‘s', the one with trailing ‘s'
    is removed and its counts are added to the version without
    trailing ‘s' — unless the word ends with ‘ss'.
 
 
Attributes
___________________________________________________________________________
 
“words_” : dict of string to float
    Word tokens with associated frequency.
 
    .. versionchanged: 2.0
        “words_” is now a dictionary
 
“layout_” : list of tuples (string, int, (int, int), int, color))
    Encodes the fitted word cloud. Encodes for each word the string, font
    size, position, orientation, and color.

There is one special parameter called colormap. This parameter is used to change the colors of the text within the word cloud. There are several sets of colors that you can choose from and picking the right set depends on the colors that you want to use. Here is a list of the colormaps that are available and please note, they are case-sensitive:

Accent, Accent_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, Blues, Blues_r, bone, bone_r, BrBG, BrBG_r, brg, brg_r, BuGn, BuGn_r, BuPu, BuPu_r, bwr, bwr_r, cividis, cividis_r, CMRmap, CMRmap_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, Dark2, Dark2_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, GnBu, GnBu_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, Greens, Greens_r, Greys, Greys_r, hot, hot_r, hsv, hsv_r, inferno, inferno_r, jet, jet_r, magma, magma_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, Oranges, Oranges_r, OrRd, OrRd_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, pink, pink_r, PiYG, PiYG_r, plasma, plasma_r, PRGn, PRGn_r, prism, prism_r, PuBu, PuBu_r, PuBuGn, PuBuGn_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, rainbow, rainbow_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, seismic, seismic_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, turbo, turbo_r, twilight, twilight_r, twilight_shifted, twilight_shifted_r, viridis, viridis_r, winter, winter_r, Wistia, Wistia_r, YlGn, YlGn_r, YlGnBu, YlGnBu_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r

Custom Word Cloud Design

Now that we have learned how to create a basic word cloud, it is time to move on to creating a custom-designed word cloud. These are really much easier to do than they sound. The most difficult part is finding or creating a good base image to use as the mask. I have found it easier to use just a black-and-white image since the background of the image has to be pure white (255, 255, 255). The black part of the image gets replaced with the text of the word cloud.

Being such a great sport that I am, I included the base image that I used for the featured image of this guide. Click on the following link to download the file and save it within the same directory that you will be using for this Python application.

Download: usmap.jpg

We are only making two changes to the previous code. We are going to import the numpy library to help us mask the word cloud image and we will be passing the mask parameter to the WordCloud function.

#!/usr/bin/python3

import numpy as np
from PIL import Image
from wordcloud import WordCloud, STOPWORDS


def create_wordcloud():
    # Use the built-in list of words to be eliminated from the word cloud
    stopwords = set(STOPWORDS)

    # Load the base image into a numpy array and store the
    # array within the mask variable
    mask = np.array(Image.open("usmap.jpg"))

    # Open the text file and save it into memory as a variable
    text = open('constitution.txt', encoding='utf-8').read()

    # Pass the text to the WordCloud function with parameters
    # and generate the word cloud data
    wc = WordCloud(
        mode             = "RGBA",
        background_color = "black",
        colormap         = "RdYlBu",
        width            = 815,
        height           = 486,
        random_state     = 6,
        max_words        = 600,
        stopwords        = stopwords,
        mask             = mask,
        collocations     = True
    ).generate(text)

    # Save the word cloud data as an image
    wc.to_file('constitution_cloud.png')


if __name__ == "__main__":
    # Call the create_wordcloud function
    create_wordcloud()

We didn't change any other parameters within the code. After you have executed the application, you should have generated an image very similar to the featured image of this guide. I encourage you to explore the WordCloud library in further depth and to play around with different mask images.

Jonathan Moore

Senior Software Engineer and Cybersecurity Specialist with over 3 decades of experience in developing web, desktop, and server applications for Linux and Windows-based operating systems. Worked on numerous projects, including automation, artificial intelligence, data analysis, application programming interfaces, intrusion detection systems, streaming audio servers, WordPress plugins, and much more.

Building a Secure 2FA Authenticator with Python

Securing online accounts has become increasingly important as cyber threats continue to evolve. Two-factor authentication (2FA) is a critical security...

Building a Cache Warmer with Python

Ensuring your website loads quickly is essential for providing a great user experience and maintaining good SEO rankings. One effective...

How to Create a Python GUI to Launch Webhooks

In this article, I will walk you through the process of creating a simple yet powerful Python GUI application to...

How to Install Python 3 on Windows 10

Subscribe for Updates

Would you like to be notified when a new article is published? You can unsubscribe at any time.

Latest Articles