interlinear_commentary_text_of_python_for_devops_-_learn_ruthlessly_effective_automation

Interlinear Commentary Text of Python for DevOps - Learn Ruthlessly Effective Automation

Return to Python for DevOps, Python DevOps, Python Bibliography

Python for DevOps - Learn Ruthlessly Effective Automation

Python for DevOps - Learn Ruthlessly Effective Automation - Python for DevOps - by Noah Gift, Kennedy Behrman, Alfredo Deza, and Grig Gheorghiu, 2020, 978-1-492-05769-7

Python for DevOps

by Noah Gift, Kennedy Behrman, Alfredo Deza, and Grig Gheorghiu

Copyright © 2020 Noah Gift, Kennedy Behrman, Alfredo Deza, Grig Gheorghiu. All rights reserved.

Preface

One time Noah was in the ocean, and a wave crashed on top of him and took his breath away as it pulled him deeper into the sea. Just as he started to recover his breath, another wave dropped on top. It extracted much of his remaining energy. It pulled him even deeper into the ocean. Just as he started to recover, yet another wave crashed down on top. The more he would fight the waves and the sea, the more energy was drained. He seriously wondered if he would die at that moment. He couldn’t breathe, his body ached, and he was terrified he was going to drown. Being close to death helped him focus on the only thing that could save him, which was conserving his energy and using the waves — not fighting them.

Being in a startup that doesn’t practice DevOps is a lot like that day at the beach. There are production fires that burn for months; everything is manual, alerts wake you up for days on end damaging your health. The only escape from this death spiral is the DevOps way.

Do one right thing, then another, until you find clarity. First, set up a build server, start testing your code, and automate manual tasks. Do something; it can be anything, but have a “bias for action.” Do that first thing right and make sure it is automated.

A common trap in startups or any company is the search for superheroes. “We need a performance engineer” because they will fix our performance problems. “We need a Chief Revenue Officer” because they will fix all sales problems. “We need DevOps engineers” because they will fix our deployment process.

At one company, Noah had a project that was over a year late, and the web application had been rewritten three times in multiple languages. This next release only needed a “performance engineer” to get it finished. I remember being the only one brave or stupid enough to say, “What is a performance engineer?” This engineer made everything work at scale. He realized at that point that they were looking for a superhero to save them. Superhero hiring syndrome is the best way to pick up on something being very wrong on a new product or a new startup. No employee will save a company unless they first save themselves.

At other companies, Noah heard similar things: “If we could only hire a senior Erlang engineer,” or “If we could only hire someone to make us revenue,” or “If we could only hire someone to teach us to be financially disciplined,” or “If we could only hire a Swift developer,” etc. This hire is the last thing your startup or new product needs — it needs to understand what it is doing wrong that only a superhero can save the day.

In the case of the company that wanted to hire a performance engineer, it turned out that the real issue was inadequate technical supervision. The wrong people were in charge (and verbally shouting down the people who could fix it). By removing a poor performer, listening to an existing team member who knew how to fix the problem all along, deleting that job listing, doing one right thing at a time, and inserting qualified engineering management, the issue resolved itself without a superhero hire.

No one will save you at your startup; you and your team have to protect yourselves by creating great teamwork, a great process, and believing in your organization. The solution to the problem isn’t a new hire; it is being honest and mindful about the situation you are in, how you got there, and doing one right thing at a time until you work your way out. There is no superhero unless it is you.

Just like being in the ocean in a storm and slowly drowning, no one is going to save you or the company unless it is you. You are the superhero your company needs, and you might discover your coworkers are too.

There is a way out of the chaos, and this book can be your guide. Let’s get started.

What Does DevOps Mean to the Authors?

Many abstract concepts in the software industry are hard to define precisely. Cloud Computing, Agile, and Big Data are good examples of topics that can have many definitions depending on whom you talk to. Instead of strictly defining what DevOps is, let’s use some phrases that show evidence DevOps is occurring:

Two-way collaboration between Development and Operation teams.

Turnaround of Ops tasks in minutes to hours, not days to weeks.

Strong involvement from developers; otherwise, it’s back to Devs versus Ops.

Operations people need development skills — at least Bash and Python.

Developer people need operational skills — their responsibilities don’t end with writing the code, but with deploying the system to production and monitoring alerts.

Automation, automation, automation: you can’t accurately automate without Dev skills, and you can’t correctly automate without Ops skills

Ideally: self-service for developers, at least in terms of deploying code.

Can be achieved via CI/CD pipelines.

GitOps.

Bidirectional everything between Development and Operations (tooling, knowledge, etc.).

Constant collaboration in design, implementation, deployment — and yes, automation — can’t be successful without cooperation.

If it isn’t automated, it’s broken.

Cultural: Hierarchy < Process.

Microservices > Monolithic.

The continuous deployment system is the heart and soul of the software team.

There are no superheroes.

Continuous delivery isn’t an option; it is a mandate.

How to Use This Book

This book is useful in any order. You can randomly open any chapter you like, and you should be able to find something helpful to apply to your job. If you are an experienced Python programmer, you may want to skim Chapter 1. Likewise, if you are interested in war stories, case studies, and interviews, you may want to read the Chapter 16 first.

Conceptual Topics

The content is broken up into several conceptual topics. The first group is Python Foundations, and it covers a brief introduction to the language as well as automating text, writing command-line tools, and automating the file system.

Next up is Operations, which includes useful Linux utilities, package management, build systems, monitoring and instrumentation, and automated testing. These are all essential topics to master to become a competent DevOps practitioner.

Cloud Foundations are in the next section, and there are chapters on Cloud Computing, Infrastructure as Code, Kubernetes, and Serverless. There is currently a crisis in the software industry around finding enough talent trained in the Cloud. Mastering this section will pay immediate dividends to both your salary and your career.

Next up is the Data section. Machine Learning Operations and Data Engineering are both covered from the perspective of DevOps. There is also a full soup to nuts machine learning project walkthrough that takes you through the building, deploying, and operationalizing of a machine learning model using Flask, Sklearn, Docker, and Kubernetes.

The last section is Chapter 16 on case studies, interviews, and DevOps war stories. This chapter makes for good bed time reading.

Python Foundations

Chapter 1, Python Essentials for DevOps

Chapter 2, Automating Files and the Filesystem

Chapter 3, Working with the Command Line

Operations

Chapter 4, Useful Linux Utilities

Chapter 5, Package Management

Chapter 6, Continuous Integration and Continuous Deployment

Chapter 7, Monitoring and Logging

Chapter 8, Pytest for DevOps

Cloud Foundations

Chapter 9, Cloud Computing

Chapter 10, Infrastructure as Code

Chapter 11, Container Technologies: Docker and Docker Compose

Chapter 12, Container Orchestration: Kubernetes

Chapter 13, Serverless Technologies

Data

Chapter 14, MLOps and Machine learning Engineering

Chapter 15, Data Engineering

Case Studies

Chapter 16, DevOps War Stories and Interviews

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This element indicates a warning or caution.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download at https://pythondevops.com. You can also view DevOps content related to the code in the book at the Pragmatic AI Labs YouTube channel.

If you have a technical question for the authors or a problem using the code examples, please email [email protected].

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Python for DevOps by Noah Gift, Kennedy Behrman, Alfredo Deza, and Grig Gheorghiu. (O’Reilly). Copyright 2020 Noah Gift, Kennedy Behrman, Alfredo Deza, Grig Gheorghiu, 978-1-492-05769-7.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected].

O’Reilly Online Learning

Note

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at oreil.ly/python-for-devops.

Email [email protected] to comment or ask technical questions about this book.

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

To start off, the authors would like to thank the two main technical reviewers of the book:

Wes Novack is an architect and engineer specializing in public cloud systems and web-scale SaaS applications. He designs, builds, and manages complex systems that enable highly available infrastructure, continuous delivery pipelines, and rapid releases within large, polyglot microservice ecosystems hosted on AWS and GCP. Wes makes extensive use of languages, frameworks, and tools to define Infrastructure as Code, drive automation, and eliminate toil. He is vocal in the tech community by participating in mentorship, workshops, and conferences, and he is also a Pluralsight video course author. Wes is an advocate for the CALMS of DevOps; Culture, Automation, Lean, Measurement, and Sharing. You can find him on Twitter @WesleyTech or visit his personal blog.

Brad Andersen is a software engineer and architect. He has designed and developed software professionally for 30 years. He works as a catalyst for change and innovation; he has assumed leadership and development roles across a spectrum from enterprise organizations to startups. Brad is currently pursuing a master’s degree in data science at the University of California, Berkeley. You can find more information on Brad’s LinkedIn profile.

We would also like to thank Jeremy Yabrow and Colin B. Erdman for chipping in with many great ideas and bits of feedback.

Noah

I would like to thank the coauthors of the book: Grig, Kennedy, and Alfredo. It was incredible working with a team that was this effective.

Kennedy

Thanks to my coauthors, it has been a pleasure to work with you. And thanks for the patience and understanding of my family.

Alfredo

In 2010 — nine years ago as of this writing — I landed my first software engineering job. I was 31 years old with no college education and no previous engineering experience. That job meant accepting a reduced salary and no health insurance. I learned a lot, met amazing people, and gained expertise through relentless determination. Throughout those years, it would’ve been impossible to get here without people opening opportunities and pointing me in the right direction.

Thanks to Chris Benson, who saw that I was hungry for learning and kept finding opportunities to have me around.

Thanks to Alejandro Cadavid, who realized that I could fix things nobody else wanted to fix. You helped me get work when no one (including myself) thought I could be useful.

Carlos Coll got me into programming and didn’t let me quit even when I asked him to. Learning to program changed my life, and Carlos had the patience to push me to learn and land my first program in production.

To Joni Benton, for believing in me and helping me land my first full-time job.

Thanks to Jonathan LaCour, an inspiring boss who continues to help me get to a better place. Your advice has always been invaluable to me.

Noah, thanks for your friendship and guidance you are a tremendous source of motivation to me. I always enjoy working together, like that one time when we rebuilt infrastructure from scratch. Your patience and guidance when I had no idea about Python was life-changing.

Lastly, a tremendous thanks to my family. My wife Claudia, who never doubts my ability to learn and improve, and so generous and understanding of the time I spent working toward this book. My children, Efrain, Ignacio, and Alana: I love you all.

Grig

My thanks to all creators of open source software. Without them, our jobs would be so much more bleak and unfulfilling. Also thank you to all who blog and share your knowledge freely. Lastly, I also wish to thank the coauthors of this book. It’s been a really fun ride.

Chapter 1. Python Essentials for DevOps

DevOps, the combination of software development with information technology operations, has been a hot field during the last decade. Traditional boundaries among software development, deployment, maintenance, and quality assurance have been broken, enabling more integrated teams. Python has been a popular language both in traditional IT operations and in DevOps due to its combination of flexibility, power, and ease of use.

The Python programming language was publicly released in the early 1990s for use in system administration. It has been a great success in this area and has gained wide adoption. Python is a general-purpose programming language used in just about every domain. The visual effects and the motion picture industries embraced it. More recently, it has become the de facto language of data science and machine learning (ML). It has been used across industries from aviation to bioinformatics. Python has an extensive arsenal of tools to cover the wide-ranging needs of its users. Learning the whole Python Standard Library (the capabilities that come with any Python installation) would be a daunting task. Trying to learn all the third-party packages that enliven the Python ecosystem would be an immense undertaking. The good news is that you don’t need to do those things. You can become a powerful DevOps practitioner by learning only a small subset of Python.

In this chapter, we draw on our decades of Python DevOps experience to teach only the elements of the language that you need. These are the parts of Python DevOps that are used daily. They form the essential toolbox to get things done. Once you have these core concepts down, you can add more complicated tools, as you’ll see in later chapters.

Installing and Running Python

If you want to try the code in this overview, you need Python 3.7 or later installed (the latest release is 3.8.0 as of this writing) and access to a shell. In macOS X, Windows, and most Linux distributions, you can open the terminal application to access a shell. To see what version of Python you are using, open a shell, and type python –version:

$ python –version Python 3.8.0

Python installers can be downloaded directly from the Python.org website. Alternatively, you can use a package manager such as Apt, RPM, MacPorts, Homebrew, Chocolatey, or many others.

The Python Shell

The simplest way to run Python is to use the built-in interactive interpreter. Just type python in a shell. You can then interactively run Python statements. Type exit() to exit the shell.

$ python Python 3.8.0 (default, Sep 23 2018, 09:47:03) [Clang 9.0.0 (clang-900.0.38)] on darwin Type “help”, “copyright”, “credits” or “license” for more information. »> 1 + 2 3 »> exit()

Python scripts

Python code runs from a file with the .py extension:

  1. This is my first Python script print('Hello world!')

Save this code to a file named hello.py. To invoke the script, in a shell run python followed by the filename:

$ python hello.py Hello world!

Python scripts are how most production Python code runs.

IPython

Besides the built-in interactive shell, several third-party interactive shells run Python code. One of the most popular is IPython. IPython offers introspection (the ability to dynamically get information about objects), syntax highlighting, special magic commands (which we touch on later in this chapter), and many more features, making it a pleasure to use for exploring Python. To install IPython, use the Python package manager, pip:

$ pip install ipython

Running is similar to running the built-in interactive shell described in the previous section:

$ ipython Python 3.8.0 (default, Sep 23 2018, 09:47:03) Type 'copyright', 'credits' or 'license' for more information IPython 7.5.0 – An enhanced Interactive Python. Type '?' for help. In [1]: print('Hello') Hello In [2]: exit()

Jupyter Notebooks

A spin-off from the iPython project, the Jupyter project allows documents containing text, code, and visualizations. These documents are powerful tools for combining running code, output, and formatted text. Jupyter enables the delivery of documentation along with the code. It has achieved widespread popularity, especially in the data science world. Here is how to install and run Jupyter notebooks:

$ pip install jupyter $ jupyter notebook

This command opens a web browser tab showing the current working directory. From here, you can open existing notebooks in the current project or create new ones.

Procedural Programming

If you’ve been around programming at all, you’ve probably heard terms like object-oriented programming (OOP) and functional programming. These are different architectural paradigms used to organize programs. One of the most basic paradigms, procedural programming, is an excellent place to start. Procedural programming is the issuing of instructions to a computer in an ordered sequence:

i = 3 »> j = i +1 »> i + j 7

As you can see in this example, there are three statements that are executed in order from the first line to the last. Each statement uses the state produced by the previous ones. In this case, the first statement assigns the value 3 to a variable named i. In the second statement, this variable’s value is used to assign a value to a variable named j, and in the third statement, the values from both variables are added together. Don’t worry about the details of these statements yet; notice that they are executed in order and rely on the state created by the previous statements.

Variables

A variable is a name that points to some value. In the previous example, the variables are i and j . Variables in Python can be assigned to new values:

dog_name = 'spot' »> dog_name 'spot' »> dog_name = 'rex' »> dog_name 'rex' »> dog_name = 't-' + dog_name »> dog_name 't-rex' »>

Python variables use dynamic typing. In practice, this means that they can be reassigned to values of different types or classes:

big = 'large' »> big 'large' »> big = 1000*1000 »> big 1000000 »> big = {} »> big {} »>

Here the same variable is set to a string, a number, and a dictionary. Variables can be reassigned to values of any type.

Basic Math

Basic math operations such as addition, subtraction, multiplication, and division can all be performed using built-in math operators:

1 + 1 2 »> 3 - 4 –1 »> 2*5 10 »> 2/3 0.6666666666666666

Note that a // symbol is for integer division. The symbol ** creates an exponent, and % is the modulo operator:

5/2 2.5 »> 5//2 2 »> 3**2 9 »> 5%2 1

Comments

Comments are text ignored by the Python interpreter. They are useful for documentation of code and can be mined by some services to provide standalone documentation. Single-line comments are delineated by prepending with #. A single-line comment can start at the beginning of a line, or at any point thereafter. Everything after the # is part of the comment until a new line break occurs:

  1. This is a comment 1 + 1 # This comment follows a statement

Multiline comments are enclosed themselves in blocks beginning and ending with either “”“ or :

”“” This statement is a block comment. It can run for multiple lines “”“ This statement is also a block comment

Built-in Functions

Functions are statements grouped as a unit. You invoke a function by typing the function name, followed by parentheses. If the function takes arguments, the arguments appear within the parentheses. Python has many built-in functions. Two of the most widely used built-in functions are print and range.

Print

The print function produces output that a user of a program can view. It is less relevant in interactive environments but is a fundamental tool when writing Python scripts. In the previous example, the argument to the print function is written as output when the script runs:

  1. This is my first Python script print(“Hello world!”) $ python hello.py Hello world!

print can be used to see the value of a variable or to give feedback as to the state of a program. print generally outputs the standard output stream and is visible as program output in a shell.

Range

Though range is a built-in function, it is technically not a function at all. It is a type representing a sequence of numbers. When calling the range() constructor, an object representing a sequence of numbers is returned. Range objects count through a sequence of numbers. The range function takes up to three integer arguments. If only one argument appears, then the sequence is represented by the numbers from zero up to, but not including, that number. If a second argument appears, it represents the starting point, rather than the default of starting from 0. The third argument can be used to specify the step distance, and it defaults to 1.

range(10) range(0, 10) »> list(range(10)) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] »> list(range(5, 10)) [5, 6, 7, 8, 9] »> list(range(5, 10, 3)) [5, 8] »>

range maintains a small memory footprint, even over extended sequences, as it only stores the start, stop, and step values. The range function can iterate through long sequences of numbers without performance constraints.

Execution Control

Python has many constructs to control the flow of statement execution. You can group statements you wish to run together as a block of code. These blocks can be run multiple times using for and while loops or only run under certain conditions using if statements, while loops, or try-except blocks. Using these constructs is the first step to taking advantage of the power of programming. Different languages demarcate blocks of code using different conventions. Many languages with syntax similar to the C language (a very influential language used in writing Unix) use curly brackets around a group of statements to define a block. In Python, indentation is used to indicate a block. Statements are grouped by indentation into blocks that execute as a unit.

Note

The Python interpreter does not care if you use tabs or spaces to indent, as long as you are consistent. The Python style guide, PEP-8, however, recommends using four whitespaces for each level of indentation.

if/elif/else

if/elif/else statements are common ways to branch between decisions in code. A block directly after an if statement runs if that statement evaluates to True:

i = 45 »> if i == 45: … print('i is 45') … … i is 45 »>

Here we used the == operator, which returns True if items are equal and False if not. Optionally, this block can follow an elif or else statement with an accompanying block. In the case of an elif statement, this block only executes if the elif evaluates to True:

i = 35 »> if i == 45: … print('i is 45') … elif i == 35: … print('i is 35') … … i is 35 »>

Multiple elif loops can append together. If you are familiar with switch statements in other languages, this simulates that same behavior of choosing from multiple choices. Adding an else statement at the end runs a block if none of the other conditions evaluate as True:

i = 0 »> if i == 45: … print('i is 45') … elif i == 35: … print('i is 35') … elif i > 10: … print('i is greater than 10') … elif i%3 == 0: … print('i is a multiple of 3') … else: … print('I don't know much about i…') … … i is a multiple of 3 »>

You can nest if statements, creating blocks containing if statements that only execute if an outer if statement is True:

cat = 'spot' »> if 's' in cat: … print(“Found an 's' in a cat”) … if cat == 'Sheba': … print(“I found Sheba”) … else: … print(“Some other cat”) … else: … print(” a cat without 's'“) … … Found an 's' in a cat Some other cat »>

for Loops

for loops allow you to repeat a block of statements (a code block) once for each member of a sequence (ordered group of items). As you iterate through the sequence, the current item can be accessed by the code block. One of most common uses of loops is to iterate through a range object to do a task a set number of times:

for i in range(10): … x = i*2 … print(x) … … 0 2 4 6 8 10 12 14 16 18 »>

In this example, our block of code is as follows:

… x = i*2 … print(x)

We repeat this code 10 times, each time assigning the variable i to the next number in the sequence of integers from 0–9. for loops can be used to iterate through any of the Python sequence types. You will see these later in this chapter.

continue

The continue statement skips a step in a loop, jumping to the next item in the sequence:

for i in range(6): … if i == 3: … continue … print(i) … … 0 1 2 4 5 »>

while Loops

while loops repeat a block as long as a condition evaluates to True:

count = 0 »> while count < 3: … print(f”The count is {count}“) … count += 1 … … The count is 0 The count is 1 The count is 2 »>

It is essential to define a way for your loop to end. Otherwise, you will be stuck in the loop until your program crashes. One way to handle this is to define your conditional statement such that it eventually evaluates to False. An alternative pattern uses the break statement to exit a loop using a nested conditional:

count = 0 »> while True: … print(f”The count is {count}“) … if count > 5: … break … count += 1 … … The count is 0 The count is 1 The count is 2 The count is 3 The count is 4 The count is 5 The count is 6 »>

Handling Exceptions

Exceptions are a type of error causing your program to crash if not handled (caught). Catching them with a try-except block allows the program to continue. These blocks are created by indenting the block in which the exception might be raised, putting a try statement before it and an except statement after it, followed by a code block that should run when the error occurs:

thinkers = ['Plato', 'PlayDo', 'Gumby'] »> while True: … try: … thinker = thinkers.pop() … print(thinker) … except IndexError as e: … print(“We tried to pop too many thinkers”) … print(e) … break … … … Gumby PlayDo Plato We tried to pop too many thinkers pop from empty list »>

There are many built-in exceptions, such as IOError, KeyError, and ImportError. Many third-party packages also define their own exception classes. They indicate that something has gone very wrong, so it only pays to catch them if you are confident that the problem won’t be fatal to your software. You can specify explicitly which exception type you will catch. Ideally, you should catch the exact exception type (in our example, this was the exception IndexError).

Built-in Objects

In this overview, we will not be covering OOP. The Python language, however, comes with quite a few built-in classes.

What Is an Object?

In OOP, data or state and functionality appear together. The essential concepts to understand when working with objects are class instantiation (creating objects from classes) and dot syntax (the syntax for accessing an object’s attributes and methods). A class defines attributes and methods shared by its objects. Think of it as the technical drawing of a car model. The class can then be instantiated to create an instance. The instance, or object, is a single car built based on those drawings.

# Define a class for fancy defining fancy cars »> class FancyCar(): … pass … »> type(FancyCar) <class 'type'> »> # Instantiate a fancy car »> my_car = FancyCar() »> type(my_car) <class '__main__.FancyCar'>

You don’t need to worry about creating your own classes at this point. Just understand that each object is an instantiation of a class.

Object Methods and Attributes

Objects store data in attributes. These attributes are variables attached to the object or object class. Objects define functionality in object methods (methods defined for all objects in a class) and class methods (methods attached to a class and shared by all objects in the class), which are functions attached to the object.

Note

In Python documentation, functions attached to objects and classes are referred to as methods.

These functions have access to the object’s attributes and can modify and use the object’s data. To call an object’s method or access one of its attributes, we use dot syntax:

# Define a class for fancy defining fancy cars »> class FancyCar(): … # Add a class variable … wheels = 4 … # Add a method … def driveFast(self): … print(“Driving so fast”) … … … »> # Instantiate a fancy car »> my_car = FancyCar() »> # Access the class attribute »> my_car.wheels 4 »> # Invoke the method »> my_car.driveFast() Driving so fast »>

So here our FancyCar class defines a method called driveFast and an attribute wheels. When you instantiate an instance of FancyCar named my_car, you can access the attribute and invoke the method using the dot syntax.

Sequences

Sequences are a family of built-in types, including the list, tuple, range, string, and binary types. Sequences represent ordered and finite collections of items.

Sequence operations

There are many operations that work across all of the types of sequences. We cover some of the most commonly used operations here.

You can use the in and not in operators to test whether or not an item exists in a sequence:

2 in [1,2,3] True »> 'a' not in 'cat' False »> 10 in range(12) True »> 10 not in range(2, 4) True

You can reference the contents of a sequence by using its index number. To access the item at some index, use square brackets with the index number as an argument. The first item indexed is at position 0, the second at 1, and so forth up to the number one less than the number of items:

my_sequence = 'Bill Cheatham' »> my_sequence[0] 'B' »> my_sequence[2] 'l' »> my_sequence[12] 'm'

Indexing can appear from the end of a sequence rather than from the front using negative numbers. The last item has the index of –1, the second to last has the index of –2, and so forth:

my_sequence = “Bill Cheatham” »> my_sequence[–1] 'm' »> my_sequence[–2] 'a' »> my_sequence[–13] 'B'

The index of an item results from the index method. By default, it returns the index of the first occurrence of the item, but optional arguments can define a subrange in which to search:

my_sequence = “Bill Cheatham” »> my_sequence.index('C') 5 »> my_sequence.index('a') 8 »> my_sequence.index('a',9, 12) 11 »> my_sequence[11] 'a' »>

You can produce a new sequence from a sequence using slicing. A slice appears by invoking a sequence with brackets containing optional start, stop, and step arguments:

my_sequence[start:stop:step]

start is the index of the first item to use in the new sequence, stop the first index beyond that point, and step, the distance between items. These arguments are all optional and are replaced with default values if omitted. This statement produces a copy of the original sequence. The default value for start is 0, for stop is the length of the sequence, and for step is 1. Note that if the step does not appear, the corresponding : can also be dropped:

my_sequence = ['a', 'b', 'c', 'd', 'e', 'f', 'g'] »> my_sequence[2:5] ['c', 'd', 'e'] »> my_sequence[:5] ['a', 'b', 'c', 'd', 'e'] »> my_sequence[3:] ['d', 'e', 'f', 'g'] »>

Negative numbers can be used to index backward:

my_sequence[–6:] ['b', 'c', 'd', 'e', 'f', 'g'] »> my_sequence[3:–1] ['d', 'e', 'f'] »>

Sequences share many operations for getting information about them and their contents. len returns the length of the sequence, min the smallest member, max the largest, and count the number of a particular item. min and max work only on sequences with items that are comparable. Remember that these work with any sequence type:

my_sequence = [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4] »> len(my_sequence) 12 »> min(my_sequence) 0 »> max(my_sequence) 4 »> my_sequence.count(1) 3 »>

Lists

Lists, one of the most commonly used Python data structures, represent an ordered collection of items of any type. The use of square brackets indicates a list syntax.

The function list() can be used to create an empty list or a list based on another finite iterable object (such as another sequence):

list() [] »> list(range(10)) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] »> list(“Henry Miller”) ['H', 'e', 'n', 'r', 'y', ' ', 'M', 'i', 'l', 'l', 'e', 'r'] »>

Lists created by using square brackets directly are the most common form. Items in the list need to be enumerated explicitly in this case. Remember that the items in a list can be of different types:

empty = [] »> empty [] »> nine = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] »> nine [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] »> mixed = [0, 'a', empty, 'WheelHoss'] »> mixed [0, 'a', [], 'WheelHoss'] »>

The most efficient way to add a single item to a list is to append the item to the end of the list. A less efficient method, insert, allows you to insert an item at the index position of your choice:

pies = ['cherry', 'apple'] »> pies ['cherry', 'apple'] »> pies.append('rhubarb') »> pies ['cherry', 'apple', 'rhubarb'] »> pies.insert(1, 'cream') »> pies ['cherry', 'cream', 'apple', 'rhubarb'] »>

The contents of one list can be added to another using the extend method:

pies ['cherry', 'cream', 'apple', 'rhubarb'] »> desserts = ['cookies', 'paste'] »> desserts ['cookies', 'paste'] »> desserts.extend(pies) »> desserts ['cookies', 'paste', 'cherry', 'cream', 'apple', 'rhubarb'] »>

The most efficient and common way of removing the last item from a list and returning its value is to pop it. An index argument can be supplied to this method, removing and returning the item at that index. This technique is less efficient, as the list needs to be re-indexed:

pies ['cherry', 'cream', 'apple', 'rhubarb'] »> pies.pop() 'rhubarb' »> pies ['cherry', 'cream', 'apple'] »> pies.pop(1) 'cream' »> pies ['cherry', 'apple']

There is also a remove method, which removes the first occurrence of an item.

pies.remove('apple') »> pies ['cherry'] »>

One of the most potent and idiomatic Python features, list comprehensions, allows you to use the functionality of a for loop in a single line. Let’s look at a simple example, starting with a for loop squaring all of the numbers from 0–9 and appending them to a list:

squares = [] »> for i in range(10): … squared = i*i … squares.append(squared) … … »> squares [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] »>

In order to replace this with a list comprehension, we do the following:

squares = [i*i for i in range(10)] »> squares [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] »>

Note that the functionality of the inner block is put first, followed by the for statement. You can also add conditionals to list comprehensions, filtering the results:

squares = [i*i for i in range(10) if i%2==0] »> squares [0, 4, 16, 36, 64] »>

Other techniques for list comprehensions include nesting them and using multiple variables, but the more straightforward form shown here is the most common.

Strings

The string sequence type is a collection of ordered characters surrounded by quotation marks. As of Python 3, strings default to using UTF-8 encoding.

You can create strings either by using the string constructor method, str(), or by directly enclosing the text in quotation marks:

str() »> “some new string!” 'some new string!' »> 'or with single quotes' 'or with single quotes'

The string constructor can be used to make strings from other objects:

my_list = list() »> str(my_list) '[]'

You can create multiline strings by using triple quotes around the content:

multi_line = ”““This is a … multi-line string, … which includes linebreaks. … ””“ »> print(multi_line) This is a multi-line string, which includes linebreaks. »>

In addition to the methods shared by all sequences, strings have quite a few methods distinct to their class.

It is relatively common for user text to have trailing or leading whitespace. If someone types ” yes “ in a form instead of “yes” you usually want to treat them the same. Python strings have a strip method just for this case. It returns a string with the whitespace removed from the beginning and end. There are also methods to remove the whitespace from only the right or left side of the string:

input = ” I want more “ »> input.strip() 'I want more' »> input.rstrip() ' I want more' »> input.lstrip() 'I want more '

On the other hand, if you want to add padding to a string, you can use the ljust or rjust methods. Either one pads with whitespace by default, or takes a character argument:

output = 'Barry' »> output.ljust(10) 'Barry ' »> output.rjust(10, '*') '*****Barry'

Sometimes you want to break a string up into a list of substrings. Perhaps you have a sentence you want to turn into a list of words, or a string of words separated by commas. The split method breaks a string into a list of strings. By default, it uses whitespace as the token to make the breaks. An optional argument can be used to add in another character where the split can break:

text = “Mary had a little lamb” »> text.split() ['Mary', 'had', 'a', 'little', 'lamb'] »> url = “gt.motomomo.io/v2/api/asset/143” »> url.split('/') ['gt.motomomo.io', 'v2', 'api', 'asset', '143']

You can easily create a new string from a sequence of strings and join them into a single string. This method inserts a string as a separator between a list of other strings:

items = ['cow', 'milk', 'bread', 'butter'] »> ” and “.join(items) 'cow and milk and bread and butter'

Changing the case of text is a common occurrence, whether it is making the case uniform for comparison or changing in preparation for user consumption. Python strings have several methods to make this an easy process:

name = “bill monroe” »> name.capitalize() 'Bill monroe' »> name.upper() 'BILL MONROE' »> name.title() 'Bill Monroe' »> name.swapcase() 'BILL MONROE' »> name = “BILL MONROE” »> name.lower() 'bill monroe'

Python also provides methods to understand a string’s content. Whether it’s checking the case of the text, or seeing if it represents a number, there are quite a few built-in methods for interrogation. Here are just a few of the most commonly used methods:

“William”.startswith('W') True »> “William”.startswith('Bill') False »> “Molly”.endswith('olly') True »> “abc123”.isalnum() True »> “abc123”.isalpha() False »> “abc”.isalnum() True »> “123”.isnumeric() True »> “Sandy”.istitle() True »> “Sandy”.islower() False »> “SANDY”.isupper() True

You can insert content into a string and control its format at runtime. Your program can use the values of variables or other calculated content in strings. This approach is used in both creating user-consumed text and for writing software logs.

The older form of string formatting in Python comes from the C language printf function. You can use the modulus operator, %, to insert formatted values into a string. This technique applies to the form string % values, where values can be a single nontuple or a tuple of multiple values. The string itself must have a conversion specifier for each value. The conversion specifier, at a minimum, starts with a % and is followed by a character representing the type of value inserted:

”%s + %s = %s“ % (1, 2, “Three”) '1 + 2 = Three' »>

Additional format arguments include the conversion specifier. For example, you can control the number of places a float, %f, prints:

”%.3f“ % 1.234567 '1.235'

This mechanism for string formatting was the dominant one in Python for years, and you encounter it in legacy code. This approach offers some compelling features, such as sharing syntax with other languages. It also has some pitfalls. In particular, due to the use of a sequence to hold the arguments, errors related to displaying tuple and dict objects are common. We recommend adopting newer formatting options, such as the string format method, template strings, and f-strings, to both avoid these errors and increase the simplicity and readability of your code.

Python 3 introduced a new way of formatting strings using the string method format. This way of formatting has been backported to Python 2 as well. This specification uses curly brackets in the string to indicate replacement fields rather than the modulus-based conversion specifiers of the old-style formatting. The insert values become arguments to the string format method. The order of the arguments determines their placement order in the target string:

'{} comes before {}'.format('first', 'second') 'first comes before second' »>

You can specify index numbers in the brackets to insert values in an order different than that in the argument list. You can also repeat a value by specifying the same index number in multiple replacement fields:

'{1} comes after {0}, but {1} comes before {2}'.format('first', 'second', 'third') 'second comes after first, but second comes before third' »>

An even more powerful feature is that the insert values can be specified by name:

{country} is an island. … {country} is off of the coast of … {continent} in the {ocean}.format(ocean='Indian Ocean', … continent='Africa', … country='Madagascar') 'Madagascar is an island. Madagascar is off of the coast of Africa in the Indian Ocean'

Here a dict works to supply the key values for name-based replacement fields:

values = {'first': 'Bill', 'last': 'Bailey'} »> “Won't you come home {first} {last}?”.format(**values) “Won't you come home Bill Bailey?”

You can also specify format specification arguments. Here they add left and right padding using > and <. In the second example, we specify a character to use in the padding:

text = “|{0:>22}||{0:<22}|” »> text.format('O','O') '| O||O |' »> text = “|{0:<>22}||{0:><22}|” »> text.format('O','O') '|««««««««««<O||O»»»»»»»»»»>|'

Format specifications are done using the format specification mini-language. Our topic also uses another type of language called f-strings.

Python f-strings use the same formatting language as the format method, but offer a more straightforward and intuitive mechanism for using them. f-strings are prepended with either f or F before the first quotation mark. Like the format string previously described, f-strings use curly braces to demarcate replacement fields. In an f-string, however, the content of the replacement field is an expression. This approach means it can refer to variables defined in the current scope or involve calculations:

a = 1 »> b = 2 »> f”a is {a}, b is {b}. Adding them results in {a + b}“ 'a is 1, b is 2. Adding them results in 3'

As in format strings, format specifications in f-strings happen within the curly brackets after the value expression and start with a ::

count = 43 »> f”|{count:5d}“ '| 43'

The value expression can contain nested expressions, referencing variables, and expressions in the construction of the parent expression:

padding = 10 »> f”|{count:{padding}d}“ '| 43'

Tip

We highly recommend using f-strings for the majority of your string formatting. They combine the power of the specification mini-language with a simple and intuitive syntax.

Template strings are designed to offer a straightforward string substitution mechanism. These built-in methods work for tasks such as internationalization, where simple word substitutions are necessary. They use $ as a substitution character, with optional curly braces surrounding them. The characters directly following the $ identify the value to be inserted. When the substitute method of the string template executes, these names are used to assign values.

Note

Built-in types and functions are available whenever you run Python code, but to access the broader world of functionality available in the Python ecosystem, you need to use the import statement. This approach lets you add functionality from the Python Standard Library or third-party services into your environment. You can selectively import parts of a package by using the from keyword:

from string import Template »> greeting = Template(“$hello Mark Anthony”) »> greeting.substitute(hello=“Bonjour”) 'Bonjour Mark Anthony' »> greeting.substitute(hello=“Zdravstvuyte”) 'Zdravstvuyte Mark Anthony' »> greeting.substitute(hello=“Nǐn hǎo”) 'Nǐn hǎo Mark Anthony'

Dicts

Aside from strings and lists, dicts may be the most used of the Python built-in classes. A dict is a mapping of keys to values. The lookup of any particular value using a key is highly efficient and fast. The keys can be strings, numbers, custom objects, or any other nonmutable type.

Note

A mutable object is one whose contents can change in place. Lists are a primary example; the contents of the list can change without the list’s identity changing. Strings are not mutable. You create a new string each time you change the contents of an existing one.

Dicts are represented as comma–separated key/value pairs surrounded by curly braces. The key/value pairs consist of a key, a colon (:), and then a value.

You can create a dict object using the dict() constructor. With no arguments, it creates an empty dict. It takes a sequence of key/value pairs as an argument as well:

map = dict() »> type(map) <class 'dict'> »> map {} »> kv_list = 'key-1', 'value-1'], ['key-2', 'value-2' »> dict(kv_list) {'key-1': 'value-1', 'key-2': 'value-2'}

You can also create a dict directly using curly braces:

map = {'key-1': 'value-1', 'key-2': 'value-2'} »> map {'key-1': 'value-1', 'key-2': 'value-2'}

You can access the value associated with a key using square bracket syntax:

map['key-1'] 'value-1' »> map['key-2'] 'value-2'

You can use the same syntax to set a value. If the key is not in the dict, it adds as a new entry. If it already exists, the value changes to the new value:

map {'key-1': 'value-1', 'key-2': 'value-2'} »> map['key-3'] = 'value-3' »> map {'key-1': 'value-1', 'key-2': 'value-2', 'key-3': 'value-3'} »> map['key-1'] = 13 »> map {'key-1': 13, 'key-2': 'value-2', 'key-3': 'value-3'}

If you try to access a key that has not been defined in a dict, a KeyError exception will be thrown:

map['key-4'] Traceback (most recent call last): File ”<input>“, line 1, in <module> map['key-4'] KeyError: 'key-4'

You can check to see if the key exists in a dict using the in syntax we saw with sequences. In the case of dicts, it checks for the existence of keys:

if 'key-4' in map: … print(map['key-4']) … else: … print('key-4 not there') … … key-4 not there

A more intuitive solution is to use the get() method. If you have not defined a key in a dict, it returns a supplied default value. If you have not supplied a default value, it returns None:

map.get('key-4', 'default-value') 'default-value'

Use del to remove a key-value pair from a dict:

del(map['key-1']) »> map {'key-2': 'value-2', 'key-3': 'value-3'}

The keys() method returns a dict_keys object with the dict’s keys. The values() method returns an dict_values object, and the items() method returns key-value pairs. This last method is useful for iterating through the contents of a dict:

map.keys() dict_keys(['key-1', 'key-2']) »> map.values() dict_values(['value-1', 'value-2']) »> for key, value in map.items(): … print(f”{key}: {value}“) … … key-1: value-1 key-2: value-2

Similar to list comprehensions, dict comprehensions are one-line statements returning a dict by iterating through a sequence:

letters = 'abcde' »> # mapping individual letters to their upper-case representations »> cap_map = {x: x.upper() for x in letters} »> cap_map['b'] 'B'

Functions

You have seen some Python built-in functions already. Now move on to writing your own. Remember, a function is a mechanism for encapsulating a block of code. You can repeat the behavior of this block in multiple spots without having to duplicate the code. Your code will be better organized, more testable, maintainable, and easier to understand.

Anatomy of a Function

The first line of a function definition starts with the keyword def, followed by the function name, function parameters enclosed in parentheses, and then :. The rest of the function is a code block and is indented:

def <FUNCTION NAME>(<PARAMETERS>): <CODE BLOCK>

If a string using multiline syntax is provided first in the indented block, it acts as documentation. Use these to describe what your function does, how parameters work, and what it can be expected to return. You will find these docstrings are invaluable for communicating with future users of your code. Various programs and services also use them to create documentation. Providing docstrings is considered a best practice and is highly recommended:

def my_function(): … This is a doc string. … … It should describe what the function does, … what parameters work, and what the … function returns. …

Function arguments occur in the parentheses following the function name. They can be either positional or keyword. Positional arguments use the order of the arguments to assign value:

def positioned(first, second): … ”““Assignment based on order.””“ … print(f”first: {first}“) … print(f”second: {second}“) … … »> positioned(1, 2) first: 1 second: 2 »>

With keyword arguments, assign each argument a default value:

def keywords(first=1, second=2): … Default values assigned … print(f”first: {first}“) … print(f”second: {second}“) … …

The default values are used when no values are passed during function invocation. The keyword parameters can be called by name during function invocation, in which case the order will not matter:

keywords(0) first: 0 second: 2 »> keywords(3,4) first: 3 second: 4 »> keywords(second='one', first='two') first: two second: one

When using keyword parameters, all parameters defined after a keyword parameter must be keyword parameters as well. All functions return a value. The return keyword is used to set this value. If not set from a function definition, the function returns None:

def no_return(): … No return defined … pass … »> result = no_return() »> print(result) None »> def return_one(): … Returns 1 … return 1 … »> result = return_one() »> print(result) 1

Functions as Objects

Functions are objects. They can be passed around, or stored in data structures. You can define two functions, put them in a list, and then iterate through the list to invoke them:

def double(input): … double input … return input*2 … »> double <function double at 0x107d34ae8> »> type(double) <class 'function'> »> def triple(input): … Triple input … return input*3 … »> functions = [double, triple] »> for function in functions: … print(function(3)) … … 6 9

Anonymous Functions

When you need to create a very limited function, you can create an unnamed (anonymous) one using the lambda keyword. Generally, you should limit their use to situations where a function expects a small function as a argument. In this example, you take a list of lists and sort it. The default sorting mechanism compares based on the first item of each sublist:

To sort based on something other than the first entry, you can define a method which returns the item’s second entry and pass it into the sorting function’s key parameter:

def second(item): … return second entry … return item[1] … »> sorted(items, key=second) 0, 'a', 2], [5, 'b', 0], [2, 'c', 1

With the lambda keyword, you can do the same thing without the full function definition. Lambdas work with the lambda keyword followed by a parameter name, a colon, and a return value:

lambda <PARAM>: <RETURN EXPRESSION>

Sort using lambdas, first using the second entry and then using the third:

sorted(items, key=lambda item: item[1]) 0, 'a', 2], [5, 'b', 0], [2, 'c', 1 »> sorted(items, key=lambda item: item[2]) 5, 'b', 0], [2, 'c', 1], [0, 'a', 2

Be cautious of using lambdas more generally, as they can create code that is poorly documented and confusing to read if used in place of general functions.

Using Regular Expressions

The need to match patterns in strings comes up again and again. You could be looking for an identifier in a log file or checking user input for keywords or a myriad of other cases. You have already seen simple pattern matching using the in operation for sequences, or the string .endswith and .startswith methods. To do more sophisticated matching, you need a more powerful tool. Regular expressions, often referred to as regex, are the answer. Regular expressions use a string of characters to define search patterns. The Python re package offers regular expression operations similar to those found in Perl. The re module uses backslashes (\) to delineate special characters used in matching. To avoid confusion with regular string escape sequences, raw strings are recommended when defining regular expression patterns. Raw strings are prepended with an r before the first quotation mark.

Note

Python strings have several escape sequences. Among the most common are line-feed \n and tab \t.

Searching

Let say you have a cc list from an email as a text and you want to understand more about who is in this list:

In [1]: cc_list = Ezra Koenig ekoenig@vpwk.com, …: Rostam Batmanglij rostam@vpwk.com, …: Chris Tomson <[email protected], …: Bobbi Baio <[email protected]

If you want to know whether a name is in this text, you could use the in sequence membership syntax:

In [2]: 'Rostam' in cc_list Out[2]: True

To get similar behavior, you can use the re.search function, which returns a re.Match object only if there is a match:

In [3]: import re In [4]: re.search(r'Rostam', cc_list) Out[4]: <re.Match object; span=(32, 38), match='Rostam'>

You can use this as a condition to test for membership:

if re.search(r'Rostam', cc_list): … print('Found Rostam') … … Found Rostam

Character Sets

So far re hasn’t given you anything you couldn’t get using the in operator. However, what if you are looking for a person in a text, but you can’t remember if the name is Bobbi or Robby?

With regular expressions, you can use groups of characters, any one of which could appear in a spot. These are called character sets. The characters from which a match should be chosen are enclosed by square brackets in the regular expression definition. You can match on B or R, followed by obb, and either i or y:

In [5]: re.search(r'[R,B]obb[i,y]', cc_list) Out[5]: <re.Match object; span=(101, 106), match='Bobbi'>

You can put comma-separated individual characters in a character set or use ranges. The range A–Z includes all the capitalized letters; the range 0–9 includes the digits from zero to nine:

In [6]: re.search(r'Chr[a-z][a-z]', cc_list) Out [6]: <re.Match object; span=(69, 74), match='Chris'>

The + after an item in a regular expression matches one or more of that item. A number in brackets matches an exact number of characters:

In [7]: re.search(r'[A-Za-z]+', cc_list) Out [7]: <re.Match object; span=(0, 4), match='Ezra'> In [8]: re.search(r'[A-Za-z]{6}', cc_list) Out [8]: <re.Match object; span=(5, 11), match='Koenig'>

We can construct a match using a combination of character sets and other characters to make a naive match of an email address. The . character has a special meaning. It is a wildcard and matches any character. To match against the actual . character, you must escape it using a backslash:

In [9]: re.search(r'[A-Za-z]+@[a-z]+\.[a-z]+', cc_list) Out[9]: <re.Match object; span=(13, 29), match='[email protected]'>

This example is just a demonstration of character sets. It does not represent the full complexity of a production-ready regular expressions for emails.

Character Classes

In addition to character sets, Python’s re offers character classes. These are premade character sets. Some commonly used ones are \w, which is equivalent to [a-zA-Z0-9_] and \d, which is equivalent to [0-9]. You can use the + modifier to match for multiple characters:

re.search(r'\w+', cc_list) <re.Match object; span=(0, 4), match='Ezra'>

And you can replace our primative email matcher with \w:

re.search(r'\w+\@\w+\.\w+', cc_list) <re.Match object; span=(13, 29), match='[email protected]'>

Groups

You can use parentheses to define groups in a match. These groups can be accessed from the match object. They are numbered in the order they appear, with the zero group being the full match:

re.search(r'(\w+)\@(\w+)\.(\w+)', cc_list) <re.Match object; span=(13, 29), match='[email protected]'> »> matched = re.search(r'(\w+)\@(\w+)\.(\w+)', cc_list) »> matched.group(0) '[email protected]' »> matched.group(1) 'ekoenig' »> matched.group(2) 'vpwk' »> matched.group(3) 'com'

Named Groups

You can also supply names for the groups by adding ?P<NAME> in the group definition. Then you can access the groups by name instead of number:

matched = re.search(r'(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)', cc_list) »> matched.group('name') 'ekoenig' »> print(fname: {matched.group(“name”)} … Secondary Level Domain: {matched.group(“SLD”)} … Top Level Domain: {matched.group(“TLD”)}) name: ekoenig Secondary Level Domain: vpwk Top Level Domain: com

Find All

Up until now, we have demonstrated returning just the first match found. We can also use findall to return all of the matches as a list of strings:

matched = re.findall(r'\w+\@\w+\.\w+', cc_list) »> matched ['[email protected]', '[email protected]', '[email protected]', '[email protected]'] »> matched = re.findall(r'(\w+)\@(\w+)\.(\w+)', cc_list) »> matched [('ekoenig', 'vpwk', 'com'), ('rostam', 'vpwk', 'com'), ('ctomson', 'vpwk', 'com'), ('cbaio', 'vpwk', 'com')] »> names = [x[0] for x in matched] »> names ['ekoenig', 'rostam', 'ctomson', 'cbaio']

Find Iterator

When dealing with large texts, such as logs, it is useful to not process the text all at once. You can produce an iterator object using the finditer method. This object processes text until it finds a match and then stops. Passing it to the next function returns the current match and continues processing until finding the next match. In this way, you can deal with each match individually without devoting resources to process all of the input at once:

matched = re.finditer(r'\w+\@\w+\.\w+', cc_list) »> matched <callable_iterator object at 0x108e68748> »> next(matched) <re.Match object; span=(13, 29), match='[email protected]'> »> next(matched) <re.Match object; span=(51, 66), match='[email protected]'> »> next(matched) <re.Match object; span=(83, 99), match='[email protected]'>

The iterator object, matched, can be used in a for loop as well:

matched = re.finditer(”(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)“, cc_list) »> for m in matched: … print(m.groupdict()) … … {'name': 'ekoenig', 'SLD': 'vpwk', 'TLD': 'com'} {'name': 'rostam', 'SLD': 'vpwk', 'TLD': 'com'} {'name': 'ctomson', 'SLD': 'vpwk', 'TLD': 'com'} {'name': 'cbaio', 'SLD': 'vpwk', 'TLD': 'com'}

Substitution

Besides searching and matching, regexes can be used to substitute part or all of a string:

re.sub(“\d”, ”#“, “The passcode you entered was 09876”) 'The passcode you entered was #####' »> users = re.sub(”(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)“, “\g<TLD>.\g<SLD>.\g<name>”, cc_list) »> print(users) Ezra Koenig <com.vpwk.ekoenig>, Rostam Batmanglij <com.vpwk.rostam>, Chris Tomson <com.vpwk.ctomson, Chris Baio <com.vpwk.cbaio

Compiling

All of the examples so far have called methods on the re module directly. This is adequate for many cases, but if the same match is going to happen many times, performance gains can be had by compiling the regular expression into an object. This object can be reused for matches without recompiling:

regex = re.compile(r'\w+\@\w+\.\w+') »> regex.search(cc_list) <re.Match object; span=(13, 29), match='[email protected]'>

Regular expressions offer many more features than we have dealt with here. Indeed many books have been written on their use, but you should now be prepared for most basic cases.

Lazy Evaluation

Lazy evaluation is the idea that, especially when dealing with large amounts of data, you do not want process all of the data before using the results. You have already seen this with the range type, where the memory footprint is the same, even for one representing a large group of numbers.

Generators

You can use generators in a similar way as range objects. They perform some operation on data in chunks as requested. They pause their state in between calls. This means that you can store variables that are needed to calculate output, and they are accessed every time the generator is called.

To write a generator function, use the yield keyword rather than a return statement. Every time the generator is called, it returns the value specified by yield and then pauses its state until it is next called. Let’s write a generator that simply counts, returning each subsequent number:

def count(): … n = 0 … while True: … n += 1 … yield n … … »> counter = count() »> counter <generator object count at 0x10e8509a8> »> next(counter) 1 »> next(counter) 2 »> next(counter) 3

Note that the generator keeps track of its state, and hence the variable n in each call to the generator reflects the value previously set. Let’s implement a Fibonacci generator:

def fib(): … first = 0 … last = 1 … while True: … first, last = last, first + last … yield first … »> f = fib() »> next(f) 1 »> next(f) 1 »> next(f) 2 »> next(f) 3

We can also iterate using the generator in a for loop:

f = fib() »> for x in f: … print(x) … if x > 12: … break … 1 1 2 3 5 8 13

Generator Comprehensions

We can use generator comprehensions to create one-line generators. They are created using a syntax similar to list comprehensions, but parentheses are used rather than square brackets:

list_o_nums = [x for x in range(100)] »> gen_o_nums = (x for x in range(100)) »> list_o_nums [0, 1, 2, 3, … 97, 98, 99] »> gen_o_nums <generator object <genexpr> at 0x10ea14408>

Even with this small example, we can see the difference in memory used by using the sys.getsizeof method, which returns the size of an object, in bytes:

import sys »> sys.getsizeof(list_o_nums) 912 »> sys.getsizeof(gen_o_nums) 120

More IPython Features

You saw some of IPython’s features at the beginning of the chapter. Now let’s look at some more advanced features, such as running shell commands from within the IPython interpreter and using magic functions.

Using IPython to Run Unix Shell Commands

You can use IPython to run shell commands. This is one of the most compelling reasons to perform DevOps actions in the IPython shell. Let’s take a look at a very simple example where the ! character, which IPython uses to identify shell commands, is put in front of the command ls:

In [3]: var_ls = !ls -l In [4]: type(var_ls) Out[4]: IPython.utils.text.SList

The output of the command is assigned to a Python variable var_ls. The type of this variable is IPython.utils.text.SList. The SList type converts a regular shell command into an object that has three main methods: fields, grep, and sort. Here is an example in action using the Unix df command. The sort method can interpret the whitespace from this Unix command and then sort the third column by size:

In [6]: df = !df In [7]: df.sort(3, nums = True)

Let’s take a look at SList and .grep next. Here is an example that searches for what commands with kill as part of their names are installed in the /usr/bin directory:

In [10]: ls = !ls -l /usr/bin In [11]: ls.grep(“kill”) Out[11]: ['-rwxr-xr-x 1 root wheel 1621 Aug 20 2018 kill.d', '-rwxr-xr-x 1 root wheel 23984 Mar 20 23:10 killall', '-rwxr-xr-x 1 root wheel 30512 Mar 20 23:10 pkill']

The key take away here is that IPython is a dream environment for hacking around with little shell scripts.

Using IPython magic commands

If you get in the habit of using IPython, you should also get in the habit of using built-in magic commands. They are essentially shortcuts that pack a big punch. Magic commands are indicated by prepending them with . Here is an example of how to write inline Bash inside of IPython. Note, this is just a small command, but it could be an entire Bash script: In [13]: bash …: uname -a …: …: Darwin nogibjj.local 18.5.0 Darwin Kernel Version 18.5.0: Mon Mar …

The writefile is pretty tricky because you can write and test Python or Bash scripts on the fly, using IPython to execute them. That’s not a bad party trick at all: In [16]: writefile print_time.py …: #!/usr/bin/env python …: import datetime …: print(datetime.datetime.now().time()) …: …: …: Writing print_time.py In [17]: cat print_time.py #!/usr/bin/env python import datetime print(datetime.datetime.now().time()) In [18]: !python print_time.py 19:06:00.594914

Another very useful command, %who, will show you what is loaded into memory. It comes in quite handy when you have been working in a terminal that has been running for a long time:

In [20]: %who df ls var_ls

Exercises

Write a Python function that takes a name as an argument and prints that name.

Write a Python function that takes a string as an argument and prints whether it is upper- or lowercase.

Write a list comprehension that results in a list of every letter in the word smogtether capitalized.

Write a generator that alternates between returning Even and Odd.

Chapter 2. Automating Files and the Filesystem

One of Python’s most powerful features is its ability to manipulate text and files. In the DevOps world, you are continually parsing, searching, and changing the text in files, whether you’re searching application logs or propagating configuration files. Files are a means of persisting the state of your data, code, and configuration; they are how you look back at what happened in logs and how you control what happens with configuration. With Python, you can create, read, and change files and text in the code that you can use repeatedly. Automating these tasks is indeed one aspect of modern DevOps that separates it from traditional system administration. Rather than keeping a set of instructions that you have to follow manually, you can write code. This diminishes your chances of missing steps or doing them out of order. If you are confident that your system uses the same steps every time you run it, you can have greater understanding and confidence in the process.

Reading and Writing Files

You can use the open function to create a file object that can read and write files. It takes two arguments, the path of the file and the mode (mode optionally defaults to reading). You use the mode to indicate, among other things, if you want to read or write a file and if it is text or binary data. You can open a text file using the mode r to read its contents. The file object has a read method that returns the contents of the file as a string:

In [1]: file_path = 'bookofdreams.txt' In [2]: open_file = open(file_path, 'r') In [3]: text = open_file.read() In [4]: len(text) Out[4]: 476909 In [5]: text[56] Out[5]: 's' In [6]: open_file Out[6]: <_io.TextIOWrapper name='bookofdreams.txt' mode='r' encoding='UTF-8'> In [7]: open_file.close()

Note

It is a good practice to close a file when you finish with it. Python closes a file when it is out of scope, but until then the file consumes resources and may prevent other processes from opening it.

You can also read a file using the readlines method. This method reads the file and splits its contents on newline characters. It returns a list of strings. Each string is one line of the original text:

In [8]: open_file = open(file_path, 'r') In [9]: text = open_file.readlines() In [10]: len(text) Out[10]: 8796 In [11]: text[100] Out[11]: 'science, when it admits the possibility of occasional hallucinations\n' In [12]: open_file.close()

A handy way of opening files is to use with statements. You do not need to close a file explicitly in this case. Python closes it and releases the file resource at the end of the indented block:

In [13]: with open(file_path, 'r') as open_file: …: text = open_file.readlines() …: In [14]: text[101] Out[14]: 'in the sane and healthy, also admits, of course, the existence of\n' In [15]: open_file.closed Out[15]: True

Different operating systems use different escaped characters to represent line endings. Unix systems use \n and Windows systems use \r\n. Python converts these to \n when you open a file as text. If you are opening a binary file, such as a .jpeg image, you are likely to corrupt the data by this conversion if you open it as text. You can, however, read binary files by appending a b to mode:

In [15]: file_path = 'bookofdreamsghos00lang.pdf' In [16]: with open(file_path, 'rb') as open_file: …: btext = open_file.read() …: In [17]: btext[0] Out[17]: 37 In [18]: btext[:25] Out[18]: b'%PDF-1.5\n%\xec\xf5\xf2\xe1\xe4\xef\xe3\xf5\xed\xe5\xee\xf4\n18'

Adding this opens the file without any line-ending conversion.

To write to a file, use the write mode, represented as the argument w. The tool direnv is used to automatically set up some development environments. You can define environment variables and application runtimes in a file named .envrc; direnv uses it to set these things up when you enter the directory with the file. You can set the environment variable STAGE to PROD and TABLE_ID to token-storage-1234 in such a file in Python by using open with the write flag:

In [19]: text = export STAGE=PROD …: export TABLE_ID=token-storage-1234 In [20]: with open('.envrc', 'w') as opened_file: …: opened_file.write(text) …: In [21]: !cat .envrc export STAGE=PROD export TABLE_ID=token-storage-1234

Warning

Be warned that pathlib’s write method will overwrite a file if it already exists.

The open function creates a file if it does not already exist and overwrites if it does. If you want to keep existing contents and only append the file, use the append flag a. This flag appends new text to the end of the file while keeping the original content. If you are writing nontext content, such as the contents of a .jpeg file, you are likely to corrupt it if you use either the w or a flag. This corruption is likely as Python converts line endings to platform-specific ones when it writes text data. To write binary data, you can safely use wb or ab.

Chapter 3 covers pathlib in depth. Two useful features are convenience functions for reading and writing files. pathlib handles the file object behind the scenes. The following allows you to read text from a file:

In [35]: import pathlib In [36]: path = pathlib.Path( ”/Users/kbehrman/projects/autoscaler/check_pending.py“) In [37]: path.read_text()

To read binary data, use the path.read_bytes method.

When you want to overwrite a file or write a new file, there are methods for writing text and for writing binary data:

In [38]: path = pathlib.Path(”/Users/kbehrman/sp.config“) In [39]: path.write_text(“LOG:DEBUG”) Out[39]: 9 In [40]: path = pathlib.Path(”/Users/kbehrman/sp“) Out[41]: 8

Reading and writing using the file object’s read and write functions is usually adequate for unstructured text, but what if you are dealing with more complex data? The Javascript Object Notation (JSON) format is widely used to store simple structured data in modern web services. It uses two data structures: a mapping of key-value pairs similar to a Python dict and a list of items somewhat similar to a Python list. It defines data types for numbers, strings, booleans (which hold true/false values), and nulls (empty values). The AWS Identity and Access Management (IAM) web service allows you to control access to AWS resources. It uses JSON files to define access policies, as in this sample file:

{ “Version”: “2012-10-17”, “Statement”: { “Effect”: “Allow”, “Action”: “service-prefix:action-name”, “Resource”: “

  • ”, “Condition”: { “DateGreaterThan”: {“aws:CurrentTime”: “2017-07-01T00:00:00Z”}, “DateLessThan”: {“aws:CurrentTime”: “2017-12-31T23:59:59Z”} } } }

You could use the standard file object read or readlines methods to get the data from such a file:

In [8]: with open('service-policy.json', 'r') as opened_file: …: policy = opened_file.readlines() …: …:

The result would not be immediately usable, as it would be a single string or list of strings, depending on your chosen read method:

In [9]: print(policy) ['{\n', ' “Version”: “2012-10-17”, \n', ' “Statement”: {\n', ' “Effect”: “Allow”, \n', ' “Action”: “service-prefix:action-name”, \n', ' “Resource”: “

  • ”, \n', ' “Condition”: {\n', ' “DateGreaterThan”: {“aws:CurrentTime”: “2017-07-01T00:00:00Z”}, \n', ' “DateLessThan”: {“aws:CurrentTime”: “2017-12-31T23:59:59Z”}\n', ' }\n', ' }\n', '}\n']

You would then need to parse this string (or strings) into data structures and types that match the original, which may be a great deal of work. A far better way is to use the json module:

In [10]: import json In [11]: with open('service-policy.json', 'r') as opened_file: …: policy = json.load(opened_file) …: …: …:

This module parses the JSON format for you, returning the data in appropriate Python data structures:

In [13]: from pprint import pprint In [14]: pprint(policy) {'Statement': {'Action': 'service-prefix:action-name', 'Condition': {'DateGreaterThan': {'aws:CurrentTime': '2017-07-01T00:00:00Z'}, 'DateLessThan': {'aws:CurrentTime': '2017-12-31T23:59:59Z'}}, 'Effect': 'Allow', 'Resource': '*'}, 'Version': '2012-10-17'}

Note

The pprint module automatically formats Python objects for printing. Its output is often more easily read and is a handy way of looking at nested data structures.

Now you can use the data with the original file structure. For example, here is how you would change the resource whose access this policy controls to S3:

In [15]: policy['Statement']['Resource'] = 'S3' In [16]: pprint(policy) {'Statement': {'Action': 'service-prefix:action-name', 'Condition': {'DateGreaterThan': {'aws:CurrentTime': '2017-07-01T00:00:00Z'}, 'DateLessThan': {'aws:CurrentTime': '2017-12-31T23:59:59Z'}}, 'Effect': 'Allow', 'Resource': 'S3'}, 'Version': '2012-10-17'}

You can write a Python dictionary as a JSON file by using the json.dump method. This is how you would update the policy file you just modified:

In [17]: with open('service-policy.json', 'w') as opened_file: …: policy = json.dump(policy, opened_file) …: …: …:

Another language commonly used in configuration files is YAML (“YAML Ain’t Markup Language”). It is a superset of JSON, but has a more compact format, using whitespace similar to how Python uses it.

Ansible is a tool used to automate software configuration, management, and deployment. Ansible uses files, referred to as playbooks, to define actions you want to automate. These playbooks use the YAML format:

— - hosts: webservers vars: http_port: 80 max_clients: 200 remote_user: root tasks: - name: ensure apache is at the latest version yum: name: httpd state: latest …

The most commonly used library for parsing YAML files in Python is PyYAML. It is not in the Python Standard Library, but you can install it using pip:

$ pip install PyYAML

Once installed, you can use PyYAML to import and export YAML data much as you did with JSON:

In [18]: import yaml In [19]: with open('verify-apache.yml', 'r') as opened_file: …: verify_apache = yaml.safe_load(opened_file) …:

The data loads as familiar Python data structures (a list containing a dict):

In [20]: pprint(verify_apache) [{'handlers': [{'name': 'restart apache', 'service': {'name': 'httpd', 'state': 'restarted'}}], 'hosts': 'webservers', 'remote_user': 'root', 'tasks': [{'name': 'ensure apache is at the latest version', 'yum': {'name': 'httpd', 'state': 'latest'}}, {'name': 'write the apache config file', 'notify': ['restart apache'], 'template': {'dest': '/etc/httpd.conf', 'src': '/srv/httpd.j2'}}, {'name': 'ensure apache is running', 'service': {'name': 'httpd', 'state': 'started'}}], 'vars': {'http_port': 80, 'max_clients': 200}}]

You can also save Python data to a file in YAML format:

In [22]: with open('verify-apache.yml', 'w') as opened_file: …: yaml.dump(verify_apache, opened_file) …: …: …:

Another language widely used for representing structured data is Extensible Markup Language (XML). It consists of hierarchical documents of tagged elements. Historically, many web systems used XML to transport data. One such use is for Real Simple Syndication (RSS) feeds. RSS feeds are used to track and notify users of updates to websites and have been used to track the publication of articles from various sources. RSS feeds use XML-formatted pages. Python offers the xml library for dealing with XML documents. It maps the XML documents’ hierarchical structure to a tree-like data structure. The nodes of the tree are elements, and a parent-child relationship is used to model the hierarchy. The top parent node is referred to as the root element. To parse an RSS XML document and get its root:

In [1]: import xml.etree.ElementTree as ET In [2]: tree = ET.parse('http_feeds.feedburner.com_oreilly_radar_atom.xml') In [3]: root = tree.getroot() In [4]: root Out[4]: <Element '{http://www.w3.org/2005/Atom}feed' at 0x11292c958>

You can walk down the tree by iterating over the child nodes:

In [5]: for child in root: …: print(child.tag, child.attrib) …: {http://www.w3.org/2005/Atom}title {} {http://www.w3.org/2005/Atom}id {} {http://www.w3.org/2005/Atom}updated {} {http://www.w3.org/2005/Atom}subtitle {} {http://www.w3.org/2005/Atom}link {'href': 'https://www.oreilly.com'} {http://www.w3.org/2005/Atom}link {'rel': 'hub', 'href': 'http://pubsubhubbub.appspot.com/'} {http://www.w3.org/2003/01/geo/wgs84_pos#}long {} {http://rssnamespace.org/feedburner/ext/1.0}emailServiceId {} …

XML allows for namespacing (using tags to group data). XML prepends tags with namespaces enclosed in brackets. If you know the structure of the hierarchy, you can search for elements by using their paths. You can supply a dictionary that defines namespaces as a convenience:

In [108]: ns = {'default':'http://www.w3.org/2005/Atom'} In [106]: authors = root.findall(“default:entry/default:author/default:name”, ns) In [107]: for author in authors: …: print(author.text) …: Nat Torkington VM Brasseur Adam Jacob Roger Magoulas Pete Skomoroch Adrian Cockcroft Ben Lorica Nat Torkington Alison McCauley Tiffani Bell Arun Gupta

You may find yourself dealing with data stored as comma-separated values (CSV). This format is common for spreadsheet data. You can use the Python csv module to read these easily:

In [16]: import csv In [17]: file_path = '/Users/kbehrman/Downloads/registered_user_count_ytd.csv' In [18]: with open(file_path, newline=) as csv_file: …: off_reader = csv.reader(csv_file, delimiter=',') …: for _ in range(5): …: print(next(off_reader)) …: ['Date', 'PreviousUserCount', 'UserCountTotal', 'UserCountDay'] ['2014-01-02', '61', '5336', '5275'] ['2014-01-03', '42', '5378', '5336'] ['2014-01-04', '26', '5404', '5378'] ['2014-01-05', '65', '5469', '5404']

The csv reader object iterates through the .csv file one line at a time, allowing you to process the data one row at a time. Processing a file this way is especially useful for large .csv files that you do not want to read into memory all at once. Of course, if you need to do multiple row calculations across columns and the file is not overly large, you should load it all at once.

The Pandas package is a mainstay in the data science world. It includes a data structure, the pandas.DataFrame, which acts like a data table, similar to a very powerful spreadsheet. If you have table-like data on which you want to do statistical analysis or that you want to manipulate by rows and columns, DataFrames is the tool for you. It is a third-party library, so you need to install it with pip. You can use a variety of methods to load data into the DataFrames; one of the most common is from a .csv file:

In [54]: import pandas as pd In [55]: df = pd.read_csv('sample-data.csv') In [56]: type(df) Out[56]: pandas.core.frame.DataFrame

You can take a look at the top rows of your DataFrame using the head method:

In [57]: df.head(3) Out[57]: Attributes open high low close volume 0 Symbols F F F F F 1 date NaN NaN NaN NaN NaN 2 2018-01-02 11.3007 11.4271 11.2827 11.4271 20773320

You can get a statistical insight using the describe method:

In [58]: df.describe() Out[58]: Attributes open high low close volume count 357 356 356 356 356 356 unique 357 290 288 297 288 356 top 2018-10-18 10.402 8.3363 10.2 9.8111 36298597 freq 1 5 4 3 4 1

Alternatively, you can view a single column of data by using its name in square brackets:

In [59]: df['close'] Out[59]: 0 F 1 NaN 2 11.4271 3 11.5174 4 11.7159 … 352 9.83 353 9.78 354 9.71 355 9.74 356 9.52 Name: close, Length: 357, dtype: object

Pandas has many more methods for analyzing and manipulating table-like data, and there are many books on its use. It is a tool you should be aware of if you have the need to do data analysis.

Using Regular Expressions to Search Text

The Apache HTTP server is an open source web server widely used to serve web content. The web server can be configured to save log files in different formats. One widely used format is the Common Log Format (CLF). A variety of log analysis tools can understand this format. Below is the layout of this format:

<IP Address> <Client Id> <User Id> <Time> <Request> <Status> <Size>

What follows is an example line from a log in this format:

127.0.0.1 - swills [13/Nov/2019:14:43:30 -0800] “GET /assets/234 HTTP/1.0” 200 2326

Chapter 1 introduced you to regular expressions and the Python re module, so let’s use it to pull information from a log in the common log format. One trick to constructing regular expressions is to do it in sections. Doing so enables you to get each subexpression working without the complication of debugging the whole expression. You can create a regular expression using named groups to pull out the IP address from a line:

In [1]: line = '127.0.0.1 - rj [13/Nov/2019:14:43:30] “GET HTTP/1.0” 200' In [2]: re.search(r'(?P<IP>\d+\.\d+\.\d+\.\d+)', line) Out[2]: <re.Match object; span=(0, 9), match='127.0.0.1'> In [3]: m = re.search(r'(?P<IP>\d+\.\d+\.\d+\.\d+)', line) In [4]: m.group('IP') Out[4]: '127.0.0.1'

You can also create a regular expression to get the time:

In [5]: r = r'\[(?P<Time>\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})\]' In [6]: m = re.search(r, line) In [7]: m.group('Time') Out[7]: '13/Nov/2019:14:43:30'

You can grab multiple elements, as has been done here: the IP, user, time, and request:

In [8]: r = r'(?P<IP>\d+\.\d+\.\d+\.\d+)' In [9]: r += r' - (?P<User>\w+) ' In [10]: r += r'\[(?P<Time>\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})\]' In [11]: r += r' (?P<Request>”.+“)' In [12]: m = re.search(r, line) In [13]: m.group('IP') Out[13]: '127.0.0.1' In [14]: m.group('User') Out[14]: 'rj' In [15]: m.group('Time') Out[15]: '13/Nov/2019:14:43:30' In [16]: m.group('Request') Out[16]: '“GET HTTP/1.0”'

Parsing a single line of a log is interesting but not terribly useful. However, you can use this regular expression as a basis for designing one to pull information from the whole log. Let’s say you want to pull all of the IP addresses for GET requests that happened on November 8, 2019. Using the preceding expression, you make modifications based on the specifics of your request:

In [62]: r = r'(?P<IP>\d+\.\d+\.\d+\.\d+)' In [63]: r += r'- (?P<User>\w+)' In [64]: r += r'\[(?P<Time>08/Nov/\d{4}:\d{2}:\d{2}:\d{2} [-+]\d{4})\]' In [65]: r += r' (?P<Request>“GET .+”)'

Use the finditer method to process the log, printing the IP addresses of the matching lines:

In [66]: matched = re.finditer(r, access_log) In [67]: for m in matched: …: print(m.group('IP')) …: 127.0.0.1 342.3.2.33

There is a lot that you can do with regular expressions and texts of all sorts. If they do not daunt you, you will find them one of the most powerful tools in dealing with text.

Dealing with Large Files

There are times that you need to process very large files. If the files contain data that can be processed one line at a time, the task is easy with Python. Rather than loading the whole file into memory as you have done up until now, you can read one line at a time, process the line, and then move to the next. The lines are removed from memory automatically by Python’s garbage collector, freeing up memory.

Note

Python automatically allocates and frees memory. Garbage collection is one means of doing this. The Python garbage collector can be controlled using the gc package, though this is rarely needed.

The fact that operating systems use alternate line endings can be a hassle when reading a file created on a different OS. Windows-created files have \r characters in addition to \n. These show up as part of the text on a Linux-based system. If you have a large file and you want to correct the line endings to fit your current OS, you can open the file, read one line at a time, and save it to a new file. Python handles the line-ending translation for you:

In [23]: with open('big-data.txt', 'r') as source_file: …: with open('big-data-corrected.txt', 'w') as target_file: …: for line in source_file: …: target_file.write(line) …:

Notice that you can nest the with statements to open two files at once and loop through the source file object one line at a time. You can define a generator function to handle this, especially if you need to parse multiple files a single line at a time:

In [46]: def line_reader(file_path): …: with open(file_path, 'r') as source_file: …: for line in source_file: …: yield line …: In [47]: reader = line_reader('big-data.txt') In [48]: with open('big-data-corrected.txt', 'w') as target_file: …: for line in reader: …: target_file.write(line) …:

If you do not or cannot use line endings as a means of breaking up your data, as in the case of a large binary file, you can read your data in chunks. You pass the number of bytes read in each chunk to the file objects read method. When there is nothing left to read, the expression returns an empty string:

In [27]: with open('bb141548a754113e.jpg', 'rb') as source_file: …: while True: …: chunk = source_file.read(1024) …: if chunk: …: process_data(chunk) …: else: …: break …:

Encrypting Text

There are many times you need to encrypt text to ensure security. In addition to Python’s built-in package hashlib, there is a widely used third-party package called cryptography. Let’s take a look at both.

Hashing with Hashlib

To be secure, user passwords must be stored encrypted. A common way to handle this is to use a one-way function to encrypt the password into a bit string, which is very hard to reverse engineer. Functions that do this are called hash functions. In addition to obscuring passwords, hash functions ensure that documents sent over the web are unchanged during transmission. You run the hash function on the document and send the result along with the document. The recipient can then confirm that the value is the same when they hash the document. The hashlib includes secure algorithms for doing this, including SHA1, SHA224, SHA384, SHA512, and RSA’s MD5. This is how you would hash a password using the MD5 algorithm:

In [62]: import hashlib In [63]: secret = “This is the password or document text” In [64]: bsecret = secret.encode() In [65]: m = hashlib.md5() In [66]: m.update(bsecret) In [67]: m.digest() Out[67]: b' \xf5\x06\xe6\xfc\x1c\xbe\x86\xddj\x96C\x10\x0f5E'

Notice that if your password or document is a string, you need to turn it into a binary string by using the encode method.

Encryption with Cryptography

The cryptography library is a popular choice for handling encryption problems in Python. It is a third-party package, so you must install it with pip. Symmetric key encryption is a group of encryption algorithms based on shared keys. These algorithms include Advanced Encryption Algorithm (AES), Blowfish, Data Encryption Standard (DES), Serpent, and Twofish. A shared key is similar to a password that is used to both encrypt and decrypt text. The fact that both the creator and the reader of an encrypted file need to share the key is a drawback when compared to asymmetric key encryption, which we will touch on later. However, symmetric key encryption is faster and more straightforward, and so is appropriate for encrypting large files. Fernet is an implementation of the popular AES algorithm. You first need to generate a key:

In [1]: from cryptography.fernet import Fernet In [2]: key = Fernet.generate_key() In [3]: key Out[3]: b'q-fEOs2JIRINDR8toMG7zhQvVhvf5BRPx3mj5Atk5B8='

You need to store this key securely, as you need it to decrypt. Keep in mind that anyone who has access to it is also able to decrypt your files. If you choose to save the key to a file, use the binary data type. The next step is to encrypt the data using the Fernet object:

In [4]: f = Fernet(key) In [5]: message = b”Secrets go here“ In [6]: encrypted = f.encrypt(message) In [7]: encrypted Out[7]: b'gAAAAABdPyg4 … plhkpVkC8ezOHaOLIA=='

You can decrypt the data using a Fernet object created with the same key:

In [1]: f = Fernet(key) In [2]: f.decrypt(encrypted) Out[2]: b'Secrets go here'

Asymmetric key encryption uses a pair of keys, one public and one private. The public key is designed to be widely shared, while a single user holds the private one. The only way you can decrypt messages that have been encrypted using your public key is by using your private key. This style of encryption is widely used to pass information confidentially both on local networks and across the internet. One very popular asymmetric key algorithm is Rivest-Shamir-Adleman (RSA), which is widely used for communication across networks. The cryptography library offers the ability to create public/private key pairs:

In [1]: from cryptography.hazmat.backends import default_backend In [2]: from cryptography.hazmat.primitives.asymmetric import rsa In [3]: private_key = rsa.generate_private_key(public_exponent=65537, key_size=4096, backend=default_backend()) In [4]: private_key Out[4]: <cryptography.hazmat.backends.openssl.rsa._RSAPrivateKey at 0x10d377c18> In [5]: public_key = private_key.public_key In [6]: public_key = private_key.public_key() In [7]: public_key Out[7]: <cryptography.hazmat.backends.openssl.rsa._RSAPublicKey at 0x10da642b0>

You can then use the public key to encrypt:

In [8]: message = b”More secrets go here“ In [9]: from cryptography.hazmat.primitives.asymmetric import padding In [11]: from cryptography.hazmat.primitives import hashes In [12]: encrypted = public_key.encrypt(message, …: padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()), …: algorithm=hashes.SHA256(), …: label=None))

You can use the private key to decrypt messages:

In [13]: decrypted = private_key.decrypt(encrypted, …: padding.OAEP(mgf=padding.MGF1(algorithm=hashes.SHA256()), …: algorithm=hashes.SHA256(), …: label=None)) In [14]: decrypted Out[14]: b'More secrets go here'

The os Module

The os module is one of the most used modules in Python. This module handles many low-level operating system calls and attempts to offer a consistent interface across multiple operating systems, which is important if you think your application might run on both Windows and Unix-based systems. It does offer some operating-specific features (os.O_TEXT for Windows and os.O_CLOEXEC on Linux) that are not available across platforms. Use these only if you are confident that your application does not need to be portable across operating systems. Example 2-1 shows some of the most useful additional methods of the os module.

Example 2-1. More os methods

In [1]: os.listdir('.') Out[1]: ['__init__.py', 'os_path_example.py'] In [2]: os.rename('_crud_handler', 'crud_handler') In [3]: os.chmod('my_script.py', 0o777) In [4]: os.mkdir('/tmp/holding') In [5]: os.makedirs('/Users/kbehrman/tmp/scripts/devops') In [6]: os.remove('my_script.py') In [7]: os.rmdir('/tmp/holding') In [8]: os.removedirs('/Users/kbehrman/tmp/scripts/devops') In [9]: os.stat('crud_handler') Out[9]: os.stat_result(st_mode=16877, st_ino=4359290300, st_dev=16777220, st_nlink=18, st_uid=501, st_gid=20, st_size=576, st_atime=1544115987, st_mtime=1541955837, st_ctime=1567266289)

List the contents of a directory.

Rename a file or directory.

Change the permission settings of a file or directory.

Create a directory.

Recursively create a directory path.

Delete a file.

Delete a single directory.

Delete a tree of directories, starting with the leaf directory and working up the tree. The operation stops with the first nonempty directory.

Get stats about the file or directory. These stats include st_mode, the file type and permissions, and st_atime, the time the item was last accessed.

Managing Files and Directories Using os.path

In Python, you can use strings (binary or otherwise) to represent paths. The os.path module offers a plethora of path-related methods for creating and manipulating paths as strings. As previously mentioned, the os module tries to offer cross-platform behaviors, and the os.path submodule is no exception. This module interprets paths based on the current operating system, using forward slashes to separate directories in Unix-like systems and backward slashes in Windows. Your program can construct paths on the fly that work on the current system, whichever it is. The ability to easily split and join paths is probably the most used functionality of os.path. The three methods used to split paths are split, basename, and dirname:

In [1]: import os In [2]: cur_dir = os.getcwd() In [3]: cur_dir Out[3]: '/Users/kbehrman/Google-Drive/projects/python-devops/samples/chapter4' In [4]: os.path.split(cur_dir) Out[4]: ('/Users/kbehrman/Google-Drive/projects/python-devops/samples', 'chapter4') In [5]: os.path.dirname(cur_dir) Out[5]: '/Users/kbehrman/Google-Drive/projects/python-devops/samples' In [6]: os.path.basename(cur_dir) Out[6]: 'chapter4'

Get the current working directory.

os.path.split splits the leaf level of the path from the parent path.

os.path.dirname returns the parent path.

os.path.basename returns the leaf name.

You can easily use os.path.dirname to walk up a directory tree:

In [7]: while os.path.basename(cur_dir): …: cur_dir = os.path.dirname(cur_dir) …: print(cur_dir) …: /Users/kbehrman/projects/python-devops/samples /Users/kbehrman/projects/python-devops /Users/kbehrman/projects /Users/kbehrman /Users /

Using files to configure an application at runtime is a common practice; files in Unix-like systems are named by convention as dotfiles ending with rc. Vim’s .vimrc file and the Bash shell’s .bashrc are two common examples. You can store these files in different locations. Often programs will define a hierarchy of locations to check. For example, your tool might look first for an environment variable that defines which rc file to use, and in its absence, check the working directory, and then the user’s home directory. In Example 2-2 we try to locate an rc file in these locations. We use the file variable that Python automatically sets when Python code runs from a file. This variable is populated with a path relative to the current working directory, not an absolute or full path. Python does not automatically expand paths, as is common in Unix-like systems, so we must expand this path before we use it to construct the path to check our rc file. Similarly, Python does not automatically expand environment variables in paths, so we must expand these explicitly.

Example 2-2. find_rc method

def find_rc(rc_name=”.examplerc“): # Check for Env variable var_name = “EXAMPLERC_DIR” if var_name in os.environ: var_path = os.path.join(f”${var_name}“, rc_name) config_path = os.path.expandvars(var_path) print(f”Checking {config_path}“) if os.path.exists(config_path): return config_path # Check the current working directory config_path = os.path.join(os.getcwd(), rc_name) print(f”Checking {config_path}“) if os.path.exists(config_path): return config_path # Check user home directory home_dir = os.path.expanduser(”~/“) config_path = os.path.join(home_dir, rc_name) print(f”Checking {config_path}“) if os.path.exists(config_path): return config_path # Check Directory of This File file_path = os.path.abspath(__file__) parent_path = os.path.dirname(file_path) config_path = os.path.join(parent_path, rc_name) print(f”Checking {config_path}“) if os.path.exists(config_path): return config_path print(f”File {rc_name} has not been found“)

Check whether the environment variable exists in the current environment.

Use join to construct a path with the environment variable name. This will look something like $EXAMPLERC_DIR/.examplerc.

Expand the environment variable to insert its value into the path.

Check to see if the file exists.

Construct a path using the current working directory.

Use the expanduser function to get the path to the user’s home directory.

Expand the relative path stored in file to an absolute path.

Use dirname to get the path to the directory holding the current file.

The path submodule also offers ways to interrogate stats about a path. You can determine if a path is a file, a directory, a link, or a mount. You can get stats such as it’s size or time of last access or modification. In Example 2-3 we use path to walk down a directory tree and report on the size and last access time of all files therein.

Example 2-3. os_path_walk.py

  1. !/usr/bin/env python import fire import os def walk_path(parent_path): print(f”Checking: {parent_path}“) childs = os.listdir(parent_path) for child in childs: child_path = os.path.join(parent_path, child) if os.path.isfile(child_path): last_access = os.path.getatime(child_path) size = os.path.getsize(child_path) print(f”File: {child_path}“) print(f”\tlast accessed: {last_access}“) print(f”\tsize: {size}“) elif os.path.isdir(child_path): walk_path(child_path) if __name__ == '__main__': fire.Fire()

os.listdir returns the contents of a directory.

Construct the full path of an item in the parent directory.

Check to see if the path represents a file.

Get the last time the file was accessed.

Get the size of the file.

Check if the path represents a directory.

Check the tree from this directory down.

You could use a script like this to identify large files or files that have not been accessed and then report, move, or delete them.

Walking Directory Trees Using os.walk

The os module offers a convenience function for walking directory trees called os.walk. This function returns a generator that in turn returns a tuple for each iteration. The tuple consists of the current path, a list of directories, and a list of files. In Example 2-4 we rewrite our walk_path function from Example 2-3 to use os.walk. As you can see in this example, with os.walk you don’t need to test which paths are files or recall the function with every subdirectory.

Example 2-4. Rewrite walk_path

def walk_path(parent_path): for parent_path, directories, files in os.walk(parent_path): print(f”Checking: {parent_path}“) for file_name in files: file_path = os.path.join(parent_path, file_name) last_access = os.path.getatime(file_path) size = os.path.getsize(file_path) print(f”File: {file_path}“) print(f”\tlast accessed: {last_access}“) print(f”\tsize: {size}“)

Paths as Objects with Pathlib

The pathlib library represents paths as objects rather than strings. In Example 2-5 we rewrite Example 2-2 using pathlib rather than os.path.

Example 2-5. rewrite find_rc

def find_rc(rc_name=”.examplerc“): # Check for Env variable var_name = “EXAMPLERC_DIR” example_dir = os.environ.get(var_name) if example_dir: dir_path = pathlib.Path(example_dir) config_path = dir_path / rc_name print(f”Checking {config_path}“) if config_path.exists(): return config_path.as_postix() # Check the current working directory config_path = pathlib.Path.cwd() / rc_name print(f”Checking {config_path}“) if config_path.exists(): return config_path.as_postix() # Check user home directory config_path = pathlib.Path.home() / rc_name print(f”Checking {config_path}“) if config_path.exists(): return config_path.as_postix() # Check Directory of This File file_path = pathlib.Path(__file__).resolve() parent_path = file_path.parent config_path = parent_path / rc_name print(f”Checking {config_path}“) if config_path.exists(): return config_path.as_postix() print(f”File {rc_name} has not been found“)

As of this writing, pathlib does not expand environment variables. Instead you grab the value of the variable from os.environ.

This creates a pathlib.Path object appropriate for the currently running operating system.

You can construct new pathlib.Path objects by following a parent path with forward slashes and strings.

The pathlib.Path object itself has an exists method.

Call as_postix to return the path as a string. Depending on your use case, you can return the pathlib.Path object itself.

The class method pathlib.Path.cwd returns a pathlib.Path object for the current working directory. This object is used immediately here to create the config_path by joining it with the string rc_name.

The class method pathlib.Path.home returns a pathlib.Path object for the current user’s home directory.

Create a pathlib.Path object using the relative path stored in file and then call its resolve method to get the absolute path.

This returns a parent pathlib.Path object directly from the object itself.

Chapter 3. Working with the Command Line

The command line is where the rubber hits the road. Although there are many powerful tools with graphical interfaces, the command line is still home for DevOps work. Interacting with your shell environment from within Python and creating Python command-line tools are both necessary when using Python for DevOps.

Working with the Shell

Python offers tools for interacting with systems and shells. You should become familiar with the sys, os, and subprocess modules, as all are essential tools.

Talking to the Interpreter with the sys Module

The sys module offers access to variables and methods closely tied to the Python interpreter.

Note

There are two dominant ways to interpret bytes during reading. The first, little endian, interprets each subsequent byte as having higher significance (representing a larger digit). The other, big endian, assumes the first byte has the greatest significance and moves down from there.

You can use the sys.byteorder attribute to see the byte order of your current architecture:

In [1]: import sys In [2]: sys.byteorder Out[2]: 'little'

You can use sys.getsizeof to see the size of Python objects. This is useful if you are dealing with limited memory:

In [3]: sys.getsizeof(1) Out[3]: 28

If you want to perform different behaviors, depending on the underlying operating system, you can use sys.platform to check:

In [5]: sys.platform Out[5]: 'darwin'

A more common situation is that you want to use a language feature or module that is only available in specific versions of Python. You can use the sys.version_info to control behavior based on the running Python interpreter. Here we print different messages for Python 3.7, a Python version 3 that is below 3.7, and Python versions lower than 3:

if sys.version_info.major < 3: print(“You need to update your Python version”) elif sys.version_info.minor < 7: print(“You are not running the latest version of Python”) else: print(“All is good.”)

We cover more sys usage later in this chapter when we write command-line tools.

Dealing with the Operating System Using the os Module

You have seen the os module used in Chapter 2 for dealing with the filesystem. It also has a grab bag of various attributes and functions related to dealing with the operating system. In Example 3-1 we demonstrate some of them.

Example 3-1. os module examples

In [1]: import os In [2]: os.getcwd() Out[2]: '/Users/kbehrman/Google-Drive/projects/python-devops' In [3]: os.chdir('/tmp') In [4]: os.getcwd() Out[4]: '/private/tmp' In [5]: os.environ.get('LOGLEVEL') In [6]: os.environ['LOGLEVEL'] = 'DEBUG' In [7]: os.environ.get('LOGLEVEL') Out[7]: 'DEBUG' In [8]: os.getlogin() Out[8]: 'kbehrman'

Get the current working directory.

Change the current working directory.

The os.environ holds the environment variables that were set when the os module was loaded.

This is the setting and environment variable. This setting exists for subprocesses spawned from this code.

This is the login of the user in the terminal that spawned this process.

The most common usage of the os module is to get settings from environment variables. These could be the level to set your logging, or secrets such as API keys.

Spawn Processes with the subprocess Module

There are many instances when you need to run applications outside of Python from within your Python code. This could be built-in shell commands, Bash scripts, or any other command-line application. To do this, you spawn a new process (instance of the application). The subprocess module is the right choice when you want to spawn a process and run commands within it. With subprocess, you can run your favorite shell command or other command-line software and collect its output from within Python. For the majority of use cases, you should use the subprocess.run function to spawn processes:

In [1]: cp = subprocess.run(['ls','-l'], capture_output=True, universal_newlines=True) In [2]: cp.stdout Out[2]: 'total 96 -rw-r–r– 1 kbehrman staff 0 Apr 12 08:48 __init__.py drwxr-xr-x 5 kbehrman staff 160 Aug 18 15:47 __pycache__ -rw-r–r– 1 kbehrman staff 123 Aug 13 12:13 always_say_it.py -rwxr-xr-x 1 kbehrman staff 1409 Aug 8 15:36 argparse_example.py -rwxr-xr-x 1 kbehrman staff 734 Aug 12 09:36 click_example.py -rwxr-xr-x 1 kbehrman staff 538 Aug 13 10:41 fire_example.py -rw-r–r– 1 kbehrman staff 41 Aug 18 15:17 foo_plugin_a.py -rw-r–r– 1 kbehrman staff 41 Aug 18 15:47 foo_plugin_b.py -rwxr-xr-x 1 kbehrman staff 335 Aug 10 12:36 simple_click.py -rwxr-xr-x 1 kbehrman staff 256 Aug 13 09:21 simple_fire.py -rwxr-xr-x 1 kbehrman staff 509 Aug 8 10:27 simple_parse.py -rwxr-xr-x 1 kbehrman staff 502 Aug 18 15:11 simple_plugins.py -rwxr-xr-x 1 kbehrman staff 850 Aug 6 14:44 sys_argv.py -rw-r–r– 1 kbehrman staff 182 Aug 18 16:24 sys_example.py '

The subprocess.run function returns a CompletedProcess instance once the process completes. In this case, we run the shell command ls with the argument -l to see the contents of the current directory. We set it to capture stdout and stderr with the capture_output parameter. We then access the results using cp.stdout. If we run our ls command on a nonexistent directory, causing it to return an error, we can see the output in cp.stderr:

In [3]: cp = subprocess.run(['ls','/doesnotexist'], capture_output=True, universal_newlines=True) In [3]: cp.stderr Out[3]: 'ls: /doesnotexist: No such file or directory\n'

You can better integrate the handling of errors by using the check parameter. This raises an exception if the subprocess reports an error:

In [23]: cp = subprocess.run(['ls', '/doesnotexist'], capture_output=True, universal_newlines=True, check=True) ————————————————————————— CalledProcessError Traceback (most recent call last) <ipython-input-23-c0ac49c40fee> in <module> —→ 1 cp = subprocess.run(['ls', '/doesnotexist'], capture_output=True, universal_newlines=True, check=True) ~/.pyenv/versions/3.7.0/lib/python3.7/subprocess.py … 466 if check and retcode: 467 raise CalledProcessError(retcode, process.args, –> 468 output=stdout, stderr=stderr) 469 return CompletedProcess(process.args, retcode, stdout, stderr) 470 CalledProcessError: Command '['ls', '/doesnotexist']' returned non-zero exit

In this way, you don’t have to check stderr for failures. You can treat errors from your subprocess much as you would other Python exceptions.

Creating Command-Line Tools

The simplest way to invoke a Python script on the command line is to invoke it using Python. When you construct a Python script, any statements at the top level (not nested in code blocks) run whenever the script is invoked or imported. If you have a function you want to run whenever your code is loaded, you can invoke it at the top level:

def say_it(): greeting = 'Hello' target = 'Joe' message = f'{greeting} {target}' print(message) say_it()

This function runs whenever the script runs on the command line:

$ python always_say_it.py Hello Joe

Also, when the file is imported:

In [1]: import always_say_it Hello Joe

This should only be done with the most straightforward scripts, however. A significant downside to this approach is that if you want to import your module into other Python modules, the code runs during import instead of waiting to be invoked by the calling module. Someone who is importing your module usually wants control over when its contents are invoked. You can add functionality that only happens when called from the command line by using the global name variable. You have seen that this variable reports the name of the module during import. If the module is called directly on the command line, this sets it to the string main. The convention for modules running on the command line is to end with a block testing for this and run command-line specific code from this block. To modify the script to run a function automatically only when invoked on the command line, but not during import, put the function invocation into the block after the test:

def say_it(): greeting = 'Hello' target = 'Joe' message = f'{greeting} {target}' print(message) if __name__ == '__main__': say_it()

When you import this function, this block does not run, as the __name__ variable reflects the module path as imported. It runs when the module is run directly, however:

$ python say_it.py Hello Joe

Making Your Shell Script Executable

To eliminate the need to explicitly call type python on the command line when you run your script, you can add the line #!/usr/bin/env python to the top of your file:

  1. !/usr/bin/env python def say_it(): greeting = 'Hello' target = 'Joe' message = f'{greeting} {target}' print(message) if __name__ == '__main__': say_it()

Then make the file executable using chmod (a command-line tool for setting permissions):

chmod +x say_it.py`

You can then call it in a shell without directly invoking Python:

$ ./say_it.py Hello Joe

The first step in creating command-line tools is separating code that should only run when invoked on the command line. The next step is to accept command-line arguments. Unless your tool only does one thing, you need to accept commands to know what to do. Also, command-line tools that do more than the simplest tasks accept optional flags to configure their workings. Remember that these commands and flags are the user interface (UI) for anyone using your tools. You need to consider how easy they are to use and understand. Providing documentation is an essential part of making your code understandable.

Using sys.argv

The simplest and most basic way to process arguments from the command line is to use the argv attribute of the sys module. This attribute is a list of arguments passed to a Python script at runtime. If the script runs on the command line, the first argument is the name of the script. The rest of the items in the list are any remaining command-line arguments, represented as strings:

  1. !/usr/bin/env python ”“” Simple command-line tool using sys.argv “”“ import sys if __name__ == '__main__': print(f”The first argument: '{sys.argv[0]}'“) print(f”The second argument: '{sys.argv[1]}'“) print(f”The third argument: '{sys.argv[2]}'“) print(f”The fourth argument: '{sys.argv[3]}'“)

Run it on the command line and see the arguments:

$ ./sys_argv.py –a-flag some-value 13 The first argument: './sys_argv.py' The second argument: '–a-flag' The third argument: 'some-value' The fourth argument: '13'

You can use these arguments to write your own argument parser. To see what this might look like, check out Example 3-2.

Example 3-2. Parsing with sys.argv

  1. !/usr/bin/env python ”“” Simple command-line tool using sys.argv “”“ import sys def say_it(greeting, target): message = f'{greeting} {target}' print(message) if __name__ == '__main__': greeting = 'Hello' target = 'Joe' if '–help' in sys.argv: help_message = f”Usage: {sys.argv[0]} –name <NAME> –greeting <GREETING>“ print(help_message) sys.exit() if '–name' in sys.argv: # Get position after name flag name_index = sys.argv.index('–name') + 1 if name_index < len(sys.argv): name = sys.argv[name_index] if '–greeting' in sys.argv: # Get position after greeting flag greeting_index = sys.argv.index('–greeting') + 1 if greeting_index < len(sys.argv): greeting = sys.argv[greeting_index] say_it(greeting, name)

Here we test to see if we are running from the command line.

Default values are set in these two lines.

Check if the string –help is in the list of arguments.

Exit the program after printing the help message.

We need the position of the value after the flag, which should be the associated value.

Test that the arguments list is long enough. It will not be if the flag was provided without a value.

Call the function with the values as modified by the arguments.

Example 3-2 goes far enough to print out a simple help message and accept arguments to the function:

$ ./sys_argv.py –help Usage: ./sys_argv.py –name <NAME> –greeting <GREETING> $ ./sys_argv.py –name Sally –greeting Bonjour Bonjour Sally

This approach is fraught with complication and potential bugs. Example 3-2 fails to handle many situations. If a user misspells or miscapitalizes a flag, the flag is ignored with no useful feedback. If they use commands that are not supported or try to use more than one value with a flag, once again the error is ignored. You should be aware of the argv parsing approach, but do not use it for any production code unless you specifically set out to write an argument parser. Luckily there are modules and packages designed for the creation of command-line tools. These packages provide frameworks to design the user interface for your module when running in a shell. Three popular solutions are argparse, click, and python-fire. All three include ways to design required arguments, optional flags, and means to display help documentation. The first, argparse, is part of the Python standard library, and the other two are third-party packages that need to be installed separately (using pip).

Using argparse

argparse abstracts away many of the details of parsing arguments. With it, you design your command-line user interface in detail, defining commands and flags along with their help messages. It uses the idea of parser objects, to which you attach commands and flags. The parser then parses the arguments, and you use the results to call your code. You construct your interface using ArgumentParser objects that parse user input for you:

if __name__ == '__main__': parser = argparse.ArgumentParser(description='Maritime control')

You add position-based commands or optional flags to the parser using the add_argument method (see Example 3-3). The first argument to this method is the name of the new argument (command or flag). If the name begins with a dash, it is treated as an optional flag argument; otherwise it is treated as a position-dependent command. The parser creates a parsed-arguments object, with the arguments as attributes that you can then use to access input. Example 3-3 is a simple program that echoes a users input and shows the basics of how argparse works.

Example 3-3. simple_parse.py

  1. !/usr/bin/env python ”“” Command-line tool using argparse “”“ import argparse if __name__ == '__main__': parser = argparse.ArgumentParser(description='Echo your input') parser.add_argument('message', help='Message to echo') parser.add_argument('–twice', '-t', help='Do it twice', action='store_true') args = parser.parse_args() print(args.message) if args.twice: print(args.message)

Create the parser object, with its documentation message.

Add a position-based command with its help message.

Add an optional argument.

Store the optional argument as a boolean value.

Use the parser to parse the arguments.

Access the argument values by name. The optional argument’s name has the – removed.

When you run it with the –twice flag, the input message prints twice:

$ ./simple_parse.py hello –twice hello hello

argparse automatically sets up help and usage messages based on the help and description text you supply:

$ ./simple_parse.py –help usage: simple_parse.py [-h] [–twice] message Echo your input positional arguments: message Message to echo optional arguments: -h, –help show this help message and exit –twice, -t Do it twice

Many command-line tools use nested levels of commands to group command areas of control. Think of git. It has top-level commands, such as git stash, which have separate commands under them, such as git stash pop. With argparse, you create subcommands by creating subparsers under your main parser. You can create a hierarchy of commands using subparsers. In Example 3-4, we implement a maritime application that has commands for ships and sailors. Two subparsers are added to the main parser; each subparser has its own commands.

Example 3-4. argparse_example.py

  1. !/usr/bin/env python ”“” Command-line tool using argparse “”“ import argparse def sail(): ship_name = 'Your ship' print(f”{ship_name} is setting sail“) def list_ships(): ships = ['John B', 'Yankee Clipper', 'Pequod'] print(f”Ships: {','.join(ships)}“) def greet(greeting, name): message = f'{greeting} {name}' print(message) if __name__ == '__main__': parser = argparse.ArgumentParser(description='Maritime control') parser.add_argument('–twice', '-t', help='Do it twice', action='store_true') subparsers = parser.add_subparsers(dest='func') ship_parser = subparsers.add_parser('ships', help='Ship related commands') ship_parser.add_argument('command', choices=['list', 'sail']) sailor_parser = subparsers.add_parser('sailors', help='Talk to a sailor') sailor_parser.add_argument('name', help='Sailors name') sailor_parser.add_argument('–greeting', '-g', help='Greeting', default='Ahoy there') args = parser.parse_args() if args.func == 'sailors': greet(args.greeting, args.name) elif args.command == 'list': list_ships() else: sail()

Create the top-level parser.

Add a top-level argument that can be used along with any command under this parser’s hierarchy.

Create a subparser object to hold the subparsers. The dest is the name of the attribute used to choose a subparser.

Add a subparser for ships.

Add a command to the ships subparser. The choices parameter gives a list of possible choices for the command.

Add a subparser for sailors.

Add a required positional argument to the sailors subparser.

Check which subparser is used by checking the func value.

Example 3-4 has one top-level optional argument (twice) and two subparsers. Each subparser has its own commands and flags. argparse automatically creates a hierarchy of help messages and displays them with the –help flag. The top-level help commands, including the subparsers and the top-level twice argument, are documented:

$ ./argparse_example.py –help usage: argparse_example.py [-h] [–twice] {ships,sailors} … Maritime control positional arguments: {ships,sailors} ships Ship related commands sailors Talk to a sailor optional arguments: -h, –help show this help message and exit –twice, -t Do it twice

You can dig into the subcommands (subparsers) by using the help flag after the command:

$ ./argparse_example.py ships –help usage: argparse_example.py ships [-h] {list,sail} positional arguments: {list,sail} optional arguments: -h, –help show this help message and exit

As you can see, argparse gives you a lot of control over your command-line interface. You can design a multilayered interface with built-in documentation with many options to fine-tune your design. Doing so takes a lot of work on your part, however, so let’s look at some easier options.

Using click

The click package was first developed to work with web framework flask. It uses Python function decorators to bind the command-line interface directly with your functions. Unlike argparse, click interweaves your interface decisions directly with the rest of your code.

Function Decorators

Python decorators are a special syntax for functions which take other functions as arguments. Python functions are objects, so any function can take a function as an argument. The decorator syntax provides a clean and easy way to do this. The basic format of a decorator is:

In [2]: def some_decorator(wrapped_function): …: def wrapper(): …: print('Do something before calling wrapped function') …: wrapped_function() …: print('Do something after calling wrapped function') …: return wrapper …:

You can define a function and pass it as an argument to this function:

In [3]: def foobat(): …: print('foobat') …: In [4]: f = some_decorator(foobat) In [5]: f() Do something before calling wrapped function foobat Do something after calling wrapped function

The decorator syntax simplifies this by indicating which function should be wrapped by decorating it with @decorator_name. Here is an example using the decorator syntax with our some_decorator function:

In [6]: @some_decorator …: def batfoo(): …: print('batfoo') …: In [7]: batfoo() Do something before calling wrapped function batfoo Do something after calling wrapped function

Now you call your wrapped function using its name rather than the decorator name. Pre-built functions intended as decorators are offered both as part of the Python Standard Library (staticMethod, classMethod) and as part of third-party packages, such as Flask and Click.

This means that you tie your flags and options directly to the parameters of the functions that they expose. You can create a simple command-line tool from your functions using click’s command and option functions as decorators before your function:

  1. !/usr/bin/env python ”“” Simple Click example “”“ import click @click.command() @click.option('–greeting', default='Hiya', help='How do you want to greet?') @click.option('–name', default='Tammy', help='Who do you want to greet?') def greet(greeting, name): print(f”{greeting} {name}“) if __name__ == '__main__': greet()

click.command indicates that a function should be exposed to command-line access. click.option adds an argument to the command-line, automatically linking it to the function parameter of the same name (–greeting to greet and –name to name). click does some work behind the scenes so that we can call our greet method in our main block without parameters that are covered by the options decorators.

These decorators handle parsing command-line arguments and automatically produce help messages:

$ ./simple_click.py –greeting Privet –name Peggy Privet Peggy $ ./simple_click.py –help Usage: simple_click.py [OPTIONS] Options: –greeting TEXT How do you want to greet? –name TEXT Who do you want to greet? –help Show this message and exit.

You can see that with click you can expose your functions for command-line use with much less code than argparse. You can concentrate on the business logic of your code rather than designing the interface.

Now let’s look at a more complicated example with nested commands. Commands are nested by using click.group creating functions that represent the groups. In Example 3-5 we nest commands with argparse, using an interface that is very similar to the one from Example 3-4.

Example 3-5. click_example.py

  1. !/usr/bin/env python ”“” Command-line tool using argparse “”“ import click @click.group() def cli(): pass @click.group(help='Ship related commands') def ships(): pass cli.add_command(ships) @ships.command(help='Sail a ship') def sail(): ship_name = 'Your ship' print(f”{ship_name} is setting sail“) @ships.command(help='List all of the ships') def list_ships(): ships = ['John B', 'Yankee Clipper', 'Pequod'] print(f”Ships: {','.join(ships)}“) @cli.command(help='Talk to a sailor') @click.option('–greeting', default='Ahoy there', help='Greeting for sailor') @click.argument('name') def sailors(greeting, name): message = f'{greeting} {name}' print(message) if __name__ == '__main__': cli()

Create a top-level group under which other groups and commands will reside.

Create a function to act as the top-level group. The click.group method transforms the function into a group.

Create a group to hold the ships commands.

Add the ships group as a command to the top-level group. Note that the cli function is now a group with an add_command method.

Add a command to the ships group. Notice that ships.command is used instead of click.command.

Add a command to the cli group.

Call the top-level group.

The top-level help messages generated by click look like this:

./click_example.py –help Usage: click_example.py [OPTIONS] COMMAND [ARGS]… Options: –help Show this message and exit. Commands: sailors Talk to a sailor ships Ship related commands

You can dig into the help for a subgroup like this:

$ ./click_example.py ships –help Usage: click_example.py ships [OPTIONS] COMMAND [ARGS]… Ship related commands Options: –help Show this message and exit. Commands: list-ships List all of the ships sail Sail a ship

If you compare Example 3-4 and Example 3-5, you will see some of the differences between using argparse and click. The click approach certainly requires less code, almost half in these examples. The user interface (UI) code is interspersed throughout the whole program; it is especially important when creating functions that solely act as groups. If you have a complex program, with a complex interface, you should try as best as possible to isolate different functionality. By doing so, you make individual pieces easier to test and debug. In such a case, you might choose argparse to keep your interface code separate.

Defining Classes

A class definition starts with the keyword class followed by the class name and parentheses:

In [1]: class MyClass():

Attributes and method definitions follow in the indented code block. All methods of a class recieve as their first parameter a copy of the instantiated class object. By convention this is refered to as self:

In [1]: class MyClass(): …: def some_method(self): …: print(f”Say hi to {self}“) …: In [2]: myObject = MyClass() In [3]: myObject.some_method() Say hi to <__main__.MyClass object at 0x1056f4160>

Every class has an init method. When the class is instantiated, this method is called. If you do not define this method, it gets a default one, inherited from the Python base object class:

In [4]: MyClass.__init__ Out[4]: <slot wrapper '__init__' of 'object' objects>

Generally you define an object’s attributes in the init method:

In [5]: class MyOtherClass(): …: def __init__(self, name): …: self.name = name …: In [6]: myOtherObject = MyOtherClass('Sammy') In [7]: myOtherObject.name Out[7]: 'Sammy'

fire

Now, let’s take a step farther down the road of making a command-line tool with minimal UI code. The fire package uses introspection of your code to create interfaces automatically. If you have a simple function you want to expose, you call fire.Fire with it as an argument:

  1. !/usr/bin/env python ”“” Simple fire example “”“ import fire def greet(greeting='Hiya', name='Tammy'): print(f”{greeting} {name}“) if __name__ == '__main__': fire.Fire(greet)

fire then creates the UI based on the method’s name and arguments:

$ ./simple_fire.py –help NAME simple_fire.py SYNOPSIS simple_fire.py <flags> FLAGS –greeting=GREETING –name=NAME

In simple cases, you can expose multiple methods automatically by invoking fire with no arguments:

  1. !/usr/bin/env python ”“” Simple fire example “”“ import fire def greet(greeting='Hiya', name='Tammy'): print(f”{greeting} {name}“) def goodbye(goodbye='Bye', name='Tammy'): print(f”{goodbye} {name}“) if __name__ == '__main__': fire.Fire()

fire creates a command from each function and documents automatically:

$ ./simple_fire.py –help INFO: Showing help with the command 'simple_fire.py – –help'. NAME simple_fire.py SYNOPSIS simple_fire.py GROUP | COMMAND GROUPS GROUP is one of the following: fire The Python fire module. COMMANDS COMMAND is one of the following: greet goodbye (END)

This is really convenient if you are trying to understand someone else’s code or debug your own. With one line of additional code, you can interact with all of a module’s functions from the command-line. That is powerful. Because fire uses the structure of your program itself to determine the interface, it is even more tied to your non-interface code than argparse or click. To mimic our nest command interface, you need to define classes with the structure of the interface you want to expose. To see an approach to this, check out Example 3-6.

Example 3-6. fire_example.py

  1. !/usr/bin/env python ”“” Command-line tool using fire “”“ import fire class Ships(): def sail(self): ship_name = 'Your ship' print(f”{ship_name} is setting sail“) def list(self): ships = ['John B', 'Yankee Clipper', 'Pequod'] print(f”Ships: {','.join(ships)}“) def sailors(greeting, name): message = f'{greeting} {name}' print(message) class Cli(): def __init__(self): self.sailors = sailors self.ships = Ships() if __name__ == '__main__': fire.Fire(Cli)

Define a class for the ships commands.

sailors has no subcommands, so it can be defined as a function.

Define a class to act as the top group. Add the sailors function and the Ships as attributes of the class.

Call fire.Fire on the class acting as the top-level group.

The automatically generated documentation at the top level represents the Ships class as a group, and the sailors command as a command:

$ ./fire_example.py NAME fire_example.py SYNOPSIS fire_example.py GROUP | COMMAND GROUPS GROUP is one of the following: ships COMMANDS COMMAND is one of the following: sailors (END)

The documentation for the ships group shows the commands representing the methods attached to the Ships class:

$ ./fire_example.py ships –help INFO: Showing help with the command 'fire_example.py ships – –help'. NAME fire_example.py ships SYNOPSIS fire_example.py ships COMMAND COMMANDS COMMAND is one of the following: list sail (END)

The parameters for the sailors function are turned into positional arguments:

$ ./fire_example.py sailors –help INFO: Showing help with the command 'fire_example.py sailors – –help'. NAME fire_example.py sailors SYNOPSIS fire_example.py sailors GREETING NAME POSITIONAL ARGUMENTS GREETING NAME NOTES You can also use flags syntax for POSITIONAL ARGUMENTS (END)

You can call the commands and subcommands as expected:

$ ./fire_example.py ships sail Your ship is setting sail chapter3$ ./fire_example.py ships list Ships: John B,Yankee Clipper,Pequod chapter3$ ./fire_example.py sailors Hiya Karl Hiya Karl

An exciting feature of fire is the ability to enter an interactive mode easily. By using the –interactive flag, fire opens an IPython shell with the object and functions of your script available:

$ ./fire_example.py sailors Hiya Karl – –interactive Hiya Karl Fire is starting a Python REPL with the following objects: Modules: fire Objects: Cli, Ships, component, fire_example.py, result, sailors, self, trace Python 3.7.0 (default, Sep 23 2018, 09:47:03) Type 'copyright', 'credits' or 'license' for more information IPython 7.5.0 – An enhanced Interactive Python. Type '?' for help. ————————————————————————— In [1]: sailors Out[1]: <function __main__.sailors(greeting, name)> In [2]: sailors('hello', 'fred') hello fred

Here we run the maritime program’s sailors command in interactive mode. An IPython shell opens, and you have access to the sailors function. This interactive mode, in combination with the ease of exposing objects with fire, makes it the right tool both for debugging and introducing yourself to new code.

You have now run the gamut in command-line tool building libraries, from the very hands-on argparse, to the less verbose click, and lastly to the minimal fire. So which one should you use? We recommend click for most use cases. It balances ease and control. In the case of complex interfaces where you want to separate the UI code from business logic, argparse is the way to go. Moreover, if you need to access code that does not have a command-line interface quickly, fire is right for you.

Implementing Plug-ins

Once you’ve implemented your application’s command-line user interface, you might want to consider a plug-in system. Plug-ins are pieces of code supplied by the user of your program to extend functionality. Plug-in systems are used in all sorts of applications, from large applications like Autodesk’s Maya to minimal web frameworks like Flask. You could write a tool that handles walking a filesystem and allows a user to provide plug-ins to operate on its contents. A key part of any plug-in system is plug-in discover. Your program needs to know what plug-ins are available to load and run. In Example 3-7, we write a simple application that discovers and runs plug-ins. It uses a user-supplied prefix to search for, load, and run plug-ins.

Example 3-7. simple_plugins.py

  1. !/usr/bin/env python import fire import pkgutil import importlib def find_and_run_plugins(plugin_prefix): plugins = {} # Discover and Load Plugins print(f”Discovering plugins with prefix: {plugin_prefix}“) for _, name, _ in pkgutil.iter_modules(): if name.startswith(plugin_prefix): module = importlib.import_module(name) plugins[name] = module # Run Plugins for name, module in plugins.items(): print(f”Running plugin {name}“) module.run() if __name__ == '__main__': fire.Fire()

pkgutil.iter_modules returns all modules available in the current sys.path.

Check if the module uses our plug-in prefix.

Use importlib to load the module, saving it in a dict for later use.

Call the run method on the plug-in.

Writing supplying plug-ins to Example 3-7 is as simple as supplying modules whose names use a shared prefix and whose functionality is accessed using a method named run. If you write two files using the prefix foo_plugin with individual run methods:

def run(): print(“Running plugin A”)

def run(): print(“Running plugin B”)

You can discover and run them with our plugin application:

$ ./simple_plugins.py find_and_run_plugins foo_plugin Running plugin foo_plugin_a Running plugin A Running plugin foo_plugin_b Running plugin B

You can easily extend this simple example to create plug-in systems for your applications.

Case Study: Turbocharging Python with Command-Line Tools

It’s as good a time as ever to be writing code these days; a little bit of code goes a long way. Just a single function is capable of performing incredible things. Thanks to GPUs, machine learning, the cloud, and Python, it’s easy to create “turbocharged” command-line tools. Think of it as upgrading your code from using a basic internal combustion engine to a jet engine. What’s the basic recipe for the upgrade? One function, a sprinkle of powerful logic, and, finally, a decorator to route it to the command line.

Writing and maintaining traditional GUI applications — web or desktop — is a Sisyphean task at best. It all starts with the best of intentions, but can quickly turn into a soul crushing, time-consuming ordeal where you end up asking yourself why you thought becoming a programmer was a good idea in the first place. Why did you run that web framework setup utility that essentially automated a 1970s technology — the relational database — into series of Python files? The old Ford Pinto with the exploding rear gas tank has newer technology than your web framework. There has got to be a better way to make a living.

The answer is simple: stop writing web applications and start writing jet-powered command-line tools instead. The turbocharged command-line tools discussed in the following sections are focused on fast results vis-à-vis minimal lines of code. They can do things like learn from data (machine learning), make your code run two thousand times faster, and best of all, generate colored terminal output.

Here are the raw ingredients that will be used to make several solutions:

Click framework

Python CUDA framework

Numba framework

Scikit-learn machine learning framework

Using the Numba Just-in-Time (JIT) Compiler

Python has a reputation for slow performance because it’s fundamentally a scripting language. One way to get around this problem is to use the Numba Just-in-Time (JIT) compiler. Let’s take a look at what that code looks like.

First, use a timing decorator to get a grasp on the runtime of your functions:

def timing(f): @wraps(f) def wrap(*args, **kwargs): ts = time() result = f(*args, **kwargs) te = time() print(f”fun: {f.__name__}, args: [{args}, {kwargs}] took: {te-ts} sec“) return result return wrap

Next, add a numba.jit decorator with the nopython keyword argument and set it to True. This will ensure that the code will be run by JIT instead of regular Python.

@timing @numba.jit(nopython=True) def expmean_jit(rea): ”““Perform multiple mean calculations””“ val = rea.mean() ** 2 return val

When you run it, you can see both a jit as well as a regular version being run via the command-line tool:

$ python nuclearcli.py jit-test Running NO JIT func:'expmean' args:[(array(1.0000e+00, 4.2080e+05, 2350e+05, ..., 1.0543e+06, 1.0485e+06, 1.0444e+06], [2.0000e+00, 5.4240e+05, 5.4670e+05, ..., 1.5158e+06, 1.5199e+06, 1.5253e+06], [3.0000e+00, 7.0900e+04, 7.1200e+04, ..., 1.1380e+05, 1.1350e+05, 1.1330e+05], ..., [1.5277e+04, 9.8900e+04, 9.8100e+04, ..., 2.1980e+05, 2.2000e+05, 2.2040e+05], [1.5280e+04, 8.6700e+04, 8.7500e+04, ..., 1.9070e+05, 1.9230e+05, 1.9360e+05], [1.5281e+04, 2.5350e+05, 2.5400e+05, ..., 7.8360e+05, 7.7950e+05, 7.7420e+05, dtype=float32),), {}] took: 0.0007 sec $ python nuclearcli.py jit-test –jit Running with JIT func:'expmean_jit' args:[(array(1.0000e+00, 4.2080e+05, 4.2350e+05, ..., 0543e+06, 1.0485e+06, 1.0444e+06], [2.0000e+00, 5.4240e+05, 5.4670e+05, ..., 1.5158e+06, 1.5199e+06, 1.5253e+06], [3.0000e+00, 7.0900e+04, 7.1200e+04, ..., 1.1380e+05, 1.1350e+05, 1.1330e+05], ..., [1.5277e+04, 9.8900e+04, 9.8100e+04, ..., 2.1980e+05, 2.2000e+05, 2.2040e+05], [1.5280e+04, 8.6700e+04, 8.7500e+04, ..., 1.9070e+05, 1.9230e+05, 1.9360e+05], [1.5281e+04, 2.5350e+05, 2.5400e+05, ..., 7.8360e+05, 7.7950e+05, @click.option('--jit/--no-jit', default=False) 7.7420e+05, dtype=float32),), {}] took: 0.2180 sec

How does that work? Just a few lines of code allow for this simple toggle:

@cli.command() def jit_test(jit): rea = real_estate_array() if jit: click.echo(click.style('Running with JIT', fg='green')) expmean_jit(rea) else: click.echo(click.style('Running NO JIT', fg='red')) expmean(rea)

In some cases, a JIT version could make code run thousands of times faster, but benchmarking is key. Another item to point out is this line:

click.echo(click.style('Running with JIT', fg='green'))

This script allows for colored terminal output, which can be very helpful when creating sophisticated tools.

Using the GPU with CUDA Python

Another way to turbocharge your code is to run it straight on a GPU. This example requires you run it on a machine with a CUDA enabled. Here’s what that code looks like:

@cli.command() def cuda_operation(): ”““Performs Vectorized Operations on GPU””“ x = real_estate_array() y = real_estate_array() print(“Moving calculations to GPU memory”) x_device = cuda.to_device(x) y_device = cuda.to_device(y) out_device = cuda.device_array( shape=(x_device.shape[0],x_device.shape[1]), dtype=np.float32) print(x_device) print(x_device.shape) print(x_device.dtype) print(“Calculating on GPU”) add_ufunc(x_device,y_device, out=out_device) out_host = out_device.copy_to_host() print(f”Calculations from GPU {out_host}“)

It’s useful to point out that if the Numpy array is first moved to the GPU, then a vectorized function does the work on the GPU. After that work is completed, the data is moved from the GPU. By using a GPU, there could be a monumental improvement to the code, depending on what it’s running. The output from the command-line tool is shown here:

$ python nuclearcli.py cuda-operation Moving calculations to GPU memory <numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x7f01bf6ccac8> (10015, 259) float32 Calculating on GPU Calculcations from GPU [ [2.0000e+00 8.4160e+05 8.4700e+05 … 2.1086e+06 2.0970e+06 2.0888e+06] [4.0000e+00 1.0848e+06 1.0934e+06 … 3.0316e+06 3.0398e+06 3.0506e+06] [6.0000e+00 1.4180e+05 1.4240e+05 … 2.2760e+05 2.2700e+05 2.2660e+05] … [3.0554e+04 1.9780e+05 1.9620e+05 … 4.3960e+05 4.4000e+05 4.4080e+05] [3.0560e+04 1.7340e+05 1.7500e+05 … 3.8140e+05 3.8460e+05 3.8720e+05] [3.0562e+04 5.0700e+05 5.0800e+05 … 1.5672e+06 1.5590e+06 1.5484e+06] ]

Running True Multicore Multithreaded Python Using Numba

One common performance problem with Python is the lack of true, multithreaded performance. This also can be fixed with Numba. Here’s an example of some basic operations:

@timing @numba.jit(parallel=True) def add_sum_threaded(rea): ”““Use all the cores””“ x,_ = rea.shape total = 0 for _ in numba.prange(x): total += rea.sum() print(total) @timing def add_sum(rea): ”““traditional for loop””“ x,_ = rea.shape total = 0 for _ in numba.prange(x): total += rea.sum() print(total) @cli.command() @click.option('–threads/–no-jit', default=False) def thread_test(threads): rea = real_estate_array() if threads: click.echo(click.style('Running with multicore threads', fg='green')) add_sum_threaded(rea) else: click.echo(click.style('Running NO THREADS', fg='red')) add_sum(rea)

Note that the key difference between the parallel version is that it uses @numba.jit(parallel=True) and numba.prange to spawn threads for iteration. As you can see in Figure 3-1, all of the CPUs are maxed out on the machine, but when almost the exact same code is run without the parallelization, it only uses a core.

Figure 3-1. Using all of the cores

$ python nuclearcli.py thread-test $ python nuclearcli.py thread-test –threads

KMeans Clustering

Another powerful thing that can be accomplished with a command-line tool is machine learning. In the example below, a KMeans clustering function is created with just a few lines of code. This clusters a Pandas DataFrame into a default of three clusters:

def kmeans_cluster_housing(clusters=3): ”““Kmeans cluster a dataframe””“ url = “https://raw.githubusercontent.com/noahgift/\ socialpowernba/master/data/nba_2017_att_val_elo_win_housing.csv” val_housing_win_df =pd.read_csv(url) numerical_df =( val_housing_win_df.loc[:,[“TOTAL_ATTENDANCE_MILLIONS”, “ELO”, “VALUE_MILLIONS”, “MEDIAN_HOME_PRICE_COUNTY_MILLIONS”]] ) #scale data scaler = MinMaxScaler() scaler.fit(numerical_df) scaler.transform(numerical_df) #cluster data k_means = KMeans(n_clusters=clusters) kmeans = k_means.fit(scaler.transform(numerical_df)) val_housing_win_df['cluster'] = kmeans.labels_ return val_housing_win_df

The cluster number can be changed by passing in another number (as shown below) using click:

@cli.command() @click.option(”–num“, default=3, help=“number of clusters”) def cluster(num): df = kmeans_cluster_housing(clusters=num) click.echo(“Clustered DataFrame”) click.echo(df.head())

Finally, the output of the Pandas DataFrame with the cluster assignment is shown next. Note that it now has cluster assignment as a column:

$ python -W nuclearcli.py cluster

Clustered DataFrame TEAM GMS … COUNTY cluster 0 Chicago Bulls 41 … Cook 0 1 Dallas Mavericks 41 … Dallas 0 2 Sacramento Kings 41 … Sacremento 1 3 Miami Heat 41 … Miami-Dade 0 4 Toronto Raptors 41 … York-County 0 [5 rows x 12 columns]

$ python -W nuclearcli.py cluster –num 2

Clustered DataFrame TEAM GMS … COUNTY cluster 0 Chicago Bulls 41 … Cook 1 1 Dallas Mavericks 41 … Dallas 1 2 Sacramento Kings 41 … Sacremento 0 3 Miami Heat 41 … Miami-Dade 1 4 Toronto Raptors 41 … York-County 1 [5 rows x 12 columns]

Exercises

Use sys to write a script that prints command line only when run from the command line.

Use click to create a command-line tool that takes a name as an argument and prints it if it does not begin with a p.

Use fire to access methods in an existing Python script from the command line.

Fair Use Sources


Cloud Monk is Retired (for now). Buddha with you. © 2005 - 2024 Losang Jinpa or Fair Use. Disclaimers

SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.


interlinear_commentary_text_of_python_for_devops_-_learn_ruthlessly_effective_automation.txt · Last modified: 2022/05/03 17:22 by 127.0.0.1