data_science_at_the_command_line_-_obtain_scrub_explore_and_model_data_with_unix_power_tools_2nd_edition_by_jeroen_janssens

Data Science at the Command Line - Obtain, Scrub, Explore, and Model Data with Unix Power Tools, 2nd Edition, by Jeroen Janssens

Book Summary

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 80 tools–useful whether you work with Windows, macOS, or Linux.

You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, and engineers; software and machine learning engineers; and system administrators.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on text, CSV, HTM, XML, and JSON files
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow
  • Create reusable command-line tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines
  • Model data with dimensionality reduction, clustering, regression, and classification algorithms

Who This Book Is For

This book makes just one assumption about you: that you work with data. It doesn’t matter which programming language or statistical computing environment you’re currently using. The book explains all the necessary concepts from the beginning.

It also doesn’t matter whether your operating system is Microsoft Windows, macOS, or some flavor of Linux. The book comes with a Docker image, which is an easy-to-install virtual environment. It allows you to run the command-line tools and follow along with the code examples in the same environment as this book was written. You don’t have to waste time figuring out how to install all the command-line tools and their dependencies.

The book contains some code in Bash, Python, and R, so it’s helpful if you have some programming experience

Reviews

“Traditional computer and data science curricula all too often mistake the command line as an obsolete relic instead of teaching it as the modern and vital toolset that it is. Only well into my career did I come to grasp the elegance and power of the command line for easily exploring messy datasets and even creating reproducible data pipelines for work. The first edition of Data Science at the Command Line was one of the most comprehensive and clear references when I was a novice in the art, and now with the second edition, I'm again learning new tools and applications from it.” — Dan Nguyen, data scientist, former news application developer at ProPublica, and former Lorry I. Lokey Visiting Professor in Professional Journalism at Stanford University

“Despite what you may hear, most practical data science is still focused on interesting visualizations and insights derived from flat files. Jeroen's book leans into this reality, and helps reduce complexity for data practitioners by showing how time-tested command-line tools can be repurposed for data science.” — Paige Bailey, principal product manager code intelligence at Microsoft, GitHub

“It's amazing how fast so much data work can be performed at the command line before ever pulling the data into R, Python, or a database. Older technologies like sed and awk are still incredibly powerful and versatile. Until I read Data Science at the Command Line, I had only heard of these tools but never saw their full power. Thanks to Jeroen, it's like I now have a secret weapon for working with large data.” — Jared Lander, chief data scientist at Lander Analytics, organizer of the New York Open Statistical Programming Meetup, and author of R for Everyone

“The command line is an essential tool in every data scientist's toolbox, and knowing it well makes it easy to translate questions you have of your data to real-time insights. Jeroennot only explains the basic Unix philosophy of how to chain together single-purpose tools to arrive at simple solutions for complex problems, but also introduces new command-line tools for data cleaning, analysis, visualization, and modeling.” — Jake Hofman, senior principal researcher at Microsoft Research, and adjunct assistant professor in the department of applied mathematics at Columbia University

“The Unix philosophy of simple tools, each doing one job well, then cleverly piped together, is embodied by the command line. Jeroen expertly discusses how to bring that philosophy into your work in data science, illustrating how the command line is not only the world of file input/output, but also the world of data manipulation, exploration, and even modeling.” — Chris H. Wiggins, associate professor in the department of applied physics and applied mathematics at Columbia University, and chief data scientist at The New York Timese]], but it’s by no means required to follow along with the examples.

About the Author

Jeroen Janssens teaches data science; often through training and coaching, occasionally through speaking, and infrequently through writing. His interests include visualizing data, building machine learning models, and automating things using either Python, R, or Bash. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. Currently, Jeroen is the CEO of Data Science Workshops, which organizes open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. All related to data science of course. He lives with his wife and two kids in Rotterdam, the Netherlands.

Product Details

Research More

Python Data Science

JVM Data Science - Java-Kotlin-Scala-Clojure - Android - Spring Boot

C# .NET Data Science

Data Science with JavaScript - React / TypeScript Angular

Data Science Tutorials

Data Science Support

Fair Use Sources

Data Science: Fundamentals of Data Science, DataOps, Big Data, Data Science IDEs (Jupyter Notebook, JetBrains DataGrip, Google Colab, JetBrains DataSpell, SQL Server Management Studio, MySQL Workbench, Oracle SQL Developer, SQLiteStudio), Data Science Tools (SQL, Apache Arrow, Pandas, NumPy, Dask, Spark, Kafka); Data Science Programming Languages (Python Data Science, NumPy Data Science, R Data Science, Java Data Science, C++ Data Science, MATLAB Data Science, Scala Data Science, Julia Data Science, Excel Data Science (Excel is the most popular "programming language") - Google Sheets, SAS Data Science, C# Data Science, Golang Data Science, JavaScript Data Science, Kotlin Data Science, Ruby Data Science, Rust Data Science, Swift Data Science, TypeScript Data Science, Bash Data Science); Databases, Data, Augmentation, Analysis, Analytics, Archaeology, Cleansing, Collection, Compression, Corruption, Curation, Degradation, Editing (EmEditor), Data engineering, ETL/ ELT ( Extract- Transform- Load), Farming, Format management, Fusion, Integration, Integrity, Lake, Library, Loss, Management, Migration, Mining, Pre-processing, Preservation, Protection (privacy), Recovery, Reduction, Retention, Quality, Science, Scraping, Scrubbing, Security, Stewardship, Storage, Validation, Warehouse, Wrangling/munging. ML-DL - MLOps. Data science history, Data Science Bibliography, Manning Data Science Series, Data science Glossary, Data science topics, Data science courses, Data science libraries, Data science frameworks, Data science GitHub, Data Science Awesome list. (navbar_datascience - see also navbar_python, navbar_numpy, navbar_data_engineering and navbar_database)

Data Structures: Array, Linked List, Stack, Queue, Binary Tree, Binary Search Tree, Heap, Hash Table, Graph, Trie, Skip List, Red-Black Tree, AVL Tree, B-Tree, B+ Tree, Splay Tree, Fibonacci Heap, Disjoint Set, Adjacency Matrix, Adjacency List, Circular Linked List, Doubly Linked List, Priority Queue, Dynamic Array, Bloom Filter, Segment Tree, Fenwick Tree, Cartesian Tree, Rope, Suffix Array, Suffix Tree, Ternary Search Tree, Radix Tree, Quadtree, Octree, KD Tree, Interval Tree, Sparse Table, Union-Find, Min-Max Heap, Binomial Heap, And-Or Graph, Bit Array, Bitmask, Circular Buffer, Concurrent Data Structures, Content Addressable Memory, Deque, Directed Acyclic Graph (DAG), Edge List, Eulerian Path and Circuit, Expression Tree, Huffman Tree, Immutable Data Structure, Indexable Skip List, Inverted Index, Judy Array, K-ary Tree, Lattice, Linked Hash Map, Linked Hash Set, List, Matrix, Merkle Tree, Multimap, Multiset, Nested Data Structure, Object Pool, Pairing Heap, Persistent Data Structure, Quad-edge, Queue (Double-ended), R-Tree, Radix Sort Tree, Range Tree, Record, Ring Buffer, Scene Graph, Scapegoat Tree, Soft Heap, Sparse Matrix, Spatial Index, Stack (Min/Max), Suffix Automaton, Threaded Binary Tree, Treap, Triple Store, Turing Machine, Unrolled Linked List, Van Emde Boas Tree, Vector, VList, Weak Heap, Weight-balanced Tree, X-fast Trie, Y-fast Trie, Z-order, Zero-suppressed Decision Diagram, Zigzag Tree

Data Structures Fundamentals - Algorithms Fundamentals, Algorithms, Data Types; Primitive Types (Boolean data type, Character (computing), Floating-point arithmetic, Single-precision floating-point format - Double-precision floating-point format, IEEE 754, Category:Floating point types, Fixed-point arithmetic, Integer (computer science), Reference (computer science), Pointer (computer programming), Enumerated type, Date Time);

Composite Types or Non-Primitive Types: Array data structure, String (computer science) (Array of characters), Record (computer science) (also called Struct (C programming language)), Union type (Tagged union, also called Variant type, Variant record, Discriminated union, or Disjoint union);

Abstract Data Types: Container (data structure), List (abstract data type), Tuple, Associative array (also called Map, Multimap, Set (abstract data type), Multiset (abstract data type) (also called Multiset (bag)), Stack (abstract data type), Queue (abstract data type), (e.g. Priority queue), Double-ended queue, Graph (data structure) (e.g. Tree (data structure), Heap (data structure))

Data Structures and Algorithms, Data Structures Syntax, Data Structures and OOP - Data Structures and Design Patterns, Data Structures Best Practices, Data Structures and Containerization, Data Structures and IDEs (IntelliSense), Data Structures and Development Tools, Data Structures and Compilers, Data Structures and Data Science - Data Structures and DataOps, Machine Learning Data Structures - Data Structures and MLOps, Deep Learning Data Structures, Functional Data Structures, Data Structures and Concurrency - Data Structures and Parallel Programming, Data Structure Libraries, Data Structures History, Data Structures Bibliography (Grokking Data Structures), Data Structures Courses, Data Structures Glossary, Data Structures Topics, Data Structures Research, Data Structures GitHub, Written in Data Structures, Data Structures Popularity, Data Structures Awesome. (navbar_data_structures - see also navbar_cpp_containers, navbar_math_algorithms, navbar_data_algorithms, navbar_design_patterns, navbar_software_architecture)


© 1994 - 2024 Cloud Monk Losang Jinpa or Fair Use. Disclaimers

SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.


data_science_at_the_command_line_-_obtain_scrub_explore_and_model_data_with_unix_power_tools_2nd_edition_by_jeroen_janssens.txt · Last modified: 2024/04/28 03:42 (external edit)