Big Data Essentials

L2: Using Linux as a Data Scientist





Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

Log into aliyun server

  • Mac: ssh student@39.98.252.239
  • Windows
    • Download putty.
    • Hostname: 39.98.252.239
    • User ID: student
    • port: 22

The command line

  • A command line, or terminal, is a text based interface to the system.
  • You are able to enter commands by typing them on the keyboard and feedback will be given to you similarly as text.

Shell

  • A program that takes commands from the keyboard and gives them to the operating system to perform.
  • Bash: one kind of shell.



Basic Navigation!

In [99]:
# Who am I?
whoami

# Where we are?
pwd

pwd

whoami
yanfei
/home/yanfei/lectures
/home/yanfei/lectures
yanfei
In [100]:
# What's in our current location?

ls
BDE-L0-intro.ipynb	  BDE-L1-bigdata.slides.html  figs
BDE-L0-intro.slides.html  BDE-L2-linux.ipynb
BDE-L1-bigdata.ipynb	  BDE-L2-linux.slides.html

Paths

  • The file system under linux is a hierarchical structure.
  • At the very top of the structure is what's called the root directory. It is denoted by a single slash ( / ).

Absolute and relative paths

  • Absolute paths specify a location (file or directory) in relation to the root directory. You can identify them easily as they always begin with a forward slash ( / ).
  • Relative paths specify a location (file or directory) in relation to where we currently are in the system. They will not begin with a slash.
In [5]:
ls
BDE-L0-intro.ipynb  BDE-L1-bigdata.ipynb  BDE-L2-linux.ipynb  figs
In [8]:
ls /home/yanfei/lectures
BDE-L0-intro.ipynb  BDE-L1-bigdata.ipynb  BDE-L2-linux.ipynb  figs

More on paths

  • ~ (tilde) - This is a shortcut for your home directory. /home/yanfei/lectures or ~/lectures.
  • . (dot) - This is a reference to your current directory. ls ./.
  • .. (dotdot)- This is a reference to the parent directory. ls ../.

Let's move around a bit

  • In order to move around in the system we use a command called cd which stands for change directory.
  • Use Tab Completion.

Using manual pages

  • The manual pages are a set of pages that explain every command available on your system including what they do, the specifics of how you run them and what command line arguments they accept.
  • Try man ls.
In [85]:
ls -lhsta
total 892K
4.0K drwxr-xr-x 4 yanfei yanfei 4.0K Sep  6 19:16 .
 20K -rw-r--r-- 1 yanfei yanfei  19K Sep  6 19:16 BDE-L2-linux.ipynb
4.0K -rw-rw-r-- 1 yanfei yanfei  166 Sep  6 19:12 output
4.0K -rw-r--r-- 1 yanfei yanfei 3.0K Sep  6 18:53 BDE-L0-intro.ipynb
292K -rw-rw-r-- 1 yanfei yanfei 291K Sep  5 15:25 BDE-L2-linux.slides.html
276K -rw-rw-r-- 1 yanfei yanfei 275K Sep  5 15:25 BDE-L1-bigdata.slides.html
276K -rw-rw-r-- 1 yanfei yanfei 274K Sep  5 15:25 BDE-L0-intro.slides.html
4.0K drwxr-xr-x 2 yanfei yanfei 4.0K Sep  4 18:11 figs
4.0K -rw-r--r-- 1 yanfei yanfei 3.0K Sep  4 17:44 BDE-L1-bigdata.ipynb
4.0K drwxr-xr-x 2 yanfei yanfei 4.0K Sep  4 17:44 .ipynb_checkpoints
4.0K drwxr-xr-x 9 yanfei yanfei 4.0K Sep  4 16:12 ..

Lab

  • Use the commands cd and ls to explore what directories are on your system and what's in them. Make sure you use a variety of relative and absolute paths.
  • Now go to your home directory using 4 different methods.
  • Make sure you are using Tab Completion when typing out your paths too. Why do anything you can get the computer to do for you?



File manipulation

File manipulation

  • Making a directory: mkdir
  • Removing a directory: rmdir
  • Creating a blank file: touch
  • Copying a file or directory: cp
  • Moving a file or directory: mv
  • Renaming files or directories
  • Removing files or empty directories: rm
  • How to removing non empty directories?
  • Note: no undo options.

Lab

  • Start by creating a directory in your home directory in which to experiment.
  • In that directory, create a series of files and directories (and files and directories in those directories).
  • Now rename a few of those files and directories.
  • Delete one of the directories that has other files and directories in them.
  • Move back to your home directory and from there copy a file from one of your subdirectories into the initial directory you created.
  • Now move that file back into another directory.
  • Rename a few files
  • Next, move a file and rename it in the process.
  • Finally, have a look at the existing directories in your home directory.

Upload local files

A command line editor

  • Many text editor available: nano, vim, emacs.
  • There are two modes in Vim. Insert (or Input) mode and Edit mode.
  • In input mode you may input or enter content into the file.
  • In edit mode you can move around the file, perform actions such as deleting, copying, search and replace, saving etc.
  • A common mistake is to start entering commands without first going back into edit mode or to start typing input without first going into insert mode.

First file

  • Start with vim firstfile.
  • You always start off in edit mode so the first thing we are going to do is switch to insert mode by pressing i. You can tell when you are in insert mode as the bottom left corner will tell you.

Saving and editing

  • :q! - discard all changes, since the last save, and exit
  • :w - save file but don't exit
  • :wq - again, save and exit

Other ways to view files

  • Try cat firstfile.
  • For larger files there is a better suited command which is less.
  • head, tail.
In [69]:
head BDE-L0-intro.slides.html
<!DOCTYPE html>
<html>
<head>

<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="chrome=1" />

<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />

Navigating a file in Vim

Now let's go back into the file we just created and enter some more content. In insert mode you may use the arrow keys to move the cursor around. Enter two more paragraphs of content then hit Esc to go back to edit mode.

  • Arrow keys - move the cursor around
  • j, k, h, l - move the cursor down, up, left and right (similar to the arrow keys)
  • ^ (caret) - move cursor to beginning of current line
  • $ - move cursor to end of the current line
  • nG - move to the nth line (eg 5G moves to 5th line)
  • G - move to the last line
  • w - move to the beginning of the next word
  • nw - move forward n word (eg 2w moves two words forwards)
  • b - move to the beginning of the previous word
  • nb - move back n word
  • { - move backward one paragraph
  • } - move forward one paragraph

Deleting content

  • x - delete a single character
  • nx - delete n characters (eg 5x deletes five characters)
  • dd - delete the current line
  • dn - d followed by a movement command. Delete to where the movement command would have taken you. (eg d5w means delete 5 words)

Undoing

  • u - Undo the last action (you may keep pressing u to keep undoing)
  • U (Note: capital) - Undo all changes to the current line

Lab

  • Start by creating a file and putting some content into it.
  • Save the file and view it in both cat and less
  • Go back into the file in vi and enter some more content.
  • Move around the content using at least 6 different movement commands.
  • Play about with several of the delete commands, especially the ones that incorporate a movement command. Remember you may undo your changes so you don't have to keep putting new content in.



Wildcards!

What are they?

  • * - represents zero or more characters
  • ? - represents a single character
  • [] - represents a range of characters
In [61]:
# Examples

ls B*
ls *.????b
ls *[0-1]*
ls */*.png
ls -lhsa /home/*/.bash_history
BDE-L0-intro.ipynb  BDE-L1-bigdata.ipynb  BDE-L2-linux.ipynb
BDE-L0-intro.ipynb  BDE-L1-bigdata.ipynb  BDE-L2-linux.ipynb
BDE-L0-intro.ipynb  BDE-L1-bigdata.ipynb
figs/bigdata.png
4.0K -rw-r--r-- 1 yanfei yanfei 560 Sep  4 18:02 /home/yanfei/.bash_history

Lab

  • A good directory to play with is /etc which is a directory containing config files for the system. As a normal user you may view the files but you can't make any changes so we can't do any harm. Do a listing of that directory to see what's there. Then pick various subsets of files and see if you can create a pattern to select only those files.
  • Do a listing of /etc with only files that contain an extension.
  • What about only a 3 letter extension?
  • How about files whose name contains an uppercase letter? (hint: [[:upper:]] may be useful here)
  • Can you list files whose name is 4 characters long?



Piping and redirection

Piping and redirection

  • Piping and redirection help create powerful workflows that will automate your work, saving you time and effort.
  • We looked at a collection of filters that would manipulate data for us. How we may join them together to do more powerful data manipulation?

Redirecting to a File

In [96]:
ls > output
cat output
BDE-L0-intro.ipynb
BDE-L0-intro.slides.html
BDE-L1-bigdata.ipynb
BDE-L1-bigdata.slides.html
BDE-L2-linux.ipynb
BDE-L2-linux.slides.html
figs
output

Saving to an existing file

In [97]:
wc -l output
wc -l output >> output
cat output
8 output
BDE-L0-intro.ipynb
BDE-L0-intro.slides.html
BDE-L1-bigdata.ipynb
BDE-L1-bigdata.slides.html
BDE-L2-linux.ipynb
BDE-L2-linux.slides.html
figs
output
8 output

Piping

  • Now we'll take a look at a mechanism for sending data from one program to another.
  • It's called piping.
  • The operator we use is ( | ).
In [84]:
ls | head -3
BDE-L0-intro.ipynb
BDE-L0-intro.slides.html
BDE-L1-bigdata.ipynb