Data Scientist's Toolbox
Data Science Specialisation : John Hopkin's University
Part 1 - Data Scientist's Toolbox
Aaditree Jaisswal
LINK FOR COURSE :
When we talk about data science,there are a few things we imply it revolves around; Statistics, Data Cleaning, Computer Science etc and when we talk about data, it is basically any information helpful in decision making. It can be values or numbers or facts. There is no limitation.
There are mainly six fields of analysis in data science :
1. Descriptive : summarises the data.
2. Exploratory : examine the data and find relationships.
Note: Correlation does not imply causation.
3. Inferential : use a small sample of data and make a conclusion for a large population.
4. Predictive : Use current and historical data to make predictions about the future.
5. Causal : See what happens to one variable when we manipulate another variable.
6. Mechanistic : understand the exact changes that take place in one variable that lead to the exact changes in another variable.
In this course, the language that has been used is R. The platform is R studio and along with that, we use version control.
R is mainly used for two things : Statistical analysis and graphing.
A few functions of R include:
1.install.packages("ggplot2") - to install a single package called "ggplot2".
2. install.packages("ggplot2","devtools"....) - To install multiple packages together.
3. bioclite() - to get Bioconductor packages
4. install_github("author/package")
It's not just enough to install the package. To use it, we need to load it into R studio.
Install package -> load[library() ==> library(package_name)]
To check if all the packages have been installed , we use either installed.packages() or library().
To update a package, we use update.packages().
To check an old package : old.packages().
Consider in a hypothetical situation, the updated package of ggplot2 doesn't do what we want it to do. Now we need to revert back to the old package. For this , we use: detach("package:ggplot2",unload=True)
To remove a package completely , we use : remove.packages("package_name")
To check a version : version
To go through information about what you did : SessionInfo()
To know information about a certain package : help( package_name)
Note: No inverted commas here
To understand how to use a particular function, with clear instructions : browseVignettes()
PROJECTS IN R:
Every project in R should have 3 main folders to store and organise all the data.
1.Data
2.Scripts
3.Output
VERSION CONTROL
Version Control records all the changes we make in our project with a timestamp. This helps when we do not understand or remember a particular change we had made in the past and want to go back through it.
Now we know about R and RStudio and projects, there is one very important thing that one must do, and that is upload the code on Github. This helps to store the code and distribute it as well.
In Github, a repository is equivalent to a project.
A few functions of Github:
commit : saves all the edited work.
Push : updates the repository.
Pull : updates the local version of the repository to the current repository.
Staging : act of preparing a file for editing.
Branch : copies of the main repository .
Three things that help while updating a repository :
1. Single issue commits.
2. Informative message commits.
3.Push and pull often, to keep the repository updated.
CONFIGURING GIT WITH R
1. Terminal :
Step 1: git config --global user.name="NAME"
Step 2: git config --global user.email "xxx@gmail.com"
then click exit .
Note : Both of these should match with the information on your Github account.
2. R STUDIO :
Step 1: Open R.
Step 2: Go to Tools.
Step 3 : Go to Options.
Step 4 : Git/SVN.
Step 5 : Create RSA key.
Step 6 : View public key and copy.
Step 7 : Close.
3. GITHUB ACCOUNT :
Step 1 : Personal Settings
Step 2 : SSH and GPG keys.
Step 3 : Paste the public key here.
Step 4 : Create a repository.
4. R STUDIO:
Step 1 : Select Version control as GIT.
Step 2 : Put the repository URL .
Step 3 : Create Project.
Step 4: Create a new file.
Step 5 : Make some changes. For example : print("HELLO WORLD")
Step 6 : Commit.
Step 7 : Push.
5.TO PUT PROJECTS UNDER VERSION CONTROL
Step 1 : git init
Step 2 : git add .
Step 3 : git status // shows the state of the working directory i.e. whether the files have been added or not.
Step 4 : git commit -m "Initial Commit"
R MARKDOWN
This helps in reproducibility.
To install in R studio : install.packages("rmarkdown")
Tips :
To make text bold : **text**
To make text italic : *text*
comments : ''' { }'''
Great explanation!
ReplyDelete