Become a Data Scientist

Inspred by nirvacana.com according to his map we are going to create the online book to lear data-mining to become a data scientist. Below the map you can find the links to the articles describing each topics. If the topic was not described yet there will be strikethrough. The additional thing that was not mention by the author of the map, but we think that will be good to write something about it, will be mark with be bolded and blue colored.

RoadToDataScientist1

Click the topic to learn the field:

  • Fundamentals
    • Matrices & Linear Algebra Fundamentals
    • Hash Functions, Binary Tree, O(n)
    • Relational Algebra, DB Basics
    • Inner, Outer, Cross, Theta Jon
    • CAP Thorem
    • Tabular Data
    • Entropy
    • Data Frames & Series
    • Sharding
    • OLAP
    • Multidimensional Data Model
    • ETL
    • Reporting Vs BI Vs Analytics
    • JSON & XML
    • NoSQL
    • Regex
    • Vendor Landscape
    • Env Setup
  • Statistics
    • Pick a Dataset (UCI Repo)
    • Descriptive Statistics (mean, median, range, SD, Var)
    • Exploratory Data Analysis
    • Histograms
    • Percentiles & Outliers
    • Probability Theory
    • Bayes Thoery
    • Random Variables
    • Cumul Dist Fn (CDF)
    • Continous Distriuation (Normal, Poisson, Gaussian)
    • Skewness
    • ANOVA
    • Prob Dem Fn (PDF)
    • Central Limit Theorem
    • Monte Carlo Method
    • Hypothesis Testing
    • p-Value
    • ChiTest
    • Estimation
    • Confid int (CI)
    • MLE
    • Kernel Density Estimate
    • Regression
    • Covariance
    • Correlation
    • Pearson Coeff
    • Causation
    • Least2 Fit
    • Euclidean Distance
  • Econometrics
  • Programming
    • Python Basics
    • Working in Excel
    • Rapid Miner
    • IBM SPSS
    • R Setup & R Studio
    • R Basics
    • Expressions
    • Variables
    • Vectors
    • Matrices
    • Arrays
    • Factors
    • Lists
    • Data Frames
    • Reading CSV Data
    • Reading Raw Data
    • Subsetting Data
    • Manipulate Data Frames
    • Functions
    • Factor Analysis
    • Install Pkgs
  • Machine Learning
    • What is ML?
    • Numerical Var
    • Categorical Var
    • Supervised Learning
    • Unsupervised Learning
    • Concepts, Inputs & Attributes
    • Training & Test Data
    • Classifier
    • Prediction
    • Lift
    • Overfitting
    • Bias & Variance
    • Trees & Classification
    • Classification Rate
    • Decission Rate
    • Boosting
    • Naive Bayes Classifiers
    • K-Nearest Neighbor
    • Regression
      • Logistic Regression
      • Ranking
      • Linear Regresssion
    • Perceptron
    • Clustering
      • Hierarchical Clustering
      • K-means Clustering
    • Neural Networks
    • Sentiment Analysis
    • Collaborative Filtering
  • Text Mining / Natural Language Processing
    • Tagging
    • Vocabulary Mapping
    • Classify Text
    • Using NLTK
    • Using Weka
    • Using Marhout
    • Feature Extraction
    • Market Based Analysis
    • Association Rules
    • Support Vector Machines
    • Term Frequance & Weight
    • Term Document Matrix
    • UIMA
    • Text Analysis
    • Named Entity Recognition
    • Corpus
  • Data Visualization
    • Tableu
    • IBM ManyEyes
    • InfoVis
    • D3.js
    • Decission Tree
    • Timeline
    • Survay Plot
    • Spatial Charts
    • Line Charts (Bi)
    • Scatter Plot (Bi)
    • Tree & Tree Map
    • Histogram & Pie (Uni)
    • gglplot2
    • Uni, Bi & Multivariate Viz
    • Data Exploration in R (Hist, Boxplot, etc)
  • Big Data
    • Map Reduce Fundamentals
    • Hadoop Components
    • HDFS
    • Data Replication Principles
    • Setup Hadoop (IBM/Cloudera/HortonWorks)
    • Name & Data Nodes
    • Job & Trash Tracker
    • MIR Programming
    • Sqoop: Leading Data in HDFS
    • Flume, Scribe: For Unstruct Data
    • SQL with Pig
    • DWH with Hive
    • Scribe, Chukwa For Weblog
    • Using Mahout
    • Zookeeper Avro
    • Storm: Hadoop Realtime
    • Rhadoop, RHIPE
    • rmr
    • Cassandra
    • MongoDB, Neo4j
  • Data Ingestion
    • Summary of Data Formats
    • Data Discovery
    • Data Source & Acquisition
    • Data Integration
    • Data Fusion
    • Transformation & Enrichment
    • Data Survay
    • Google OpenRefine
    • How much Data?
    • Using ETL
  • Data Munging
    • Principal Component Analysis
    • Stratified Sampling
    • Sampling
    • Denoising
    • Feature Extraction
    • Binning Sparse Values
    • Unbiased Estimators
    • Handling Missing Values
    • Data Scrubbing
    • Normalization
    • Dimensionality & Numerosity Reduction
  • Toolbox
    • MS Excel and Analysis Toolpack
    • Java, Python
    • R, R-Studio, Rattle
    • Weka, Knie, RapidMiner
    • Hadoop Dist of Choice
    • Spark, Storm
    • Flume, Scibe, Chukwa
    • Nutch, Talend, Scraperwiki
    • Webscraper, Flume, Sqoop
    • tm, RWeka, NTLK
    • RHIPE
    • D3.js, ggplot2, Shiny
    • IBM Languageware
    • Cassandra, Mongo DB