Target conducting a training "Data Analysis and Relationship Modeling in R Package" - explore basic features R programs - free language programming for statistical calculations, as well as learn how to organize and manage data entry, conduct primary statistical analysis data, present them graphically, be able to find relationships in data. The training is designed for students with no experience in R or with basic knowledge of the package.

It is desirable for listeners to have programming skills and be familiar with the basics of statistical analysis.

Upon graduation, you will be able to use the R program to:

  • Correctly form a data sample for analysis
  • Organize data entry and manage data
  • Perform descriptive statistical analysis
  • Explore relationships in cross tables
  • Test statistical hypotheses about equality of means
  • Use graphical features
  • Conduct correlation analysis
  • Conduct regression analysis
  • Conduct analysis of variance

Duration of the training: 32 ac.h. or 4 days.

Training program:

Topic 1. Basic concepts of statistical data analysis - 2 ac.ch.

  • Statistical study
  • Ways to get data
  • The difference between observation and experiment
  • General population and sample
  • Data requirements for sampling
  • The concept of point and interval statistical estimation
  • Features and Variables
  • Variable scales
  • Directions analysis of statistical data
  • Descriptive and analytical statistics
  • The choice of methods of statistical analysis depending on the scales of measurement of variables
  • Statistical hypothesis
  • Types of statistical errors
  • Principles for Testing Statistical Hypotheses
  • Choosing a Significance Level for Hypothesis Testing

Topic 2. Introduction to working in the R environment - 2 ac.h.

  • Features of working with R
  • Program installation
  • Program launch
  • R environment
  • Interface command line and dialog boxes
  • Command rules
  • Creating a working directory
  • Packages
  • Graphical interfaces
  • R as a calculator
  • reference system

Topic 3. Fundamentals of programming in R - 2 ac.

  • Types of objects in R
  • Vector
  • Lists
  • matrices
  • Factors
  • data tables
  • Expressions
  • Data access operators
  • Functions and Arguments
  • Cycles and conditional statements
  • Database Management in R
  • Operation vectorization
  • Debugging
  • Object Oriented Programming

Topic 4. Entering and organizing data in R - 2 ac.h.

  • Ways to download data
  • Direct data entry
  • Entering data in a table
  • Import data from MS Excel
  • Importing data from other statistical packages and databases
  • Saving analysis results
  • Specifying quantity data
  • Specifying ordinal and nominal data
  • Specifying missing values ​​in data
  • Identification of outliers and errors
  • Principles of data transformation

Topic 5. Graphical capabilities of R - 2 ac.ch.

  • Graphic functions
  • Graphics Devices
  • Graphics Options
  • interactive graphics
  • Composite images
  • Output devices

Topic 6. Descriptive statistical analysis in R – 4 ac.

  • Central trend statistics
  • Arithmetic mean
  • modal value
  • Median value
  • Scatter statistics
  • Variance and standard deviation
  • The coefficient of variation
  • Percentiles
  • Histograms
  • Boxplots
  • Z-transform
  • Normal distribution law
  • Asymmetry and kurtosis
  • Checking the distribution for normality
  • Some laws of distribution
  • Binomial distribution
  • Poisson distribution
  • Uniform distribution
  • Exponential Distribution
  • lognormal distribution
  • Standard error and interval for mean

Topic 7. Formation of data for analysis by a selective method - 2 ac.ch.

  • General and sample population
  • Sample Characteristics
  • Features of the sampling method of research
  • Sample classification
  • Types and methods of probabilistic selection
  • Sampling methods
  • Simple random selection
  • Systematic random selection
  • cluster selection
  • Single-stage cluster selection
  • Multistage cluster selection
  • Algorithm for conducting sample surveys
  • Determination of the required sample size

Topic 8. Statistical tests for detecting differences in samples in R - 4 ac.ch.

  • Hypotheses about comparing means
  • Z-test for comparing means
  • Z-score for comparing shares
  • One-sample t-test
  • T-test for independent samples
  • T-test for dependent samples
  • Conditions for applying nonparametric criteria
  • One-Sample Wilcoxon Signed Rank Test
  • Mann-Whitney test
  • Sign test for related samples
  • Wilcoxon Sign Test for Related Samples
  • Kruskal-Wallis Nonparametric ANOVA
  • Friedman criterion for dependent samples

Topic 9. Evaluation of the relationship between variables in R - 4 ac.ch.

  • Analysis of the relationship between categorical variables
  • Contingency tables
  • Expected frequencies and residuals in contingency tables
  • Chi-square test
  • Goodness Criteria
  • Classification of types of relationship between quantitative variables
  • Scatterplots
  • Prerequisites and conditions for conducting correlation analysis
  • Pearson correlation coefficient
  • Rank correlation coefficients
  • Spearman's correlation coefficient
  • Checking the Significance of a Relationship
  • Interval estimates of correlation coefficients
  • Partial correlation coefficients

Topic 10. Modeling the form of communication using regression analysis in R- 4 ac.ch.

  • Basic concepts of regression analysis
  • Pair and Multiple Linear Regression Model
  • Background of Linear Regression Analysis
  • Estimation of regression coefficients
  • Checking the validity of the regression model
  • Significance of the regression equation
  • Significance of regression coefficients
  • Selection of variables in regression analysis
  • Estimating the accuracy of the regression equation
  • Estimation of the statistical stability of the regression equation
  • Point and interval estimation of the dependent variable
  • Nonlinear Regression Models
  • Categorical explanatory variables in a regression model

Topic 11 analysis of variance in R - 4 ac.h.

  • ANOVA Models
  • Prerequisites for the use of analysis of variance
  • Testing the hypothesis of equality of variances
  • One-Way ANOVA Model
  • One-Way ANOVA Table
  • Assessment of the degree of influence of the factor
  • Post hoc tests for paired comparisons
  • Analysis of variance with two or more factors
  • Two-Way ANOVA Table with Interaction
  • Graphical interpretation of the interaction of factors
  • Analysis of a multivariate model

Data analysis in R environment

Institute of Computational Mathematics and Information Technology, Department of Data Analysis and Operations Research


Direction
: 01.03.02 "Applied Mathematics and Informatics. System Programming"(bachelor's degree, 3rd year)

Discipline: "Data analysis in R environment"

Academic plan: "Full-time education, 2017"

Number of hours: 90 (including: lectures - 18, laboratory classes - 36, independent work - 36); form of control - offset.

Direction: 38.03.05 "Business Informatics" (bachelor's degree, 4th year)

Discipline: "Data analysis"

Academic plan: "Full-time education, 2018"

Number of hours: 78 (including: lectures - 18, laboratory classes - 36, independent work - 24); form of control - offset.


Keywords
: Data Mining, Machine Learning, regression, classification, clusterization, support vector, SVM, artificial neutron, neural network, recommendation system, data analysis, machine learning, model, sample, response variable, sample learning, sample overfitting, supervised learning, learning unsupervised, R package, R programming language, statistics, random variable, r.v., distribution law, normal distribution, sampling, statistics, maximum likelihood method, Chi-square distribution, Student's distribution, Fisher's distribution, hypothesis, hypothesis acceptance area , significance level, type I and type II errors, sample comparison, goodness of fit, contingency table, correlation, regression, linear regression, non-linear regression, factor, predictor, one-way regression, multiple regression, classification, logistic regression, one-way discriminant analysis, Bayesian approach, naive Bayes, support vector machine, separating r hyperplane, decision trees, neural network, neuron, activation function, recommender system, clustering, quality functional.

Topics: 1. Development environment R: information from history. installing and running the package. 2. Programming in R. first steps. 3. Graphing in the R environment. 4. Data entry and working with files in the R environment. 4.1. Working with one-dimensional data arrays. 4.2. Working with matrices and data tables. 5. Testing statistical hypotheses in the R environment. 5.1. Testing the hypothesis about the law of probability distribution of a random variable (Pearson's Chi-square test). 5.2. Testing the hypothesis about the independence of features with qualitative grouping (Pearson's Chi-square test). 5.3. Testing the hypothesis about the equality of mathematical expectations of normal general populations (Student's criterion). 5.4. Testing the hypothesis about the equality of variances of normal general populations (Fisher's criterion). 6. The problem of building a model of one-factor linear regression. Forecasting. 7. Problem of multiple linear regression. 7.1. The problem of one-factor linear regression as a special case of multiple regression. 7.2. Investigation of the dependence of the response variable on the factor in the regression model. 8. The task of classification, approaches to its solution. 8.1. logistic regression. 8.2. Linear discriminant analysis. 8.3. Decision trees - the principle of "divide and conquer" ("divide and con-quer"). 9. Neural networks(neural networks) and their application in machine learning. 10. Support vectors, support vector machines ("support vector machines", SVM) in machine learning. 11. Recommender systems ("recommendation system"), their purpose, construction, application. 12. Special tasks of machine learning.


Date of commencement of operation: September 1, 2014
  • Missarov Mukadas Dmukhtasibovich Department of Data Analysis and Operations Research KFU, Doctor of Physical and Mathematical Sciences, Professor, email: [email protected]
  • Kashina Olga Andreevna, Candidate of Physical and Mathematical Sciences, Associate Professor of the Department of Data Analysis and Operations Research, email: [email protected]

Introduction

First of all, let's discuss the terminology. We are talking about an area that is called Data Mining in Western literature, and is often translated into Russian as “data analysis”. The term is not entirely successful, since the word "analysis" in mathematics is quite familiar, has an established meaning and is included in the name of many classical sections: mathematical analysis, functional analysis, convex analysis, non-standard analysis, multivariate complex analysis, discrete analysis, stochastic analysis, quantum analysis etc. In all these areas of science, a mathematical apparatus is studied, which is based on some fundamental results and allows solving problems from these areas. In data analysis, the situation is much more complicated. This is, first of all, applied science, in which there is no mathematical apparatus, in the sense that there is no finite set of basic facts from which it follows how to solve problems. Many problems are "individual", and now more and more new classes of problems appear, for which it is necessary to develop a mathematical apparatus. Here an even greater role is played by the fact that data analysis is a relatively new direction in science.

Next, it is necessary to explain what “data analysis” is. I called it "the area", but the area of ​​what? This is where the fun begins, because this is not only a field of science. A real analyst solves, first of all, applied problems and is aimed at practice. In addition, it is necessary to analyze data in economics, biology, sociology, psychology, etc. Solution

new tasks, as I said, requires the invention of new techniques (these are not always theories, but also techniques, methods, etc.), so some say that data analysis is also an art and craft.

AT application areas, the most important thing is practice! It is impossible to imagine a surgeon who has not performed a single operation. Actually, this is not a surgeon at all. Also, a data analyst cannot do without solving real applied problems. The more such tasks you solve on your own, the more qualified specialists you will become.

First, data analysis is practice, practice and more practice. It is necessary to solve real problems, many, from different areas. Since, for example, the classification of signals and texts are two completely different areas. Experts who can easily build an engine diagnostic algorithm based on sensor signals may not be able to make a simple email spam filter. But it is very desirable to get basic skills when working with different objects: signals, texts, images, graphs, feature descriptions, etc. In addition, it will allow you to choose tasks to your liking.

Secondly, it is important to choose the right training courses and mentors.

AT Basically, you can learn everything yourself. After all, we are not dealing with an area where there is some secrets passed from mouth to mouth. On the contrary, there are many competent training courses, source codes of programs and data. In addition, it is very useful when several people solve the same problem in parallel. The fact is that when solving such problems, one has to deal with very specific programming. Let's say your algorithm

gave 89% correct answers. Question: is it a lot or a little? If not enough, then what's the matter: did you program the algorithm incorrectly, chose the wrong parameters of the algorithm, or the algorithm itself is bad and not suitable for solving this problem? If the work is duplicated, then errors in the program and incorrect parameters can be quickly found. And if it is duplicated by a specialist, then the issues of evaluating the result and the acceptability of the model are also resolved quickly.

Thirdly, it is useful to remember that it takes a lot of time to solve the problem of data analysis.

Statistics

Data analysis in R

1. Variables

AT R, like all other programming languages, has variables. What is a variable? In fact, this is the address with which we can find some data that we store in memory.

Variables are made up of left and right parts, separated by an assignment operator. In R, the assignment operator is “<-”, если название переменной находится слева, а значение, которое сохраняется в памяти - справа, и она аналогична “=” в других языках программирования. В отличии от других языков программирования, хранимое значение может находиться слева от оператора присваивания, а имя переменной - справа. В таком случае, как можно догадаться, оператор присваивания примет конструкцию следующего вида: “->”.

AT depending on the stored data, variables can be various types: integer, real, string. For example:

my.var1<- 42 my.var2 <- 35.25

In this case, the variable my.var1 will be of integer type, and my.var2 will be of real type.

Just like in other programming languages, you can perform various arithmetic operations with variables.

my.var1 + my.var2 - 12

my.var3<- my.var1^2 + my.var2^2

In addition to arithmetic operations, you can perform logical operations, that is, comparison operations.

my.var3 > 200 my.var3 > 3009 my.var1 == my.var2 my.var1 != my.var2 my.var3 >= 200 my.var3<= 200

The result of the logical operation will be true (TRUE) or false (FALSE) statement. You can also perform logical operations not only between a variable with some value, but also with another variable.

my.new.var<- my.var1 == my.var2

Random Forest is one of my favorite data mining algorithms. Firstly, it is incredibly versatile, it can be used to solve both regression and classification problems. Search for anomalies and select predictors. Secondly, this is an algorithm that is really difficult to apply incorrectly. Simply because, unlike other algorithms, it has few customizable parameters. And yet it is surprisingly simple in its essence. At the same time, it is remarkably accurate.

What is the idea of ​​such a wonderful algorithm? The idea is simple: let's say we have some very weak algorithm, say . If we make a lot of different models using this weak algorithm and average the result of their predictions, then the final result will be much better. This is the so-called ensemble learning in action. The Random Forest algorithm is therefore called "Random Forest", for the received data it creates many decision trees and then averages the result of their predictions. An important point here is the element of randomness in the creation of each tree. After all, it is clear that if we create many identical trees, then the result of their averaging will have the accuracy of one tree.

How does he work? Suppose we have some input data. Each column corresponds to some parameter, each row corresponds to some data element.

We can choose, at random, a number of columns and rows from the entire dataset and build a decision tree from them.


Thursday, May 10, 2012

Thursday, January 12, 2012


That's actually all. The 17-hour flight is over, Russia has remained overseas. And through the window of a cozy 2-bedroom apartment, San Francisco, the famous Silicon Valley, California, USA is looking at us. Yes, this is the very reason why I have not written much lately. We moved.

It all started back in April 2011 when I had a phone interview with Zynga. Then it all seemed like some kind of game that had nothing to do with reality, and I could not even imagine what it would lead to. In June 2011, Zynga arrived in Moscow and conducted a series of interviews, about 60 candidates who passed a telephone interview were considered, and about 15 people were selected from them (I don’t know the exact number, someone changed their mind later, someone immediately refused). The interview turned out to be surprisingly simple. No programming tasks for you, no intricate questions about the shape of hatches, mainly the ability to chat was tested. And knowledge, in my opinion, was evaluated only superficially.

And then the rigmarole began. First we waited for the results, then the offer, then the approval of the LCA, then the approval of the petition for a visa, then the documents from the USA, then the line at the embassy, ​​then the additional check, then the visa. At times it seemed to me that I was ready to drop everything and score. At times I doubted whether we need this America, because Russia is not bad either. The whole process took about half a year, in the end, in mid-December, we received visas and began to prepare for departure.

Monday was my first day at the new job. The office has all the conditions to not only work, but also to live. Breakfasts, lunches and dinners from our own chefs, a bunch of varied food stuffed in all corners, a gym, massage and even a hairdresser. All this is completely free for employees. Many get to work by bike and several rooms are equipped to store vehicles. In general, I have never seen anything like this in Russia. Everything, however, has its price, we were immediately warned that we would have to work a lot. What is "a lot", by their standards, is not very clear to me.

I hope, however, that despite the amount of work, in the foreseeable future I will be able to resume blogging and maybe tell something about American life and working as a programmer in America. Wait and see. In the meantime, I wish you all a Merry Christmas and a Happy New Year and see you soon!


For an example of use, let's print out the dividend yield of Russian companies. As a base price, we take the closing price of the share on the day the register is closed. For some reason, this information is not available on the Troika website, and it is much more interesting than the absolute values ​​of dividends.
Attention! The code takes a long time to execute, because for each stock, you need to make a request to the finam servers and get its value.

result<- NULL for(i in (1:length(divs[,1]))){ d <- divs if (d$Divs>0)( try(( quotes<- getSymbols(d$Symbol, src="Finam", from="2010-01-01", auto.assign=FALSE) if (!is.nan(quotes)){ price <- Cl(quotes) if (length(price)>0)(dd<- d$Divs result <- rbind(result, data.frame(d$Symbol, d$Name, d$RegistryDate, as.numeric(dd)/as.numeric(price), stringsAsFactors=FALSE)) } } }, silent=TRUE) } } colnames(result) <- c("Symbol", "Name", "RegistryDate", "Divs") result


Similarly, you can build statistics for past years.