home How to fix Data analysis and relationship modeling in R. R-analysis, or the acceptability of criteria-based approaches Data analysis in the R environment

Data analysis and relationship modeling in R. R-analysis, or the acceptability of criteria-based approaches Data analysis in the R environment

Target conducting a training "Data Analysis and Relationship Modeling in R Package" - explore basic features R programs - free language programming for statistical calculations, as well as learn how to organize and manage data entry, conduct primary statistical analysis data, present them graphically, be able to find relationships in data. The training is designed for students with no experience in R or with basic knowledge of the package.

It is desirable for listeners to have programming skills and be familiar with the basics of statistical analysis.

Upon graduation, you will be able to use the R program to:

Correctly form a data sample for analysis
Organize data entry and manage data
Perform descriptive statistical analysis
Explore relationships in cross tables
Test statistical hypotheses about equality of means
Use graphical features
Conduct correlation analysis
Conduct regression analysis
Conduct analysis of variance

Duration of the training: 32 ac.h. or 4 days.

Training program:

Topic 1. Basic concepts of statistical data analysis - 2 ac.ch.

Statistical study
Ways to get data
The difference between observation and experiment
General population and sample
Data requirements for sampling
The concept of point and interval statistical estimation
Features and Variables
Variable scales
Directions analysis of statistical data
Descriptive and analytical statistics
The choice of methods of statistical analysis depending on the scales of measurement of variables
Statistical hypothesis
Types of statistical errors
Principles for Testing Statistical Hypotheses
Choosing a Significance Level for Hypothesis Testing

Topic 2. Introduction to working in the R environment - 2 ac.h.

Features of working with R
Program installation
Program launch
R environment
Interface command line and dialog boxes
Command rules
Creating a working directory
Packages
Graphical interfaces
R as a calculator
reference system

Topic 3. Fundamentals of programming in R - 2 ac.

Types of objects in R
Vector
Lists
matrices
Factors
data tables
Expressions
Data access operators
Functions and Arguments
Cycles and conditional statements
Database Management in R
Operation vectorization
Debugging
Object Oriented Programming

Topic 4. Entering and organizing data in R - 2 ac.h.

Ways to download data
Direct data entry
Entering data in a table
Import data from MS Excel
Importing data from other statistical packages and databases
Saving analysis results
Specifying quantity data
Specifying ordinal and nominal data
Specifying missing values in data
Identification of outliers and errors
Principles of data transformation

Topic 5. Graphical capabilities of R - 2 ac.ch.

Graphic functions
Graphics Devices
Graphics Options
interactive graphics
Composite images
Output devices

Topic 6. Descriptive statistical analysis in R – 4 ac.

Central trend statistics
Arithmetic mean
modal value
Median value
Scatter statistics
Variance and standard deviation
The coefficient of variation
Percentiles
Histograms
Boxplots
Z-transform
Normal distribution law
Asymmetry and kurtosis
Checking the distribution for normality
Some laws of distribution
Binomial distribution
Poisson distribution
Uniform distribution
Exponential Distribution
lognormal distribution
Standard error and interval for mean

Topic 7. Formation of data for analysis by a selective method - 2 ac.ch.

General and sample population
Sample Characteristics
Features of the sampling method of research
Sample classification
Types and methods of probabilistic selection
Sampling methods
Simple random selection
Systematic random selection
cluster selection
Single-stage cluster selection
Multistage cluster selection
Algorithm for conducting sample surveys
Determination of the required sample size

Topic 8. Statistical tests for detecting differences in samples in R - 4 ac.ch.

Hypotheses about comparing means
Z-test for comparing means
Z-score for comparing shares
One-sample t-test
T-test for independent samples
T-test for dependent samples
Conditions for applying nonparametric criteria
One-Sample Wilcoxon Signed Rank Test
Mann-Whitney test
Sign test for related samples
Wilcoxon Sign Test for Related Samples
Kruskal-Wallis Nonparametric ANOVA
Friedman criterion for dependent samples

Topic 9. Evaluation of the relationship between variables in R - 4 ac.ch.

Analysis of the relationship between categorical variables
Contingency tables
Expected frequencies and residuals in contingency tables
Chi-square test
Goodness Criteria
Classification of types of relationship between quantitative variables
Scatterplots
Prerequisites and conditions for conducting correlation analysis
Pearson correlation coefficient
Rank correlation coefficients
Spearman's correlation coefficient
Checking the Significance of a Relationship
Interval estimates of correlation coefficients
Partial correlation coefficients

Topic 10. Modeling the form of communication using regression analysis in R- 4 ac.ch.

Basic concepts of regression analysis
Pair and Multiple Linear Regression Model
Background of Linear Regression Analysis
Estimation of regression coefficients
Checking the validity of the regression model
Significance of the regression equation
Significance of regression coefficients
Selection of variables in regression analysis
Estimating the accuracy of the regression equation
Estimation of the statistical stability of the regression equation
Point and interval estimation of the dependent variable
Nonlinear Regression Models
Categorical explanatory variables in a regression model

Topic 11 analysis of variance in R - 4 ac.h.

ANOVA Models
Prerequisites for the use of analysis of variance
Testing the hypothesis of equality of variances
One-Way ANOVA Model
One-Way ANOVA Table
Assessment of the degree of influence of the factor
Post hoc tests for paired comparisons
Analysis of variance with two or more factors
Two-Way ANOVA Table with Interaction
Graphical interpretation of the interaction of factors
Analysis of a multivariate model

Data analysis in R environment

Institute of Computational Mathematics and Information Technology, Department of Data Analysis and Operations Research

Direction: 01.03.02 "Applied Mathematics and Informatics. System Programming"(bachelor's degree, 3rd year)

Discipline: "Data analysis in R environment"

Academic plan: "Full-time education, 2017"

Number of hours: 90 (including: lectures - 18, laboratory classes - 36, independent work - 36); form of control - offset.

Direction: 38.03.05 "Business Informatics" (bachelor's degree, 4th year)

Discipline: "Data analysis"

Academic plan: "Full-time education, 2018"

Number of hours: 78 (including: lectures - 18, laboratory classes - 36, independent work - 24); form of control - offset.

Keywords : Data Mining, Machine Learning, regression, classification, clusterization, support vector, SVM, artificial neutron, neural network, recommendation system, data analysis, machine learning, model, sample, response variable, sample learning, sample overfitting, supervised learning, learning unsupervised, R package, R programming language, statistics, random variable, r.v., distribution law, normal distribution, sampling, statistics, maximum likelihood method, Chi-square distribution, Student's distribution, Fisher's distribution, hypothesis, hypothesis acceptance area , significance level, type I and type II errors, sample comparison, goodness of fit, contingency table, correlation, regression, linear regression, non-linear regression, factor, predictor, one-way regression, multiple regression, classification, logistic regression, one-way discriminant analysis, Bayesian approach, naive Bayes, support vector machine, separating r hyperplane, decision trees, neural network, neuron, activation function, recommender system, clustering, quality functional.

Topics: 1. Development environment R: information from history. installing and running the package. 2. Programming in R. first steps. 3. Graphing in the R environment. 4. Data entry and working with files in the R environment. 4.1. Working with one-dimensional data arrays. 4.2. Working with matrices and data tables. 5. Testing statistical hypotheses in the R environment. 5.1. Testing the hypothesis about the law of probability distribution of a random variable (Pearson's Chi-square test). 5.2. Testing the hypothesis about the independence of features with qualitative grouping (Pearson's Chi-square test). 5.3. Testing the hypothesis about the equality of mathematical expectations of normal general populations (Student's criterion). 5.4. Testing the hypothesis about the equality of variances of normal general populations (Fisher's criterion). 6. The problem of building a model of one-factor linear regression. Forecasting. 7. Problem of multiple linear regression. 7.1. The problem of one-factor linear regression as a special case of multiple regression. 7.2. Investigation of the dependence of the response variable on the factor in the regression model. 8. The task of classification, approaches to its solution. 8.1. logistic regression. 8.2. Linear discriminant analysis. 8.3. Decision trees - the principle of "divide and conquer" ("divide and con-quer"). 9. Neural networks(neural networks) and their application in machine learning. 10. Support vectors, support vector machines ("support vector machines", SVM) in machine learning. 11. Recommender systems ("recommendation system"), their purpose, construction, application. 12. Special tasks of machine learning.

Date of commencement of operation: September 1, 2014

Missarov Mukadas Dmukhtasibovich Department of Data Analysis and Operations Research KFU, Doctor of Physical and Mathematical Sciences, Professor, email: [email protected]
Kashina Olga Andreevna, Candidate of Physical and Mathematical Sciences, Associate Professor of the Department of Data Analysis and Operations Research, email: [email protected]

Introduction

First of all, let's discuss the terminology. We are talking about an area that is called Data Mining in Western literature, and is often translated into Russian as “data analysis”. The term is not entirely successful, since the word "analysis" in mathematics is quite familiar, has an established meaning and is included in the name of many classical sections: mathematical analysis, functional analysis, convex analysis, non-standard analysis, multivariate complex analysis, discrete analysis, stochastic analysis, quantum analysis etc. In all these areas of science, a mathematical apparatus is studied, which is based on some fundamental results and allows solving problems from these areas. In data analysis, the situation is much more complicated. This is, first of all, applied science, in which there is no mathematical apparatus, in the sense that there is no finite set of basic facts from which it follows how to solve problems. Many problems are "individual", and now more and more new classes of problems appear, for which it is necessary to develop a mathematical apparatus. Here an even greater role is played by the fact that data analysis is a relatively new direction in science.

Next, it is necessary to explain what “data analysis” is. I called it "the area", but the area of what? This is where the fun begins, because this is not only a field of science. A real analyst solves, first of all, applied problems and is aimed at practice. In addition, it is necessary to analyze data in economics, biology, sociology, psychology, etc. Solution

new tasks, as I said, requires the invention of new techniques (these are not always theories, but also techniques, methods, etc.), so some say that data analysis is also an art and craft.

AT application areas, the most important thing is practice! It is impossible to imagine a surgeon who has not performed a single operation. Actually, this is not a surgeon at all. Also, a data analyst cannot do without solving real applied problems. The more such tasks you solve on your own, the more qualified specialists you will become.

First, data analysis is practice, practice and more practice. It is necessary to solve real problems, many, from different areas. Since, for example, the classification of signals and texts are two completely different areas. Experts who can easily build an engine diagnostic algorithm based on sensor signals may not be able to make a simple email spam filter. But it is very desirable to get basic skills when working with different objects: signals, texts, images, graphs, feature descriptions, etc. In addition, it will allow you to choose tasks to your liking.

Secondly, it is important to choose the right training courses and mentors.

AT Basically, you can learn everything yourself. After all, we are not dealing with an area where there is some secrets passed from mouth to mouth. On the contrary, there are many competent training courses, source codes of programs and data. In addition, it is very useful when several people solve the same problem in parallel. The fact is that when solving such problems, one has to deal with very specific programming. Let's say your algorithm

gave 89% correct answers. Question: is it a lot or a little? If not enough, then what's the matter: did you program the algorithm incorrectly, chose the wrong parameters of the algorithm, or the algorithm itself is bad and not suitable for solving this problem? If the work is duplicated, then errors in the program and incorrect parameters can be quickly found. And if it is duplicated by a specialist, then the issues of evaluating the result and the acceptability of the model are also resolved quickly.

Thirdly, it is useful to remember that it takes a lot of time to solve the problem of data analysis.

Statistics

Data analysis in R

1. Variables

AT R, like all other programming languages, has variables. What is a variable? In fact, this is the address with which we can find some data that we store in memory.

Variables are made up of left and right parts, separated by an assignment operator. In R, the assignment operator is “<-”, если название переменной находится слева, а значение, которое сохраняется в памяти - справа, и она аналогична “=” в других языках программирования. В отличии от других языков программирования, хранимое значение может находиться слева от оператора присваивания, а имя переменной - справа. В таком случае, как можно догадаться, оператор присваивания примет конструкцию следующего вида: “->”.

AT depending on the stored data, variables can be various types: integer, real, string. For example:

my.var1<- 42 my.var2 <- 35.25

In this case, the variable my.var1 will be of integer type, and my.var2 will be of real type.

Just like in other programming languages, you can perform various arithmetic operations with variables.

my.var1 + my.var2 - 12

my.var3<- my.var1^2 + my.var2^2

In addition to arithmetic operations, you can perform logical operations, that is, comparison operations.

my.var3 > 200 my.var3 > 3009 my.var1 == my.var2 my.var1 != my.var2 my.var3 >= 200 my.var3<= 200

The result of the logical operation will be true (TRUE) or false (FALSE) statement. You can also perform logical operations not only between a variable with some value, but also with another variable.

my.new.var<- my.var1 == my.var2

Random Forest is one of my favorite data mining algorithms. Firstly, it is incredibly versatile, it can be used to solve both regression and classification problems. Search for anomalies and select predictors. Secondly, this is an algorithm that is really difficult to apply incorrectly. Simply because, unlike other algorithms, it has few customizable parameters. And yet it is surprisingly simple in its essence. At the same time, it is remarkably accurate.

What is the idea of such a wonderful algorithm? The idea is simple: let's say we have some very weak algorithm, say . If we make a lot of different models using this weak algorithm and average the result of their predictions, then the final result will be much better. This is the so-called ensemble learning in action. The Random Forest algorithm is therefore called "Random Forest", for the received data it creates many decision trees and then averages the result of their predictions. An important point here is the element of randomness in the creation of each tree. After all, it is clear that if we create many identical trees, then the result of their averaging will have the accuracy of one tree.

How does he work? Suppose we have some input data. Each column corresponds to some parameter, each row corresponds to some data element.

We can choose, at random, a number of columns and rows from the entire dataset and build a decision tree from them.

Thursday, May 10, 2012

Thursday, January 12, 2012

That's actually all. The 17-hour flight is over, Russia has remained overseas. And through the window of a cozy 2-bedroom apartment, San Francisco, the famous Silicon Valley, California, USA is looking at us. Yes, this is the very reason why I have not written much lately. We moved.

It all started back in April 2011 when I had a phone interview with Zynga. Then it all seemed like some kind of game that had nothing to do with reality, and I could not even imagine what it would lead to. In June 2011, Zynga arrived in Moscow and conducted a series of interviews, about 60 candidates who passed a telephone interview were considered, and about 15 people were selected from them (I don’t know the exact number, someone changed their mind later, someone immediately refused). The interview turned out to be surprisingly simple. No programming tasks for you, no intricate questions about the shape of hatches, mainly the ability to chat was tested. And knowledge, in my opinion, was evaluated only superficially.

And then the rigmarole began. First we waited for the results, then the offer, then the approval of the LCA, then the approval of the petition for a visa, then the documents from the USA, then the line at the embassy, then the additional check, then the visa. At times it seemed to me that I was ready to drop everything and score. At times I doubted whether we need this America, because Russia is not bad either. The whole process took about half a year, in the end, in mid-December, we received visas and began to prepare for departure.

Monday was my first day at the new job. The office has all the conditions to not only work, but also to live. Breakfasts, lunches and dinners from our own chefs, a bunch of varied food stuffed in all corners, a gym, massage and even a hairdresser. All this is completely free for employees. Many get to work by bike and several rooms are equipped to store vehicles. In general, I have never seen anything like this in Russia. Everything, however, has its price, we were immediately warned that we would have to work a lot. What is "a lot", by their standards, is not very clear to me.

I hope, however, that despite the amount of work, in the foreseeable future I will be able to resume blogging and maybe tell something about American life and working as a programmer in America. Wait and see. In the meantime, I wish you all a Merry Christmas and a Happy New Year and see you soon!

For an example of use, let's print out the dividend yield of Russian companies. As a base price, we take the closing price of the share on the day the register is closed. For some reason, this information is not available on the Troika website, and it is much more interesting than the absolute values of dividends.
Attention! The code takes a long time to execute, because for each stock, you need to make a request to the finam servers and get its value.

result<- NULL for(i in (1:length(divs[,1]))){ d <- divs if (d$Divs>0)( try(( quotes<- getSymbols(d$Symbol, src="Finam", from="2010-01-01", auto.assign=FALSE) if (!is.nan(quotes)){ price <- Cl(quotes) if (length(price)>0)(dd<- d$Divs result <- rbind(result, data.frame(d$Symbol, d$Name, d$RegistryDate, as.numeric(dd)/as.numeric(price), stringsAsFactors=FALSE)) } } }, silent=TRUE) } } colnames(result) <- c("Symbol", "Name", "RegistryDate", "Divs") result

Similarly, you can build statistics for past years.

Just about the complex. Programs. Iron. Internet. Windows

Data analysis and relationship modeling in R. R-analysis, or the acceptability of criteria-based approaches Data analysis in the R environment

Data analysis in R environment

Institute of Computational Mathematics and Information Technology, Department of Data Analysis and Operations Research

Thursday, May 10, 2012

Thursday, January 12, 2012