R < Home < Thayersfarm

Home
%TOC%

---+ Multiple plots

To print 4 histograms (with 100 classes) of random normal distribution samples on the same plot, where n=351, mean=160, sd=6.03:

<verbatim>
> par(mfrow=c(2,2)) 
> for (i in 1:4){
   hist(rnorm(351, 160, 6.03), 100)
}
</verbatim>

Gives the following plot (yours may differ because of randomness):
    <img src="%ATTACHURLPATH%/multiplot.jpg" alt="Multi"  width="480" height="480"  />

To produce a jpeg:

<verbatim>
> jpeg("~/multiplot.jpg")
> par(mfrow=c(3,3))
> for (i in 1:9){
 hist(rnorm(351, 160, 6.03), 100)
}
> dev.off()
</verbatim>

Which produces the following image:

 <img src="%ATTACHURLPATH%/multiplot1.jpg" alt="Multi"  width="480" height="480"  />

*Note:* The dimensions of the image are 480x480 pixels, hence the titles being truncated. The following is more appropriate:

=> jpeg("~/multiplot.jpg", 600, 600)=

    <img src="%ATTACHURLPATH%/multiplot4.jpg" alt="an"  width="600" height="600"  />

---

---+ Cumalative plot

[[%ATTACHURL%/coal.txt][coal.txt]] contains the interval in days between successive mine explosions. The following command can be used to produce a cumulative plot:

=> x<-scan("coal.txt") //read data=

=> time<-cumsum(x) //cumulative 'running' sum of the data=

=> number<-1:length(x) //sequence from 1:190=

=> plot(time, number) //giving:=

    <img src="%ATTACHURLPATH%/coal.jpg" alt="Coal"  width="480" height="480"  />

=> scatter.smooth(x=time, y=number) //or a scatter with smoothed curve overlay:=

    <img src="%ATTACHURLPATH%/coal1.jpg" alt="coal"  width="480" height="480"  />

---

---+ Poisson distribution

To plot a poisson probability function Poission(mu) from the data in [[%ATTACHURL%/dec20.txt][dec20.txt]], issue the following commands:

=> dec20<-scan("dec20.txt") //load in data=

=> barplot(dpois(1:max(dec20),summary(dec20)['Mean'])) //prints the following:=

    <img src="%ATTACHURLPATH%/dec20.jpg" alt="dec"  width="480" height="480"  />

*Note:* the syntax to get the mean from the summary is not straight forward.  Because =summary(...)= returns an atomic vector, hence the =$= notation cannot be used to (de-)reference the internals of the returned data structure, like it can from say =hist(x,y)$counts=.  The array syntax =[]= should be used, hence: =summary(dec20)['Mean']= to get hold of the mean/mu. You might find it enlightening to issue the following commands: =mode(summary(dec20))=, =mode(hist(x))= to get a feel for the different data structures returned from various functions. You might also like to try =attributes(...)=.

---

---+ Histograms

=> hist(dec20) //plots the following=

    <img src="%ATTACHURLPATH%/dec20hist.jpg" alt="dec"  width="480" height="480"  />

But can be improved with the following:

=> hist(dec20, 23, right=F)=

    <img src="%ATTACHURLPATH%/dec20hist1.jpg" alt="dec"  width="480" height="480"  />

Notice that the second plot has more groups/classes, and the frequency for 0 is now plotted separately, =right=F=.

---

---+ Poisson process

[[%ATTACHURL%/earth.txt][earth.txt]] contains intervals in days between major earthquakes.

=> e<-scan("earth.txt")=

The following finds the probability that exactly 5 earthquakes happen in a typical 4 year period:

=> dpois(5,((365*3)+366)/summary(e)['Mean'])=

And the following finds the probability that fewer than 3 earthquakes happen in a 4 year period:

=> ppois(2,((365*3)+366)/summary(e)['Mean'])=

The waiting times between successive earthquakes can be described through an exponential distribution, with parameter lambda (rate, number of earthquakes a day).

The probability that the waiting time will be less than the 30 days:

=> pexp(30, 62/27107)=

and will exceed two years:

=> 1-pexp(2*365, 62/27107)=

---

---+ Quantiles for an exponential distribution

The mean of the interval between serious earthquakes is 437 days (see above for data).  The median can be calculated using P(X<= x)=0.5:

=> qexp(.5, 1/437)=

which results in approx 303 days.  1/437 is the rate of earthquakes (1/mu).  Any quantile can be calculated in this manner.

---

---+ Cycling accident data for 2005

See also: [[http://labs.timesonline.co.uk/blog/2009/03/11/uk-cycling-accidents/][Times online]], [[http://innovate.direct.gov.uk/2009/03/10/pedalling-some-raw-data/][Gov Direct]], [[http://po-ru.com/files/pedal-cycle-accident-locations.zip][Repackaged csv]].

Load the following file [[%ATTACHURL%/2005.csv][2005.csv]]

=> stuff<-scan("2005.csv", skip=1, what=list(year=0,north=0, east=0), sep=',', quiet=T)=

Create a density plot of the co-ordinate data (densities are calculated over a 500x500 grid):

=> blah<-kde2d(stuff$north, stuff$east, n=500)=

=blah= is a list containing 3 items, =x= and =y=, representing the grid, and =z= a 500x500 matrix containing the kernel density for each cell in the grid:

Then plot the result:

=> image(blah, col=c(0, rainbow(50)))=

<img src="%ATTACHURLPATH%/accident2005.jpg" alt="accident"  width="480" height="480"  />

You can see that the above plot is a lot clearer than the following:

=> plot(stuff$north, stuff$east)=

    <img src="%ATTACHURLPATH%/latlong2005.jpg" alt="latlong"  width="480" height="480"  />

---

If you want to add a colour legend to the z-plot above, you'll need to install the =fields= package:

=> install.packages("fields")=

=> library(fields)=

=> image.plot(blah, col=c(0, rainbow(50)))=

    <img src="%ATTACHURLPATH%/legend2005.jpg" alt="legend"  width="480" height="480"  />

---

To add a few place names...

<verbatim>
> points(537500, 169500);text(537500, 169500, pos=2,"Beckenham");points(534500,184500);text(534500,184500, pos=2, "Hackney");points(532108,178763);text(532108,178763, pos=2, "E&C")
> points(531496,181359)
> text(531496,181359, "Shoe Lane (The City)", pos=4)
</verbatim>

---

The following plot show the accident hotspot for London in 2005 was approximately in the region of Shoe Lane:

    <img src="%ATTACHURLPATH%/accident2005Lengend.jpg" alt="accident"  width="480" height="480"  />

and Victoria Emabankment in 2007:

    <img src="%ATTACHURLPATH%/accidentHotspot2007.jpg" alt="blah"  width="480" height="480"  />

---

TODO nice to add UK boundary to the plots.

---

   * [[%ATTACHURL%/2007.csv][2007.csv]]

---

---+ Testing the central limit theorem

I wanted to test that the mean over a large number of samples can be modeled adequately with a normal distribution.

Further, I wanted to see if the theorem held up when the original distribution was heavily skew.

To investigate:

=means<-numeric(10000) //somewhere to store the means=

<verbatim>
for(i in 1:10000){means[i]<-mean(rbinom(n=10000, prob=.1, size=5))} //generate 10000 random binomial distributions, storing each of their means, note I use prob=.1 producing a heavily skew distribution
</verbatim>

=h<-hist(means) //plot the distribution=

Now lets overlay the idealised normal curve.

=x<-seq(0.4766, 0.5262,.0001) //generate the x values, use range(means)=

<verbatim>
y<-dnorm(x,mean=mean(means),sd(means))*.005*10000 //scale the y values by the bin width and the number of samples (look at the =h= for bin width).
</verbatim>
    <img src="%ATTACHURLPATH%/central.jpg" alt="central"  width="480" height="480"  />

Normal approximation looks pretty good.

And the probabilty plot looks like this:

=plot(sort(means), qnorm(seq(1/10000, 1,1/10000)))=

    <img src="%ATTACHURLPATH%/probplot.jpg" alt="prob"  width="480" height="480"  />

Again looks very good.



---+ Data frames and Combinations

Create a list containing two columns with a count form 1 to 31

=L<-list(A=c(1:31),B=c(1:31))=

Create a data frame that contains all the combinations 1 to 31
=G<-expand.grid(L)=

Add a column, excel style:

=G$fraction <- G$A/G$B=

Filter by column:

<verbatim>
G[G$fraction <= 1,]
</verbatim>


Count the number of rows of a dataframe:

<verbatim>
nrow (G[G$fraction <= 1,])
</verbatim>
to top
End of topic
Skip to action links | Back to top
Edit | Attach image or document | Printable version | Raw text | More topic actions
Revisions: | r1.13 | > | r1.12 | > | r1.11 | Total page history | Backlinks
You are here: Home > R

to top
Chris and Janet's website