Box plot hinges

TeeChart for ActiveX, COM and ASP
Post Reply
ESRI
Newbie
Newbie
Posts: 60
Joined: Wed Mar 09, 2005 5:00 am

Box plot hinges

Post by ESRI » Wed Nov 28, 2007 11:51 pm

Hola.

We have been asked about how the box part of box plots are determined, particularly how the "hinges" are calculated. I couldn't find a detailed description, and it seems that there are several ways this can be done. Can you provide the formulations that TeeChart uses?


The conventional way of drawing box plots seems to be to draw the upper and lower boundaries at the 25th and 75th percentiles, and the median at the 50th percentile.

Our user's example data values (with rank underneath) is as follows:
(note: am using CODE here for formatting only)

Code: Select all

   1   19   20   24   28   37   57   58   60   75   80
  (1)  (2)  (3)  (4)  (5)  (6)  (7)  (8)  (9) (10) (11)
From this, one would expect the following:
- median (2nd quartile) = value 6 = 37
- 25th percentile (1st quartile) = value 3 = 20
- 75th percentile (3rd quartile) = value 9 = 60

However, the box plot we are getting seemingly puts the lower hinge at 21 and the upper at 59.5.

To illustrate, below I have plotted the ranked data values (over the character location number for reference), followed by the conventionally-expected boxplot and then the actual one:

Rank plot:

Code: Select all

X.................XX...X...X........X...................XX.X..............X....X              
12345678901234567890123456789012345678901234567890123456789012345678901234567890
Conventional:
Note: instead of '|' to symbolize the perimeter ends of the box, the lower case 'x' was used to indicate that it is an actual value in the input data.

Code: Select all

                   +----------------+----------------------+                          
X.................Xx...X...X........M...................XX.x..............X....X
                   +----------------+----------------------+
Actual:

Code: Select all

                    +---------------+---------------------+                          
X.................XX|..X...X........M...................XX|X..............X....X
                    +---------------+---------------------+   
Note: in contrast to the previous, all of the data values are represented outside the perimeter of the box.


Formula
Based on some investigation of statistical resources on the web, I came across a method that seems to give the same results as that seen in TeeChart. I shall describe it here:

- We are looking for the values of the 25th and 75th percentiles (p = 25 and p = 75).
- The p-th percentile of N ordered values is obtained by first calculating the rank:

Code: Select all

Rp = N/100 * p + .5
- If Rp is an integer, it is the value for the percentile p.
- If Rp is not an integer, a value must be calculated.
- First, the rank Rp is divided into the integer component k and the decimal component d.
- With these, the p-th percentile is calculated with:

Code: Select all

Vp = Rk + d * (R(k+1) - Rk)
Application:
When this formula is applied to the dataset provided above, following are the calulculations and the results:

In the case of this dataset, N = 11.

25th percentile:
The ranked value of the 25th percentile is:

Code: Select all

R25  = 11 / 100 * 25 + .5
     = .11 * 25 + .5
     = 2.75 + .5
     = 3.25
Thus, k = 3 and d = 0.25

For the 25th percentile,

Code: Select all

V25   = Rk + d * (R(k+1) - Rk)
          where R3 = 20  and  R4 = 24
      = 20 + .25 * (24 - 20)
      = 20 + .25 * 4
      = 20 + 1
      = 21
75th percentile:
The ranked value of the 75th percentile is:

Code: Select all

R75 = 11 / 100 * 75 + .5
    = .11 * 75 + .5
    = 8.25 + .5
    = 8.75
Thus, k = 8 and d = 0.75.

For the 75th percentile,

Code: Select all

V75 = Rk + d * (R(k+1) - Rk)
          where R8 = 58  and  R9 = 60
    = 58 + .75 * (60 - 58)
    = 58 + .75 * 2
    = 58 + 1.5
    = 59.5
Conclusion:
For a box plot drawn from this dataset, this formula puts the the lower hinge of the box at 21, and the upper hinge at value 59.5.

Thanks!
juan

(we are using v7.0.0.6, btw)

Yeray
Site Admin
Site Admin
Posts: 9601
Joined: Tue Dec 05, 2006 12:00 am
Location: Girona, Catalonia
Contact:

Post by Yeray » Thu Nov 29, 2007 9:19 am

Hola Juan,

Please, take a look at those two .NET threads where similar questions are discussed: thread1 and thread2.

Also you'll find an example of this series at All Features\Welcome!\Chart Styles\Statistical\BoxPlot\Custom values in the features demo available at TeeChart's program group.
Best Regards,
ImageYeray Alonso
Development & Support
Steema Software
Av. Montilivi 33, 17003 Girona, Catalonia (SP)
Image Image Image Image Image Image Please read our Bug Fixing Policy

ESRI
Newbie
Newbie
Posts: 60
Joined: Wed Mar 09, 2005 5:00 am

Post by ESRI » Thu Nov 29, 2007 8:34 pm

Yeray,

I've looked through your suggestions (and other Search returns), but still cannot get at the information I am looking for. I am not a programmer, and have only a very hazy (and wikipedia-enhanced) understanding of statistics, so please bear with me if my questions are a little simplistic.

Just to be clear, we do have the box plots all hooked up and working. I am trying to get the mathematical formulas that are used for determining the box limits, not the whiskers or outliers.

The description I had put into our user documentation of how the box plots are created does not reflect the actual results. I need to better understand the calculations so that I can make the appropriate changes to the text.

BTW, if you wish to see the descriptions I have done, you can see them here:
http://webhelp.esri.com/arcgisdesktop/9 ... raph_types

Thanks,
juan

ESRI
Newbie
Newbie
Posts: 60
Joined: Wed Mar 09, 2005 5:00 am

Post by ESRI » Fri Dec 14, 2007 5:24 am

Hi.

Any updates for this issue? We have clients enquiring about the methodology TeeChart uses for generating the box components of Box Plots, and we'd like to be able to provide this information to them (as well as for updating our documentation).

Thanks!
juan

Yeray
Site Admin
Site Admin
Posts: 9601
Joined: Tue Dec 05, 2006 12:00 am
Location: Girona, Catalonia
Contact:

Post by Yeray » Fri Dec 14, 2007 10:15 am

Hola Juan,

Excuse us for delay. I had to investigate how teechart finally calculates those parameters exactly because, as you said, there are different possible ways to do this.

Here there are some webs explaining the issue: MathWorld and Engineer Statistics

Here is the code that TeeChart uses (same as default on Matlab):

Code: Select all

private double Percentile(double P, double InvN) 
    { 
      double QQ = 0.0; 
      double OldQQ = 0.0; 

      int i = 0; 
      while (QQ < P) 
      { 
        OldQQ = QQ; 
        QQ = (0.5 + i) * InvN; 
        i++; 
      } 

      double U = (P - OldQQ) / (QQ - OldQQ); 
      return SampleValues[i - 2] + (SampleValues[i - 1] - SampleValues[i - 2]) * U; 
    }
Best Regards,
ImageYeray Alonso
Development & Support
Steema Software
Av. Montilivi 33, 17003 Girona, Catalonia (SP)
Image Image Image Image Image Image Please read our Bug Fixing Policy

ESRI
Newbie
Newbie
Posts: 60
Joined: Wed Mar 09, 2005 5:00 am

Post by ESRI » Sat Dec 15, 2007 5:48 am

Yeray,

Thanks for the update.

I'm still not quite getting it though. From your message, by my understanding a "simplified" version of the algorithm would be:

Code: Select all

1] Percentile (P, N) 
2] QQ = 0.0
3] Old = 0.0 
4] i = 0
5] while (QQ < P) 
6]   Old = QQ
7]   QQ = (0.5 + i) * N
8]   i = i + 1 
9] end
10] U = (P - Old) / (QQ - Old) 
11] Out = SV[i - 2] + ( SV[i - 1] - SV[i - 2] ) * U
When using the data values we have:

Code: Select all

Val[1] =  1	Val[5] = 28	Val[9] =  60	
Val[2] = 19	Val[6] = 37	Val[10] = 75	
Val[3] = 20	Val[7] = 57	Val[11] = 80	
Val[4] = 24	Val[8] = 58
Running it through for the first quantile, I still can't generate the observed output value:

Code: Select all

1] Percentile (25, 11)
2] QQ = 0.0
3] Old = 0.0
4] i = 0

5] While (0.0 < 25)		(TRUE)
6]    Old = 0.0
7]    QQ = ( 0.5 + 0 ) * 11   =  0.5 * 11   =  5.5
8]     i = 1

5] While (5.5 < 25)		(TRUE)
6]    Old = 5.5
7]    QQ = ( 0.5 + 1 ) * 11  =  1.5 * 11  =  16.5
8]     i = 2

5] While (16.5 < 25)		(TRUE)
6]    Old = 16.5
7]    QQ = ( 0.5 + 2 ) * 11  = 2.5 * 11  = 27.5
8]     i = 3

5] While (27.5 < 25)		(FALSE)
9] End                  (values: Old = 16.5, QQ = 27.5, i = 3)

10] U = (25 - 16.5) / (27.5 - 16.5)  = 8.5 / 11  = 0.772727...
11] Out = SV[(3 - 2)] + ( SV[(3 - 1)] - SV[(3 - 2)] ) * 0.7727..
        = SV[1] + ( SV[2] - SV[1] ) * 0.7727..
        = 1 + ( 19 - 1) * 0.7727..
        = 1 + 18 *  0.7727..
        = 1 + 13.909
        = 14.909
I must be missing something. Could you perhaps look through it and help point it out?

juan

Yeray
Site Admin
Site Admin
Posts: 9601
Joined: Tue Dec 05, 2006 12:00 am
Location: Girona, Catalonia
Contact:

Post by Yeray » Mon Dec 24, 2007 12:09 pm

Hola Juan,

Note that index arrays, in delphi, start at 0. So, for an 11 elements array, indexes go from 0 to 10.

Then, in your example are stored as follows:

Code: Select all

Val[0] =  1   Val[4] = 28   Val[8] =  60   
Val[1] = 19   Val[5] = 37   Val[9] = 75   
Val[2] = 20   Val[6] = 57   Val[10] = 80   
Val[3] = 24   Val[7] = 58
Best Regards,
ImageYeray Alonso
Development & Support
Steema Software
Av. Montilivi 33, 17003 Girona, Catalonia (SP)
Image Image Image Image Image Image Please read our Bug Fixing Policy

Post Reply