Sunday, October 11, 2015

Adolescent Suicide Attempt Analysis in United States (Part 4)

The forth step is creating graphs of the variables to explore the values. There are two things that will be analyzed. First of all, the analyis of explanatory variables of H1TO31 (How many times adolescent used Marijuana), H1TO35 (How many times adolescent used cocain) and H1TO41 (How many times adolescent used other drugs) are done by using univariate graph. Secondly, it will be analyzed the association between those variables as explanatory variables and response variable: H1SU1 (Did you ever seriously think about committing suicide? ). In the second analysis, it will be created categorical variables of explanatory variables by creating a group.

Univariate Graphs
In this part, will be analyzed the explanatory variables using univariate graph.

PROC UNIVARIATE; VAR H1TO31 H1TO35 H1TO41;

Below is the graph of H1TO31:

Based on the table above, it can be seen the mean average of adolescent who use marijuana are about 9 times during past 12 months and standard deviation are quite high about 51 times. Most of adolescent do not use marijuan. It can be seen from median and mode are equal to 0 time.


Below is the graph of H1TO35:




Based on the table above, the table shows that the mean average of adolescent who use cocain are about 0.5 time during past 12 months and standard deviation are about 12 times. Most of adolescent do not use cocain. It can be seen from median and mode are equal to 0 time.


Below is the graph of H1TO41:



Based on the table above, the table shows that the mean average of adolescent who use other drugs are about 2.4 times during past 12 months and standard deviation are about 27.4 times. Most of adolescent do not use cocain. It can be seen from median and mode are equal to 0 time



Bivariate Graphs
In this part, it will be analyzed the association between MGRP (Group of how many times adolescent used Marijuana), CGRP (Group of how many times adolescent used Cocain) and OGRP (Group of how many times adolescent used other drugs) and response variable H1SU1. Please refer to previous analysis to see the source code Adolescent Suicide Attempt Analysis in United States (Part 3).

Recall, the values are grouped based on following rule:
Group 0: Adolescent doesn't use drugs. (Note: in the previous analysis Adolescent Suicide Attempt Analysis in United States (Part 3), it is grouped into Group 9)
Group 1: Adolescent uses drugs between 1 and 5 times in last 12 months
Group 2: Adolescent uses drugs between 6 and 10 times in last 12 months
Group 3: Adolescent uses drugs between 11 and 20 times in last 12 months
Group 4: Adolescent uses drugs between 21 and 50 times in last 12 months
Group 5: Adolescent uses drugs between 51 and 900 times in last 12 months

Below is the graph of association between MGRP and H1SU1:

PROC GCHART; VBAR MGRP/discrete type=mean SUMVAR=H1SU1;





















Based on graph above, it can be seen the association between how many times adolescent use marijuana and the thinking about suicide. The graph shows strong correlation between those variables. The more often Adolescent uses marijuana the more percentage of Adolescent thinks about suicide. It can be seen that group 0, adolescents who do not use marijuana, about 9% are thinking about suicide. Among group 1, about 19% are thinking about suicide. Among those using marijuana 6 to 10 times last 12 months, almost 24% are thinking about suicide. It is increasing, about 28% are thinking about suicide amoung those using marijuana 11 - 20 times. There is slightly decreasing adolescent who thinking suicide at group 4. However, adolescent at group 5 who use marijuana more than 50 times have the highest percentage, about 29%.

Below is the graph of association between CGRP and H1SU1

PROC GCHART; VBAR CGRP/discrete type=mean SUMVAR=H1SU1;





















Based on graph above, it can be seen the pattern between how many times adolescent use cocain and the thinking about suicide. The graph shows strong correlation between those variables. The more often Adolescent uses cocain the more percentage of Adolescent thinks about suicide. It can be seen that group 0, adolescents who do not use cocain, about 12% are thinking about suicide. Among group 1, about 33% are thinking about suicide. Among those using cocain 6 to 10 times last 12 months, almost 38% are thinking about suicide. About 44% are thinking about suicide amoung those using cocain 11 - 20 times. There is slightly decreasing adolescent who thinking suicide at group 4, similar like previous graph. However, adolescent at group 5 who use cocain more than 50 times have the highest percentage, about 55%.

Below is the graph of association between OGRP and H1SU1

PROC GCHART; VBAR OGRP/discrete type=mean SUMVAR=H1SU1;






















Based on graph above, it can be seen the pattern between how many times adolescent use other drugs and the thinking about suicide. The graph shows correlation between those variables. The more often Adolescent uses other drugs the more percentage of Adolescent thinks about suicide. It can be seen that group 0, adolescents who do not use cocain, about 11% are thinking about suicide. Among group 1, about 26% are thinking about suicide. Among those using cocain 6 to 10 times last 12 months, almost 38% are thinking about suicide. There is slightly decreasing percentage at group 3 and 4, about 34% ad 36%. However, adolescent at group 5 who use other drugs more than 50 times have the highest percentage, about 45%.

Sunday, October 4, 2015

Adolescent Suicide Attempt Analysis in United States (Part 3)

The third step is about data management which includes coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables. The variables that will be managed are H1TO31, H1TO35 and H1TO41. These variables are about whether the adolescent use drugs or not. For detail explanation of each variable, please refer to Adolescent Suicide Attempt Analysis in United States (Part 1)

Below is the SAS code:
LIBNAME mydata "/courses/d1406ae5ba27fe300" access=readonly;
DATA new2; set mydata.addhealth_pds;
LABEL 
H1TO31="During your life, how many times have you used marijuana?"
H1TO35="During your life, how many times have you used cocaine?"
H1TO41="During your life, how many times have you used any other illegal drug such as LSD, PCP, ecstasy, mushrooms, speed, ice, heroin or pills without a doctor’s prescription?"
MGRP= "Group of How many times Adolescent used Marijuana"
CGRP= "Group of How many times Adolescent used Cocain"
OGRP= "Group of How many times Adolescent used Other Drugs"
DRUGS="Use drugs either marijuana or cocaine or other drugs";

Recall from previous step, that the data is filtered only for Adolescent who thinking about suicide.
 
IF H1SU1=1; /*Seriously thinking about comitting suicide*/
IF H1SU8 >= 2 and H1SU8 <=4 ;/*Somewhat Honest to Completly Honest */

1. Missing Data
Those variables have the same codes for missing data which are 996: refused, 998: Don't know and 999: Not Applicable. Therefore, these codes will be excluded. Here is the additional script in SAS:

/*Remove 996: refused, 998: Don't know and 999: Not Applicable*/
IF H1TO31 = 996 or H1TO31 = 998 or H1TO31 = 999 then H1TO31 = .;
IF H1TO35 = 996 or H1TO35 = 998 or H1TO35 = 999 then H1TO35 = .;
IF H1TO41 = 996 or H1TO41 = 998 or H1TO41 = 999 then H1TO41 = .;

 
It can be seen that after the data processing there are missing value for each variable. There are 29 row missing value for variable H1TO31, 8 row missing value for variable H1TO35 and 15 row missing value for variable H1TO41.
Note: The tables above is cut due to web space.

2. Group Data
The values of these variables are various from 1 to 700 or 900. In order to make it simple, the binning process is performed. The groups will be created into 6 groups with following range:
Group 1: 1 - 5 times
Group 2: 6 - 10 times
Group 3: 11 - 20 times
Group 4: 21 - 50 times
Group 5: > 50 times
Group 9: 997 (Has not tried)

/*Create binning for each variables*/
IF H1TO31 LE 5 THEN MGRP = 1;
ELSE IF H1TO31 LE 10 then MGRP = 2;
ELSE IF H1TO31 LE 20 then MGRP = 3;
ELSE IF H1TO31 LE 50 then MGRP = 4;
ELSE IF H1TO31 LE 900 then MGRP = 5;
ELSE IF H1TO31 = 997 then MGRP = 9;





IF H1TO35 LE 5 then CGRP = 1;
ELSE IF H1TO35 LE 10 then CGRP = 2;
ELSE IF H1TO35 LE 20 then CGRP = 3;
ELSE IF H1TO35 LE 50 then CGRP = 4;
ELSE IF H1TO35 LE 700 then CGRP = 5;
ELSE IF H1TO35 = 997 then CGRP = 9;





IF H1TO41 LE 5 then OGRP = 1;
ELSE IF H1TO41 LE 10 then OGRP = 2;
ELSE IF H1TO41 LE 20 then OGRP = 3;
ELSE IF H1TO41 LE 50 then OGRP = 4;
ELSE IF H1TO41 LE 900 then OGRP = 5;
ELSE IF H1TO41 = 997 then OGRP = 9;




Based on table above, it shows that Adolescent who thinking about suicide, 48.11% of them use Marijuana, 10.45% use Cocain and 20.28% use Other Drugs.


3. Create New Variable
It will be created one new variable to know whether the Adolescent who thinking about suicide uses one or more drugs or not.

/*New Variables DRUGS, 1: use one or more drugs, 0: doesn't use drugs*/
IF H1TO31 < 997 or H1TO35 < 997 or H1TO41 < 997 THEN DRUGS=1;
ELSE DRUGS=0
It can be seen that in total there are 400 Adolescent or about 50.4% who thinking about suicide also use drugs either marijuana, cocain or other drugs. The other 49.62% Adolescent are not clear what is the reason of suicide. It is needed further investigation.


Here is the rest of the code.
PROC SORT; by AID;
PROC FREQ; TABLES H1TO31 H1TO35 H1TO41 MGRP CGRP OGRP DRUGS;
RUN;



Friday, September 25, 2015

Adolescent Suicide Attempt Analysis in United States (Part 2)

Next step after research questions have been decided, the chosen variables need to be analyzed.
In this case, SAS programming will be performed. Below is the SAS Code to understand the frequency data:

DATA new; set mydata.addhealth_pds;
LABEL H1SU1="Seriously think about committing suicide during the past 12 months"
 H1SU2="Attempt suicide during the past 12 months"
 H1SU3="Any attempt result in an injury, poisoning, or overdose that had to be treated by a doctor or nurse"
 H1SU8="How honestly have you answered the questions?";
/*Seriously thinking about comitting suicide*/
IF H1SU1=1;
/*Somewhat Honest to Completly Honest */
IF H1SU8 >= 2 and H1SU8 <=4 ;
PROC SORT; by AID;
PROC FREQ; TABLES H1SU1 H1SU2 H1SU3 H1SU8;
RUN;

The data is filtered by taking only young adult who seriously thinking about committing suicide during the past 12 months and who honestly answered the question (variable H1SU8 with code 2,3 and4). To know the meaning of the code each variables please refer to Adolescent Suicide Attempt Analysis in United States (Part 1).

Below is the frequency diagram for each variables:

As represented in the table, H1SU1 only has data with code 1 which the answer is 'Yes'. It can be seen that in total 794 adolescent who think to suicide, there are 220 adolescent (sum of code 1,2,3 and 4 of variable H1SU2) who attempt suicide or about 27.7 percent. The other 574 adolescent did not do the suicide. Furthermore, there are 60 adolescent or 7.56 percent (Variable H1SU3) who treated by doctor because of their suicide attempt. Based on the result, there is no missing data.


Thursday, September 17, 2015

Adolescent Suicide Attempt Analysis in United States (Part 1)

Suicide is the act of intentionally causing one's own death. Suicide happens because of many reasons such as mental disorder, bullying, alcohol, drugs, financial difficulties, interpersonal relationship  and many others. Below is a chart from ngram google books. The chart displays frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1900 - 2008. It can be seen that the word suicide increases more than double from 1900s to 2000s. It shows that suicide becomes important issue nowdays. Many studies are conducted to understand the cause of suicide. One of the studies by Henry and Short analyze the relation between business cycle and suicide rates in United States. The hypothesis that is used are suicide rates are aggressive reaction to frustration caused by economic forces. It is mentioned that suicide varies negatively with the strength of external restraint over behavior when behavior is subjected to strong intense involvement in social relationships. Other interesting study is made by Sinha, Cnaan and Gelles about relationship between religious activities and risk behavior. Based on the study, it indicates that youth religious activities are associated with decreased risk behaviors.

This study wants to analyze about suicide of Adolescent that happen in United States. The study uses data from National Longitudinal Study of Adolescent Health (AddHealth). The survey comprised adolescents in grades 7-12 in the United States. The survey focuses on factors that may influence adolescents' health and risk behaviors including, including personal traits, families, friendships, romantic relationships, peer groups, schools, neighborhoods and communities.

The aim of this study is to investigate the association of drug consumption with suicide. It is expected that there is association between drug consumption and suicide (first hypothesis).

After examination of AddHealth Codebook, Variables that relevant to the aim of first topic are

Section 33: Suicide, the relevant variables are follows:

Variables Description Data
H1SU1 During the past 12 months, did you ever seriously think about committing suicide 0: no
1: yes
6: refused
8: don’t know
9: not applicable

Note: if the answer is either 0,6,8,9 then variable H1SU2 and H1SU3 are skipped.
H1SU2 During the past 12 months, how many times did you actually attempt suicide 0: 0 times
1: 1 time

2: 2 or 3 times
3: 4 or 5 times
4: 6 or more times
6: refused
7: legitimate skip (did not seriously think about committing suicide
8: don’t know

Note: if the answer is either 0,6,8 then variable H1SU3 are skipped.
H1SU3 Did any attempt result in an injury, poisoning, or overdose that had to be treated by a doctor or nurse? 0: no
1: yes
7: legitimate skip (did not attempt suicide)
H1SU8 How honestly have you answered the questions? 1: not honestly at all
2: somewhat honestly
3: very honestly
4: completely honestly
6: refused
8: don’t know
9: not applicable


Section 28: Tobacco, Alcohol, Drugs, the relevant variables are follows:

H1TO31 During your life, how many times have you used marijuana? range 1 - 900 times
996: refused
997: legitimate skip (has not tried marijuana)
998: don’t know
999: not applicable
H1TO35 During your life, how many times have you used cocaine? range 1 - 700 times
996: refused
997: legitimate skip (has not tried marijuana)
998: don’t know
999: not applicable
H1TO41 During your life, how many times have you used any other illegal drug such as LSD, PCP, ecstasy, mushrooms, speed, ice, heroin or pills without a doctor’s prescription? range 1 - 900 times
996: refused
997: legitimate skip (has not tried marijuana)
998: don’t know
999: not applicable


A further aim to expand the study about suicide is to investigate what type of activities and time spend are associated with suicide attempt. The second hypothesis, the less time adolescent do any activities the more attempt to suicide. The relevant variables are taken from Section 2 of AddHealth Codebook.

Section 2: Daily Activities, the relevant variables are follows:
H1DA1 During the past week, how many times did you do work around the house, such as cleaning, cooking, laundry, yardwork, or caring for a pet? 

0: not at all
1: 1 or 2 times
2: 3 or 4 times
3: 5 or more times
6: refused
8: don’t know
H1DA2 During the past week, how many times did you do hobbies, such as collecting baseball cards, playing a musical instrument, reading, or doing arts and crafts? 

0: not at all
1: 1 or 2 times
2: 3 or 4 times
3: 5 or more times
6: refused
8: don’t know
H1DA3 During the past week, how many times did you watch television or videos, or play video games? 

0: not at all
1: 1 or 2 times
2: 3 or 4 times
3: 5 or more times
6: refused
8: don’t know
H1DA4 During the past week, how many times did you go roller-blading, roller-skating, skate-boarding, or bicycling? 0: not at all
1: 1 or 2 times
2: 3 or 4 times
3: 5 or more times
6: refused
8: don’t know
H1DA5 During the past week, how many times did you play an active sport, such as baseball, softball, basketball, soccer, swimming, or football? 

0: not at all
1: 1 or 2 times
2: 3 or 4 times
3: 5 or more times
6: refused
8: don’t know
H1DA6 During the past week, how many times did you do exercise, such as jogging, walking, karate, jumping rope, gymnastics or dancing? 

0: not at all
1: 1 or 2 times
2: 3 or 4 times
3: 5 or more times
6: refused
8: don’t know
H1DA7 During the past week, how many times did you just hang out with friends? 

0: not at all
1: 1 or 2 times
2: 3 or 4 times
3: 5 or more times
6: refused
8: don’t know
H1DA8 How many hours a week do you watch television? 0: does not watch television
range 1 to 99 hours
996: refused
998: don’t know
H1DA9 How many hours a week do you watch videos? 0: does not watch videos
range 1 to 99 hours
996: refused
998: don’t know
H1DA10 How many hours a week do you play video or computer games? 0: does not play video or computer games
range 1 to 99 hours
996: refused
998: don’t know
H1DA11 How many hours a week do you listen to the radio? 0: does not listen to the radio
range 1 to 99 hours
996: refused
998: don’t know



Bibliography:
1. Henry, A.F., Short, J.F., Suicide and Homicide - Some Economic, sociological and Psychological Aspects of Aggression, Arno Press, 1977.
2. Sinha, J.W., Cnaan, R. A., Gelles, R.J., Adolescent risk behaviors and religion: Findings from a national study, Journal of Adolescent, April 2007, Vol. 30, Issue 2, Pages 231 - 249.





Monday, May 16, 2011

SQL Traps

There are some SQL traps that we find. It is important to understand how aggregates in queries may result in incorrect data. You will know how SQL traps can cause queries to return inaccurate results.


Chasm Trap
A chasm trap is a common problem in relational database schemas in which a join path returns more data than expected. A chasm trap is a type of join path between three tables when two many-to-one joins converge on a single table, and there is no context in place that separates the converging join paths.


Fun Trap
Fan traps occur when there is a "one-to-many" join to a table that "fans out" into another "one-to-many" join to another table. This is a common structure and will not normally result in a fan trap. You only get incorrect result from the fan trap when the uery includes a measure object on the middle table of the table path and  an object from the subsequent table.

Thursday, May 5, 2011

Datastage: Partitioning and Collecting

Partitioning mechanism divides a portion of data into multiple streams of rows (one for each node), which is then processed independently by each node in parallel. Each partition of rows is processed separately by the stage/operator. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Collecting is the opposite of partitioning. It can be defined as a process of bringing back partitioned data into a single sequential stream (one data partition).

Partitioning Algorithm
As a default, Datastage decides to use between Same or Round Robin partitioning. Typically Same partitioning is used between two parallel stages and round robin is used between a sequential and an EE stage. Datastage supports a few types of Data partitioning algorithm which can be implemented in parallel stages:
- Auto
 
Keyed: Rows are distributed based on values in the specified key.

- Hash: rows with same key column (or multiple columns) go to the same partition. Hash is very often used and sometimes improves performance, however it is important to have in mind that hash partitioning does not guarantee load balance and misuse may lead to skew data and poor performance. Hash does not guarantee "continuity": here, 3s are bunched with 0s, not with neighboring value 2. The expensive version of Hash is RANGE, that guarantees continuity.

- Modulus: data is partitioned on one specified numeric field by calculating modulus against number of partitions. partition = MOD(key_value/#partitions). Faster than Hash. Guarantees that rows with identical key values go in the same partition. Partition size is relatively equal if the data within the key column is evenly distributed.

Keyless: Rows are distributed independently of data values

- Round Robin: rows are evenly distributed across partitions. this partitioning method guarantees an exact load balance (the same number of rows processed) between nodes and is very fast. Good for initial import of data if no other partitioning is needed and useful for redistributing data. Round robin assigns rows to partitions like dealing cards. Partition assignment will be the same for a given $APT_CONFIG_FILE.

- Same: existing partitioning remains unchanged. no data is moved between nodes
- Entire: all rows from a dataset are distributed to each partition. Duplicated rows are stored and the data volume is significantly increased. Each partition gets a complete copy of data. this is useful for distributing lookup and reference data. ENTIRE is the default partitioning for lookup reference links with "Auto" partitioning
- Random: rows are randomly distributed across partitions
- Range: an expensive refinement to hash partitioning. It is simillar to hash but partition mapping is user-determined and partitions are ordered. Rows are distributed according to the values in one or more key fields, using a range map (the 'Write Range Map' stage needs to be used to create it). Range partitioning requires processing the data twice which makes it hard to find a reason for using it

Collecting Algorithms
A collector combines partitions into a single sequential stream.
Datastage EE supports the following collecting algorithms:
- Auto: Collect first available record. the default algorithm reads rows from a partition as soon as they are ready. This may lead to producing different row orders in different runs with identical data. The execution is non-deterministic
- Round Robin: picks rows from input partition patiently, for instance: first row from partition 0, next from partition 1, even if other partitions can produce rows faster than partition 1.
- Ordered: reads all rows from first partition, then second partition, then third and so on.
- Sort Merge: produces a globally sorted sequential stream from within partition sorted rows. Sort Merge produces a non-deterministic on un-keyed columns sorted sequential stream using the following algorithm: always pick the partition that produces the row with the smallest key value. Read in by key, presumes data is sorted by the key in each partition, builds a single sorted stream based on the key.


References
1. "Data Stage v8 Essentials Core Modules", IBM, 2009.
2,"Datastage data partitioning and collecting methods", www.etl-tools.info, 2010