Merged file size is ~5X the source: Mechanism behind Merge

Hello,

This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.

I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.

My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.

Code:

 obs:        18,865                          
 vars:            24                          4 Mar 2020 14:55
 size:     4,093,705                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key             str18   %18s                  
run_yr          float   %9.0g                 Running year of alliance
firm1_gvkey     long    %12.0g                Final GVKey for P1
firm2_gvkey     long    %12.0g                Final GVKey for P2
gvkeypaired     float   %9.0g                
ann_date        float   %td                   Announcement date of the alliance
firm1_permco    long    %12.0g                P1 Permco from CCM
firm2_permco    long    %12.0g                P2 Permco from CCM
n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
firm1_parent    str30   %30s                  P1 Ultimate Parent Name
firm2_parent    str30   %30s                  P2 Ultimate Parent Name
industry        str14   %14s                  Industry
firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
count_allyear   float   %9.0g                 Count of alliance year
group           float   %9.0g                 ID variable to identify year-focal firm combination
run_yr_enddate  float   %td                   End date of running year of alliance
id1             float   %9.0g                 group(run_yr)
id2             float   %9.0g                 group(run_yr firm1_gvkey)
gfreq           float   %9.0g                
numid           float   %9.0g                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: run_yr  firm2_gvkey  gvkeypaired

The 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.

Code:

 obs:   633,476,799                          
 vars:             6                          20 Feb 2020 23:41
 size:20,904,734,367                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
year            int     %8.0g                
gvkey1          long    %12.0g                
gvkey2          long    %12.0g                
score           float   %9.0g                
ball            byte    %8.0g                
key             str18   %18s                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:

I perform a merge operation on these data

Code:

merge m:1 key using "/home/1996_2017.dta"

The resulting data set is unbelievably massive in terms of file size (approximately 147GB)!!

Code:

  obs:   633,483,072                          
 vars:            30                          4 Mar 2020 15:44
 size:147601555776                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key             str18   %18s                  
run_yr          float   %9.0g                 Running year of alliance
firm1_gvkey     long    %12.0g                Final GVKey for P1
firm2_gvkey     long    %12.0g                Final GVKey for P2
gvkeypaired     float   %9.0g                
ann_date        float   %td                   Announcement date of the alliance
firm1_permco    long    %12.0g                P1 Permco from CCM
firm2_permco    long    %12.0g                P2 Permco from CCM
n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
firm1_parent    str30   %30s                  P1 Ultimate Parent Name
firm2_parent    str30   %30s                  P2 Ultimate Parent Name
industry        str14   %14s                  Industry
firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
count_allyear   float   %9.0g                 Count of alliance year
group           float   %9.0g                 ID variable to identify year-focal firm combination
run_yr_enddate  float   %td                   End date of running year of alliance
id1             float   %9.0g                 group(run_yr)
id2             float   %9.0g                 group(run_yr firm1_gvkey)
gfreq           float   %9.0g                
numid           float   %9.0g                
year            int     %8.0g                
gvkey1          long    %12.0g                
gvkey2          long    %12.0g                
score           float   %9.0g                
ball            byte    %8.0g                
_merge          byte    %23.0g     _merge    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:

This is problematic even for machines with 64 CPU cores and 512GB of RAM. Simple operations like filtering and summarizing take a long time to show results, which is quite understandable. What I do not understand is that even though all input and output files are .dta format, an addition of 6273 observations and 24 variables to the larger ('using') data set leads to approximately five fold increase in the file size.

I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Merged file size is ~5X the source: Mechanism behind Merge
Merged file size is ~5X the source: Mechanism behind Merge

0 Response to Merged file size is ~5X the source: Mechanism behind Merge

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Merged file size is ~5X the source: Mechanism behind Merge Merged file size is ~5X the source: Mechanism behind Merge

0 Response to Merged file size is ~5X the source: Mechanism behind Merge

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Merged file size is ~5X the source: Mechanism behind Merge
Merged file size is ~5X the source: Mechanism behind Merge