This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.
I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.
My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.
Code:
obs: 18,865 vars: 24 4 Mar 2020 14:55 size: 4,093,705 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- key str18 %18s run_yr float %9.0g Running year of alliance firm1_gvkey long %12.0g Final GVKey for P1 firm2_gvkey long %12.0g Final GVKey for P2 gvkeypaired float %9.0g ann_date float %td Announcement date of the alliance firm1_permco long %12.0g P1 Permco from CCM firm2_permco long %12.0g P2 Permco from CCM n_firm1 float %9.0g Number of alliances of firm1 in the running year flipped byte %8.0g 0 if A-B, 1 if B-A in a year firm1_parent str30 %30s P1 Ultimate Parent Name firm2_parent str30 %30s P2 Ultimate Parent Name industry str14 %14s Industry firm1_name str30 %30s Participant 1 in Venture / Alliance (Short Name) firm2_name str30 %30s Participant 2 in Venture / Alliance (Short Name) firm1_sic int %8.0g P1 Ultimate Parent Primary SIC Code firm2_sic int %8.0g P2 Ultimate Parent Primary SIC Code count_allyear float %9.0g Count of alliance year group float %9.0g ID variable to identify year-focal firm combination run_yr_enddate float %td End date of running year of alliance id1 float %9.0g group(run_yr) id2 float %9.0g group(run_yr firm1_gvkey) gfreq float %9.0g numid float %9.0g ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: run_yr firm2_gvkey gvkeypaired
The 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.
Code:
obs: 633,476,799 vars: 6 20 Feb 2020 23:41 size:20,904,734,367 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- year int %8.0g gvkey1 long %12.0g gvkey2 long %12.0g score float %9.0g ball byte %8.0g key str18 %18s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by:
Code:
merge m:1 key using "/home/1996_2017.dta"
Code:
obs: 633,483,072 vars: 30 4 Mar 2020 15:44 size:147601555776 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- key str18 %18s run_yr float %9.0g Running year of alliance firm1_gvkey long %12.0g Final GVKey for P1 firm2_gvkey long %12.0g Final GVKey for P2 gvkeypaired float %9.0g ann_date float %td Announcement date of the alliance firm1_permco long %12.0g P1 Permco from CCM firm2_permco long %12.0g P2 Permco from CCM n_firm1 float %9.0g Number of alliances of firm1 in the running year flipped byte %8.0g 0 if A-B, 1 if B-A in a year firm1_parent str30 %30s P1 Ultimate Parent Name firm2_parent str30 %30s P2 Ultimate Parent Name industry str14 %14s Industry firm1_name str30 %30s Participant 1 in Venture / Alliance (Short Name) firm2_name str30 %30s Participant 2 in Venture / Alliance (Short Name) firm1_sic int %8.0g P1 Ultimate Parent Primary SIC Code firm2_sic int %8.0g P2 Ultimate Parent Primary SIC Code count_allyear float %9.0g Count of alliance year group float %9.0g ID variable to identify year-focal firm combination run_yr_enddate float %td End date of running year of alliance id1 float %9.0g group(run_yr) id2 float %9.0g group(run_yr firm1_gvkey) gfreq float %9.0g numid float %9.0g year int %8.0g gvkey1 long %12.0g gvkey2 long %12.0g score float %9.0g ball byte %8.0g _merge byte %23.0g _merge ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by:
I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!
0 Response to Merged file size is ~5X the source: Mechanism behind Merge
Post a Comment