This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.
I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.
My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.
Code:
obs: 18,865
vars: 24 4 Mar 2020 14:55
size: 4,093,705
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key str18 %18s
run_yr float %9.0g Running year of alliance
firm1_gvkey long %12.0g Final GVKey for P1
firm2_gvkey long %12.0g Final GVKey for P2
gvkeypaired float %9.0g
ann_date float %td Announcement date of the alliance
firm1_permco long %12.0g P1 Permco from CCM
firm2_permco long %12.0g P2 Permco from CCM
n_firm1 float %9.0g Number of alliances of firm1 in the running year
flipped byte %8.0g 0 if A-B, 1 if B-A in a year
firm1_parent str30 %30s P1 Ultimate Parent Name
firm2_parent str30 %30s P2 Ultimate Parent Name
industry str14 %14s Industry
firm1_name str30 %30s Participant 1 in Venture / Alliance (Short Name)
firm2_name str30 %30s Participant 2 in Venture / Alliance (Short Name)
firm1_sic int %8.0g P1 Ultimate Parent Primary SIC Code
firm2_sic int %8.0g P2 Ultimate Parent Primary SIC Code
count_allyear float %9.0g Count of alliance year
group float %9.0g ID variable to identify year-focal firm combination
run_yr_enddate float %td End date of running year of alliance
id1 float %9.0g group(run_yr)
id2 float %9.0g group(run_yr firm1_gvkey)
gfreq float %9.0g
numid float %9.0g
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: run_yr firm2_gvkey gvkeypairedThe 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.
Code:
obs: 633,476,799
vars: 6 20 Feb 2020 23:41
size:20,904,734,367
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
year int %8.0g
gvkey1 long %12.0g
gvkey2 long %12.0g
score float %9.0g
ball byte %8.0g
key str18 %18s
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:Code:
merge m:1 key using "/home/1996_2017.dta"
Code:
obs: 633,483,072
vars: 30 4 Mar 2020 15:44
size:147601555776
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key str18 %18s
run_yr float %9.0g Running year of alliance
firm1_gvkey long %12.0g Final GVKey for P1
firm2_gvkey long %12.0g Final GVKey for P2
gvkeypaired float %9.0g
ann_date float %td Announcement date of the alliance
firm1_permco long %12.0g P1 Permco from CCM
firm2_permco long %12.0g P2 Permco from CCM
n_firm1 float %9.0g Number of alliances of firm1 in the running year
flipped byte %8.0g 0 if A-B, 1 if B-A in a year
firm1_parent str30 %30s P1 Ultimate Parent Name
firm2_parent str30 %30s P2 Ultimate Parent Name
industry str14 %14s Industry
firm1_name str30 %30s Participant 1 in Venture / Alliance (Short Name)
firm2_name str30 %30s Participant 2 in Venture / Alliance (Short Name)
firm1_sic int %8.0g P1 Ultimate Parent Primary SIC Code
firm2_sic int %8.0g P2 Ultimate Parent Primary SIC Code
count_allyear float %9.0g Count of alliance year
group float %9.0g ID variable to identify year-focal firm combination
run_yr_enddate float %td End date of running year of alliance
id1 float %9.0g group(run_yr)
id2 float %9.0g group(run_yr firm1_gvkey)
gfreq float %9.0g
numid float %9.0g
year int %8.0g
gvkey1 long %12.0g
gvkey2 long %12.0g
score float %9.0g
ball byte %8.0g
_merge byte %23.0g _merge
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!
0 Response to Merged file size is ~5X the source: Mechanism behind Merge
Post a Comment