For pitchers, the pattern seems to be that the Batting.G column shows the
number of non-DH games in which the pitcher played, except for pitchers who
did not appear in any non-DH games for that team, in which case it matches
Pitching.G.
This makes it impossible to answer questions of the form "How many major
league games has Mike Hampton played?"
I don't know of an online source for the correct data. Even mlb.com gets
this wrong. The TSN Baseball Guide, crediting STATS Inc, shows the correct
figures, when it comes out in March.
> 4. Team_ID should not need to be unique over leagues, so the Boston
Braves
> should be called 'BOS' even though the Red Sox are also
called 'BOS'.
***
*** Team ID must be unique, unless you ensure that the League field
is always paired up to the Team ID field. Any field that can be used
in a key must generate a unique value.
> >ERA = 27 * ER / ROUND(3*IP,0)
> >
> >Using this counts 1/3rd innings instead of full innings.
>
> Makes sense, but you'd still have to use .3/.6 instead of .1/.2, no?
>
***
ERA = 9 * ER / IP
or
ERA = 27 * ER / (IP * 3)
or
ERA = 27 * ER / (ROUND(IP*3,0)
Work it out, and you see it makes sense.
***
*** Why not simply add the SQL statement that you need as views, and
then after the view is created the first time, save that view as a
table? This way, those people who don't care about some of
the "calculated data" won't be burdened with all the extra data, and
those who do will have their performance optimized for all subsequent
checks. Same deal for SLG, OBA, BA, ERA. Again, no need to have
these calculated fields, because if you fix the AB or the doubles on
some line, you have to regenerate these fields. But again, for those
who want it, you supply the SQL view to generate the table with the
new calculated fields. As for the differences in calculating SLG and
OBA, etc, this should be part of a table, and the view would do the
lookup calc. However, I can see how important these 4 calcs are
anyway, and it would be nice to have them as part of the database.
***
*** Team and Franchise IDs. These are 2 separate items. You have a
franchise ID, each with "one to many" team IDs. This should be a
separate table, and I think Lahman did this in his database, though I
seem to remember some problems with it.
*** Before you agree to denormalize, you should normalize first, get
the design right, and then you can talk about the tradeoff in
performance v referential integrity.
On 2001-10-15 23:29:38 Sean Forman wrote:
>micke.hovmoller@... wrote:
>
>> 1a. If a player is traded from a team and then back in the same season, he
>> has two lines in batting and/or pitching. I can see the history behind this
>> but if the rows are not numbered and one can't tell which was the first and
>> which was the second session I don't find that information to be worth a
>> whole lot. Also, if a player is sent to the minors and back in the same
>> season, he has only one line, so I don't find this consistent.
>
>
>Team order is the proper solution here, IMO. You need a way to get the
>team totals and lumping them as one yearly entry would be very
>problematic and would lose data. I have a transaction register from Tom
>Ruane and when we get going that can start us on the team order column
>(actually get us almost all the way there). Also, there is one subtlety
>here. There are about five instances where a player was traded away and
>back to a team in the same season. I think in those cases we want both
>entries in the database separately.
Team order is the "proper" solution, definitely. The question is, is it the
desirable solution? The trade-off is the amount of time it takes to gather and
update the data compared to the value it gives. To me, it gives almost no value,
but maybe that's just me. I can't quite see where that level of granularity is
required where the next step (game by game logs) wouldn't be almost necessary.
>> 1c. Fix the LahmanID:s for all the players called O'<something> as they are
>> sometimes referred to as O'leatr01 and sometimes as oleartr01. I would think
>> that the latter one is the correct version.
>
>
>I think we should hash out a naming standard for all the team_ID's,
>lg_ID's and player_ID's and stick with that. If anyone wants to propose
>one, we can discuss it here.
Isn't there a proper standard for names and an obviuos one for leagues? Or are
there 19'th century leagues that are ambiguous?
My take on teams:
1. I prefer NYY/NYM/CHW/CHC to NYA/NYN/CHA/CHN but don't have a strong
preference.
2. I suggest that team names be unique to franchises so that the 1957 NL Braves
and the 2000 NL Brewers can't both be called 'MIL', whereas the 1990 AL Brewers
and the 2000 NL Brewers should both have the same team_ID. (We'll be happy about
NYY when the Yankees switch leagues... :) )
3. I suggest that teams get their names from this date and backwards. Thus,
'BAL' should be the modern Orioles (old Browns), not the modern Yankees (old
Orioles).
4. Team_ID should not need to be unique over leagues, so the Boston Braves
should be called 'BOS' even though the Red Sox are also called 'BOS'.
5. I suggest the namning standard for older teams should be <the first two
letters of the city>+<figure starting from 1> for teams where the combination of
city and league wouldn't be unique. I realize that this only allows ten teams
per city and league but don't expect that to be a problem. (It will if we some
day decide to add minors according to the same naming standards.)
6. I prefer filling up all the three letters over only using two:
LAD/SDP/NYG/SFG over LA/SD/NY/SF. Again, no strong preference.
>A question here is what we want to do about computed columns. I think
>the term is atomicity, but in hardcore DB design these columns wouldn't
>exist in a table as they can be computed from existing columns. I'm
>willing to make it more useful at a cost of academic honesty.
Since the problem isn't just that that they could be computed but that the
formulas that need to be used varies over time, I think it's perfectly OK to
denormalize in this case.
>> 3. IP is given in the .1/.2 notation for partial innings. This makes adding
>> and calculating stuff like ERA a whole lot more complicated. I suggest
>> changing this to .333/.667.
>
>Three decimal places only sort of fixes the rounding problems. I use
>the following trick with one decimal place.
>ERA = 27 * ER / ROUND(3*IP,0)
>
>Using this counts 1/3rd innings instead of full innings.
Makes sense, but you'd still have to use .3/.6 instead of .1/.2, no?
>> 4. SH is in the batting table but not in the Teams table. Is there any
>> particular reason for this?
>
>
>Probably should be added. Another question is what to do about data
>that isn't available. My personal opinion is that those entries should
>be NULL. Though a DB expert could sway my opinion if that is
>inappropriate.
NULL is appropriate unless one wants to differentiate between <not applicable>
and <not yet available> in the same column. I can't say for sure if we would
ever want to make that distinction. If we do, it will mess things up quite a
bit.
>> 9. By comparing the sum of records from batting/pitching to the matching
>> record in teams it is easy to spot literally hundreds of differences. For
>> instance, the following query yields 200+ rows in my DB:
>
>Others have pointed out the problems here. I think we need to think
>about how we should approve changes, etc. when corrections are found.
>As are constantly happening. The last SABR records committee newsletter
>had a couple from the teens in it.
I have found a few errors in the 1997-2000 lines where I have the Major League
Official Handbook (or whatever it's called) to compare to, and in those
instances I'm fairly certain it's just an honest mistake in the DB as opposed to
conflicting historical facts. Where should I send these?
/Micke
On 2001-10-15 18:41:06 "Tom Tom" wrote:
>> 6. I'd like there to be predefined tables for careerbatting and
>> careerpitching as well as for franchisebatting and
>franchisepitching. This
>> is for performance reasons. Without these tables already populated
>searches
>> such as "list all the players who have hit at least 100 HR for two
>different
>> teams" become terribly slow. There could also be tables for
>teambatting and
>> teampitching but I don't care too much about those records.
>***
>*** I strongly disagree here. You can create the "query" or "view"
>for this, and that's fine. But why create a table based on another
>table. In Access, it let's you create a table based on the view, if
>the user so wishes. We should strive for data to NOT be duplicated,
>except for keys. Normalization is important.
As Sean F suggested, it is perhaps better to include the instructions on how to
create these tables as opposed to actually including the tables with data and
all.
However, I do feel very strongly that most people would want these tables.
Without them, some queries are just taking forever to execute. Maybe it's just
my contrived interest in obscure records, but I have made literally hundreds of
queries where the denormalized tables will yield me an answer in a matter of
seconds as opposed to minutes or even hours. Example, and yes, I have actually
checked this even tough I don't remember the answer right now: what player has
the fewest career HR despite hitting at least 100 HR for two different
franchises.
As for denormalization, I'd suggest that the database purists out there take
caution in the design. AVG/OBP/SLG should, IMHO, be included in the batting
stats because the definitions have changed over time and I want each batting
line to use the way of calculating these stats that is now considered the proper
way to do it for that season (as opposed to the way it's calculated for *this*
season). I know, I can write a fancy view whith advanced conditions for year and
league and so on, but that will kill performance for even the simplest of
queries.
Performance do count, after all. As does the advantage of easily writing the
queries.
And as Sean L pointed out, there are discrepancies that are "officially
acknowledged" (for lack of a better term) and the DB should clearly feature the
official team lines, not just the sum of the battign/pitching lines.
>> 8. In the future, I'd like batters faced and number of pitches
>thrown to be
>> added to pitching and number of pitches seen added to batting.
>***
>*** Again, alot of work here. For the recent seasons, easy enough.
>For pitchers, you can make a good estimate of the number of pitches
>based on his other data. Just as important would be to add run
>support. Again, you can make an estimate based on team runs scored
>as well.
I have read what Sean and Sean wrote and if the data on pitches thrown in
particular is inaccurate, I can see why it shouldn't be included. However, I
don't see the problem with adding data that is not at all available for earlier
seasons. We do have CS data after all.
I think this is an important philosophical question in general. I strongly
suggest that the argument "it's not available for earlier seasons" should not be
a valid criteria for selecting what to add and what not to add. Valid arguments
should instead be "nobody cares about this" or "the data is not reliable" or
"too much work gathering it".
(Minor point: saying that a data point can be reasonably estimated is a moot
point. The pitch count estimates are about as good as the RA estimates using
only BF/H/BB/HR and league and park adjustments. That's surely a silly argument
for not including RA.)
>***
>*** By the way, the Lahman database is a tremendous tremendous
>effort.
I cannot agree more. I was just stunned at the amount of information that was in
there! I am extremely impressed by the amount of work that has gone into this. A
million thanks to everyone involved!
/Micke
Greg sent this to me and labeled it the "ultimate Baseball Reference".
I think I agree. I've annotated some of the entries.
Players
Major League Statistics
Minor League Statistics (there is an ongoing db project here)
Situational Statistics
Fielding Statistics (have most need, GS and Innings)
Postseason Statistics (including eligibility, same as season tables)
don't have fielding data here either.
All-Star Statistics (including selections)
Uniform Numbers
Photographs
Positional (not sure what you mean here)
Quotations
Biographical Data (where from, deathplace, where buried?, siblings?)
Award Data (silver sluggers, TSN awards)
Hall of Fame data (voting data)
Game Logs
Biographies
Career Highlights
Transaction Data (including league disciplinary action)
Amateur Draft Data
Pronunciation Key (Doug Gwosdz?)
Spring Training Teams
Coaching, Managing and Umpiring data (including minor leagues)
Other Baseball Jobs
Contract Information
Baseball relatives
Video/Audio
Scouting reports
High School/American Legion/College/Olympic Experience
Baseball Information
Other Professional Baseball Information
Ejections
Acting and Endorsement Credits
Bibliography
Records held (possibly records ever held)
Armed Services record
For pitchers I would like to see pitches thrown.
Teams
Team/League Statistics (all kinds)/Standings
Game Logs (including official scorers, stadiums)
Uniforms (caps, logos, etc.)
Personnel (including full ownership, GM, president)
Photographs
Full Amateur Draft data (I have some of this)
Minor league affiliates
Payroll Information
Broadcast Information (flagship station, color man, Vin Scully)
Expansion draft (We have most of this)
Rule V draft
Free agent draft info - 80s
Attendance (have most of this)
Stadiums (diagrams, naming, capacity, etc.)
Spring Training Information
Lots of stuff here. And only the beginning. Add your own and at least
we'll have it some place.
Sincerely,
Sean Forman
Baseball Stats! http://www.Baseball-Reference.com/
Baseball Analysis! http://www.BaseballPrimer.com/
micke.hovmoller@... wrote:
I don't think I'm repeating anything anyone else has posted on this, so
I'll just answer as I go.
> Warning: Long post!
> Some changes I'd like to see made to the Lahman DB. And also a bunch of
> questions, suggestions, etc. Take it for what it's worth.
> 1. I'd like all tables checked so that referential integrity could be added
> in the obvious places (between batting/pitching/postbatting/postpitching and
> master/teams, between teams and teammaster etc.). Implementing this will
> catch a large number of errors that might otherwise go unnoticed. In order
> to change this, (at least) the following need to be done:
BR.com is based on Sean Lahman's original database. I think I split off
on 2.2 or something like that. Anyways, I have corrected many, many of
these errors in my homebrew database. All the team_ID's are consistent
across all tables. All player_ID's have an entry in the master table.
This is true of all the tables, postsason, regular season, etc. Thanks
to Sean Lahman, I have "complete" fielding data. All the batters (who
weren't just pinch hitters or runners) have fielding entries and
vice-versa. All of the pitchers have batting entries as well. It is
pretty tight and pretty complete. I have all of my id's
cross-referenced with both Sean's and other systems, so I can dump all
of this out in Sean's naming convention or any other we choose to use.
> 1a. If a player is traded from a team and then back in the same season, he
> has two lines in batting and/or pitching. I can see the history behind this
> but if the rows are not numbered and one can't tell which was the first and
> which was the second session I don't find that information to be worth a
> whole lot. Also, if a player is sent to the minors and back in the same
> season, he has only one line, so I don't find this consistent.
Team order is the proper solution here, IMO. You need a way to get the
team totals and lumping them as one yearly entry would be very
problematic and would lose data. I have a transaction register from Tom
Ruane and when we get going that can start us on the team order column
(actually get us almost all the way there). Also, there is one subtlety
here. There are about five instances where a player was traded away and
back to a team in the same season. I think in those cases we want both
entries in the database separately.
> 1c. Fix the LahmanID:s for all the players called O'<something> as they are
> sometimes referred to as O'leatr01 and sometimes as oleartr01. I would think
> that the latter one is the correct version.
I think we should hash out a naming standard for all the team_ID's,
lg_ID's and player_ID's and stick with that. If anyone wants to propose
one, we can discuss it here.
> 2. In my DB AVG, OBP and SLG are calculated with only two decimals instead
> of the usual three. I'd suggest using as many decimals as possible on each
> platform.
A question here is what we want to do about computed columns. I think
the term is atomicity, but in hardcore DB design these columns wouldn't
exist in a table as they can be computed from existing columns. I'm
willing to make it more useful at a cost of academic honesty.
> 3. IP is given in the .1/.2 notation for partial innings. This makes adding
> and calculating stuff like ERA a whole lot more complicated. I suggest
> changing this to .333/.667.
Three decimal places only sort of fixes the rounding problems. I use
the following trick with one decimal place.
ERA = 27 * ER / ROUND(3*IP,0)
Using this counts 1/3rd innings instead of full innings.
> 4. SH is in the batting table but not in the Teams table. Is there any
> particular reason for this?
Probably should be added. Another question is what to do about data
that isn't available. My personal opinion is that those entries should
be NULL. Though a DB expert could sway my opinion if that is
inappropriate.
> 5. I'd like there to be a Franchise table linking teams from teammaster.
> This is described in the Baseball Archive Support forum, and allows the user
> to more easily see that the St. Louis Browns and the Baltimore Orioles is
> the same franchise, whereas the 1902 and the 1992 Baltimore Orioles are not.
I have a franchise table and have hashed it out with SABR folks as to
what is appropriate. There are also issues with the Player's league,
Federal League etc. as to when and how those teams and leagues folded.
This caused hours of pain to clean up.
> 6. I'd like there to be predefined tables for careerbatting and
> careerpitching as well as for franchisebatting and franchisepitching. This
> is for performance reasons. Without these tables already populated searches
> such as "list all the players who have hit at least 100 HR for two different
> teams" become terribly slow. There could also be tables for teambatting and
> teampitching but I don't care too much about those records.
I disagree for reasons stated by others. I think adding instructions
and SQL on how to do this is a good idea, but if we included a career
table we would be responsible for two numbers to update rather than one.
> 7. I'd like there to be some information about park factors, but I'm not
> sure where and how to insert it.
There should be some consideration about adding a parks table. I have
much of this data available, but it needs to be expanded and added to.
Also, with the Retrosheet game logs, we can get RS, RA at the various
parks by home and visitor. In recent years (1920+), HR, H, etc. are
available.
> 8. In the future, I'd like batters faced and number of pitches thrown to be
> added to pitching and number of pitches seen added to batting. I do realize
> the amount of work needed to gather the data, and I do not volunteer to
> gather it at this point in time, but I suppose getting it for the last few
> years can't be that difficult and that would always be a start.
Nice to have. I can see adding this in as a set of supplemental
tables. Essentially, I think we should first tweak the current design
to a solid standard. This standard can then be expanded in any way the
people working here or elsewhere can take it. Greg Spira sent a Crazy
Horse-esque (in scope) list of data he would like to see, which would
keep us all busy for 40+ years. Like I said, I want something where we
get out of the way of people who want to contribute and add to the
databank. And maybe we will be get there in ten years. ;-)
> 9. By comparing the sum of records from batting/pitching to the matching
> record in teams it is easy to spot literally hundreds of differences. For
> instance, the following query yields 200+ rows in my DB:
Others have pointed out the problems here. I think we need to think
about how we should approve changes, etc. when corrections are found.
As are constantly happening. The last SABR records committee newsletter
had a couple from the teens in it.
> /Micke
Thanks for getting the ball rolling Micke.
Sincerely,
Sean Forman
Baseball Stats! http://www.Baseball-Reference.com/
Baseball Analysis! http://www.BaseballPrimer.com/
> 1. I'd like all tables checked so that referential integrity could
be added
***
*** Yes, this should be done before releasing the database.
Milwaukee 98-2000, there are 2 hitters in 1997 (one has SEA NL as an
entry), etc, could all be caught using this.
> 1a. If a player is traded from a team and then back in the same
season, he
> has two lines in batting and/or pitching. I can see the history
behind this
> but if the rows are not numbered and one can't tell which was the
first and
> which was the second session
***
*** I don't see much value in splitting up the hitting totals
by "session". This sure is alot of work, and I don't see the benefit
payoff of this. The next level of split would be by actual game, at
which point, we'd have to link in the retrosheet data.
> 1c. Fix the LahmanID:s for all the players called O'<something> as
they are
> sometimes referred to as O'leatr01 and sometimes as oleartr01. I
would think
> that the latter one is the correct version.
***
*** I agree, periods, apostrophes, spaces and other national
characters should be avoided in the key fields. They mess things up
sometimes.
>
> 3. IP is given in the .1/.2 notation for partial innings. This
makes adding
> and calculating stuff like ERA a whole lot more complicated. I
suggest
> changing this to .333/.667.
***
*** I agree as well. When I run my "views" or "queries", that's the
first thing I change.
> 6. I'd like there to be predefined tables for careerbatting and
> careerpitching as well as for franchisebatting and
franchisepitching. This
> is for performance reasons. Without these tables already populated
searches
> such as "list all the players who have hit at least 100 HR for two
different
> teams" become terribly slow. There could also be tables for
teambatting and
> teampitching but I don't care too much about those records.
***
*** I strongly disagree here. You can create the "query" or "view"
for this, and that's fine. But why create a table based on another
table. In Access, it let's you create a table based on the view, if
the user so wishes. We should strive for data to NOT be duplicated,
except for keys. Normalization is important.
> 8. In the future, I'd like batters faced and number of pitches
thrown to be
> added to pitching and number of pitches seen added to batting.
***
*** Again, alot of work here. For the recent seasons, easy enough.
For pitchers, you can make a good estimate of the number of pitches
based on his other data. Just as important would be to add run
support. Again, you can make an estimate based on team runs scored
as well.
***
*** I noticed as well that the team data is not the sum of the player
data, and have decided to completely ignore this table. The team
data totals breaks the normalization rules anyway.
***
*** By the way, the Lahman database is a tremendous tremendous
effort. I think the next step is to link it with the retrosheet data.
> 1. I'd like all tables checked so that referential integrity could
be added
***
*** Yes, this should be done before releasing the database.
Milwaukee 98-2000, there are 2 hitters in 1997 (one has SEA NL as an
entry), etc, could all be caught using this.
> 1a. If a player is traded from a team and then back in the same
season, he
> has two lines in batting and/or pitching. I can see the history
behind this
> but if the rows are not numbered and one can't tell which was the
first and
> which was the second session
***
*** I don't see much value in splitting up the hitting totals
by "session". This sure is alot of work, and I don't see the benefit
payoff of this. The next level of split would be by actual game, at
which point, we'd have to link in the retrosheet data.
> 1c. Fix the LahmanID:s for all the players called O'<something> as
they are
> sometimes referred to as O'leatr01 and sometimes as oleartr01. I
would think
> that the latter one is the correct version.
***
*** I agree, periods, apostrophes, spaces and other national
characters should be avoided in the key fields. They mess things up
sometimes.
>
> 3. IP is given in the .1/.2 notation for partial innings. This
makes adding
> and calculating stuff like ERA a whole lot more complicated. I
suggest
> changing this to .333/.667.
***
*** I agree as well. When I run my "views" or "queries", that's the
first thing I change.
> 6. I'd like there to be predefined tables for careerbatting and
> careerpitching as well as for franchisebatting and
franchisepitching. This
> is for performance reasons. Without these tables already populated
searches
> such as "list all the players who have hit at least 100 HR for two
different
> teams" become terribly slow. There could also be tables for
teambatting and
> teampitching but I don't care too much about those records.
***
*** I strongly disagree here. You can create the "query" or "view"
for this, and that's fine. But why create a table based on another
table. In Access, it let's you create a table based on the view, if
the user so wishes. We should strive for data to NOT be duplicated,
except for keys. Normalization is important.
> 8. In the future, I'd like batters faced and number of pitches
thrown to be
> added to pitching and number of pitches seen added to batting.
***
*** Again, alot of work here. For the recent seasons, easy enough.
For pitchers, you can make a good estimate of the number of pitches
based on his other data. Just as important would be to add run
support. Again, you can make an estimate based on team runs scored
as well.
***
*** I noticed as well that the team data is not the sum of the player
data, and have decided to completely ignore this table. The team
data totals breaks the normalization rules anyway.
***
*** By the way, the Lahman database is a tremendous tremendous
effort. I think the next step is to link it with the retrosheet data.
I'll follow up to some of Micke's comments...
micke.hovmoller@... wrote:
> 8. In the future, I'd like batters faced and number of pitches thrown to be
> added to pitching and number of pitches seen added to batting. I do realize
> the amount of work needed to gather the data, and I do not volunteer to
> gather it at this point in time, but I suppose getting it for the last few
> years can't be that difficult and that would always be a start.
I don't think this data is that easy to find, and I've always resisted this sort
of thing for a couple of reasons. First, this kind of situational data is
available only for very recent seasons, and second, the accuracy of the
pitch-by-pitch data is very questionable. Long term, I think the best thing for
us to do is provide a mechanism for linking to the retrosheet data rather than
trying to incorporate it here.
> 9. By comparing the sum of records from batting/pitching to the matching
> record in teams it is easy to spot literally hundreds of differences.
That's not a database issue, that's a matter of historical fact. There are
hundreds of instances where the official team totals don't match the sum of the
player totals. I've spoken with Pete Palmer at length about this issue, and
it's
something we addressed in Total Baseball (see "Issues and Answers:
Discrepancies"
in the chapter "The History of Major League Baseball Statistics", page 524 of
the
seventh edition).
> As for 1-6, I could easily provide scripts that implement all of these
> changes, but I will only bother doing it if I know it will be used. How do I
> go about sending this to the proper place? This list?
Let's get through the 2001 update and then we can start thinking about changes
to
the overall design.
Regards,
Sean
Warning: Long post!
Some changes I'd like to see made to the Lahman DB. And also a bunch of
questions, suggestions, etc. Take it for what it's worth.
1. I'd like all tables checked so that referential integrity could be added
in the obvious places (between batting/pitching/postbatting/postpitching and
master/teams, between teams and teammaster etc.). Implementing this will
catch a large number of errors that might otherwise go unnoticed. In order
to change this, (at least) the following need to be done:
1a. If a player is traded from a team and then back in the same season, he
has two lines in batting and/or pitching. I can see the history behind this
but if the rows are not numbered and one can't tell which was the first and
which was the second session I don't find that information to be worth a
whole lot. Also, if a player is sent to the minors and back in the same
season, he has only one line, so I don't find this consistent.
Suggested solution: Change the duplicate rows into one aggregate row. I can
write scripts for this if it's considered a good idea.
Alternative solution: Add a column named "TeamOrder" or something such
indicating in which order the different rows for the same year was generated.
Filling this column properly is *a lot* of work and, IMHO, not worth it.
1b. Fix a few errors where teams do not match entries in the team table.
Again, I think I have scripts for this or can generate them relatively
quickly.
1c. Fix the LahmanID:s for all the players called O'<something> as they are
sometimes referred to as O'leatr01 and sometimes as oleartr01. I would think
that the latter one is the correct version.
1d. Go through all the postbatting and postpitching tables and fix the team
ID's.
2. In my DB AVG, OBP and SLG are calculated with only two decimals instead
of the usual three. I'd suggest using as many decimals as possible on each
platform.
3. IP is given in the .1/.2 notation for partial innings. This makes adding
and calculating stuff like ERA a whole lot more complicated. I suggest
changing this to .333/.667.
4. SH is in the batting table but not in the Teams table. Is there any
particular reason for this?
5. I'd like there to be a Franchise table linking teams from teammaster.
This is described in the Baseball Archive Support forum, and allows the user
to more easily see that the St. Louis Browns and the Baltimore Orioles is
the same franchise, whereas the 1902 and the 1992 Baltimore Orioles are not.
6. I'd like there to be predefined tables for careerbatting and
careerpitching as well as for franchisebatting and franchisepitching. This
is for performance reasons. Without these tables already populated searches
such as "list all the players who have hit at least 100 HR for two different
teams" become terribly slow. There could also be tables for teambatting and
teampitching but I don't care too much about those records.
7. I'd like there to be some information about park factors, but I'm not
sure where and how to insert it.
8. In the future, I'd like batters faced and number of pitches thrown to be
added to pitching and number of pitches seen added to batting. I do realize
the amount of work needed to gather the data, and I do not volunteer to
gather it at this point in time, but I suppose getting it for the last few
years can't be that difficult and that would always be a start.
9. By comparing the sum of records from batting/pitching to the matching
record in teams it is easy to spot literally hundreds of differences. For
instance, the following query yields 200+ rows in my DB:
select t.team, t.lg, t.year, t.ab, sum(b.ab), t.ab - sum(b.ab) diff
from teams t,
batting b
where t.team = b.team
and t.lg = b.lg
and t.year = b.year
group by t.team, t.lg, t.year, t.ab
having t.ab - sum(b.ab) <> 0
order by 6
I'd like to help out in detecting and correcting these errors but would like
suggestions on where to start looking. I suppose baseball-reference.com
still has most of these errors since it's based on the same DB, no? Since I
live in Sweden and hardly have any access to printed matter, on-line sources
are of course highly preferable.
As for 1-6, I could easily provide scripts that implement all of these
changes, but I will only bother doing it if I know it will be used. How do I
go about sending this to the proper place? This list?
/Micke
Michael Mavrogiannis wrote:
>
> Zounds:
> Update Master set Lastname="Wilson" where Lastname="A. Wilson";
>
> For consistency with all the other players known by their initials:
> Update Master set LahmanID="sabatc01" where LahmanID="sabatc.01";
> Update Master set LahmanID="dicker01" where LahmanID="dicker.01";
>
Thanks for catching this. The site has been updated.
Sincerely,
Sean Forman
Baseball Stats! http://www.Baseball-Reference.com/
Baseball Analysis! http://www.BaseballPrimer.com/
Zounds:
Update Master set Lastname="Wilson" where Lastname="A. Wilson";
For consistency with all the other players known by their initials:
Update Master set LahmanID="sabatc01" where LahmanID="sabatc.01";
Update Master set LahmanID="dicker01" where LahmanID="dicker.01";
Sean Lahman wrote:
>
> See: http://espn.go.com/mlb/profiles/gamelogs/2001/pitching/3898.html
>
> STATS erroneously shows Dave Martinez throwing 17 pitches in the July 4 game
against
> Philadelphia, but no batters faced. What happened was that he came into the
game as
> part of a double switch with Kerry Ligtenberg. Those 17 pitches belonged to
> Lightenberg. Several other sources support this account.
>
> I recommend we drop Dave Martinez from the 2001 pitching table.
>
> --SL
Good catch. I deleted the record and the site will show the update in a
minute or so. Did I mention that I'm glad I wrote a script that does
the dump and upload for me? ;-)
later,
sean
Baseball Stats! http://www.Baseball-Reference.com/
Baseball Analysis! http://www.BaseballPrimer.com/
See: http://espn.go.com/mlb/profiles/gamelogs/2001/pitching/3898.html
STATS erroneously shows Dave Martinez throwing 17 pitches in the July 4 game
against
Philadelphia, but no batters faced. What happened was that he came into the
game as
part of a double switch with Kerry Ligtenberg. Those 17 pitches belonged to
Lightenberg. Several other sources support this account.
I recommend we drop Dave Martinez from the 2001 pitching table.
--SL
Sean Lahman wrote:
> The Sporting News shows a pitching line for him but no IP, no batters faced,
and
> no credit for a game appaearance. ESPN shows the same thing. I'm going to
look
> at the box scores and see if I can figure out what happened.
>
> --SL
>
> Justin Kubatko wrote:
>
> > Sean,
> >
> > Dave Martinez of the Atlanta Braves is listed in the 2001 pitching file, but
> > is not credited with a game pitched. Did he actually appear as a pitcher
> > this year? If so, he should at least have a "1" in the "G" column.
> >
> > Thanks,
> >
> > Justin Kubatko
> >
> >
> > http://www.baseball-databank.org/
> >
> > To unsubscribe from this group, send an email to:
> > baseball-databank-unsubscribe@yahoogroups.com
> >
> >
> >
> > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
> http://www.baseball-databank.org/
>
> To unsubscribe from this group, send an email to:
> baseball-databank-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
The Sporting News shows a pitching line for him but no IP, no batters faced, and
no credit for a game appaearance. ESPN shows the same thing. I'm going to look
at the box scores and see if I can figure out what happened.
--SL
Justin Kubatko wrote:
> Sean,
>
> Dave Martinez of the Atlanta Braves is listed in the 2001 pitching file, but
> is not credited with a game pitched. Did he actually appear as a pitcher
> this year? If so, he should at least have a "1" in the "G" column.
>
> Thanks,
>
> Justin Kubatko
>
>
> http://www.baseball-databank.org/
>
> To unsubscribe from this group, send an email to:
> baseball-databank-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Sean,
Dave Martinez of the Atlanta Braves is listed in the 2001 pitching file, but
is not credited with a game pitched. Did he actually appear as a pitcher
this year? If so, he should at least have a "1" in the "G" column.
Thanks,
Justin Kubatko
Justin Kubatko wrote:
>
> Sean(s),
>
> In the 2001 "TEAM.txt" file, the New York Mets and New York Yankees
> team_id's are reversed:
>
> NYN AL E 161 95 65 ...
> NYA NL E 162 82 80 ...
>
> Thanks,
>
> Justin Kubatko
Thanks, there was an error in how I had my ID's and the Lahman team ID's
cross-referenced in my team table. They should be right now. Also, I
updated the ERA column to their correct values.
Sincerely,
Sean Forman
Baseball Stats! http://www.Baseball-Reference.com/
Baseball Analysis! http://www.BaseballPrimer.com/
Sean,
> In the 2001 "TEAM.txt" file, the New York Mets and New York Yankees
> team_id's are reversed:
This is also the case in the batting file...
piazzmi01 2001 NYA NL
jeterde01 2001 NYN AL
...as well as the pitching file:
appieke01 2001 NYA NL
pettian01 2001 NYN AL
Thanks,
Justin Kubatko
Sean(s),
In the 2001 "TEAM.txt" file, the New York Mets and New York Yankees
team_id's are reversed:
NYN AL E 161 95 65 ...
NYA NL E 162 82 80 ...
Thanks,
Justin Kubatko
Justin Kubatko wrote:
>
> Sean,
>
> A few errors I found in the 2001 files:
>
> 1) Brent Hinchliffe's lahman_id in the batting and pitching files is listed
> as "hinclbr01". It should be "hinchbr01".
>
> 2) The last column in the master file is labeled "2001". I believe this
> should be "debut_year".
>
> 3) Chuck Smith's lahman_id is "smithch07", although there is no "smithch06".
> I guess this really isn't an error, but I think his lahman_id should be
> "smithch06" to be consistent.
>
> Thanks,
>
> Justin Kubatko
All of these are changes that need to be made in the main database. I
don't have the debut column in my db, so I just fudged it in this case.
Hinchliffe and Smith's ID's appear as they do in last year's version. I
don't want to change it here because it would make the update
incompatible with Sean's version from last year. You are correct that
they should have different id's, but that is more an issue for when the
full database comes up for consideration and it will be fixed then.
Thanks for the help.
Sincerely,
Sean Forman
Baseball Stats! http://www.Baseball-Reference.com/
Baseball Analysis! http://www.BaseballPrimer.com/
Sean,
A few errors I found in the 2001 files:
1) Brent Hinchliffe's lahman_id in the batting and pitching files is listed
as "hinclbr01". It should be "hinchbr01".
2) The last column in the master file is labeled "2001". I believe this
should be "debut_year".
3) Chuck Smith's lahman_id is "smithch07", although there is no "smithch06".
I guess this really isn't an error, but I think his lahman_id should be
"smithch06" to be consistent.
Thanks,
Justin Kubatko
--- In baseball-databank@y..., Sean Lahman <slahman@b...> wrote:
> I realize it's outside the scope of what you've done, Sean, but what
> about non-active players who have died? The SABR biographical
committee
> can provide an updated list -- I usually get it from David Vincent
or
> Bill Carle.
>
> Regards,
> Sean
The data that I sent out was only to be consistent with the data
currently in the DB. I've been able to add big chunks of biographical
data to the database as well. I have place of birth, given name,
place and data of death, date of debut, height, weight, and nicknames
for the majority of the players. One thing I'd like to see done here
by this group.
1) Fill in the gaps
2) add in any new information (recently deceased, etc.)
We need to hash out what tables and columns to have.
Currently, in my home DB, I have two multiple tables for the bio data.
a Bio data with the name, birth, death, height and weight data. This
is data everyone has and then a
players table with handedness, and debut information. This way we
aren't really concerned what Lee MacPhail's throwing hand was and it
seems a little bit cleaner to me, but I'll listen to countervailing
arguments.
I see all this data going into the V.2001 Database.
I realize it's outside the scope of what you've done, Sean, but what
about non-active players who have died? The SABR biographical committee
can provide an updated list -- I usually get it from David Vincent or
Bill Carle.
Regards,
Sean
Mike Emeigh wrote:
>
> Where's Lloyd McClendon?
>
> Mike Emeigh
> piratefan1@...
Did he actually do anything this year?
| 2951 | mcclell01 | 2001 | 0 | PIT | NL | C | 162 |
62 | 100 | 6 | 0 | |
I had the wrong entry for his lahman_ID in the manager table. It should
be fixed now.
I've incorporated the changes to the birthdates as well. And the new
updates on the site.
http://www.baseball-reference.com/data/2001/
later,
sean
Baseball Stats! http://www.Baseball-Reference.com/
Baseball Analysis! http://www.BaseballPrimer.com/
I noted a fair number of birthdates were not in the updated
2001 data for debut players. So - I went hunting on the
ESPN Web site and got the missing ones. They're all in the
file master.csv, which I uploaded to the Files area of this
group:
http://groups.yahoo.com/group/baseball-databank/files/
Mike Emeigh
piratefan1@...
Just to get this on record.
tables to include in new release, and columns
naming conventions
new tables to build
what to do with new data and how to proof it
how should updates be done
licensing issues
What infrastructure needs to be built for the dissemination of the
data and on the website.
Sites and projects that have a similar focus.
http://www.baseball-databank.org/ is not yet running
http://www.gnu.org/philosophy/license-list.html
The design science license may be appropriate for what we are doing.
However, that would allow others to redistribute the database as they
see fit.
later,
sean
Sean Forman Assistant Professor sforman@...
Math and Computer Science Dept. http://www.sju.edu/~sforman/
St. Joseph's University Philadelphia PA