First of all, let me say that I'm glad more people are thinking/working in the space of triples. Even unstructured ones like this.
But when there's no semi-strict schema, it gets really, really tricky. Free text is hard, and actual meaning is hard to separate. (I say semi-strict, as Freebase is schema-last -- feel free to create your own! -- but has some level of enforcement)
For specific domains you may be okay with tags. And for some limited applications it probably works great. Triples are cool!
But when you start talking about larger, broader, datasets, ones that no one person or small group can curate, you're going to start running into collision.
I think there's a lot of interesting work to be done. But I doubt that this is "better" per se, or at the very least, is little more than a toy.
And hey, I built such a toy graph engine once upon a time (be gentle -- it was really a demo hack) https://github.com/barakmich/jgd -- you can even query it with Freebase's old MQL. (Which I have mixed feelings about, but is cool in its own way)
I guess my argument is, don't throw the baby out with the bathwater. And feel free to ping me for more!
First of all, I'm glad to be talking to a Freebase engineer :)
Second: It is not intended to be a Freebase killer, but it is certainly better for me and what I needed it for: building denisthebot.com - which can answer questions about a whole lot of subjects while being completely agnostic about what "categories" these subjects fit in.
Third: Since we all know no-one is going to change anyone's mind here, I won't discuss the merits of a DB structured the way I built it.
But calling it a toy is quite flattering, I know some very powerful stuff that were called "toy" for some time :) Also, it would be a very easy toy to use, which will suit a lot of people, according to the emails I received.
Also, you're right: "don't throw the baby out with the bathwater".
> Simpler structure: There are no datatypes, namespaces, lists, domains. Just ordered nodes. Having a dead simple structure like that allows developers to quickly and intuitively know how to access the info they want.
I don't see how this makes it simpler or intuitive at ALL. If there's no convention as to whether I should search for "born on" or "born_date" or "year_born", or whether the date will be "1900-08-01" or "08-01-1900" or "1900/08"... then how is this supposed to be useful?
The central problem is, there are lots of textual ways of describing the same thing. Without standardized datatypes and standardized tags, it quickly becomes a messy, useless free-for-all.
I don't see how TheBigDB gets around this. The FAQ explains how it's different from Freebase/Wikidata, but I don't at all understand how it's supposed to be better, or even as good.
This is stupid. I hate to say it, but by taking all of the complexity out of Freebase, they forgot the problem Freebase was trying to solve: Semantic integration
Right on the front page they have their biggest flaw in the example:
Apple weight 150g
Ok, so what about the company Apple? How are they semantically distinguishing one item from another? I guess Steve Jobs is Founder of a Fruit that weighs 150 grams.
What you're talking about is the larger problem of "context".
Thankfully, I already wrote the code and the documentation to attach statements to contexts, solving this problem in an elegant way.
I haven't released it yet to keep the core of the service simple, and to help people understand what it does first.
It will be released when it will be the right time for it.
Thanks for your comment!
I apologize for calling your project stupid. I should not have done that. I'm interested in seeing what you are doing on context. Thanks for being classy.
When you start assigning context to statments, you'll quickly find yourself walking down the path to structured data, and toward the very complexity you're arguing against. At least predicate-wise.
You'll then need to start assigning contexts to entities, at which point you're right about where Freebase begins, in principle. Even if we assume you use Wikipedia-esque strings like "Apple (company)"
Yes, whatever I do is making me "walking down the path to structured data, and toward the very complexity you're arguing against", that's why it is done in a way where those principles are still preserved.
I don't want to talk too much about a structure that I haven't released yet, but let me assure you there's a way that doesn't end up where Freebase is. (Not that there's anything wrong with that.</seinfeld>)
It's all about conventions, redundancy and votes. When you know what you're looking for, and you got a community agreeing on such conventions, you know you'll get what you're actually looking for.
That's pretty much it.
And yes, it is different, "better" depends on what you're trying to do with it :)
Agree. It's like WikiPedia: Make it easy for the community to decide how to organize it. Nothing stopping anyone from just populating this db in whatever format they prefer then publishing their own docs explaining their data model ... Let the audience decide if that's useful or not.
Sounds like a much simplified version of Douglas Lenat's Cyc project [1], which has been going since the mid eighties and is attempting to build a structured knowledgebase/ontology of everyday knowledge. They have freely downloadable subset called OpenCyc [2]. It seemed pretty impressive last time I looked at it.
>Lenat was frustrated by Automated Mathematician's constraint to a single domain and so developed Eurisko; his frustration with the effort of encoding domain knowledge for Eurisko led to Lenat's subsequent (and, as of 2008, continuing) development of Cyc. Lenat envisions ultimately coupling the Cyc knowledgebase with the Eurisko discovery engine.
I don't know what he intends to do with it from there, but it could potentially make for some very powerful AIs.
Yes, simplification + open API is at the heart of the project.
The first one because it makes people actually want to use the service, the second one because it helps having up to date data about all sort of things.
I wonder if you could do machine learning on schemata. Basically start learning about dates (as an example) and as it learns updates the information with what it has learned. Something that has one person putting in { name "foo", born "10/1/92"} and someone else putting in { name "bar", born "september 30th, 1966" } and then going back and replacing the dates with an ISO standard date type but with a change history so you could look backwards in time at the data and see how the database had "improved" it. (or not). Then by voting on the improvements you teach the system to clean up its data representations. Crazy? Insightful? Stupid? I don't know but it was the question that popped into my head.
The problem is that many possible format conflicts in ways that make resolution impossible without cross-referencing with other sources.
Which date is "10/3/5"? Is it March 5th 2010? March 5th 1910? March 10th 1905? March 10th 2005? October 3rd 1905? October 3rd 2005? (or another century entirely, though the 20th and 21st would be most likely). And don't think the "/" vs. "-" as separate is sufficient to tell them apart.
Aand you'll find a lot of other variations - I'm used to writing 10/3-5 for example... But I'm not even consistent, I might write 10/3/5 or 10-3-5, or 5/3/10 / 5-3-10; anywhere I want to be explicit, I would write 2005-03-10 exactly because I'm used to seeing so many ambiguous dates that can't easily be resolved.
What about the value 5.123? Is it a floating point value with "123" after the decimal point, or the integer 5123? The "decimal point" is "," in many countries, and the thousand separator is usually, but not always, "." in countries that use "," as the decimal marker. If you treat things as "just text" you are going to have to potentially deal with dozens of different combinations of decimal points and quantity markers (depending on country, the markers don't all occur only every 3 digits to the left from the decimal marker...)
Interpreting small text fragments is fraught with a near infinite number of obnoxious details like this, and part of the problem is that even few people know most of them and will be unable to quickly resolve ambiguities without cross referencing with other data (or worse: they think they know, or don't even recognize that there is an ambiguity in the first place)
One nice property of the Wikidata database is that it is a "secondary database. Wikidata will record not just statements, but their sources, thus reflecting the diversity of knowledge available and supporting the notion of verifiability." [1]
I think that's far better than voting. Voting for facts amounts to relying on a logical fallacy: appeal to the majority. [2] (Voting is fine for popularity contests, or things that can only be matters of opinion, but facts?)
Is it possible to download all data and use it under some open license (like CC-BY)? I can't find data license terms.
If no, then sorry, freebase is vastly superior IMHO - from user's point of view I don't see a point in a crowdsourced proprietary database (even if API is currently free).
It's worth pointing out that in most of the world facts is not copyrightable, and the copyright status of a collection of facts is often not copyrightable. In those instances, CC-BY or whatever license you apply will be unenforceable. I believe Creative Commons themselves explicitly believe that CC-BY and similar licenses are not appropriate for collections of facts.
E.g. they explicitly say on their website: "Copyright does not protect the facts or ideas underlying the creative expression. So, Creative Commons licenses do not apply to ideas, factual information or other non-creative elements that are not protected by copyright."
Courts in many jurisdiction explicitly refuse to accept "sweat of the brow" arguments for copyright, explicitly requiring an element of creativity. Some countries do have "database copyrights" that can protect arrangements of facts in certain ways, while in others a collection of straight up facts can not be copyrighted pretty much no matter what.
Please get the licensing clarified. It's important. CC-BY is maybe too much; imagine having to give attribution/credit for 23,967 sources... Check out the content license used by Wikidata and OpenStreetMap (Hint: not CC-BY.)
Have you/do you plan to seed your database with the already structured data from freebase? It should be relatively straightforward, right? Well, I mean, minus the time to properly map the Freebase schema into your format. But that's probably less time than it takes to wait for people to fill in enough facts.
Excellent! I've been working on something similar. Trying to come up with a schema that is data-centric is hard enough let alone focusing on the ease of use by developers. Good luck!
Can I send how many requests I want?, I think you might mean Can I send as many requests as I want? ?
Actually yes, but in a different way, I was thinking of accelerating the power of sudden downvotes, but nothing is set in stone yet.
So yes it will happen, how exactly is yet to be decided.
Thanks for your question!
'Yugoslavia is a country'. Are you really going to exclude that? So, no statements about political geography at all, then. Same problem for physical geography ('New Moore Island exists'). Nothing about climate (because 'the rain in Spain' could change at any time), nothing about population levels. Even the sample 'average weight of an apple' is pretty suspect. I think your database is going to be quite limited.
Edit: It seems we're arguing about what a priori knowledge is capable of serving as a base for factual deductions. The Kantian approach is to say that we all agree on time and space and everything can be based off of these self-evident truths. I think there is not such a clear boundary between objective truth and induction.
Edit2: I'd also like to take this moment to point out that "you're" is the proper contraction of "you are", since we're getting all semantic.
This has suddenly become very philosophical. My view is that a database of facts should contain things that are believed to be facts. It should be possible to remove facts that are shown to be incorrect, but those things should never have been true.
> I'd also like to take this moment to point out that "you're" is the proper contraction of "you are", since we're getting all semantic.
I know, it annoys me too. By the time I'd realized it was too late to edit. Typos happen.
"My view is that a database of facts should contain things that are believed to be facts. It should be possible to remove facts that are shown to be incorrect, but those things should never have been true."
But they definitely were true. I thought you were making a distinction between 'something that is true' and 'something that is a fact (ie: is unchangingly true)' which I don't think most people make.
It's pretty obvious you'd didn't consider statements like that. There are a large number of examples that make your statement of the invariance of facts absurd.
In the absolute sense, you're right they don't; but in the practical sense they do, especially open time-dependent statements.
The easiest example is this:
We're in 2010. X has been married to Y since 2009;
In the DB it would be represented exactly like that: a "from" time period, no "to" time period, meaning "it is still true now".
They divorce in 2013.
X married to Y from 2009 to "now" isn't true. It should be downvoted.
but X married to Y from "2009" to "2013" is actually true. It should be created and upvoted. That fact surely won't change overtime.
> The easiest example is this: We're in 2010. X has been married to Y since 2009; In the DB it would be represented exactly like that: a "from" time period, no "to" time period, meaning "it is still true now".
This is not a fact database. This is:
X Married Y 2009-01-01
X Divorced Y 2013-01-01
This way you can represent any number of facts:
X Married Y 2013-02-01
Think how convoluted it would be for you to represent X and Y marrying again in your example, if not outright wrong (because you update/delete information).
I can be wrong, but being in a state of marriage with Y is a fact.
And since the DB handle time periods, with my way you can actually search who X is married to with one request, without checking if he divorced, if Y is dead, disappeared, or else.
> I can be wrong, but being in a state of marriage with Y is a fact.
This is akin to your bank storing the total amount of your account in their database, instead of the transaction history and deriving the total from that.
There's nothing wrong with that, until there is.
Consider this situation: X marries Y, divorces, marries again. Now you would have the date of divorce prior to the date of marriage. How do you make sense of this data?
> And since the DB handle time periods, with my way you can actually search who X is married to with one request, without checking if he divorced, if Y is dead, disappeared, or else.
The fact Y is dead doesn't mean X wasn't married to it. So despite the obvious technical implications of keeping all this data in sync (e.g., Y dies so you would update X to reflect that?), you're simply obliterating information. A fact is immutable, therefore a fact database, by definition, only appends, never updates.
How can you say how the data will be structured? I thought you weren't enforcing any type of data structure?
I like the new approach though. You appear to be focused on simplicity and ease of use, and hopefully your find ways to fix the resultant problems. For example, some fancy graph theory might be able to determine that the graph node "Apple" refers to two different ideas.
Any chance to release this as an open source? For example, people would like to have installed in their servers and use it for their own things. I think it would be useful for fandom. For example, the Star Wars DB or Lord of the Rings Db :-)
I'm seriously considering it actually, but not 100% sure at the moment... :)
And yes, it would be awesome for fandom, but the whole point of the service is to be able to have all kinds of data in it, so... :)
Don't be deterred by the negative comments about the unstructured data. It's a tough problem but not an impossible one. I know because I'm also battling the same question building a free-form NLP based self tracking app to help track daily data ( http://thyself.io ). The problem for me is that it's hard to perform analytics when one datapoint is in "miles walked" and the other is in "laps ran".
As you said, conventions help mitigate the problem a little bit but the end user can hardly be expected to stick to best practices.
I have hope though. This is a problem worth solving.
Reminds me of Freebase. They built a huge data-set as well as tools and an api to access themselves. Have you talked to anyone on the team? (they are now at Google) How would you say that you are different from them?
I like this idea in the sense of an experiment. I'm not sure where it will end up, but it could be interesting.
As others have pointed out, some kind of conventions must be established around the semantics, and something must be done to avoid redundancy (which leads to inconsistency) and ambiguity.
I agree with those criticisms, but if the community also helps develop the schema, it will be interesting to see. What collisions will happen? What will be the result of queries that reach far across disciplines?
I appreciate any new service that attempts to organize data / information. With that in mind, I hope this succeeds.
A suggestion: it needs a demo query box on the site. Shouldn't be too hard to let a rate limited IP address throw a few keywords at it and spit back results. I'd like to see what the db contains before I invest too much time (how many topics, how many facts, etc).
You're right, it would be a plus for people who quickly want to test the API.
In the meantime, you can just look through http://browser.thebigdb.com what the DB contains,
or just "gem install thebigdb" and start copy/paste the code examples to see how the API really behaves.
Based on observations and prior experience (esp. Bitzi), I believe the wiki approach of "correct-in-place" leads to better convergence and community than "downvote the errors, add a corrected entry, upvote the better entry".
(Voting democracy may help prevent people from being oppressed in certain ways, but it isn't much of a truth-discovery mechanism.)
Interesting concept. It's like RDF for human beings. It's easier for human beings to look at unstructured data, but at the same time it makes it extremely hard to do interesting stuff programmatically. You just can't do reliable inferencing
First of all, let me say that I'm glad more people are thinking/working in the space of triples. Even unstructured ones like this.
But when there's no semi-strict schema, it gets really, really tricky. Free text is hard, and actual meaning is hard to separate. (I say semi-strict, as Freebase is schema-last -- feel free to create your own! -- but has some level of enforcement)
For specific domains you may be okay with tags. And for some limited applications it probably works great. Triples are cool!
But when you start talking about larger, broader, datasets, ones that no one person or small group can curate, you're going to start running into collision.
There's certainly an argument to be made for metaschema -- https://developers.google.com/freebase/v1/search-metaschema -- and crowdsourcing these sorts of things could be interesting.
I think there's a lot of interesting work to be done. But I doubt that this is "better" per se, or at the very least, is little more than a toy.
And hey, I built such a toy graph engine once upon a time (be gentle -- it was really a demo hack) https://github.com/barakmich/jgd -- you can even query it with Freebase's old MQL. (Which I have mixed feelings about, but is cool in its own way)
I guess my argument is, don't throw the baby out with the bathwater. And feel free to ping me for more!