Show HN: TheBigDB - A simpler open database of facts

barakm · on April 4, 2013

Former Freebase, current Google engineer here:

First of all, let me say that I'm glad more people are thinking/working in the space of triples. Even unstructured ones like this.

But when there's no semi-strict schema, it gets really, really tricky. Free text is hard, and actual meaning is hard to separate. (I say semi-strict, as Freebase is schema-last -- feel free to create your own! -- but has some level of enforcement)

For specific domains you may be okay with tags. And for some limited applications it probably works great. Triples are cool!

But when you start talking about larger, broader, datasets, ones that no one person or small group can curate, you're going to start running into collision.

There's certainly an argument to be made for metaschema -- https://developers.google.com/freebase/v1/search-metaschema -- and crowdsourcing these sorts of things could be interesting.

I think there's a lot of interesting work to be done. But I doubt that this is "better" per se, or at the very least, is little more than a toy.

And hey, I built such a toy graph engine once upon a time (be gentle -- it was really a demo hack) https://github.com/barakmich/jgd -- you can even query it with Freebase's old MQL. (Which I have mixed feelings about, but is cool in its own way)

I guess my argument is, don't throw the baby out with the bathwater. And feel free to ping me for more!

christophe971 · on April 4, 2013

First of all, I'm glad to be talking to a Freebase engineer :)

Second: It is not intended to be a Freebase killer, but it is certainly better for me and what I needed it for: building denisthebot.com - which can answer questions about a whole lot of subjects while being completely agnostic about what "categories" these subjects fit in.

Third: Since we all know no-one is going to change anyone's mind here, I won't discuss the merits of a DB structured the way I built it. But calling it a toy is quite flattering, I know some very powerful stuff that were called "toy" for some time :) Also, it would be a very easy toy to use, which will suit a lot of people, according to the emails I received.

Also, you're right: "don't throw the baby out with the bathwater".

crazygringo · on April 4, 2013

Comparison with Freebase:

> Simpler structure: There are no datatypes, namespaces, lists, domains. Just ordered nodes. Having a dead simple structure like that allows developers to quickly and intuitively know how to access the info they want.

I don't see how this makes it simpler or intuitive at ALL. If there's no convention as to whether I should search for "born on" or "born_date" or "year_born", or whether the date will be "1900-08-01" or "08-01-1900" or "1900/08"... then how is this supposed to be useful?

The central problem is, there are lots of textual ways of describing the same thing. Without standardized datatypes and standardized tags, it quickly becomes a messy, useless free-for-all.

I don't see how TheBigDB gets around this. The FAQ explains how it's different from Freebase/Wikidata, but I don't at all understand how it's supposed to be better, or even as good.

JPKab · on April 4, 2013

This is stupid. I hate to say it, but by taking all of the complexity out of Freebase, they forgot the problem Freebase was trying to solve: Semantic integration

Right on the front page they have their biggest flaw in the example:

Apple weight 150g

Ok, so what about the company Apple? How are they semantically distinguishing one item from another? I guess Steve Jobs is Founder of a Fruit that weighs 150 grams.

christophe971 · on April 4, 2013

What you're talking about is the larger problem of "context". Thankfully, I already wrote the code and the documentation to attach statements to contexts, solving this problem in an elegant way. I haven't released it yet to keep the core of the service simple, and to help people understand what it does first.

It will be released when it will be the right time for it. Thanks for your comment!

JPKab · on April 4, 2013

I apologize for calling your project stupid. I should not have done that. I'm interested in seeing what you are doing on context. Thanks for being classy.

christophe971 · on April 4, 2013

No problem, I'm not afraid of criticism and insults, there is always some kind of information in them :)

barakm · on April 4, 2013

When you start assigning context to statments, you'll quickly find yourself walking down the path to structured data, and toward the very complexity you're arguing against. At least predicate-wise.

You'll then need to start assigning contexts to entities, at which point you're right about where Freebase begins, in principle. Even if we assume you use Wikipedia-esque strings like "Apple (company)"

Triples are easy. Semantics are hard.

christophe971 · on April 4, 2013

Yes, whatever I do is making me "walking down the path to structured data, and toward the very complexity you're arguing against", that's why it is done in a way where those principles are still preserved.

I don't want to talk too much about a structure that I haven't released yet, but let me assure you there's a way that doesn't end up where Freebase is. (Not that there's anything wrong with that.</seinfeld>)

pbreit · on April 4, 2013

Wikipedia had a similar issue and conventions just evolved over time. I could see that happening here.

christophe971 · on April 4, 2013

It's all about conventions, redundancy and votes. When you know what you're looking for, and you got a community agreeing on such conventions, you know you'll get what you're actually looking for. That's pretty much it.

And yes, it is different, "better" depends on what you're trying to do with it :)

snorkel · on April 4, 2013

Agree. It's like WikiPedia: Make it easy for the community to decide how to organize it. Nothing stopping anyone from just populating this db in whatever format they prefer then publishing their own docs explaining their data model ... Let the audience decide if that's useful or not.

manoDev · on April 4, 2013

Without an ontology this is not a facts database more than a graph of strings. It can be interesting anyway, maybe an ontology will emerge from it.

andyjohnson0 · on April 4, 2013

Sounds like a much simplified version of Douglas Lenat's Cyc project [1], which has been going since the mid eighties and is attempting to build a structured knowledgebase/ontology of everyday knowledge. They have freely downloadable subset called OpenCyc [2]. It seemed pretty impressive last time I looked at it.

[1] http://en.wikipedia.org/wiki/Cyc

[2] http://www.cyc.com/platform/opencyc

JPKab · on April 4, 2013

Does anyone actually DO anything with Cyc? It always appears to be vaporware to me.

Maybe I'm just impatient, and something mindblowingly awesome, based on Cyc, is around the corner.

Houshalter · on April 5, 2013

Well according to Wikipedia:

>Lenat was frustrated by Automated Mathematician's constraint to a single domain and so developed Eurisko; his frustration with the effort of encoding domain knowledge for Eurisko led to Lenat's subsequent (and, as of 2008, continuing) development of Cyc. Lenat envisions ultimately coupling the Cyc knowledgebase with the Eurisko discovery engine.

I don't know what he intends to do with it from there, but it could potentially make for some very powerful AIs.

christophe971 · on April 4, 2013

Yes, simplification + open API is at the heart of the project.

The first one because it makes people actually want to use the service, the second one because it helps having up to date data about all sort of things.

ximeng · on April 4, 2013

Is there a web interface to cyc? The search on that link didn't seem to go anywhere, nor did the example concepts.

ww520 · on April 4, 2013

I think Cyc has some notion of inference, or relationship between facts.

ChuckMcM · on April 4, 2013

I wonder if you could do machine learning on schemata. Basically start learning about dates (as an example) and as it learns updates the information with what it has learned. Something that has one person putting in { name "foo", born "10/1/92"} and someone else putting in { name "bar", born "september 30th, 1966" } and then going back and replacing the dates with an ISO standard date type but with a change history so you could look backwards in time at the data and see how the database had "improved" it. (or not). Then by voting on the improvements you teach the system to clean up its data representations. Crazy? Insightful? Stupid? I don't know but it was the question that popped into my head.

vidarh · on April 5, 2013

The problem is that many possible format conflicts in ways that make resolution impossible without cross-referencing with other sources.

Which date is "10/3/5"? Is it March 5th 2010? March 5th 1910? March 10th 1905? March 10th 2005? October 3rd 1905? October 3rd 2005? (or another century entirely, though the 20th and 21st would be most likely). And don't think the "/" vs. "-" as separate is sufficient to tell them apart.

Aand you'll find a lot of other variations - I'm used to writing 10/3-5 for example... But I'm not even consistent, I might write 10/3/5 or 10-3-5, or 5/3/10 / 5-3-10; anywhere I want to be explicit, I would write 2005-03-10 exactly because I'm used to seeing so many ambiguous dates that can't easily be resolved.

What about the value 5.123? Is it a floating point value with "123" after the decimal point, or the integer 5123? The "decimal point" is "," in many countries, and the thousand separator is usually, but not always, "." in countries that use "," as the decimal marker. If you treat things as "just text" you are going to have to potentially deal with dozens of different combinations of decimal points and quantity markers (depending on country, the markers don't all occur only every 3 digits to the left from the decimal marker...)

Interpreting small text fragments is fraught with a near infinite number of obnoxious details like this, and part of the problem is that even few people know most of them and will be unable to quickly resolve ambiguities without cross referencing with other data (or worse: they think they know, or don't even recognize that there is an ambiguity in the first place)

smoyer · on April 4, 2013

Sounds plausible to me ... like how OpenStreetMaps improved (improves?) its map data by letting people import traces from their GPS devices.

http://www.openstreetmap.org/traces

tlarkworthy · on April 4, 2013

See NELL. The never ending language learner.

troymc · on April 4, 2013

One nice property of the Wikidata database is that it is a "secondary database. Wikidata will record not just statements, but their sources, thus reflecting the diversity of knowledge available and supporting the notion of verifiability." [1]

I think that's far better than voting. Voting for facts amounts to relying on a logical fallacy: appeal to the majority. [2] (Voting is fine for popularity contests, or things that can only be matters of opinion, but facts?)

[1] http://www.wikidata.org/wiki/Wikidata:Introduction

[2] https://en.wikipedia.org/wiki/Argumentum_ad_populum

kmike84 · on April 4, 2013

Is it possible to download all data and use it under some open license (like CC-BY)? I can't find data license terms.

If no, then sorry, freebase is vastly superior IMHO - from user's point of view I don't see a point in a crowdsourced proprietary database (even if API is currently free).

christophe971 · on April 4, 2013

While it is not the case right now, being heavily invested into open-source myself, it is very possible that it ends up under CC-BY at some point.

vidarh · on April 6, 2013

It's worth pointing out that in most of the world facts is not copyrightable, and the copyright status of a collection of facts is often not copyrightable. In those instances, CC-BY or whatever license you apply will be unenforceable. I believe Creative Commons themselves explicitly believe that CC-BY and similar licenses are not appropriate for collections of facts.

E.g. they explicitly say on their website: "Copyright does not protect the facts or ideas underlying the creative expression. So, Creative Commons licenses do not apply to ideas, factual information or other non-creative elements that are not protected by copyright."

Courts in many jurisdiction explicitly refuse to accept "sweat of the brow" arguments for copyright, explicitly requiring an element of creativity. Some countries do have "database copyrights" that can protect arrangements of facts in certain ways, while in others a collection of straight up facts can not be copyrighted pretty much no matter what.

troymc · on April 4, 2013

Please get the licensing clarified. It's important. CC-BY is maybe too much; imagine having to give attribution/credit for 23,967 sources... Check out the content license used by Wikidata and OpenStreetMap (Hint: not CC-BY.)

danso · on April 4, 2013

Have you/do you plan to seed your database with the already structured data from freebase? It should be relatively straightforward, right? Well, I mean, minus the time to properly map the Freebase schema into your format. But that's probably less time than it takes to wait for people to fill in enough facts.

_pctq · on April 4, 2013

Congrats, it looks great.

So, if I understand correctly, it let people crowdsource any kind of structured and descriptive data ?

christophe971 · on April 4, 2013

That's exactly right! There are a lot of data on the web, almost none of it is structured in a way that lets other easily access and edit it.

For example, as you can see on http://denisthebot.com, having that kind of API makes it trivial to built smart question-answering engines.

_pctq · on April 4, 2013

Great. It may offer a good way to build some kind of open alternative to google now.

christophe971 · on April 4, 2013

Absolutely, thanks for your question!

namank · on April 4, 2013

Excellent! I've been working on something similar. Trying to come up with a schema that is data-centric is hard enough let alone focusing on the ease of use by developers. Good luck!

Can I send how many requests I want?, I think you might mean Can I send as many requests as I want? ?

christophe971 · on April 4, 2013

Woops, good catch! Fixed. And thank you!

timdorr · on April 4, 2013

Any plans to degrade votes over time, so that new or updated facts can more quickly gain precedence?

christophe971 · on April 4, 2013

Actually yes, but in a different way, I was thinking of accelerating the power of sudden downvotes, but nothing is set in stone yet. So yes it will happen, how exactly is yet to be decided. Thanks for your question!

DanWaterworth · on April 4, 2013

Facts don't change.

_pctq · on April 4, 2013

It's thursday.

DanWaterworth · on April 4, 2013

Your very clever. Do you really think I hadn't considered a statement like that?

Facts are verifiable statements and things that are verifiable are verifiable repeatably. This is not a verifiable statement.

jacalata · on April 4, 2013

'Yugoslavia is a country'. Are you really going to exclude that? So, no statements about political geography at all, then. Same problem for physical geography ('New Moore Island exists'). Nothing about climate (because 'the rain in Spain' could change at any time), nothing about population levels. Even the sample 'average weight of an apple' is pretty suspect. I think your database is going to be quite limited.

DanWaterworth · on April 5, 2013

> 'Yugoslavia is a country'. Are you really going to exclude that?

Yes, of course.

> So, no statements about political geography at all, then.

Absolutely not, you just have to make time explicit.

msellout · on April 5, 2013

What calendar will you use?

Edit: It seems we're arguing about what a priori knowledge is capable of serving as a base for factual deductions. The Kantian approach is to say that we all agree on time and space and everything can be based off of these self-evident truths. I think there is not such a clear boundary between objective truth and induction.

Edit2: I'd also like to take this moment to point out that "you're" is the proper contraction of "you are", since we're getting all semantic.

DanWaterworth · on April 5, 2013

This has suddenly become very philosophical. My view is that a database of facts should contain things that are believed to be facts. It should be possible to remove facts that are shown to be incorrect, but those things should never have been true.

> I'd also like to take this moment to point out that "you're" is the proper contraction of "you are", since we're getting all semantic.

I know, it annoys me too. By the time I'd realized it was too late to edit. Typos happen.

jacalata · on April 5, 2013

"My view is that a database of facts should contain things that are believed to be facts. It should be possible to remove facts that are shown to be incorrect, but those things should never have been true."

But they definitely were true. I thought you were making a distinction between 'something that is true' and 'something that is a fact (ie: is unchangingly true)' which I don't think most people make.

msellout · on April 5, 2013

It'd be the solipsist's database: "I exist" would be the only entry.

hnriot · on April 5, 2013

It's pretty obvious you'd didn't consider statements like that. There are a large number of examples that make your statement of the invariance of facts absurd.

BoyWizard · on April 4, 2013

It's still a fact, the context affects the meaning of the statement.

christophe971 · on April 4, 2013

In the absolute sense, you're right they don't; but in the practical sense they do, especially open time-dependent statements.

The easiest example is this: We're in 2010. X has been married to Y since 2009; In the DB it would be represented exactly like that: a "from" time period, no "to" time period, meaning "it is still true now".

They divorce in 2013. X married to Y from 2009 to "now" isn't true. It should be downvoted. but X married to Y from "2009" to "2013" is actually true. It should be created and upvoted. That fact surely won't change overtime.

manoDev · on April 4, 2013

> The easiest example is this: We're in 2010. X has been married to Y since 2009; In the DB it would be represented exactly like that: a "from" time period, no "to" time period, meaning "it is still true now".

This is not a fact database. This is:

    X Married Y 2009-01-01
    X Divorced Y 2013-01-01

This way you can represent any number of facts:

    X Married Y 2013-02-01

Think how convoluted it would be for you to represent X and Y marrying again in your example, if not outright wrong (because you update/delete information).

christophe971 · on April 4, 2013

I can be wrong, but being in a state of marriage with Y is a fact. And since the DB handle time periods, with my way you can actually search who X is married to with one request, without checking if he divorced, if Y is dead, disappeared, or else.

manoDev · on April 4, 2013

> I can be wrong, but being in a state of marriage with Y is a fact.

This is akin to your bank storing the total amount of your account in their database, instead of the transaction history and deriving the total from that.

There's nothing wrong with that, until there is.

Consider this situation: X marries Y, divorces, marries again. Now you would have the date of divorce prior to the date of marriage. How do you make sense of this data?

> And since the DB handle time periods, with my way you can actually search who X is married to with one request, without checking if he divorced, if Y is dead, disappeared, or else.

The fact Y is dead doesn't mean X wasn't married to it. So despite the obvious technical implications of keeping all this data in sync (e.g., Y dies so you would update X to reflect that?), you're simply obliterating information. A fact is immutable, therefore a fact database, by definition, only appends, never updates.

It is, indeed, a hard problem.

Buttons840 · on April 4, 2013

How can you say how the data will be structured? I thought you weren't enforcing any type of data structure?

I like the new approach though. You appear to be focused on simplicity and ease of use, and hopefully your find ways to fix the resultant problems. For example, some fancy graph theory might be able to determine that the graph node "Apple" refers to two different ideas.

christophe971 · on April 4, 2013

You're right, nothing is enforced, if people choose a different way to present the data, it's totally fine.

pmtarantino · on April 4, 2013

Any chance to release this as an open source? For example, people would like to have installed in their servers and use it for their own things. I think it would be useful for fandom. For example, the Star Wars DB or Lord of the Rings Db :-)

christophe971 · on April 4, 2013

I'm seriously considering it actually, but not 100% sure at the moment... :) And yes, it would be awesome for fandom, but the whole point of the service is to be able to have all kinds of data in it, so... :)

pmtarantino · on April 4, 2013

Yes, but people is usually more motivated to contribute to create the biggest database about a specific topic than the bigger ever :)

Check, for example, Wikia, which contains Wikipedia for TV Shows which are more complete than Wikipedia.

mlinksva · on April 4, 2013

Open source would be another Freebase differentiator. :)

If you're worried about competition, release under AGPL.

Similarly for the data, to the extent restricted by copyright and such, release under CC0 (same as Wikidata). Or if you want to be onerous, ODbL.

Some neat ideas in the project in any case, good luck!

oneofthose · on April 4, 2013

Why not include functionality to connect different installations to one large, distributed database?

feniv · on April 4, 2013

Don't be deterred by the negative comments about the unstructured data. It's a tough problem but not an impossible one. I know because I'm also battling the same question building a free-form NLP based self tracking app to help track daily data ( http://thyself.io ). The problem for me is that it's hard to perform analytics when one datapoint is in "miles walked" and the other is in "laps ran".

As you said, conventions help mitigate the problem a little bit but the end user can hardly be expected to stick to best practices.

I have hope though. This is a problem worth solving.

vineet · on April 4, 2013

Reminds me of Freebase. They built a huge data-set as well as tools and an api to access themselves. Have you talked to anyone on the team? (they are now at Google) How would you say that you are different from them?

christophe971 · on April 4, 2013

Never talked to them, but I did address the differences between TheBigDB and Freebase here: http://thebigdb.com/faq

jeffdavis · on April 5, 2013

I like this idea in the sense of an experiment. I'm not sure where it will end up, but it could be interesting.

As others have pointed out, some kind of conventions must be established around the semantics, and something must be done to avoid redundancy (which leads to inconsistency) and ambiguity.

I agree with those criticisms, but if the community also helps develop the schema, it will be interesting to see. What collisions will happen? What will be the result of queries that reach far across disciplines?

adventured · on April 4, 2013

I appreciate any new service that attempts to organize data / information. With that in mind, I hope this succeeds.

A suggestion: it needs a demo query box on the site. Shouldn't be too hard to let a rate limited IP address throw a few keywords at it and spit back results. I'd like to see what the db contains before I invest too much time (how many topics, how many facts, etc).

christophe971 · on April 4, 2013

You're right, it would be a plus for people who quickly want to test the API.

In the meantime, you can just look through http://browser.thebigdb.com what the DB contains, or just "gem install thebigdb" and start copy/paste the code examples to see how the API really behaves.

Thanks for your suggestion!

gojomo · on April 4, 2013

Based on observations and prior experience (esp. Bitzi), I believe the wiki approach of "correct-in-place" leads to better convergence and community than "downvote the errors, add a corrected entry, upvote the better entry".

(Voting democracy may help prevent people from being oppressed in certain ways, but it isn't much of a truth-discovery mechanism.)

indeyets · on April 4, 2013

Interesting concept. It's like RDF for human beings. It's easier for human beings to look at unstructured data, but at the same time it makes it extremely hard to do interesting stuff programmatically. You just can't do reliable inferencing

xivSolutions · on April 5, 2013

Cool project, man. And, way to show class in handling the "detractors" here. I firmly believe that constructive debate is a good thing.

c0n5pir4cy · on April 4, 2013

Also quite similar to freebase: http://www.freebase.com/

christophe971 · on April 4, 2013

Not really. I addressed this in the FAQ: http://thebigdb.com/faq

firdaus · on April 4, 2013

Sounds a bit like what http://fluidinfo.com/ wanted to do originally.

ximeng · on April 4, 2013

You should have a list of the most recent facts added to give people a taste of what's in the database.

christophe971 · on April 4, 2013

Yes, while you can see what's in the DB here: http://browser.thebigdb.com, you can't see what's recent. I'll do something about that.

hnriot · on April 4, 2013

Do you support downloads of the data?

peteypao2013 · on April 4, 2013

Ahh, too yellow!

christophe971 · on April 4, 2013

Sorry!