Amazon Web Services Apache Mahout TLP project to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data. AWS (Amazon Web Services) Public Data provides a centralized repository of public data sets that can into AWS cloud-based applications. BDR - Big Data Repository big list of public data sources. Bioassay data described in Virtual screening of bioassay data, by Amanda , J. of , with 21 Bioassay datasets (Active / Inactive compounds) available for download. Bitly 1usagov data anonymized clicks on gov links. Bureau of Justice Here you can find data on law enforcement agencies, jails, and probation agencies and courts. Canada Open Data project with and geospatial datasets. Causality Workbench data repository. GeodaCenter geographical and spatial data. CERN Open Data More than one petabyte of data from particle physics experiments carried out by CERN. Common crawl builds and maintains an open crawl of the web accessible to everyone. in amazon s3bucket and the requester may some money to access it. Complete Public Reddit Comments Corpus Over one billion public comments posted to Reddit between 2007 and 2015, for training language algorithms Corral Big Data repository at Texas Advanced Computing Center, supporting data-centric science. Data Dumps Datasets on books including catalogs from libraries around the world Data Market Offers a free package with access to datasets covering world population, currencies, development indicators and weather data. Data Source Handbook A Guide to Public Data, by , O’Reilly (Jan 2011). open government data from US, EU, Canada, CKAN, and more. Datagov Datagovuk publicly available data from UK (also London .) Datagov/Education central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more. DataMarket visualize the world’s economy, societies, nature, and industries, with million time series UN, World Bank, Eurostat and other important data providers. Datasets comprehensive data on random 10,000 UK companies sampled from , updated automatically using AI/Machine Learning. dataworld is a platform where data scientists can find and use a vast array of high-quality open data, collaborate on data projects, and meet other like-minded data nerds. DMOZ - Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in . is one main source for internet search engines. eBay Market Data Insights Data on millions of online sales and auctions from eBay EDRM File Formats Data of 381 files covering 200 file formats. EDRM Enron Email Data Set v2 Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST. Enron Email data from about 150 users, mostly senior management of Enron. European Union Open Data Portal Europeana Data contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana – the trusted resource for European cultural heritage content. Facebook Graph FBI Uniform Crime Reporting The FBI and publishing national crime statistics, with free data available at , state and county level. FEDSTATS a comprehensive source of US statistics and more FIMI repository for frequent mining implementations and datasets. Financial Times Market Data information on financial markets from around the world, including stock price indexes, commodities and foreign exchange. FiveThirtyEight polls data on public opinion of political and sporting issues. Socrata access to over 10,000 datasets including business, education, government, and fun. GDELT The Global Data on Events, Location and Tone, described by Guardian as “a big data history of life, the universe and everything.” GEO (GEO Gene Expression Omnibus) a gene expression/molecular abundance repository supporting MIAME compliant data a curated, online resource for gene expression data browsing, query and retrieval. Glassdoor API Information about job vacancies, candidates, salaries and employee satisfaction is available through their developer API. Google Books text from millions of books scanned by Google. Google datasets to find datasets Google Finance Google Scholar Entire texts of academic papers, journals, books and legal case law. Google Trends Examine and analyze data on search activity and trending news stories around the world. Grain financial data including stocks, futures, etc. Hilary Mason research-quality Big Data sets collection - and image datasets. ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008. IMDB Datasets Datasets in formats drawn from the web’s largest resource on movies, and people working in those industries. IMF Data The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices and investments. Instagram As with Twitter, Instagram posts and conversations are public by default. Their APIs allow likes, and business details to . Irish Electric Vehicle Charge Point Status Collates data from the which oversees the network of EV charge points across the Republic of Ireland and Northern Ireland. Jerry Smith dataset collection with Finance, Government, Machine Learning, Science, and other data. KONECT the Koblenz Network Collection, with large network datasets of all types to perform research in network mining. Labelled Faces in the Wild 13,000 collated and labeled images of human faces, for developing applications involving facial recognition. Linking Open Data at making data freely available to everyone. list of SNA datasets for text, SNA, and other fields. Machine Learning Dataset Repository Kaggle Collection of open datasets contributed by data scientists involved in machine learning projects. Microsoft Azure Data Markets Free Datasets Freely available datasets covering everything from agriculture to weather Microsoft Marco Microsoft’s open machine learning datasets for training systems in reading comprehension and question answering. Million song data has data related to tracks and songs. Million Song Data Set ML Data the data repository of the EU Pascal2 networks. NASA Exoplanet Archive Public datasets covering planets and stars gathered by NASA’s space exploration missions. NASDAQ Data provides access to market data. National Climatic Data Center National Government Statistical Web Sites data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America. National Space Science Data Center (NSSDC) NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more. Natural History Museum Data Portal Information on nearly 4 million historical specimens in the London museum’s collection, scientific sound recordings of the natural world. New York Times NHS Health and Social Care Information Centre One Million Audio Cover Images Dataset hosted at archiveorg covering music released around the world, for image processing research Open Data assesses the state of open data around the world. Open Source Sports many sports databases, including Baseball, Football, Basketball, and Hockey. OpenCorporates The world’s largest open database of companies. a clearinghouse of datasets available from the City & County of San Francisco, CA. Project Gutenberg offers over 36,000 free ebooks to download to your PC, Kindle, Android, or other portable device. Quandl a collaboratively curated portal to millions of financial and economic time-series datasets. qunb a platform to find and visualize quantitative data. Robert Schiller data on housing, stock market, and more from his book Irrational Exuberance. SourceForgenet Research includes historic and status statistics on approximately 100,000 projects and over 1 million registered users’ activities at the project management. The CIA World Factbook The UK Data Centre The UK’s largest collection of social, economic and population data. The US National Center for Education Statistics Data on educational institutions and education demographics from the US and around the world. THEINFO This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and integrate their particular projects. Anterior Siguiente